|
| 1 | +--- |
| 2 | +title: "My Journey to a serverless transformers pipeline on Google Cloud" |
| 3 | +thumbnail: /blog/assets/14_how_to_deploy_a_pipeline_to_google_clouds/thumbnail.png |
| 4 | +--- |
| 5 | + |
| 6 | +# My Journey to a serverless transformers pipeline on <br>Google Cloud |
| 7 | + |
| 8 | +<div class="blog-metadata"> |
| 9 | + <small>Published March 18, 2021.</small> |
| 10 | + <a target="_blank" class="btn no-underline text-sm mb-5 font-sans" href="https://github.com/huggingface/blog/blob/master/how-to-deploy-a-pipeline-to-google-clouds.md"> |
| 11 | + Update on GitHub |
| 12 | + </a> |
| 13 | +</div> |
| 14 | + |
| 15 | +<div class="author-card"> |
| 16 | + <a href="/Maxence"> |
| 17 | + <img class="avatar avatar-user" src="https://aeiljuispo.cloudimg.io/v7/https://s3.amazonaws.com/moonup/production/uploads/1613496680893-602bfe18c4f8038e9a1e0a66.jpeg?w=200&h=200&f=face" title="Gravatar"> |
| 18 | + <div class="bfc"> |
| 19 | + <code>Maxence</code> |
| 20 | + <span class="fullname">Maxence Dominici</span> |
| 21 | + <span class="bg-gray-100 rounded px-1 text-gray-600 text-sm font-mono">guest</span> |
| 22 | + </div> |
| 23 | + </a> |
| 24 | +</div> |
| 25 | + |
| 26 | +> ##### A guest blog post by community member <a href="/Maxence">Maxence Dominici</a> |
| 27 | +
|
| 28 | +This article will discuss my journey to deploy the `transformers` _sentiment-analysis_ pipeline on [Google Cloud](https://cloud.google.com). We will start with a quick introduction to `transformers` and then move to the technical part of the implementation. Finally, we'll summarize this implementation and review what we have achieved. |
| 29 | + |
| 30 | +## The Goal |
| 31 | + |
| 32 | +I wanted to create a micro-service that automatically detects whether a customer review left in Discord is positive or negative. This would allow me to treat the comment accordingly and improve the customer experience. For instance, if the review was negative, I could create a feature which would contact the customer, apologize for the poor quality of service, and inform him/her that our support team will contact him/her as soon as possible to assist him and hopefully fix the problem. Since I don't plan to get more than 2,000 requests per month, I didn't impose any performance constraints regarding the time and the scalability. |
| 33 | + |
| 34 | +## The Transformers library |
| 35 | +I was a bit confused at the beginning when I downloaded the .h5 file. I thought it would be compatible with `tensorflow.keras.models.load_model`, but this wasn't the case. After a few minutes of research I was able to figure out that the file was a weights checkpoint rather than a Keras model. |
| 36 | +After that, I tried out the API that Hugging Face offers and read a bit more about the pipeline feature they offer. Since the results of the API & the pipeline were great, I decided that I could serve the model through the pipeline on my own server. |
| 37 | + |
| 38 | +Below is the [official example](https://github.com/huggingface/transformers#quick-tour) from the Transformers GitHub page. |
| 39 | + |
| 40 | +```python |
| 41 | +from transformers import pipeline |
| 42 | + |
| 43 | +# Allocate a pipeline for sentiment-analysis |
| 44 | +classifier = pipeline('sentiment-analysis') |
| 45 | +classifier('We are very happy to include pipeline into the transformers repository.') |
| 46 | +[{'label': 'POSITIVE', 'score': 0.9978193640708923}] |
| 47 | +``` |
| 48 | + |
| 49 | + |
| 50 | +## Deploy transformers to Google Cloud |
| 51 | +> GCP is chosen as it is the cloud environment I am using in my personal organization. |
| 52 | +
|
| 53 | +### Step 1 - Research |
| 54 | +I already knew that I could use an API-Service like `flask` to serve a `transformers` model. I searched in the Google Cloud AI documentation and found a service to host Tensorflow models named [AI-Platform Prediction](https://cloud.google.com/ai-platform/prediction/docs). I also found [App Engine](https://cloud.google.com/appengine) and [Cloud Run](https://cloud.google.com/run) there, but I was concerned about the memory usage for App Engine and was not very familiar with Docker. |
| 55 | + |
| 56 | +### Step 2 - Test on AI-Platform Prediction |
| 57 | + |
| 58 | +As the model is not a "pure TensorFlow" saved model but a checkpoint, and I couldn't turn it into a "pure TensorFlow model", I figured out that the example on [this page](https://cloud.google.com/ai-platform/prediction/docs/deploying-models) wouldn't work. |
| 59 | +From there I saw that I could write some custom code, allowing me to load the `pipeline` instead of having to handle the model, which seemed is easier. I also learned that I could define a pre-prediction & post-prediction action, which could be useful in the future for pre- or post-processing the data for customers' needs. |
| 60 | +I followed Google's guide but encountered an issue as the service is still in beta and everything is not stable. This issue is detailed [here](https://github.com/huggingface/transformers/issues/9926). |
| 61 | + |
| 62 | + |
| 63 | +### Step 3 - Test on App Engine |
| 64 | + |
| 65 | +I moved to Google's [App Engine](https://cloud.google.com/appengine) as it's a service that I am familiar with, but encountered an installation issue with TensorFlow due to a missing system dependency file. I then tried with PyTorch which worked with an F4_1G instance, but it couldn't handle more than 2 requests on the same instance, which isn't really great performance-wise. |
| 66 | + |
| 67 | +### Step 4 - Test on Cloud Run |
| 68 | + |
| 69 | +Lastly, I moved to [Cloud Run](https://cloud.google.com/run) with a docker image. I followed [this guide](https://cloud.google.com/run/docs/quickstarts/build-and-deploy#python) to get an idea of how it works. In Cloud Run, I could configure a higher memory and more vCPUs to perform the prediction with PyTorch. I ditched Tensorflow as PyTorch seems to load the model faster. |
| 70 | + |
| 71 | + |
| 72 | +## Implementation of the serverless pipeline |
| 73 | + |
| 74 | +The final solution consists of four different components: |
| 75 | +- `main.py` handling the request to the pipeline |
| 76 | +- `Dockerfile` used to create the image that will be deployed on Cloud Run. |
| 77 | +- Model folder having the `pytorch_model.bin`, `config.json` and `vocab.txt`. |
| 78 | + - Model : [DistilBERT base uncased finetuned SST-2 |
| 79 | + ](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english) |
| 80 | + - To download the model folder, follow the instructions in the button.  |
| 81 | + - You don't need to keep the `rust_model.ot` or the `tf_model.h5` as we will use [PyTorch](https://pytorch.org/). |
| 82 | +- `requirement.txt` for installing the dependencies |
| 83 | + |
| 84 | +The content on the `main.py` is really simple. The idea is to receive a `GET` request containing two fields. First the review that needs to be analysed, second the API key to "protect" the service. The second parameter is optional, I used it to avoid setting up the oAuth2 of Cloud Run. After these arguments are provided, we load the pipeline which is built based on the model `distilbert-base-uncased-finetuned-sst-2-english` (provided above). In the end, the best match is returned to the client. |
| 85 | + |
| 86 | +```python |
| 87 | +import os |
| 88 | +from flask import Flask, jsonify, request |
| 89 | +from transformers import pipeline |
| 90 | + |
| 91 | +app = Flask(__name__) |
| 92 | + |
| 93 | +model_path = "./model" |
| 94 | + |
| 95 | +@app.route('/') |
| 96 | +def classify_review(): |
| 97 | + review = request.args.get('review') |
| 98 | + api_key = request.args.get('api_key') |
| 99 | + if review is None or api_key != "MyCustomerApiKey": |
| 100 | + return jsonify(code=403, message="bad request") |
| 101 | + classify = pipeline("sentiment-analysis", model=model_path, tokenizer=model_path) |
| 102 | + return classify("that was great")[0] |
| 103 | + |
| 104 | + |
| 105 | +if __name__ == '__main__': |
| 106 | + # This is used when running locally only. When deploying to Google Cloud |
| 107 | + # Run, a webserver process such as Gunicorn will serve the app. |
| 108 | + app.run(debug=False, host="0.0.0.0", port=int(os.environ.get("PORT", 8080))) |
| 109 | +``` |
| 110 | + |
| 111 | +Then the `DockerFile` which will be used to create a docker image of the service. We specify that our service runs with python:3.7, plus that we need to install our requirements. Then we use `gunicorn` to handle our process on the port `5000`. |
| 112 | +```dockerfile |
| 113 | +# Use Python37 |
| 114 | +FROM python:3.7 |
| 115 | +# Allow statements and log messages to immediately appear in the Knative logs |
| 116 | +ENV PYTHONUNBUFFERED True |
| 117 | +# Copy requirements.txt to the docker image and install packages |
| 118 | +COPY requirements.txt / |
| 119 | +RUN pip install -r requirements.txt |
| 120 | +# Set the WORKDIR to be the folder |
| 121 | +COPY . /app |
| 122 | +# Expose port 5000 |
| 123 | +EXPOSE 5000 |
| 124 | +ENV PORT 5000 |
| 125 | +WORKDIR /app |
| 126 | +# Use gunicorn as the entrypoint |
| 127 | +CMD exec gunicorn --bind :$PORT main:app --workers 1 --threads 1 --timeout 0 |
| 128 | +``` |
| 129 | + |
| 130 | +It is important to note the arguments `--workers 1 --threads 1` which means that I want to execute my app on only one worker (= 1 process) with a single thread. This is because I don't want to have 2 instances up at once because it might increase the billing. One of the downsides is that it will take more time to process if the service receives two requests at once. After that, I put the limit to one thread due to the memory usage needed for loading the model into the pipeline. If I were using 4 threads, I might have 4 Gb / 4 = 1 Gb only to perform the full process, which is not enough and would lead to a memory error. |
| 131 | + |
| 132 | +Finally, the `requirement.txt` file |
| 133 | +```python |
| 134 | +Flask==1.1.2 |
| 135 | +torch===1.7.1 |
| 136 | +transformers~=4.2.0 |
| 137 | +gunicorn>=20.0.0 |
| 138 | +``` |
| 139 | + |
| 140 | + |
| 141 | +## Deployment instructions |
| 142 | + |
| 143 | +First, you will need to meet some requirements such as having a project on Google Cloud, enabling the billing and installing the `gcloud` cli. You can find more details about it in the [Google's guide - Before you begin](https://cloud.google.com/run/docs/quickstarts/build-and-deploy#before-you-begin), |
| 144 | + |
| 145 | +Second, we need to build the docker image and deploy it to cloud run by selecting the correct project (replace `PROJECT-ID`) and set the name of the instance such as `ai-customer-review`. You can find more information about the deployment on [Google's guide - Deploying to](https://cloud.google.com/run/docs/quickstarts/build-and-deploy#deploying_to). |
| 146 | + |
| 147 | +```shell |
| 148 | +gcloud builds submit --tag gcr.io/PROJECT-ID/ai-customer-review |
| 149 | +gcloud run deploy --image gcr.io/PROJECT-ID/ai-customer-review --platform managed |
| 150 | +``` |
| 151 | + |
| 152 | +After a few minutes, you will also need to upgrade the memory allocated to your Cloud Run instance from 256 MB to 4 Gb. To do so, head over to the [Cloud Run Console](https://console.cloud.google.com/run) of your project. |
| 153 | + |
| 154 | +There you should find your instance, click on it. |
| 155 | + |
| 156 | + |
| 157 | + |
| 158 | +After that you will have a blue button labelled "edit and deploy new revision" on top of the screen, click on it and you'll be prompt many configuration fields. At the bottom you should find a "Capacity" section where you can specify the memory. |
| 159 | + |
| 160 | + |
| 161 | + |
| 162 | +## Performances |
| 163 | + |
| 164 | + |
| 165 | +Handling a request takes less than five seconds from the moment you send the request including loading the model into the pipeline, and prediction. The cold start might take up an additional 10 seconds more or less. |
| 166 | + |
| 167 | +We can improve the request handling performance by warming the model, it means loading it on start-up instead on each request (global variable for example), by doing so, we win time and memory usage. |
| 168 | + |
| 169 | +## Costs |
| 170 | +I simulated the cost based on the Cloud Run instance configuration with [Google pricing simulator](https://cloud.google.com/products/calculator#id=cd314cba-1d9a-4bc6-a7c0-740bbf6c8a78) |
| 171 | + |
| 172 | + |
| 173 | +For my micro-service, I am planning to near 1,000 requests per month, optimistically. 500 may more likely for my usage. That's why I considered 2,000 requests as an upper bound when designing my microservice. |
| 174 | +Due to that low number of requests, I didn't bother so much regarding the scalability but might come back into it if my billing increases. |
| 175 | + |
| 176 | +Nevertheless, it's important to stress that you will pay the storage for each Gigabyte of your build image. It's roughly €0.10 per Gb per month, which is fine if you don't keep all your versions on the cloud since my version is slightly above 1 Gb (Pytorch for 700 Mb & the model for 250 Mb). |
| 177 | + |
| 178 | +## Conclusion |
| 179 | + |
| 180 | +By using Transformers' sentiment analysis pipeline, I saved a non-negligible amount of time. Instead of training/fine-tuning a model, I could find one ready to be used in production and start the deployment in my system. I might fine-tune it in the future, but as shown on my test, the accuracy is already amazing! |
| 181 | +I would have liked a "pure TensorFlow" model, or at least a way to load it in TensorFlow without Transformers dependencies to use the AI platform. It would also be great to have a lite version. |
0 commit comments