Skip to content

Latest commit

 

History

History
42 lines (33 loc) · 2.75 KB

machine-learning-scaling-webservice.md

File metadata and controls

42 lines (33 loc) · 2.75 KB
title description services documentationcenter author manager editor keywords ms.assetid ms.service ms.devlang ms.workload ms.tgt_pltfrm ms.topic ms.date ms.author
Scaling web service | Microsoft Docs
Learn how to scale a web service by increasing concurrency and adding new endpoints.
machine-learning
neerajkh
srikants
cgronlun
azure machine learning, web services, operationalization, scaling, endpoint, concurrency
c2c51d7f-fd2d-4f03-bc51-bf47e6969296
machine-learning
NA
data-services
na
article
10/05/2016
neerajkh

Scaling a Web service

Note

This topic describes techniques applicable to a Classic Machine Learning Web service.

By default, each published Web service is configured to support 20 concurrent requests and can be as high as 200 concurrent requests. While the Azure classic portal provides a way to set this value, Azure Machine Learning automatically optimizes the setting to provide the best performance for your web service and the portal value is ignored.

If you plan to call the API with a higher load than a Max Concurrent Calls value of 200 will support, you should create multiple endpoints on the same Web service. You can then randomly distribute your load across all of them.

Add new endpoints for same web service

The scaling of a Web service is a common task. Some reasons to scale are to support more than 200 concurrent requests, increase availability through multiple endpoints, or provide separate endpoints for the web service. You can increase the scale by adding additional endpoints for the same Web service through Azure classic portal or the Azure Machine Learning Web Service portal.

For more information on adding new endpoints, see Creating Endpoints.

Keep in mind that using a high concurrency count can be detrimental if you're not calling the API with a correspondingly high rate. You might see sporadic timeouts and/or spikes in the latency if you put a relatively low load on an API configured for high load.

The synchronous APIs are typically used in situations where a low latency is desired. Latency here implies the time it takes for the API to complete one request, and doesn't account for any network delays. Let's say you have an API with a 50-ms latency. To fully consume the available capacity with throttle level High and Max Concurrent Calls = 20, you need to call this API 20 * 1000 / 50 = 400 times per second. Extending this further, a Max Concurrent Calls of 200 allows you to call the API 4000 times per second, assuming a 50-ms latency.