Throttling design pattern

Control the consumption of resources used by an instance of an application or service. This can allow the system to continue to function and meet service level agreements (SLA), even when an increase in demand places an extreme load on resources.

There're many strategies available for handling varying loads in the cloud, depending on the business goals of the application. One strategy is to use autoscaling to match the provisioned resources to the user's needs at any given time. This has the potential to consistently meet user demand while optimizing running costs. However, while autoscaling can trigger the provisioning of additional resources, this provisioning isn't immediate. If demand grows quickly, there can be a window of time where there's a resource deficit.

An alternative strategy to autoscaling is to allow applications to use resources only up to a limit, and then throttle them when this limit is reached. The system should monitor how it's using resources so that, when usage exceeds the threshold, it can throttle requests from one or more users. This will enable the system to continue functioning and meet any service level agreements (SLAs) that are in place.

The system could implement several throttling strategies, including:

Rejecting requests from an individual user who's already accessed system APIs more than n times per second over a given period of time. This requires the system to meter the use of resources for each tenant or user running an application.
Disabling or degrading the functionality of selected nonessential services so that essential services can run unimpeded with sufficient resources. For example, if the application is streaming video output, it could switch to a lower resolution.

Issues and Considerations:

Throttling an application, and the strategy to use is an architectural decision that impacts the entire design of a system. Throttling should be considered early in the application design process because it isn't easy to add once a system has been implemented.
If a service needs to temporarily deny a user request, it should return a specific error code so the client application understands that the reason for the refusal to perform an operation is due to throttling. The client application can wait for a period before retrying the request.
Throttling can be used as a temporary measure while a system auto-scales. In some cases, it's better to simply throttle, rather than to scale, if a burst in activity is sudden and isn't expected to be long-lived because scaling can add considerably to running costs.