System Design: Design A Rate Limiter

What is a rate limiter/ rate limiting?

Rate limiting is used to block the computer network or limiting the network traffic. It puts a threshold on how often a particular action be repeated by someone. For example: In Telegram and Discord you have the option to limit the number of messages a person can send in a minute. This is also done using rate limiting.
Rate limiting can help in stopping unusual bot activities to certain extent. It also helps in reducing strain on web servers.
Bot attacks which can be stopped by rate limiting:

Brute force attacks
DoS and DDoS attacks
Web Scraping

Rate limiting can be created using different algorithms, as a candidate you should ask interviewer about the kind of rate limiter you are trying to build.
Some questions that can be asked include:

Rate limiter we are going to design is client-side rate limiter or server-side API rate limiter?
What is the scale of the system? Is it being built for a large system or a small system?
Rate limiter will be created inside application code or as a separate service?
Environment in which we are creating it is distributed or not?

Some of the requirements which can be kept in mind while answering these questions:

Rate limit should accurately limit excess requests.
Latency of rate limiter should be low, and it should work with less resources.
Rate limiter should be able to handle exceptions, for example in case of throttling users should be informed.
There should not be single point of failure in the system due to rate limiting going down.

Client Side Rate Limiting

Client Side rate limiting is usually not a reliable place to enforce rate limiting as it can be easily by passed by malicious users.

There is very less control over the client implementation this another reason for not having rate limiting at client side.

Server Side Rate Limiting

In server side rate limiting servers have limit on handling the user requests ensuring all the servers are used and no server is over burdened or unused. When user requests on one server exceeds the threshold value then request will be forwarded to the next server ensuring limiting the DDoS attacks.

Middleware Rate Limiting

Middleware rate limiter acts as an interface between client and server. This HTTP middleware allows restricting maximum number of allowed HTTP requests per second.

For example: API allows 2 requests per second, as soon as it will receive more than 2 requests it will throttle those requests and return 429.

API Gateway is a component inside cloud microservices that enables rate limiting. It supports rate limiting, Authentication, whitelisting, SSL termination, etc.

Server side or Middleware which rate limiter to choose?

If the programming language that company is working on currently is efficient enough to implement server side rate limiting then server side rate limiting can be used.
If microservice you are working on is on cloud and already has a gateway then rate limiter can be added to the gateway.

Rate Limiting Algorithms

Fixed Window
Sliding Window Log
Sliding Window Counter
Token Bucket
Leaking Bucket

Fixed Window Algorithm

Timeline divided into fixed size windows and a counter is assigned to each window.
As soon as a request is received counter is incremented by one.
Once counter reaches the set threshold value, all the new requests are dropped and this happens until new time window starts.

Each window contains counter of 4
When a request reached, request is allocated based on the FLOOR of timestamp.
If request is received at 12:01:45 then it will be allocated to window of 12:01:00
If counter is >0 then counter is decremented and process the request. Else, request is discarded.

Advantages of Fixed Window Algorithm

Easy to understand
Memory Efficient
Low latency

Disadvantages of Fixed Window Algorithm

Load of sudden burst of traffic near boundary of the window.
If counter runs out at the start of the window, clients will need to wait for long reset window.

Sliding Window Log Algorithm

Sliding window algorithm is similar to fixed window algorithm but it tackles the disadvantages of fixed window algorithm, such as: fixed window algorithm had a limitation due to which more requests were sent to the edges of of window.

How does sliding window algorithm work?

Log is used to store processed requests along with their timestamps. Timestamp and similar data is usually stored in a cache example: Redis cache.
If the log size is lower than the max allowed count , we accept the request and add it to the log otherwise we reject it.
When a new request is received all the outdated timestamps are removed. Outdated timestamps are the one those are older than the start of current time window.

In the below example, rate limiter will allow only 2 requests/minute.

Log is empty when a new request is received at 1:00:01. Therefore, this request is accepted.
After sometime a new request is received at 1:00:30 this is also accepted and inserted into the window. Now log size is equal to max size allowed i.e. 2.
Now a new request is received at 1:00:50, the timestamp inserted into the log. After insertion the size becomes greater than 2 thus, this request is rejected but timestamp remains in the log.
New request arrives at 1:01:40. Requests in the range [1:00:40-1:01:40] belong to the same group with in latest time frame. But request before 1:00:40 are outdated and thus, outdated timestamps 1:00:01 and 1:00:30 are removed from the log. After remove operation size again becomes 2 and thus, request is accepted.

Advantages

Ensures rate limit is strictly adhered to any given point in time.

Disadvantages

More memory is required as event the rejected request is stored in the memory.

Sliding Window Counter Algorithm

Fixed window counter is not a robust rate limiting algorithm as it can’t correctly handle burst requests, but is a memory efficient solution where as sliding window log is a powerful solution as it enforces hard limit on every time window but it isn’t memory efficient solution.
Sliding window counter algorithm is a combination of fixed window counter and sliding window log algorithms.

The entire window time is broken down into smaller buckets. The size of each bucket depends upon how much elasticity is allowed for the rate limiter.
Each bucket stores the request corresponding to the bucket range.

Let’s calculate the weighted counter for the previous fixed time window. Then, we can estimate how many requests are in the current sliding window.

The formula to calculate number of requests in a rolling window is :

Requests in current window + request in previous window * overlap percentage of rolling window and previous window

For the above example:

previous window request count = 8

current request window count = 5

previous window count weight = 0.47

Number of requests in current sliding window: 5 + 8 * 0.47 = 8.76 -> ~9

So if the rate limiter allows maximum of 10 requests in a minute, the current request can go however limit will be reached after receiving one more request.

Advantages

Works well in most of the scenarios and smooths out spikes in traffic since rate is based on the average rate of the previous window.
Memory efficient as only counts are stored.

Disadvantages

Works only for not-so-strict look back window times, especially for smaller unit times. It is an approximation of actual rate because it assumes the requests in previous window are evenly distributed.

Token Bucket Algorithm

Token bucket is a common widely used rate limiting algorithm some of the companies that use it include Amazon and Stripe both of them use it to throttle their API requests.
So how does Token Bucket algorithm work?

The token bucket algorithm is based on an analogy of a fixed capacity bucket into which tokens are added at a fixed rate.
Before allowing an API to proceed, the bucket is inspected to see if it contains at least 1 token at a time.
If so, one token is removed from the bucket, the API is allowed to proceed. In case there are no tokens available, the API is returned saying that the quota has exceeded that window.

Token bucket algorithm works on two parameters:

Bucket Size : Maximum number of token allowed inside the bucket
Refill Rate : Number of token inserted back to the bucket every second.

Number of buckets one need depends upon rate-limiting rule. For example:

If throttling requests on the basis of IP addresses is required then each IP address requires a bucket.

Advantages

Easy to understand and implement
Memory efficient
Allows burst of traffic for short period

Disadvantages

Bucket size and refill rate require proper tuning otherwise rate limiting will not work as expected.

Leaking Bucket Algorithm

Leaky bucket algorithm is similar to token bucket except the fact requests are processed at a fixed rate in it. The idea behind the algorithm is simple, imagine that there is a bucket of tasks that’s being filled with incoming tasks. The bucket also has holes at the bottom that tasks can drip down the task processor.

There are a few states the bucket may be in:

empty
full
has some tasks being processed but neither empty nor full.

As long as the bucket is not full, it’ll keep handing over tasks to the processor at a constant rate. When it’s full, it should “overflow” and no incoming tasks are allowed in until some tasks are done. It is usually implemented with a first-in-first-out (FIFO) queue.

Algorithm

The system checks if bucket/queue is full. If queue is not full then request is added to the queue. Else, request is dropped/discarded.
Requests are pulled from the bucket/queue at regular intervals.

Advantages

Memory efficient
Requests are processed at fixed rate

Disadvantages

If old requests are not processed in time then new requests will be rate limited.
Fine tuning of 2 parameters is required.

High Level Design

The intention behind having a rate limiting algorithm is restricting the actions of user after a certain count, so we would need a counter to keep a track of user actions such as requests sent from the same IP. Now if the number of such requests exceed the threshold value then such requests should be rejected/discarded.
Usually such counter are stored in in-memory cache as it is fast and supports expiration based on time. There are generally two commands that offered by Cache such as Redis:

INCR : Increase the counter by 1
EXPIRE : Every counter has a timeout limit that is set, if the timeout reaches/expires then counter is automatically deleted.

Client/User sends a request to rate limiter.
Rate limiter checks the counter inside cache memory if the limit has reached or not.
If limit is reached, reject the request. Else accept the request and system will increment the counter.

Design In Depth

While working with rate limiters it’s important to know answers of the following questions:

How are rate limiting rules created? Where will these rules be stored?
How rate limited requests will be handled? What are various strategies to handle it?
How rate limiting works in a distributed environment?
How rate limiter can be optimized?

In this section we will answer all the above questions one by one.

Rate Limiting Rules:

Rate limiting rules are generally written in configuration files and are saved on disk.

This is what a rate limiting rule commonly looks like:

Example1:

domain: messaging
descriptors:
  - key: message_type
    value: sales
    rate_limit:
      unit: week
      requests_per_unit: 5

As per the above rule rate limiter is configured to allow maximum 3 sales messages in a week

Example2:

domain: auth
descriptors:
  - key: auth_type
    value: login
    rate_limit:
      unit: minute
      requests_per_unit: 3

As per the above rules rate limiter is configured to allow maximum 3 user login in a minute.

Handling Rate Limited Requests

When a request it rate limited then API returns HTTP status code 429 throttling/too many requests to the user. There can be a use case of processing rate limited requests later. For example: If some reservations are rate limited due to system overloaded then we may need to keep those requests to be processed later.
How does a client know the number of requests left before being throttled?

Answer : Rate limiter returns HTTP header to client:

X-Ratelimit-Remaining: The number of requests left within the window which will be allowed.

X-Ratelimit-Limit: number of calls the client can make per time window.

X-Ratelimit-Retry-After: The number of seconds client has to wait to make a request again without being throttled.

Now if a user sends too many requests and API returns 429 response then rate limiter will return X-Ratelimit-Retry-After header to the user.

Workers pull rules from the disk on regular intervals and store it to the cache.
Client send a request to rate limiter.
Rate limiter checks the rules from the cache.
If request meets the rules and is not rate limited then it accepts the request and forward it to the server and counter increases in cache.
If the request is rate limited then rate limiter will return 429 status code and now request can either be dropped or be added to the queue for later processing.

Rate Limiter in a Distributed Environment

Scaling rate limiter to work in a distributed environment with multiple servers and concurrent thread has following challenges:

Race Condition: Rate limiter on getting a request checks performs a read on Cache, it checks if counter has reached threshold or not, incase not it will increase the counter value by 1 and update the cache counter. Now assuming value of counter in cache is 2 and two requests concurrently read this value and update the counter value by one and write it back without checking it. Now both requests will assume they have correct value 3 whereas correct counter value should be 4. Now to avoid race condition one of the ways would be to introduce locks but locks will significantly slow down the system. Sorted set data structure in Redis and Lua Scripts are commonly used to solve this problem
Synchronization Issue: While working in distributed environment it is important to consider synchronization issue. With increase in the number of users a single rate limiter server will not be enough. Thus, multiple rate limiter servers will be required and synchronization between then is important. As it is important to keep a track of requests belonging to user in rate limiter, if from the same user one request goes to one rate limiter server and second goes to another then rate limiter will not solve the purpose as this is a synchronization issue. One of the ways to solve synchronization issue is to use sticky session that will allow client traffic to send request to the same rate limiter. But this is not a recommended approach as it is not scalable and flexible. A better solution would be to use centralized data stores like redis.

Rate Limiter Performance Optimization

Performance Optimizations:

Multi Data Center setup as this will reduce latency for users spread at various locations. Traffic will be routed to the closest server to reduce latency.
Synchronize data with eventual consistency

Rate limiter will be effective if:

Rate limiting algorithm is effective
Rate limiting rules are effective