Understanding AWS Lambda Scaling

AWS Lambda is a serverless computing application. In other words, it allows users to run code without provisioning or managing servers. One of the many reasons why AWS Lambda has become so popular is because it is massively scalable. Let’s take a look at what this means in practice.

AWS Lambda is all about burstable concurrency

When you fire up AWS Lambda, you start your function. AWS Lambda creates an instance for it and runs its handler method to process the event. If that function returns a response without any intervening activity, such as another function being invoked, then AWS Lambda simply waits for you to tell it what to do. 

If, however, you create another event, let’s say you invoke the function again, AWS Lambda will create another instance for it and work on them both together. Assuming more events are created, AWS Lambda will create new instances for them until either all your functions are being run as requested or you reach your burstable concurrency limit. 

After this initial burst, AWS will aim to assign them to existing instances as they become available if possible and only create new instances if they are not.

The concurrency limit is the maximum possible number of instances serving requests

A simple way to picture the concurrency limit is to see it as a ceiling for your events. As your events increase in number, they move closer to the ceiling until finally, they bump up against it. 

If you want to push the analogy even further, AWS Lambda actually has two ceilings, you can think of them as the main ceiling and an attic. The first, main, ceiling is for your initial round of traffic and the second, the attic, is for scaling additional functions upwards until the final limit is reached.

At that point, your options are either to reduce the number of events or increase the height of the ceiling (i.e. buy more concurrency).

Default concurrency limits vary by region

As is par for the course with AWS, default concurrency limits vary by region. Currently, limits for the initial burst are as follows:

3000 – US West (Oregon), US East (N. Virginia), Europe (Ireland).

1000 – Asia Pacific (Tokyo), Europe (Frankfurt).

500 – Other Regions.

Those huge differentials underline the importance of choosing the best region for your needs. For example, even if you work with data from the EU and have to think about GDPR, you would still be able to choose between Ireland and Frankfurt and Ireland gives you much higher concurrency.

From this initial ceiling, your function can continue to scale at a maximum of 500 instances each minute until it reaches the default regional concurrency limit, which starts at 1000 instances.

A brief walkthrough of AWS Lambda scaling in practice

You start your function and as it processes, you begin another and another. AWS Lambda keeps creating new instances until you reach your burstable concurrency limit. In other words, it scales vertically. You keep creating events and AWS Lambda alters its process so that it tries to assign them to existing instances before it creates new ones. This results in linear scaling. 

If you initiate more than 500 events per minute and/or you reach your final concurrency limit, then you will receive a throttling error (429 status code) and your requests will fail. You can, however, raise the concurrency limit (for a price) via the support center. If the reason for the overload is only temporary remember to switch it back when you’re finished and if it’s long-term, remember to update your cost projections accordingly.

Pro-tip, you can use a service called “reserved concurrency” to ensure that designated functions have a pool of concurrency allocated for their sole use. This is different from provisioned concurrency, which we’ll explain later.

AWS Lambda and latency

If your initialization code takes a long time to load, then you are likely to see this reflected in your average and percentile latency. There are two ways to address this. The first is to have a good look at your code and see if it can be improved. In other words, if you can address a problem at its root, then you should probably do so.

If, however, this is not an option or you are satisfied that your code is as good as it can be, then you could use provisioned concurrency. Basically, provisioned concurrency keeps functions initialized and hyper-ready to respond and as you’d expect from AWS Lambda, you only pay for the amount of concurrency that you configure and for the period that it is configured.

You can even take provisioned concurrency to the next level with Application Auto Scaling. This basically allows you to create a policy that adjusts provisioned concurrency levels automatically based on the Lambda utilization metric.