The following continues the series of blog posts on Guanaco, our internal service responsible for the submission, processing and serving of large language models (LLMs) to our app. Read the previous entry here!
Guanaco consists of several containerised Python web-services exposing a unified API to the outside world. Part of this API is responsible for processing user messages to their bots. Naturally, this service receives a large amount of concurrent requests and must therefore be load-balanced.
We achieve this by hosting Guanaco on the Cloud Run service offered by Google Cloud Platform. This cloud service runs our containers in a horizontally-scaled fashion (in fact, on a Kubernetes cluster), spinning up new containers of the same service on-demand in order to deal with fluctuations in traffic.
Recently, as we became more confident in the service, we started routing more traffic through Guanaco. We quickly noticed that the tail latencies of the service were degrading. After a closer look, we noticed that the Guanaco container was taking as much as 30 seconds to spin up! This is of course no good when trying to respond quickly to user messages. In particular, we expect that 70% of messages are served within 5 seconds, and 99% of messages in under 10 seconds (for messages that reach our larger more intensive models).
Diagnosing the cause
The first step in dealing with this problem always lies in reproducing it locally. This was actually more subtle than it sounds, as when I initially tried to do this, the Docker container took less than a second to launch on my machine 🤔
I started to play with the CPU and memory parameters passed to the container and found that with specs closer to that which Cloud Run serves, I was able to reproduce the long cold-start time.
docker run --cpus=0.2 -m 1000m -it --rm -t guanaco-repo /bin/bash -c "./start_guanaco.sh"
With this in hand I next set out to profile the start-up of the container. I had a hunch that there were probably some dodgy imports which were responsible for this. So I started the server up with the environment variable PYTHONPROFILEIMPORTTIME=1
This variable instructs Python to profile each and every single import when running a given script. Doing this revealed that up to 16 seconds were being spent loading imports from external dependencies!
This was not great as far as the chat API was concerned, as some of these imports were not even relevant to the code responsible for processing chats. They were only being used to process requests to other non-chat related endpoints. So these imports were delaying the fulfilment of bot messages, despite not being relevant for them.
In order to solve this, we decided to convert these imports to lazy imports. To be more precise, Python evaluates all global imports at runtime, recursively cascading through the import tree until all necessary packages have been imported. Lazy imports, on the other hand, are evaluated just-in-time.
As a first approach, one could fact import anywhere in Python, even inside functions. This in and of itself would solve the problem but, while it is considered Pythonic by manh, I did not find it to be a satisfying and clean solution. In addition, it is not DRY, requiring possibly multiple imports of the same package across multiple different functions.
I really longed for a solution that allowed me to maintain imports in the usual global scope, but just have them be lazily evaluated.
Thankfully, the package lazy-import does exactly that. With the following code placed in the repo’s __init__.py , for example, any import of the kubernetes package is evaluated only when the package is first invoked:
And that’s all that was required! After identifying the worst offending imports and lazy importing them in this way, I saw the average container start time on Cloud Run drop to roughly five seconds.
This drastically improved our tail latencies: no longer is the chatbot message queue building up waiting for containers to start. Rather, containers now spin up very fast and the various imports for a given container delayed until the other endpoint actually requiring them is invoked.
N.B. If desired, one could go even further here and automatically lazy_load all your package requirements by iterating over the entries of the usual requirements.txt and passing them to lazy_import.load_module.
Alex is a backend engineer at Chai. He has been writing software and contributing to various open-source projects for the better part of 15 years. He holds Masters and PhD degrees in Pure Mathematics from Cambridge and King’s College London respectively.