Deploying language models is a fiddly business. It is one thing to spin up a server with a beefy GPU and start making single inference requests; but how do you scale this concept to hundreds of millions of inference requests a day? Is it possible to do this seamlessly and swiftly, and also to start receiving immediate human feedback from diverse audiences? In this series of blog posts we will offer a technical deep-dive into how our engineering team tackled these problems and, ultimately, how it paved the way for the world’s first user-evaluated LLM competition: Guanaco.
Setting the scene
To set the scene, let me first give you a high level introduction to our infrastructure here at Chai. Our frontend, the Chai app, is available on iOS and the Google Play Store, and currently serves roughly one million daily active users. Powering this behind the scenes, we host a myriad of containerized Python micro-services on Google Cloud Platform. These RESTful micro-services are all each responsible for the various different features found on the app.
Arguably the most important service we host is our bot response service: the program that is ultimately responsible for parsing user messages to bots, and forwarding them to our models for inferences. These models are hosted on an impressive 700-GPU-strong Kubernetes cluster hosted by the good folks at CoreWeave.
Managing and maintaining such a cluster is a full-time job in-and-of itself (and we’re hiring!). Designing a system which can quickly and automatically launch new models to the cluster, and make them immediately available, requires a thoughtful approach.
The high-level approach
When we first set out on this idea, we started by setting our constraints which included:
Models should be deployable, without manual intervention, by anyone familiar with Python.
Model developers should start receiving their first user-feedbacks within roughly five minutes of submission.
The user experience must be seamless: choosing to switch to a Guanaco model should incur no latency penalties or other such poor UX.
The service must be maintainable by a team of no more than three engineers (after all, we’re a company of less than ten people!)
It became clear very quickly that to achieve this, especially with our engineering head-count, we had to apply a principled approach. Ultimately, for us, this means a heavy and judicious use of test-driven development (TDD). Coming from a quantitative-trading background this felt like a very natural approach as, in that domain, poor test coverage in any one area of the code could incur extremely heavy financial losses.
We thus began with an overarching end-to-end test, which starts with the developer submitting her model, and ends with an assertion that within 5 minutes she has received her first feedback. With this failing test in hand, we then started to flesh out the submission and deployment system one failing unit test at a time.
While not a proper web-development project per se, we also drew heavy inspiration from familiar web-development principles such as the Model-View-Controller framework (MVC) and Object Relational Mapping (ORM). I personally feel that using these guiding principles has helped to ensure that we build a trustworthy system that is both scalable and easily maintainable.
Diving now into the more technical details, let me describe to you the journey of a language model (LLM) through the Guanaco service. An LLM will start-off its life as a fine-tune of some foundational model (usually Meta’s Llama), hosted on the popular model-storage platform HuggingFace.
Via our pip package, chai-guanaco, a developer can submit his model via one easy command, which prompts the Guanaco service to launch a submission job.
This job records some metadata to a database, but then very quickly begins the model preprocessing stage. This involves downloading the LLM from HuggingFace, running some sanity-checks, and then serializing it with the very cool package tensorizer, written by engineers at Coreweave. This serialized model is then uploaded to CoreWeave’s regional cache, which is a specialized Amazon S3 service built to ensure incredibly fast access to models from their various data centers.
Once this stage is complete, which takes no longer than two to three minutes, we then publish a new inference service manifest to our Kubernetes cluster, instructing it to launch our model inference server code and serve the submitted model.
Our inference server code is based heavily off the excellent open-source package vLLM. Together with some in-house optimizations and integrations with our S3 cache, this inference service is usually ready to start serving requests in under two minutes.
In principle, the model is now ready to serve. However, to ensure the best experience for our users, we run various stress-checks against it, verifying that the model’s inference latencies fall within an acceptable range for the end-user.
Once this is complete, the model is published to the outside world, its status updated to “deployed” and the app’s backend code kicks in.
Whenever a user decides to start a new conversation in our so-called “beta-mode” (after consenting to share their conversation), our bot response services queries the Guanaco service to understand which models are in need of feedback. It chooses a model (usually those which have been submitted most recently) for the user to chat with.
After a certain amount of messages, or when backing out of the conversation, the user is prompted to give a thumbs-up rating and some plaintext feedback. This feedback is then routed to our feedback micro-service which categories all the feedback the individual models have collected. Once this finds its home in the database, our model developers can easily view it, along with the anonymized conversation, via the chai-guanaco pip package.
Alex is a backend engineer at Chai. He has been writing software and contributing to various open-source projects for the better part of 15 years. He holds Masters and PhD degrees in Pure Mathematics from Cambridge and King’s College London respectively.