top of page

Crowdsourcing the leap to Ten Trillion-Parameter AGI

Updated: Jan 1



When Elon Musk foot the bill for OpenAI in 2015, he hoped it would rival Google’s dominance over AI. A series of back-and-forth competitive developments were triggered, generating unprecedented advancements in the field of AI. It was Google’s AI team which developed the transformer technology in 2017 which provided the foundational technological insights that first suggested any possibility of an AGI. The crux came in November 2022 when OpenAI launched ChatGPT, transforming the nature of human interaction with technology. From this moment on AI systems became sophisticated, accessible, and widespread.


However, as successful as OpenAI has become, it has recently struggled to obtain enough new data to grow. With access to vast swathes of the internet, it seems on the surface that OpenAI has been reliant on appeals to the public for more data to help refine their models[1]. Large data sets and strong feedback systems are fundamental to AI, and Large Language Models (LLMs) in particular, to produce the best, and most lifelike outcomes. Musk hoped that buying X (Twitter) in 2022 would help solve this problem and supplement OpenAI data sets. The limiting factor of advancement in this field lies in data collection – more data equalling better models. It is hardly surprising then, that an ever-increasing number of companies have turned to crowdsourcing systems to overcome these barriers. LLaMa-2 and Minstral have implemented these open-source models so successfully as to be on par with, or even outperform closed-source competitors. Clearly, whoever best crowdsources the collection of data and feedback to their systems will win the AI war.


Crowdsourcing AI

At Chai our mission is to accelerate the advent of AGI through massively distributed collaboration, i.e. crowdsourcing. By appealing to well-known scaling laws [3] we anticipate that achieving this will require a language model with parameters on the order of about 10 trillion. Several companies, such as OpenAI and Anthropic, focus on internal closed-source research, utilising internal datasets and powerful compute clusters to train individual large-parameter models.


At Chai we believe the resources to piece together this large model already exist, but are widely distributed. This is why at Chai we have cultivated a community of developers who currently produce small-size models (7b and 13b parameters) by fine-tuning them on any datasets of their choosing. Once a model is ready, a developer can deploy it via our Chaiverse platform. The model is then served to the Chai App where millions of users can start chatting with it, and provide feedback on the quality thereof.


Since these models are trained by different developers on a wide variety of different datasets, we have discovered that each of them can perform well in some particular “dimension”. For example, some models may excel at “creativity/storytelling”, some might have a very literary way of speaking, some might be “intelligent”, while others still might be “fun”.


MoEs Are Recommender Systems

The challenge, then, is to leverage these wide varieties of models and to serve them in a way that is best aligned with the user’s expectation. This is indeed very similar to the so-called Mixture-of-Experts (MoE) architectures, which consist of several small models (the “experts”). At inference time, one (or even several) expert models are selected to generate the output. The selection is done via an auxiliary gating model, which, based on the input, selects which model to use for inference.


This approach offers multiple advantages:

  • Lower training costs: Training small-size models is accessible to a wider range of developers. Indeed, it is well known that training large-parameter LLMs is a prohibitively expensive task, and as a result is inaccessible to most machine learners.

  • Lower inference costs: The inference cost of a multitude of small expert models is lower than that of a single large model.

  • Higher iteration speed:  Small models can be trained and deployed much faster than large ones, allowing for much shorter feedback loops and higher iteration speed.


Usually, MoE models are all trained together along with the gating model on a wide variety of datasets. This is however not the case at Chai: our experts have already been trained by the community, and our goal is to build the gating model. The Chai gating model, which we typically refer to as an LLM-controller is equivalent to a recommender system which takes as input a conversation and has, as output, the recommended expert model to serve.




Connecting Millions of LLMs with Chai LLM-Controller

How should this controller system work at a high level? The usual approach is to write the score of serving a model m for a given conversation C as:



where these are relatively low-dimensional “feature” vectors obtained by representing the model and conversation state together in some common ambient feature-space. The higher the score, the more aligned the model m is to the conversation.


These features vector can be taken to be the “dimensions” which we discussed in the introduction. To illustrate this, let’s imagine we have at our disposal three dimensions: [descriptive, creative/fun, informative/factual]. Then we can consider the following feature vectors, together with the accompanying score:


The goal of the LLM-controller would then be to optimise this, seeking those models m with the highest such score. When building such a recommender system, there are three main parts we need to investigate:

  • What is the most sensible metric to optimise?

  • How do we determine the feature space?

  • What does our training dataset look like?


There exist a few such offline metrics which have worked well when training our reward models:

  • retry: was the message retried?

  • rating: what was the star rating of the message?

  • conversation length: how long was the conversation after the message was sent.



For the purpose of the LLM-controller, it is a rather subtle question as to whether these metrics are sufficient, whether there could exist other more useful ones, or even whether a combination of such metrics is better-suited:

where the alpha parameters are some coefficients to be determined.


In order to score a model m’s suitability for a conversation C, we must map them to a set of k numbers: the “feature space”. The conversation being a string, we can tokenise it and use language model to map it to a feature space:

Of course, the LLM architecture needs to be selected and its parameters determined via training. The question of how to map a model to a subset of k features is, however, more complicated:

At this point, the literature on recommender systems often delineates two approaches to this problem [4]:

  • Collaborative-based filtering: In this approach, one considers a finite set of models [m_1, m_2, ... m_N], together with a set of trainable weights

  • which will be determined through training. This is approach that several large-scale algorithms take, such as YouTube’s video recommender system [5]. The advantage of this approach is that we remain agnostic on the nature of the k-dimensional feature space: the model itself is able to learn the embeddings on the fly. The inconvenience is that 1) we lose interpretability of the feature space and, mainly, 2) the model would need to be frequently retrained in order to learn the embeddings for newly introduced expert model: a rather daunting engineering task.

  • Content-based filtering: The second approach is to independently model the mapping from expert models to the feature space. One way to accomplish this is to ask another language model (such as GPT4) to score the response of each expert model on a set of (dataset x categories). The great advantage is that each model can be scored on a feature space without the need for retraining. The inconvenience is that we make a conscious, and possibly limiting, decision of what we think the features (or categories) should be.


It is also worth noting that many find success in a hybrid approach, where the model trained in the collaborative approach has its dataset augmented with some fixed additional features via the content-based approach.


Early Experiments: +68% Engagement VS. GPT3.5


At Chai, the open-source solution is already well underway. Chaiverse, our developer platform, was first launched on April 4th 2023. It provides a space for developers to connect with millions of unique users, onboard feedback and iterate and improve their models almost immediately. Because of this large influx of creators, the language model powering Chaiverse has already grown to a trillion parameters, with over 1211 unique expert models submitted to the platform, allowing for a level of customised AI interaction as never seen before. The results were unthinkable. A carefully optimized mixture of 7B models had outperformed OpenAI's GPT-3.5. Over a four-month period, our in-house LLMs had achieved a 20% day-30 engagement improvement, compared with GPT-3.5 models. However, when the top models from the Chaiverse LLM competition were combined, day-30 engagement levels were raised by 40% from the in-house models, marking a 68% total increase from GPT-3.5.




bottom of page