State of AI Report - 2024 ONLINE
State of AI Report - 2024 ONLINE
stateof.ai airstreet.com
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 2
Nathan Benaich
Nathan is the General Partner of Air Street Capital, a
venture capital firm investing in AI-first companies. He
runs the Research and Applied AI Summit (RAAIS), the
RAAIS Foundation (funding open-source AI projects), AI
communities in the US and Europe, and Spinout.fyi
(improving university spinout creation). He studied
biology at Williams College and earned a PhD from
Cambridge in cancer research as a Gates Scholar.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 3
Alex Chalmers
Alex is Platform Lead at Air Street Capital and
regularly writes research, analysis, and commentary on
AI via Air Street Press. Before joining Air Street, he was
an associate director at Milltown Partners, where he
advised big technology companies, start-ups, and
investors on policy and positioning. He graduated from
the University of Oxford in 2017 with a degree in
History.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 4
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is
because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its seventh year. Consider this report as a compilation of the most interesting things we’ve
seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 5
Definitions
Artificial intelligence (AI): a broad discipline with the goal of creating intelligent machines, as opposed to the natural intelligence that is
demonstrated by humans and animals.
Artificial general intelligence (AGI): a term used to describe future machines that could match and then exceed the full range of human cognitive
ability across all economically valuable tasks.
AI Agent: an AI-powered system that can take actions in an environment. For example, an LLM that has access to a suite of tools and has to decide
which one to use in order to accomplish a task that it has been prompted to do.
AI Safety: a field that studies and attempts to mitigate the risks (minor to catastrophic) which future AI could pose to humanity.
Computer vision (CV): the ability of a program to analyse and understand images and video.
Deep learning (DL): an approach to AI inspired by how neurons in the brain recognise complex patterns in data. The “deep” refers to the many layers
of neurons in today’s models that help to learn rich representations of data to achieve better performance gains.
Diffusion: An algorithm that iteratively denoises an artificially corrupted signal in order to generate new, high-quality outputs. In recent years it has
been at the forefront of image generation and protein design.
Generative AI: A family of AI systems that are capable of generating new content (e.g. text, images, audio, or 3D assets) based on 'prompts'.
Graphics Processing Unit (GPU): a semiconductor processing unit that enables a large number calculations to be computed in parallel. Historically
this was required for rendering computer graphics. Since 2012 GPUs have adapted for training DL models, which also require a large number of
parallel calculations.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 6
Definitions
(Large) Language model (LM, LLM): a model trained on vast amounts of (often) textual data to predict the next word in a self-supervised manner.
The term “LLM” is used to designate multi-billion parameter LMs, but this is a moving definition.
Machine learning (ML): a subset of AI that often uses statistical techniques to give machines the ability to "learn" from data without being explicitly
given the instructions for how to do so. This process is known as “training” a “model” using a learning “algorithm” that
progressively improves model performance on a specific task.
Model: a ML algorithm trained on data and used to make predictions.
Natural language processing (NLP): the ability of a program to understand human language as it is spoken and written.
Prompt: a user input often written in natural language that is used to instruct an LLM to generate something or take action.
Reinforcement learning (RL): an area of ML in which software agents learn goal-oriented behavior by trial and error in an environment that
provides rewards or penalties in response to their actions (called a “policy”) towards achieving that goal.
Self-supervised learning (SSL): a form of unsupervised learning, where manually labeled data is not needed. Raw data is instead modified in an
automated way to create artificial labels to learn from. An example of SSL is learning to complete text by masking random words in a sentence and
trying to predict the missing ones.
Transformer: a model architecture at the core of most state of the art (SOTA) ML research. It is composed of multiple “attention” layers which learn
which parts of the input data are the most important for a given task. Transformers started in NLP (specifically machine translation) and
subsequently were expanded into computer vision, audio, and other modalities.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 7
Definitions
Model type legend
In the rest of the slides, icons in the top right corner indicate input and output modalities for the model.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 8
Executive Summary
Research
- Frontier lab performance converges, but OpenAI maintains its edge following the launch of o1, as planning and reasoning emerge as a major frontier.
- Foundation models demonstrate their ability to break out of language as multimodal research drives into mathematics, biology, genomics, the physical
sciences, and neuroscience.
- US sanctions fail to stop Chinese (V)LLMs rising up community leaderboards.
Industry
- NVIDIA remains the most powerful company in the world, enjoying a stint in the $3T club, while regulators probe the concentrations of power within GenAI.
- More established GenAI companies bring in billions of dollars in revenue, while start-ups begin to gain traction in sectors like video and audio generation.
Although companies begin to make the journey from model to product, long-term questions around pricing and sustainability remain unresolved.
- Driven by a bull run in public markets, AI companies reach $9T in value, while investment levels grow healthily in private companies.
Politics
- While global governance efforts stall, national and regional AI regulation has continued to advance, with controversial legislation passing in the US and EU.
- The reality of compute requirements forces Big Tech companies to reckon with real-world physical constraints on scaling and their own emissions targets.
Meanwhile, governments’ own attempts to build capacity continue to lag.
- Anticipated AI effects on elections, employment and a range of other sensitive areas are yet to be realized at any scale.
Safety
- A vibe-shift from safety to acceleration takes place as companies that previously warned us about the pending extinction of humanity need to ramp up
enterprise sales and usage of their consumer apps.
- Governments around the world emulate the UK in building up state capacity around AI safety, launching institutes and studying critical national infrastructure
for potential vulnerabilities.
- Every proposed jailbreaking ‘fix’ has failed, but researchers are increasingly concerned with more sophisticated, long-term attacks.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 9
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 10
A generative AI media company is investigated for its misuse during in the 2024 US election circuit. ~ Not yet, but there’s still time.
Self-improving AI agents crush SOTA in a complex environment (e.g. AAA game, tool use, science). NO Not yet, despite promising work on open-endedness, including strong game performance.
While the Magnificent Seven have enjoyed strong gains, private companies are hanging
Tech IPO markets unthaw and we see at least one major listing for an AI-focused company (e.g. DBRX). ~ on until markets settle. However, AI chip company Cerebras has filed to IPO.
The GenAI scaling craze sees a group spend >$1B to train a single large-scale model. NO Not quite yet - let’s give it another year.
The US’s FTC or UK’s CMA investigate the Microsoft/OpenAI deal on competition grounds. YES Both regulators are investigating this partnership.
We see limited progress on global AI governance beyond high-level voluntary commitments. YES The commitments from Bletchley and Seoul summits remain voluntary and high-level.
Some VC funds are rumored to be offering GPUs for equity, but we’re yet to see anyone go
Financial institutions launch GPU debt funds to replace VC equity dollars for compute funding. NO
down the debt route.
It turns out this had already happened last year with “Heart on My Sleeve”, but we’ve also
An AI-generated song breaks into the Billboard Hot 100 Top 10 or the Spotify Top Hits 2024. YES
seen an AI-generated song reach #27 in Germany and spend several days in the Top 50.
As inference workloads and costs grow significantly, a large AI company (e.g. OpenAI) acquires or builds Sam Altman is reportedly raising huge sums of money to do this, while each of Google,
YES
an inference-focused AI chip company. Amazon, Meta and Microsoft continue to build and improve their owned AI silicon.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 11
Section 1: Research
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 12
📝→📝
OpenAI’s reign of terror came to an end, until…
For much of the year, both benchmarks and community leaderboards pointed to a chasm between GPT-4 and ‘the
best of the rest’. However, Claude 3.5 Sonnet, Gemini 1.5, and Grok 2 have all but eliminated this gap as model
performance now begin to converge.
● On both formal benchmarks and vibes-based analysis, the best-funded frontier labs are able to rack up scores
within low single digits of each other on individual capabilities.
● Models are now consistently highly capable coders, are strong at factual recall
and math, but less good at open-ended question-answering and multi-modal
problem solving.
● Many of the variations are sufficiently small that they are now likely to be the
product of differences in implementation. For example, GPT-4o outperforms
Claude 3.5 Sonnet on MMLU, but apparently underperforms it on MMLU-Pro - a
benchmark designed to be more challenging.
● Considering the relatively subtle technical differences between architectures and
likely heavy overlaps in pre-training data, model builders are now increasingly
having to compete on new capabilities and product features.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 13
📝→📝
…the Strawberry landed, doubling down on scaling inference compute
The OpenAI team had clearly clocked the potential of inference compute early, with OpenAI o1 appearing within
weeks of papers from other labs exploring the technique.
● By shifting compute from pre- and post-training to inference, o1 reasons through complex prompts step-by-step
in a chain-of-thought (COT) style, employing RL to sharpen the COT and the strategies it uses. This unlocks the
possibility of solving multi-layered math, science, and coding problems where LLMs have historically struggled,
due to the inherent limitations of next-token prediction.
● OpenAI report significant improvements on reasoning-heavy benchmarks
versus 4o, with the starkest on AIME 2024 (competition math), with a
whopping score of 83.83 versus 13.4.
● However, this capability comes at a steep price: 1M input tokens of
o1-preview costs $15, while 1M output tokens will set you back $60. This
makes it 3-4x more expensive than GPT-4o.
● OpenAI is clear in its API documentation that it is not a like-for-like 4o
replacement and that it is not the best model for tasks that require
consistently quick responses, image inputs or function calling.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 14
📝→📝
o1 showcases both areas of incredible strength and persistent weakness
The community were quick to put o1 through its paces, finding that it performed significantly better than other
LLMs on certain logical problems and puzzles. Its true edge shone through, however, on complex math and
science tasks, with a viral video of a PhD student reacting with astonishment as it reproduced a year of his PhD
code in approximately an hour. However, the model remains weaker on certain kinds of spatial reasoning. Like its
predecessors, it can’t play chess to save its life… yet.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 15
📝→🖼
Llama 3 closes the gap between open and closed models
In April, Meta dropped the Llama 3 family, 3.1 in July, and 3.2 in September. Llama 3.1 405B, their largest
to-date, is able to hold its own against GPT-4o and Claude 3.5 Sonnet across reasoning, math, multilingual, and
long-context tasks. This marks the first time an open model has closed the gap with the proprietary frontier.
● Meta stuck to the same decoder-only transformer architecture that it’s used since Llama 1, with minor
adaptations, namely more transformer layers and attention heads.
● Meta used an incredible 15T tokens to train the family. While this blew through the “Chinchilla-optimal”
amount of training compute, they found that both the 8B and 70B models improved log-linearly up to 15T.
● Llama 3.1 405B was trained over 16,000 H100 GPUs, the first
Llama model trained at this scale.
● Meta followed up with Llama 3.2 in September, which
incorporated 11B and 90B VLMs (Llama’s multimodal debut).
The former was competitive with Claude 3 Haiku, the latter
with GPT-4o-mini. The company also released 1B and 3B
text-only models, designed to operate on-device.
● Llama-based models have now racked up over 440M
stateof.ai 2024
downloads on Hugging Face.
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 16
📝→📝
But how ‘open’ are ‘open source’ models?
With open source commanding considerable community support and becoming a hot button regulatory issue,
some researchers have suggested that the term is often used misleadingly. It can be used to lump together
vastly different openness practices across weights, datasets, licensing, and access methods.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 17
📝→📝
Is contamination inflating progress?
With new model families reporting incredibly strong benchmark performance straight out-of-the-gate,
researchers have increasingly been shining a light on dataset contamination: when test or validation data leaks
into the training set. Researchers from Scale retested a number of models on a new Grade School Math 1000
(GSM1k) that mirrors the style and complexity of the established GSM8k benchmark, finding significant
performance drops in some cases. Similarly, researchers at X.ai re-evaluated models using a dataset based on
the Hungarian national finals math exam that post-dated their release, with similar results.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 18
📝→📝
Researchers try to correct problems in widely used benchmarks
But benchmarking challenges cut both ways. There are alarmingly high error rates in some of the most popular
benchmarks that could be leading us to underestimate the capabilities of some models, with safety implications.
Meanwhile, the temptation to overfit is strong.
● A team from the University of Edinburgh flagged up the number of mistakes in MMLU, including the wrong
ground truth, unclear questions, and multiple correct answers. While low across most individual topics, there
were big spikes in certain fields, such as virology, where 57% of the analyzed instances contained errors.
● On a manually corrected MMLU subset, models broadly gain in performance, although worsened on professional
law and formal logic. This says inaccurate MMLU instances are being learned during pre-training.
● In more safety-critical territory, OpenAI has warned that SWE-bench, which
evaluates models’ ability to solve real-world software issues, was underestimating
the autonomous software engineering capabilities of models, as it contained tasks
that were hard or impossible to solve.
● The researchers partnered with the creators of the benchmark to create
SWE-bench Verified.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 19
📝→📝
Live by the vibes, die by the vibes…or close your eyes for a year and OpenAI is still #1
The LMSYS Chatbot Arena Leaderboard has emerged as the community’s favorite method of formalizing
evaluation by “vibes”. But as model performance improves, it’s beginning to produce counterintuitive results
● The arena, which allows users to interact with two randomly selected chatbots side-by-side provides a rough
crowdsourced evaluation.
● However, controversially, this led to GPT-4o and GPT-4o Mini
receiving the same scores, with the latter also outperforming
Claude Sonnet 3.5.
● This has led to concerns that the ranking is essentially
becoming a way of assessing which writing style users happen
to prefer most.
● Additionally, as smaller models tend to perform less well on
tasks involving more tokens, the 8k context limit arguably
gives them an unfair advantage.
● However, the early version of the vision leaderboard is now
beginning to gain traction and aligns better with other evals.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 20
📝+🖼→📝
Are neuro-symbolic systems making a comeback?
Deficiencies in both reasoning capabilities and training data mean that AI systems have frequently fallen short
on math and geometry problems. With AlphaGeometry, a symbolic deduction engine comes to the rescue.
● A Google DeepMind/NYU team generated millions of synthetic theorems and proofs using symbolic engines,
using them to train a language model from scratch.
● AlphaGeometry alternates between the language model
proposing new constructions and symbolic engines
performing deductions until a solution is found.
● Impressively, It solved 25 out of 30 on a benchmark of
Olympiad-level geometry problems, nearing human
International Mathematical Olympiad gold medalist
performance. The next best AI performance scored only 10.
● It also demonstrated generalisation capabilities - for
example, finding that a specific detail in a 2004 IMO
problem was unnecessary to for the proof.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 21
📝→📝
It’s possible to shrink models with minimal impact on performance…
Research suggests that models are robust in the face of deeper layers - which are meant to handle complex,
abstract, or task-specific information - being pruned intelligently. Maybe it’s possible to go even further.
● A Meta/MIT team looking at open-weight pre-trained LLMs concluded that it’s possible to do away with up to
half a model’s layers and suffer only negligible performance drops on question-answering benchmark.
● They identified optimal layers for removal based on similarity and then “healed” the model through small
amounts of efficient fine-tuning.
● NVIDIA researchers took a more radical approach by
pruning layers, neurons, attention heads, and
embeddings, and then using knowledge distillation for
efficient retraining.
● The MINITRON models, derived from Nemotron-4 15B,
achieved comparable or superior performance to
models like Mistral 7B and Llama-3 8B while using up
to 40x fewer training tokens.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 22
📝→📝
…as distilled models become more fashionable
As Andrej Karpathy and others have argued, current large model sizes could be a reflection of inefficient training.
Using these big models to refine and synthesize training data, could help train capable smaller models.
● Google have embraced this approach, distilling Gemini 1.5 Flash from Gemini 1.5 Pro, while Gemma 2 9B was
distilled from Gemma 2 27B, and Gemma 2B from a larger unreleased model.
● There is also community speculation that Claude 3 Haiku, a highly capable smaller model, is a distilled version
of the larger Opus, but Anthropic has never confirmed this.
● These distillation efforts are going multimodal too. Black Forest
Labs have released FLUX.1 dev, an open-weight text-to-image
distilled from their Pro model.
● To support these efforts, the community has started to produce
open-source distillation tools, like arcee.ai’s DistillKit, which
supports both Logit-based and Hidden States-based distillation.
● Llama 3.1 405B is also being used for distillation, after Meta
updated its terms so output logits can be used to improve any
models, not just Llama ones.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 23
📝→📝
Models built for mobile compete with their larger peers
As big tech companies think through large-scale end user deployment, we’re starting to see high-performing LLM
and multimodal models that are small enough to run on smartphones.
● Microsoft’s phi-3.5-mini is a 3.8B LM that competes with larger models like 7B
and Llama 3.1 8B. It performs well on reasoning and question-answering, but
size restricts its factual knowledge. To enable on-device inference, the model
was quantized to 4 bits, reducing its memory footprint to approximately 1.8GB.
● Apple introduced MobileCLIP, a family of efficient image-text models optimized
for fast inference on smartphones. Using novel multimodal reinforced training,
they improve the accuracy of compact models by transferring knowledge from
an image captioning model and an ensemble of strong CLIP encoders.
● Hugging Face also got in on the action with SmolLM, a family of small language
models, available in 135M, 360M, and 1.7B formats. By using a highly curated
synthetic dataset created via an enhanced version of Cosmopedia (see slide 31)
the team achieved SOTA performance for the size.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 24
📝+🖼→📝
Strong results in quantization point to an on-device future
It’s possible to shrink the memory requirements of LLMs by reducing the precision of their parameters.
Researchers are increasingly managing to minimize the performance trade-offs.
● Microsoft’s BitNet uses a “BitLinear” layer to replace standard linear layers,
employing 1-bit weights and quantized activations.
● It shows competitive performance compared to full-precision models and
demonstrates a scaling law similar to full-precision transformers, with
significant memory and energy savings.
● Microsoft followed up with BitNet b1.58, with ternary weights to match
full-precision LLM performance at 3B size while retaining efficiency gains.
● Meanwhile, ByteDance’s TiTok (Transformer-based 1-Dimensional
Tokenizer) quantizes images into compact 1D sequences of discrete token
for image reconstruction and generation tasks. This allows images to be
represented with as few as 32 tokens, instead of hundreds or thousands.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 25
📝→📝
Will representation fine tuning unlock on-device personalization?
Parameter-efficient fine-tuning (e.g. via LoRA) is nothing new, but Stanford researchers believe a more targeted
approach offers greater efficiency and adaptation.
● Inspired by model interpretability research, ReFT (Representation Fine-tuning) doesn’t alter the model’s
weights. Instead, it manipulates the model’s internal representations at inference time to steer its behavior.
● While it comes with a slight interference penalty, ReFT requires 15-65x fewer parameters compared to
weight-based fine-tuning methods.
● It also enables more selective interventions on specific
layers and token positions, enabling fine-grained control
over the adaptation process.
● The researchers show its potential in few-shot adaptation
where a chat model is given a new persona with just five
examples. Combined with the small storage footprint for
learned interventions, it could be used for real-time
personalization on devices with sufficient compute power.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 26
📝→📝
Hybrid models begin to gain traction
Models that combine attention and other mechanisms are able to maintain or even improve accuracy, while
reducing computational costs and memory footprint.
● Selective state-space models like Mamba, designed last year to handle long sequences more efficiently, can to
some extent compete with transformers, but lag on tasks that require copying or in-context learning. That
said, Falcon’s Mamba 7B shows impressive benchmark performance versus similar-sized transformer models.
● Hybrid models appear to be a more promising direction. Combined with self-attention and MLP layers, the
AI21’s Mamba-Transformer hybrid model outperforms the 8B Transformer across knowledge and reasoning
benchmarks, while being up to 8x faster generating tokens in inference.
● In a nostalgia trip, there are early signs of a comeback for recurrent
neural networks, which had fallen out of fashion due to training and
scaling difficulties.
● Griffin, trained by Google DeepMind, mixes linear recurrences and local
attention, holding its own against Llama-2 while being trained on 6x
fewer tokens.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 27
📝→📝
And could we distill transformers into hybrid models? It’s…complicated.
By transferring knowledge from a larger, more powerful model, one could improve the performance of
subquadratic models, allowing us to harness their efficiency on downstream tasks.
● MOHAWK is a new method for distilling knowledge from a large, pre-trained transformer model (teacher) to a
smaller, subquadratic model (student) like a state-space model (SSM).
● It aligns i) the sequence transformation matrices of the student and teacher models ii) and the hidden states of
each layer, then iii) transfers the remaining weights of the teacher model to the student model to finetune it.
● The authors create Phi-Mamba, a new student model combining
Mamba-2 and an MLP block and a variant called
Hybrid-Phi-Mamba that retains some attention layers from the
teacher model.
● Mohawk can train Phi-Mamba and Hybrid-Phi-Mamba to achieve
performance close to the teacher model. Phi-Mamba is distilled
with only 3B tokens, less than 1% of the data used to train either
the previously best-performing Mamba models and 2% for the
Phi-1.5 model itself.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 28
22%
74%
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 29
📝→📝
Synthetic data starts gaining more widespread adoption…
Last year’s report pointed to the divides of opinion around synthetic data: with some finding it useful, others
fearing its potential to trigger model collapse by compounding errors. Opinion seems to be warming.
● As well as being the main source of training data for the Phi family, synthetic data was used by Anthropic when
training Claude 3 to help represent scenarios that might have been missing in the training data.
● Hugging Face used Mixtral-8x7B Instruct to generate over 30M files and
25B tokens of synthetic textbooks, blog posts, and stories to recreate the
Phi-1.5 training dataset, which they dubbed Cosmopedia.
● To make this process easier, NVIDIA released the Nemotron-4-340B
family, a suite of models designed specifically for synthetic data
generation, available via a permissive license. Meta’s Llama can also be
used for synthetic data generation.
● It also appears possible to create synthetic high-quality instruction data by
extracting it directly from an aligned LLM, with techniques like Magpie. Models
fine-tuned this way sometimes perform comparably to Llama-3-8B-Instruct.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 30
📝→📝
…but Team Model Collapse isn’t going down without a fight
As model builders motor ahead, researchers have focused on trying to assess if there’s a tipping point in the
quantity of synthetic data that triggers these kinds of outcomes and if any mitigations work
● A Nature paper from Oxford and Cambridge researchers found model collapse occurs across various AI
architectures, including fine-tuned language models, challenging the idea that pre-training or periodic
exposure to small amounts of original data can prevent degradation (measured by Perplexity score).
● This creates a “first mover advantage”, as sustained access to
diverse, human-generated data will become increasingly critical
for maintaining model quality.
● However, these results are primarily focused on a scenario where
real data is replaced with synthetic data over generations. In
practise, real and synthetic data usually accumulates.
● Other research suggests that, provided the proportion of synthetic
data doesn’t get too high, collapse can usually be avoided.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 31
📝→📝
Web data is decanted openly at scale - proving quality is key 🍷
Team Hugging Face built a 15T token dataset for LLM pre-training, using 96 CommonCrawl snapshots, which
produces LLMs that outperform other open pre-training datasets. They also released an instruction manual.
● FineWeb, the dataset, was created through a multi-step process including base filtering, independent
MinHash deduplication per dump, selected filters derived from the C4 dataset, and the team’s custom filters.
● The text extraction using the trafilatura library produced higher quality data than default CommonCrawl
WET files, even though the resulting dataset was meaningfully smaller.
●
● They found deduplication drove performance improvements, up to a
point, before hitting a point of diminishing returns, and then
worsening it.
● The team also used llama-3-70b-instruct to annotate 500k samples
from FineWeb, scoring scoring each for their educational quality on
a scale from 0 to 5. FineWeb-edu, which filtered out samples scored
below 3, outperformed FineWeb and all other open datasets,
despite being significantly smaller.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 32
📝→📝
Retrieval and embeddings hit the center stage
While retrieval and embeddings are not new, growing interest in retrieval augmented generation (RAG) has
prompted improvements in the quality of embedding models.
● Following the playbook that’s proven effective in regular LLMs, massive performance improvements have come
from scale (GritLM has ~ 47B parameters vs the 110M common among prior embedding models).
● Similarly, the usage of broad web scale corpora and improved
filtering methods have led to large improvements in the smaller
models.
● Meanwhile, ColPali is a vision-language embedding model that
exploits the visual structure of documents, not just their text
embeddings, to improve retrieval.
● Retrieval models are one of the few subdomains where open
models commonly outperform proprietary models from the biggest
labs. On the MTEB Retrieval Leaderboard, OpenAI’s embedding
model ranks 29th, while NVIDIA’s open NV-Embed-v2 is top.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 33
📝→📝
Context proves a crucial driver of performance
Traditional RAG solutions usually involve creating text snippets 256 tokens at a time with sliding windows (128
overlapping the prior chunk). This makes retrieval more efficient, but significantly less accurate.
● Anthropic solved this using ‘contextual embeddings’, where a prompt instructs the model to generate text
explaining the context of each chunk in the document.
● They found that this approach leads to a reduction of top-20
retrieval failure rate of 35% (5.7% → 3.7%).
● It can then be scaled using Anthropic’s prompt caching.
● As Fernando Diaz of CMU observed in a recent thread, this is a
great example of techniques pioneered on one area of AI
research (e.g. early speech retrieval and document expansion
work) being applied to another. Another version of “what is
new, is old”.
● Research from Chroma shows that the choice of chunking
strategy can affect retrieval performance by up to 9% in recall.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 34
📝→📝
Evaluation for RAG remains unsolved
Many commonly used RAG benchmarks are repurposed retrieval or question answering datasets. They don’t
effectively evaluate the accuracy of citations, the importance of each piece of text to the overall answer, or the
impact of conflicting points of information.
● Researchers are now pioneering novel approaches, like Ragnarök, which introduces a novel web-based arena
for human evaluation through pairwise system comparisons. This addresses the challenge of assessing RAG
quality beyond traditional automated metrics.
● Meanwhile, Researchy Questions provides a large-scale collection of complex, multi-faceted questions that
require in-depth research and analysis to answer, drawn from real user queries.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 35
Frontier labs face up to the realities of the power grid and work on mitigations
As compute clusters grow larger, they become harder to build and maintain. Clusters require high-bandwidth, low
latency connections and are sensitive to device heterogeneity. Researchers see the potential for alternatives.
● Google DeepMind has proposed Distributed Low-Communication (DiLoCo), an optimization algorithm that
allows training to occur on multiple loosely connected “islands” of devices.
● Each island performs a large number of local update steps before communicating with the others, reducing
the need for frequent data exchange. They’re able to demonstrate fully synchronous optimization across 8 of
these islands while reducing communication 500x.
● GDM also proposed a refined version of DiLoCo, optimized for asynchronous settings.
● Researchers at Prime Intellect released an open-source implementation and replication of DiLoCo, while
scaling it up 3x, to demonstrate its effectiveness on 1B parameter models.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 36
📝→📝
Could better data curation methods reduce training compute requirements?
Data curation is an essential part of effective pre-training, but is often done manually and inefficiently. This is
both hard to scale and wasteful, especially for multimodal models.
● Usually, an entire dataset is processed upfront, which doesn’t account for how the relevance of a training
example can change over the course of learning. These methods are frequently applied before training, so
cannot adapt to changing needs during training.
● Google DeepMind’s JEST selects entire batches of data jointly,
rather than individual examples independently. The selection is
guided by a ‘learnability score’ (determined by a pre-trained
reference model) which evaluates how useful it will be for
training. It’s able to integrate data selection directly into the
training process, making it dynamic and adaptive.
● JEST uses lower-resolution image processing for both data
selection and part of the training, significantly reducing
computational costs while maintaining performance benefits.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 37
📝+🖼→📝
Chinese (V)LLMs storm the leaderboards despite sanctions
Models produced by DeepSeek, 01.AI, Zhipu AI, and Alibaba have achieved strong spots on the LMSYS
leaderboard, displaying particularly impressive results in math and coding.
● The strongest models from Chinese labs are competitive with the second-most powerful tier of frontier model
produced by US labs, while being challenging the SOTA on certain subtasks.
● These labs have prioritized computational efficiency to
compensate for constraints around GPU access, learning to stretch
their resources much further than their US peers.
● Chinese labs have different strengths. For example, DeepSeek has
pioneered techniques like Multi-head Latent Attention to reduce
memory requirements during inference and an enhanced MoE
architecture.
● Meanwhile, 01.AI has focused less on architectural innovation and
more on building a strong Chinese language dataset to
compensate for its relative paucity in popular repositories like
Common Crawl.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 38
📝+🖼→📝
And Chinese open source projects win fans around the world
To drive international uptake and evaluation, Chinese labs have become enthusiastic open source contributors.
A few models have emerged as strong contenders in individual sub-domains.
● DeepSeek has emerged as a community favorite on coding tasks, with deepseek-coder-v2 for its
combination of speed, lightness, and accuracy.
● Alibaba released the Qwen-2 family recently, and the community has
been particularly impressed by its vision capabilities, ranging from
challenging OCR tasks to its ability to analyse complex art work.
● At the smaller end, the NLP lab at Tsinghua University has funded
OpenBMB, a project that has spawned the MiniCPM project.
● These are small <2.5B parameter models that can run on-device. Their
2.8B vision model is only marginally behind GPT-4V on some metrics,
while 8.5B Llama 3 based model surpasses it on some metrics.
● Tsinghua University’s Knowledge Engineering Group has also created
CogVideoX - one of the most capable text to video models.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 39
📝+🖼→📝
VLMs achieve SOTA performance out-of-the-box
The first State of AI Report in 2018 detailed the painstaking efforts of researchers who tried to teach models
common sense scene understanding by creating datasets of millions of labelled videos. Every major frontier
model builder now offers vision capabilities out of the box. Even smaller models, in the low hundreds M to single
digit B parameter size like Florence-2 from Microsoft or LongVILA from NVIDIA, can achieve remarkable results.
Allen Institute for AI’s open source Molmo can hold its own against the larger, proprietary GPT-4o.
2018 2024
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 40
📝→🖼
Diffusion models for image generation become more and more sophisticated
Moving on from diffusion models for text-to-image, Stability AI have continued to search for refinements that
increase quality while bringing about greater efficiency.
● Adversarial diffusion distillation speeds up image generation by reducing the sampling steps needed to create
high-quality images from potentially hundreds down to 1-4, while maintaining high fidelity.
● It combines adversarial training with score distillation: the model is trained just
using a pre-trained diffusion model as a guide.
● As well as unlocking single-step generation, the authors focused on reducing
computational complexity and improving sampling efficiency.
● Rectified flow improves upon traditional diffusion methods by connecting data
and noise through a direct, straight line, rather than a curved path.
● They combined this with a novel transformer-based architecture for
text-to-image that allow for a bidirectional flow of information between text
and image components. This enhances the model's ability to generate more
accurate and coherent high-resolution images based on textual descriptions.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 41
📝→🎥
Stable Video Diffusion marks a step forward for high-quality video generation…
Stability AI released Stable Video Diffusion, one of the first models capable of generating high-quality, realistic
videos from text prompts, along with a significant step up in customizability. The team took a three-stage
approach to training: i) image pre-training on a large text-to-image dataset, ii) video pre-training on a large,
curated low res video dataset, and iii) fine tuning on a smaller, high res video dataset. In March, they followed up
with Stable Video 3D, finetuned on a third object dataset to predict 3D orbits.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 42
📝→🎥
…leading the big labs to release their own gated text-to-video efforts
Both Google DeepMind and OpenAI have given us sneak previews of highly powerful text-to-video diffusion
models. But access remains heavily gated and neither has supplied much technical detail.
● OpenAI’s Sora is able to generate videos up to a minute long, while maintaining 3D consistency, object
permanence, and high resolution. It uses spacetime patches, similar to the tokens used in transformer models,
but for visual content, to learn efficiently from a vast dataset of videos.
● Sora was also trained on visual data in its native size and aspect
ratio, removing the usual cropping and resizing that reduces quality.
● Google DeepMind’s Veo combines text and optional image prompts
with a noisy compressed video input, processing them through
encoders and a latent diffusion model to create a unique compressed
video representation.
● The system then decodes this representation into a final
high-resolution video.
● Also in the fight are Runway’s Gen-3 Alpha, Luma’s Dream Machine,
and Kling by Kuaishou.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 43
📝→🎥
Meta goes even further, throwing audio into the mix
Keeping the gated approach of other labs, Meta has brought together its work on different modalities via the
Make-A-Scene and Llama families to build Movie Gen.
● The core of Movie Gen is a 30B video generation and a 13B audio generation model, capable of producing
16-second videos at 16 frames per second and 45-second audio clips respectively.
● These models leverage joint optimization techniques for text-to-image and text-to-video tasks, as well as
novel audio extension methods for generating coherent audio for videos of arbitrary lengths.
● Movie Gen's video editing capabilities combine advanced image
editing techniques with video generation, allowing for both localized
edits and global changes while preserving original content.
● The models were trained on a combination of licensed and publicly
available datasets.
● Meta used A/B human evaluation comparisons to demonstrate positive
net win rates against competing industry models across their four
main capabilities. The researchers say they intend to make the model
available in future, but don’t commit to a timeline or release strategy.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 44
AI gets en-Nobel-ed
In a sign that AI has truly come of age as both a scientific discipline and a tool to accelerate science, the Royal
Swedish Academy of Sciences awarded Nobel Prizes to OG pioneers in deep learning, alongside the architects of
its best-known application (so far) in science. The news was celebrated by the entire field.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 45
📝→🧬
AlphaFold 3: going beyond proteins and their interactions with other biomolecules
DeepMind and Isomorphic Labs released AlphaFold 3, their successor from AF2, which can now model how small
molecule drugs, DNAs, RNAs and antibodies interact with protein targets.
● There were substantial and surprising algorithmic changes
from AF2: all equivariance constraints were removed in
favor of simplicity and scale, while the Structure Module
was replaced with a diffusion model to build the 3D
coordinates.
● Unsurprisingly, the researchers claim that AF3 performs
exceptionally well in comparison to other methods (esp.
for small molecule docking), although this was not
compared to stronger baselines.
● Notably, no open-source code was made available (yet).
Several independent groups are working on reproducing
the work openly.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 46
📝→🧬
…starting a race to become the first to reproduce a fully functioning AlphaFold3 clone
The decision to not release code for the AF3 publication was highly controversial, with many blaming Nature.
Politics aside, there has been a race by start-ups and AI communities to make their model the go-to alternative.
● The first horse out of the gate was Baidu with their HelixFold3 model, which was comparable to AF3 for
ligand binding. They provide a web server and their code is fully open-sourced for non-commercial use.
● Chai-1 from Chai Discovery (backed by OpenAI) recently
released a molecular structure prediction model that has
taken off in popularity due to its performance and high
quality implementation. The web server is also available for
commercial drug discovery use.
● We are still waiting for a fully open-sourced model with no
restrictions (e.g. using outputs for training of other models).
● Will DeepMind fully release AF3 sooner if they begin to fear
alternative models are becoming the communities favourite?
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 47
📝→🧬
AlphaProteo: DeepMind flexes new experimental biology capabilities
The secretive protein design team at DeepMind finally “came out of stealth” with their first model AlphaProteo, a
generative model that is able to design sub-nanomolar protein binders with 3- to 300-fold better affinities.
● While few technical details were given, it seems it was built on top of
AlphaFold3 and is likely a diffusion model. ‘Hotspots’ on the target
epitope can also be specified.
● The model was able to design protein binders with 3- to 300-fold
better binding affinities than previous works (e.g. RFDiffusion).
● The “dirty secret” of the protein design field is that the in silico filtering
is just as (if not more) important than the generative modelling, with
the paper suggesting that AF3-based scoring is key.
● They also use their confidence metrics to screen a large number of
possible novel targets for which future protein binders could be
designed.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 48
📝→🧬
The Bitter Lesson: Equivariance is dead…long live equivariance!
Equivariance is the idea of giving a model the inductive biases to natively handle rotations, translations and
(sometimes) reflections. It has been at the core of Geometric Deep Learning and biomolecular modelling
research since AlphaFold 2. However, recent works by top labs have questioned the existing mantra.
● The first shots were fired by Apple, with a paper that obtained SOTA
results on predicting the 3D structures of small molecules using a
non-equivariant diffusion model with a transformer encoder.
● Remarkably, the authors showed that using the domain-agnostic model
did not deleteriously impact generalization and was consistently able
to outperform specialist models (assuming sufficient scale was used).
● Next was AlphaFold 3, which infamously dropped all the equivariance
and frames constraints from the previous model in favour of another
diffusion process coupled with augmentations and, of course, scale.
● Regardless, the greatly improved training efficiency of equivariant
models means the practice is likely to stay for a while (at least for
academic groups working on large systems such as proteins).
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 49
📝→🧬
Scaling frontier models of biology: EvolutionaryScale’ ESM3
Since 2019, Meta had been publishing transformer-based language models (Evolutionary Scale Models) trained
on large-scale amino acid and protein databases. When Meta terminated these efforts in 2023, the team founded
EvolutionaryScale. This year, they released ESM3, a frontier multimodal generative model that was trained over
sequences, structures and functions of proteins rather than sequences alone.
● The model is a bidirectional transformer that fuses tokens
that represent each of the three modalities as separate tracks
into a single latent space.
● Unlike traditional masked language modelling, ESM3’s
training process uses a variable masking schedule, exposing
the model to diverse combinations of masked sequence,
structure, and function. ESM3 learns to predict completions
for any combination of modalities.
● ESM3 was prompted to generate new green fluorescent
proteins (GFP) with low sequence similarity to known ones.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 50
📝→🧬
Language models that learn to design human genome editors
We previously profiled how LLMs (e.g. ProGen2) pre-trained on large and diverse datasets of natural protein
sequences could be used to design functional proteins with vastly different sequences to their natural peers.
Now, Profluent has finetuned ProGen2 on their CRISPR-Cas Atlas to generate functional genome editors with
novel sequences that, importantly, were shown to edit the DNA of human cells in vitro for the first time.
● The CRISPR-Cas Atlas consists of >1M diverse CRISPR-Cas operons,
including various effector systems, that were mined from 26.2
terabases of assembled microbial genomes and metagenomes,
spanning diverse phyla and biomes.
● Generated sequences are 4.8x more diverse vs. natural proteins
from the CRISPR-Cas atlas. The median identity to the nearest
natural protein typically fell between 40-60%.
● A model fine-tuned on Cas9 proteins can generate novel editors
that were then validated in human cells. One such editor offered
the best editing performance and 71.7% sequence similarity to
SpCas9 and was open sourced as OpenCRISPR-1.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 51
📝+ 🖼 → 📝
Yet, evals and benchmarking in BioML remains poor
The fundamental problem with research at the intersection of biology and ML is that there are very few people
with the skills to both train a frontier model and give it a rigorous biological appraisal.
● Two works from late 2023, PoseCheck and PoseBusters, showed that ML models for molecule generation and
protein-ligand docking gave structures (poses) with gross physical violations.
● Even the AlphaFold3 paper didn’t get away without a few bruises
when Inductive bio showed that using a slightly more advanced
conventional docking pipeline beat AF3.
● A new industry consortium led by Valence Labs, including major
pharma companies (i.e. Recursion, Relay, Merck, Novartis J&J,
Pfizer), is developing Polaris, a benchmarking platform for
AI-driven drug discovery. Polaris will provide high-quality
datasets, facilitate evaluations, and certify benchmarks.
● Meanwhile, Recursion’s work on perturbative map-building led
them to create a new set of benchmarks and metrics.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 52
📝+ 🖼 → 📝
Foundation models across the sciences: inorganic materials
To determine the properties of physical materials and how they behave under reactions, it is necessary to run
atomic-scale simulations that today rely on density functional theory. This method is powerful, but slow and
computational expensive. While faster, alternative approaches that calculate force fields (interatomic potentials)
tend to have insufficient accuracy to be useful, particularly for reactive events and phase transitions.
● In 2022, equivariant message passing neural networks (MPNN) combined with
efficient many-body messages (MACE) were introduced at NeurIPS.
● Now, the authors present MACE-MP-0, which uses the MACE architecture and
is trained on the Materials Project Trajectory dataset, which contains millions
of structures, energies, magnetic moments, forces and stresses.
● The model reduces the number of message passing layers to two by
considering interactions involving four atoms simultaneously, and it only uses
nonlinear activations in selective parts of the network.
● It is capable of molecular dynamics simulation across a wide variety of
chemistries in the solid, liquid and gaseous phases.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 53
📝+ 🖼 → 📝
Expanding the protein function design space: challenging folds and soluble analogues
Characterising and generating structures for proteins that are not found in soluble form but are in membrane
environments is challenging and hinders the development of drugs meant to target membrane receptors. So too
is the design of protein folds that are large and include non-local topologies. Can AF2 and sequence models
remedy this and give drug designers access to a larger soluble proteome with previously inaccessible folds?
● To do so, the authors first use an inverted AF2 model that
generates an initial sequence given a target fold structure.
These sequences are then optimised by ProteinMPNN before
structures are re-predicted by AF2 followed by filtering on
the basis of structure similarity to the target structure.
● This AF2-MPNN pipeline was tested on three challenging
folds: IGF, BBF and TBF, which have therapeutic utility.
● It was also possible to generate soluble analogues of
membrane-only folds which could massively speed up drug
discovery targeting membrane-bound receptor proteins.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 54
📝+ 🖼 → 📝
Foundation models for the mind: learning brain activity from fMRI
Deep learning, originally inspired by neuroscience, is now making into modelling the brain itself. BrainLM is a
foundation model built on 6,700 hours of human brain activity recordings generated by functional magnetic
resonance imaging (fMRI), which detects changes in blood oxygenation (left figure). The model learns to
reconstruct masked spatiotemporal brain activity sequences and, importantly, it can generalise to held-out
distributions (right figure). This model can be fine-tuned to predict clinical variables e.g. age, neuroticism, PTSD,
and anxiety disorder scores better than a graph convolutional model or an LSTM.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 55
📝+ 🖼 → 📝
Foundation models across the sciences: the atmosphere
Classical atmospheric simulation methods like numerical weather prediction are costly and unable to make use
of diverse and often scarce atmospheric data modalities. But, foundation models are well suited here. Microsoft
researchers created Aurora, a foundation model that produces forecasts for a wide range of atmospheric
forecasting problems such as global air pollution and high-resolution medium-term weather patterns. It can also
adapt to new tasks by making use of a general-purpose learned representation of atmospheric dynamics.
● The 1.3B model is pre-trained on >1M hours of weather
and climate data from 6 datasets, including forecasts,
analysis data, reanalysis data, and climate simulations.
● The models encodes heterogeneous inputs into a standard
3D representation of the atmosphere across space and
pressure-levels, which is evolved over time at inference by
a vision transformer and decoded into specific predictions.
● Importantly, it is the first model to predict atmospheric chemistry (6 major air pollutants, e.g. ozone, carbon
monoxide), which involves hundreds of stiff equations, better than numerical models. The model is also 5,000x
faster than the Integrated Forecasting System that uses numerical forecasting.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 56
📝+ 🖼 → 📝
Foundation models for the mind: reconstructing what you see
MindEye2, is a generative model that maps fMRI activity to a rich CLIP space from which images of what the
individual sees are reconstructed using a fine-tuned Stable Diffusion XL. The model is trained on the Natural
Scenes Dataset, an fMRI dataset built from 8 subjects whose brain responses were captured for 30-40 hours as
they looked at hundreds of rich naturalistic stimuli from the COCO dataset scanning sessions for 3 seconds each.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 57
📝+ 🖼 → 📝
Speaking what you think
Decoding speech from brain recordings with implantable microelectrodes could enable communication for
patients with impaired speech. In a recent case, a 45-year-old man with amyotrophic lateral sclerosis (ALS) with
tetraparesis and severe motor speech damage underwent surgery to implant microelectrodes into his brain. The
arrays recorded neural activity as the patient spoke in both prompted and unstructured conversational settings.
At first, cortical neural activity was decoded into a small vocabulary of 50 words with 99.6% accuracy by
predicting the most likely English phoneme being attempted. Sequences of phonemes were combined into words
using an RNN, before moving to a larger 125,000-word vocabulary enabled by further training.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 58
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 62
📝→📝
But were implicit reasoning capabilities staring us in the face the whole time?
After prolonged training beyond the point of overfitting (known as grokking), some researchers have argued that
transformers learn to reason over parametric knowledge through composition and comparison tasks.
● Researchers at Ohio State University argued that a fully grokked transformer outperformed then SOTA models
like GPT-4-Turbo and Gemini-1.5-Pro on complex reasoning tasks with a large search space.
● They conducted mechanistic analyses to understand the internal workings of the models during grokking,
revealing distinct generalizing circuits for different tasks.
● However, they found that while fully grokked models performed
well on comparison tasks (e.g. comparing attributes based on
atomic facts), they were less good at out-of-distribution
generalization in composition tasks.
● This raises questions about whether these are really meaningful
reasoning capabilities versus memorization by another name,
although the researchers believe that enhancing the transformer
with better cross-layer memory sharing could resolve this.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 63
📝 → </>
Program search unlocks new discoveries in the mathematical sciences
Drawing on a combination of LLMs and evolutionary algorithms, FunSearch uses an LLM to generate and modify
programs, guided by an evaluation function that scores the quality of solutions. Searching for programs rather
than direct solutions allows it to discover concise, interpretable representations of complex objects or strategies.
This form of program search is one of the avenues that Chollet believes has the most potential to solve the ARC
challenge. The Google DeepMind team applied it to the cap set problem in extremal combinatorics and online
bin picking. In both cases, FunSearch discovered novel solutions that surpassed human-designed approaches.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 64
📝→🛠
RL drives improvements in VLM performance…
For agents to be useful, they need to be robust to real-word stochasticity, which SOTA models have historically
struggled with. We’re beginning to see signs of progress.
● DigiRL is a novel autonomous reinforcement learning approach for training in-the-wild device control agents
specifically for Android devices. The method involves a two-stage process: offline reinforcement learning
followed by offline-to-online reinforcement learning.
● It achieves a 62.7% task success rate on the Android-in-the-Wild
dataset, a significant improvement on the prior SOTA.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 65
📝→🛠
…while LLMs improve RL performance
In 2019, Uber published Go-Explore, an RL agent that solved hard-exploration problems by archiving discovered
states and iteratively returning to and exploring from promising ones. In 2024, LLMs are supercharging it.
● Intelligent Go-Explore (IGE) uses an LLM to guide state selection, action choice, and archive updating, rather than
the original Go-Explore’s hand-crafted heuristics. This enabled more flexible and intelligent exploration in
complex environments.
● This approach also allowed IGE to recognize and capitalize on
promising discoveries, a key aspect of open-ended learning systems.
● It significantly outperformed other LLM agents on mathematical
reasoning, grid worlds, and text-based adventure games.
● Switching from GPT-4 to GPT-3.5 resulted in a significant
performance drop across all environments, suggesting that IGE's
performance scales with the capabilities of the underlying language
model.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 66
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 67
📝→🛠
Could foundation models make it easier to train RL agents at scale?
One of the big bottlenecks for training RL agents is a shortage of training data. Standard approaches like
converting pre-existing environments (e.g. Atari) or manually building them are labor-intensive and don’t scale.
● Genie (winner of a Best Paper award at ICML 2024) is a world model that can generate action-controllable
virtual worlds. It analyzed 30,000 hours of video game footage from 2D platformer games, learning to
compress the visual information and infer the actions that drive changes between frames.
● By learning a latent action space from video data, it can handle
action representations without requiring explicit action labels,
which distinguishes it from other world models.
● Genie is both able to imagine entirely new interactive scenes and
demonstrate significant flexibility: it can take prompts in various
forms, from text descriptions to hand-drawn sketches, and bring
them to life as playable environments.
● This approach demonstrated applicability beyond games, with the
team successfully applying the hyperparameters from the game
model to robotics data, without fine tuning.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 68
📝→🛠
Could foundation models make it easier to train RL agents at scale?
Imperial and UBC’s OMNI-EPIC used LLMs to create a theoretically endless stream of RL tasks and environments
to help agents build upon previously learned skills. The system generates executable Python code that can
implement simulated environments and reward functions for each task, and employs a model to assess whether
newly generated tasks are sufficiently novel and complex.
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 69
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 70
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 71
📝+ 🖼 → 📝
Self-driving embraces more modalities
Wayve’s LINGO-2 is the second generation of its vision-language-action model, that, unlike its predecessor, can
both generate real-time driving commentary and control a car, linking language explanations directly with
decision-making and actions. Meanwhile, the company is using generative models to enhance its simulator with
more real-world detail. PRISM-1 creates realistic 4D simulations of dynamic driving scenarios using only camera
inputs. It enables more effective testing and training by accurately reconstructing complex urban environments,
including moving elements like pedestrians, cyclists, and vehicles, without relying on LiDAR or 3D bounding
boxes.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 72
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 74
📝+ 🖼+ 🤖 → 📝
Google DeepMind quietly emerges as a robotics leader
Despite all eyes being on Gemini, the Google DeepMind team has steadily been increasing its robotics output,
improving the efficiency, adaptability, and data collection of robots.
● The team created AutoRT, a system that uses a VLM for environmental understanding and an LLM to suggest a
list of creative tasks the robot could carry out. These models are then combined with a robot control policy.
This helps to scale up deployment quickly in previously unseen environments.
● RT-Trajectory enhances robotic learning through video input. For each
video in the dataset of demonstrations, a 2D sketch of the gripper
performing the task is overlaid. This provide practical visual hits to the
model as it learns.
● The team have also improved the efficiency of transformers. SARA-RT is
a novel ‘up-training’ method to convert pre-trained or fine-tuned robotic
policies from quadratic to linear attention, while maintaining quality.
● Researchers have found Gemini 1.5 Pro’s multimodal capabilities and
long context window makes it an effective way of interacting with
robots via natural language.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 75
📝+ 🖼+ 🤖 → 📝
Hugging Face pulls down barriers to entry
Historically, robotics had significantly fewer open source datasets, tools, and libraries than other areas of AI -
creating an artificially high barrier to entry. Hugging Face’s LeRobot aims to bridge the gap, hosting pretrained
models, datasets with human-collected demonstrations, and pre-trained demonstrations. And the community’s
loving it.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 76
📝+ 🖼+ 🤖 → 📝
Diffusion models drive improvements in policy and action generation
Well-established in image and audio generation, diffusion models continue to demonstrate their effectiveness in
generating complex action sequences in robotics.
● A number of research groups are aiming to bridge the gap between high-dimensional observation and
low-dimensional action spaces in robot learning. They create a unified representation that allows the learning
algorithm to understand the spatial implications of actions.
● Diffusion models excel at modeling these kinds of complex, non-linear
multimodal distributions, while their iterative denoising process allows for
the gradual refinement of actions or trajectories.
● There are multiple ways of attacking this. Researchers at Imperial and
Shanghai Qizhi Institute have opted for RGB images, which offer rich visual
information and compatibility with pre-trained models.
● Meanwhile, a team at UC Berkeley and Stanford have leveraged point
clouds, for their explicit 3D information.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 77
📝+ 🖼+ 🤖 → 📝
Can we stretch existing real-world robotics data further than we currently do?
Robotics policies have often been hampered by a lack of generalizability, due to limited real-world data. Rather
than finding more data, researchers are injecting more structure and knowledge to what we already have.
● One approach, outlined by a Carnegie Mellon team, involves learning more “affordance” information from
human video data, such as hand possess, object interactions, and contact points.
● This information can then be used to finetine existing visual
representations to make them more for suitable robotic tasks. This
consistently improved performance on real-world manipulation tasks.
● Meanwhile, a Berkeley/Stanford team found that chain-of-thought
reasoning could have a similar impact.
● Rather than just predicting actions directly, the enhanced models are
trained to reason step-by-step about plans, sub-tasks, and visual features
before deciding on actions.
● This approach uses LLMs to generate training data for the reasoning steps.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 78
📝+ 🖼+ 🤖 → 📝
Can we overcome the data bottleneck for humanoids?
It’s challenging to model the intricacies of human behavior with imitation learning, which relies on human
demonstrators. While effective, it’s difficult to implement at scale. Stanford has some workarounds.
● HumanPlus is a full-stack system for humanoids to learn from human data. It combines a real-time shadowing
system and an imitation learning algorithm.
● The shadowing system uses a single RGB camera and a low-level policy to
allow human operators to control the humanoid's whole body in real-time.
This low-level control policy is trained on a large dataset of human motion
data in simulation and transfers to the real world without additional
training.
● The imitation learning component enables efficient learning of autonomous
skills from shadowing data. It uses binocular egocentric vision and
combines action prediction with forward dynamics prediction.
● The system demonstrates impressive results on a variety of tasks, including
complex actions like wearing a shoe and walking, using only up to 40
demonstrations.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 79
📝+ 🖼+ 🤖 → 📝
Back with a vengeance: robot doggos 🐶
Boston Dynamics' Spot showcased progress in mobility and stability for embodied AI but lacked manipulation
skills. Researchers are now addressing this gap. A Stanford/Columbia team combined real-world demonstration
data with simulation-trained controllers to focus on controlling the robot's gripper movement rather than
individual joints. This approach simplifies transferring manipulation skills from stationary arms to mobile robots.
Meanwhile, a UC San Diego team developed a two-part system: a low-level policy for executing commands and a
high-level policy for generating visual-based commands, enhancing the robot's manipulation capabilities.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 80
The Apple Vision Pro emerges as the must-have robotics research tool
While consumer demand for the Vision Pro lacklustre so far, it’s taking robotics research by storm, where its
high-res, advanced tracking, and processing power is being leveraged by researchers working on teleoperation -
controlling robot movements and actions at a distance. Systems like Open-TeleVision and Bunny-Vision Pro use
it to help enable precise control of multi-finger robotic hands (at a 3000 mile distance in the case of the former),
demonstrating improved performance on complex manipulation tasks compared to previous approaches. They
address challenges such as real-time control, safety through collision avoidance, and effective bimanual
coordination.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 81
📝→🖼
To finetune or not to finetune (in medicine)?
Last year, a non-finetuned GPT-4 via one API call was highly competitive with Google’s Med-PaLM2 on certain
medical knowledge benchmarks. Gemini has ridden to the rescue.
● The Med-Gemini family of multimodal models for medicine are finetuned from Gemini Pro 1.0 and 1.5 using
various medical datasets and incorporate web search for up-to-date information. They achieved SOTA 91.1%
accuracy on MedQA, surpassing GPT-4.
● For multimodal tasks (e.g. in radiology and pathology), Med-Gemini set a new SOTA on 5 out of 7 datasets.
● When quality errors in questions were fixed, model
performance improved and it exhibited strong reason
across other benchmarks. It also achieved high precision
and recall in retrieving rare findings in lengthy EHRs - a
challenging "needle-in-a-haystack" task.
● In a preliminary study, clinicians rated Med-Gemini's
outputs equal or better than human-written examples in
most cases.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 82
📝→🖼
Generating synthetic data in medicine
High-quality medical imaging datasets are hard to come by or, even so, license to for research or commercial
products. They are also not immune to distributional shifts. And yet, realistic image generators have flooded the
internet in the last year. Could these be repurposed to generate realistic medical images that are useful for
model training, despite the large visual and semantic differences between natural images and medical images?
● By jointly fine-tuning both the U-Net and the CLIP text encoder from
Stable Diffusion a large dataset of real chest x-rays (CXR) and
corresponding radiologist reports, it is possible to generate synthetic
CXR scans with high fidelity and conceptual correctness as evaluated
by board-certified radiologists.
● Generated CXRs can be used for data augmentation and
self-supervised learning.
● Consistent with other modalities, supervised classification
performance drops slightly when training purely synthetic data.
● Moreover, generative models can improve fairness of medical classifiers by enriching
training datasets with synthetic examples that fill out underrepresented data points.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 83
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 84
The global balance of power in AI research remains unchanged, but academia gains
As AI emerges as the new competitive battleground, big tech companies begin to hold more details of their work
close to their chest. Frontier labs have meaningfully cut publication levels for the first time since this report
began, while academia gets into gear.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 85
Section 2: Industry
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 86
ChatGPT
launches
��
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 87
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 89
Buying NVIDIA stock would’ve been far better than investing in its start-up contenders
We looked at the $6B invested in AI chip challengers since 2016 and asked what would have happened if
investors had just bought the equivalent amount of NVIDIA stock at that day’s price. The answer is lime green:
that $6B would be worth $120B of NVIDIA stock today (20x!) vs. the $31B (5x) in its startup contenders.
$ invested
NAV challengers
NAV lengers
Note: Market pricing and valuation data retrieved as of 9 Oct 2024. NAV = net asset value. stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 90
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 91
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 92
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 93
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 94
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 95
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 96
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 97
Scaling up and out with faster connections between GPUs and nodes
The speed of data communication between GPUs within a node (scale-up fabric), as well as between nodes
(scale-out fabric), is critical to large-scale cluster performance. NVIDIA’s technology for the former, NVLink, has
bandwidth per link, the number of links and the number of total GPUs connected per node increasing
significantly in the last 8 years. Coupled to their InfiniBand technology for connecting nodes intro large-scale
clusters, NVIDIA is ahead of the pack. Meanwhile, Chinese companies like Tencent have reportedly innovated
around sanctions for similar outcomes. Their Xingmai 2.0 high-performance computing network, which is said to
support over 100,000 GPUs in a single cluster, improves network communication efficiency by 60% and LLM
training by 20%. That said, it is not clear whether Tencent possesses clusters of this size.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 98
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 99
While SoftBank starts to build its own chip empire (after prematurely selling NVIDIA)
Known for betting big, SoftBank is entering the arena, tasking subsidiary Arm with launching its first AI chips in
2025 and acquiring struggling UK start-up Graphcore for a rumoured $600-700M.
● Arm is already a player in the AI world, but historically, its instruction set architecture has not been optimal for
the large-scale parallel processing infrastructure required for datacenter training and inference. It’s also
struggled against NVIDIA’s entrenched data center business and mature software ecosystem.
● With a current market cap of over $140B, markets aren’t bothered. The company is
reportedly already in talks with TSMC and others about manufacturing.
● SoftBank also scooped up Graphcore, which pioneered Intelligent Processing Units,
a processor designed to handle AI workloads more efficiently than GPUs and CPUs,
using small volumes of data. Despite its sophistication, the hardware was often not
a logical choice for genAI applications as they took off.
● The company will operate semi-autonomously under the Graphcore brand.
● Meanwhile, Softbank’s talks with Intel on designing a GPU challenger stalled after
they were unable to agree on requirements.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 102
…but opts not to restrict the use of hardware by Chinese labs in US data centers
While Chinese labs face restrictions in their ability to import hardware, there are currently no controls on their
local affiliates renting access to it overseas. ByteDance rents access to NVIDIA H100s via Oracle in the US, while
Alibaba and Tencent are reportedly in conversations with NVIDIA about setting up their own US-based
datacenters. Meanwhile, Google and Microsoft have directly pitched big Chinese firms on their cloud offerings.
The US is planning to make hyperscalers report this kind of usage via a KYC scheme, but is yet to draw up plans
to prohibit it.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 104
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 105
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 106
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 107
Perhaps it’s neither: vibes are all you need (to recover your share price)
Meta has produced an incredible vibe shift in public markets by ditching their substantial metaverse investments
and pivoting hard into open source AI with their Llama models. Mark Zuckerberg is, arguably, the de facto
messiah of open source AI, counterpositing vs. OpenAI, Anthropic, and Google DeepMind.
The top quality model, OpenAI’s o1, comes at a significant price and latency premiums
As the model menu matures, developers are choosing the right tool for the job (and their budget).
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 109
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 110
Google Gemini produced a strong model series with very competitive pricing
Prices on Gemini 1.5 Pro and 1.5 Flash have been dropped by 64-86% a few months after launch while offering
strong performance, e.g. Flash-8B is 50% cheaper than 1.5 Flash yet comparable across many benchmarks.
76% cut
86% cut
Note: Pricing for <128k token prompts and outputs. Retrieved 4 Oct 2024 stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 111
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 112
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 113
While les grands modèles catch on, but another European challenger loses steam
European leaders have been desperate to point to a domestic success story as US labs have occupied the
spotlight. For now, Mistral remains the continent’s primary bright spark.
● With over €1B in the bank, Mistral has emerged as the undisputed European foundation model champion,
demonstrating both impressive computational efficiency and multilingual capabilities. Au Large, its flagship model
is available via Azure as part of the company’s new partnership with Microsoft.
● The company has started striking partnerships with both French
companies like BNP Paribas and international start-ups like
Harvey AI. The company is also beginning to bulk out its US sales
function.
● Meanwhile, self-styled German ‘sovereign AI’ champions Aleph
Alpha have struggled.
● Despite raising $500M through equity, grants, and licensing
deals, the company’s closed models have underperformed freely
available peers. As a result, the company appears to be pivoting
to licensing Llama 2-3 and DBRX.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 114
Databricks and Snowflake pivot to build their own models…but can they compete?
In last year’s report, we touched on Databricks and Mosaic’s LLM combined strategy, which focused on
fine-tuning models on customer’s data. Is the ‘bring your own model’ era over?
● The Mosaic research team, now folded into Databricks, open-sourced DBRX in March. A 132B MoE model,
DBRX was trained on just over 3,000 NVIDIA GPUs at a cost of $10M. Databricks is pitching the model as a
foundation for enterprises to build on and customize, while remaining in control of their own data.
● Meanwhile, Snowflake’s Arctic is pitched as the most efficient model for
enterprise workflows, based on a set of metrics covering tasks including
coding and instruction following.
● It’s unclear how much enterprises are willing to invest in costly custom
model tuning, given the constant set of releases and improvements
driven bigger players.
● With readily available open source frontier models, the appeal of training
custom models is increasingly unclear.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 115
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 117
Figures are total raised and latest round as of 7 Oct 2024 stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 118
…while cases jam up the court system and provide little clarity over fair use
The central question about whether creators’ copyright has been infringed by model builders via the use of their
work for training remains unsolved, but more expansive arguments have been shot down in the courts.
● Cases continue against Anthropic, OpenAI, Meta, Midjourney, Runway, Udio, Suno, Stability and others from news
outlets, image suppliers, authors, creative artists, and record labels.
● So far, model builders have failed to get any of these cases dismissed in full, but have managed to narrow their
scope significantly.
● For example, claims from two groups of authors against OpenAI and Meta arguing that the companies were
guilty of vicarious copyright infringement because all of their models’ outputs are “infringing derivative work”
failed, because they were unable to demonstrate “substantial similarity”. Only their original claims on the ground
of copyright infringement have been allowed to proceed.
● A similar pruning happened with the cases against Midjourney, Runway, and Stability with plaintiffs told to
focus on the original scraping, with many of their wider claims dismissed.
● Amid this uncertainty, Adobe, Google, Microsoft, and OpenAI have taken the unusual step of indemnifying their
customers against any legal claims they might face on copyright grounds.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 123
The last ones standing: Self-driving companies Wayve and Waymo power ahead
With Wayve unveiling a $1.05B Series C and Waymo scaling across the US, the industry seems to be booming,
after years of hype followed by disappointment.
● Waymo has gradually scaled in San Francisco, Los Angeles, and Phoenix, with a planned Austin launch later this
year. The company has now abolished is SF waiting list, opening up its waiting list to anyone.
● As well as raising fresh funding from Softbank, NVIDIA, and
Microsoft, Wayve scored a win when the UK passed legislation
allowing autonomous vehicles to hit the streets in 2026.
● The technology is also beginning to demonstrate commercial
potential. Alphabet has announced an additional $5B of
investment in Waymo, after its “Other Bets” unit, which includes
Waymo, delivered $365 million in quarterly revenue.
● Meanwhile, in August, the company announced that it had
reached 100,000 paid trips a week in the US, with 300 cars on
the road in SF alone.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 124
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 125
Cash pours into humanoid start-ups…but are they set to be the next self-driving?
Humanoid start-ups like Figure, Sanctuary, and 1X have raised close to a billion dollars from corporate investors,
including Samsung, Microsoft, Intel, OpenAI, and NVIDIA. Can the tech overcome its limitations?
● Replicating the complexity of human motion and engineering human-like dexterity, has historically proven to be
an expensive and technically difficult endeavor.
● Start-ups are betting that sophisticated VLMs, real-world training data and simulation, along with better
hardware can change this.
● Avid SOAI readers, however, will be familiar with the story of
self-driving - where breakthroughs were promised every year,
before companies undershot for half a decade.
● Customers must also be convinced that humanoids are more
efficient than cheaper, non-humanlike industrial robot systems.
● The appetite for non-humanoid robotics start-ups remains
healthy, despite Amazon’s recent pseudo-acquisition of
Covariant, a Bay Area robotics foundation model builder.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 126
2023 Prediction: A Hollywood-grade production makes use of genAI for visual effects.
Visual effects are an expensive and labor-intensive business, so Hollywood producers have been slowly trying to
integrate generative AI, amid a backlash from artists and animators. While much of this work has been done
quietly and post-production, eagle-eyed viewers have spotted clear signs of gen-AI related mishaps in the
background of HBO and Netflix productions. This ties back to long-standing issues around models’ ability to
represent physics and geometry accurately and consistently. Our prediction never said the output would be
good…
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 127
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 128
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 130
Text-to-speech is booming
ElevenLabs, the market leader in text-to-speech (TTS) hit unicorn status at the start of the year, with a $1.1B
valuation. With the big labs approaching the space tentatively, it has much of the field to itself.
● Alongside its flagship TTS product, the company has expanded into dubbing in foreign languages, voice
isolation, and has previewed an early text-to-music model. Likely seeking to avoid a copyright blow-up, the
company has opted not to release it immediately, but has provided an API for sound effect generation.
● 62% of Fortune 500 companies now have at least one employee using
ElevenLabs.
● Meanwhile, the frontier labs have approached the space with caution,
likely out of concern that misuse of voice generation capabilities could
result in a potential backlash.
● GPT-4o’s voice outputs have been restricted to preset voices for general
release, while OpenAI has said it is yet to make a decision on whether it
will ever make its Voice Engine (which can allegedly recreate a voice
based on a 15-second recording) widely available.
● Meanwhile, Cartesia is betting on state space models for efficient TTS.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 131
2020 data starts on 1 May. 2024 data stops on 1 Sept. stateof.ai 2023
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 132
63%
41%
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 133
…while AI-first challengers scale revenue much quicker than their SaaS peers
Analysis of the 100 highest revenue grossing AI companies using Stripe reveals that, as a group, they are
generating revenue at a much faster pace than previous waves of equivalently well-performing SaaS companies.
Strikingly, the average AI company that has reached $30M+ annualised revenue took just 20 months to get
there, compared to 65 months for equally promising SaaS companies.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 134
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 136
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 138
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 140
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 141
But high-end model providers face a squeeze from cheap and OS competitors
US text-to-video start-ups sell subscription plans based on credits, but with a single second of video burning
through 5 Runway or Pika credits, users have to make sure they master the art of prompting quickly.
Text-to-video tends to come with lower GPU requirements than LLMs, creating an opportunity for cheaper
Chinese offerings like Kuaishou’s Kling, unconstrained by copyright fears, or highly capable open source models
like CogVideoX.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 143
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 144
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 146
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 147
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 148
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 149
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 150
While over the last 2 years, mega $250M+ rounds dominated AI financings
There appears to be a clear “pre/post-GPT-4 era” (2023) that triggered all funding systems to go on steroids…
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 151
The IPO market remains lifeless, while M&A activity drifts -23% from its 2021 peak
Amid mounting regulatory scrutiny and shaky post-Covid stimulated markets, dealmaking has been icy, as
companies maintain a ‘wait and see’ attitude
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 152
Attention is all you need… to build raise billions for sell your AI start-up
Noam Shazeer of Character.ai sold his team back to Google for $2.5B, while Adept was acqui-hired into
Amazon and Inflection into Microsoft for $650M. These deals all involved hiring founders and star employees
while paying enough money to investors as a technology licensing fee to get the deals through.
ex- ex-
$415M NA
$1.5B $650M
$193M $2.5B
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 153
Section 3: Politics
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 154
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 156
The EU AI Act finally passes into law, following frantic last-minute lobbying
In March, the European Parliament passed the AI Act after an intensive Franco-German influence campaign to
weaken certain provisions. Questions about implementation, however, remain unanswered.
● With the passage of the act, Europe is now the first bloc in the world to adopt the a full-scale regulatory
framework for AI. Enforcement will be rolled out in stages, with the ban on “unacceptable risk” (e.g. deception,
social scoring) to come in February 2025.
● France and Germany managed to secure changes that tiered the
foundation model regulations, with a basic set of rules applying to
all models and additional regulations for those being deployed in
a sensitive environment.
● The full-on ban on facial recognition has now been watered down
to allow its use by law enforcement.
● While industry is concerned about the law, the months of
consultation and large amount of secondary legislation required
means it still has time to shape the specifics of implementation if
it engages constructively.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 157
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 158
April September
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 164
Big in Japan?
For a combination of political and cultural reasons, Japan has historically been a placid market for both venture
capital and AI start-ups. The government is suddenly keen to get a slice of the action.
● The Japanese government sees VC and AI as a potential vehicle for kickstarting a long-stagnant economy, while
Japan presents an opportunity for investors who’d rather not have to raise from deep-pocketed Gulf states.
● Tokyo-based Sakana has already pulled in $200M from US investors like Lux
Capital and Khosla Ventures, while a16z is reported to be planning a Japan office.
● In turn, the Japanese government-funded investment vehicles have invested in two
of US VC NEA’s funds and are actively exploring others. Mitsubishi is said to be
investing in Andrew Ng of Stanford’s second AI fund.
● Meanwhile, the country is also priding itself on a light-touch approach to
regulation and is focusing on industry-led oversight and seems unsympathetic to
copyright claims around generative AI. However, it has created a UK-style safety
institute.
● Sensing the momentum, Microsoft has announced $2.9B of investment in Japanese
AI and cloud infrastructure. stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 165
Amid sharply rising compute bills, sovereign wealth influence begins to grow
With the capex needs of frontier labs beginning to grow beyond what traditional VC alone can supply, labs are
beginning to look further afield. Alarm bells are already beginning to ring in the corridors of power.
● Following the downfall of FTX, its 8% stake in Anthropic was sold primarily to Mubadala, the government of
Abu Dhabi’s sovereign wealth fund. A Saudi bid was turned down on national security grounds, although Saudi
investors Prince Alwaleed Bin Talal and Kingdom Holding participated in X.ai’s Series B.
● Most controversially, G42, an Emirati AI-focused holding company
had struck a partnership with OpenAI to work in the country’s
finance, energy, and healthcare sectors.
● G42’s holdings in prominent Chinese technology companies,
including Bytedance, prompted panic in the US intelligence
community.
● In the end G42, was pressured into divesting its Chinese holdings
and accepting a $1.5B investment from Microsoft, with Microsoft
President Brad Smith joining the board.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 166
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 173
Section 4: Safety
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 175
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 176
UK creates the world’s first AI Safety Institute and the US swiftly follows
Coinciding with the Bletchley Summit, the UK announced that its Frontier AI Taskforce was to be superseded by
the AI Safety Institute (AISI) - the world’s first. The US, Japan, and Canada have all followed with smaller efforts.
● The AISI has three core functions: i) to conduct evaluations on advanced models before their deployment, ii)
build up state capacity around safety and conduct research, and iii) coordinate with international partners.
● It announced an MoU with its US equivalent, with the two agreeing to work
together on the development of tests, while the AISI is planning an SF office.
● OpenAI has said that it will offer the US AISI early access to its next model.
● The AISI has also released Inspect, a framework for LLM safety evaluation,
covering core knowledge, ability to reason, autonomous capabilities among
other things.
● However, there is a debate about the extent to which the AISI should focus
on standard setting (which it is well setup to do) and evaluations (where it
will rely more on the goodwill of industry).
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 179
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 183
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 187
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 189
Direct Preference Optimization offers an escape from “reward hacking”…or does it?
First proposed as an alternative to RLHF in 2023, DPO has no explicit reward function and comes with efficiency
advantages because it doesn’t sample from a policy during training or require extensive hyperparameter tuning.
Despite its novelty, the method has already been used to align Llama 3.1 and Qwen2.
● However, there are signs that the “over-optimization” that’ is traditionally associated with RLHF can also
happen with DPO and other kinds of direct alignment algorithms (DAAs), despite the absence of a reward
model. This worsens the more models are allowed to deviate from their starting point as they learn to align
with human preferences.
● This could be the result of underconstrained objectives,
where the algorithm unintentionally assigns high
probabilities to out-of-distribution data.
● This is inherent to DAAs, but can be partially mitigated
through careful parameter tuning and increased model size.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 190
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 191
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 192
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 195
Transparency is on the up, but there’s significant room for improvement still
Shortly after the last SOAI, Stanford published its first Foundation Model Transparency Index, giving model
developers an average score of 37. This climbed to 58 in the team’s interim update.
● In May 2024, the latest installment of the index assessed the transparency of 14 leading foundation model
developers based on 100 indicators spanning ‘upstream’ factors data, labor, compute, ‘model-level’ factors
around capabilities and risk, ‘downstream’ criteria around distribution, and societal impact.
● Scores on compute and usage policies have seen the strongest improvements, while ‘upstream’ ratings still
remain weak.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 196
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 198
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 199
Maybe the black box just isn’t that opaque after all?
We’ve seen a run of interpretability research, including works on SAE, which argue that high-level semantic
concepts are encoded “linearly” in the representations - and they can be decoded!
● A Chicago/Carnegie Mellon team introduce a simplified model where words and sentences are represented by
binary "concept" variables. They prove that these concepts end up being represented linearly within the model’s
internal space, thanks to next-token prediction and the tendency of gradient descent to find simple, linear
solutions.
● This linearity was also the theme of work from the Moscow-based AI
Research Institute, which argued that transformations happening within
the model can be approximated with simple linear operations.
● Google has introduced a popular new method for decoding intermediate
neurons. Patchscopes takes a hidden representation for LLM and
‘patching’ it to a different prompt. This prompt is used to generate a
description or answer a question, revealing the encoded information.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 200
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 201
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 202
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 203
Section 5: Predictions
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 205
1. A $10B+ investment from a sovereign state into a US large AI lab invokes national security review.
2. An app or website created solely by someone with no coding ability will go viral (e.g. App Store Top-100).
3. Frontier labs implement meaningful changes to data collection practices after cases begin reaching trial.
4. Early EU AI Act implementation ends up softer than anticipated after lawmakers worry they’ve overreached.
7. Levels of investment in humanoids will trail off, as companies struggle to achieve product-market fit.
8. Strong results from Apple’s on-device research accelerates momentum around personal on-device AI.
10. A video game based around interacting with GenAI-based elements will achieve break-out status.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 206
Thanks!
Congratulations on making it to the end of the State of AI Report 2024! Thanks for reading.
In this report, we set out to capture a snapshot of the exponential progress in the field of artificial intelligence, with
a focus on developments since last year’s issue that was published on 12 October 2023. We believe that AI will be a
force multiplier on technological progress in our world, and that wider understanding of the field is critical if we are
to navigate such a huge transition.
We set out to compile a snapshot of all the things that caught our attention in the last year across the range of AI
research, industry, politics and safety.
We would appreciate any and all feedback on how we could improve this report further, as well as contribution
suggestions for next year’s edition.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 207
Reviewers
We’d like to thank the following individuals for providing critical review of this year’s Report:
Anastasia Borovykh, Daniel Campos, Safiye Celik, Mehdi Ghissassi, Corina Gurau, Charlie Harris, Max Jaderberg, Harry
Law, Omar Sanseviero, Patrick Schwab, Shubho Sengupta, and Joe Spisak.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 208
Conflicts of interest
The authors declare a number of conflicts of interest as a result of being investors and/or advisors, personally or via
funds, in a number of private and public companies whose work is cited in this report. Notably, the authors are
investors in companies listed at: airstreet.com/portfolio
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 209
Nathan Benaich
Nathan is the General Partner of Air Street Capital, a
venture capital firm investing in AI-first companies. He
runs the Research and Applied AI Summit (RAAIS), the
RAAIS Foundation (funding open-source AI projects), AI
communities in the US and Europe, and Spinout.fyi
(improving university spinout creation). He studied
biology at Williams College and earned a PhD from
Cambridge in cancer research as a Gates Scholar.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 210
Alex Chalmers
Alex is Platform Lead at Air Street Capital and
regularly writes research, analysis, and commentary on
AI via Air Street Press. Before joining Air Street, he was
an associate director at Milltown Partners, where he
advised big technology companies, start-ups, and
investors on policy and positioning. He graduated from
the University of Oxford in 2017 with a degree in
History.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 211
stateof.ai 2024
STATE OF AI REPORT .
October 10, 2024
Nathan Benaich
AIR STREET CAPITAL .
stateof.ai airstreet.com