0% found this document useful (0 votes)

404 views212 pages

State of AI Report - 2024 ONLINE

Uploaded by

andrew.w.p.henderson

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

404 views212 pages

State of AI Report - 2024 ONLINE

Uploaded by

andrew.w.p.henderson

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 212

STATE OF AI REPORT .

October 10, 2024

Nathan Benaich
AIR STREET CAPITAL .

About the authors

Nathan Benaich
Nathan is the General Partner of Air Street Capital, a
venture capital ﬁrm investing in AI-ﬁrst companies. He
runs the Research and Applied AI Summit (RAAIS), the
RAAIS Foundation (funding open-source AI projects), AI
communities in the US and Europe, and Spinout.fyi
(improving university spinout creation). He studied
biology at Williams College and earned a PhD from
Cambridge in cancer research as a Gates Scholar.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 3

About the authors

Alex Chalmers
Alex is Platform Lead at Air Street Capital and
regularly writes research, analysis, and commentary on
AI via Air Street Press. Before joining Air Street, he was
an associate director at Milltown Partners, where he
advised big technology companies, start-ups, and
investors on policy and positioning. He graduated from
the University of Oxford in 2017 with a degree in
History.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 4

Artiﬁcial intelligence (AI) is a multidisciplinary ﬁeld of science and engineering whose goal is to create intelligent machines.

We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is
because everything around us today, ranging from culture to consumer products, is a product of intelligence.

The State of AI Report is now in its seventh year. Consider this report as a compilation of the most interesting things we’ve
seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.

We consider the following key dimensions in our report:

- Research: Technology breakthroughs and their capabilities.
- Industry: Areas of commercial application for AI and its business impact.
- Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
- Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
- Predictions: What we believe will happen in the next 12 months and a 2023 performance review to keep us honest.

Produced by Nathan Benaich and Air Street Capital team

Definitions
Artificial intelligence (AI): a broad discipline with the goal of creating intelligent machines, as opposed to the natural intelligence that is
demonstrated by humans and animals.
Artificial general intelligence (AGI): a term used to describe future machines that could match and then exceed the full range of human cognitive
ability across all economically valuable tasks.
AI Agent: an AI-powered system that can take actions in an environment. For example, an LLM that has access to a suite of tools and has to decide
which one to use in order to accomplish a task that it has been prompted to do.
AI Safety: a field that studies and attempts to mitigate the risks (minor to catastrophic) which future AI could pose to humanity.
Computer vision (CV): the ability of a program to analyse and understand images and video.
Deep learning (DL): an approach to AI inspired by how neurons in the brain recognise complex patterns in data. The “deep” refers to the many layers
of neurons in today’s models that help to learn rich representations of data to achieve better performance gains.
Diffusion: An algorithm that iteratively denoises an artificially corrupted signal in order to generate new, high-quality outputs. In recent years it has
been at the forefront of image generation and protein design.
Generative AI: A family of AI systems that are capable of generating new content (e.g. text, images, audio, or 3D assets) based on 'prompts'.
Graphics Processing Unit (GPU): a semiconductor processing unit that enables a large number calculations to be computed in parallel. Historically
this was required for rendering computer graphics. Since 2012 GPUs have adapted for training DL models, which also require a large number of
parallel calculations.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 6

Definitions
(Large) Language model (LM, LLM): a model trained on vast amounts of (often) textual data to predict the next word in a self-supervised manner.
The term “LLM” is used to designate multi-billion parameter LMs, but this is a moving definition.
Machine learning (ML): a subset of AI that often uses statistical techniques to give machines the ability to "learn" from data without being explicitly
given the instructions for how to do so. This process is known as “training” a “model” using a learning “algorithm” that
progressively improves model performance on a specific task.
Model: a ML algorithm trained on data and used to make predictions.
Natural language processing (NLP): the ability of a program to understand human language as it is spoken and written.
Prompt: a user input often written in natural language that is used to instruct an LLM to generate something or take action.
Reinforcement learning (RL): an area of ML in which software agents learn goal-oriented behavior by trial and error in an environment that
provides rewards or penalties in response to their actions (called a “policy”) towards achieving that goal.
Self-supervised learning (SSL): a form of unsupervised learning, where manually labeled data is not needed. Raw data is instead modified in an
automated way to create artificial labels to learn from. An example of SSL is learning to complete text by masking random words in a sentence and
trying to predict the missing ones.
Transformer: a model architecture at the core of most state of the art (SOTA) ML research. It is composed of multiple “attention” layers which learn
which parts of the input data are the most important for a given task. Transformers started in NLP (specifically machine translation) and
subsequently were expanded into computer vision, audio, and other modalities.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 7

Deﬁnitions
Model type legend
In the rest of the slides, icons in the top right corner indicate input and output modalities for the model.

Input/Output types: Model types:

📝: Text 📝 → 📝 : LLMs
🖼: Image 📝+ 🖼 → 📝 : Multimodal LLMs
</> : Code 📝+ 🖼+ 🤖 → 📝 : Multimodal LLMs for Robotics
🛠 : Software tool use (text, code generation & execution) 📝 → </> : Text to Code
🎥 : Video 📝 → 🛠 : Text to Software tool use
🎵: Music 📝 → 🖼 : Text to Image
📦: 3D 📝 → 🎥 : Text to Video
🤖: Robot state 📝 → 🎵 : Text to Music
🧬: Biological modality 🖼 → 📦 : Image to 3D
📝 → 📦 : Text to 3D
📝 → 🧬 : Biological models

Executive Summary
Research
- Frontier lab performance converges, but OpenAI maintains its edge following the launch of o1, as planning and reasoning emerge as a major frontier.
- Foundation models demonstrate their ability to break out of language as multimodal research drives into mathematics, biology, genomics, the physical
sciences, and neuroscience.
- US sanctions fail to stop Chinese (V)LLMs rising up community leaderboards.
Industry
- NVIDIA remains the most powerful company in the world, enjoying a stint in the $3T club, while regulators probe the concentrations of power within GenAI.
- More established GenAI companies bring in billions of dollars in revenue, while start-ups begin to gain traction in sectors like video and audio generation.
Although companies begin to make the journey from model to product, long-term questions around pricing and sustainability remain unresolved.
- Driven by a bull run in public markets, AI companies reach $9T in value, while investment levels grow healthily in private companies.
Politics
- While global governance efforts stall, national and regional AI regulation has continued to advance, with controversial legislation passing in the US and EU.
- The reality of compute requirements forces Big Tech companies to reckon with real-world physical constraints on scaling and their own emissions targets.
Meanwhile, governments’ own attempts to build capacity continue to lag.
- Anticipated AI effects on elections, employment and a range of other sensitive areas are yet to be realized at any scale.
Safety
- A vibe-shift from safety to acceleration takes place as companies that previously warned us about the pending extinction of humanity need to ramp up
enterprise sales and usage of their consumer apps.
- Governments around the world emulate the UK in building up state capacity around AI safety, launching institutes and studying critical national infrastructure
for potential vulnerabilities.
- Every proposed jailbreaking ‘ﬁx’ has failed, but researchers are increasingly concerned with more sophisticated, long-term attacks.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 9

Scorecard: Reviewing our predictions from 2023

Our 2023 Prediction Evidence

A Hollywood-grade production makes use of generative AI for visual effects. YES Largely badly, but GenAI AI visual effects have been seen in Netﬂix and HBO productions.

A generative AI media company is investigated for its misuse during in the 2024 US election circuit. ~ Not yet, but there’s still time.

Self-improving AI agents crush SOTA in a complex environment (e.g. AAA game, tool use, science). NO Not yet, despite promising work on open-endedness, including strong game performance.

While the Magniﬁcent Seven have enjoyed strong gains, private companies are hanging
Tech IPO markets unthaw and we see at least one major listing for an AI-focused company (e.g. DBRX). ~ on until markets settle. However, AI chip company Cerebras has ﬁled to IPO.

The GenAI scaling craze sees a group spend >$1B to train a single large-scale model. NO Not quite yet - let’s give it another year.

The US’s FTC or UK’s CMA investigate the Microsoft/OpenAI deal on competition grounds. YES Both regulators are investigating this partnership.

We see limited progress on global AI governance beyond high-level voluntary commitments. YES The commitments from Bletchley and Seoul summits remain voluntary and high-level.

Some VC funds are rumored to be offering GPUs for equity, but we’re yet to see anyone go
Financial institutions launch GPU debt funds to replace VC equity dollars for compute funding. NO
down the debt route.

It turns out this had already happened last year with “Heart on My Sleeve”, but we’ve also
An AI-generated song breaks into the Billboard Hot 100 Top 10 or the Spotify Top Hits 2024. YES
seen an AI-generated song reach #27 in Germany and spend several days in the Top 50.

As inference workloads and costs grow signiﬁcantly, a large AI company (e.g. OpenAI) acquires or builds Sam Altman is reportedly raising huge sums of money to do this, while each of Google,
YES
an inference-focused AI chip company. Amazon, Meta and Microsoft continue to build and improve their owned AI silicon.

Section 1: Research

stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 12
📝→📝
OpenAI’s reign of terror came to an end, until…
For much of the year, both benchmarks and community leaderboards pointed to a chasm between GPT-4 and ‘the
best of the rest’. However, Claude 3.5 Sonnet, Gemini 1.5, and Grok 2 have all but eliminated this gap as model
performance now begin to converge.
● On both formal benchmarks and vibes-based analysis, the best-funded frontier labs are able to rack up scores
within low single digits of each other on individual capabilities.
● Models are now consistently highly capable coders, are strong at factual recall
and math, but less good at open-ended question-answering and multi-modal
problem solving.
● Many of the variations are sufficiently small that they are now likely to be the
product of differences in implementation. For example, GPT-4o outperforms
Claude 3.5 Sonnet on MMLU, but apparently underperforms it on MMLU-Pro - a
benchmark designed to be more challenging.
● Considering the relatively subtle technical differences between architectures and
likely heavy overlaps in pre-training data, model builders are now increasingly
having to compete on new capabilities and product features.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 13
📝→📝
…the Strawberry landed, doubling down on scaling inference compute
The OpenAI team had clearly clocked the potential of inference compute early, with OpenAI o1 appearing within
weeks of papers from other labs exploring the technique.
● By shifting compute from pre- and post-training to inference, o1 reasons through complex prompts step-by-step
in a chain-of-thought (COT) style, employing RL to sharpen the COT and the strategies it uses. This unlocks the
possibility of solving multi-layered math, science, and coding problems where LLMs have historically struggled,
due to the inherent limitations of next-token prediction.
● OpenAI report significant improvements on reasoning-heavy benchmarks
versus 4o, with the starkest on AIME 2024 (competition math), with a
whopping score of 83.83 versus 13.4.
● However, this capability comes at a steep price: 1M input tokens of
o1-preview costs $15, while 1M output tokens will set you back $60. This
makes it 3-4x more expensive than GPT-4o.
● OpenAI is clear in its API documentation that it is not a like-for-like 4o
replacement and that it is not the best model for tasks that require
consistently quick responses, image inputs or function calling.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 14
📝→📝
o1 showcases both areas of incredible strength and persistent weakness
The community were quick to put o1 through its paces, finding that it performed significantly better than other
LLMs on certain logical problems and puzzles. Its true edge shone through, however, on complex math and
science tasks, with a viral video of a PhD student reacting with astonishment as it reproduced a year of his PhD
code in approximately an hour. However, the model remains weaker on certain kinds of spatial reasoning. Like its
predecessors, it can’t play chess to save its life… yet.

stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 15
📝→🖼
Llama 3 closes the gap between open and closed models
In April, Meta dropped the Llama 3 family, 3.1 in July, and 3.2 in September. Llama 3.1 405B, their largest
to-date, is able to hold its own against GPT-4o and Claude 3.5 Sonnet across reasoning, math, multilingual, and
long-context tasks. This marks the ﬁrst time an open model has closed the gap with the proprietary frontier.
● Meta stuck to the same decoder-only transformer architecture that it’s used since Llama 1, with minor
adaptations, namely more transformer layers and attention heads.
● Meta used an incredible 15T tokens to train the family. While this blew through the “Chinchilla-optimal”
amount of training compute, they found that both the 8B and 70B models improved log-linearly up to 15T.
● Llama 3.1 405B was trained over 16,000 H100 GPUs, the ﬁrst
Llama model trained at this scale.
● Meta followed up with Llama 3.2 in September, which
incorporated 11B and 90B VLMs (Llama’s multimodal debut).
The former was competitive with Claude 3 Haiku, the latter
with GPT-4o-mini. The company also released 1B and 3B
text-only models, designed to operate on-device.
● Llama-based models have now racked up over 440M
stateof.ai 2024
downloads on Hugging Face.
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 16
📝→📝
But how ‘open’ are ‘open source’ models?
With open source commanding considerable community support and becoming a hot button regulatory issue,
some researchers have suggested that the term is often used misleadingly. It can be used to lump together
vastly different openness practices across weights, datasets, licensing, and access methods.

stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 17
📝→📝
Is contamination inflating progress?
With new model families reporting incredibly strong benchmark performance straight out-of-the-gate,
researchers have increasingly been shining a light on dataset contamination: when test or validation data leaks
into the training set. Researchers from Scale retested a number of models on a new Grade School Math 1000
(GSM1k) that mirrors the style and complexity of the established GSM8k benchmark, finding significant
performance drops in some cases. Similarly, researchers at X.ai re-evaluated models using a dataset based on
the Hungarian national finals math exam that post-dated their release, with similar results.

stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 18
📝→📝
Researchers try to correct problems in widely used benchmarks
But benchmarking challenges cut both ways. There are alarmingly high error rates in some of the most popular
benchmarks that could be leading us to underestimate the capabilities of some models, with safety implications.
Meanwhile, the temptation to overfit is strong.
● A team from the University of Edinburgh flagged up the number of mistakes in MMLU, including the wrong
ground truth, unclear questions, and multiple correct answers. While low across most individual topics, there
were big spikes in certain fields, such as virology, where 57% of the analyzed instances contained errors.
● On a manually corrected MMLU subset, models broadly gain in performance, although worsened on professional
law and formal logic. This says inaccurate MMLU instances are being learned during pre-training.
● In more safety-critical territory, OpenAI has warned that SWE-bench, which
evaluates models’ ability to solve real-world software issues, was underestimating
the autonomous software engineering capabilities of models, as it contained tasks
that were hard or impossible to solve.
● The researchers partnered with the creators of the benchmark to create
SWE-bench Verified.

stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 19
📝→📝
Live by the vibes, die by the vibes…or close your eyes for a year and OpenAI is still #1
The LMSYS Chatbot Arena Leaderboard has emerged as the community’s favorite method of formalizing
evaluation by “vibes”. But as model performance improves, it’s beginning to produce counterintuitive results
● The arena, which allows users to interact with two randomly selected chatbots side-by-side provides a rough
crowdsourced evaluation.
● However, controversially, this led to GPT-4o and GPT-4o Mini
receiving the same scores, with the latter also outperforming
Claude Sonnet 3.5.
● This has led to concerns that the ranking is essentially
becoming a way of assessing which writing style users happen
to prefer most.
● Additionally, as smaller models tend to perform less well on
tasks involving more tokens, the 8k context limit arguably
gives them an unfair advantage.
● However, the early version of the vision leaderboard is now
beginning to gain traction and aligns better with other evals.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 20
📝+🖼→📝
Are neuro-symbolic systems making a comeback?
Deficiencies in both reasoning capabilities and training data mean that AI systems have frequently fallen short
on math and geometry problems. With AlphaGeometry, a symbolic deduction engine comes to the rescue.
● A Google DeepMind/NYU team generated millions of synthetic theorems and proofs using symbolic engines,
using them to train a language model from scratch.
● AlphaGeometry alternates between the language model
proposing new constructions and symbolic engines
performing deductions until a solution is found.
● Impressively, It solved 25 out of 30 on a benchmark of
Olympiad-level geometry problems, nearing human
International Mathematical Olympiad gold medalist
performance. The next best AI performance scored only 10.
● It also demonstrated generalisation capabilities - for
example, finding that a specific detail in a 2004 IMO
problem was unnecessary to for the proof.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 21
📝→📝
It’s possible to shrink models with minimal impact on performance…
Research suggests that models are robust in the face of deeper layers - which are meant to handle complex,
abstract, or task-specific information - being pruned intelligently. Maybe it’s possible to go even further.
● A Meta/MIT team looking at open-weight pre-trained LLMs concluded that it’s possible to do away with up to
half a model’s layers and suffer only negligible performance drops on question-answering benchmark.
● They identified optimal layers for removal based on similarity and then “healed” the model through small
amounts of efficient fine-tuning.
● NVIDIA researchers took a more radical approach by
pruning layers, neurons, attention heads, and
embeddings, and then using knowledge distillation for
efficient retraining.
● The MINITRON models, derived from Nemotron-4 15B,
achieved comparable or superior performance to
models like Mistral 7B and Llama-3 8B while using up
to 40x fewer training tokens.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 22
📝→📝
…as distilled models become more fashionable
As Andrej Karpathy and others have argued, current large model sizes could be a reflection of inefficient training.
Using these big models to refine and synthesize training data, could help train capable smaller models.
● Google have embraced this approach, distilling Gemini 1.5 Flash from Gemini 1.5 Pro, while Gemma 2 9B was
distilled from Gemma 2 27B, and Gemma 2B from a larger unreleased model.
● There is also community speculation that Claude 3 Haiku, a highly capable smaller model, is a distilled version
of the larger Opus, but Anthropic has never confirmed this.
● These distillation efforts are going multimodal too. Black Forest
Labs have released FLUX.1 dev, an open-weight text-to-image
distilled from their Pro model.
● To support these efforts, the community has started to produce
open-source distillation tools, like arcee.ai’s DistillKit, which
supports both Logit-based and Hidden States-based distillation.
● Llama 3.1 405B is also being used for distillation, after Meta
updated its terms so output logits can be used to improve any
models, not just Llama ones.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 23
📝→📝
Models built for mobile compete with their larger peers
As big tech companies think through large-scale end user deployment, we’re starting to see high-performing LLM
and multimodal models that are small enough to run on smartphones.
● Microsoft’s phi-3.5-mini is a 3.8B LM that competes with larger models like 7B
and Llama 3.1 8B. It performs well on reasoning and question-answering, but
size restricts its factual knowledge. To enable on-device inference, the model
was quantized to 4 bits, reducing its memory footprint to approximately 1.8GB.
● Apple introduced MobileCLIP, a family of efficient image-text models optimized
for fast inference on smartphones. Using novel multimodal reinforced training,
they improve the accuracy of compact models by transferring knowledge from
an image captioning model and an ensemble of strong CLIP encoders.
● Hugging Face also got in on the action with SmolLM, a family of small language
models, available in 135M, 360M, and 1.7B formats. By using a highly curated
synthetic dataset created via an enhanced version of Cosmopedia (see slide 31)
the team achieved SOTA performance for the size.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 24
📝+🖼→📝
Strong results in quantization point to an on-device future
It’s possible to shrink the memory requirements of LLMs by reducing the precision of their parameters.
Researchers are increasingly managing to minimize the performance trade-offs.
● Microsoft’s BitNet uses a “BitLinear” layer to replace standard linear layers,
employing 1-bit weights and quantized activations.
● It shows competitive performance compared to full-precision models and
demonstrates a scaling law similar to full-precision transformers, with
significant memory and energy savings.
● Microsoft followed up with BitNet b1.58, with ternary weights to match
full-precision LLM performance at 3B size while retaining efficiency gains.
● Meanwhile, ByteDance’s TiTok (Transformer-based 1-Dimensional
Tokenizer) quantizes images into compact 1D sequences of discrete token
for image reconstruction and generation tasks. This allows images to be
represented with as few as 32 tokens, instead of hundreds or thousands.

stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 25
📝→📝
Will representation fine tuning unlock on-device personalization?
Parameter-efficient fine-tuning (e.g. via LoRA) is nothing new, but Stanford researchers believe a more targeted
approach offers greater efficiency and adaptation.
● Inspired by model interpretability research, ReFT (Representation Fine-tuning) doesn’t alter the model’s
weights. Instead, it manipulates the model’s internal representations at inference time to steer its behavior.
● While it comes with a slight interference penalty, ReFT requires 15-65x fewer parameters compared to
weight-based fine-tuning methods.
● It also enables more selective interventions on specific
layers and token positions, enabling fine-grained control
over the adaptation process.
● The researchers show its potential in few-shot adaptation
where a chat model is given a new persona with just five
examples. Combined with the small storage footprint for
learned interventions, it could be used for real-time
personalization on devices with sufficient compute power.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 26
📝→📝
Hybrid models begin to gain traction
Models that combine attention and other mechanisms are able to maintain or even improve accuracy, while
reducing computational costs and memory footprint.
● Selective state-space models like Mamba, designed last year to handle long sequences more efficiently, can to
some extent compete with transformers, but lag on tasks that require copying or in-context learning. That
said, Falcon’s Mamba 7B shows impressive benchmark performance versus similar-sized transformer models.
● Hybrid models appear to be a more promising direction. Combined with self-attention and MLP layers, the
AI21’s Mamba-Transformer hybrid model outperforms the 8B Transformer across knowledge and reasoning
benchmarks, while being up to 8x faster generating tokens in inference.
● In a nostalgia trip, there are early signs of a comeback for recurrent
neural networks, which had fallen out of fashion due to training and
scaling difficulties.
● Griffin, trained by Google DeepMind, mixes linear recurrences and local
attention, holding its own against Llama-2 while being trained on 6x
fewer tokens.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 27
📝→📝
And could we distill transformers into hybrid models? It’s…complicated.
By transferring knowledge from a larger, more powerful model, one could improve the performance of
subquadratic models, allowing us to harness their efficiency on downstream tasks.
● MOHAWK is a new method for distilling knowledge from a large, pre-trained transformer model (teacher) to a
smaller, subquadratic model (student) like a state-space model (SSM).
● It aligns i) the sequence transformation matrices of the student and teacher models ii) and the hidden states of
each layer, then iii) transfers the remaining weights of the teacher model to the student model to finetune it.
● The authors create Phi-Mamba, a new student model combining
Mamba-2 and an MLP block and a variant called
Hybrid-Phi-Mamba that retains some attention layers from the
teacher model.
● Mohawk can train Phi-Mamba and Hybrid-Phi-Mamba to achieve
performance close to the teacher model. Phi-Mamba is distilled
with only 3B tokens, less than 1% of the data used to train either
the previously best-performing Mamba models and 2% for the
Phi-1.5 model itself.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 28

Either way, the transformer continues to reign supreme (for now)

Work with transformer alternatives and hybrid models is interesting, but at this stage remains niche. One
paradigm still seems to rule them all.

Transformers vs other paradigms

22%

74%

stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 29
📝→📝
Synthetic data starts gaining more widespread adoption…
Last year’s report pointed to the divides of opinion around synthetic data: with some finding it useful, others
fearing its potential to trigger model collapse by compounding errors. Opinion seems to be warming.
● As well as being the main source of training data for the Phi family, synthetic data was used by Anthropic when
training Claude 3 to help represent scenarios that might have been missing in the training data.
● Hugging Face used Mixtral-8x7B Instruct to generate over 30M files and
25B tokens of synthetic textbooks, blog posts, and stories to recreate the
Phi-1.5 training dataset, which they dubbed Cosmopedia.
● To make this process easier, NVIDIA released the Nemotron-4-340B
family, a suite of models designed specifically for synthetic data
generation, available via a permissive license. Meta’s Llama can also be
used for synthetic data generation.
● It also appears possible to create synthetic high-quality instruction data by
extracting it directly from an aligned LLM, with techniques like Magpie. Models
fine-tuned this way sometimes perform comparably to Llama-3-8B-Instruct.

stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 30
📝→📝
…but Team Model Collapse isn’t going down without a fight
As model builders motor ahead, researchers have focused on trying to assess if there’s a tipping point in the
quantity of synthetic data that triggers these kinds of outcomes and if any mitigations work
● A Nature paper from Oxford and Cambridge researchers found model collapse occurs across various AI
architectures, including fine-tuned language models, challenging the idea that pre-training or periodic
exposure to small amounts of original data can prevent degradation (measured by Perplexity score).
● This creates a “first mover advantage”, as sustained access to
diverse, human-generated data will become increasingly critical
for maintaining model quality.
● However, these results are primarily focused on a scenario where
real data is replaced with synthetic data over generations. In
practise, real and synthetic data usually accumulates.
● Other research suggests that, provided the proportion of synthetic
data doesn’t get too high, collapse can usually be avoided.

stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 31
📝→📝
Web data is decanted openly at scale - proving quality is key 🍷
Team Hugging Face built a 15T token dataset for LLM pre-training, using 96 CommonCrawl snapshots, which
produces LLMs that outperform other open pre-training datasets. They also released an instruction manual.
● FineWeb, the dataset, was created through a multi-step process including base filtering, independent
MinHash deduplication per dump, selected filters derived from the C4 dataset, and the team’s custom filters.
● The text extraction using the trafilatura library produced higher quality data than default CommonCrawl
WET files, even though the resulting dataset was meaningfully smaller.
●
● They found deduplication drove performance improvements, up to a
point, before hitting a point of diminishing returns, and then
worsening it.
● The team also used llama-3-70b-instruct to annotate 500k samples
from FineWeb, scoring scoring each for their educational quality on
a scale from 0 to 5. FineWeb-edu, which filtered out samples scored
below 3, outperformed FineWeb and all other open datasets,
despite being significantly smaller.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 32
📝→📝
Retrieval and embeddings hit the center stage
While retrieval and embeddings are not new, growing interest in retrieval augmented generation (RAG) has
prompted improvements in the quality of embedding models.
● Following the playbook that’s proven effective in regular LLMs, massive performance improvements have come
from scale (GritLM has ~ 47B parameters vs the 110M common among prior embedding models).
● Similarly, the usage of broad web scale corpora and improved
filtering methods have led to large improvements in the smaller
models.
● Meanwhile, ColPali is a vision-language embedding model that
exploits the visual structure of documents, not just their text
embeddings, to improve retrieval.
● Retrieval models are one of the few subdomains where open
models commonly outperform proprietary models from the biggest
labs. On the MTEB Retrieval Leaderboard, OpenAI’s embedding
model ranks 29th, while NVIDIA’s open NV-Embed-v2 is top.

stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 33
📝→📝
Context proves a crucial driver of performance
Traditional RAG solutions usually involve creating text snippets 256 tokens at a time with sliding windows (128
overlapping the prior chunk). This makes retrieval more efficient, but significantly less accurate.
● Anthropic solved this using ‘contextual embeddings’, where a prompt instructs the model to generate text
explaining the context of each chunk in the document.
● They found that this approach leads to a reduction of top-20
retrieval failure rate of 35% (5.7% → 3.7%).
● It can then be scaled using Anthropic’s prompt caching.
● As Fernando Diaz of CMU observed in a recent thread, this is a
great example of techniques pioneered on one area of AI
research (e.g. early speech retrieval and document expansion
work) being applied to another. Another version of “what is
new, is old”.
● Research from Chroma shows that the choice of chunking
strategy can affect retrieval performance by up to 9% in recall.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 34
📝→📝
Evaluation for RAG remains unsolved
Many commonly used RAG benchmarks are repurposed retrieval or question answering datasets. They don’t
effectively evaluate the accuracy of citations, the importance of each piece of text to the overall answer, or the
impact of conflicting points of information.
● Researchers are now pioneering novel approaches, like Ragnarök, which introduces a novel web-based arena
for human evaluation through pairwise system comparisons. This addresses the challenge of assessing RAG
quality beyond traditional automated metrics.
● Meanwhile, Researchy Questions provides a large-scale collection of complex, multi-faceted questions that
require in-depth research and analysis to answer, drawn from real user queries.

Frontier labs face up to the realities of the power grid and work on mitigations
As compute clusters grow larger, they become harder to build and maintain. Clusters require high-bandwidth, low
latency connections and are sensitive to device heterogeneity. Researchers see the potential for alternatives.
● Google DeepMind has proposed Distributed Low-Communication (DiLoCo), an optimization algorithm that
allows training to occur on multiple loosely connected “islands” of devices.
● Each island performs a large number of local update steps before communicating with the others, reducing
the need for frequent data exchange. They’re able to demonstrate fully synchronous optimization across 8 of
these islands while reducing communication 500x.
● GDM also proposed a reﬁned version of DiLoCo, optimized for asynchronous settings.
● Researchers at Prime Intellect released an open-source implementation and replication of DiLoCo, while
scaling it up 3x, to demonstrate its effectiveness on 1B parameter models.

stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 36
📝→📝
Could better data curation methods reduce training compute requirements?
Data curation is an essential part of effective pre-training, but is often done manually and inefficiently. This is
both hard to scale and wasteful, especially for multimodal models.
● Usually, an entire dataset is processed upfront, which doesn’t account for how the relevance of a training
example can change over the course of learning. These methods are frequently applied before training, so
cannot adapt to changing needs during training.
● Google DeepMind’s JEST selects entire batches of data jointly,
rather than individual examples independently. The selection is
guided by a ‘learnability score’ (determined by a pre-trained
reference model) which evaluates how useful it will be for
training. It’s able to integrate data selection directly into the
training process, making it dynamic and adaptive.
● JEST uses lower-resolution image processing for both data
selection and part of the training, significantly reducing
computational costs while maintaining performance benefits.

stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 37
📝+🖼→📝
Chinese (V)LLMs storm the leaderboards despite sanctions
Models produced by DeepSeek, 01.AI, Zhipu AI, and Alibaba have achieved strong spots on the LMSYS
leaderboard, displaying particularly impressive results in math and coding.
● The strongest models from Chinese labs are competitive with the second-most powerful tier of frontier model
produced by US labs, while being challenging the SOTA on certain subtasks.
● These labs have prioritized computational efﬁciency to
compensate for constraints around GPU access, learning to stretch
their resources much further than their US peers.
● Chinese labs have different strengths. For example, DeepSeek has
pioneered techniques like Multi-head Latent Attention to reduce
memory requirements during inference and an enhanced MoE
architecture.
● Meanwhile, 01.AI has focused less on architectural innovation and
more on building a strong Chinese language dataset to
compensate for its relative paucity in popular repositories like
Common Crawl.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 38
📝+🖼→📝
And Chinese open source projects win fans around the world
To drive international uptake and evaluation, Chinese labs have become enthusiastic open source contributors.
A few models have emerged as strong contenders in individual sub-domains.
● DeepSeek has emerged as a community favorite on coding tasks, with deepseek-coder-v2 for its
combination of speed, lightness, and accuracy.
● Alibaba released the Qwen-2 family recently, and the community has
been particularly impressed by its vision capabilities, ranging from
challenging OCR tasks to its ability to analyse complex art work.
● At the smaller end, the NLP lab at Tsinghua University has funded
OpenBMB, a project that has spawned the MiniCPM project.
● These are small <2.5B parameter models that can run on-device. Their
2.8B vision model is only marginally behind GPT-4V on some metrics,
while 8.5B Llama 3 based model surpasses it on some metrics.
● Tsinghua University’s Knowledge Engineering Group has also created
CogVideoX - one of the most capable text to video models.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 39
📝+🖼→📝
VLMs achieve SOTA performance out-of-the-box
The ﬁrst State of AI Report in 2018 detailed the painstaking efforts of researchers who tried to teach models
common sense scene understanding by creating datasets of millions of labelled videos. Every major frontier
model builder now offers vision capabilities out of the box. Even smaller models, in the low hundreds M to single
digit B parameter size like Florence-2 from Microsoft or LongVILA from NVIDIA, can achieve remarkable results.
Allen Institute for AI’s open source Molmo can hold its own against the larger, proprietary GPT-4o.

2018 2024

stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 40
📝→🖼
Diffusion models for image generation become more and more sophisticated
Moving on from diffusion models for text-to-image, Stability AI have continued to search for refinements that
increase quality while bringing about greater efficiency.
● Adversarial diffusion distillation speeds up image generation by reducing the sampling steps needed to create
high-quality images from potentially hundreds down to 1-4, while maintaining high fidelity.
● It combines adversarial training with score distillation: the model is trained just
using a pre-trained diffusion model as a guide.
● As well as unlocking single-step generation, the authors focused on reducing
computational complexity and improving sampling efficiency.
● Rectified flow improves upon traditional diffusion methods by connecting data
and noise through a direct, straight line, rather than a curved path.
● They combined this with a novel transformer-based architecture for
text-to-image that allow for a bidirectional flow of information between text
and image components. This enhances the model's ability to generate more
accurate and coherent high-resolution images based on textual descriptions.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 41
📝→🎥
Stable Video Diffusion marks a step forward for high-quality video generation…
Stability AI released Stable Video Diffusion, one of the first models capable of generating high-quality, realistic
videos from text prompts, along with a significant step up in customizability. The team took a three-stage
approach to training: i) image pre-training on a large text-to-image dataset, ii) video pre-training on a large,
curated low res video dataset, and iii) fine tuning on a smaller, high res video dataset. In March, they followed up
with Stable Video 3D, finetuned on a third object dataset to predict 3D orbits.

stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 42
📝→🎥
…leading the big labs to release their own gated text-to-video efforts
Both Google DeepMind and OpenAI have given us sneak previews of highly powerful text-to-video diffusion
models. But access remains heavily gated and neither has supplied much technical detail.
● OpenAI’s Sora is able to generate videos up to a minute long, while maintaining 3D consistency, object
permanence, and high resolution. It uses spacetime patches, similar to the tokens used in transformer models,
but for visual content, to learn efficiently from a vast dataset of videos.
● Sora was also trained on visual data in its native size and aspect
ratio, removing the usual cropping and resizing that reduces quality.
● Google DeepMind’s Veo combines text and optional image prompts
with a noisy compressed video input, processing them through
encoders and a latent diffusion model to create a unique compressed
video representation.
● The system then decodes this representation into a final
high-resolution video.
● Also in the fight are Runway’s Gen-3 Alpha, Luma’s Dream Machine,
and Kling by Kuaishou.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 43
📝→🎥
Meta goes even further, throwing audio into the mix
Keeping the gated approach of other labs, Meta has brought together its work on different modalities via the
Make-A-Scene and Llama families to build Movie Gen.
● The core of Movie Gen is a 30B video generation and a 13B audio generation model, capable of producing
16-second videos at 16 frames per second and 45-second audio clips respectively.
● These models leverage joint optimization techniques for text-to-image and text-to-video tasks, as well as
novel audio extension methods for generating coherent audio for videos of arbitrary lengths.
● Movie Gen's video editing capabilities combine advanced image
editing techniques with video generation, allowing for both localized
edits and global changes while preserving original content.
● The models were trained on a combination of licensed and publicly
available datasets.
● Meta used A/B human evaluation comparisons to demonstrate positive
net win rates against competing industry models across their four
main capabilities. The researchers say they intend to make the model
available in future, but don’t commit to a timeline or release strategy.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 44

AI gets en-Nobel-ed
In a sign that AI has truly come of age as both a scientiﬁc discipline and a tool to accelerate science, the Royal
Swedish Academy of Sciences awarded Nobel Prizes to OG pioneers in deep learning, alongside the architects of
its best-known application (so far) in science. The news was celebrated by the entire ﬁeld.

stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 45
📝→🧬
AlphaFold 3: going beyond proteins and their interactions with other biomolecules
DeepMind and Isomorphic Labs released AlphaFold 3, their successor from AF2, which can now model how small
molecule drugs, DNAs, RNAs and antibodies interact with protein targets.
● There were substantial and surprising algorithmic changes
from AF2: all equivariance constraints were removed in
favor of simplicity and scale, while the Structure Module
was replaced with a diffusion model to build the 3D
coordinates.
● Unsurprisingly, the researchers claim that AF3 performs
exceptionally well in comparison to other methods (esp.
for small molecule docking), although this was not
compared to stronger baselines.
● Notably, no open-source code was made available (yet).
Several independent groups are working on reproducing
the work openly.

stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 46
📝→🧬
…starting a race to become the ﬁrst to reproduce a fully functioning AlphaFold3 clone
The decision to not release code for the AF3 publication was highly controversial, with many blaming Nature.
Politics aside, there has been a race by start-ups and AI communities to make their model the go-to alternative.
● The ﬁrst horse out of the gate was Baidu with their HelixFold3 model, which was comparable to AF3 for
ligand binding. They provide a web server and their code is fully open-sourced for non-commercial use.
● Chai-1 from Chai Discovery (backed by OpenAI) recently
released a molecular structure prediction model that has
taken off in popularity due to its performance and high
quality implementation. The web server is also available for
commercial drug discovery use.
● We are still waiting for a fully open-sourced model with no
restrictions (e.g. using outputs for training of other models).
● Will DeepMind fully release AF3 sooner if they begin to fear
alternative models are becoming the communities favourite?

stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 47
📝→🧬
AlphaProteo: DeepMind flexes new experimental biology capabilities
The secretive protein design team at DeepMind finally “came out of stealth” with their first model AlphaProteo, a
generative model that is able to design sub-nanomolar protein binders with 3- to 300-fold better affinities.
● While few technical details were given, it seems it was built on top of
AlphaFold3 and is likely a diffusion model. ‘Hotspots’ on the target
epitope can also be specified.
● The model was able to design protein binders with 3- to 300-fold
better binding affinities than previous works (e.g. RFDiffusion).
● The “dirty secret” of the protein design field is that the in silico filtering
is just as (if not more) important than the generative modelling, with
the paper suggesting that AF3-based scoring is key.
● They also use their confidence metrics to screen a large number of
possible novel targets for which future protein binders could be
designed.

stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 48
📝→🧬
The Bitter Lesson: Equivariance is dead…long live equivariance!
Equivariance is the idea of giving a model the inductive biases to natively handle rotations, translations and
(sometimes) reflections. It has been at the core of Geometric Deep Learning and biomolecular modelling
research since AlphaFold 2. However, recent works by top labs have questioned the existing mantra.
● The first shots were fired by Apple, with a paper that obtained SOTA
results on predicting the 3D structures of small molecules using a
non-equivariant diffusion model with a transformer encoder.
● Remarkably, the authors showed that using the domain-agnostic model
did not deleteriously impact generalization and was consistently able
to outperform specialist models (assuming sufficient scale was used).
● Next was AlphaFold 3, which infamously dropped all the equivariance
and frames constraints from the previous model in favour of another
diffusion process coupled with augmentations and, of course, scale.
● Regardless, the greatly improved training efficiency of equivariant
models means the practice is likely to stay for a while (at least for
academic groups working on large systems such as proteins).
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 49
📝→🧬
Scaling frontier models of biology: EvolutionaryScale’ ESM3
Since 2019, Meta had been publishing transformer-based language models (Evolutionary Scale Models) trained
on large-scale amino acid and protein databases. When Meta terminated these efforts in 2023, the team founded
EvolutionaryScale. This year, they released ESM3, a frontier multimodal generative model that was trained over
sequences, structures and functions of proteins rather than sequences alone.
● The model is a bidirectional transformer that fuses tokens
that represent each of the three modalities as separate tracks
into a single latent space.
● Unlike traditional masked language modelling, ESM3’s
training process uses a variable masking schedule, exposing
the model to diverse combinations of masked sequence,
structure, and function. ESM3 learns to predict completions
for any combination of modalities.
● ESM3 was prompted to generate new green fluorescent
proteins (GFP) with low sequence similarity to known ones.

stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 50
📝→🧬
Language models that learn to design human genome editors
We previously profiled how LLMs (e.g. ProGen2) pre-trained on large and diverse datasets of natural protein
sequences could be used to design functional proteins with vastly different sequences to their natural peers.
Now, Profluent has finetuned ProGen2 on their CRISPR-Cas Atlas to generate functional genome editors with
novel sequences that, importantly, were shown to edit the DNA of human cells in vitro for the first time.
● The CRISPR-Cas Atlas consists of >1M diverse CRISPR-Cas operons,
including various effector systems, that were mined from 26.2
terabases of assembled microbial genomes and metagenomes,
spanning diverse phyla and biomes.
● Generated sequences are 4.8x more diverse vs. natural proteins
from the CRISPR-Cas atlas. The median identity to the nearest
natural protein typically fell between 40-60%.
● A model fine-tuned on Cas9 proteins can generate novel editors
that were then validated in human cells. One such editor offered
the best editing performance and 71.7% sequence similarity to
SpCas9 and was open sourced as OpenCRISPR-1.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 51
📝+ 🖼 → 📝
Yet, evals and benchmarking in BioML remains poor
The fundamental problem with research at the intersection of biology and ML is that there are very few people
with the skills to both train a frontier model and give it a rigorous biological appraisal.
● Two works from late 2023, PoseCheck and PoseBusters, showed that ML models for molecule generation and
protein-ligand docking gave structures (poses) with gross physical violations.
● Even the AlphaFold3 paper didn’t get away without a few bruises
when Inductive bio showed that using a slightly more advanced
conventional docking pipeline beat AF3.
● A new industry consortium led by Valence Labs, including major
pharma companies (i.e. Recursion, Relay, Merck, Novartis J&J,
Pfizer), is developing Polaris, a benchmarking platform for
AI-driven drug discovery. Polaris will provide high-quality
datasets, facilitate evaluations, and certify benchmarks.
● Meanwhile, Recursion’s work on perturbative map-building led
them to create a new set of benchmarks and metrics.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 52
📝+ 🖼 → 📝
Foundation models across the sciences: inorganic materials
To determine the properties of physical materials and how they behave under reactions, it is necessary to run
atomic-scale simulations that today rely on density functional theory. This method is powerful, but slow and
computational expensive. While faster, alternative approaches that calculate force fields (interatomic potentials)
tend to have insufficient accuracy to be useful, particularly for reactive events and phase transitions.
● In 2022, equivariant message passing neural networks (MPNN) combined with
efficient many-body messages (MACE) were introduced at NeurIPS.
● Now, the authors present MACE-MP-0, which uses the MACE architecture and
is trained on the Materials Project Trajectory dataset, which contains millions
of structures, energies, magnetic moments, forces and stresses.
● The model reduces the number of message passing layers to two by
considering interactions involving four atoms simultaneously, and it only uses
nonlinear activations in selective parts of the network.
● It is capable of molecular dynamics simulation across a wide variety of
chemistries in the solid, liquid and gaseous phases.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 53
📝+ 🖼 → 📝
Expanding the protein function design space: challenging folds and soluble analogues
Characterising and generating structures for proteins that are not found in soluble form but are in membrane
environments is challenging and hinders the development of drugs meant to target membrane receptors. So too
is the design of protein folds that are large and include non-local topologies. Can AF2 and sequence models
remedy this and give drug designers access to a larger soluble proteome with previously inaccessible folds?
● To do so, the authors first use an inverted AF2 model that
generates an initial sequence given a target fold structure.
These sequences are then optimised by ProteinMPNN before
structures are re-predicted by AF2 followed by filtering on
the basis of structure similarity to the target structure.
● This AF2-MPNN pipeline was tested on three challenging
folds: IGF, BBF and TBF, which have therapeutic utility.
● It was also possible to generate soluble analogues of
membrane-only folds which could massively speed up drug
discovery targeting membrane-bound receptor proteins.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 54
📝+ 🖼 → 📝
Foundation models for the mind: learning brain activity from fMRI
Deep learning, originally inspired by neuroscience, is now making into modelling the brain itself. BrainLM is a
foundation model built on 6,700 hours of human brain activity recordings generated by functional magnetic
resonance imaging (fMRI), which detects changes in blood oxygenation (left figure). The model learns to
reconstruct masked spatiotemporal brain activity sequences and, importantly, it can generalise to held-out
distributions (right figure). This model can be fine-tuned to predict clinical variables e.g. age, neuroticism, PTSD,
and anxiety disorder scores better than a graph convolutional model or an LSTM.

stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 55
📝+ 🖼 → 📝
Foundation models across the sciences: the atmosphere
Classical atmospheric simulation methods like numerical weather prediction are costly and unable to make use
of diverse and often scarce atmospheric data modalities. But, foundation models are well suited here. Microsoft
researchers created Aurora, a foundation model that produces forecasts for a wide range of atmospheric
forecasting problems such as global air pollution and high-resolution medium-term weather patterns. It can also
adapt to new tasks by making use of a general-purpose learned representation of atmospheric dynamics.
● The 1.3B model is pre-trained on >1M hours of weather
and climate data from 6 datasets, including forecasts,
analysis data, reanalysis data, and climate simulations.
● The models encodes heterogeneous inputs into a standard
3D representation of the atmosphere across space and
pressure-levels, which is evolved over time at inference by
a vision transformer and decoded into specific predictions.
● Importantly, it is the first model to predict atmospheric chemistry (6 major air pollutants, e.g. ozone, carbon
monoxide), which involves hundreds of stiff equations, better than numerical models. The model is also 5,000x
faster than the Integrated Forecasting System that uses numerical forecasting.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 56
📝+ 🖼 → 📝
Foundation models for the mind: reconstructing what you see
MindEye2, is a generative model that maps fMRI activity to a rich CLIP space from which images of what the
individual sees are reconstructed using a fine-tuned Stable Diffusion XL. The model is trained on the Natural
Scenes Dataset, an fMRI dataset built from 8 subjects whose brain responses were captured for 30-40 hours as
they looked at hundreds of rich naturalistic stimuli from the COCO dataset scanning sessions for 3 seconds each.

stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 57
📝+ 🖼 → 📝
Speaking what you think
Decoding speech from brain recordings with implantable microelectrodes could enable communication for
patients with impaired speech. In a recent case, a 45-year-old man with amyotrophic lateral sclerosis (ALS) with
tetraparesis and severe motor speech damage underwent surgery to implant microelectrodes into his brain. The
arrays recorded neural activity as the patient spoke in both prompted and unstructured conversational settings.
At ﬁrst, cortical neural activity was decoded into a small vocabulary of 50 words with 99.6% accuracy by
predicting the most likely English phoneme being attempted. Sequences of phonemes were combined into words
using an RNN, before moving to a larger 125,000-word vocabulary enabled by further training.

A new challenge aims to refocus the industry on the path to AGI

François Chollet, the creator of Keras, has partnered with Zapier co-founder Mike Knoop to launch the ARC prize,
offering a $1M prize fund for teams that make significant progress on the ARC-AGI benchmark
● Chollet created the benchmark back in 2019 as a means of measuring models’ ability to generalize, focusing on
tasks that are easier for humans and hard for AI. The tasks require minimal prior knowledge and emphasise
visual problem-solving and puzzle-like tasks to make it resistant to memorization.
● Historically, LLMs have performed poorly on the benchmark, with
performance peaking at about 34%.
● Chollet is sceptical of LLMs’ ability to generalize to new problems outside
of their training data and is hoping the prize will encourage new research
directions that will lead to a more human-like form of intelligence.
● The highest score so far is 46 (short of the 85 target). It’s been achieved by
the Minds AI team, who have used an LLM-based approach, employing
active inference, fine-tuning the LLM on test task examples and expanding
it with synthetic examples to improve performance.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 59
📝→🛠
LLMs still struggle with planning and simulation tasks
On novel tasks, where LLMs are unable to rely on memory and retrieval, performance often degrades. This
suggests that they still often struggle to generalize beyond familiar patterns without external help.
● Even advanced LLMs like GPT-4 have difficulty reliably simulating state transitions in text-based games, especially
for environment-driven changes. Their inability to consistently grasp causality, physics, and object permanence,
makes them poor world-modellers, even on relatively straightforward tasks.
● Researchers found that LLMs accurately predict direct action results, like a
sink turning on, around 77% of the time, but struggle with environmental
effects, such as water filling a cup in the sink, achieving only 50% accuracy for
these indirect changes.
● Other research evaluated LLMs on planning domains, including Blocksworld,
and Logistics. GPT-4 produced executable plans 12% of the time. However,
using iterative prompting with external verification, Blocksworld plans hit
82% accuracy and Logistics plans 70% accuracy after 15 rounds of feedback.
When re-run with o1, performance jumped but was still far from perfect.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 60
📝→📝
Can LLMs learn to think before they speak?
Researchers are exploring methods to generate stronger internal reasoning processes, variously targeting both
training and inference. The latter approach appears to underpin OpenAI o1’s jump in capabilities.
● Quiet-STaR from a joint Stanford-Notbad AI team generates internal rationales during pre-training, using a
parallel sampling algorithm and custom meta-tokens to mark the beginning and end of these "thoughts."
● The approach employs a reinforcement learning-inspired technique to optimize the usefulness of generated
rationales, rewarding those that improve the model's ability to predict future tokens.
● Meanwhile, Google DeepMind have targeted inference, showing that for
many types of problems, strategically applying more computation at test
time can be more effective than using a much larger pre-trained model.
● A Stanford/Oxford team have also looked at scaling inference compute,
finding that repeated sampling can significantly improve coverage. They
suggest that using weaker and cheaper models with many attempts can
outperform single attempts from their stronger and more expensive
peers.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 61
📝→🛠
Open-endedness gathers momentum as a promising research direction
One path to improving the robustness of LLM reasoning is to embrace an open-ended approach such that they’re
capable of generating new knowledge.
● In a position paper, a Google DeepMind team framed open-ended systems as able to “continuously generate
artifacts that are novel and learnable to an observer”.
● They outline potential paths towards open-ended foundation models, including reinforcement learning,
self-improvement, task generation, and evolutionary algorithms.
● On the self-improvement front, we saw STRATEGIST, a method for
allowing LLMs to learn new skills for multi-agent games.
● The researchers used a bi-level tree search approach, combining
high-level strategic learning with low-level simulated self-play for
feedback. It outperformed RL and other LLM-based approaches on
Game of Pure Strategy and The Resistance: Avalon at action planning
and dialogue generation.

stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 62
📝→📝
But were implicit reasoning capabilities staring us in the face the whole time?
After prolonged training beyond the point of overﬁtting (known as grokking), some researchers have argued that
transformers learn to reason over parametric knowledge through composition and comparison tasks.
● Researchers at Ohio State University argued that a fully grokked transformer outperformed then SOTA models
like GPT-4-Turbo and Gemini-1.5-Pro on complex reasoning tasks with a large search space.
● They conducted mechanistic analyses to understand the internal workings of the models during grokking,
revealing distinct generalizing circuits for different tasks.
● However, they found that while fully grokked models performed
well on comparison tasks (e.g. comparing attributes based on
atomic facts), they were less good at out-of-distribution
generalization in composition tasks.
● This raises questions about whether these are really meaningful
reasoning capabilities versus memorization by another name,
although the researchers believe that enhancing the transformer
with better cross-layer memory sharing could resolve this.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 63
📝 → </>
Program search unlocks new discoveries in the mathematical sciences
Drawing on a combination of LLMs and evolutionary algorithms, FunSearch uses an LLM to generate and modify
programs, guided by an evaluation function that scores the quality of solutions. Searching for programs rather
than direct solutions allows it to discover concise, interpretable representations of complex objects or strategies.
This form of program search is one of the avenues that Chollet believes has the most potential to solve the ARC
challenge. The Google DeepMind team applied it to the cap set problem in extremal combinatorics and online
bin picking. In both cases, FunSearch discovered novel solutions that surpassed human-designed approaches.

stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 64
📝→🛠
RL drives improvements in VLM performance…
For agents to be useful, they need to be robust to real-word stochasticity, which SOTA models have historically
struggled with. We’re beginning to see signs of progress.
● DigiRL is a novel autonomous reinforcement learning approach for training in-the-wild device control agents
specifically for Android devices. The method involves a two-stage process: offline reinforcement learning
followed by offline-to-online reinforcement learning.
● It achieves a 62.7% task success rate on the Android-in-the-Wild
dataset, a significant improvement on the prior SOTA.

stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 65
📝→🛠
…while LLMs improve RL performance
In 2019, Uber published Go-Explore, an RL agent that solved hard-exploration problems by archiving discovered
states and iteratively returning to and exploring from promising ones. In 2024, LLMs are supercharging it.
● Intelligent Go-Explore (IGE) uses an LLM to guide state selection, action choice, and archive updating, rather than
the original Go-Explore’s hand-crafted heuristics. This enabled more flexible and intelligent exploration in
complex environments.
● This approach also allowed IGE to recognize and capitalize on
promising discoveries, a key aspect of open-ended learning systems.
● It significantly outperformed other LLM agents on mathematical
reasoning, grid worlds, and text-based adventure games.
● Switching from GPT-4 to GPT-3.5 resulted in a significant
performance drop across all environments, suggesting that IGE's
performance scales with the capabilities of the underlying language
model.

Who remembers Monte Carlo Tree Search?

To improve planning, approaches like MCTS, which helped to power AlphaGo, are slowly returning to the fore.
Early results are promising, but will they be enough?
● MultiOn and Stanford combined an LLM with MCTS, along with a self-criticism mechanism and direct
preference optimization, to learn from different success and failure criteria.
● They found this improved Llama-3 70B’s zero-shot performance from 18.6% to 81.7% in real-world booking
scenarios, after a day of data collection, and up to 95.4% with online search.
● The longer-term question will be whether
next-token prediction loss is too
ﬁne-grained.
● This risks limiting the ability of RL and
MCTS to achieve agentic behavior by
focusing too much on individual tokens
and hindering the exploration of broader,
more strategic solutions.

stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 67
📝→🛠
Could foundation models make it easier to train RL agents at scale?
One of the big bottlenecks for training RL agents is a shortage of training data. Standard approaches like
converting pre-existing environments (e.g. Atari) or manually building them are labor-intensive and don’t scale.
● Genie (winner of a Best Paper award at ICML 2024) is a world model that can generate action-controllable
virtual worlds. It analyzed 30,000 hours of video game footage from 2D platformer games, learning to
compress the visual information and infer the actions that drive changes between frames.
● By learning a latent action space from video data, it can handle
action representations without requiring explicit action labels,
which distinguishes it from other world models.
● Genie is both able to imagine entirely new interactive scenes and
demonstrate significant flexibility: it can take prompts in various
forms, from text descriptions to hand-drawn sketches, and bring
them to life as playable environments.
● This approach demonstrated applicability beyond games, with the
team successfully applying the hyperparameters from the game
model to robotics data, without fine tuning.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 68
📝→🛠
Could foundation models make it easier to train RL agents at scale?
Imperial and UBC’s OMNI-EPIC used LLMs to create a theoretically endless stream of RL tasks and environments
to help agents build upon previously learned skills. The system generates executable Python code that can
implement simulated environments and reward functions for each task, and employs a model to assess whether
newly generated tasks are sufficiently novel and complex.
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 69

Are scientists inventing their AI replacement?

New lab Sakana AI has been focused on attempting to enhance the creative capabilities of current frontier
models. One of their first papers looks at using foundation models to automate research itself.
● The AI Scientist is an end-to-end framework designed to automate the generation of research ideas,
implementation, and the production of research papers.
● After being given a starting template, it brainstorms novel research directions, before executing the
experiments, and writing them up. The researchers claim their LLM-powered reviewer evaluates the generated
papers with near-human accuracy.
● The researchers used it to generate example papers
about diffusion, language modeling, and grokking. These
were convincing at first glimpse, but contained some
flaws on closer examination.
● Yet, the system periodically displayed signs of unsafe
behavior, e.g. importing unfamiliar Python libraries and
editing code to extend experiment timelines.

An ensemble approach appears to drive strong performance improvements in code

Meta’s TestGen-LLM combines multiple LLMs, prompts and configurations to leverage different models’ strengths
to improve unit testing coverage for Android code on Instagram and Facebook.
● It uses an "assured" approach, filtering generated tests to ensure they build successfully, pass reliably, and
increase coverage before recommending them. This is the first large-scale industrial deployment of an
approach that combines LLMs with verifiable guarantees of code improvement, addressing concerns about LLM
hallucinations and reliability in a software engineering context.
● In deployment, TestGen-LLM improved about 10% of test classes it was applied to, with 73% of its
recommendations accepted by developers.

stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 71
📝+ 🖼 → 📝
Self-driving embraces more modalities
Wayve’s LINGO-2 is the second generation of its vision-language-action model, that, unlike its predecessor, can
both generate real-time driving commentary and control a car, linking language explanations directly with
decision-making and actions. Meanwhile, the company is using generative models to enhance its simulator with
more real-world detail. PRISM-1 creates realistic 4D simulations of dynamic driving scenarios using only camera
inputs. It enables more effective testing and training by accurately reconstructing complex urban environments,
including moving elements like pedestrians, cyclists, and vehicles, without relying on LiDAR or 3D bounding
boxes.

Segment Anything gets boosters and expands to video

Last year’s Meta’s Segment Anything impressed with its ability to identify and segment objects in images given
any prompt. In July, they released Segment Anything 2 (SAM 2), which stunned observers.
● Meta has extended SAM to include video segmentation, training it on their own dataset (SA-V) of 51,000
real-world videos and 600,000 spatio-temporal masks. This dataset has been made available, along with the
model, under an Apache 2.0 license.
● To build a unified model that works for both video as well as image,
Meta have made some adaptations. For example, they have included a
memory mechanism to track objects across frames and an occlusion
head to handle objects that disappear or reappear.
● They find it is more accurate and 6x faster the SAM 1 at image
segmentation, while able to surpass the accuracy of prior leading video
segmentation models with 3x fewer interactions.
● The model is, however, less efficient at segmenting multiple objects
simultaneously in video and can struggle with longer clips.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 73

Robotics (ﬁnally) becomes fashionable (again) as the big labs pile in

LLMs and VLMs demonstrate their potential to help resolve data bottlenecks and resolve longstanding usability
hurdles
2021 2024

stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 74
📝+ 🖼+ 🤖 → 📝
Google DeepMind quietly emerges as a robotics leader
Despite all eyes being on Gemini, the Google DeepMind team has steadily been increasing its robotics output,
improving the efficiency, adaptability, and data collection of robots.
● The team created AutoRT, a system that uses a VLM for environmental understanding and an LLM to suggest a
list of creative tasks the robot could carry out. These models are then combined with a robot control policy.
This helps to scale up deployment quickly in previously unseen environments.
● RT-Trajectory enhances robotic learning through video input. For each
video in the dataset of demonstrations, a 2D sketch of the gripper
performing the task is overlaid. This provide practical visual hits to the
model as it learns.
● The team have also improved the efficiency of transformers. SARA-RT is
a novel ‘up-training’ method to convert pre-trained or fine-tuned robotic
policies from quadratic to linear attention, while maintaining quality.
● Researchers have found Gemini 1.5 Pro’s multimodal capabilities and
long context window makes it an effective way of interacting with
robots via natural language.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 75
📝+ 🖼+ 🤖 → 📝
Hugging Face pulls down barriers to entry
Historically, robotics had significantly fewer open source datasets, tools, and libraries than other areas of AI -
creating an artificially high barrier to entry. Hugging Face’s LeRobot aims to bridge the gap, hosting pretrained
models, datasets with human-collected demonstrations, and pre-trained demonstrations. And the community’s
loving it.

stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 76
📝+ 🖼+ 🤖 → 📝
Diffusion models drive improvements in policy and action generation
Well-established in image and audio generation, diffusion models continue to demonstrate their effectiveness in
generating complex action sequences in robotics.
● A number of research groups are aiming to bridge the gap between high-dimensional observation and
low-dimensional action spaces in robot learning. They create a uniﬁed representation that allows the learning
algorithm to understand the spatial implications of actions.
● Diffusion models excel at modeling these kinds of complex, non-linear
multimodal distributions, while their iterative denoising process allows for
the gradual reﬁnement of actions or trajectories.
● There are multiple ways of attacking this. Researchers at Imperial and
Shanghai Qizhi Institute have opted for RGB images, which offer rich visual
information and compatibility with pre-trained models.
● Meanwhile, a team at UC Berkeley and Stanford have leveraged point
clouds, for their explicit 3D information.

stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 77
📝+ 🖼+ 🤖 → 📝
Can we stretch existing real-world robotics data further than we currently do?
Robotics policies have often been hampered by a lack of generalizability, due to limited real-world data. Rather
than ﬁnding more data, researchers are injecting more structure and knowledge to what we already have.
● One approach, outlined by a Carnegie Mellon team, involves learning more “affordance” information from
human video data, such as hand possess, object interactions, and contact points.
● This information can then be used to ﬁnetine existing visual
representations to make them more for suitable robotic tasks. This
consistently improved performance on real-world manipulation tasks.
● Meanwhile, a Berkeley/Stanford team found that chain-of-thought
reasoning could have a similar impact.
● Rather than just predicting actions directly, the enhanced models are
trained to reason step-by-step about plans, sub-tasks, and visual features
before deciding on actions.
● This approach uses LLMs to generate training data for the reasoning steps.

stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 78
📝+ 🖼+ 🤖 → 📝
Can we overcome the data bottleneck for humanoids?
It’s challenging to model the intricacies of human behavior with imitation learning, which relies on human
demonstrators. While effective, it’s difficult to implement at scale. Stanford has some workarounds.
● HumanPlus is a full-stack system for humanoids to learn from human data. It combines a real-time shadowing
system and an imitation learning algorithm.
● The shadowing system uses a single RGB camera and a low-level policy to
allow human operators to control the humanoid's whole body in real-time.
This low-level control policy is trained on a large dataset of human motion
data in simulation and transfers to the real world without additional
training.
● The imitation learning component enables efficient learning of autonomous
skills from shadowing data. It uses binocular egocentric vision and
combines action prediction with forward dynamics prediction.
● The system demonstrates impressive results on a variety of tasks, including
complex actions like wearing a shoe and walking, using only up to 40
demonstrations.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 79
📝+ 🖼+ 🤖 → 📝
Back with a vengeance: robot doggos 🐶
Boston Dynamics' Spot showcased progress in mobility and stability for embodied AI but lacked manipulation
skills. Researchers are now addressing this gap. A Stanford/Columbia team combined real-world demonstration
data with simulation-trained controllers to focus on controlling the robot's gripper movement rather than
individual joints. This approach simplifies transferring manipulation skills from stationary arms to mobile robots.
Meanwhile, a UC San Diego team developed a two-part system: a low-level policy for executing commands and a
high-level policy for generating visual-based commands, enhancing the robot's manipulation capabilities.

The Apple Vision Pro emerges as the must-have robotics research tool
While consumer demand for the Vision Pro lacklustre so far, it’s taking robotics research by storm, where its
high-res, advanced tracking, and processing power is being leveraged by researchers working on teleoperation -
controlling robot movements and actions at a distance. Systems like Open-TeleVision and Bunny-Vision Pro use
it to help enable precise control of multi-ﬁnger robotic hands (at a 3000 mile distance in the case of the former),
demonstrating improved performance on complex manipulation tasks compared to previous approaches. They
address challenges such as real-time control, safety through collision avoidance, and effective bimanual
coordination.

stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 81
📝→🖼
To finetune or not to finetune (in medicine)?
Last year, a non-finetuned GPT-4 via one API call was highly competitive with Google’s Med-PaLM2 on certain
medical knowledge benchmarks. Gemini has ridden to the rescue.
● The Med-Gemini family of multimodal models for medicine are finetuned from Gemini Pro 1.0 and 1.5 using
various medical datasets and incorporate web search for up-to-date information. They achieved SOTA 91.1%
accuracy on MedQA, surpassing GPT-4.
● For multimodal tasks (e.g. in radiology and pathology), Med-Gemini set a new SOTA on 5 out of 7 datasets.
● When quality errors in questions were fixed, model
performance improved and it exhibited strong reason
across other benchmarks. It also achieved high precision
and recall in retrieving rare findings in lengthy EHRs - a
challenging "needle-in-a-haystack" task.
● In a preliminary study, clinicians rated Med-Gemini's
outputs equal or better than human-written examples in
most cases.

stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 82
📝→🖼
Generating synthetic data in medicine
High-quality medical imaging datasets are hard to come by or, even so, license to for research or commercial
products. They are also not immune to distributional shifts. And yet, realistic image generators have flooded the
internet in the last year. Could these be repurposed to generate realistic medical images that are useful for
model training, despite the large visual and semantic differences between natural images and medical images?
● By jointly fine-tuning both the U-Net and the CLIP text encoder from
Stable Diffusion a large dataset of real chest x-rays (CXR) and
corresponding radiologist reports, it is possible to generate synthetic
CXR scans with high fidelity and conceptual correctness as evaluated
by board-certified radiologists.
● Generated CXRs can be used for data augmentation and
self-supervised learning.
● Consistent with other modalities, supervised classification
performance drops slightly when training purely synthetic data.
● Moreover, generative models can improve fairness of medical classifiers by enriching
training datasets with synthetic examples that fill out underrepresented data points.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 83

Enterprise automation set to get an AI-ﬁrst upgrade

Traditional Robotic Process Automation (RPA), embodied by UiPath, has struggled with high set-up costs, brittle
execution, and burdensome maintenance. Two novel approaches, FlowMind (JP Morgan) and ECLAIR (Stanford),
use foundation models to address these limitations. FlowMind focuses on financial workflows, using LLMs to
generate executable workflows via APIs. In experiments on the NCEN-QA dataset, FlowMind achieved 99.5%
accuracy in workflow understanding. ECLAIR takes a broader approach, using multimodal models to learn from
demonstrations and interact directly with graphical user interfaces across various enterprise settings. On web
navigation tasks, ECLAIR improved completion rates from 0% to 40%.

The global balance of power in AI research remains unchanged, but academia gains
As AI emerges as the new competitive battleground, big tech companies begin to hold more details of their work
close to their chest. Frontier labs have meaningfully cut publication levels for the ﬁrst time since this report
began, while academia gets into gear.

Proportion of AI publications by country Year-on-year change in AI publication levels

Section 2: Industry

NVIDIA becomes the world’s most powerful company…

Amid growing demand for its hardware to power demanding gen AI workloads, every major lab depends on
NVIDIA for its hardware. Its market cap hit $3T in June, only the third US company to reach this milestone
(following Microsoft and Apple). Following blowout earnings in Q2, its position looks as unassailable as ever.

ChatGPT
launches
��

…and its ambitions are only growing

NVIDIA has already booked significant pre-sales on its new Blackwell family of GPUs and is making a strong play
for governments.
● The new Blackwell B200 GPU and GB200 Superchip promise significant performance gain over the Hopper
architecture of H100 fame. NVIDIA claims it can reduce cost and energy consumption 25x over an H100. In a
mark of NVIDIA’s power, every major AI lab CEO provided a supporting quote in the press release.
● While the Blackwell architecture was delayed by
manufacturing issues, the company is still confident of
booking several billion in revenue from it by the end of
the year.
● Jensen Huang, NVIDIA’s Founder and CEO is expanding
the pitch, outlining the company’s vision of sovereign AI.
● He has argued that every government needs to build its
own LLM to preserve its national heritage. You’ll never
guess whose hardware he thinks is optimal for the task…
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 88

Established competitors fail to narrow the gap

AMD and Intel have started to invest in their software ecosystems, while AMD has made a heavy pitch to the
open source community using ROCm (its CUDA competitor). However, they are yet to develop compelling
alternatives to NVIDIA’s portfolio of networking solutions. AMD is hoping its planned $4.9B acquisition of server
builder ZT Systems will change this. Meanwhile, Intel has seen its hardware sales decline. Short of regulatory
intervention, a change in research paradigm or supply constraints, NVIDIA’s position seems unassailable.

Q2 2024 revenue Q2 2024 earnings per share

Buying NVIDIA stock would’ve been far better than investing in its start-up contenders
We looked at the $6B invested in AI chip challengers since 2016 and asked what would have happened if
investors had just bought the equivalent amount of NVIDIA stock at that day’s price. The answer is lime green:
that $6B would be worth $120B of NVIDIA stock today (20x!) vs. the $31B (5x) in its startup contenders.

$ invested

NAV challengers

NAV lengers

But not everyone believes the line can only go up

A vocal minority of analysts and commentators aren’t convinced. They point to the decline in GPU scarcity, how
only a few companies are currently generating reliable revenue from AI-ﬁrst offerings, and how even Big Tech’s
infrastructure build-out is unlikely to be big enough to justify the company’s current valuation. The market is
currently ignoring these voices and seems more inclined to agree with early Tesla investor James Anderson’s
view that the company could be worth “double-digit trillions” in a decade.

Compute Index: NVIDIA A100 clusters

The number of large-scale NVIDIA A100 GPU clusters has stayed constant as the industry focuses its dollars on
the H100 and shinier Blackwell systems…more on the next slide!

Compute Index: NVIDIA H100 clusters (while GB200’s are loading…)

The real large-scale GPU cluster growth has come from H100s. The largest continues to be Meta’s 350k H100s,
followed by xAI’s 100k cluster and Tesla’s 35k. Meanwhile, Lambda, Oracle and Google have been building large
clusters summing over 72k H100s. Companies including Poolside, Hugging Face, DeepL, Recursion, Photoroom
and Magic have built over 20k worth H100 capacity. Moreover, the ﬁrst GB200 clusters are going live
(e.g. 10,752 at the Swiss National Supercomputing Centre), while OpenAI will have access to 300,000 by the end
of next year.

Compute Index: NVIDIA continues to be the preferred option in AI research papers

By last year’s count, NVIDIA was used 19x more than all of its peers combined in AI research papers (note the
log-scale y-axis!). This year, this lead has compressed to 11x, due in part to the 522% growth in papers that use
TPUs (gap is now 34x with NVIDIA). We also note the 353% growth in the use of Huawei’s Ascend 910, the 61%
growth of large AI chip start-up contenders and the new appearance of Apple’s silicon.

Compute Index: NVIDIA continues to be the preferred option in AI research papers

Usage of A100s continues to grow (+59% YoY) alongside the H100 (+477%) and the 4090 (+262%), albeit from
a much lower base. The V100 (now 7 years old, -20%), continues to be used at half the rate of the A100 (now 4
years old), further demonstrating the longevity of NVIDIA systems for AI research.

Compute Index: AI chip start-ups

Meanwhile in start-up land, Cerebras appears to be pulling out ahead the pack with 106% growth in the number
of AI research papers that make use of its wafer scale systems. Groq, which launched their LPU recently, saw its
ﬁrst usage in AI research papers last year. Meanwhile, Graphcore was acquired by SoftBank in late mid-2024.
Unlike their common enemy, NVIDIA, these AI chip start-ups have mostly pivoted from selling systems to
inference interfaces on top of open models.

More TFLOPs: NVIDIA compresses its product release timelines

Ever since the A100 launch in 2020, NVIDIA has been cutting down the time to ship its next datacenter GPU
while signiﬁcantly increasing the TFLOPs they deliver. In fact, timelines have come down by 60% from A100 to
H100 and down a further 80% from H200 to GB200. During that time, TFLOPs have gone up 6x. Large cloud
companies are buying huge amounts of these GB200 systems: Microsoft between 700k - 1.4M, Google 400k and
AWS 360k. OpenAI is rumored to have at least 400k GB200 to itself.

Scaling up and out with faster connections between GPUs and nodes
The speed of data communication between GPUs within a node (scale-up fabric), as well as between nodes
(scale-out fabric), is critical to large-scale cluster performance. NVIDIA’s technology for the former, NVLink, has
bandwidth per link, the number of links and the number of total GPUs connected per node increasing
significantly in the last 8 years. Coupled to their InfiniBand technology for connecting nodes intro large-scale
clusters, NVIDIA is ahead of the pack. Meanwhile, Chinese companies like Tencent have reportedly innovated
around sanctions for similar outcomes. Their Xingmai 2.0 high-performance computing network, which is said to
support over 100,000 GPUs in a single cluster, improves network communication efficiency by 60% and LLM
training by 20%. That said, it is not clear whether Tencent possesses clusters of this size.

But running large clusters continues to be an art and a science of interruptions

On publishing their Llama 3 family of models, Meta shared a breakdown of the 8.6 job interruptions per day
they experienced during a 54-day period of pre-training Llama 3 405B. GPUs tend to experience failures more
frequently than CPUs and all clusters are by no means created equal. Continuous monitoring is essential,
misconﬁgurations and dead-on arrival components happen too often due to insufﬁcient testing, and low-cost
power, affordable networking rates and availability are paramount. More on power needs in the Politics section!

Big labs seek to weaken their NVIDIA addiction

While big tech companies have long produced their own hardware, these efforts are accelerating as they seek to
at least improve their bargaining power with NVIDIA - but these aren’t tackling the most challenging workloads.
● Known for its TPUs, Google has unveiled the Axion, built on the Armv9 architecture and instruction set. These
will be made available through Cloud for general-purpose workloads and achieves 30% better performance
than the fastest general-purpose Arm-based instances currently available.
● Meta has unveiled the second generation of its in-house AI
inference accelerator, which more than doubles the compute and
memory bandwidth of its predecessor. The chip is currently used
for ranking and recommendation algorithms, but Meta plans to
expand its capabilities to cover training for generative AI.
● Meanwhile, OpenAI has been hiring from Google’s TPU team and
is in talks with Broadcom about developing a new AI chip.
● Sam Altman has also reportedly been in talks with major
investors, including the UAE government, for multi-trillion dollar
initiative to boost chip production.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 100

And a handful of challengers demonstrate signs of traction

Riding the NVIDIA tidal wave, AI chip challengers are ﬁghting for a slice of the (VC and customer) pie.
● Cerebras, known for their Wafer-Scale Engine, that integrates an entire supercomputer's worth of compute
onto one wafer-sized processor, has ﬁled to IPO on $136M in revenue for H1 2024 (up 15.6x YoY), 87% of
which came from Abu Dhabi-based and state-backed G42.
● The company has raised over $700M with customers in the compute-intensive energy and pharma sectors. It
recently launched an inference service to serve LLMs with faster token generation.
● Meanwhile, Groq raised a $640M Series D at a $2.8B valuation for its
Language Processing Unit, designed solely for AI inference tasks.
● So far, Groq has landed partnerships with Aramco, Samsung, Meta, and green
compute provider Earth Wind & Power.
● Both companies are focusing on speed as a core differentiator and are
working on cloud services, with Cerebras recently launching an inference.
● This helps them swerve NVIDIA’s software ecosystem advantage, but gives
them a new (challenging) competitor in the form of cloud services providers.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 101

While SoftBank starts to build its own chip empire (after prematurely selling NVIDIA)
Known for betting big, SoftBank is entering the arena, tasking subsidiary Arm with launching its ﬁrst AI chips in
2025 and acquiring struggling UK start-up Graphcore for a rumoured $600-700M.
● Arm is already a player in the AI world, but historically, its instruction set architecture has not been optimal for
the large-scale parallel processing infrastructure required for datacenter training and inference. It’s also
struggled against NVIDIA’s entrenched data center business and mature software ecosystem.
● With a current market cap of over $140B, markets aren’t bothered. The company is
reportedly already in talks with TSMC and others about manufacturing.
● SoftBank also scooped up Graphcore, which pioneered Intelligent Processing Units,
a processor designed to handle AI workloads more efﬁciently than GPUs and CPUs,
using small volumes of data. Despite its sophistication, the hardware was often not
a logical choice for genAI applications as they took off.
● The company will operate semi-autonomously under the Graphcore brand.
● Meanwhile, Softbank’s talks with Intel on designing a GPU challenger stalled after
they were unable to agree on requirements.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 102

The US Commerce Department plays whack-a-mole with chip manufacturers…

As US export controls widen, previously sanctions-compliant chips have found themselves on the wrong side of
tougher performance thresholds. That hasn’t deterred chip manufacturers.
● In last year’s report, we documented how NVIDIA had booked over $1B in sales of the A800/H800 (their
special China-compliant chip) to major Chinese AI labs. The US then banned sales to China, forcing a rethink.
● US Commerce Secretary Gina Raimondo has warned that “if you
redesign a chip around a particular cut line that enables [China] to do
AI I’m going to control it the very next day”.
● NVIDIA’s new China chip, the H20 is theoretically signiﬁcantly
weaker than top-line NVIDIA hardware, if you measure by raw
computing power. However, NVIDIA have optimised it for LLM
inference workloads, meaning it is now 20% faster than the H100
on reasoning tasks. NVIDIA are set to book $12B in sales.
● China, proportionally, however is becoming less important to US
chip manufacturers. It’s gone from representing 20% of NVIDIA’s
data center business to “mid-single digits”, according to NVIDIA. stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 103

…but opts not to restrict the use of hardware by Chinese labs in US data centers
While Chinese labs face restrictions in their ability to import hardware, there are currently no controls on their
local afﬁliates renting access to it overseas. ByteDance rents access to NVIDIA H100s via Oracle in the US, while
Alibaba and Tencent are reportedly in conversations with NVIDIA about setting up their own US-based
datacenters. Meanwhile, Google and Microsoft have directly pitched big Chinese ﬁrms on their cloud offerings.
The US is planning to make hyperscalers report this kind of usage via a KYC scheme, but is yet to draw up plans
to prohibit it.

Small-scale no more: Semiconductor smugglers get increasingly sophisticated

There is a growing number of increasingly sizeable NVIDIA chip sales to Chinese end customers via Asian
intermediary dealers (particularly Malaysia, Hong Kong and Japan) who facilitate trades using shell companies
with ﬁctitious business presences and even temporary data centers.
● In one case, a Chinese electric appliance company placed a $120 million order for a 2,400 NVIDIA H100 cluster
via a Malaysian broker. Given the size of the order, NVIDIA mandated an in-person check to ensure proper
installation of the system.
● The broker told The Information, which reported on this event, that he had “coordinated the rental, installation
and activation of the servers in a spare data center facility in Johor Bahru, a Malaysian town adjacent to the
Singapore border and home to large clusters of data centers. NVIDIA inspectors checked the servers there and left.
Shortly afterward, the servers were whisked away to China via Hong Kong.”
● Another Hong Kong-based chip broker has accumulated restricted 4,800 H100s via purchases from Dell and
Supermicro using shell companies based in non-US sanctioned countries. These were sold for $230M to a
Chinese buyer, a substantial premium on their acquisition cost of $180M.

But where’s the revenue…?

Many of the buzziest start-ups working on generative AI are raising on record, often three digit revenue
multiples. While these might be an indication of investor confidence in future returns, it sets a high bar, as many
of these companies currently have no identified path to profitability. However, this isn’t true for everyone, as the
biggest model providers see revenue begin to ramp up.

…and where’s the margin?

OpenAI is on course to see revenues triple in the space of a year, but training, inference, and stafﬁng costs mean
losses are continuing to mount. They’re not the only leader in search of functional economics.

Perhaps it’s neither: vibes are all you need (to recover your share price)
Meta has produced an incredible vibe shift in public markets by ditching their substantial metaverse investments
and pivoting hard into open source AI with their Llama models. Mark Zuckerberg is, arguably, the de facto
messiah of open source AI, counterpositing vs. OpenAI, Anthropic, and Google DeepMind.

1: Oct 28, ‘21: Metaverse 7

investment announced.
2: Nov 9, ‘22: Major layoffs and
6
metaverse tempering. 5
3: Feb 24, ‘23: Llama 1. 1 3 4
4: Jul 18, ‘23: Llama 2.
-68.4% +457%
5: Apr 18, ‘24: Llama 3. -$601B +$1.134T
2
6: Jul 23, ‘24: Llama 3.1 405B.
7: Sept 25, ‘24: Llama 3.2. tl;dr FAIR + GenAI + Llama saved Meta stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 108

The top quality model, OpenAI’s o1, comes at a signiﬁcant price and latency premiums
As the model menu matures, developers are choosing the right tool for the job (and their budget).

Inferencing all the way down: models get cheaper

Once thought to be prohibitively expensive to serve, the inference cost of serving strong models is dropping.

2023 2024 2023 2024

100x drop! 60x drop!

Google Gemini produced a strong model series with very competitive pricing
Prices on Gemini 1.5 Pro and 1.5 Flash have been dropped by 64-86% a few months after launch while offering
strong performance, e.g. Flash-8B is 50% cheaper than 1.5 Flash yet comparable across many benchmarks.

76% cut

86% cut

Chat agents as interactive developer sidekicks…

Over the summer, Anthropic and then Vercel launched the capability for their chat agents Claude and V0 to
open coding environments in which code is written and run in the browser to solve a user’s request. This brings
previously static code snippets to life, enabling users to iterate with the agent in real time, and to reduce the
barrier for creating software products. Needless to say, social media GenAI fans loved this! Below are examples
of Claude Artifacts and V0 generating a playable Minesweeper game from a single prompt.

…as AI labs move from building models to designing products

The most successful technology companies like Apple, Google, or TikTok have taken a product-ﬁrst, rather than
simply building a foundational technology and an API. As base model performance converges, OpenAI,
Anthropic, and Meta are visibly putting more thought into what their ‘product’ looks and feels like - whether it’s
Artifacts from Claude, OpenAI’s Advanced Voice functionality, or Meta’s hardware partnerships and lip-syncing
tools. Simply building a good model won’t be all you need.

While les grands modèles catch on, but another European challenger loses steam
European leaders have been desperate to point to a domestic success story as US labs have occupied the
spotlight. For now, Mistral remains the continent’s primary bright spark.
● With over €1B in the bank, Mistral has emerged as the undisputed European foundation model champion,
demonstrating both impressive computational efﬁciency and multilingual capabilities. Au Large, its ﬂagship model
is available via Azure as part of the company’s new partnership with Microsoft.
● The company has started striking partnerships with both French
companies like BNP Paribas and international start-ups like
Harvey AI. The company is also beginning to bulk out its US sales
function.
● Meanwhile, self-styled German ‘sovereign AI’ champions Aleph
Alpha have struggled.
● Despite raising $500M through equity, grants, and licensing
deals, the company’s closed models have underperformed freely
available peers. As a result, the company appears to be pivoting
to licensing Llama 2-3 and DBRX.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 114

Databricks and Snowflake pivot to build their own models…but can they compete?
In last year’s report, we touched on Databricks and Mosaic’s LLM combined strategy, which focused on
fine-tuning models on customer’s data. Is the ‘bring your own model’ era over?
● The Mosaic research team, now folded into Databricks, open-sourced DBRX in March. A 132B MoE model,
DBRX was trained on just over 3,000 NVIDIA GPUs at a cost of $10M. Databricks is pitching the model as a
foundation for enterprises to build on and customize, while remaining in control of their own data.
● Meanwhile, Snowflake’s Arctic is pitched as the most efficient model for
enterprise workflows, based on a set of metrics covering tasks including
coding and instruction following.
● It’s unclear how much enterprises are willing to invest in costly custom
model tuning, given the constant set of releases and improvements
driven bigger players.
● With readily available open source frontier models, the appeal of training
custom models is increasingly unclear.

Regulators scrutinize the relationships between key generative AI players…

Given the high compute costs involved, model builders increasingly rely on partnership arrangements with
established Big Tech companies. Antitrust regulators worry that this will further entrench incumbents.
● Regulators have particularly zeroed in on the close relationship between OpenAI and Microsoft, along with
Anthropic’s ties to Google and Amazon.
● Regulators fear that big tech companies are either essentially
buying out competition or providing friendly service provision
deals to companies that they’ve invested in - potentially
disadvantaging competitors.
● They’re particularly nervous about the influence NVIDIA wields
over the ecosystem and its decision to make direct investments.
France is contemplating NVIDIA-specific charges.
● Big Tech companies are attempting to place some clear blue water
between themselves and start-ups, with Microsoft and Apple both
voluntarily surrendering their OpenAI board observer seats.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 116
📝→📝
…leading to the rise of pseudo-acquisitions as an exit strategy
Regulatory action can only do so much to shape a market, when economic logic dictates otherwise. Giving the
converging performance of many of ‘the rest’ and the companies’ high cap-ex needs, consolidation is
unsurprising. Given some of the regulatory hurdles, we’ve seen the rise of pseudo-acquisitions, where a Big Tech
company i) hires the founders and much of the team of a start-up; ii) the start-up exits the model-building game
to focus on its enterprise offer; iii) investors are paid out via a licensing agreement. This model has been used by
Microsoft with Inflection and Amazon with Adept. However, regulators have become wise to the move and
regulators on both sides of the Atlantic are beginning to scrutinize these arrangements.

Github reigns supreme, but an ecosystem of AI coding companies is growing

By far the most widely-used AI-powered developer tool, Copilot adoption is growing 180% year-over-year and its
annual revenue run rate is now $2B (double its 2022 ﬁgure). Copilot (40% of Github revenue) alone is now a
bigger business than Github was when Microsoft acquired it. However, it’s just one of a number of coding
companies, some of which are raising blockbuster rounds.

$60M, Series A $196M, Series B $626M, Series B $465M, Series C

$68M, Series A $243M, Series C $252M, Series A

ML tools for AI struggle (again)

In a now familiar cycle, we’re seeing specialist tools and frameworks gain popularity before struggling to scale
and enter production, while incumbents demonstrate impressive resilience and adaptability.
● Following the explosive growth of vector databases, the uniqueness of searching in vector space has worn off.
Existing database providers have launched their own vector search methods.
● Hyperscalers like AWS, Azure, and Google Cloud have expanded their native
DB offerings to support vector search and retrieval at scale, while data
clouds like MongoDB, Snowflake, Databricks and Confluent are seeking to
capture RAG workloads from their existing customer base.
● Core Vector DB providers like Pinecone and Weviate now support
traditional keyword search, such as ElasticSearch and OpenSearch along
with introducing support for simple and efficient filtering and clustering.
● Over in framework land, the likes of LangChain and LlamaIndex, having
achieved popularity for experimentation, their high-level abstractions and
limited flexibility have been called out as a source of friction by some
developers, as their needs become more sophisticated.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 119

Are AI agents going commercial?

While H is being cagey about the specifics of its work, its early team contained experts in reinforcement learning
and multi-agent systems. Other agentic efforts are already up and running.
● Devin, launched by Cognition, made a splash in March. Pitched as “the first AI software engineer”, it is meant to
plan and execute tasks requiring thousands of decisions, while fixing mistakes and learning over time.
● The product itself split users, attracting fans, as well as detractors
who point to the need for guardrails and manual intervention.
Either way, investors are impressed, and within six months of
launch, the company secured a $2B valuation.
● Devin has an open source competitor in OpenDevin, which beat
the proprietary Devin on SWE-bench by 13 percentage points.
● MultiOn is also betting big on RL, with its autonomous web agent
- Agent Q (see slide 65) - combining search, self-critique, and RL. It
will be made available to users later this year.
● Meta’s TestGen-LLM has gone from paper to product at breakneck
space (4 months), being integrated into Qodo’s Cover-Agent.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 120

AI-powered search begins to make a dent, amid teething problems

With $165M raised, Perplexity has emerged as the buzziest AI-first search challenger, while Google is rolling out
its own search summaries. Both companies are finding that the output is only as good as the information.
● Within a 18 months of being founded, Perplexity hit a $1B valuation, with rumours that it is already looking to
potentially triple it. The LLM analyzes user input, sourcing responses either via a web search or from its
knowledge base, before producing a summary with in-line citations.
● Google has ruled out a summary boxes to illustrate the potential of
Gemini to power up its standard offering.
● Both services, however, have struggled with reliability issues. Gemini
was found to be using satirical Reddit posts as advice sources (e.g.
advising users to consume a rock a day), while Perplexity struggles with
the same hallucination issues that hit other LLM-powered services.
● OpenAI has started trialling a prototype search function - SearchGPT -
which will eventually be integrated into ChatGPT. While we don’t know
technical specifics yet, promotional imagery suggests a
Perplexity-esque user experience.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 121

Industry attitudes to copyright diverge as anger from content creators rises…

While copyright concerns are nothing new in generative AI, 2024 saw model builders come under greater
scrutiny from media organizations, record labels, and content creators.
● OpenAI and Google are negotiating with major media organizations, hoping that licensing arrangements will
take the sting out of criticism. In a similar vein, Eleven Labs has started a voice actor program.
● Some start-ups are swerving this altogether and are embracing ethical
certiﬁcation schemes. The best-known is Fairly Trained, started by ex-Stability
AI executive Ed Newton-Rex.
● At the other end of the spectrum, Meta and Perplexity have doubled down on
‘fair use’ arguments and have demonstrated little appetite for compromise
with their critics.
● As labs approach the data ceiling, YouTube scraping is in the spotlight.
● OpenAI reportedly transcribed millions of hours of YouTube videos to power
its audio transcription model. Meanwhile, Eleuther AI’s widely-used Pile
dataset contains subtitles from 173,536 YouTube videos. Internal documents
from both RunwayML and NVIDIA showed they mass scraped YouTube.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 122

…while cases jam up the court system and provide little clarity over fair use
The central question about whether creators’ copyright has been infringed by model builders via the use of their
work for training remains unsolved, but more expansive arguments have been shot down in the courts.
● Cases continue against Anthropic, OpenAI, Meta, Midjourney, Runway, Udio, Suno, Stability and others from news
outlets, image suppliers, authors, creative artists, and record labels.
● So far, model builders have failed to get any of these cases dismissed in full, but have managed to narrow their
scope signiﬁcantly.
● For example, claims from two groups of authors against OpenAI and Meta arguing that the companies were
guilty of vicarious copyright infringement because all of their models’ outputs are “infringing derivative work”
failed, because they were unable to demonstrate “substantial similarity”. Only their original claims on the ground
of copyright infringement have been allowed to proceed.
● A similar pruning happened with the cases against Midjourney, Runway, and Stability with plaintiffs told to
focus on the original scraping, with many of their wider claims dismissed.
● Amid this uncertainty, Adobe, Google, Microsoft, and OpenAI have taken the unusual step of indemnifying their
customers against any legal claims they might face on copyright grounds.

The last ones standing: Self-driving companies Wayve and Waymo power ahead
With Wayve unveiling a $1.05B Series C and Waymo scaling across the US, the industry seems to be booming,
after years of hype followed by disappointment.
● Waymo has gradually scaled in San Francisco, Los Angeles, and Phoenix, with a planned Austin launch later this
year. The company has now abolished is SF waiting list, opening up its waiting list to anyone.
● As well as raising fresh funding from Softbank, NVIDIA, and
Microsoft, Wayve scored a win when the UK passed legislation
allowing autonomous vehicles to hit the streets in 2026.
● The technology is also beginning to demonstrate commercial
potential. Alphabet has announced an additional $5B of
investment in Waymo, after its “Other Bets” unit, which includes
Waymo, delivered $365 million in quarterly revenue.
● Meanwhile, in August, the company announced that it had
reached 100,000 paid trips a week in the US, with 300 cars on
the road in SF alone.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 124

…but it’s still a risky business

Last year, a Cruise vehicle struck a pedestrian in San Francisco. The company lost its licence to operate in
California and saw signiﬁcant leadership turnover. General Motors, Cruise’s historically distant parent has
pumped $850M into the company after previously cutting 25% of the workforce and halting market expansion.
The company has resumed testing in Phoenix (with a human in the vehicle) and GM plan to seek external
investment. Despite this additional runway, existential questions still loom over the company, signalling the high
stands companies operating in the space are held to.

Cash pours into humanoid start-ups…but are they set to be the next self-driving?
Humanoid start-ups like Figure, Sanctuary, and 1X have raised close to a billion dollars from corporate investors,
including Samsung, Microsoft, Intel, OpenAI, and NVIDIA. Can the tech overcome its limitations?
● Replicating the complexity of human motion and engineering human-like dexterity, has historically proven to be
an expensive and technically difﬁcult endeavor.
● Start-ups are betting that sophisticated VLMs, real-world training data and simulation, along with better
hardware can change this.
● Avid SOAI readers, however, will be familiar with the story of
self-driving - where breakthroughs were promised every year,
before companies undershot for half a decade.
● Customers must also be convinced that humanoids are more
efﬁcient than cheaper, non-humanlike industrial robot systems.
● The appetite for non-humanoid robotics start-ups remains
healthy, despite Amazon’s recent pseudo-acquisition of
Covariant, a Bay Area robotics foundation model builder.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 126

2023 Prediction: A Hollywood-grade production makes use of genAI for visual effects.

Visual effects are an expensive and labor-intensive business, so Hollywood producers have been slowly trying to
integrate generative AI, amid a backlash from artists and animators. While much of this work has been done
quietly and post-production, eagle-eyed viewers have spotted clear signs of gen-AI related mishaps in the
background of HBO and Netﬂix productions. This ties back to long-standing issues around models’ ability to
represent physics and geometry accurately and consistently. Our prediction never said the output would be
good…

…but this work may be about to get professionalized

In the first deal of its kind, Runway has struck a partnership with film and games studio Lionsgate (famous for
films like John Wick, Twilight, and the Hunger Games franchise). Runway will train a new generative model on
Lionsgate’s catalogue of 20,000 titles, while Lionsgate said that it would use Runway’s models to support
“capital-efficient content creation opportunities”. Financial details remain unclear at this stage, but we know that
Lionsgate will initially use the model for storyboarding, before deploying it for the creation of visual effects.

Major labs fragment, with well-funded challengers emerging…

Due to a combination of scientiﬁc disagreement, commercial pressures, personality clashes, and availability of
capital, small bands of researchers have broken away from the biggest labs, indicating an ecosystem deepening.
● Japan-based Sakana AI, co-founded by Llion Jones, who was famously the only author of Attention Is All You Need
to have not left Google, and David Ha, emerged from stealth with $30M and three models based on the
evolution-inspired approach of ‘model merging’, where existing models are combined and the most promising
become ‘parents’ to the next generation.
● Paris-based H Company, led by a team of experienced DeepMinders,
raised a $220M round to build action models for RPA.
● Following board drama at OpenAI (more on this later), co-founder Ilya
Sutskever left to start Safe Superintelligence Inc. a lab focused on
building safe AGI with zero short-term commercial pressures or goals.
● Most recently, a number of the original Stable Diffusion creators
launched Black Forest Labs to focus on image and video generation.
They’ve already released FLUX.1, their ﬁrst family of open source image
models, which has rapidly begun to contend Midjourney’s quality.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 129

…but entrepreneurship is hard

Being a great engineer isn’t always a sign you’ll be a great founder. Some former staffers at labs have
experienced early success, others … less so. Safe Sign Technologies, founded by a former solicitor and an
ex-DeepMind researcher went through an acquisition without the founding team having to dilute to external
investors. At the other end of the spectrum, the ex-DeepMind founding team over at H Company couldn’t get to
launch without disintegrating, even with over $200M in the bank.

Text-to-speech is booming
ElevenLabs, the market leader in text-to-speech (TTS) hit unicorn status at the start of the year, with a $1.1B
valuation. With the big labs approaching the space tentatively, it has much of the field to itself.
● Alongside its flagship TTS product, the company has expanded into dubbing in foreign languages, voice
isolation, and has previewed an early text-to-music model. Likely seeking to avoid a copyright blow-up, the
company has opted not to release it immediately, but has provided an API for sound effect generation.
● 62% of Fortune 500 companies now have at least one employee using
ElevenLabs.
● Meanwhile, the frontier labs have approached the space with caution,
likely out of concern that misuse of voice generation capabilities could
result in a potential backlash.
● GPT-4o’s voice outputs have been restricted to preset voices for general
release, while OpenAI has said it is yet to make a decision on whether it
will ever make its Voice Engine (which can allegedly recreate a voice
based on a 15-second recording) widely available.
● Meanwhile, Cartesia is betting on state space models for efficient TTS.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 131

GenAI applications continue to see fast growth

Avatar video generation product, Synthesia, continues to grow exponentially across enterprise, small businesses
and creators. Once considered to be “fringe”, Synthesia is now used by the majority of the Fortune 100 for
learning and development, marketing, sales enablement, information security and customer service. Over 24M
videos have been generated with the service since launch in 2020, 2.5x more than last year.
Total videos generated per year

AI-ﬁrst products begin to demonstrate their stickiness in enterprise…

In last year’s report, we charted how GenAI products were struggling to retain paying customers beyond their
initial ‘wow’ effect and trial periods. New data from US corporate ﬁntech Ramp suggests that both spend and
retention is beginning to improve signiﬁcantly from the 2022 to 2023 cohorts. Top performers include OpenAI,
Grammarly, Anthropic, Midjourney, Otter, and ElevenLabs.

User retention over time AI product billing by quarter

63%
41%

…while AI-ﬁrst challengers scale revenue much quicker than their SaaS peers
Analysis of the 100 highest revenue grossing AI companies using Stripe reveals that, as a group, they are
generating revenue at a much faster pace than previous waves of equivalently well-performing SaaS companies.
Strikingly, the average AI company that has reached $30M+ annualised revenue took just 20 months to get
there, compared to 65 months for equally promising SaaS companies.

Months to $1M revenue Months to scale to $30M+

20 months 65 months
for AI cos for SaaS cos

Speech recognition ﬁnds its commercial feet

While text-to-speech beneﬁts from a ‘wow effect’, speech recognition has the potential to automate away
mundane tasks at scale. Investors are beginning to see its potential to scale.
● A string of start-ups working to use speech recognition for a range of use cases including customer support
and call centers have scored funding rounds in the last year or so, including Assembly AI ($50M), Deepgram
($72M), PolyAI ($50M), Parloa ($66M). PolyAI’s revenue is set to triple this year.
● These start-ups are focused on plugging shortages of call center staff and
allowing for more natural speech from customers, including corrections,
hesitation, interruption, and topic changes - areas traditional automated
systems have struggled with.
● While AI-powered transcription and audio analysis isn’t new, accuracy is
improving as a result of larger datasets, transformer models.
● For example, Assembly AI has built Universal-1, a multilingual model
trained on 12.5M of speech that is less runs faster, with less compute, fewer
errors and better ambient noise reduction than OpenAI’s Whisper.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 135

The next (uncanny) frontier: speech-to-speech?

For more than a decade, Alexa and Siri have delivered royally underwhelming consumer voice agent experiences.
The launch OpenAI’s GPT-4o and Paris-based Kyutai’s Mochi voice agents crosses the uncanny valley. Both
systems think and speak at the same time to ensure maximum ﬂow between the speaker/agent. OpenAI showed
how two phones running GPT-4o could hold a compelling voice conversation with one another. Mochi’s inference
speed was impressive and borderline too fast, producing occasionally jarring interruptions to the human speaker
if they paused too long. Google’s Notebook LM’s ability to generate conversational podcasts based on research is
also winning fans. More recently, Hugging Face implemented a speech-to-speech pipeline with voice activity
detection, TTS, an LLM and text-to-speech.

GenAI ﬁnally begins to scale in law

Legal tech isn’t new, but was historically focused on “simpler” tasks such as contract lifecycle management, NDA
review and building case law databases. A cautious, liability-conscious industry is beginning to get stuck in.
● AI-powered tools are now being used across drafting, case management, discovery, and due diligence. A range of
big US law firms, including Latham & Watkins, Cleary Gottlieb Steen & Hamilton, DLA Piper and Reed Smith
have started to hire in-house AI experts.
● Harvey, a popular legaltech AI start-up serving law firms, including
Macfarlanes and Allen & Overy, raised a $100M Series C in July.
● While in-house legal teams are less well-served by specialized tools,
adoption rates are actually higher, based on survey data. Klarna has
encouraged its legal team to use ChatGPT to save time on contract
drafting, claiming adoption rates on their legal team has hit 90%.
● The difference in pace can in part be explained by economics. The
associate billable hours AI can replace are some of a law firm’s most
profitable business. Firms are yet to settle on a solution for navigating
this while remaining competitive.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 137

Apple and OpenAI team up…

Amid reports that its fallen behind its timelines after a slow entry into the gen AI race, Apple spurned its
long-standing enemy Meta to begin integrating ChatGPT across OS, iPadOS, and macOS
● Commentators have frequently pointed to Apple as the laggard in the Big Tech AI wars. While its internal
research team has published high-quality work, it has struggled to productize it rapidly, due to a combination
of risk aversion and internal prioritization.
● While the company has announced its Apple
Intelligence service, the company is planning a
gradual roll-out that will begin after the release
of the next iPhone.
● Apple has struck a partnership with OpenAI to
use ChatGPT to enhance Siri, and offer image
and document understanding features, along
with image generation.

…but is this a marriage of convenience?

Given Apple is publishing work on foundation models that will power Apple Intelligence features, it’s reasonable
to ask how long-lasting or deep any OpenAI partnership is likely to be.
● Apple has kept up a steady tempo of research publications and has released a series of highly capable
smaller open models with a focus on on-device inference.
● In July, they released a paper documenting the models
that will power Apple Intelligence features.
● The server and smaller on-device versions of the model
demonstrate competitive performance in instruction
following, tool use, writing, and math.
● The on-device 3B model outperforms Gemma-7B and
Mistral-7B in human evaluations.
● Apple argues this is a sign that data quality is a far more
● important determinant of performance than data quantity. Pre-training included web pages, math, code, and
certain licensed datasets.
● They’re also investing on the MLX array framework for AI research on Apple silicon. stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 139

There’s gold in them kernels

Given Apple is publishing work on foundation models that will power Apple Intelligence features, it’s reasonable
to ask how long-lasting or deep any OpenAI partnership is likely to be.
● Unsloth, since launching at the end of last year, has quickly emerged as a popular open source project, offering
radically up to 30x faster training and fine-tuning, by leveraging GPU kernel improvements.
● The focus is on optimizing the attention mechanism
when using LoRA for efficient fine-tuning. Unsloth
manually derives gradients for 6 matrix operations,
related to LoRA and attention inputs.
● By carefully arranging the order of matrix multiplications
and using in-place operations, it’s possible to significantly
boost speed and memory efficiency.
● These optimizations are applied across all model
components, not just the attention mechanism.

Two of TechBio’s leading public companies come together in a $688M deal

Recursion, which excels at scaling biological exploration with high-throughput AI-ﬁrst experimentation, is
combining with Exscientia’s AI-ﬁrst precision chemistry capabilities. This creates a full-stack discovery and design
company with the largest GPU compute cluster in biopharma. The business has the potential to read out 10
clinical trials across rare disease, precision oncology, and infectious diseases in the next 18 months.

The video generation race is red hot

Players including Runway, Pika, Luma, and OpenAI are massively scaling up their data collection and model
training experiments to seek quality and consistency improvements in text-to-video generations, in addition to
producing longer clips.

Total funding: $236.5M Total funding: $135M Total funding: $67.3M

Prompt: “Cinematic animal documentary showing a highland cow in a ﬁeld with

But high-end model providers face a squeeze from cheap and OS competitors
US text-to-video start-ups sell subscription plans based on credits, but with a single second of video burning
through 5 Runway or Pika credits, users have to make sure they master the art of prompting quickly.
Text-to-video tends to come with lower GPU requirements than LLMs, creating an opportunity for cheaper
Chinese offerings like Kuaishou’s Kling, unconstrained by copyright fears, or highly capable open source models
like CogVideoX.

Generative image-conditioned video generation with Lora’s on top

Low-Rank Adaptation is a method to ﬁne-tune large models such that their generations improve along aspects
that the user cares about, such as characters, styles or concepts. Platforms such as Civit.ai make it easy for users
to train LoRA’s using their own training examples. These LoRAs are shared on a marketplace for anyone to use.
Moreover, a popular workﬂow is to use the output of a LoRA model to condition the generation of a few second
video using products like Runway that allow users to set the start and end image frame. It’s surely a matter of
time before generative audio is added to the mix!

Personalising cancer therapy with mRNA vaccines and predicted neoantigens

Covid darlings Moderna and BioNTech are generating individualised “neoantigen” therapies (INT) to fight
cancer. INTs are composed of mRNAs that encode predicted neoantigens, cancer-specific mutations that act as
antigens generated by tumor cells. These “neoantigens” prompt the patient’s immune system to clear the
tumor cells that produce them. New positive data shows that INTs are having promising therapeutic effects in
aggressive melanoma (skin) and pancreatic cancer. INTs present significant manufacturing and logistics issues.
● In April 2024, BioNTech shared 3-year follow-up data from a Phase 1 trial with their BNT122 (INT) in
pancreatic cancer. 8 of 16 patients saw high-magnitude T cells specific to the encoded neoantigens.
● 6 of these 8 patients remained disease-free during the 3-year follow-up period. Of the 8 patients who didn’t
see an immune response, 7 showed tumor recurrence.
● In June 2024, Moderna and Merck announced 3-year Phase 2b trial (n=157 patients) data showing that
mRNA-4157 (V940, INT) in combination with KEYTRUDA (melanoma drug) reduced the risk of recurrence or
death by 49% and the risk of distant metastasis or death by 62% vs. KEYTRUDA alone in melanoma patients.
● The 2.5-year recurrence-free survival rate of mRNA-4157 (V940) in combination with KEYTRUDA was 74.8% as
compared to 55.6% for KEYTRUDA alone.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 145

Hot or not: smart glasses?

Google famously launched their smart glasses in 2014 just as deep learning-based computer vision research
was starting to show promise and a few years before the augmented reality hype really started peaking. The
product ﬂopped and was pulled in 2015. Meanwhile in 2020, Meta started collaborating with popular
sunglasses brand Ray-Ban, to develop smart glasses. The ﬁrst version was released in 2021 and the second,
with enhanced audio capabilities and an integration to Meta AI, launched in 2023 for $299. It has become a
hit. While sales numbers aren’t shared, but Zuckerberg stated that many styles and colors sold out. It’s likely
that the form factor, quality audio and changing opinions towards privacy contributed to this change of fate.

Hot or not: portable AI assistants?

Less successful have been attempts to build AI-powered gadgets designed to act as assistants. The two most
famous are the Rabbit R1 and the Humane AI pin. These gadgets combine standard voice assistant capabilities
with other features, including a camera, image analysis, and language translation. Early reviews have been
near universally negative, with common complaints including unreliability, poor battery life, and a lack of
useful features. While reviewers often believed there was a world in which these devices could be useful, they
complained customers were paying high sums ($699 for the Pin, $199 for the R1) to beta test products that
weren’t ready for the market.

AI investment surges in every region

Driven by GenAI mega-rounds like xAI and OpenAI’s $6B fundraises, US private market continue to lead. Total
investment into AI companies reached close to $100B.

Driven by public companies, AI companies reach nearly $9T in value

While private company valuations have continued to climb at a steady pace, a small handful of publicly traded
companies have held up the market like Atlas. Publics alone now enjoy a greater enterprise value than the
entire market in 2023.

Of all venture-backed companies, the highest % of AI companies are found in robotics,

enterprise software, space and security categories.
Last year, enterprise software, health, ﬁnance and marketing were the most actively funded AI categories.

While over the last 2 years, mega $250M+ rounds dominated AI ﬁnancings
There appears to be a clear “pre/post-GPT-4 era” (2023) that triggered all funding systems to go on steroids…

The IPO market remains lifeless, while M&A activity drifts -23% from its 2021 peak
Amid mounting regulatory scrutiny and shaky post-Covid stimulated markets, dealmaking has been icy, as
companies maintain a ‘wait and see’ attitude

Attention is all you need… to build raise billions for sell your AI start-up
Noam Shazeer of Character.ai sold his team back to Google for $2.5B, while Adept was acqui-hired into
Amazon and Inﬂection into Microsoft for $650M. These deals all involved hiring founders and star employees
while paying enough money to investors as a technology licensing fee to get the deals through.

Capital raised Exit price

ex- ex-
$415M NA

$1.5B $650M

$193M $2.5B

Section 3: Politics

The US introduces limited frontier model rules via executive order…

After securing voluntary commitments from the big labs in July 2023, the White House decided to make them
binding, with Joe Biden signing an executive order on frontier model regulation in October that year.
● Executive Order 14110 was primarily directed at government agencies. Measures include mandating the
development of cybersecurity standards, requiring federal agencies to publish AI use policies, directing
agencies to address AI-related critical infrastructure risks, and commissioning a labor market study.
● Most notably, the EO mandated labs to notify the federal government
and share the results of safety tests before the public deployment, if
the model used more than 10^26 FLOPS of computing power in
training (slightly more than GPT-4 and Gemini Ultra).
● It also set out additional requirements for companies working on the
use of AI for biological synthesis.
● The crucial downside of executive orders is that they can be revoked
at the stroke of a pen. The Republican platform for the coming
presidential election commits to doing exactly that.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 155

…while states pursue their own, more controversial, rules

With little prospect of bipartisan consensus emerging around broader federal AI regulation, states are pursuing
their own AI laws, most notably California with SB 1047.
● Bills so far tend to be focused around the disclosure of AI usage, reporting for certain high risk use cases, and
consumer opt-outs. For example, the Colorado’s state legislature to include reporting requirements for high-risk
systems and to create a reporting mechanism for algorithmic discrimination risks.
● However, the most comprehensive and controversial has been California’s SB 1047. Sponsored by the existential
risk org the Center for AI Safety, the bill, it creates a safety and liability regime for foundation models.
● The original draft of the bill spooked industry, with an unconventional method of determining in-scope models*,
new reporting, compliance and enforcement, along with a government body to oversee frontier models.
● Following pushback by tech companies, VCs, and prominent state Democrats, the bill was signiﬁcantly amended
and the controversial provisions above removed. While Anthropic and Elon Musk supported the amended
version, OpenAI, Meta, and a trade group representing Big Tech remained opposed.
● Governor Gavin Newsom vetoed the bill, arguing that it risked giving “the public a false sense of security” while
“curtailing the very innovation that fuels advancement in favor of the public good”.

The EU AI Act finally passes into law, following frantic last-minute lobbying
In March, the European Parliament passed the AI Act after an intensive Franco-German influence campaign to
weaken certain provisions. Questions about implementation, however, remain unanswered.
● With the passage of the act, Europe is now the first bloc in the world to adopt the a full-scale regulatory
framework for AI. Enforcement will be rolled out in stages, with the ban on “unacceptable risk” (e.g. deception,
social scoring) to come in February 2025.
● France and Germany managed to secure changes that tiered the
foundation model regulations, with a basic set of rules applying to
all models and additional regulations for those being deployed in
a sensitive environment.
● The full-on ban on facial recognition has now been watered down
to allow its use by law enforcement.
● While industry is concerned about the law, the months of
consultation and large amount of secondary legislation required
means it still has time to shape the specifics of implementation if
it engages constructively.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 157

Big US labs struggle to navigate European regulation

A combination of the EU AI Act and long-standing GDPR requirements around privacy and data transfers has left
US labs struggling to adapt their services. Anthropic’s Claude wasn’t accessible to European users until May
2024, while Meta won’t offer multimodal models to European customers. Meanwhile, Apple is rebelling against
EU’s Digital Markets Act, claiming that its interoperability requirements are incompatible with its positions on
privacy and security. As a result, it is delaying the European launch of Apple Intelligence.

Governments shine a spotlight on the scraping of user data

As model builders search for more data to meet their insatiable appetites, opt-out policies are coming under
scrutiny.
● Under questioning from Australian lawmakers, Meta’s global privacy director admitted that the company
automatically scraped posts for model training going back to 2007, provided that they have not explicitly
marked them as private.
● Users in the EU have been granted a global opt-out option following
regulatory pressure. The company has conﬁrmed that it will not offer this
to users unless it is compelled to do so by local regulators.
● The UK’s Information Commissioner’s Ofﬁce asked Meta to pause in June,
but after the company gave users a window in which to object, they
allowed it to proceed.
● Meta aren’t alone. X has stopped using European users’ public posts
following a court battle, while the Irish Data Protection Commission is
now investigating Alphabet’s use of user data to train Gemini.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 159

The UK moves towards frontier model legislation (slowly)

The new UK Labour Government has signalled that it intends to break with its predecessor’s approach of only
regulating AI via existing legislation, but only subtly.
● At the November Bletchley Summit (more on that later), AWS, Anthropic, Google, Google DeepMind, Inﬂection
AI, Meta, Microsoft, Mistral AI and OpenAI voluntarily agreed to ‘deepen’ the access that they provided to the
UK Government.
● Anthropic has given the UK AISI pre-deployment access to Claude Sonnet 3.5, while Google DeepMind made
some of the Gemini family available.
● The new UK Government has signalled that it will pass legislation codifying these previously voluntary
commitments, but suggested that it will not pursue broader regulation, implicitly ruling out an EU-style
approach.
● Observers previously thought this legislation would be published immediately, but the timeline has
lengthened, as the government pursues a consultation process in the face of industry pushback.
● This follows on from an industry consultation that the previous government ran on similar questions, which
concluded that immediate frontier model regulation was unnecessary, but there would likely be a time when
this would change.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 160

China’s AI regulation enters its enforcement era

China was the first country to begin setting down generative AI guardrails, with comprehensive (originally
voluntary) guidelines appearing from 2022 onwards. The country’s censorship apparatus is now stepping in.
● While top Chinese labs continue to produce SOTA models, overseen by the Cyberspace Administration of
China, the government is keen to ensure that models simultaneously avoid giving ‘incorrect’ answers to
political questions, while avoiding giving the appearance of being censored.
● Before releasing a model, labs have to submit their models to tests
with tens of thousands of questions to calibrate their refusal rate.
They usually achieve this by building a spam-filter type classifier.
There’s also a booming industry of consultants assisting labs.
● There are also other inconveniences, including a ban on domestic
Hugging Face access. The officially sanctioned “mainstream values
corpus” acts as an inferior replacement source of training data.
● While big companies like Alibaba, ByteDance, and Tencent can afford
the compliance and use their global footprint to some restrictions -
start-ups are likely to suffer.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 161

US export and investment controls on China tighten

Shortly after the publication of the last State of AI Report, the US controlled the export of NVIDIA’s
sanctions-compliant A800 and H800 chips, but its actions have broadened out
● Not only has the US barred the export of certain items, it has actively attempted to interfere with stockpiling
efforts, either blocking shipments of goods or leaning on international partners to do so, ahead of the deadlines
for restrictions. This has affected NVIDIA, Intel, and ASML.
● This was followed with letters from the Commerce Department
instructing US manufacturers to cease sales to Chinese
semiconductor maker SMIC’s most advanced facility.
● The US is also escalating beyond just selling technology and is
moving to either block or restrict US investment in Chinese
start-ups working on a broad range of applications deemed
detrimental to national security, including semiconductors, defense,
surveillance, and audio, image, and video recognition.
● Considering sharply diminished US investment in Chinese start-ups,
the impact will largely be symbolic.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 162

China’s domestic semiconductor efforts struggle despite impressive paper performance

Skeptics of US sanctions have long warned that they could inadvertently spur local innovation. These efforts
continue to struggle with quality and performance issues, despite generous government subsidy.
● Despite corruption concerns, China has continued to deepen its semiconductor subsidy program. In May, the
government unveiled a third state-backed investment fund of $47.5B. The ﬁnance ministry is the biggest
shareholder, along with a coalition of Chinese banks.
● Huawei caused a stir with the release of the Ascend 910B, a 7nm
chip for AI training, which on paper, was a near match for the
NVIDIA A100.
● However, SMIC has struggled to manufacture these chips at scale:
4/5 are reportedly defective. The company’s cloud services CEO
has all but admitted that the company will struggle to innovate
beyond 7nm for the foreseeable future.
● Previously buzzy semiconductor start-ups such as X-Epic have
started to lay off staff as the market has cooled, while memory
chip maker YMTC ran into severe ﬁnancial trouble late last year.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 163

But the US CHIPS act begins to prove the critics wrong

The Biden White House’s embrace of industrial strategy provoked skepticism and pushback from a number of
commentators, who pointed to wasteful spending and delays. However, a TSMC plant is now up and running in
Arizona - ahead of schedule. Apple mobile processors are now set to be made in the US via its 5nm process. The
facility will enter full production next year.

April September

Big in Japan?
For a combination of political and cultural reasons, Japan has historically been a placid market for both venture
capital and AI start-ups. The government is suddenly keen to get a slice of the action.
● The Japanese government sees VC and AI as a potential vehicle for kickstarting a long-stagnant economy, while
Japan presents an opportunity for investors who’d rather not have to raise from deep-pocketed Gulf states.
● Tokyo-based Sakana has already pulled in $200M from US investors like Lux
Capital and Khosla Ventures, while a16z is reported to be planning a Japan ofﬁce.
● In turn, the Japanese government-funded investment vehicles have invested in two
of US VC NEA’s funds and are actively exploring others. Mitsubishi is said to be
investing in Andrew Ng of Stanford’s second AI fund.
● Meanwhile, the country is also priding itself on a light-touch approach to
regulation and is focusing on industry-led oversight and seems unsympathetic to
copyright claims around generative AI. However, it has created a UK-style safety
institute.
● Sensing the momentum, Microsoft has announced $2.9B of investment in Japanese
AI and cloud infrastructure. stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 165

Amid sharply rising compute bills, sovereign wealth influence begins to grow
With the capex needs of frontier labs beginning to grow beyond what traditional VC alone can supply, labs are
beginning to look further afield. Alarm bells are already beginning to ring in the corridors of power.
● Following the downfall of FTX, its 8% stake in Anthropic was sold primarily to Mubadala, the government of
Abu Dhabi’s sovereign wealth fund. A Saudi bid was turned down on national security grounds, although Saudi
investors Prince Alwaleed Bin Talal and Kingdom Holding participated in X.ai’s Series B.
● Most controversially, G42, an Emirati AI-focused holding company
had struck a partnership with OpenAI to work in the country’s
finance, energy, and healthcare sectors.
● G42’s holdings in prominent Chinese technology companies,
including Bytedance, prompted panic in the US intelligence
community.
● In the end G42, was pressured into divesting its Chinese holdings
and accepting a $1.5B investment from Microsoft, with Microsoft
President Brad Smith joining the board.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 166

Public compute efforts pale in comparison to private

The UK, US, and EU are all beginning to ramp up their public compute offering, subsiding researchers and
start-ups access to expensive hardware. But efforts remain tentative.
● The UK recently froze investment in a number of projects, most significantly a planned national
supercomputing facility at Edinburgh.
● Meanwhile, the EU is using grants to make small amounts of compute available to start-ups through a
competitive process, along with small sums of money (€250K and 2 million computational hours).
● It recently published a call for proposals for its AI Factores initiative, which will allow developers and
researchers to access the EuroHPC network of supercomputers and other resources, including data
repositories, skills training, and co-working hubs – leaving the potential hosts with the flexibility to bundle
various resources as they see fit.
● The US National AI Research Resource is now operational, with researchers applying for a year-long access on
the condition that their work is published openly afterwards.
● At the bolder end of the spectrum, the Indian government has indicated its willingness to fund half the cost
of a 10,000 NVIDIA GPU cluster, which it would look to establish in under 18 months, provided private
partners are prepared to foot some of the bill.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 167

Rising compute consumption jeopardises Big Tech’s net zero commitments…

Big tech companies have signed up to a range of 2030 climate commitments, with Microsoft even pledging to
be carbon negative. AI energy consumption means they’re currently headed in the wrong direction.
● According to Google’s 2024 Environmental Report, the company’s greenhouse gas emissions have climbed
by 48% since 2019, while Microsoft’s carbon emissions have jumped by 30% since 2020. xAI’s 100k H100
cluster is thought to be powered by gas generators.
● Meanwhile, Goldman Sachs is estimating that data center
power demand is on course to grow 160% by 2030, although
they note that demand was growing sharply even before the
genAI boom took off.
● Tech companies are trying to shape a review of the
Greenhouse Gas Protocol, which sets the rules for carbon
accounting.
● Critics argue offsets don’t represent emissions accurately.
Over 50% of Amazon and Microsoft’s renewable energy comes
from purchasing clean energy certiﬁcates.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 168

…and energy infrastructure begins to buckle

The environmental challenges around AI are closely connected to an often-forgotten blocker on scaling - the
physical constraints imposed by the physical world.
● Mark Zuckerberg has said that exponential growth curves potentially require data centers powered by 1GW of
electricity (close to the size of a meaningful nuclear power plant) versus the 50-100MW at the moment.
● Microsoft and OpenAI’s planed $100B+ Stargate supercomputer is estimated internally to need potentially as
much as 5GW to power it. For comparison, The Grand Coulee Dam, the US’ biggest power plant produces 6.8GW.
● Microsoft is set to buy all the output of the revived Three Mile
Island nuclear plant.
● A facility like this would require its own power plant, as the
grid would not be able to handle it. Ireland, Germany,
Singapore, China, and the Netherlands have introduced
restrictions on data centers due to capacity concerns.
● Alongside energy, builders of standard-sized data centers are
seeing years-long waits for back-up generators and cooling,
and challenges sourcing basic components like cables and stateof.ai 2024
transistors.
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 169

AI-ﬁrst defense challengers scale, but are they exceptions?

Since last year’s report, we’ve started to see major contracts awarded to defense challengers, but with the
number of winners still small, it’s too soon to say a new ecosystem is emerging.
● Anduril scored a number of crucial wins, getting down to the final two options in US Air Force’s Collaborative
Combat Aircraft program, expanding its UK work, and delivering its first unmanned submarine in Australia.
● The Pentagon’s Replicator program, focused on attritable
autonomous systems, has secured its first $500M of funding.
This should be fertile territory for start-ups, but its first award
went to AeroVironment, a public company.
● The US Defense Innovation Unit is also exploring the use of
cheap uncrewed systems.
● In Europe, Helsing hit a $5.4B valuation, with the backing of
US investors. As well as partnering with primes, the company is
working on integrating AI into Ukrainian made drones.
● However, the European ecosystem remains small and the Pentagon’s AI investments remain a rounding error on
its balance sheet - the sector is far from achieving escape velocity. stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 170

AI shows promise on the frontline in Ukraine, but western hardware underwhelms

When the conﬂict started, start-ups enthusiastically sent their equipment to be trialled on the frontline. The
Ukrainians haven’t always been impressed.
● Drones produced by US start-ups have frequently fallen short of their benchmark performances on range and
payload, while their high-powered designs worked against them. Their advanced comms, designed to make
them more secure, gave them an easy signature for Russian electronic warfare to detect.
● While the off-the-shelf Chinese DJI drone remains ubiquitous, it
appears that the Ukrainians are working hard to build a domestic
ecosystem of drone and ground robotics start-ups.
● At least 67 models of domestically-built UAVs have been certiﬁed
and 250 teams are working on UGVs.
● Alongside Helsing, there are still signs that international partners
are helping on software. For example, Swiss autonomy start-ups
Auterion’s Sky Node is helping FPV drones lock onto targets from a
long-distance to mitigate the effect of electronic warfare.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 171

The debate over AI’s economic impact intensiﬁes

2023 saw discussion about the extent to which different industries were exposed to AI. While organizations (e.g.
the IMF) continue to publish this work, the debate has begun to move on to its wider economic effects.
● Prominent economist Daron Acemoglu started a row when he argued in a paper for Economic Policy and some
Goldman Sachs research that AI would have a minor macroeconomic impact, increasing Total Factor Productivity*
by < 0.55% over the next 10 years, while deepening inequality.
● Acemoglu assumes that it will be possible for AI to drive further automation of tasks, while having little effect on
the efficiency of currently capital-intensive tasks - unlike previous waves of automation - while creating new
‘negative’ tasks (e.g. producing disinformation or targeted ads). These assumptions sparked criticism.
● On automation itself, influential economics commentator Noah Smith argued that comparative advantage is
likely to hold for the foreseeable future - even though AI will be more efficient than humans at any time, the cost
of energy and compute will incentivize people only to apply it to the most important tasks.
● This is fortunate, as universal basic income, the policy lever many AI luminaries such as Sam Altman and Demis
Hassabis have advocated as a response to AI’s impact, may not be a panacea. A sizeable trial funded by Altman
found UBI slightly reduced the number of hours work, but led to little in the way of increased education or
entrepreneurship.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 172

Misinformation studies boom, but evidence of AI’s effectiveness remains thin

With their ability to communicate directly with western audiences limited, Russia Today was found to be
operating a network of 1,000 fake X accounts via a tool called Meliorator. There are also signs that Russian
state-linked actors have used fake imagery around the Israel-Hamas conflict to stir controversy. But there’s little
evidence to suggest that this material is being viewed or believed by more than a small number of people.
● A recent review published in Nature poured cold water on the
significance of the issue, finding that research tended to
over-focus on fringe groups, overstate the role of bots, and failed
to actually demonstrate real-world effect.
● In a similar vein, a study from the Alan Turing Institute found that
AI-enabled disinformation had no impact on UK or European
elections this year, with volumes low and exposure largely
confined to small groups of political partisans.

Is AI going to be nationalized? (Spoiler alert: no)

As capabilities accelerate and tensions with China grow, a small chorus of voices have suggested that the US
government may need to intervene and start a new Manhattan Project. Not everyone is convinced.
● Ex-OpenAI staffer Leopold Aschenbrenner revived this discussion with ‘Situational Awareness’, an 165-page
PDF arguing that based on scaling laws, AGI by 2027 is plausible and that “the nation’s leading AI labs [are]
basically handing the key secrets for AGI to the CCP on a silver platter”.
● Aschenbrenner advocates government nationalizing the major AI labs
and building a national AGI project.
● Critics have accused Aschenbrenner of alarmism and questioned his
timelines, pointing to constraints in data, energy, and compute.
● However, it’s clear that both government and labs are taking these
questions more seriously. OpenAI appointed retired U.S. Army General
Paul M. Nakasone to its Board of Directors and created a new Safety
and Security Committee.
● This follows reports that the company’s systems were breached by
hackers last year.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 174

Section 4: Safety

Safetyism to accelerationism: a major vibe shift has occurred

From the days of US congressional hearings and world tours to promote the (existential) AI safety agenda,
leading frontier model companies are accelerating the distribution of their AI products to consumers.

2023: AI is dangerous 2024: Plz use my app

OpenAI leadership struggle marks the start of an existential risk backlash

Last year, labs were often enthusiastic participants in discussions about critical risks. When it was escalated to a
corporate and commercial tussle at OpenAI, one side clearly emerged on top.
● On 17 November 2023, Sam Altman was ousted as OpenAI CEO by directors of the non-profit organization.
While the full circumstances remain unknown, Altman’s critics have referenced an alleged culture of secrecy
and differences of opinion on safety questions.
● Following an employee rebellion and intervention by Microsoft, OpenAI’s
primary backer, Altman was restored and the board replaced.
● Superalignment researcher Jan Leike departed for Anthropic, while
co-founder Ilya Sutskever left to launch Safe Superintelligence Inc. with
ex-Apple AI lead Daniel Gross and ex-OpenAI engineer Daniel Levy.
● Shortly following the release of OpenAI o1, amid reports that OpenAI was
planning to remove non-profit control and award Altman equity, a number
of departures were announced - most notably CTO Mira Murati, Chief
Research Officer Bob McGrew and VP Research (post-training) Barret
Zoph.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 177

2023 Prediction: We see limited progress on global AI governance beyond high-level

voluntary commitments
Following the heightened AI safety discussions of 2023, the UK organized an AI Safety Summit in November,
bringing together governments and industry at Bletchley Park, marking the start of a bigger process.
● The ﬁrst AI Safety Summit resulted in the Bletchley Declaration, where the US, UK, EU, China and others
committed to cooperating on identifying safety challenges and introducing risk-based policies. This followed a
similar commitment in October from G7 nations as part of the Hiroshima Process.
● This was followed by a similarly-themed summit in Seoul in May 2024, which resulted in EU, US, UK, Australia,
Canada, Germany, France, Italy, Japan, South Korea, and Singapore agreeing to develop interoperable governance
frameworks.
● There is evidence that not every country is equally engaged in this process. France, for example, is keen to move
the discussion away from safety, pitching its stop on the summit circuit as the “AI Action Summit”, which will
focus on realizing the beneﬁts of AI.
● Moreover, this work remains high-level and non-binding. It remains to be seen whether more motivated
governments will be able to preserve the momentum.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 178

UK creates the world’s first AI Safety Institute and the US swiftly follows
Coinciding with the Bletchley Summit, the UK announced that its Frontier AI Taskforce was to be superseded by
the AI Safety Institute (AISI) - the world’s first. The US, Japan, and Canada have all followed with smaller efforts.
● The AISI has three core functions: i) to conduct evaluations on advanced models before their deployment, ii)
build up state capacity around safety and conduct research, and iii) coordinate with international partners.
● It announced an MoU with its US equivalent, with the two agreeing to work
together on the development of tests, while the AISI is planning an SF office.
● OpenAI has said that it will offer the US AISI early access to its next model.
● The AISI has also released Inspect, a framework for LLM safety evaluation,
covering core knowledge, ability to reason, autonomous capabilities among
other things.
● However, there is a debate about the extent to which the AISI should focus
on standard setting (which it is well setup to do) and evaluations (where it
will rely more on the goodwill of industry).

Governments rush to patch gaps in critical national infrastructure

Along with building a greater in-house understanding of model capabilities, the UK is emerging as one of the
main leaders in building resilience.
● Through its Advanced Research and Invention Agency (ARIA), the UK is spending £59M developing a
‘gatekeeper’ - an advanced system tasked with understanding and reducing the risk of other AI agents in critical
domains like energy, healthcare, and telecoms.
● The UK Government is also reportedly planning a “Laboratory for AI
Safety Research”, designed to pool knowledge across government
about offensive AI use by the country’s adversaries.
● The US Department of Energy has been using its in-house testbeds
to assess the risks AI may pose to critical infrastructure and energy
security.
● Meanwhile, the Department of Defense and Department of
Homeland Security have focused on addressing vulnerabilities in
government networks used for national security and civilian
purposes. stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 180

Safety goes partisan (sort of)

In last year’s report, we covered how the culture wars appeared to be slowly coming for AI, with the Gemini ‘woke
AI’ blow up fuelling the ﬁres. Could the US presidential election signal a change in direction?
● The 2024 Republican platform commits to repealing the AI executive order (EO), claiming it “hinders AI
Innovation, and imposes Radical Leftwing ideas on the development of this technology”, attracting the
support of some big names in the Valley. It, however, makes no mention of the future of the US AISI.
● JD Vance is the ﬁrst member of a presidential ticket to have apparently
developed views on these issues, having previously accused big tech
companies of using AI safety as a vehicle for regulatory capture.
● Meanwhile, Kamala Harris has said less on the subject. However, her remarks
when she visited the UK for the Bletchley Summit were widely interpreted as
an implicit critique of the focus on safety questions at the expense of ethics,
echoing many UK civil society groups.
● Regardless of the fate of the EO, at a Congressional level, safety remains a
bipartisan issue, with both parties signing up to an AI policy roadmap in May.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 181

As the attack surface widens, developers up research into jailbreaking…

New capabilities bring new vulnerabilities. Incumbents and specialist labs have upped research into jailbreaking,
designing potential fixes and creating the first red-teaming benchmarks.
● OpenAI proposed a fix to the “ignore all previous instructions” attack via ‘instruction hierarchy’. This ensures
LLMs don’t assign equal priority to users and developers’ instructions. This has been deployed in GPT-4o Mini.
● Anthropic’s work on multishot jailbreaking points to the potential of ‘Cautionary Warning Defense’, which
prepends and appends warning texts to caution models against being jailbroken.
● Meanwhile the safety specialists at Gray Swan AI have piloted the use of ‘circuit breakers’. Instead of trying to
detect attacks, it focuses on remapping harmful representations so the model either refuses to comply or
produces incoherent outputs. They find this outperforms standard refusal training.
● LLM testing start-up Haize Labs worked with Hugging Face to create the first ever red teaming resistance
benchmark. It compiles commonly used red-teaming datasets and evaluates their success rate against models.
Meanwhile, Scale has launched its own robustness leaderboard, based on private evals.
● There are philosophical debates about whether jailbreaking benchmarking datasets and evals will be fruitful -
some researchers have argued that the community should focus on designing new attacks and defending
against them individually, as jailbreaking classifiers will fail against strong models.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 182

…but they’re unable to keep up with the red team

A community of red teamers (led by the anonymous Pliny the Prompter) have managed the defenses outlined on
the previous slide, with GPT-4o mini’s Instruction Hierarchy being compromised within hours.
● While much of this work is done by ethically-motivated groups, the UK’s AI Safety Institute has expressed
alarm at how models from leading labs comply with harmful requests “under relatively simple attacks”.
● Although jailbreak attacks are mostly harmless, DeepKeep, an Israeli
cybersecurity start-up, made Llama 2 reveal sensitive personal data.
● Meanwhile, a team at UIUC has shown that GPT-4’s ability to leverage
tool use and long context means it can hack websites by performing
tasks like SQL injections without human feedback. With the right
context, it can also exploit one-day vulnerabilities.
● Other research has illustrated the vulnerability of multi-agent
environments to ‘infectious attacks’, where single agents are
jailbroken, before contaminating the others.

If you can’t beat the jailbreakers, join them

Coming up with endless potential attacks to red team models is challenging. Labs are increasingly using LLMs to
scale the process of ﬁnding and patching vulnerabilities, including two teams at Meta.
● Rainbow Teaming employs an open-ended search algorithm to create prompts that are designed to elicit
potentially unsafe or biased responses from the target LLM.
● By varying their approach and content, they can
systematically explore LLM weaknesses. This was
used as part of the safety testing for Llama 3.
● Rather than evolutionary search, AdvPrompter uses
a single LLM, going through an alternating process
of generating adversarial prompts and ﬁne-tuning
on them.
● Once trained, AdvPrompter can quickly produce new
adversarial prompts adapted to different
instructions.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 184

It’s not just foundation models that face adversarial attacks

To improve the robustness of image classifiers to adversarial attack, a Google DeepMind team drew inspiration
from biological visual systems, specifically the concept of microsaccades (small, involuntary eye movements).
● They feed the model multiple smaller, slightly blurrier versions of the same image. This improves robustness
without needing special training.
● CrossMax Ensembling combines predictions from different layers of the model.
● Even if an adversarial attack confuses the
final output, the predictions from earlier
layers are often still accurate. By combining
these, the model becomes stronger against
attacks.
● The proposed method achieves
state-of-the-art (SOTA) adversarial accuracy
on the CIFAR-10 and CIFAR-100 datasets
without adversarial training.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 185

Beyond jailbreaking, research points to the potential of more stealthy attacks

While jailbreaking is often the public face of safety challenges, the potential attack surface is much wider,
covering everything from training, through to preference data and ﬁne-tuning.
● Anthropic published an eye-catching paper arguing that it was possible to train LLMs to act as ‘sleeper agents’,
exhibiting safe behavior on their initial release, before turning malicious at a later date. This was resistant to
safety training techniques, such as supervised ﬁne-tuning, reinforcement learning, and adversarial training.
● Researchers from Google and Technical University of Darmstadt found
that poisoning the preference pairs that RLHF relies on was an effective
way to manipulate a model. They only needed to compromise <5% of the
data, indicating the dangers of the widespread use of public and
uncurated datasets for preference training.
● Berkeley and MIT researchers created a dataset that seems benign but
trains models to produce harmful outputs in response to encoded
requests. When applied to GPT-4, the model consistently acted on
harmful instructions while evading common safeguards.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 186

Why is it so hard to predict the downstream capabilities of frontier models?

While there’s a significant body of work on how pre-training performance scales, there’s much less clarity on how
downstreaming training does. A team of researchers have scrutinized the role of multiple-choice questions.
● They argue that standard performance metrics like accuracy mask the clear scaling trends visible in raw model
outputs, making capability prediction difficult. These metrics compress and distort the original probability
data, obscuring subtle improvements that occur as models get larger.
● This would appear to strengthen the argument that ‘emergent capabilities’ are the artificial product of poor
metric construction, rather than real capability jumps.
● As the metrics rely on comparing the correct choice against specific incorrect choices, the researchers argue
that we need to understand how probabilities change for both correct and incorrect answers as scale increases.
● This will also involve developing new evaluation techniques that preserve more of the raw probability
information.

Although emergent capability scepticism is far from universal

Last year’s SOAI covered a controversial paper from Stanford researchers arguing that emergent capabilities are
a product of evaluation metrics, but pushback has continued on a number of fronts.
● One of the most inﬂuential community critiques came from Harvard computer scientist Boaz Barak. In his
response, Barak argued that while some discontinuities might be artifacts of measurement, real-world tasks
usually require a model to solve multiple subtasks in sequence.
● For complex tasks, it’s hard to predict or decompose the
components needed for success in advance, so even if progress
on individual subtasks we’re measuring appears smooth,
overall performance can spike.
● Meanwhile, a paper from Zhipu AI provides evidence of sudden
performance improvements across both discontinuous and
continuous evaluation metrics. They observed these
improvements when pre-training loss dropped below a speciﬁc
threshold, regardless of model size or training compute.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 188

Is RLHF breeding sycophancy?

Ensuring accurate, honest responses is crucial in alignment. However, research points to the interplay of training
data, optimization techniques, and the limitations of current architecture making this is difficult to guarantee.
● Anthropic has zeroed in on RLHF, arguing that SOTA AI assistants show consistent sycophantic behavior (e.g.
biased feedback, being swayed by factually incorrect prompts, conforming to beliefs, mimicking errors). The
weakness lies in human preference data, with human evaluators preferring supportive responses.
● Optimizing against preference models that don’t sufficiently
prioritize or accurately assess truthfulness means they deprioritize
accessing using their factual knowledge base for certain queries.
● Similarly, research from the Centro Nacional de Inteligencia
Artificial in Chile found that LLMs can overestimate the depth of
nonsensical or pseudo-profound statements, thanks to RHLF
combined with an absence of contextual understanding.

Direct Preference Optimization offers an escape from “reward hacking”…or does it?
First proposed as an alternative to RLHF in 2023, DPO has no explicit reward function and comes with efﬁciency
advantages because it doesn’t sample from a policy during training or require extensive hyperparameter tuning.
Despite its novelty, the method has already been used to align Llama 3.1 and Qwen2.
● However, there are signs that the “over-optimization” that’ is traditionally associated with RLHF can also
happen with DPO and other kinds of direct alignment algorithms (DAAs), despite the absence of a reward
model. This worsens the more models are allowed to deviate from their starting point as they learn to align
with human preferences.
● This could be the result of underconstrained objectives,
where the algorithm unintentionally assigns high
probabilities to out-of-distribution data.
● This is inherent to DAAs, but can be partially mitigated
through careful parameter tuning and increased model size.

RLHF isn’t going anywhere fast

Due to a combination of innate advantages and innovation designed to improve its efficiency, offline direct
alignment methods don’t look set to displace RLHF at scale anytime soon.
● Testing online vs. offline approaches across datasets covering summarization, helpfulness, conversational
ability, and harmlessness, a Google DeepMind team found that RLHF emerged as the winner across all of them.
● They argue that that this stems from on-policy sampling, which more
effectively improves generative tasks and cannot be easily replicated by
offline algorithms, even with similar data or model scaling.
● Cohere for AI has explored scrapping the Proximal Policy Optimization
algorithm in RLHF (which treats each token as an individual action), in favor
of their RLOO (REINFORCE Leave One-Out) Trainer, which entire generation
as one action, distributing rewards across the full sequence.
● They find this leads to a 50-75% GPU use reduction and a 2-3x increase in
training speed versus PPO, depending on model size.

Is a happy middle possible?

A Google DeepMind team has combined the simplicity of direct alignment from preferences (DAP) with the
on-line policy learning of RLHF to create direct alignment from AI feedback. Here, an LLM serves as an annotator,
choosing between two responses during each training iteration. This keeps the advantages of online learning
without requiring a separate reward model. This is essentially a form of online DPO. They found it outperformed
traditional RLHF and ofﬂine DPO across summarization, harmfulness, and helpfulness tasks.

Can LLMs improve the reliability of…LLMs?

LLMs suffer from two main reliability errors: response that are inconsistent with their internal knowledge
(hallucinations) and ones that share information that does not accord with established external knowledge.
● A recent paper from the University of Oxford focuses on a subset of hallucinations called confabulations, where
LLMs produce incorrect generalizations.
● They measure the LLM’s uncertainty by generating multiple answers to
a question, and using another model to group them together by
similar meaning. Higher entropy scores across groups suggest
confabulation.
● Meanwhile, Google DeepMind have introduced SAFE, which evaluates
the factuality of LLM responses by breaking them down into individual
facts, using search engines to verify facts, and clustering semantically
similar statements.
● They’ve also curated LongFact, a new benchmark dataset for
evaluating long-form faculty across 38 topics.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 193

Can LLM-generated critiques enhance both accuracy and alignment?

The concept of ‘LLMs as judges’ lives on, with major labs extending it beyond the simple evaluation of outputs.
● OpenAI unveiled CriticGPT, which uses an GPT-style LLM trained on a huge dataset of ﬂawed inputs, to spot
mistakes in code generated by other LLMs. It outperformed human contractors in catching errors and its
critiques were preferred to human-written ones 63% of the time.
● The system was also able to spot bugs in training data that has been
labelled as ‘ﬂawless’.
● Meanwhile, Cohere have explored the possibility of using
LLM-generated critiques to enhance reward models for RLHF. They used
a range of LLMs to generate point-wise critiques for each preference
data pair, designed to have the LLM evaluate the effectiveness of the
prompt-completion pair.
● They found particularly strong results for weaker base models or in
low-data settings, with one high-quality critique-enhanced preference
pair can be worth up to 40 standard preference pairs.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 194

Can we make the known unknowns known?

LLMs often struggle to assign reliable confidence estimates to their outputs, even when pushed on whether or
not an answer is correct. The solution potentially lies in fine-tuning, rather than better zero-shot prompting.
● Research from NYU, Abacus AI, and Cambridge found that fine-tuning LLMs on a dataset of correct and
incorrect answers can significantly improve the calibration of their uncertainty estimates. This requires only a
small amount of additional data (around 1,000 examples) and can be done efficiently using techniques like
LoRA.
● The resulting uncertainty estimates generalize well to new question types and tasks, even when they are
different from the ones used for the fine-tuning.
● Better still, the fine-tuned models can also be used to estimate the uncertainty of other models.

Transparency is on the up, but there’s signiﬁcant room for improvement still
Shortly after the last SOAI, Stanford published its ﬁrst Foundation Model Transparency Index, giving model
developers an average score of 37. This climbed to 58 in the team’s interim update.
● In May 2024, the latest installment of the index assessed the transparency of 14 leading foundation model
developers based on 100 indicators spanning ‘upstream’ factors data, labor, compute, ‘model-level’ factors
around capabilities and risk, ‘downstream’ criteria around distribution, and societal impact.
● Scores on compute and usage policies have seen the strongest improvements, while ‘upstream’ ratings still
remain weak.

Could LLMs engage in ‘reward tampering’?

Speciﬁcation gaming - where models maximize its rewards at the expense of their intended purpose - is nothing
new. Anthropic worry that models could go further and alter the training process itself.
● They created a series of training environments to test AI models' propensity for cheating, with tasks
escalating from simple political sycophancy to complex deception. The models exhibited untrained
generalization, learning increasingly worse misbehaviors, including editing their own code when the
researchers made it available.
● While these results highlight the potential for escalation
from even minor reward misspeciﬁcation, the most severe
behavior was rare (45 times out of 32,768 trials), even
when the researchers did their best to encourage it.
● That said, as our slide on Sakana (see slide 68) and its
associated safety issues indicated, we shouldn’t
underestimate the potential of models to look for
shortcuts.
stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 197

Anthropic breaks open the black box…

Anthropic’s interpretability team used sparse autoencoders - neural networks that learn efﬁcient representations
of data by emphasizing important features and ensuring only a few are active at any one time - to decompose
activations of Claude 3 Sonnet into interpretable components. They also showed that by ‘pinning’ a feature to
‘active’ you could control the output - famously turning up the strength of the Golden Gate feature.

…and starts a trend for sparse autoencoders

SAEs aren’t new, but researchers often struggled with balancing sparsity and reconstruction quality, and latents
dying in training (i.e. inactive neurons). OpenAI researchers have worked on a methodology that scales.
● The researchers introduce the TopK activation function, which directly constrains the number of active
features. For each input, only the k-highest activating features are kept, while the rest are set to zero -
providing direct control over the sparsity level.
● They also managed to reduce dead latents to
only 7%, an improvement on previous
methods, where up to 90% could become
inactive in large models.
● The OpenAI team also demonstrated both the
potential and desirability of scaling, training
a 16 million latent autoencoder on GPT-4
activations, ﬁnding clear scaling laws.

Maybe the black box just isn’t that opaque after all?
We’ve seen a run of interpretability research, including works on SAE, which argue that high-level semantic
concepts are encoded “linearly” in the representations - and they can be decoded!
● A Chicago/Carnegie Mellon team introduce a simpliﬁed model where words and sentences are represented by
binary "concept" variables. They prove that these concepts end up being represented linearly within the model’s
internal space, thanks to next-token prediction and the tendency of gradient descent to ﬁnd simple, linear
solutions.
● This linearity was also the theme of work from the Moscow-based AI
Research Institute, which argued that transformations happening within
the model can be approximated with simple linear operations.
● Google has introduced a popular new method for decoding intermediate
neurons. Patchscopes takes a hidden representation for LLM and
‘patching’ it to a different prompt. This prompt is used to generate a
description or answer a question, revealing the encoded information.

…but does this come with a downside?

Some research indicates that LLMs use a single direction in their internal representation space to distinguish
between harmful and harmless instructions - the “refusal direction”. By altering this refusal direction, researchers
can make the model refuse harmless prompts or complete harmful prompts. Arguably, the downside of advances
in interpretability is precisely that it can help identify and understand these kinds of features, making it easier to
target them.

Despite a storm of controversy, LLM biorisk remains uncertain…

Anthropic captured headlines in the summer of 2023, after an unpublished study highlighted the potential of
LLMs to accelerate the misuse of biology relative to internet access. Other labs have struggled to replicate this.
● OpenAI assessed uplifts in performance on a 10-point scale in biological threat creation with GPT-4 access
versus an internet-only baseline. They found experts scored a 0.82 uplift and students a 0.41.
● While classifying o1 as a ‘medium’ biorisk (a ﬁrst for an
OpenAI model), the company said that “the model cannot
yet automate biological agentic tasks”. While it performed
signiﬁcantly better than 4o on biothreat information
questions, it performed poorly on actual ideation.
● A RAND Corporation study concluded that current LLMs
did not meaningfully change the operational risk of a
biological weapons attack versus standard Internet
access.

…but researchers point to other vulnerabilities

There’s growing concern among researchers working at the intersection of AI and biology that the governance
conversation has become too narrowly focused around LLMs, when specialist tools are vulnerable.
● There is a growing number of biological design tools, such as models for protein folding/design and genetic
modification (e.g. the open source RFDiffusion). The same tools that could be used to develop a vaccine or
discover drugs more quickly could also be used to create pathogens or evade DNA screening techniques (e.g.
novel virus surface proteins.)
● This has led researchers to propose specific biorisk
governance measures around access management,
KYC, lab equipment security, and vulnerability
reporting.
● A number of leading figures in protein design research
have committed to a series of responsible design
principles, along with specific practises around
partnerships and evaluation.

Zooming out, are we too focused on the wrong harms?

While sophisticated technical exploits receive the bulk of researcher attention, a Google DeepMind examination
found “that most cases of GenAI misuse are not sophisticated attacks on AI systems but readily exploit easily
accessible GenAI capabilities that require minimal technical expertise.”
● Many of the most harrowing cases of generative AI misuse come from the use of easily available tools. This is
the realm where policy, rather than technical ﬁxes, will likely have to come to the fore.
● Architecture and design consultancy Arup lost $25M after
fraudsters used deepfakes to pose as the CFO and order a bank
transfer.
● A teacher in Baltimore was the victim of a harassment
campaign and investigations after a faked audio of them
making racist remarks about colleagues and students was
widely circulated.
● The revelation of a ring of accounts on Telegram sharing deep
fake porn of female students at South Korean universities
sparked a national scandal. stateof.ai 2024
Introduction | Research | Industry | Politics | Safety | Predictions #stateofai | 204

Section 5: Predictions

10 predictions for the next 12 months

1. A $10B+ investment from a sovereign state into a US large AI lab invokes national security review.
2. An app or website created solely by someone with no coding ability will go viral (e.g. App Store Top-100).
3. Frontier labs implement meaningful changes to data collection practices after cases begin reaching trial.

4. Early EU AI Act implementation ends up softer than anticipated after lawmakers worry they’ve overreached.

5. An open source alternative to OpenAI o1 surpasses it across a range of reasoning benchmarks.

6. Challengers fail to make any meaningful dent in NVIDIA’s market position.

7. Levels of investment in humanoids will trail off, as companies struggle to achieve product-market ﬁt.

8. Strong results from Apple’s on-device research accelerates momentum around personal on-device AI.

9. A research paper generated by an AI Scientist is accepted at a major ML conference or workshop.

Thanks!
Congratulations on making it to the end of the State of AI Report 2024! Thanks for reading.

In this report, we set out to capture a snapshot of the exponential progress in the field of artificial intelligence, with
a focus on developments since last year’s issue that was published on 12 October 2023. We believe that AI will be a
force multiplier on technological progress in our world, and that wider understanding of the field is critical if we are
to navigate such a huge transition.

We set out to compile a snapshot of all the things that caught our attention in the last year across the range of AI
research, industry, politics and safety.

We would appreciate any and all feedback on how we could improve this report further, as well as contribution
suggestions for next year’s edition.

Thanks again for reading!

Nathan Benaich and Alex Chalmers

Reviewers
We’d like to thank the following individuals for providing critical review of this year’s Report:

Anastasia Borovykh, Daniel Campos, Saﬁye Celik, Mehdi Ghissassi, Corina Gurau, Charlie Harris, Max Jaderberg, Harry
Law, Omar Sanseviero, Patrick Schwab, Shubho Sengupta, and Joe Spisak.

Conﬂicts of interest
The authors declare a number of conﬂicts of interest as a result of being investors and/or advisors, personally or via
funds, in a number of private and public companies whose work is cited in this report. Notably, the authors are
investors in companies listed at: airstreet.com/portfolio

About the authors

Follow our writing on (press.airsteet.com)

If you enjoyed reading the State of AI Report, we invite you to read and subscribe to Air Street Press, the home of
our analytical writing, news, and opinions.

stateof.ai 2024
STATE OF AI REPORT .
October 10, 2024
Nathan Benaich
AIR STREET CAPITAL .

stateof.ai airstreet.com

English Project
100% (6)
English Project
13 pages
Sam Altman's Manifest Destiny - The New Yorker
No ratings yet
Sam Altman's Manifest Destiny - The New Yorker
18 pages
Morgan Stanley Behavioral Interview Questions, Answers, Prep)
No ratings yet
Morgan Stanley Behavioral Interview Questions, Answers, Prep)
7 pages
Coatue April2024
No ratings yet
Coatue April2024
68 pages
2025 March Bain Report - Global Private Equity Report 2025
No ratings yet
2025 March Bain Report - Global Private Equity Report 2025
68 pages
Point72 2025 China Financial Elite Challenge Case Materials - VF.P
No ratings yet
Point72 2025 China Financial Elite Challenge Case Materials - VF.P
8 pages
Insidethemachinelearninginterview Sample
0% (1)
Insidethemachinelearninginterview Sample
40 pages
Bill Gurleys Above The Crowd Compilation
No ratings yet
Bill Gurleys Above The Crowd Compilation
285 pages
100 Practical Applications and Use Cases of Generative AI in Media EN
100% (1)
100 Practical Applications and Use Cases of Generative AI in Media EN
117 pages
Andreessen Horowitz Bad Investment
No ratings yet
Andreessen Horowitz Bad Investment
5 pages
L1 - Machine Learning For Finance
100% (1)
L1 - Machine Learning For Finance
131 pages
Generative AI's Act Two - Sequoia Capital
100% (1)
Generative AI's Act Two - Sequoia Capital
14 pages
State of AI Report - 2024 ONLINE
No ratings yet
State of AI Report - 2024 ONLINE
213 pages
Goldman Sachs - Navigating The AI Era
0% (1)
Goldman Sachs - Navigating The AI Era
13 pages
PrivateEquity VentureCapital
No ratings yet
PrivateEquity VentureCapital
249 pages
Elon Trips
No ratings yet
Elon Trips
6 pages
Coatue - The Path To General Purpose Robots
No ratings yet
Coatue - The Path To General Purpose Robots
45 pages
State of AI Report - 2024 ONLINE
100% (1)
State of AI Report - 2024 ONLINE
213 pages
Best Venture Capital Investments (CB Insights)
No ratings yet
Best Venture Capital Investments (CB Insights)
147 pages
Artificial Intelligence Sizing and Seizing The Investment Opportunity
No ratings yet
Artificial Intelligence Sizing and Seizing The Investment Opportunity
104 pages
The Real Startup Book
No ratings yet
The Real Startup Book
162 pages
How To Write A Good Paper
No ratings yet
How To Write A Good Paper
3 pages
International Monetary Fund
No ratings yet
International Monetary Fund
19 pages
Valuation Guide (Aswath Damodaran)
No ratings yet
Valuation Guide (Aswath Damodaran)
35 pages
AI For PM Course Study Guide
No ratings yet
AI For PM Course Study Guide
9 pages
Equity Analysis Primer Wharton
No ratings yet
Equity Analysis Primer Wharton
37 pages
DTTL Genai Legal Work Full Report Germany
No ratings yet
DTTL Genai Legal Work Full Report Germany
27 pages
Unlocking The Skills of Stock Picking - We Study Billionaires Podcast Transcript With Ian Cassel & Kyle Grieve
No ratings yet
Unlocking The Skills of Stock Picking - We Study Billionaires Podcast Transcript With Ian Cassel & Kyle Grieve
22 pages
Generative Ai in Finance Finding The Way To Faster Deeper Insights
No ratings yet
Generative Ai in Finance Finding The Way To Faster Deeper Insights
4 pages
Global M A Report 2025 1741195451
No ratings yet
Global M A Report 2025 1741195451
88 pages
Sequoia PMF
No ratings yet
Sequoia PMF
8 pages
Sequoia India Operates in Southeast Asia and India Where They Actively Partner With Founders From A
No ratings yet
Sequoia India Operates in Southeast Asia and India Where They Actively Partner With Founders From A
3 pages
Buffett Case Study On Investment Filters Tabulating Company
No ratings yet
Buffett Case Study On Investment Filters Tabulating Company
10 pages
AI SaaS-India Opportunity Report (2024)
No ratings yet
AI SaaS-India Opportunity Report (2024)
25 pages
Neo Banks - Global Trends and Scope in India
100% (1)
Neo Banks - Global Trends and Scope in India
12 pages
Wall Street's Power Grab in Puerto Rico
No ratings yet
Wall Street's Power Grab in Puerto Rico
16 pages
The Anatomy of Private Equity Investments in The MENA Region
No ratings yet
The Anatomy of Private Equity Investments in The MENA Region
18 pages
#Intelligence and Wisdom - Artificial Intelligence Meets Chinese Philosophers-Bing Song - Springer (2021)
No ratings yet
#Intelligence and Wisdom - Artificial Intelligence Meets Chinese Philosophers-Bing Song - Springer (2021)
171 pages
Sequoia Capital Report ??
No ratings yet
Sequoia Capital Report ??
24 pages
CB Insights - Fintech Report Q3 2021
No ratings yet
CB Insights - Fintech Report Q3 2021
236 pages
Exit Waterfalls Explained & Some Innovative Incentive Structuring For Growth Companies
100% (1)
Exit Waterfalls Explained & Some Innovative Incentive Structuring For Growth Companies
5 pages
CVVC Insights TOP50 PDF
No ratings yet
CVVC Insights TOP50 PDF
22 pages
PDF Advanced Valuation Erasmus Compress
No ratings yet
PDF Advanced Valuation Erasmus Compress
7 pages
Conversational Commerce Growth Development Usage
No ratings yet
Conversational Commerce Growth Development Usage
11 pages
Guide To The Markets - JP Morgan - 2024 - Q2
No ratings yet
Guide To The Markets - JP Morgan - 2024 - Q2
71 pages
Goldman Sachs AI Adoption Tracker 2024Q2 Hardware Investment Grows
No ratings yet
Goldman Sachs AI Adoption Tracker 2024Q2 Hardware Investment Grows
13 pages
CB Insights - Fintech Report Q1 2019 PDF
No ratings yet
CB Insights - Fintech Report Q1 2019 PDF
81 pages
Private Equity - Micro Lending
No ratings yet
Private Equity - Micro Lending
32 pages
VC101: Venture Capital Overview: November 2020
No ratings yet
VC101: Venture Capital Overview: November 2020
35 pages
Bain IVCA India Venture Capital Report 2022 1648706342
No ratings yet
Bain IVCA India Venture Capital Report 2022 1648706342
42 pages
SSRN Id3177534 PDF
No ratings yet
SSRN Id3177534 PDF
11 pages
Consulting Business Situation Cases - Street of Walls
No ratings yet
Consulting Business Situation Cases - Street of Walls
20 pages
2018 04 13 Spring 2018 FX Report FINAL
No ratings yet
2018 04 13 Spring 2018 FX Report FINAL
46 pages
Demystifying Venture Capital Economics Part 2
No ratings yet
Demystifying Venture Capital Economics Part 2
4 pages
ChamathPalihapitiya 2018 Presentation
No ratings yet
ChamathPalihapitiya 2018 Presentation
105 pages
Goldman Sachs Growth Equity: Themes, Insights, and Investment Opportunities in Global Growth
No ratings yet
Goldman Sachs Growth Equity: Themes, Insights, and Investment Opportunities in Global Growth
49 pages
1chung L W Vault Career Guide To The Energy Industry
No ratings yet
1chung L W Vault Career Guide To The Energy Industry
172 pages
Smallcase: Team Name: Wealth Force
No ratings yet
Smallcase: Team Name: Wealth Force
19 pages
CB Insights - Big Tech Investments Acquisitions
No ratings yet
CB Insights - Big Tech Investments Acquisitions
45 pages
5dcc6609881da4e316e6f2bc - Chart Patterns Cheat Sheet
No ratings yet
5dcc6609881da4e316e6f2bc - Chart Patterns Cheat Sheet
1 page
Investment Checklist
No ratings yet
Investment Checklist
2 pages
MCS-224 Artificial Intelligence and Machine Learning
No ratings yet
MCS-224 Artificial Intelligence and Machine Learning
493 pages
Investment Banking
No ratings yet
Investment Banking
17 pages
Paula Murphy - AI in The Movies-Edinburgh University Press (2024)
100% (1)
Paula Murphy - AI in The Movies-Edinburgh University Press (2024)
210 pages
Training Report
No ratings yet
Training Report
17 pages
What Is Artificial Intelligence by Jack Copeland
No ratings yet
What Is Artificial Intelligence by Jack Copeland
37 pages
IGCSE - PAPER-1 - Student Notes-126-127
100% (1)
IGCSE - PAPER-1 - Student Notes-126-127
2 pages
Seminar Artificial Intelligence
No ratings yet
Seminar Artificial Intelligence
17 pages
Case Study
100% (1)
Case Study
8 pages
What Is AGI?: ANI (Artificial Narrow Intelligence)
No ratings yet
What Is AGI?: ANI (Artificial Narrow Intelligence)
5 pages
AI Report From Delloit - Applications PDF
No ratings yet
AI Report From Delloit - Applications PDF
40 pages
AI Guide For Government
No ratings yet
AI Guide For Government
97 pages
Artificial Intelligence and The Future of Humans
No ratings yet
Artificial Intelligence and The Future of Humans
12 pages
Game Maker Book 1 Tutorial Coding All Items and One Example Also
No ratings yet
Game Maker Book 1 Tutorial Coding All Items and One Example Also
56 pages
The Artificial Intelligence Revolution
No ratings yet
The Artificial Intelligence Revolution
32 pages
AI 3 Part1
No ratings yet
AI 3 Part1
11 pages
ISDT
No ratings yet
ISDT
157 pages
Levels of AGI: Operationalizing Progress On The Path To AGI
No ratings yet
Levels of AGI: Operationalizing Progress On The Path To AGI
20 pages
Future of AI and Humanity (Pew Research Center)
No ratings yet
Future of AI and Humanity (Pew Research Center)
123 pages
Artificial Intelligence in Diplomacy Whe
No ratings yet
Artificial Intelligence in Diplomacy Whe
5 pages
Introduction To Artificial Intelligence Cosc 4061: Chapter One
No ratings yet
Introduction To Artificial Intelligence Cosc 4061: Chapter One
64 pages
Exam Taker Answer Report
No ratings yet
Exam Taker Answer Report
8 pages
Keynote Speech Farzana
No ratings yet
Keynote Speech Farzana
40 pages
On The Controllability of Artificial Intelligence An Analysis of Limitations (2022) - Roman V. Yampolskiy
No ratings yet
On The Controllability of Artificial Intelligence An Analysis of Limitations (2022) - Roman V. Yampolskiy
84 pages
Interview Transcript - David Danks
No ratings yet
Interview Transcript - David Danks
11 pages
MCS-224 June 2024
No ratings yet
MCS-224 June 2024
5 pages
Uil Lincoln-Douglas Debate Research Series
No ratings yet
Uil Lincoln-Douglas Debate Research Series
45 pages
Big Data & AI in The Era IR 4.0 Edwin Purwandesi Stimik Esq 24 Desember2021
No ratings yet
Big Data & AI in The Era IR 4.0 Edwin Purwandesi Stimik Esq 24 Desember2021
38 pages
Simple AI - Swapnil Devesh
No ratings yet
Simple AI - Swapnil Devesh
4 pages
The Pre-IPO Playbook: Investing Before the Crowd
From Everand
The Pre-IPO Playbook: Investing Before the Crowd
Adam Dooley
No ratings yet