Letta Research

Context-Bench: A benchmark for agentic context engineering

Context-Bench measures an agent's ability to perform context engineering with:

Last updated: Feb 17, 2026

Interested in contributing a task or model? Email us or open an issue on GitHub.

#
Model
Filesystem Rubric i LLM-as-a-judge rubric to evaluate a coding agent's ability to correctly retrieve, analyze and reason about information from filesystem data files
1
openai openai/gpt-5.2-codex-xhigh
83%$38.75
2
openai openai/gpt-5.2-xhigh
83%$83.47
3
anthropic anthropic/claude-sonnet-4-6
81%$125.6
4
google google/gemini-3-flash
80%$49.06
5
anthropic anthropic/claude-opus-4-6
77%$307.6
6
google google/gemini-3-pro
72%$100.72
7
anthropic anthropic/claude-opus-4-5-20251101
59%$619.54
8
openai openai/gpt-5-mini-high
53.5%$39.25
9
anthropic anthropic/claude-sonnet-4-5-20250929
43%$423.03
10
anthropic anthropic/claude-haiku-4-5-20251001
41%$129.3
#
Model
Task Completion Rubric i LLM-as-a-judge rubric to evaluate the agent's ability to complete tasks
Skill Use Rubric i LLM-as-a-judge rubric to evaluate the agent's ability to select, load and use skills
1
openai openai/gpt-5.2-2025-12-11 (xhigh)
85.31%
63.12%
2
openai openai/gpt-5.2-2025-12-11 (high)
84.47%
54.21%
3
anthropic anthropic/claude-opus-4-6
81.32%
62.64%
4
anthropic anthropic/claude-sonnet-4-6
79.08%
64.8%
5
openai openai/gpt-5.2-2025-12-11 (medium)
77.55%
49.74%
6
anthropic anthropic/claude-sonnet-4-5-20250929
76.5%
72%
7
anthropic anthropic/claude-opus-4-5-20251101
75.54%
68.82%
8
deepseek deepseek/deepseek-chat
75.33%
53.62%
9
anthropic anthropic/claude-opus-4-1-20250805
74.4%
71.8%
10
google google/gemini-3-pro
72.73%
64.29%
11
deepseek deepseek/deepseek-reasoner
70.48%
56.12%
12
openai openai/gpt-5-2025-08-07
70.2%
51.4%
13
anthropic anthropic/claude-haiku-4-5-20251001
69.7%
57.3%
14
openai openai/gpt-5-mini-2025-08-07
68.8%
45.5%
15
z-ai z-ai/glm-4.6
65.9%
50%
16
openai openai/gpt-5.1-codex
64.84%
55.73%
17
openai openai/gpt-5.1
63.75%
55.75%
18
google google/gemini-3-flash
59.54%
59.54%
19
mistralai mistralai/mistral-large-3
56.25%
36.93%
20
openai openai/gpt-5-nano-2025-08-07
52.8%
24%
21
openai openai/gpt-5.1-codex-mini
49.72%
23.58%
22
openai openai/gpt-4.1-2025-04-14
36.1%
31.3%

Leaderboard Updates

February 17, 2026

February 5, 2026

December 18, 2025

December 11, 2025

December 9, 2025

November 26, 2025

November 7, 2025

November 4, 2025

October 28, 2025