Context-Bench measures an agent's ability to perform context engineering with:

Filesystem Suite: Evaluates how well language models can chain file operations, trace entity relationships, and manage multi-step information retrieval.
Skills Suite: Evalutes how well language models can discover and load relevant skills from a library to complete tasks.

Last updated: Feb 17, 2026

Interested in contributing a task or model? Email us or open an issue on GitHub.

#

Model

Filesystem Rubric

1

openai/gpt-5.2-codex-xhigh

83%$38.75

2

openai/gpt-5.2-xhigh

83%$83.47

3

anthropic/claude-sonnet-4-6

81%$125.6

4

google/gemini-3-flash

80%$49.06

5

anthropic/claude-opus-4-6

77%$307.6

6

google/gemini-3-pro

72%$100.72

7

anthropic/claude-opus-4-5-20251101

59%$619.54

8

openai/gpt-5-mini-high

53.5%$39.25

9

anthropic/claude-sonnet-4-5-20250929

43%$423.03

10

anthropic/claude-haiku-4-5-20251001

41%$129.3

#

Model

Task Completion Rubric

Skill Use Rubric

1

openai/gpt-5.2-2025-12-11 (xhigh)

85.31%

63.12%

2

openai/gpt-5.2-2025-12-11 (high)

84.47%

54.21%

3

anthropic/claude-opus-4-6

81.32%

62.64%

4

anthropic/claude-sonnet-4-6

79.08%

64.8%

5

openai/gpt-5.2-2025-12-11 (medium)

77.55%

49.74%

6

anthropic/claude-sonnet-4-5-20250929

76.5%

72%

7

anthropic/claude-opus-4-5-20251101

75.54%

68.82%

8

deepseek/deepseek-chat

75.33%

53.62%

9

anthropic/claude-opus-4-1-20250805

74.4%

71.8%

10

google/gemini-3-pro

72.73%

64.29%

11

deepseek/deepseek-reasoner

70.48%

56.12%

12

openai/gpt-5-2025-08-07

70.2%

51.4%

13

anthropic/claude-haiku-4-5-20251001

69.7%

57.3%

14

openai/gpt-5-mini-2025-08-07

68.8%

45.5%

15

z-ai/glm-4.6

65.9%

50%

16

openai/gpt-5.1-codex

64.84%

55.73%

17

openai/gpt-5.1

63.75%

55.75%

18

google/gemini-3-flash

59.54%

59.54%

19

mistralai/mistral-large-3

56.25%

36.93%

20

openai/gpt-5-nano-2025-08-07

52.8%

24%

21

openai/gpt-5.1-codex-mini

49.72%

23.58%

22

openai/gpt-4.1-2025-04-14

36.1%

31.3%

Leaderboard Updates

February 17, 2026

Filesystem v2 refresh
- Letta code agents with real filesystem and client-side tools
- Multi-hop navigation and reasoning
Sonnet 4.6
- #3 on Filesystem and #4 on Skills
- 70% improvement in token efficiency with 38% improvement in accuracy over Sonnet 4.5

February 5, 2026

Opus 4.6
- #1 on Filesystem and #3 on Skills
- Same price as Opus 4.5 but less token efficient leading to higher costs

December 18, 2025

Gemini 3 Flash
- Outperforms Haiku and GPT 5 Mini on Filesystem
- Less token efficient, leading to higher costs

December 11, 2025

GPT 5.2 (medium, high, xhigh)
- #1 on Filesystem (6% ↑) and Skills (9% ↑)
- 1.5-2.1x Opus cost but still cheaper than Gemini 3 Pro

December 9, 2025

Deepseek v3.2 (reasoner, chat)
- #1 OSS and only 1% behind Sonnet 4.5
- 16% ↑ on Filesystem and 10% ↑ on Skills over GLM 4.6
Mistral Large 3
- Ranks below GPT 4.1 Mini on Filesystem

November 26, 2025

GPT 5.1, 5.1 Codex, 5.1 Codex Mini
Gemini 3 Pro
Claude Opus 4.5

November 7, 2025

Launched Skills Suite
Kimi K2 Thinking

November 4, 2025

Minimax M2

October 28, 2025

Launched leaderboard website
Launched Filesystem Suite