[EMNLP 2025 Main] SolEval: Benchmarking Large Language Models for Repository-Aware Solidity Smart Contract Generation
π Paper: https://arxiv.org/abs/2502.18793
π Leaderboard: Coming soon
π« Contact: Zhiyuan Peng, Xin Yin
SolEval is an evaluation framework for repository-aware Solidity contract generation, targeting end-to-end development capabilities and engineering utility under realistic repository context. The framework covers patch generation, functional correctness (Pass@k), security (Slither), and gas evaluation, and supports RAG and few-shot configurations for reproducible, apples-to-apples comparisons across LLMs/agents.
- Datasets / Repositories: 28 real GitHub projects (original structure preserved).
- Generation: Multi-model, multi-sample patch generation with optional RAG/few-shot.
- Testing: Functional correctness via Foundry
forge(Pass@k). - Security: Static analysis via Slither.
- Gas: Gas metrics via
forge. - Artifacts: Unified
results/*.jsonloutputs for analysis and reproducibility.
- <2025/02> π SolEval repository open-sourced.
- <2025/08> π SolEval accepted to EMNLP 2025 Main Conference.
- Operating System: Linux / WSL2 on win11 / Windows / macOS (not thoroughly tested).
- Python: 3.8+ (3.10/3.11 recommended).
- Solidity toolchain: Foundry (forge).
- Security analyzer: Slither (Python 3.8+).
- (Optional) LLMs: GPT-4o, DeepSeek, Qwen, CodeLlama, OpenCode, etc.
git clone https://github.com/pzy2000/SolEval.git
cd SolEval
pip install -r requirements.txtBefore running evaluations, prepare repository data and dependency metadata.
Download the original repositories and extract them under the project root:
- Download: repository.zip
After extraction, you should have SolEval/repository. This folder contains 28 subfolders (each corresponding to a GitHub project). Do not modify any internal file structures, or the evaluation scripts may fail. The file structure should look like this after extraction:
File Structure (click to expand)
SolEval/
βββ LICENSE
βββ README.MD
βββ data/
βββ environment.txt
βββ forge/
βββ libtree-sitter-solidity.so
βββ prebuilt/
βββ repository/ # contains 28 subfolders (GitHub projects)
βββ requirements.txt
βββ run_forge_test.sh
βββ run_patch_gen.sh
βββ run_slither.sh
βββ tools/Install from the Foundry releases page (or via the official install script). Note that using newer version will cause inconsistent behaviors (e.g., forge test errors of the repository).
Note Slither requires Python 3.8+. If youβre not using a supported compilation framework, youβll need the Solidity compiler
solc; we recommend managing versions with solc-select.
python3 -m pip install slither-analyzerSolEval provides an end-to-end pipeline for repository-aware contract generation: patch generation β functional correctness (Pass@k) β security (Slither) β gas (forge).
Generate repository-aware patches with various LLMs; supports RAG and few-shot configurations.
# Example: GPT-4o + RAG, 1-shot, sampling 10 candidates per requirement
python generate_rag.py --context --model gpt-4o --shot 1 --sample 10
# Other configurations (uncomment and adjust as needed)
# python generate_rag.py --context --model gpt-4o-mini --shot 3 --sample 10
# python generate_rag.py --context --model Qwen-7B --shot 2 --sample 5
# python generate_random.py --context --model Qwen-7B --shot 1
# python generate_random.py --context --model OpenCode-33B --shot 2Args:
--context: enable repository context (e.g., RAG/retrieval).--model: LLM name used for generation.--shot: number of few-shot exemplars.--sample: number of candidates per requirement.
Run tests via Foundry forge to evaluate functional correctness of generated patches.
python run_forge.py --context y --model DeepSeek-V3 --sample 1 --rag true --shot 1
python run_forge.py --context n --model DeepSeek-V3 --sample 1 --rag true --shot 1
python run_forge.py --context y --model DeepSeek-V3 --sample 1 --rag false --shot 1
python run_forge.py --context n --model DeepSeek-V3 --sample 1 --rag false --shot 1Args:
context:y|nto toggle repository context.model: model used for generation.shot: number of few-shot exemplars.sample: number of candidates per requirement.rag: whether to enable RAG.
Run static analysis for potential security issues using Slither. Requires the verifier file produced in the previous step.
python run_slither.py --context y --verifier results/rag/results_OpenCode_shot_1_context_True_testcase_False_20250130_033003.jsonl --model OpenCode --sample 10 --rag true
python run_slither.py --context y --verifier results/rag/results_DeepSeek-Coder-33B_shot_1_context_True_testcase_False_20250201_025654.jsonl --model DeepSeek-Coder-33B --sample 10 --rag true
python run_slither.py --context y --verifier results/rag/results_CodeLlama-34B_shot_1_context_True_testcase_False_20250201_064732.jsonl --model CodeLlama-34B --sample 10 --rag trueArgs:
context:y|nfor repository context.model: generation model.verifier: path to the.jsonlverifier from the βPass@k Evaluationβ step.sample: candidates per requirement.rag: whether to enable RAG.
Compute gas metrics using forge. First, place all results_*.jsonl under results/gas, then run in order:
python tools/utils/intersect_gas.py
python tools/run_gas.py --context y --model OpenCode --sample 10 --rag true --shot 1
# python tools/run_gas.py --context y --model DeepSeek-Coder-33B --sample 10 --rag true --shot 1
# python tools/run_gas.py --context y --model CodeLlama-34B --sample 10 --rag true --shot 1Args:
context:y|n.model: model name.shot: number of few-shot exemplars.sample: candidates per requirement.rag: whether to enable RAG.
Outputs for functional correctness, security, and gas are written to
results/(or to the directory specified by your scripts).
Before opening an issue, please verify:
-
repository/directory matches the provided originals (no internal paths/files altered). - Foundry,
solc/solc-select, and Slither are installed and callable from your shell. - Python deps installed:
pip install -r requirements.txt. - Script parameters (
--model/--shot/--sample/--context/--rag) match your expected setup. -
verifierpath (for Slither) comes from the real output of the Pass@k step. - For gas analysis,
results_*.jsonlare inresults/gasand scripts executed in order.
@article{peng2025soleval,
title={SolEval: Benchmarking Large Language Models for Repository-level Solidity Code Generation},
author={Peng, Zhiyuan and Yin, Xin and Qian, Rui and Lin, Peiqin and Liu, Yongkang and Ying, Chenhao and Luo, Yuan},
journal={arXiv preprint arXiv:2502.18793},
year={2025}
}- Some repositories require specific
solcversions; align withsolc-selectbefore running evaluations. - On Windows, certain dependency installers may require Administrator privileges.
- Public leaderboard and full evaluation report.
- More real repositories and test cases.
- Additional languages/frameworks (e.g., Vyper / Hardhat).
Contributions are welcome!
- Fork this repository.
- Create a new branch:
git checkout -b feature-branch - Commit changes:
git commit -am 'Add new feature' - Push:
git push origin feature-branch - Open a Pull Request.
Please follow the existing code style and include appropriate tests.
Released under the MIT License. See LICENSE for details.
This repository accompanies our EMNLP 2025 Main paper. We thank the open-source community and the authors of tools used here (Foundry, Slither, etc.).