ToolQA, a new dataset to evaluate the capabilities of LLMs in answering challenging questions with external tools. It offers two levels (easy/hard) across eight real-life scenarios.

Jupyter Notebook 234 7 Updated Aug 19, 2023

Junjie-Ye / RoTBench

RoTBench: A Multi-Level Benchmark for Evaluating the Robustness of Large Language Models in Tool Learning

Python 11 Updated Apr 11, 2024

fairyshine / Seal-Tools

The source code and dataset mentioned in the paper Seal-Tools: Self-Instruct Tool Learning Dataset for Agent Tuning and Detailed Benchmark.

Python 31 2 Updated Aug 7, 2024

IBM / API-BLEND

Companion code to https://arxiv.org/abs/2402.15491

Python 9 Updated Apr 26, 2024

Junjie-Ye / ToolEyes

ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios

Python 63 5 Updated Apr 11, 2024

Yifan-Song793 / RestGPT

An LLM-based autonomous agent controlling real-world applications via RESTful APIs

Python 1,297 93 Updated Jun 7, 2024

tangqiaoyu / ToolAlpaca

ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases

Python 281 34 Updated Jul 4, 2024

sambanova / toolbench

ToolBench, an evaluation suite for LLM tool manipulation capabilities.

Python 134 11 Updated Feb 28, 2024

OpenBMB / BMTools

Tool Learning for Big Models, Open-Source Solutions of ChatGPT-Plugins

Python 2,876 269 Updated Dec 5, 2023

jeinlee1991 / chinese-llm-benchmark

中文大模型能力评测榜单：目前已囊括115个大模型，覆盖chatgpt、gpt4o、百度文心一言、阿里通义千问、讯飞星火、商汤senseChat、minimax等商用模型，以及百川、qwen2、glm4、yi、书生internLM2、llama3等开源大模型，多维度能力评测。不仅提供能力评分排行榜，也提供所有模型的原始输出结果！

2,330 112 Updated Sep 10, 2024

rgbkrk / chatlab

⚡️🧪 Fast LLM Tool Calling Experimentation, big and smol

Jupyter Notebook 133 12 Updated Mar 9, 2024

Maluuba / nlg-eval

Evaluation code for various unsupervised automated metrics for Natural Language Generation.

Python 1,338 224 Updated Aug 20, 2024

google-research / google-research

Google Research

Jupyter Notebook 33,794 7,825 Updated Sep 10, 2024

princeton-nlp / SWE-agent

SWE-agent takes a GitHub issue and tries to automatically fix it, using GPT-4, or your LM of choice. It solves 12.47% of bugs in the SWE-bench evaluation set and takes just 1 minute to run.

Python 13,270 1,295 Updated Sep 10, 2024

HowieHwong / MetaTool

[ICLR 2024] MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use

Python 60 8 Updated Mar 21, 2024

mangiucugna / json_repair

A python module to repair invalid JSON, commonly used to parse the output of LLMs

Python 705 40 Updated Sep 10, 2024

vinta / awesome-python

An opinionated list of awesome Python frameworks, libraries, software and resources.

Python 218,284 24,778 Updated Aug 11, 2024

huangjia2019 / ai-agents

异步图书大模型应用开发动手做AI Agent

Jupyter Notebook 149 33 Updated Jul 4, 2024

langgptai / LangGPT

LangGPT: Empowering everyone to become a prompt expert!🚀 Structured Prompt，Language of GPT, 结构化提示词，结构化Prompt

Jupyter Notebook 5,321 461 Updated Sep 4, 2024

langgptai / wonderful-prompts

🔥中文 prompt 精选🔥，ChatGPT 使用指南，提升 ChatGPT 可玩性和可用性！🚀

2,971 266 Updated Jun 11, 2024

ShengranHu / ADAS

Automated Design of Agentic Systems

Python 808 121 Updated Aug 20, 2024

magicgh / Self-MAP

[ACL 2024] On the Multi-turn Instruction Following for Conversational Web Agents

Python 10 Updated Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

孙松涛(Songtao Sun) whustan

Block or report whustan

Stars

salesforce / BOLAA

SkyworkAI / agent-studio

SalesforceAIResearch / xLAM

RAIVNLab / mnms

uiuc-kang-lab / InjecAgent

Junjie-Ye / ToolSword

MLLM-Tool / MLLM-Tool

ryoungj / ToolEmu

night-chen / ToolQA