Mike L.’s Post

View profile for Mike L., graphic

New Product Ops at Scale AI

Excited to share our first LLMs leaderboards! We've focused on three principles to improve LLM evaluation: 1. Private datasets: no overfitting. 2. Vetted experts: we trust them to rate nuanced and domain-specific model responses. 3. Open eval methodology: review our methodology and data pipelines construction; it also allows for deep dives into specific performance areas (check out the insights section of our coding leaderboard!) A huge thank you to our talented team of researchers, operators, and engineers for making this possible. Summer Yue / Daniel Berrios / Dean L. / Ernesto Gabriel Hernández Montoya / Hugh Zhang / Cristina Menghini / Diego A. Mares Buendia / Ken Murphy / William Qian / Jorge Flores Aveledo / Vaughn R.

View organization page for Scale AI, graphic

172,116 followers

📣 Scale is excited to release the SEAL leaderboards today, kicking off the first truly expert-driven, trustworthy LLM contest open to all: https://lnkd.in/g32X8Dcz   Compared to existing benchmarks, these leaderboards developed by our Safety, Evaluations, and Alignment Lab (SEAL) are built on: ✅ Private datasets that can’t be gamed ✅ Evolving competition ✅ Expert evaluations The initial domains covered include: Coding, Instruction Following, Math (based on GSM1k), and Multilinguality. These leaderboards are regularly updated to include new models and capabilities. Our goal is to foster a culture of transparency and openness in the development and evaluation of frontier models. 👉 Finally, we are also announcing the general availability of Scale Evaluation: a platform to enable organizations to evaluate and iterate on their AI models and applications. Learn more: https://lnkd.in/dVwvAhmN 👈 Check out the leaderboard yourself here: https://lnkd.in/gghYicsm And learn more about the development and motivation behind the leaderboards: https://lnkd.in/gSfZYMkE

  • No alternative text description for this image
Alex Tang, CPA, CMA, MSA.

Head of Gen AI Operations | Finance, Accounting, Operations Leader | ex Anyscale, Square, Poynt | Angel Investor and Advisor | Startup Builder

3mo

Great job team!

Like
Reply
Matteo Cera

CEO at Glaut: AI-native market research software | Y Combinator | SVP Pioneer Fund | McKinsey

3mo

Great job Mike Lunati , we'll test it out!

Like
Reply
Daniel Berrios

Head of Product, Model Evaluation at Scale AI

3mo

Amazing work on this, Mike!

Like
Reply
See more comments

To view or add a comment, sign in

Explore topics