Mike L.’s Post

New Product Ops at Scale AI

3mo Edited

Excited to share our first LLMs leaderboards! We've focused on three principles to improve LLM evaluation: 1. Private datasets: no overfitting. 2. Vetted experts: we trust them to rate nuanced and domain-specific model responses. 3. Open eval methodology: review our methodology and data pipelines construction; it also allows for deep dives into specific performance areas (check out the insights section of our coding leaderboard!) A huge thank you to our talented team of researchers, operators, and engineers for making this possible. Summer Yue / Daniel Berrios / Dean L. / Ernesto Gabriel Hernández Montoya / Hugh Zhang / Cristina Menghini / Diego A. Mares Buendia / Ken Murphy / William Qian / Jorge Flores Aveledo / Vaughn R.

Scale AI

172,116 followers

3mo Edited

📣 Scale is excited to release the SEAL leaderboards today, kicking off the first truly expert-driven, trustworthy LLM contest open to all: https://lnkd.in/g32X8Dcz Compared to existing benchmarks, these leaderboards developed by our Safety, Evaluations, and Alignment Lab (SEAL) are built on: ✅ Private datasets that can’t be gamed ✅ Evolving competition ✅ Expert evaluations The initial domains covered include: Coding, Instruction Following, Math (based on GSM1k), and Multilinguality. These leaderboards are regularly updated to include new models and capabilities. Our goal is to foster a culture of transparency and openness in the development and evaluation of frontier models. 👉 Finally, we are also announcing the general availability of Scale Evaluation: a platform to enable organizations to evaluate and iterate on their AI models and applications. Learn more: https://lnkd.in/dVwvAhmN 👈 Check out the leaderboard yourself here: https://lnkd.in/gghYicsm And learn more about the development and motivation behind the leaderboards: https://lnkd.in/gSfZYMkE

3 Comments

Alex Tang, CPA, CMA, MSA.

Head of Gen AI Operations | Finance, Accounting, Operations Leader | ex Anyscale, Square, Poynt | Angel Investor and Advisor | Startup Builder

3mo

Great job team!

Matteo Cera

CEO at Glaut: AI-native market research software | Y Combinator | SVP Pioneer Fund | McKinsey

3mo

Great job Mike Lunati , we'll test it out!

Daniel Berrios

Head of Product, Model Evaluation at Scale AI

3mo

Amazing work on this, Mike!

See more comments

To view or add a comment, sign in

More Relevant Posts

Lucas Bunzel

Product Marketing @ Scale AI
3mo Edited
Report this post
Introducing Scale AI's SEAL Leaderboards -- the first private, expert-driven, trustworthy LLM contest. Our Safety, Evaluations, and Alignment Lab (SEAL) designed the leaderboards with three principles: 🔒 Private evaluation datasets that can't be gamed 🥇 Evolving competition with periodic leaderboard updates 🔍 Expert evaluations using domain-specific methodologies Initial domains covered: Coding, Instruction Following, Math, and Multilinguality To see where GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, and Llama 3 rank, visit https://lnkd.in/gaBTsK9P
Scale AI

172,116 followers
3mo Edited

📣 Scale is excited to release the SEAL leaderboards today, kicking off the first truly expert-driven, trustworthy LLM contest open to all: https://lnkd.in/g32X8Dcz Compared to existing benchmarks, these leaderboards developed by our Safety, Evaluations, and Alignment Lab (SEAL) are built on: ✅ Private datasets that can’t be gamed ✅ Evolving competition ✅ Expert evaluations The initial domains covered include: Coding, Instruction Following, Math (based on GSM1k), and Multilinguality. These leaderboards are regularly updated to include new models and capabilities. Our goal is to foster a culture of transparency and openness in the development and evaluation of frontier models. 👉 Finally, we are also announcing the general availability of Scale Evaluation: a platform to enable organizations to evaluate and iterate on their AI models and applications. Learn more: https://lnkd.in/dVwvAhmN 👈 Check out the leaderboard yourself here: https://lnkd.in/gghYicsm And learn more about the development and motivation behind the leaderboards: https://lnkd.in/gSfZYMkE
1 Comment
Like Comment
To view or add a comment, sign in
Raleigh S.

Business Development at Scale AI
3mo
Report this post
Exciting update for the GenAI Testing & Evaluation world (which is much smaller than my LinkedIn feed would lead me to believe)! Check it out.
Scale AI

172,116 followers
3mo Edited

📣 Scale is excited to release the SEAL leaderboards today, kicking off the first truly expert-driven, trustworthy LLM contest open to all: https://lnkd.in/g32X8Dcz Compared to existing benchmarks, these leaderboards developed by our Safety, Evaluations, and Alignment Lab (SEAL) are built on: ✅ Private datasets that can’t be gamed ✅ Evolving competition ✅ Expert evaluations The initial domains covered include: Coding, Instruction Following, Math (based on GSM1k), and Multilinguality. These leaderboards are regularly updated to include new models and capabilities. Our goal is to foster a culture of transparency and openness in the development and evaluation of frontier models. 👉 Finally, we are also announcing the general availability of Scale Evaluation: a platform to enable organizations to evaluate and iterate on their AI models and applications. Learn more: https://lnkd.in/dVwvAhmN 👈 Check out the leaderboard yourself here: https://lnkd.in/gghYicsm And learn more about the development and motivation behind the leaderboards: https://lnkd.in/gSfZYMkE
Like Comment
To view or add a comment, sign in
Scale AI

172,116 followers
3mo Edited
Report this post
📣 Scale is excited to release the SEAL leaderboards today, kicking off the first truly expert-driven, trustworthy LLM contest open to all: https://lnkd.in/g32X8Dcz Compared to existing benchmarks, these leaderboards developed by our Safety, Evaluations, and Alignment Lab (SEAL) are built on: ✅ Private datasets that can’t be gamed ✅ Evolving competition ✅ Expert evaluations The initial domains covered include: Coding, Instruction Following, Math (based on GSM1k), and Multilinguality. These leaderboards are regularly updated to include new models and capabilities. Our goal is to foster a culture of transparency and openness in the development and evaluation of frontier models. 👉 Finally, we are also announcing the general availability of Scale Evaluation: a platform to enable organizations to evaluate and iterate on their AI models and applications. Learn more: https://lnkd.in/dVwvAhmN 👈 Check out the leaderboard yourself here: https://lnkd.in/gghYicsm And learn more about the development and motivation behind the leaderboards: https://lnkd.in/gSfZYMkE
2 Comments
Like Comment
To view or add a comment, sign in
Aurel Npounengnong

Full Stack Developer | Aspiring AI/ML Engineer | NLP Theory | Physics 🚀
2mo
Report this post
It is very insightful for those comparing LLms for various use cases.
Scale AI

172,116 followers
3mo Edited

📣 Scale is excited to release the SEAL leaderboards today, kicking off the first truly expert-driven, trustworthy LLM contest open to all: https://lnkd.in/g32X8Dcz Compared to existing benchmarks, these leaderboards developed by our Safety, Evaluations, and Alignment Lab (SEAL) are built on: ✅ Private datasets that can’t be gamed ✅ Evolving competition ✅ Expert evaluations The initial domains covered include: Coding, Instruction Following, Math (based on GSM1k), and Multilinguality. These leaderboards are regularly updated to include new models and capabilities. Our goal is to foster a culture of transparency and openness in the development and evaluation of frontier models. 👉 Finally, we are also announcing the general availability of Scale Evaluation: a platform to enable organizations to evaluate and iterate on their AI models and applications. Learn more: https://lnkd.in/dVwvAhmN 👈 Check out the leaderboard yourself here: https://lnkd.in/gghYicsm And learn more about the development and motivation behind the leaderboards: https://lnkd.in/gSfZYMkE
Like Comment
To view or add a comment, sign in
Joshua Laferriere, MSc

Researcher, AI Practitioner, Data Scientist, Business Analyst, System Administrator
3mo Edited
Report this post
My new goto model can't wait for the finetunes. before this I was using wizardlm I'm not convinced llama 3 is the best for me especially for processing arxiv papers I might pivot to llama 3 after reducing papers down to necessary components / manageable size, but for now I only rely on models finetuned to handle 32k plus

lmstudio-community/Mistral-7B-Instruct-v0.3-GGUF at main

huggingface.co
Like Comment
To view or add a comment, sign in
Dat Tran

VP of AI/ML at Beams Safety AI 🤖 | Advisor, Investor, and Mentor 🤝 | Keynote Speaker 🎙️
5mo
Report this post
VoiceCraft - A new zero-shot speech editing and text-to-speech model. Just tried it out. Really amazing speed at inference. You can also do voice cloning and it only needs like 3 seconds of reference. At the moment, it only supports English though. Models are on Huggingface 🤗 Really cool to see how TTS research has progressed. Wasn't like that when we started it at Axel Springer. 🎙️ Demo: https://lnkd.in/gbxqmz_d 📝 Paper: https://lnkd.in/gbTNnF_Z
2 Comments
Like Comment
To view or add a comment, sign in
Ramin Mehran

Tech Lead @ Google DeepMind Multi-Modal perception/generation, AI Breakdown Podcaster
8mo
Report this post
In this episode, we discuss Mixtral of Experts by Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed. Mixtral 8x7B is a Sparse Mixture of Experts (SMoE) language model, building on Mistral 7B's architecture with 8 experts per layer, among which two experts are selected per token for processing, allowing access to 47B parameters but using only 13B actively. It excels in benchmarks, surpassing Llama 2 70B and GPT-3.5, especially in areas like math, code generation, and multilingual tasks. A special instruction-following version called Mixtral 8x7B – Instruct also outperforms leading models, with both models being open-sourced under the Apache 2.0 license.

arxiv preprint - Mixtral of Experts

podbean.com
Like Comment
To view or add a comment, sign in
Michele Loi

Senior Science Manager
1mo Edited
Report this post
https://lnkd.in/eAsNT8GC congratulations to Corinna Hertweck for the deep thinking in, and super-fast and effective writing of, "What's Distributive Justice Got to Do With It?" the paper concluding and cristallizing our shared, tortuous, interdisciplinary journey on machine learning fairness. She'll have the opportunity to discuss this at #AIES '24, hopefully engaging philosophers and computer scientists in the same way. There no better way to end our project with Christoph Heitz (and for Corinna to conclude her doctorate)! (Spoiler alert: we're not offering a "solution to the fairness issue", but we've dug the problems deeper and discovered new levels of complexity.)

4 Comments
Like Comment
To view or add a comment, sign in
Darren McKee

Best-Selling author of "Uncontrollable: The Threat of Artificial Superintelligence and the Race to Save the World"
8mo
Report this post
AIs can generate new knowledge now. It's not yet 2024. What does next year look like? and 2025? and 2030? We aren't prepared for the rate of change and the increase in AI capabilities.
Pushmeet Kohli

VP of Research, Google DeepMind
9mo

Can LLMs uncover new knowledge - the scientific equivalent of AlphaGo’s move 37? In a paper that appears in Nature today, our team at Google DeepMind shows how FunSearch, our new LLM approach for searching in the space of programs, can generate new, verifiable knowledge in mathematics and computing. This work marks a significant leap forward, showing us how we can capitalise on the creativity of LLMs while ensuring the accuracy and reliability of their outputs. Alongside the paper, we are also releasing the functions and solutions discovered by FunSearch. Our hope is that this work inspires others to think about the ways we can work with LLMs to push the boundaries of algorithmics, mathematics and science. Read more about this brilliant piece of new research in our blog post below https://lnkd.in/d8TvpzwE
Like Comment
To view or add a comment, sign in
Nicholas Del Negro

AI Agent Egineer / Generative AI Consultant
1mo
Report this post
Daniel, this workshop sounds incredibly insightful! I'm particularly interested in your comparisons between Triton and CUDA, as well as your deep dive into the Llama model and its bug fixes. Thanks for sharing these valuable resources and for your dedication to advancing our understanding of LLMs. Looking forward to more content like this! #AI #ArtificialIntelligence #MachineLearning #DevOps #SoftwareDevelopment #Programming #FutureOfWork #Automation #Innovation #Technology #CareerDevelopment #Upskilling #Reskilling #DigitalTransformation #ITJobs #TechJobs #DeveloperProductivity #FinTech #HealthTech #CyberSecurity #DataScience #CloudComputing #InternetOfThings (IoT)

Daniel Han

unsloth.ai - open-source AI training
1mo Edited

My Low Level Technicals of LLMs 3 hour workshop is out! I talk about: 1. Triton vs CUDA 2. Why training is O(N^2) not cubic 3. GPT2 vs Llama 4. Why causal masks, layernorms, RoPE, SwiGLU 5. Bug fixes for Llama, Gemma, Phi 6. Backprop engine in Unsloth AI 3 hour workshop video: https://lnkd.in/gxjj2NPH And my 20 min talk where I cover how we fixed & found multiple issues in the latest Llama 3 model: https://lnkd.in/gg-6Q63E Thanks to Shawn swyx W for inviting me to the AI Engineer World's Fair late June!

Low Level Technicals of LLMs: Daniel Han

https://www.youtube.com/
Like Comment
To view or add a comment, sign in

2,344 followers

32 Posts

View Profile Follow

Mike L.’s Post

More Relevant Posts

arxiv preprint - Mixtral of Experts

podbean.com

Low Level Technicals of LLMs: Daniel Han

https://www.youtube.com/

Explore topics