General-Purpose LLM Leaderboards

May 10, 20251 yr

These leaderboards benchmark large language models (LLMs) across a wide variety of tasks such as reasoning, coding, factual recall, summarization, and conversation. They typically evaluate open-source and proprietary models using common datasets like MMLU, HellaSwag, GSM8K, ARC, and TruthfulQA. These platforms are ideal for developers, researchers, and companies comparing model accuracy, latency, cost, and safety for deployment. Some include crowd-voted scores (like Elo rankings), while others offer structured benchmarking scripts. These tools are critical for making informed choices between GPT-4, Claude, Mixtral, LLaMA, and similar models.

Tools:

Hugging Face Open LLM Leaderboard – Tracks open-source models across key benchmarks with performance scores, model size, and licensing info.
KLU.ai LLM Leaderboard – Provides an interactive leaderboard for LLMs, focusing on cost, latency, and hallucination rate.
LMSYS Chatbot Arena (via lmarena.ai & openlm.ai) – Crowdsources pairwise human preferences to produce Elo rankings of LLMs like GPT-4, Claude, Gemini, and Mixtral.
Aider LLM Leaderboard – Ranks LLMs based on performance inside the Aider coding assistant, focusing on dev workflows and code generation quality.
LLM Extractum Leaderboard – Offers performance comparisons across structured question-answer datasets and reasoning benchmarks.

General-Purpose LLM Leaderboards

Featured Replies

Tools:

Create an account or sign in to comment

Who's Online (See full list)

Lead AI Transformation without coding

Most Solved

Forum Statistics

Member Statistics

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)