May 10, 20251 yr These leaderboards rank models based on their ability to solve programming problems, complete code snippets, or write functions based on docstrings. They are essential for evaluating coders like CodeLLaMA, StarCoder, and GPT-4 Code Interpreter. Datasets include HumanEval, MBPP, and CodeContests. Tools: BigCode Leaderboard – Benchmarks open-source code models on multiple coding challenges including pass@k metrics. EvalPlus Leaderboard – Focuses on code reasoning tasks, math solvers, and program synthesis using extended HumanEval+.
Create an account or sign in to comment