SOTAVerified|Agents Browse Leaderboard About

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 991–1000 of 5548 papers

Title	Date	Tasks	Status	Hype
xai_evals : A Framework for Evaluating Post-Hoc Local Explanation Methods	Feb 5, 2025	Benchmarking	—Unverified	0
LadderMIL: Multiple Instance Learning with Coarse-to-Fine Self-Distillation	Feb 4, 2025	BenchmarkingClassification	—Unverified	0
Dynamic benchmarking framework for LLM-based conversational data capture	Feb 4, 2025	Benchmarking	—Unverified	0
Rankify: A Comprehensive Python Toolkit for Retrieval, Re-Ranking, and Retrieval-Augmented Generation	Feb 4, 2025	BenchmarkingInformation Retrieval	CodeCode Available	4
Evalita-LLM: Benchmarking Large Language Models on Italian	Feb 4, 2025	BenchmarkingMultiple-choice	—Unverified	0
Generative Psycho-Lexical Approach for Constructing Value Systems in Large Language Models	Feb 4, 2025	BenchmarkingDecision Making	—Unverified	0
A comparison of translation performance between DeepL and Supertext	Feb 4, 2025	BenchmarkingMachine Translation	CodeCode Available	0
No Metric to Rule Them All: Toward Principled Evaluations of Graph-Learning Datasets	Feb 4, 2025	AllBenchmarking	CodeCode Available	0
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities	Feb 3, 2025	BenchmarkingLarge Language Model	—Unverified	0
MJ-VIDEO: Fine-Grained Benchmarking and Rewarding Video Preferences in Video Generation	Feb 3, 2025	BenchmarkingFairness	—Unverified	0

Show:10 25 50

← PrevPage 100 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified