SOTAVerified|Agents Browse Leaderboard About Blog

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 2131–2140 of 5548 papers

Title	Date	Tasks	Status	Hype
MARS: Benchmarking the Metaphysical Reasoning Abilities of Language Models with a Multi-task Evaluation Dataset	Jun 4, 2024	Benchmarking	CodeCode Available	0
ACCORD: Closing the Commonsense Measurability Gap	Jun 4, 2024	BenchmarkingCommon Sense Reasoning	CodeCode Available	0
Analyzing the Feature Extractor Networks for Face Image Synthesis	Jun 4, 2024	BenchmarkingImage Generation	CodeCode Available	0
An Empirical Study into Clustering of Unseen Datasets with Self-Supervised Encoders	Jun 4, 2024	BenchmarkingClustering	CodeCode Available	1
Enhancing Trust in LLMs: Algorithms for Comparing and Interpreting LLMs	Jun 4, 2024	BenchmarkingFairness	—Unverified	0
TruthEval: A Dataset to Evaluate LLM Truthfulness and Reliability	Jun 4, 2024	BenchmarkingLanguage Modeling	CodeCode Available	0
R2C2-Coder: Enhancing and Benchmarking Real-world Repository-level Code Completion Abilities of Code Large Language Models	Jun 3, 2024	BenchmarkingCode Completion	—Unverified	0
ELSA: Evaluating Localization of Social Activities in Urban Streets using Open-Vocabulary Detection	Jun 3, 2024	Action RecognitionBenchmarking	—Unverified	0
LanEvil: Benchmarking the Robustness of Lane Detection to Environmental Illusions	Jun 3, 2024	Autonomous DrivingBenchmarking	—Unverified	0
TCMBench: A Comprehensive Benchmark for Evaluating Large Language Models in Traditional Chinese Medicine	Jun 3, 2024	BenchmarkingQuestion Answering	CodeCode Available	2

Show:10 25 50

← PrevPage 214 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified