SOTAVerified

Benchmarking

Papers

Showing 841850 of 5548 papers

TitleStatusHype
Should we be going MAD? A Look at Multi-Agent Debate Strategies for LLMsCode1
UHGEval: Benchmarking the Hallucination of Chinese Large Language Models via Unconstrained GenerationCode1
Benchmarking Robustness of Text-Image Composed RetrievalCode1
IMGTB: A Framework for Machine-Generated Text Detection BenchmarkingCode1
BEND: Benchmarking DNA Language Models on biologically meaningful tasksCode1
Towards a more inductive world for drug repurposing approachesCode1
LogLead -- Fast and Integrated Log Loader, Enhancer, and Anomaly DetectorCode1
Benchmarking Pathology Feature Extractors for Whole Slide Image ClassificationCode1
TextEE: Benchmark, Reevaluation, Reflections, and Future Challenges in Event ExtractionCode1
Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable SummarizationCode1
Show:102550
← PrevPage 85 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified