SOTAVerified

Benchmarking

Papers

Showing 26512675 of 5548 papers

TitleStatusHype
Benchmarking Quality-Diversity Algorithms on Neuroevolution for Reinforcement Learning0
Benchmarking Quality-Dependent and Cost-Sensitive Score-Level Multimodal Biometric Fusion Algorithms0
Foundations for learning from noisy quantum experiments0
Found in Translation: Measuring Multilingual LLM Consistency as Simple as Translate then Evaluate0
FarsBase-KBP: A Knowledge Base Population System for the Persian Knowledge Graph0
Fantastic Questions and Where to Find Them: FairytaleQA – An Authentic Dataset for Narrative Comprehension0
AI PERSONA: Towards Life-long Personalization of LLMs0
Fantastic Questions and Where to Find Them: FairytaleQA--An Authentic Dataset for Narrative Comprehension0
FRED: The Florence RGB-Event Drone Dataset0
Benchmarking Prompt Engineering Techniques for Secure Code Generation with GPT Models0
Free Performance Gain from Mixing Multiple Partially Labeled Samples in Multi-label Image Classification0
Benchmarking Single-Image Reflection Removal Algorithms0
FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning0
From Audio Encoders to Piano Judges: Benchmarking Performance Understanding for Solo Piano0
Benchmarking projective simulation in navigation problems0
From Blind Solvers to Logical Thinkers: Benchmarking LLMs' Logical Integrity on Faulty Mathematical Problems0
A Survey on LLM-based News Recommender Systems0
From Code to Play: Benchmarking Program Search for Games Using Large Language Models0
From Environmental Sound Representation to Robustness of 2D CNN Models Against Adversarial Attacks0
From Generalist to Specialist: Improving Large Language Models for Medical Physics Using ARCoT0
Holistic Multi-View Building Analysis in the Wild with Projection Pooling0
How Aligned are Different Alignment Metrics?0
A Large-scale Evaluation of Pretraining Paradigms for the Detection of Defects in Electroluminescence Solar Cell Images0
From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future0
Benchmarking Processor Performance by Multi-Threaded Machine Learning Algorithms0
Show:102550
← PrevPage 107 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified