SOTAVerified

Benchmarking

Papers

Showing 43514400 of 5548 papers

TitleStatusHype
When Reasoning Meets Compression: Benchmarking Compressed Large Reasoning Models on Complex Reasoning Tasks0
When Safety Detectors Aren't Enough: A Stealthy and Effective Jailbreak Attack on LLMs via Steganographic Techniques0
Where Paths Collide: A Comprehensive Survey of Classic and Learning-Based Multi-Agent Pathfinding0
Which models are innately best at uncertainty estimation?0
White Men Lead, Black Women Help? Benchmarking and Mitigating Language Agency Social Biases in LLMs0
Who Said That? Benchmarking Social Media AI Detection0
Who Wins the Game of Thrones? How Sentiments Improve the Prediction of Candidate Choice0
Why every GBDT speed benchmark is wrong0
Why is the winner the best?0
WILD: a new in-the-Wild Image Linkage Dataset for synthetic image attribution0
Wildfire Forecasting with Satellite Images and Deep Generative Model0
WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences0
Window-of-interest based Multi-objective Evolutionary Search for Satisficing Concepts0
WiSoSuper: Benchmarking Super-Resolution Methods on Wind and Solar Data0
Word Complexity Estimation for Japanese Lexical Simplification0
WorldView-Bench: A Benchmark for Evaluating Global Cultural Perspectives in Large Language Models0
Writing as a testbed for open ended agents0
xai_evals : A Framework for Evaluating Post-Hoc Local Explanation Methods0
XCSP3: An Integrated Format for Benchmarking Combinatorial Constrained Problems0
XLD: A Cross-Lane Dataset for Benchmarking Novel Driving View Synthesis0
Yambda-5B -- A Large-Scale Multi-modal Dataset for Ranking And Retrieval0
Yesil o1 Pro: Evidence-Based AI Model for Health and Benchmarking in Clinical Decision Support0
Yet Another ADNI Machine Learning Paper? Paving The Way Towards Fully-reproducible Research on Classification of Alzheimer's Disease0
You Only Crash Once v2: Perceptually Consistent Strong Features for One-Stage Domain Adaptive Detection of Space Terrain0
Zero-Forcing Max-Power Beamforming for Hybrid mmWave Full-Duplex MIMO Systems0
Zero-shot Benchmarking: A Framework for Flexible and Scalable Automatic Evaluation of Language Models0
Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis0
λ: A Benchmark for Data-Efficiency in Long-Horizon Indoor Mobile Manipulation Robotics0
LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs0
LAG-MMLU: Benchmarking Frontier LLM Understanding in Latvian and Giriama0
LAMBDA: Covering the Solution Set of Black-Box Inequality by Search Space Quantization0
Landscape-Aware Automated Algorithm Configuration using Multi-output Mixed Regression and Classification0
LanEvil: Benchmarking the Robustness of Lane Detection to Environmental Illusions0
Language Complexity Measurement as a Noisy Zero-Shot Proxy for Evaluating LLM Performance0
Language-Driven 6-DoF Grasp Detection Using Negative Prompt Guidance0
Language Models for Automated Classification of Brain MRI Reports and Growth Chart Generation0
Can LLMs Capture Human Preferences?0
Large Language Model for Multi-Domain Translation: Benchmarking and Domain CoT Fine-tuning0
Understanding Large Language Models in Your Pockets: Performance Study on COTS Mobile Devices0
Large Language Models are Null-Shot Learners0
Large Language Models are Few-Shot Clinical Information Extractors0
Large Language Models as Automated Aligners for benchmarking Vision-Language Models0
Large Language Models Have Intrinsic Meta-Cognition, but Need a Good Lens0
Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level0
Large Malaysian Language Model Based on Mistral for Enhanced Local Language Understanding0
Large Physics Models: Towards a collaborative approach with Large Language Models and Foundation Models0
Large-scale Benchmarking of Metaphor-based Optimization Heuristics0
Large-Scale Quantum Separability Through a Reproducible Machine Learning Lens0
Latency-aware Road Anomaly Segmentation in Videos: A Photorealistic Dataset and New Metrics0
Latent Variable Models for Visual Question Answering0
Show:102550
← PrevPage 88 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified