SOTAVerified

Benchmarking

Papers

Showing 13511375 of 5548 papers

TitleStatusHype
Contemporary Symbolic Regression Methods and their Relative PerformanceCode1
Controlgym: Large-Scale Control Environments for Benchmarking Reinforcement Learning AlgorithmsCode1
LMM4LMM: Benchmarking and Evaluating Large-multimodal Image Generation with LMMsCode1
Benchmarking Test-Time Adaptation against Distribution Shifts in Image ClassificationCode1
A Unified Taxonomy and Multimodal Dataset for Events in Invasion GamesCode1
Benchmarking the Abilities of Large Language Models for RDF Knowledge Graph Creation and Comprehension: How Well Do LLMs Speak Turtle?Code1
CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language ModelsCode1
LoLI-Street: Benchmarking Low-Light Image Enhancement and BeyondCode1
Benchmarking Image Retrieval for Visual LocalizationCode1
ArabicaQA: A Comprehensive Dataset for Arabic Question AnsweringCode1
A User-Centric Multi-Intent Benchmark for Evaluating Large Language ModelsCode1
Benchmarking the Combinatorial Generalizability of Complex Query Answering on Knowledge GraphsCode1
Benchmarking the CoW with the TopCoW Challenge: Topology-Aware Anatomical Segmentation of the Circle of Willis for CTA and MRACode1
Comprehensive benchmarking of large language models for RNA secondary structure predictionCode1
Benchmarking human visual search computational models in natural scenes: models comparison and reference datasetsCode1
ReMeDi: Resources for Multi-domain, Multi-service, Medical DialoguesCode1
ComplexBench-Edit: Benchmarking Complex Instruction-Driven Image Editing via Compositional DependenciesCode1
Constellation Dataset: Benchmarking High-Altitude Object Detection for an Urban IntersectionCode1
Boosting Neural Image Compression for Machines Using Latent Space MaskingCode1
Machine Translation Meta Evaluation through Translation Accuracy Challenge SetsCode1
Benchmarking the Robustness of LiDAR-Camera Fusion for 3D Object DetectionCode1
MALPOLON: A Framework for Deep Species Distribution ModelingCode1
AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language ModelsCode1
High-Dimensional Inference in Bayesian NetworksCode1
Aquatic Navigation: A Challenging Benchmark for Deep Reinforcement LearningCode1
Show:102550
← PrevPage 55 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified