SOTAVerified

Benchmarking

Papers

Showing 21012150 of 5548 papers

TitleStatusHype
QGEval: Benchmarking Multi-dimensional Evaluation for Question GenerationCode1
Benchmarking Neural Decoding Backbones towards Enhanced On-edge iBCI Applications0
1st Place Winner of the 2024 Pixel-level Video Understanding in the Wild (CVPR'24 PVUW) Challenge in Video Panoptic Segmentation and Best Long Video Consistency of Video Semantic Segmentation0
VisionAD, a software package of performant anomaly detection algorithms, and Proportion Localised, an interpretable metricCode0
Behavior Structformer: Learning Players Representations with Structured Tokenization0
GenzIQA: Generalized Image Quality Assessment using Prompt-Guided Latent Diffusion Models0
WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the WildCode3
Deep Jansen-Rit Parameter Inference for Model-Driven Analysis of Brain ActivityCode0
CLoG: Benchmarking Continual Learning of Image Generation ModelsCode1
Scenarios and Approaches for Situated Natural Language Explanations0
Hints-In-Browser: Benchmarking Language Models for Programming Feedback Generation0
Multi-Head RAG: Solving Multi-Aspect Problems with LLMsCode3
Omni6DPose: A Benchmark and Model for Universal 6D Object Pose Estimation and Tracking0
Benchmarking AlphaFold3's protein-protein complex accuracy and machine learning prediction reliability for binding free energy changes upon mutation0
Performance of large language models in numerical vs. semantic medical knowledge: Benchmarking on evidence-based Q&As0
Statistical Multicriteria Benchmarking via the GSD-Front0
Bench2Drive: Towards Multi-Ability Benchmarking of Closed-Loop End-To-End Autonomous DrivingCode4
Better Late Than Never: Formulating and Benchmarking Recommendation EditingCode0
Time Sensitive Knowledge Editing through Efficient Finetuning0
NATURAL PLAN: Benchmarking LLMs on Natural Language Planning0
MLVU: Benchmarking Multi-task Long Video UnderstandingCode3
Empirical Guidelines for Deploying LLMs onto Resource-constrained Edge Devices0
BEADs: Bias Evaluation Across Domains0
TIDMAD: Time Series Dataset for Discovering Dark Matter with AI DenoisingCode1
Comparative Benchmarking of Failure Detection Methods in Medical Image Segmentation: Unveiling the Role of Confidence Aggregation0
CommonPower: A Framework for Safe Data-Driven Smart Grid ControlCode1
A Comprehensive Library for Benchmarking Multi-class Visual Anomaly Detection0
CattleFace-RGBT: RGB-T Cattle Facial Landmark BenchmarkCode1
Hyperbolic Benchmarking Unveils Network Topology-Feature Relationship in GNN PerformanceCode0
ACCORD: Closing the Commonsense Measurability GapCode0
Bi-DCSpell: A Bi-directional Detector-Corrector Interactive Framework for Chinese Spelling Check0
MARS: Benchmarking the Metaphysical Reasoning Abilities of Language Models with a Multi-task Evaluation DatasetCode0
Analyzing the Feature Extractor Networks for Face Image SynthesisCode0
TruthEval: A Dataset to Evaluate LLM Truthfulness and ReliabilityCode0
An Empirical Study into Clustering of Unseen Datasets with Self-Supervised EncodersCode1
Enhancing Trust in LLMs: Algorithms for Comparing and Interpreting LLMs0
R2C2-Coder: Enhancing and Benchmarking Real-world Repository-level Code Completion Abilities of Code Large Language Models0
ELSA: Evaluating Localization of Social Activities in Urban Streets using Open-Vocabulary Detection0
LanEvil: Benchmarking the Robustness of Lane Detection to Environmental Illusions0
animal2vec and MeerKAT: A self-supervised transformer for rare-event raw audio input and a large-scale reference dataset for bioacousticsCode1
TCMBench: A Comprehensive Benchmark for Evaluating Large Language Models in Traditional Chinese MedicineCode2
Scaffold Splits Overestimate Virtual Screening Performance0
WebSuite: Systematically Evaluating Why Web Agents FailCode0
GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation ModelsCode1
On the project risk baseline: integrating aleatory uncertainty into project scheduling0
LLMGeo: Benchmarking Large Language Models on Image Geolocation In-the-wildCode1
SECURE: Benchmarking Large Language Models for CybersecurityCode1
Is Synthetic Data all We Need? Benchmarking the Robustness of Models Trained with Synthetic Images0
Aquatic Navigation: A Challenging Benchmark for Deep Reinforcement LearningCode1
CoSy: Evaluating Textual Explanations of Neurons0
Show:102550
← PrevPage 43 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified