SOTAVerified

Benchmarking

Papers

Showing 13011350 of 5548 papers

TitleStatusHype
Forecasting Future International Events: A Reliable Dataset for Text-Based Event ModelingCode0
PATH: A Discrete-sequence Dataset for Evaluating Online Unsupervised Anomaly Detection Approaches for Multivariate Time SeriesCode0
Multi-Agent Environments for Vehicle Routing ProblemsCode1
Beyond Visual Understanding: Introducing PARROT-360V for Vision Language Model Benchmarking0
Benchmarking a wide range of optimisers for solving the Fermi-Hubbard model using the variational quantum eigensolver0
Delta-Influence: Unlearning Poisons via Influence FunctionsCode0
VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative ModelsCode5
BelHouse3D: A Benchmark Dataset for Assessing Occlusion Robustness in 3D Point Cloud Semantic Segmentation0
BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games0
The Moral Mind(s) of Large Language Models0
Integrating Dynamic Correlation Shifts and Weighted Benchmarking in Extreme Value Analysis0
Benchmarking Positional Encodings for GNNs and Graph TransformersCode0
DLBacktrace: A Model Agnostic Explainability for any Deep Learning ModelsCode1
Introducing Milabench: Benchmarking Accelerators for AICode1
Benchmarking pre-trained text embedding models in aligning built asset informationCode0
Value-Spectrum: Quantifying Preferences of Vision-Language Models via Value Decomposition in Social Media ContextsCode0
Reinforcing Competitive Multi-Agents for Playing So Long Sucker0
Countering Backdoor Attacks in Image Recognition: A Survey and Evaluation of Mitigation Strategies0
Different Horses for Different Courses: Comparing Bias Mitigation Algorithms in ML0
FastDraft: How to Train Your Draft0
Towards a Comprehensive Benchmark for Pathological Lymph Node Metastasis in Breast Cancer SectionsCode0
The Oxford Spires Dataset: Benchmarking Large-Scale LiDAR-Visual Localisation, Reconstruction and Radiance Field Methods0
The ParClusterers Benchmark Suite (PCBS): A Fine-Grained Analysis of Scalable Graph Clustering0
Automated Coding of Communications in Collaborative Problem-solving Tasks Using ChatGPT0
Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level0
WelQrate: Defining the Gold Standard in Small Molecule Drug Discovery Benchmarking0
A survey of probabilistic generative frameworks for molecular simulationsCode0
Caravan MultiMet: Extending Caravan with Multiple Weather Nowcasts and ForecastsCode3
BEARD: Benchmarking the Adversarial Robustness for Dataset DistillationCode0
Anomaly Detection in Large-Scale Cloud Systems: An Industry Case and DatasetCode0
A Survey on Vision Autoregressive Model0
HyperFace: Generating Synthetic Face Recognition Datasets by Exploring Face Embedding Hypersphere0
FM-TS: Flow Matching for Time Series GenerationCode1
Evaluating the Generation of Spatial Relations in Text and Image Generative Models0
Retrieval or Global Context Understanding? On Many-Shot In-Context Learning for Long-Context EvaluationCode0
BuckTales : A multi-UAV dataset for multi-object tracking and re-identification of wild antelopes0
General Geospatial Inference with a Population Dynamics Foundation ModelCode3
Benchmarking LLMs' Judgments with No Gold StandardCode0
Arctique: An artificial histopathological dataset unifying realism and controllability for uncertainty quantificationCode1
MolMiner: Towards Controllable, 3D-Aware, Fragment-Based Molecular Design0
Low Dynamic Range for RIS-aided Bistatic Integrated Sensing and Communication0
Benchmarking 3D multi-coil NC-PDNet MRI reconstruction0
FactLens: Benchmarking Fine-Grained Fact Verification0
Open-set object detection: towards unified problem formulation and benchmarking0
Benchmarking Distributional Alignment of Large Language ModelsCode0
A Retrospective on the Robot Air Hockey Challenge: Benchmarking Robust, Reliable, and Safe Learning Techniques for Real-world Robotics0
ProverbEval: Exploring LLM Evaluation Challenges for Low-resource Language Understanding0
Performance-Guided LLM Knowledge Distillation for Efficient Text Classification at Scale0
Deep Learning Models for UAV-Assisted Bridge Inspection: A YOLO Benchmark Analysis0
HandCraft: Anatomically Correct Restoration of Malformed Hands in Diffusion Generated Images0
Show:102550
← PrevPage 27 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified