SOTAVerified

Benchmarking

Papers

Showing 10511100 of 5548 papers

TitleStatusHype
An Interpretable Measure for Quantifying Predictive Dependence between Continuous Random Variables -- Extended Version0
ColorGrid: A Multi-Agent Non-Stationary Environment for Goal Inference and AssistanceCode0
FORLAPS: An Innovative Data-Driven Reinforcement Learning Approach for Prescriptive Process Monitoring0
PixelBrax: Learning Continuous Control from Pixels End-to-End on the GPUCode0
Village-Net Clustering: A Rapid approach to Non-linear Unsupervised Clustering of High-Dimensional Data0
SMPLest-X: Ultimate Scaling for Expressive Human Pose and Shape EstimationCode5
Off-policy Evaluation for Payments at Adyen0
Cancer-Net PCa-Seg: Benchmarking Deep Learning Models for Prostate Cancer Segmentation Using Synthetic Correlated Diffusion Imaging0
MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents0
Similarity-Quantized Relative Difference Learning for Improved Molecular Activity Prediction0
ToMATO: Verbalizing the Mental States of Role-Playing LLMs for Benchmarking Theory of MindCode1
Benchmarking Robustness of Contrastive Learning Models for Medical Image-Report Retrieval0
Evaluating SAT and SMT Solvers on Large-Scale Sudoku PuzzlesCode0
Multimodal LLMs Can Reason about Aesthetics in Zero-ShotCode1
Keras Sig: Efficient Path Signature Computation on GPU in Keras 30
Benchmarking Classical, Deep, and Generative Models for Human Activity Recognition0
Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion ModelsCode4
Benchmarking Multimodal Models for Fine-Grained Image Analysis: A Comparative Study Across Diverse Visual Features0
Benchmarking Vision Foundation Models for Input Monitoring in Autonomous Driving0
Investigating Energy Efficiency and Performance Trade-offs in LLM Inference Across Tasks and DVFS Settings0
Data-driven inventory management for new products: An adjusted Dyna-Q approach with transfer learning0
Benchmarking Graph Representations and Graph Neural Networks for Multivariate Time Series ClassificationCode0
Benchmarking Abstractive Summarisation: A Dataset of Human-authored Summaries of Norwegian News Articles0
Stronger Than You Think: Benchmarking Weak Supervision on Realistic TasksCode0
Understanding and Benchmarking Artificial Intelligence: OpenAI's o3 Is Not AGI0
The Paradox of Success in Evolutionary and Bioinspired Optimization: Revisiting Critical Issues, Key Studies, and Methodological Pathways0
TimberVision: A Multi-Task Dataset and Framework for Log-Component Segmentation and Tracking in Autonomous Forestry OperationsCode1
WebWalker: Benchmarking LLMs in Web TraversalCode11
Lessons From Red Teaming 100 Generative AI Products0
ZNO-Eval: Benchmarking reasoning capabilities of large language models in UkrainianCode1
Benchmarking YOLOv8 for Optimal Crack Detection in Civil Infrastructure0
Retrieval-Augmented Dialogue Knowledge Aggregation for Expressive Conversational Speech SynthesisCode1
Evidential Deep Learning for Uncertainty Quantification and Out-of-Distribution Detection in Jet Identification using Deep Neural NetworksCode0
Benchmarking Rotary Position Embeddings for Automatic Speech Recognition0
DiffuSETS: 12-lead ECG Generation Conditioned on Clinical Text Reports and Patient-Specific InformationCode1
AgoraSpeech: A multi-annotated comprehensive dataset of political discourse through the lens of humans and AI0
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?Code2
Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning0
CallNavi, A Challenge and Empirical Study on LLM Function Calling and Routing0
VoxEval: Benchmarking the Knowledge Understanding Capabilities of End-to-End Spoken Language ModelsCode1
Large Physics Models: Towards a collaborative approach with Large Language Models and Foundation Models0
LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation0
Open-Source Manually Annotated Vocal Tract Database for Automatic Segmentation from 3D MRI Using Deep Learning: Benchmarking 2D and 3D Convolutional and Transformer Networks0
Advancing Retrieval-Augmented Generation for Persian: Development of Language Models, Comprehensive Benchmarks, and Best Practices for Optimization0
IOLBENCH: Benchmarking LLMs on Linguistic ReasoningCode0
An Analysis of Model Robustness across Concurrent Distribution Shifts0
Practical Design and Benchmarking of Generative AI Applications for Surgical Billing and Coding0
Machine Learning for Identifying Grain Boundaries in Scanning Electron Microscopy (SEM) Images of Nanoparticle Superlattices0
The FACTS Grounding Leaderboard: Benchmarking LLMs' Ability to Ground Responses to Long-Form Input0
Underwater Image Restoration Through a Prior Guided Hybrid Sense Approach and Extensive Benchmark AnalysisCode1
Show:102550
← PrevPage 22 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified