SOTAVerified

Benchmarking

Papers

Showing 76100 of 5548 papers

TitleStatusHype
GUI-Robust: A Comprehensive Dataset for Testing GUI Agent Robustness in Real-World AnomaliesCode1
ImpliRet: Benchmarking the Implicit Fact Retrieval ChallengeCode0
Egocentric Human-Object Interaction Detection: A New Benchmark and Method0
The Price of Freedom: Exploring Expressivity and Runtime Tradeoffs in Equivariant Tensor ProductsCode1
C-TLSAN: Content-Enhanced Time-Aware Long- and Short-Term Attention Network for Personalized RecommendationCode0
A Comprehensive Survey on Video Scene Parsing:Advances, Challenges, and Prospects0
Deep Diffusion Models and Unsupervised Hyperspectral Unmixing for Realistic Abundance Map Synthesis0
Few-Shot Learning for Industrial Time Series: A Comparative Analysis Using the Example of Screw-Fastening Process Monitoring0
Robustness of Reinforcement Learning-Based Traffic Signal Control under Incidents: A Comparative Study0
JENGA: Object selection and pose estimation for robotic grasping from a stack0
ComplexBench-Edit: Benchmarking Complex Instruction-Driven Image Editing via Compositional DependenciesCode1
A large-scale, physically-based synthetic dataset for satellite pose estimation0
MLDebugging: Towards Benchmarking Code Debugging Across Multi-Library ScenariosCode0
OpenUnlearning: Accelerating LLM Unlearning via Unified Benchmarking of Methods and MetricsCode4
Delving into Instance-Dependent Label Noise in Graph Data: A Comprehensive Study and BenchmarkCode0
ANIRA: An Architecture for Neural Network Inference in Real-Time Audio ApplicationsCode3
Learning Best Paths in Quantum Networks0
Benchmarking Multimodal LLMs on Recognition and Understanding over Chemical Tables0
SemanticST: Spatially Informed Semantic Graph Learning for Clustering, Integration, and Scalable Analysis of Spatial Transcriptomics0
Temporal cross-validation impacts multivariate time series subsequence anomaly detection evaluation0
crossMoDA Challenge: Evolution of Cross-Modality Domain Adaptation Techniques for Vestibular Schwannoma and Cochlea Segmentation from 2021 to 20230
EconGym: A Scalable AI Testbed with Diverse Economic Tasks0
Mind the XAI Gap: A Human-Centered LLM Framework for Democratizing Explainable AICode0
SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security TasksCode2
HyBiomass: Global Hyperspectral Imagery Benchmark Dataset for Evaluating Geospatial Foundation Models in Forest Aboveground Biomass Estimation0
Show:102550
← PrevPage 4 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified