SOTAVerified

Benchmarking

Papers

Showing 751800 of 5548 papers

TitleStatusHype
MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific ResearchCode1
Omnia de EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric VideosCode1
VeriContaminated: Assessing LLM-Driven Verilog Coding for Data Contamination0
CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era0
Advancing Human-Machine Teaming: Concepts, Challenges, and Applications0
Genicious: Contextual Few-shot Prompting for Insights Discovery0
Language Models for Automated Classification of Brain MRI Reports and Growth Chart Generation0
Dataset Properties Shape the Success of Neuroimaging-Based Patient Stratification: A Benchmarking Analysis Across Clustering Algorithms0
Challenges and Advancements in Modeling Shock Fronts with Physics-Informed Neural Networks: A Review and Benchmarking Study0
LAG-MMLU: Benchmarking Frontier LLM Understanding in Latvian and Giriama0
Heterogeneous graph neural networks for species distribution modeling0
V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning0
InverseBench: Benchmarking Plug-and-Play Diffusion Priors for Inverse Problems in Physical Sciences0
VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity0
RESPONSE: Benchmarking the Ability of Language Models to Undertake Commonsense Reasoning in Crisis Situation0
Dynamic Obstacle Avoidance with Bounded Rationality Adversarial Reinforcement Learning0
Enhancing Hand Palm Motion Gesture Recognition by Eliminating Reference Frame Bias via Frame-Invariant Similarity Measures0
A Benchmarking Study of Vision-based Robotic Grasping AlgorithmsCode0
GNNs as Predictors of Agentic Workflow PerformancesCode1
VisTai: Benchmarking Vision-Language Models for Traditional Chinese in TaiwanCode1
DarkBench: Benchmarking Dark Patterns in Large Language Models0
TIME: Temporal-sensitive Multi-dimensional Instruction Tuning and Benchmarking for Video-LLMs0
ExtremeAIGC: Benchmarking LMM Vulnerability to AI-Generated Extremist Content0
CULEMO: Cultural Lenses on Emotion -- Benchmarking LLMs for Cross-Cultural Emotion Understanding0
SciHorizon: Benchmarking AI-for-Science Readiness from Scientific Data to Large Language Models0
MarineGym: A High-Performance Reinforcement Learning Platform for Underwater Robotics0
CASTLE: Benchmarking Dataset for Static Code Analyzers and LLMs towards CWE DetectionCode1
Large Language Models for Outpatient Referral: Problem Definition, Benchmarking and ChallengesCode0
Robust Latent Matters: Boosting Image Generation with Sampling ErrorCode3
nnInteractive: Redefining 3D Promptable SegmentationCode3
Comprehensive Benchmarking of Machine Learning Methods for Risk Prediction Modelling from Large-Scale Survival Data: A UK Biobank Study0
ResBench: Benchmarking LLM-Generated FPGA Designs with Resource Awareness0
Ev-Layout: A Large-scale Event-based Multi-modal Dataset for Indoor Layout Estimation and Tracking0
Integration of nested cross-validation, automated hyperparameter optimization, high-performance computing to reduce and quantify the variance of test performance estimation of deep learning modelsCode0
Towards Large Language Models that Benefit for All: Benchmarking Group Fairness in Reward Models0
Benchmarking Chinese Medical LLMs: A Medbench-based Analysis of Performance Gaps and Hierarchical Optimization Strategies0
Illuminating Darkness: Enhancing Real-world Low-light Scenes with Smartphone ImagesCode1
MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical ReasoningCode2
Skelite: Compact Neural Networks for Efficient Iterative SkeletonizationCode0
Griffin: Aerial-Ground Cooperative Detection and Tracking Dataset and BenchmarkCode2
DynCIM: Dynamic Curriculum for Imbalanced Multimodal LearningCode0
Steerable Pyramid Weighted Loss: Multi-Scale Adaptive Weighting for Semantic Segmentation0
DependEval: Benchmarking LLMs for Repository Dependency UnderstandingCode1
Is Your Benchmark (Still) Useful? Dynamic Benchmarking for Code Language Models0
Beyond Black-Box Benchmarking: Observability, Analytics, and Optimization of Agentic Systems0
General Scales Unlock AI Evaluation with Explanatory and Predictive Power0
Removing Multiple Hybrid Adverse Weather in Video via a Unified Model0
UrbanVideo-Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban Spaces0
SCoRE: Benchmarking Long-Chain Reasoning in Commonsense ScenariosCode0
Understanding the Limits of Lifelong Knowledge Editing in LLMs0
Show:102550
← PrevPage 16 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified