SOTAVerified

Benchmarking

Papers

Showing 21512200 of 5548 papers

TitleStatusHype
Benchmarking Reasoning Robustness in Large Language Models0
Quantifying the Reasoning Abilities of LLMs on Real-world Clinical CasesCode0
Assumed Identities: Quantifying Gender Bias in Machine Translation of Gender-Ambiguous Occupational Terms0
Eventprop training for efficient neuromorphic applications0
Benchmarking Dynamic SLO Compliance in Distributed Computing Continuum SystemsCode0
Towards Universal Learning-based Model for Cardiac Image Reconstruction: Summary of the CMRxRecon2024 Challenge0
AttackSeqBench: Benchmarking Large Language Models' Understanding of Sequential Patterns in Cyber AttacksCode0
GNNMerge: Merging of GNN Models Without Accessing Training DataCode0
Technical report of a DMD-based Characterization Method for Vision Sensors0
Evaluation of Architectural Synthesis Using Generative AI0
A2Perf: Real-World Autonomous Agents Benchmark0
Optimizing open-domain question answering with graph-based retrieval augmented generation0
Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics0
MiLiC-Eval: Benchmarking Multilingual LLMs for China's Minority LanguagesCode0
Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models0
Multi-Agent Reinforcement Learning with Long-Term Performance Objectives for Service Workforce Optimization0
FunBench: Benchmarking Fundus Reading Skills of MLLMs0
MAPS: Multi-Fidelity AI-Augmented Photonic Simulation and Inverse Design Infrastructure0
Towards Efficient Educational Chatbots: Benchmarking RAG Frameworks0
A Multi-Labeled Dataset for Indonesian Discourse: Examining Toxicity, Polarization, and Demographics Information0
Solar Multimodal Transformer: Intraday Solar Irradiance Predictor using Public Cameras and Time Series0
Large Language Model-Based Benchmarking Experiment Settings for Evolutionary Multi-Objective Optimization0
NeuroMorse: A Temporally Structured Dataset For Neuromorphic ComputingCode0
ProBench: Benchmarking Large Language Models in Competitive Programming0
PsychBench: A comprehensive and professional benchmark for evaluating the performance of LLM-assisted psychiatric clinical practice0
ConvCodeWorld: Benchmarking Conversational Code Generation in Reproducible Feedback Environments0
Machine-learning for photoplethysmography analysis: Benchmarking feature, image, and signal-based approachesCode0
MMSciBench: Benchmarking Language Models on Multimodal Scientific Problems0
LimeSoDa: A Dataset Collection for Benchmarking of Machine Learning Regressors in Digital Soil MappingCode0
Is Your Paper Being Reviewed by an LLM? A New Benchmark Dataset and Approach for Detecting AI Text in Peer Review0
Improved YOLOv12 with LLM-Generated Synthetic Data for Enhanced Apple Detection and Benchmarking Against YOLOv11 and YOLOv100
Modelling Regional Solar Photovoltaic Capacity in Great Britain0
Agentic Mixture-of-Workflows for Multi-Modal Chemical Search0
MEBench: Benchmarking Large Language Models for Cross-Document Multi-Entity Question Answering0
MathTutorBench: A Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors0
Isolating Language-Coding from Problem-Solving: Benchmarking LLMs with PseudoEval0
Safe Multi-Agent Navigation guided by Goal-Conditioned Safe Reinforcement LearningCode0
Science Across Languages: Assessing LLM Multilingual Translation of Scientific Papers0
CayleyPy RL: Pathfinding and Reinforcement Learning on Cayley Graphs0
A Real-time Spatio-Temporal Trajectory Planner for Autonomous Vehicles with Semantic Graph Optimization0
OpenFly: A Comprehensive Platform for Aerial Vision-Language Navigation0
MULTITAT: Benchmarking Multilingual Table-and-Text Question AnsweringCode0
SynthRAD2025 Grand Challenge dataset: generating synthetic CTs for radiotherapy0
Enhancing Image Matting in Real-World Scenes with Mask-Guided Iterative Refinement0
Benchmarking Temporal Reasoning and Alignment Across Chinese DynastiesCode0
Overconfident Oracles: Limitations of In Silico Sequence Design Benchmarking0
On Neural Inertial Classification Networks for Pedestrian Activity Recognition0
An Analyst-Inspector Framework for Evaluating Reproducibility of LLMs in Data ScienceCode0
VidLBEval: Benchmarking and Mitigating Language Bias in Video-Involved LVLMs0
VisFactor: Benchmarking Fundamental Visual Cognition in Multimodal Large Language ModelsCode0
Show:102550
← PrevPage 44 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified