SOTAVerified

Benchmarking

Papers

Showing 35013550 of 5548 papers

TitleStatusHype
Benchmarking Large Multimodal Models for Ophthalmic Visual Question Answering with OphthalWeChat0
MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations0
MathTutorBench: A Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors0
Matrix-Free Preconditioning in Online Learning0
Benchmarking Large Language Model Volatility0
Benchmarking Large Language Models with Integer Sequence Generation Tasks0
Maximum Categorical Cross Entropy (MCCE): A noise-robust alternative loss function to mitigate racial bias in Convolutional Neural Networks (CNNs) by reducing overfitting0
MaxpoolNMS: Getting Rid of NMS Bottlenecks in Two-Stage Object Detectors0
Benchmarking Pre-Trained Time Series Models for Electricity Price Forecasting0
MBA-VO: Motion Blur Aware Visual Odometry0
Towards Class-agnostic Tracking Using Feature Decorrelation in Point Clouds0
MCDFN: Supply Chain Demand Forecasting via an Explainable Multi-Channel Data Fusion Network Model0
MCL-3D: a database for stereoscopic image quality assessment using 2D-image-plus-depth source0
Benchmarking Large Language Models with Augmented Instructions for Fine-grained Information Extraction0
MCUBench: A Benchmark of Tiny Object Detectors on MCUs0
MDIW-13: a New Multi-Lingual and Multi-Script Database and Benchmark for Script Identification0
MDR-DeePC: Model-Inspired Distributionally Robust Data-Enabled Predictive Control0
Benchmarking Large Language Models via Random Variables0
Measuring and Benchmarking Large Language Models' Capabilities to Generate Persuasive Language0
Measuring CLEVRness: Black-box Testing of Visual Reasoning Models0
Measuring CLEVRness: Blackbox testing of Visual Reasoning Models0
Measuring Large Language Models Capacity to Annotate Journalistic Sourcing0
Measuring the Complexity of Domains Used to Evaluate AI Systems0
Measuring the Effect of Causal Disentanglement on the Adversarial Robustness of Neural Network Models0
Towards Effective Disambiguation for Machine Translation with Large Language Models0
MEBench: Benchmarking Large Language Models for Cross-Document Multi-Entity Question Answering0
MechProNet: Machine Learning Prediction of Mechanical Properties in Metal Additive Manufacturing0
Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models0
Benchmarking Large Language Models on Homework Assessment in Circuit Analysis0
Benchmarking Large Language Models in Complex Question Answering Attribution using Knowledge Graphs0
Benchmarking Large Language Models for Cryptanalysis and Mismatched-Generalization0
MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale0
Benchmarking Large Language Models for Cyberbullying Detection in Real-World YouTube Comments0
EMO-Debias: Benchmarking Gender Debiasing Techniques in Multi-Label Speech Emotion Recognition0
What can 5.17 billion regression fits tell us about artificial models of the human visual system?0
MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models0
Towards Efficient Multi-LLM Inference: Characterization and Analysis of LLM Routing and Hierarchical Techniques0
MedBrowseComp: Benchmarking Medical Deep Research and Computer Use0
Medchain: Bridging the Gap Between LLM Agents and Clinical Practice through Interactive Sequential Benchmarking0
MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation0
MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering0
Knowledge-guided Contextual Gene Set Analysis Using Large Language Models0
MedGPTEval: A Dataset and Benchmark to Evaluate Responses of Large Language Models in Medicine0
MedGUIDE: Benchmarking Clinical Decision-Making in Large Language Models0
MediaEval 2018: Predicting Media Memorability Task0
Benchmarking Large Language Models for Handwritten Text Recognition0
MedMeshCNN -- Enabling MeshCNN for Medical Surface Models0
Benchmarking large language models for materials synthesis: the case of atomic layer deposition0
Benchmarking Large Language Models for Conversational Question Answering in Multi-instructional Documents0
MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding0
Show:102550
← PrevPage 71 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified