SOTAVerified

Benchmarking

Papers

Showing 9511000 of 5548 papers

TitleStatusHype
Machine learning for modelling unstructured grid data in computational physics: a review0
SkyRover: A Modular Simulator for Cross-Domain Pathfinding0
LOB-Bench: Benchmarking Generative AI for Finance -- an Application to Limit Order Book DataCode1
Handwritten Text Recognition: A Survey0
Fino1: On the Transferability of Reasoning Enhanced LLMs to FinanceCode2
One-Shot Federated Learning with Classifier-Free Diffusion Models0
Causal Analysis of ASR Errors for Children: Quantifying the Impact of Physiological, Cognitive, and Extrinsic Factors0
The Devil is in the Prompts: De-Identification Traces Enhance Memorization Risks in Synthetic Chest X-Ray GenerationCode0
exHarmony: Authorship and Citations for Benchmarking the Reviewer Assignment ProblemCode0
Foundation Model of Electronic Medical Records for Adaptive Risk EstimationCode1
MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations0
Accelerating Data Processing and Benchmarking of AI Models for PathologyCode4
Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation0
CSR-Bench: Benchmarking LLM Agents in Deployment of Computer Science Research Repositories0
Evaluating the Systematic Reasoning Abilities of Large Language Models through Graph ColoringCode0
Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video EnvironmentsCode1
Decoding Complexity: Intelligent Pattern Exploration with CHPDA (Context Aware Hybrid Pattern Detection Algorithm)0
Benchmarking Prompt Engineering Techniques for Secure Code Generation with GPT Models0
Mol-MoE: Training Preference-Guided Routers for Molecule GenerationCode0
ShiftySpeech: A Large-Scale Synthetic Speech Dataset with Distribution ShiftsCode1
Surprise Potential as a Measure of Interactivity in Driving Scenarios0
ITBench: Evaluating AI Agents across Diverse Real-World IT Automation TasksCode3
An Extended Benchmarking of Multi-Agent Reinforcement Learning Algorithms in Complex Fully Cooperative TasksCode1
Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and SoundCode4
Synthetic Datasets for Machine Learning on Spatio-Temporal Graphs using PDEsCode0
EmoBench-M: Benchmarking Emotional Intelligence for Multimodal Large Language Models0
Verifiable Format Control for Large Language Model Generations0
Confident or Seek Stronger: Exploring Uncertainty-Based On-device LLM Routing From Benchmarking to Generalization0
LUND-PROBE -- LUND Prostate Radiotherapy Open Benchmarking and Evaluation dataset0
Large Language Models for Multi-Robot Systems: A SurveyCode1
SoK: Benchmarking Poisoning Attacks and Defenses in Federated LearningCode2
Improving the Perturbation-Based Explanation of Deepfake Detectors Through the Use of Adversarially-Generated SamplesCode0
PINT: Physics-Informed Neural Time Series Models with Applications to Long-term Inference on WeatherBench 2m-Temperature DataCode0
Benchmarking Time Series Forecasting Models: From Statistical Techniques to Foundation Models in Real-World Applications0
TGB-Seq Benchmark: Challenging Temporal GNNs with Complex Sequential DynamicsCode0
MEETING DELEGATE: Benchmarking LLMs on Attending Meetings on Our Behalf0
Speculative Prefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance EstimationCode2
Optimal PMU Placement for Kalman Filtering of DAE Power System Models0
Energy & Force Regression on DFT Trajectories is Not Enough for Universal Machine Learning Interatomic Potentials0
PICBench: Benchmarking LLMs for Photonic Integrated Circuits DesignCode1
xai_evals : A Framework for Evaluating Post-Hoc Local Explanation Methods0
LadderMIL: Multiple Instance Learning with Coarse-to-Fine Self-Distillation0
Dynamic benchmarking framework for LLM-based conversational data capture0
Rankify: A Comprehensive Python Toolkit for Retrieval, Re-Ranking, and Retrieval-Augmented GenerationCode4
Evalita-LLM: Benchmarking Large Language Models on Italian0
Generative Psycho-Lexical Approach for Constructing Value Systems in Large Language Models0
A comparison of translation performance between DeepL and SupertextCode0
No Metric to Rule Them All: Toward Principled Evaluations of Graph-Learning DatasetsCode0
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities0
MJ-VIDEO: Fine-Grained Benchmarking and Rewarding Video Preferences in Video Generation0
Show:102550
← PrevPage 20 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified