SOTAVerified

Benchmarking

Papers

Showing 951975 of 5548 papers

TitleStatusHype
LOB-Bench: Benchmarking Generative AI for Finance -- an Application to Limit Order Book DataCode1
Machine learning for modelling unstructured grid data in computational physics: a review0
SkyRover: A Modular Simulator for Cross-Domain Pathfinding0
Handwritten Text Recognition: A Survey0
One-Shot Federated Learning with Classifier-Free Diffusion Models0
Fino1: On the Transferability of Reasoning Enhanced LLMs to FinanceCode2
Causal Analysis of ASR Errors for Children: Quantifying the Impact of Physiological, Cognitive, and Extrinsic Factors0
exHarmony: Authorship and Citations for Benchmarking the Reviewer Assignment ProblemCode0
The Devil is in the Prompts: De-Identification Traces Enhance Memorization Risks in Synthetic Chest X-Ray GenerationCode0
Foundation Model of Electronic Medical Records for Adaptive Risk EstimationCode1
MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations0
Accelerating Data Processing and Benchmarking of AI Models for PathologyCode4
Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation0
CSR-Bench: Benchmarking LLM Agents in Deployment of Computer Science Research Repositories0
Evaluating the Systematic Reasoning Abilities of Large Language Models through Graph ColoringCode0
Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video EnvironmentsCode1
Decoding Complexity: Intelligent Pattern Exploration with CHPDA (Context Aware Hybrid Pattern Detection Algorithm)0
Benchmarking Prompt Engineering Techniques for Secure Code Generation with GPT Models0
Mol-MoE: Training Preference-Guided Routers for Molecule GenerationCode0
Surprise Potential as a Measure of Interactivity in Driving Scenarios0
ShiftySpeech: A Large-Scale Synthetic Speech Dataset with Distribution ShiftsCode1
ITBench: Evaluating AI Agents across Diverse Real-World IT Automation TasksCode3
An Extended Benchmarking of Multi-Agent Reinforcement Learning Algorithms in Complex Fully Cooperative TasksCode1
Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and SoundCode4
Synthetic Datasets for Machine Learning on Spatio-Temporal Graphs using PDEsCode0
Show:102550
← PrevPage 39 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified