SOTAVerified

Benchmarking

Papers

Showing 22512300 of 5548 papers

TitleStatusHype
Forecasting time series with constraintsCode0
SkyRover: A Modular Simulator for Cross-Domain Pathfinding0
AT-Drone: Benchmarking Adaptive Teaming in Multi-Drone Pursuit0
Beyond the Singular: The Essential Role of Multiple Generations in Effective Benchmark Evaluation and Analysis0
Zero-shot generation of synthetic neurosurgical data with large language modelsCode0
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency0
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents0
A Survey on LLM-based News Recommender Systems0
Standardisation of Convex Ultrasound Data Through Geometric Analysis and Augmentation0
Machine learning for modelling unstructured grid data in computational physics: a review0
Handwritten Text Recognition: A Survey0
Causal Analysis of ASR Errors for Children: Quantifying the Impact of Physiological, Cognitive, and Extrinsic Factors0
One-Shot Federated Learning with Classifier-Free Diffusion Models0
exHarmony: Authorship and Citations for Benchmarking the Reviewer Assignment ProblemCode0
The Devil is in the Prompts: De-Identification Traces Enhance Memorization Risks in Synthetic Chest X-Ray GenerationCode0
CSR-Bench: Benchmarking LLM Agents in Deployment of Computer Science Research Repositories0
Evaluating the Systematic Reasoning Abilities of Large Language Models through Graph ColoringCode0
Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation0
MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations0
Decoding Complexity: Intelligent Pattern Exploration with CHPDA (Context Aware Hybrid Pattern Detection Algorithm)0
Benchmarking Prompt Engineering Techniques for Secure Code Generation with GPT Models0
Surprise Potential as a Measure of Interactivity in Driving Scenarios0
Mol-MoE: Training Preference-Guided Routers for Molecule GenerationCode0
LUND-PROBE -- LUND Prostate Radiotherapy Open Benchmarking and Evaluation dataset0
Improving the Perturbation-Based Explanation of Deepfake Detectors Through the Use of Adversarially-Generated SamplesCode0
Confident or Seek Stronger: Exploring Uncertainty-Based On-device LLM Routing From Benchmarking to Generalization0
Verifiable Format Control for Large Language Model Generations0
PINT: Physics-Informed Neural Time Series Models with Applications to Long-term Inference on WeatherBench 2m-Temperature DataCode0
Synthetic Datasets for Machine Learning on Spatio-Temporal Graphs using PDEsCode0
EmoBench-M: Benchmarking Emotional Intelligence for Multimodal Large Language Models0
Energy & Force Regression on DFT Trajectories is Not Enough for Universal Machine Learning Interatomic Potentials0
Optimal PMU Placement for Kalman Filtering of DAE Power System Models0
xai_evals : A Framework for Evaluating Post-Hoc Local Explanation Methods0
Benchmarking Time Series Forecasting Models: From Statistical Techniques to Foundation Models in Real-World Applications0
TGB-Seq Benchmark: Challenging Temporal GNNs with Complex Sequential DynamicsCode0
MEETING DELEGATE: Benchmarking LLMs on Attending Meetings on Our Behalf0
LadderMIL: Multiple Instance Learning with Coarse-to-Fine Self-Distillation0
No Metric to Rule Them All: Toward Principled Evaluations of Graph-Learning DatasetsCode0
Evalita-LLM: Benchmarking Large Language Models on Italian0
A comparison of translation performance between DeepL and SupertextCode0
Generative Psycho-Lexical Approach for Constructing Value Systems in Large Language Models0
Dynamic benchmarking framework for LLM-based conversational data capture0
MJ-VIDEO: Fine-Grained Benchmarking and Rewarding Video Preferences in Video Generation0
SE Arena: An Interactive Platform for Evaluating Foundation Models in Software Engineering0
EdgeMark: An Automation and Benchmarking System for Embedded Artificial Intelligence Tools0
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities0
Learned Bayesian Cramér-Rao Bound for Unknown Measurement Models Using Score Neural NetworksCode0
True Online TD-Replan(lambda) Achieving Planning through Replaying0
MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding0
Fine-tuning LLaMA 2 interference: a comparative study of language implementations for optimal efficiency0
Show:102550
← PrevPage 46 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified