Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 951–1000 of 5548 papers

Title	Date	Tasks	Status	Hype
Machine learning for modelling unstructured grid data in computational physics: a review	Feb 13, 2025	Benchmarking	—Unverified	0
SkyRover: A Modular Simulator for Cross-Domain Pathfinding	Feb 13, 2025	Benchmarking	—Unverified	0
LOB-Bench: Benchmarking Generative AI for Finance -- an Application to Limit Order Book Data	Feb 13, 2025	BenchmarkingState Space Models	CodeCode Available	1
Handwritten Text Recognition: A Survey	Feb 12, 2025	BenchmarkingHandwritten Text Recognition	—Unverified	0
Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance	Feb 12, 2025	BenchmarkingLong-Context Understanding	CodeCode Available	2
One-Shot Federated Learning with Classifier-Free Diffusion Models	Feb 12, 2025	BenchmarkingDataset Generation	—Unverified	0
Causal Analysis of ASR Errors for Children: Quantifying the Impact of Physiological, Cognitive, and Extrinsic Factors	Feb 12, 2025	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified	0
The Devil is in the Prompts: De-Identification Traces Enhance Memorization Risks in Synthetic Chest X-Ray Generation	Feb 11, 2025	BenchmarkingDe-identification	CodeCode Available	0
exHarmony: Authorship and Citations for Benchmarking the Reviewer Assignment Problem	Feb 11, 2025	BenchmarkingDiversity	CodeCode Available	0
Foundation Model of Electronic Medical Records for Adaptive Risk Estimation	Feb 10, 2025	Benchmarking	CodeCode Available	1
MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations	Feb 10, 2025	BenchmarkingIn-Context Learning	—Unverified	0
Accelerating Data Processing and Benchmarking of AI Models for Pathology	Feb 10, 2025	Benchmarking	CodeCode Available	4
Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation	Feb 10, 2025	Benchmarking	—Unverified	0
CSR-Bench: Benchmarking LLM Agents in Deployment of Computer Science Research Repositories	Feb 10, 2025	Benchmarking	—Unverified	0
Evaluating the Systematic Reasoning Abilities of Large Language Models through Graph Coloring	Feb 10, 2025	Benchmarking	CodeCode Available	0
Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments	Feb 10, 2025	BenchmarkingOptical Character Recognition	CodeCode Available	1
Decoding Complexity: Intelligent Pattern Exploration with CHPDA (Context Aware Hybrid Pattern Detection Algorithm)	Feb 9, 2025	BenchmarkingCPU	—Unverified	0
Benchmarking Prompt Engineering Techniques for Secure Code Generation with GPT Models	Feb 9, 2025	BenchmarkingCode Generation	—Unverified	0
Mol-MoE: Training Preference-Guided Routers for Molecule Generation	Feb 8, 2025	BenchmarkingDrug Design	CodeCode Available	0
ShiftySpeech: A Large-Scale Synthetic Speech Dataset with Distribution Shifts	Feb 8, 2025	BenchmarkingSelf-Supervised Learning	CodeCode Available	1
Surprise Potential as a Measure of Interactivity in Driving Scenarios	Feb 8, 2025	Benchmarking	—Unverified	0
ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks	Feb 7, 2025	Benchmarking	CodeCode Available	3
An Extended Benchmarking of Multi-Agent Reinforcement Learning Algorithms in Complex Fully Cooperative Tasks	Feb 7, 2025	BenchmarkingMulti-agent Reinforcement Learning	CodeCode Available	1
Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound	Feb 7, 2025	Benchmarking	CodeCode Available	4
Synthetic Datasets for Machine Learning on Spatio-Temporal Graphs using PDEs	Feb 6, 2025	BenchmarkingEpidemiology	CodeCode Available	0
EmoBench-M: Benchmarking Emotional Intelligence for Multimodal Large Language Models	Feb 6, 2025	BenchmarkingEmotional Intelligence	—Unverified	0
Verifiable Format Control for Large Language Model Generations	Feb 6, 2025	BenchmarkingInstruction Following	—Unverified	0
Confident or Seek Stronger: Exploring Uncertainty-Based On-device LLM Routing From Benchmarking to Generalization	Feb 6, 2025	BenchmarkingUncertainty Quantification	—Unverified	0
LUND-PROBE -- LUND Prostate Radiotherapy Open Benchmarking and Evaluation dataset	Feb 6, 2025	BenchmarkingComputed Tomography (CT)	—Unverified	0
Large Language Models for Multi-Robot Systems: A Survey	Feb 6, 2025	Action GenerationBenchmarking	CodeCode Available	1
SoK: Benchmarking Poisoning Attacks and Defenses in Federated Learning	Feb 6, 2025	BenchmarkingData Poisoning	CodeCode Available	2
Improving the Perturbation-Based Explanation of Deepfake Detectors Through the Use of Adversarially-Generated Samples	Feb 6, 2025	BenchmarkingDeepFake Detection	CodeCode Available	0
PINT: Physics-Informed Neural Time Series Models with Applications to Long-term Inference on WeatherBench 2m-Temperature Data	Feb 6, 2025	BenchmarkingTime Series	CodeCode Available	0
Benchmarking Time Series Forecasting Models: From Statistical Techniques to Foundation Models in Real-World Applications	Feb 5, 2025	BenchmarkingFeature Engineering	—Unverified	0
TGB-Seq Benchmark: Challenging Temporal GNNs with Complex Sequential Dynamics	Feb 5, 2025	BenchmarkingLink Prediction	CodeCode Available	0
MEETING DELEGATE: Benchmarking LLMs on Attending Meetings on Our Behalf	Feb 5, 2025	BenchmarkingScheduling	—Unverified	0
Speculative Prefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation	Feb 5, 2025	BenchmarkingLarge Language Model	CodeCode Available	2
Optimal PMU Placement for Kalman Filtering of DAE Power System Models	Feb 5, 2025	BenchmarkingState Estimation	—Unverified	0
Energy & Force Regression on DFT Trajectories is Not Enough for Universal Machine Learning Interatomic Potentials	Feb 5, 2025	Benchmarking	—Unverified	0
PICBench: Benchmarking LLMs for Photonic Integrated Circuits Design	Feb 5, 2025	BenchmarkingPrompt Engineering	CodeCode Available	1
xai_evals : A Framework for Evaluating Post-Hoc Local Explanation Methods	Feb 5, 2025	Benchmarking	—Unverified	0
LadderMIL: Multiple Instance Learning with Coarse-to-Fine Self-Distillation	Feb 4, 2025	BenchmarkingClassification	—Unverified	0
Dynamic benchmarking framework for LLM-based conversational data capture	Feb 4, 2025	Benchmarking	—Unverified	0
Rankify: A Comprehensive Python Toolkit for Retrieval, Re-Ranking, and Retrieval-Augmented Generation	Feb 4, 2025	BenchmarkingInformation Retrieval	CodeCode Available	4
Evalita-LLM: Benchmarking Large Language Models on Italian	Feb 4, 2025	BenchmarkingMultiple-choice	—Unverified	0
Generative Psycho-Lexical Approach for Constructing Value Systems in Large Language Models	Feb 4, 2025	BenchmarkingDecision Making	—Unverified	0
A comparison of translation performance between DeepL and Supertext	Feb 4, 2025	BenchmarkingMachine Translation	CodeCode Available	0
No Metric to Rule Them All: Toward Principled Evaluations of Graph-Learning Datasets	Feb 4, 2025	AllBenchmarking	CodeCode Available	0
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities	Feb 3, 2025	BenchmarkingLarge Language Model	—Unverified	0
MJ-VIDEO: Fine-Grained Benchmarking and Rewarding Video Preferences in Video Generation	Feb 3, 2025	BenchmarkingFairness	—Unverified	0

Show:10 25 50

← PrevPage 20 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified