Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 2251–2300 of 5548 papers

Title	Date	Tasks	Status
Forecasting time series with constraints	Feb 14, 2025	Additive modelsBenchmarking	CodeCode Available
SkyRover: A Modular Simulator for Cross-Domain Pathfinding	Feb 13, 2025	Benchmarking	—Unverified
AT-Drone: Benchmarking Adaptive Teaming in Multi-Drone Pursuit	Feb 13, 2025	BenchmarkingEdge-computing	—Unverified
Beyond the Singular: The Essential Role of Multiple Generations in Effective Benchmark Evaluation and Analysis	Feb 13, 2025	Benchmarking	—Unverified
Zero-shot generation of synthetic neurosurgical data with large language models	Feb 13, 2025	BenchmarkingDe-identification	CodeCode Available
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency	Feb 13, 2025	BenchmarkingMath	—Unverified
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents	Feb 13, 2025	Benchmarking	—Unverified
A Survey on LLM-based News Recommender Systems	Feb 13, 2025	BenchmarkingFairness	—Unverified
Standardisation of Convex Ultrasound Data Through Geometric Analysis and Augmentation	Feb 13, 2025	Benchmarking	—Unverified
Machine learning for modelling unstructured grid data in computational physics: a review	Feb 13, 2025	Benchmarking	—Unverified
Handwritten Text Recognition: A Survey	Feb 12, 2025	BenchmarkingHandwritten Text Recognition	—Unverified
Causal Analysis of ASR Errors for Children: Quantifying the Impact of Physiological, Cognitive, and Extrinsic Factors	Feb 12, 2025	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
One-Shot Federated Learning with Classifier-Free Diffusion Models	Feb 12, 2025	BenchmarkingDataset Generation	—Unverified
exHarmony: Authorship and Citations for Benchmarking the Reviewer Assignment Problem	Feb 11, 2025	BenchmarkingDiversity	CodeCode Available
The Devil is in the Prompts: De-Identification Traces Enhance Memorization Risks in Synthetic Chest X-Ray Generation	Feb 11, 2025	BenchmarkingDe-identification	CodeCode Available
CSR-Bench: Benchmarking LLM Agents in Deployment of Computer Science Research Repositories	Feb 10, 2025	Benchmarking	—Unverified
Evaluating the Systematic Reasoning Abilities of Large Language Models through Graph Coloring	Feb 10, 2025	Benchmarking	CodeCode Available
Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation	Feb 10, 2025	Benchmarking	—Unverified
MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations	Feb 10, 2025	BenchmarkingIn-Context Learning	—Unverified
Decoding Complexity: Intelligent Pattern Exploration with CHPDA (Context Aware Hybrid Pattern Detection Algorithm)	Feb 9, 2025	BenchmarkingCPU	—Unverified
Benchmarking Prompt Engineering Techniques for Secure Code Generation with GPT Models	Feb 9, 2025	BenchmarkingCode Generation	—Unverified
Surprise Potential as a Measure of Interactivity in Driving Scenarios	Feb 8, 2025	Benchmarking	—Unverified
Mol-MoE: Training Preference-Guided Routers for Molecule Generation	Feb 8, 2025	BenchmarkingDrug Design	CodeCode Available
LUND-PROBE -- LUND Prostate Radiotherapy Open Benchmarking and Evaluation dataset	Feb 6, 2025	BenchmarkingComputed Tomography (CT)	—Unverified
Improving the Perturbation-Based Explanation of Deepfake Detectors Through the Use of Adversarially-Generated Samples	Feb 6, 2025	BenchmarkingDeepFake Detection	CodeCode Available
Confident or Seek Stronger: Exploring Uncertainty-Based On-device LLM Routing From Benchmarking to Generalization	Feb 6, 2025	BenchmarkingUncertainty Quantification	—Unverified
Verifiable Format Control for Large Language Model Generations	Feb 6, 2025	BenchmarkingInstruction Following	—Unverified
PINT: Physics-Informed Neural Time Series Models with Applications to Long-term Inference on WeatherBench 2m-Temperature Data	Feb 6, 2025	BenchmarkingTime Series	CodeCode Available
Synthetic Datasets for Machine Learning on Spatio-Temporal Graphs using PDEs	Feb 6, 2025	BenchmarkingEpidemiology	CodeCode Available
EmoBench-M: Benchmarking Emotional Intelligence for Multimodal Large Language Models	Feb 6, 2025	BenchmarkingEmotional Intelligence	—Unverified
Energy & Force Regression on DFT Trajectories is Not Enough for Universal Machine Learning Interatomic Potentials	Feb 5, 2025	Benchmarking	—Unverified
Optimal PMU Placement for Kalman Filtering of DAE Power System Models	Feb 5, 2025	BenchmarkingState Estimation	—Unverified
xai_evals : A Framework for Evaluating Post-Hoc Local Explanation Methods	Feb 5, 2025	Benchmarking	—Unverified
Benchmarking Time Series Forecasting Models: From Statistical Techniques to Foundation Models in Real-World Applications	Feb 5, 2025	BenchmarkingFeature Engineering	—Unverified
TGB-Seq Benchmark: Challenging Temporal GNNs with Complex Sequential Dynamics	Feb 5, 2025	BenchmarkingLink Prediction	CodeCode Available
MEETING DELEGATE: Benchmarking LLMs on Attending Meetings on Our Behalf	Feb 5, 2025	BenchmarkingScheduling	—Unverified
LadderMIL: Multiple Instance Learning with Coarse-to-Fine Self-Distillation	Feb 4, 2025	BenchmarkingClassification	—Unverified
No Metric to Rule Them All: Toward Principled Evaluations of Graph-Learning Datasets	Feb 4, 2025	AllBenchmarking	CodeCode Available
Evalita-LLM: Benchmarking Large Language Models on Italian	Feb 4, 2025	BenchmarkingMultiple-choice	—Unverified
A comparison of translation performance between DeepL and Supertext	Feb 4, 2025	BenchmarkingMachine Translation	CodeCode Available
Generative Psycho-Lexical Approach for Constructing Value Systems in Large Language Models	Feb 4, 2025	BenchmarkingDecision Making	—Unverified
Dynamic benchmarking framework for LLM-based conversational data capture	Feb 4, 2025	Benchmarking	—Unverified
MJ-VIDEO: Fine-Grained Benchmarking and Rewarding Video Preferences in Video Generation	Feb 3, 2025	BenchmarkingFairness	—Unverified
SE Arena: An Interactive Platform for Evaluating Foundation Models in Software Engineering	Feb 3, 2025	BenchmarkingCode Generation	—Unverified
EdgeMark: An Automation and Benchmarking System for Embedded Artificial Intelligence Tools	Feb 3, 2025	Benchmarking	—Unverified
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities	Feb 3, 2025	BenchmarkingLarge Language Model	—Unverified
Learned Bayesian Cramér-Rao Bound for Unknown Measurement Models Using Score Neural Networks	Feb 2, 2025	Benchmarking	CodeCode Available
True Online TD-Replan(lambda) Achieving Planning through Replaying	Jan 31, 2025	Benchmarking	—Unverified
MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding	Jan 30, 2025	BenchmarkingDecision Making	—Unverified
Fine-tuning LLaMA 2 interference: a comparative study of language implementations for optimal efficiency	Jan 30, 2025	BenchmarkingLanguage Modeling	—Unverified

Show:10 25 50

← PrevPage 46 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified