Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1001–1050 of 5548 papers

Title	Date	Tasks	Status	Hype
MJ-VIDEO: Fine-Grained Benchmarking and Rewarding Video Preferences in Video Generation	Feb 3, 2025	BenchmarkingFairness	—Unverified	0
SE Arena: An Interactive Platform for Evaluating Foundation Models in Software Engineering	Feb 3, 2025	BenchmarkingCode Generation	—Unverified	0
MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models	Feb 2, 2025	Benchmarking	CodeCode Available	1
Learned Bayesian Cramér-Rao Bound for Unknown Measurement Models Using Score Neural Networks	Feb 2, 2025	Benchmarking	CodeCode Available	0
True Online TD-Replan(lambda) Achieving Planning through Replaying	Jan 31, 2025	Benchmarking	—Unverified	0
Evolving Hard Maximum Cut Instances for Quantum Approximate Optimization Algorithms	Jan 30, 2025	BenchmarkingCombinatorial Optimization	—Unverified	0
Fine-tuning LLaMA 2 interference: a comparative study of language implementations for optimal efficiency	Jan 30, 2025	BenchmarkingLanguage Modeling	—Unverified	0
Unraveling the Capabilities of Language Models in News Summarization	Jan 30, 2025	BenchmarkingFew-Shot Learning	CodeCode Available	0
The iToBoS dataset: skin region images extracted from 3D total body photographs for lesion detection	Jan 30, 2025	BenchmarkingDiagnostic	CodeCode Available	0
MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding	Jan 30, 2025	BenchmarkingDecision Making	—Unverified	0
Solving Urban Network Security Games: Learning Platform, Benchmark, and Challenge for AI Research	Jan 29, 2025	Benchmarking	—Unverified	0
SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language Model	Jan 28, 2025	BenchmarkingLanguage Modeling	CodeCode Available	2
HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns	Jan 28, 2025	Adversarial AttackBenchmarking	CodeCode Available	1
Molecular-driven Foundation Model for Oncologic Pathology	Jan 28, 2025	BenchmarkingDiagnostic	CodeCode Available	4
Benchmarking Quantum Convolutional Neural Networks for Signal Classification in Simulated Gamma-Ray Burst Detection	Jan 28, 2025	Benchmarking	—Unverified	0
Making Sense of Data in the Wild: Data Analysis Automation at Scale	Jan 27, 2025	BenchmarkingDiversity	—Unverified	0
PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding	Jan 27, 2025	BenchmarkingCommon Sense Reasoning	—Unverified	0
A Benchmarking Environment for Worker Flexibility in Flexible Job Shop Scheduling Problems	Jan 27, 2025	BenchmarkingEvolutionary Algorithms	—Unverified	0
Transfer of Knowledge through Reverse Annealing: A Preliminary Analysis of the Benefits and What to Share	Jan 27, 2025	BenchmarkingTransfer Learning	—Unverified	0
IndicMMLU-Pro: Benchmarking Indic Large Language Models on Multi-Task Language Understanding	Jan 27, 2025	BenchmarkingDiversity	—Unverified	0
Skeleton-Guided-Translation: A Benchmarking Framework for Code Repository Translation with Fine-Grained Quality Evaluation	Jan 27, 2025	BenchmarkingC++ code	—Unverified	0
Benchmarking Quantum Reinforcement Learning	Jan 27, 2025	Benchmarkingreinforcement-learning	CodeCode Available	0
GiantHunter: Accurate detection of giant virus in metagenomic data using reinforcement-learning and Monte Carlo tree search	Jan 26, 2025	BenchmarkingDiversity	CodeCode Available	0
CISOL: An Open and Extensible Dataset for Table Structure Recognition in the Construction Industry	Jan 26, 2025	BenchmarkingObject Detection	—Unverified	0
Self-supervised Benchmark Lottery on ImageNet: Do Marginal Improvements Translate to Improvements on Similar Datasets?	Jan 26, 2025	BenchmarkingSelf-Supervised Learning	—Unverified	0
Beyond Benchmarks: On The False Promise of AI Regulation	Jan 26, 2025	Benchmarking	—Unverified	0
EvoRL: A GPU-accelerated Framework for Evolutionary Reinforcement Learning	Jan 25, 2025	BenchmarkingEvolutionary Algorithms	CodeCode Available	7
Prompting ChatGPT for Chinese Learning as L2: A CEFR and EBCL Level Study	Jan 25, 2025	Benchmarking	—Unverified	0
Benchmarking global optimization techniques for unmanned aerial vehicle path planning	Jan 24, 2025	Benchmarkingglobal-optimization	—Unverified	0
MedAgentBench: A Realistic Virtual EHR Environment to Benchmark Medical LLM Agents	Jan 24, 2025	Benchmarking	CodeCode Available	3
Feature-based Evolutionary Diversity Optimization of Discriminating Instances for Chance-constrained Optimization Problems	Jan 24, 2025	BenchmarkingDiversity	—Unverified	0
The Karp Dataset	Jan 24, 2025	BenchmarkingMathematical Reasoning	—Unverified	0
Scalable Benchmarking and Robust Learning for Noise-Free Ego-Motion and 3D Reconstruction from Noisy Video	Jan 24, 2025	3D ReconstructionBenchmarking	CodeCode Available	2
Enhancing Biomedical Relation Extraction with Directionality	Jan 23, 2025	BenchmarkingDocument-level Relation Extraction	CodeCode Available	1
AEON: Adaptive Estimation of Instance-Dependent In-Distribution and Out-of-Distribution Label Noise for Robust Learning	Jan 23, 2025	Benchmarkingimage-classification	—Unverified	0
DI-BENCH: Benchmarking Large Language Models on Dependency Inference with Testable Repositories at Scale	Jan 23, 2025	Benchmarking	—Unverified	0
You Only Crash Once v2: Perceptually Consistent Strong Features for One-Stage Domain Adaptive Detection of Space Terrain	Jan 23, 2025	BenchmarkingDomain Adaptation	—Unverified	0
RAG-Reward: Optimizing RAG with Reward Modeling and RLHF	Jan 22, 2025	BenchmarkingHallucination	—Unverified	0
Leveraging LLMs to Create a Haptic Devices' Recommendation System	Jan 22, 2025	Benchmarking	—Unverified	0
Implicit Causality-biases in humans and LLMs as a tool for benchmarking LLM discourse capabilities	Jan 22, 2025	BenchmarkingReferring Expression	—Unverified	0
Does Table Source Matter? Benchmarking and Improving Multimodal Scientific Table Understanding and Reasoning	Jan 22, 2025	Benchmarking	CodeCode Available	0
CHaRNet: Conditioned Heatmap Regression for Robust Dental Landmark Localization	Jan 22, 2025	Benchmarkingregression	—Unverified	0
Benchmarking Generative AI for Scoring Medical Student Interviews in Objective Structured Clinical Examinations (OSCEs)	Jan 21, 2025	Benchmarking	—Unverified	0
Benchmarking Randomized Optimization Algorithms on Binary, Permutation, and Combinatorial Problem Landscapes	Jan 21, 2025	Benchmarking	—Unverified	0
Optimally-Weighted Maximum Mean Discrepancy Framework for Continual Learning	Jan 21, 2025	BenchmarkingContinual Learning	—Unverified	0
Benchmarking Image Perturbations for Testing Automated Driving Assistance Systems	Jan 21, 2025	Autonomous VehiclesBenchmarking	CodeCode Available	0
Beyond the Hype: Benchmarking LLM-Evolved Heuristics for Bin Packing	Jan 20, 2025	BenchmarkingEvolutionary Algorithms	—Unverified	0
Algorithm Selection with Probing Trajectories: Benchmarking the Choice of Classifier Model	Jan 20, 2025	Benchmarking	—Unverified	0
Benchmarking Large Language Models via Random Variables	Jan 20, 2025	BenchmarkingMathematical Reasoning	—Unverified	0
InsQABench: Benchmarking Chinese Insurance Domain Question Answering with Large Language Models	Jan 19, 2025	BenchmarkingQuestion Answering	CodeCode Available	1

Show:10 25 50

← PrevPage 21 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified