Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 451–500 of 5548 papers

Title	Date	Tasks	Status	Hype
M4-SAR: A Multi-Resolution, Multi-Polarization, Multi-Scene, Multi-Source Dataset and Benchmark for Optical-SAR Fusion Object Detection	May 16, 2025	Benchmarkingobject-detection	CodeCode Available	1
MatTools: Benchmarking Large Language Models for Materials Science Tools	May 16, 2025	BenchmarkingQuestion Answering	CodeCode Available	1
Evaluating Robustness of Deep Reinforcement Learning for Autonomous Surface Vehicle Control in Field Tests	May 15, 2025	BenchmarkingDeep Reinforcement Learning	CodeCode Available	1
Words That Unite The World: A Unified Framework for Deciphering Central Bank Communications Globally	May 15, 2025	BenchmarkingSentence	CodeCode Available	1
Towards scalable surrogate models based on Neural Fields for large scale aerodynamic simulations	May 14, 2025	Benchmarking	CodeCode Available	1
OpenLKA: An Open Dataset of Lane Keeping Assist from Recent Car Models under Real-world Driving Conditions	May 14, 2025	Autonomous DrivingBenchmarking	CodeCode Available	1
Benchmarking AI scientists in omics data-driven biological research	May 13, 2025	BenchmarkingMultiple-choice	CodeCode Available	1
FNBench: Benchmarking Robust Federated Learning against Noisy Labels	May 10, 2025	BenchmarkingFederated Learning	CodeCode Available	1
JaxRobotarium: Training and Deploying Multi-Robot Policies in 10 Minutes	May 10, 2025	BenchmarkingGPU	CodeCode Available	1
scDrugMap: Benchmarking Large Foundation Models for Drug Response Prediction	May 8, 2025	BenchmarkingDrug Discovery	CodeCode Available	1
PyTDC: A multimodal machine learning training, evaluation, and inference platform for biomedical foundation models	May 8, 2025	BenchmarkingGraph Representation Learning	CodeCode Available	1
Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments	May 8, 2025	BenchmarkingPrompt Engineering	CodeCode Available	1
RGB-Event Fusion with Self-Attention for Collision Prediction	May 7, 2025	BenchmarkingComputational Efficiency	CodeCode Available	1
Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards	May 7, 2025	BenchmarkingHallucination	CodeCode Available	1
Benchmarking LLMs' Swarm intelligence	May 7, 2025	Benchmarking	CodeCode Available	1
CombiBench: Benchmarking LLM Capability for Combinatorial Mathematics	May 6, 2025	Benchmarking	CodeCode Available	1
RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video	May 4, 2025	BenchmarkingQuestion Answering	CodeCode Available	1
GEOM-Drugs Revisited: Toward More Chemically Accurate Benchmarks for 3D Molecule Generation	Apr 30, 2025	3D Molecule GenerationBenchmarking	CodeCode Available	1
TrueFake: A Real World Case Dataset of Last Generation Fake Images also Shared on Social Networks	Apr 29, 2025	BenchmarkingMisinformation	CodeCode Available	1
OSVBench: Benchmarking LLMs on Specification Generation Tasks for Operating System Verification	Apr 29, 2025	BenchmarkingCode Generation	CodeCode Available	1
BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text	Apr 28, 2025	Benchmarking	CodeCode Available	1
Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency	Apr 24, 2025	BenchmarkingMath	CodeCode Available	1
LongMamba: Enhancing Mamba's Long Context Capabilities via Training-Free Receptive Field Enlargement	Apr 22, 2025	BenchmarkingLanguage Modeling	CodeCode Available	1
TinyverseGP: Towards a Modular Cross-domain Benchmarking Framework for Genetic Programming	Apr 14, 2025	BenchmarkingProgram Synthesis	CodeCode Available	1
LEMUR Neural Network Dataset: Towards Seamless AutoML	Apr 14, 2025	AutoMLBenchmarking	CodeCode Available	1
LMM4LMM: Benchmarking and Evaluating Large-multimodal Image Generation with LMMs	Apr 11, 2025	BenchmarkingImage Generation	CodeCode Available	1
Evolutionary Generation of Random Surreal Numbers for Benchmarking	Apr 9, 2025	Benchmarking	CodeCode Available	1
An Empirical Study of GPT-4o Image Generation Capabilities	Apr 8, 2025	BenchmarkingImage Generation	CodeCode Available	1
V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models	Apr 8, 2025	BenchmarkingVisual Reasoning	CodeCode Available	1
CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial Optimization	Apr 6, 2025	BenchmarkingCombinatorial Optimization	CodeCode Available	1
A Survey of Pathology Foundation Model: Progress and Future Directions	Apr 5, 2025	BenchmarkingMultiple Instance Learning	CodeCode Available	1
Generative Evaluation of Complex Reasoning in Large Language Models	Apr 3, 2025	BenchmarkingMemorization	CodeCode Available	1
BlenderGym: Benchmarking Foundational Model Systems for Graphics Editing	Apr 2, 2025	3D ReconstructionBenchmarking	CodeCode Available	1
SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers	Mar 31, 2025	Benchmarking	CodeCode Available	1
EgoToM: Benchmarking Theory of Mind Reasoning from Egocentric Videos	Mar 28, 2025	BenchmarkingQuestion Answering	CodeCode Available	1
FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs	Mar 27, 2025	AttributeBenchmarking	CodeCode Available	1
A Comprehensive Benchmark for RNA 3D Structure-Function Modeling	Mar 27, 2025	BenchmarkingDeep Learning	CodeCode Available	1
NeoRL-2: Near Real-World Benchmarks for Offline Reinforcement Learning with Extended Realistic Scenarios	Mar 25, 2025	BenchmarkingOffline RL	CodeCode Available	1
The Coralscapes Dataset: Semantic Scene Understanding in Coral Reefs	Mar 25, 2025	BenchmarkingScene Segmentation	CodeCode Available	1
Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models	Mar 25, 2025	BenchmarkingImage Captioning	CodeCode Available	1
Benchmarking Object Detectors under Real-World Distribution Shifts in Satellite Imagery	Mar 24, 2025	BenchmarkingHumanitarian	CodeCode Available	1
Benchmarking Multi-modal Semantic Segmentation under Sensor Failures: Missing and Noisy Modality Robustness	Mar 24, 2025	BenchmarkingSemantic Segmentation	CodeCode Available	1
GeoBenchX: Benchmarking LLMs for Multistep Geospatial Tasks	Mar 23, 2025	BenchmarkingHallucination	CodeCode Available	1
V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction	Mar 22, 2025	BenchmarkingVideo Understanding	CodeCode Available	1
QCPINN: Quantum-Classical Physics-Informed Neural Networks for Solving PDEs	Mar 20, 2025	BenchmarkingPhysics-informed machine learning	CodeCode Available	1
The Emperor's New Clothes in Benchmarking? A Rigorous Examination of Mitigation Strategies for LLM Benchmark Data Contamination	Mar 20, 2025	BenchmarkingLarge Language Model	CodeCode Available	1
JuDGE: Benchmarking Judgment Document Generation for Chinese Legal System	Mar 18, 2025	BenchmarkingIn-Context Learning	CodeCode Available	1
Omnia de EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric Videos	Mar 17, 2025	BenchmarkingQuestion Answering	CodeCode Available	1
MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research	Mar 17, 2025	ArticlesBenchmarking	CodeCode Available	1
GNNs as Predictors of Agentic Workflow Performances	Mar 14, 2025	BenchmarkingPosition	CodeCode Available	1

Show:10 25 50

← PrevPage 10 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified