Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 476–500 of 5548 papers

Title	Date	Tasks	Status	Hype
False Promises in Medical Imaging AI? Assessing Validity of Outperformance Claims	May 7, 2025	Benchmarking	CodeCode Available	0
Are Synthetic Corruptions A Reliable Proxy For Real-World Corruptions?	May 7, 2025	BenchmarkingSemantic Segmentation	CodeCode Available	0
Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards	May 7, 2025	BenchmarkingHallucination	CodeCode Available	1
RGB-Event Fusion with Self-Attention for Collision Prediction	May 7, 2025	BenchmarkingComputational Efficiency	CodeCode Available	1
Advancing and Benchmarking Personalized Tool Invocation for LLMs	May 7, 2025	BenchmarkingWorld Knowledge	CodeCode Available	0
Benchmarking LLMs' Swarm intelligence	May 7, 2025	Benchmarking	CodeCode Available	1
Alpha Excel Benchmark	May 7, 2025	Benchmarking	—Unverified	0
Call for Action: towards the next generation of symbolic regression benchmark	May 6, 2025	BenchmarkingDiversity	—Unverified	0
Multimodal Benchmarking and Recommendation of Text-to-Image Generation Models	May 6, 2025	BenchmarkingImage Generation	CodeCode Available	0
MedArabiQ: Benchmarking Large Language Models on Arabic Medical Tasks	May 6, 2025	BenchmarkingMultiple-choice	CodeCode Available	0
Towards Efficient Benchmarking of Foundation Models in Remote Sensing: A Capabilities Encoding Approach	May 6, 2025	BenchmarkingEarth Observation	CodeCode Available	0
CombiBench: Benchmarking LLM Capability for Combinatorial Mathematics	May 6, 2025	Benchmarking	CodeCode Available	1
Completing Spatial Transcriptomics Data for Gene Expression Prediction Benchmarking	May 5, 2025	BenchmarkingPrediction	—Unverified	0
FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models	May 5, 2025	BenchmarkingMathematical Reasoning	CodeCode Available	2
NeuroSim V1.5: Improved Software Backbone for Benchmarking Compute-in-Memory Accelerators with Device and Circuit-level Non-idealities	May 5, 2025	BenchmarkingQuantization	CodeCode Available	0
Physics-Learning AI Datamodel (PLAID) datasets: a collection of physics simulations for machine learning	May 5, 2025	Benchmarking	—Unverified	0
NbBench: Benchmarking Language Models for Comprehensive Nanobody Tasks	May 4, 2025	BenchmarkingRepresentation Learning	CodeCode Available	0
Meta-Black-Box-Optimization through Offline Q-function Learning	May 4, 2025	BenchmarkingMamba	CodeCode Available	0
Benchmarking Feature Upsampling Methods for Vision Foundation Models using Interactive Segmentation	May 4, 2025	BenchmarkingFeature Upsampling	CodeCode Available	0
RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video	May 4, 2025	BenchmarkingQuestion Answering	CodeCode Available	1
Representation Learning of Limit Order Book: A Comprehensive Study and Benchmarking	May 4, 2025	BenchmarkingRepresentation Learning	CodeCode Available	0
Not Every Tree Is a Forest: Benchmarking Forest Types from Satellite Remote Sensing	May 3, 2025	BenchmarkingImage Segmentation	—Unverified	0
CMAWRNet: Multiple Adverse Weather Removal via a Unified Quaternion Neural Architecture	May 3, 2025	Autonomous DrivingBenchmarking	—Unverified	0
Interpretable graph-based models on multimodal biomedical data integration: A technical review and benchmarking	May 3, 2025	BenchmarkingData Integration	—Unverified	0
PhytoSynth: Leveraging Multi-modal Generative Models for Crop Disease Data Generation with Novel Benchmarking and Prompt Engineering Approach	May 3, 2025	BenchmarkingImage-to-Image Translation	—Unverified	0

Show:10 25 50

← PrevPage 20 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified