Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 451–500 of 5548 papers

Title	Date	Tasks	Status	Hype
The Pitfalls of Benchmarking in Algorithm Selection: What We Are Getting Wrong	May 12, 2025	Benchmarking	—Unverified	0
Benchmarking Retrieval-Augmented Generation for Chemistry	May 12, 2025	BenchmarkingRAG	—Unverified	0
Benchmarking of CPU-intensive Stream Data Processing in The Edge Computing Systems	May 12, 2025	BenchmarkingComputational Efficiency	—Unverified	0
Optimizing Recommendations using Fine-Tuned LLMs	May 11, 2025	BenchmarkingRecommendation Systems	—Unverified	0
From Knowledge to Reasoning: Evaluating LLMs for Ionic Liquids Research in Chemical and Biological Engineering	May 11, 2025	BenchmarkingGeneral Knowledge	CodeCode Available	0
Multi-Modal Explainable Medical AI Assistant for Trustworthy Human-AI Collaboration	May 11, 2025	BenchmarkingDescriptive	—Unverified	0
JaxRobotarium: Training and Deploying Multi-Robot Policies in 10 Minutes	May 10, 2025	BenchmarkingGPU	CodeCode Available	1
FNBench: Benchmarking Robust Federated Learning against Noisy Labels	May 10, 2025	BenchmarkingFederated Learning	CodeCode Available	1
Contributions of the Petabyte Scale Sequence Search Codeathon toward efforts to scale sequence-based searches on SRA	May 9, 2025	Benchmarkingscientific discovery	—Unverified	0
Evaluating Financial Sentiment Analysis with Annotators Instruction Assisted Prompting: Enhancing Contextual Interpretation and Stock Prediction Accuracy	May 9, 2025	BenchmarkingSentiment Analysis	—Unverified	0
The ML.ENERGY Benchmark: Toward Automated Inference Energy Measurement and Optimization	May 9, 2025	Benchmarking	CodeCode Available	3
Healthy LLMs? Benchmarking LLM Knowledge of UK Government Public Health Information	May 9, 2025	BenchmarkingForm	—Unverified	0
Federated Deconfounding and Debiasing Learning for Out-of-Distribution Generalization	May 8, 2025	AttributeBenchmarking	—Unverified	0
QualBench: Benchmarking Chinese LLMs with Localized Professional Qualifications for Vertical Domain Evaluation	May 8, 2025	BenchmarkingFederated Learning	—Unverified	0
Autoregressive Stochastic Clock Jitter Compensation in Analog-to-Digital Converters	May 8, 2025	Benchmarking	—Unverified	0
Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments	May 8, 2025	BenchmarkingPrompt Engineering	CodeCode Available	1
clem:todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations	May 8, 2025	BenchmarkingTask-Oriented Dialogue Systems	—Unverified	0
PyTDC: A multimodal machine learning training, evaluation, and inference platform for biomedical foundation models	May 8, 2025	BenchmarkingGraph Representation Learning	CodeCode Available	1
Enhancing Treatment Effect Estimation via Active Learning: A Counterfactual Covering Perspective	May 8, 2025	Active LearningBenchmarking	CodeCode Available	0
scDrugMap: Benchmarking Large Foundation Models for Drug Response Prediction	May 8, 2025	BenchmarkingDrug Discovery	CodeCode Available	1
A Neuro-Symbolic Framework for Sequence Classification with Relational and Temporal Knowledge	May 8, 2025	Benchmarking	CodeCode Available	0
DispBench: Benchmarking Disparity Estimation to Synthetic Corruptions	May 8, 2025	Autonomous NavigationBenchmarking	CodeCode Available	0
Software Development Life Cycle Perspective: A Survey of Benchmarks for Code Large Language Models and Agents	May 8, 2025	Benchmarking	—Unverified	0
Benchmarking Ophthalmology Foundation Models for Clinically Significant Age Macular Degeneration Detection	May 8, 2025	BenchmarkingOut-of-Distribution Generalization	—Unverified	0
Benchmarking Traditional Machine Learning and Deep Learning Models for Fault Detection in Power Transformers	May 7, 2025	BenchmarkingFault Detection	CodeCode Available	0
False Promises in Medical Imaging AI? Assessing Validity of Outperformance Claims	May 7, 2025	Benchmarking	CodeCode Available	0
Are Synthetic Corruptions A Reliable Proxy For Real-World Corruptions?	May 7, 2025	BenchmarkingSemantic Segmentation	CodeCode Available	0
Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards	May 7, 2025	BenchmarkingHallucination	CodeCode Available	1
RGB-Event Fusion with Self-Attention for Collision Prediction	May 7, 2025	BenchmarkingComputational Efficiency	CodeCode Available	1
Advancing and Benchmarking Personalized Tool Invocation for LLMs	May 7, 2025	BenchmarkingWorld Knowledge	CodeCode Available	0
Benchmarking LLMs' Swarm intelligence	May 7, 2025	Benchmarking	CodeCode Available	1
Alpha Excel Benchmark	May 7, 2025	Benchmarking	—Unverified	0
Call for Action: towards the next generation of symbolic regression benchmark	May 6, 2025	BenchmarkingDiversity	—Unverified	0
Multimodal Benchmarking and Recommendation of Text-to-Image Generation Models	May 6, 2025	BenchmarkingImage Generation	CodeCode Available	0
MedArabiQ: Benchmarking Large Language Models on Arabic Medical Tasks	May 6, 2025	BenchmarkingMultiple-choice	CodeCode Available	0
Towards Efficient Benchmarking of Foundation Models in Remote Sensing: A Capabilities Encoding Approach	May 6, 2025	BenchmarkingEarth Observation	CodeCode Available	0
CombiBench: Benchmarking LLM Capability for Combinatorial Mathematics	May 6, 2025	Benchmarking	CodeCode Available	1
Completing Spatial Transcriptomics Data for Gene Expression Prediction Benchmarking	May 5, 2025	BenchmarkingPrediction	—Unverified	0
FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models	May 5, 2025	BenchmarkingMathematical Reasoning	CodeCode Available	2
NeuroSim V1.5: Improved Software Backbone for Benchmarking Compute-in-Memory Accelerators with Device and Circuit-level Non-idealities	May 5, 2025	BenchmarkingQuantization	CodeCode Available	0
Physics-Learning AI Datamodel (PLAID) datasets: a collection of physics simulations for machine learning	May 5, 2025	Benchmarking	—Unverified	0
NbBench: Benchmarking Language Models for Comprehensive Nanobody Tasks	May 4, 2025	BenchmarkingRepresentation Learning	CodeCode Available	0
Meta-Black-Box-Optimization through Offline Q-function Learning	May 4, 2025	BenchmarkingMamba	CodeCode Available	0
Benchmarking Feature Upsampling Methods for Vision Foundation Models using Interactive Segmentation	May 4, 2025	BenchmarkingFeature Upsampling	CodeCode Available	0
RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video	May 4, 2025	BenchmarkingQuestion Answering	CodeCode Available	1
Representation Learning of Limit Order Book: A Comprehensive Study and Benchmarking	May 4, 2025	BenchmarkingRepresentation Learning	CodeCode Available	0
Not Every Tree Is a Forest: Benchmarking Forest Types from Satellite Remote Sensing	May 3, 2025	BenchmarkingImage Segmentation	—Unverified	0
CMAWRNet: Multiple Adverse Weather Removal via a Unified Quaternion Neural Architecture	May 3, 2025	Autonomous DrivingBenchmarking	—Unverified	0
Interpretable graph-based models on multimodal biomedical data integration: A technical review and benchmarking	May 3, 2025	BenchmarkingData Integration	—Unverified	0
PhytoSynth: Leveraging Multi-modal Generative Models for Crop Disease Data Generation with Novel Benchmarking and Prompt Engineering Approach	May 3, 2025	BenchmarkingImage-to-Image Translation	—Unverified	0

Show:10 25 50

← PrevPage 10 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified