Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1851–1900 of 5548 papers

Title	Date	Tasks	Status
A Large-scale Benchmark on Geological Fault Delineation Models: Domain Shift, Training Dynamics, Generalizability, Evaluation and Inferential Behavior	May 13, 2025	BenchmarkingSeismic Interpretation	—Unverified
Granite-speech: open-source speech-aware LLMs with strong English ASR capabilities	May 13, 2025	automatic-speech-translationBenchmarking	—Unverified
ExEBench: Benchmarking Foundation Models on Extreme Earth Events	May 13, 2025	BenchmarkingManagement	CodeCode Available
Benchmarking Ethical and Safety Risks of Healthcare LLMs in China-Toward Systemic Governance under Healthy China 2030	May 12, 2025	BenchmarkingEthics	—Unverified
The Pitfalls of Benchmarking in Algorithm Selection: What We Are Getting Wrong	May 12, 2025	Benchmarking	—Unverified
PRISM: Complete Online Decentralized Multi-Agent Pathfinding with Rapid Information Sharing using Motion Constraints	May 12, 2025	Benchmarking	—Unverified
FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning	May 12, 2025	16kBenchmarking	—Unverified
From raw affiliations to organization identifiers	May 12, 2025	BenchmarkingMetadata quality	CodeCode Available
Benchmarking Retrieval-Augmented Generation for Chemistry	May 12, 2025	BenchmarkingRAG	—Unverified
Benchmarking of CPU-intensive Stream Data Processing in The Edge Computing Systems	May 12, 2025	BenchmarkingComputational Efficiency	—Unverified
Benchmarking Graph Neural Networks for Document Layout Analysis in Public Affairs	May 12, 2025	BenchmarkingDocument Layout Analysis	—Unverified
Optimizing Recommendations using Fine-Tuned LLMs	May 11, 2025	BenchmarkingRecommendation Systems	—Unverified
Multi-Modal Explainable Medical AI Assistant for Trustworthy Human-AI Collaboration	May 11, 2025	BenchmarkingDescriptive	—Unverified
From Knowledge to Reasoning: Evaluating LLMs for Ionic Liquids Research in Chemical and Biological Engineering	May 11, 2025	BenchmarkingGeneral Knowledge	CodeCode Available
Contributions of the Petabyte Scale Sequence Search Codeathon toward efforts to scale sequence-based searches on SRA	May 9, 2025	Benchmarkingscientific discovery	—Unverified
Healthy LLMs? Benchmarking LLM Knowledge of UK Government Public Health Information	May 9, 2025	BenchmarkingForm	—Unverified
Evaluating Financial Sentiment Analysis with Annotators Instruction Assisted Prompting: Enhancing Contextual Interpretation and Stock Prediction Accuracy	May 9, 2025	BenchmarkingSentiment Analysis	—Unverified
DispBench: Benchmarking Disparity Estimation to Synthetic Corruptions	May 8, 2025	Autonomous NavigationBenchmarking	CodeCode Available
clem:todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations	May 8, 2025	BenchmarkingTask-Oriented Dialogue Systems	—Unverified
A Neuro-Symbolic Framework for Sequence Classification with Relational and Temporal Knowledge	May 8, 2025	Benchmarking	CodeCode Available
Federated Deconfounding and Debiasing Learning for Out-of-Distribution Generalization	May 8, 2025	AttributeBenchmarking	—Unverified
Enhancing Treatment Effect Estimation via Active Learning: A Counterfactual Covering Perspective	May 8, 2025	Active LearningBenchmarking	CodeCode Available
Autoregressive Stochastic Clock Jitter Compensation in Analog-to-Digital Converters	May 8, 2025	Benchmarking	—Unverified
Software Development Life Cycle Perspective: A Survey of Benchmarks for Code Large Language Models and Agents	May 8, 2025	Benchmarking	—Unverified
Benchmarking Ophthalmology Foundation Models for Clinically Significant Age Macular Degeneration Detection	May 8, 2025	BenchmarkingOut-of-Distribution Generalization	—Unverified
QualBench: Benchmarking Chinese LLMs with Localized Professional Qualifications for Vertical Domain Evaluation	May 8, 2025	BenchmarkingFederated Learning	—Unverified
Advancing and Benchmarking Personalized Tool Invocation for LLMs	May 7, 2025	BenchmarkingWorld Knowledge	CodeCode Available
False Promises in Medical Imaging AI? Assessing Validity of Outperformance Claims	May 7, 2025	Benchmarking	CodeCode Available
Alpha Excel Benchmark	May 7, 2025	Benchmarking	—Unverified
Benchmarking Traditional Machine Learning and Deep Learning Models for Fault Detection in Power Transformers	May 7, 2025	BenchmarkingFault Detection	CodeCode Available
Are Synthetic Corruptions A Reliable Proxy For Real-World Corruptions?	May 7, 2025	BenchmarkingSemantic Segmentation	CodeCode Available
Call for Action: towards the next generation of symbolic regression benchmark	May 6, 2025	BenchmarkingDiversity	—Unverified
Multimodal Benchmarking and Recommendation of Text-to-Image Generation Models	May 6, 2025	BenchmarkingImage Generation	CodeCode Available
Towards Efficient Benchmarking of Foundation Models in Remote Sensing: A Capabilities Encoding Approach	May 6, 2025	BenchmarkingEarth Observation	CodeCode Available
MedArabiQ: Benchmarking Large Language Models on Arabic Medical Tasks	May 6, 2025	BenchmarkingMultiple-choice	CodeCode Available
Physics-Learning AI Datamodel (PLAID) datasets: a collection of physics simulations for machine learning	May 5, 2025	Benchmarking	—Unverified
NeuroSim V1.5: Improved Software Backbone for Benchmarking Compute-in-Memory Accelerators with Device and Circuit-level Non-idealities	May 5, 2025	BenchmarkingQuantization	CodeCode Available
Completing Spatial Transcriptomics Data for Gene Expression Prediction Benchmarking	May 5, 2025	BenchmarkingPrediction	—Unverified
Benchmarking Feature Upsampling Methods for Vision Foundation Models using Interactive Segmentation	May 4, 2025	BenchmarkingFeature Upsampling	CodeCode Available
Meta-Black-Box-Optimization through Offline Q-function Learning	May 4, 2025	BenchmarkingMamba	CodeCode Available
Representation Learning of Limit Order Book: A Comprehensive Study and Benchmarking	May 4, 2025	BenchmarkingRepresentation Learning	CodeCode Available
NbBench: Benchmarking Language Models for Comprehensive Nanobody Tasks	May 4, 2025	BenchmarkingRepresentation Learning	CodeCode Available
Not Every Tree Is a Forest: Benchmarking Forest Types from Satellite Remote Sensing	May 3, 2025	BenchmarkingImage Segmentation	—Unverified
CMAWRNet: Multiple Adverse Weather Removal via a Unified Quaternion Neural Architecture	May 3, 2025	Autonomous DrivingBenchmarking	—Unverified
BOOM: Benchmarking Out-Of-distribution Molecular Property Predictions of Machine Learning Models	May 3, 2025	BenchmarkingHyperparameter Optimization	—Unverified
PhytoSynth: Leveraging Multi-modal Generative Models for Crop Disease Data Generation with Novel Benchmarking and Prompt Engineering Approach	May 3, 2025	BenchmarkingImage-to-Image Translation	—Unverified
Edge-Cloud Collaborative Computing on Distributed Intelligence and Model Optimization: A Survey	May 3, 2025	Autonomous DrivingBenchmarking	—Unverified
Interpretable graph-based models on multimodal biomedical data integration: A technical review and benchmarking	May 3, 2025	BenchmarkingData Integration	—Unverified
Parameterized Argumentation-based Reasoning Tasks for Benchmarking Generative Language Models	May 2, 2025	Benchmarking	CodeCode Available
Can Foundation Models Really Segment Tumors? A Benchmarking Odyssey in Lung CT Imaging	May 2, 2025	BenchmarkingComputational Efficiency	—Unverified

Show:10 25 50

← PrevPage 38 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified