Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1851–1875 of 5548 papers

Title	Date	Tasks	Status
A Large-scale Benchmark on Geological Fault Delineation Models: Domain Shift, Training Dynamics, Generalizability, Evaluation and Inferential Behavior	May 13, 2025	BenchmarkingSeismic Interpretation	—Unverified
Grounding Synthetic Data Evaluations of Language Models in Unsupervised Document Corpora	May 13, 2025	BenchmarkingDiagnostic	CodeCode Available
ExEBench: Benchmarking Foundation Models on Extreme Earth Events	May 13, 2025	BenchmarkingManagement	CodeCode Available
Benchmarking Ethical and Safety Risks of Healthcare LLMs in China-Toward Systemic Governance under Healthy China 2030	May 12, 2025	BenchmarkingEthics	—Unverified
FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning	May 12, 2025	16kBenchmarking	—Unverified
The Pitfalls of Benchmarking in Algorithm Selection: What We Are Getting Wrong	May 12, 2025	Benchmarking	—Unverified
From raw affiliations to organization identifiers	May 12, 2025	BenchmarkingMetadata quality	CodeCode Available
Benchmarking Retrieval-Augmented Generation for Chemistry	May 12, 2025	BenchmarkingRAG	—Unverified
Benchmarking of CPU-intensive Stream Data Processing in The Edge Computing Systems	May 12, 2025	BenchmarkingComputational Efficiency	—Unverified
PRISM: Complete Online Decentralized Multi-Agent Pathfinding with Rapid Information Sharing using Motion Constraints	May 12, 2025	Benchmarking	—Unverified
Benchmarking Graph Neural Networks for Document Layout Analysis in Public Affairs	May 12, 2025	BenchmarkingDocument Layout Analysis	—Unverified
Multi-Modal Explainable Medical AI Assistant for Trustworthy Human-AI Collaboration	May 11, 2025	BenchmarkingDescriptive	—Unverified
Optimizing Recommendations using Fine-Tuned LLMs	May 11, 2025	BenchmarkingRecommendation Systems	—Unverified
From Knowledge to Reasoning: Evaluating LLMs for Ionic Liquids Research in Chemical and Biological Engineering	May 11, 2025	BenchmarkingGeneral Knowledge	CodeCode Available
Contributions of the Petabyte Scale Sequence Search Codeathon toward efforts to scale sequence-based searches on SRA	May 9, 2025	Benchmarkingscientific discovery	—Unverified
Healthy LLMs? Benchmarking LLM Knowledge of UK Government Public Health Information	May 9, 2025	BenchmarkingForm	—Unverified
Evaluating Financial Sentiment Analysis with Annotators Instruction Assisted Prompting: Enhancing Contextual Interpretation and Stock Prediction Accuracy	May 9, 2025	BenchmarkingSentiment Analysis	—Unverified
Software Development Life Cycle Perspective: A Survey of Benchmarks for Code Large Language Models and Agents	May 8, 2025	Benchmarking	—Unverified
QualBench: Benchmarking Chinese LLMs with Localized Professional Qualifications for Vertical Domain Evaluation	May 8, 2025	BenchmarkingFederated Learning	—Unverified
clem:todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations	May 8, 2025	BenchmarkingTask-Oriented Dialogue Systems	—Unverified
A Neuro-Symbolic Framework for Sequence Classification with Relational and Temporal Knowledge	May 8, 2025	Benchmarking	CodeCode Available
DispBench: Benchmarking Disparity Estimation to Synthetic Corruptions	May 8, 2025	Autonomous NavigationBenchmarking	CodeCode Available
Autoregressive Stochastic Clock Jitter Compensation in Analog-to-Digital Converters	May 8, 2025	Benchmarking	—Unverified
Federated Deconfounding and Debiasing Learning for Out-of-Distribution Generalization	May 8, 2025	AttributeBenchmarking	—Unverified
Enhancing Treatment Effect Estimation via Active Learning: A Counterfactual Covering Perspective	May 8, 2025	Active LearningBenchmarking	CodeCode Available

Show:10 25 50

← PrevPage 75 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified