Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 501–550 of 5548 papers

Title	Date	Tasks	Status	Hype
Interpretable graph-based models on multimodal biomedical data integration: A technical review and benchmarking	May 3, 2025	BenchmarkingData Integration	—Unverified	0
PhytoSynth: Leveraging Multi-modal Generative Models for Crop Disease Data Generation with Novel Benchmarking and Prompt Engineering Approach	May 3, 2025	BenchmarkingImage-to-Image Translation	—Unverified	0
Overview and practical recommendations on using Shapley Values for identifying predictive biomarkers via CATE modeling	May 2, 2025	Benchmarking	—Unverified	0
EvalxNLP: A Framework for Benchmarking Post-Hoc Explainability Methods on NLP Models	May 2, 2025	Benchmarking	CodeCode Available	0
Can Foundation Models Really Segment Tumors? A Benchmarking Odyssey in Lung CT Imaging	May 2, 2025	BenchmarkingComputational Efficiency	—Unverified	0
Parameterized Argumentation-based Reasoning Tasks for Benchmarking Generative Language Models	May 2, 2025	Benchmarking	CodeCode Available	0
Position: AI Competitions Provide the Gold Standard for Empirical Rigor in GenAI Evaluation	May 1, 2025	BenchmarkingPosition	—Unverified	0
EnronQA: Towards Personalized RAG over Private Documents	May 1, 2025	BenchmarkingMemorization	—Unverified	0
InterLoc: LiDAR-based Intersection Localization using Road Segmentation with Automated Evaluation Method	May 1, 2025	BenchmarkingMotion Planning	—Unverified	0
Vision Mamba in Remote Sensing: A Comprehensive Survey of Techniques, Applications and Outlook	May 1, 2025	BenchmarkingChange Detection	CodeCode Available	2
AI-ready Snow Radar Echogram Dataset (SRED) for climate change monitoring	May 1, 2025	BenchmarkingDeep Learning	—Unverified	0
MINERVA: Evaluating Complex Video Reasoning	May 1, 2025	BenchmarkingTemporal Localization	CodeCode Available	2
GEOM-Drugs Revisited: Toward More Chemically Accurate Benchmarks for 3D Molecule Generation	Apr 30, 2025	3D Molecule GenerationBenchmarking	CodeCode Available	1
Towards Robust and Generalizable Gerchberg Saxton based Physics Inspired Neural Networks for Computer Generated Holography: A Sensitivity Analysis Framework	Apr 30, 2025	BenchmarkingLearning Theory	—Unverified	0
From Precision to Perception: User-Centred Evaluation of Keyword Extraction Algorithms for Internet-Scale Contextual Advertising	Apr 30, 2025	BenchmarkingComputational Efficiency	—Unverified	0
Sadeed: Advancing Arabic Diacritization Through Small Language Model	Apr 30, 2025	Arabic Text DiacritizationBenchmarking	—Unverified	0
Galvatron: An Automatic Distributed System for Efficient Foundation Model Training	Apr 30, 2025	Benchmarking	—Unverified	0
Evaluating Generative Models for Tabular Data: Novel Metrics and Benchmarking	Apr 29, 2025	BenchmarkingIntrusion Detection	—Unverified	0
The Leaderboard Illusion	Apr 29, 2025	BenchmarkingChatbot	—Unverified	0
OSVBench: Benchmarking LLMs on Specification Generation Tasks for Operating System Verification	Apr 29, 2025	BenchmarkingCode Generation	CodeCode Available	1
Hydra: Marker-Free RGB-D Hand-Eye Calibration	Apr 29, 2025	Benchmarking	—Unverified	0
TrueFake: A Real World Case Dataset of Last Generation Fake Images also Shared on Social Networks	Apr 29, 2025	BenchmarkingMisinformation	CodeCode Available	1
On the Potential of Large Language Models to Solve Semantics-Aware Process Mining Tasks	Apr 29, 2025	Anomaly DetectionBenchmarking	—Unverified	0
LMME3DHF: Benchmarking and Evaluating Multimodal 3D Human Face Generation with LMMs	Apr 29, 2025	BenchmarkingFace Generation	—Unverified	0
SecRepoBench: Benchmarking LLMs for Secure Code Generation in Real-World Repositories	Apr 29, 2025	BenchmarkingCode Generation	—Unverified	0
Bridging the Generalisation Gap: Synthetic Data Generation for Multi-Site Clinical Model Validation	Apr 29, 2025	BenchmarkingFairness	CodeCode Available	0
TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models	Apr 29, 2025	BenchmarkingDataset Generation	CodeCode Available	0
Can LLMs Be Trusted for Evaluating RAG Systems? A Survey of Methods and Datasets	Apr 28, 2025	ArticlesBenchmarking	—Unverified	0
BLADE: Benchmark suite for LLM-driven Automated Design and Evolution of iterative optimisation heuristics	Apr 28, 2025	Benchmarking	—Unverified	0
WILD: a new in-the-Wild Image Linkage Dataset for synthetic image attribution	Apr 28, 2025	BenchmarkingImage Attribution	—Unverified	0
ResearchCodeAgent: An LLM Multi-Agent System for Automated Codification of Research Methodologies	Apr 28, 2025	BenchmarkingData Augmentation	—Unverified	0
BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text	Apr 28, 2025	Benchmarking	CodeCode Available	1
BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese	Apr 27, 2025	BenchmarkingProper Noun	CodeCode Available	2
Quantitative evaluation of brain-inspired vision sensors in high-speed robotic perception	Apr 27, 2025	BenchmarkingEvent-based vision	—Unverified	0
The Convergent Ethics of AI? Analyzing Moral Foundation Priorities in Large Language Models with a Multi-Framework Approach	Apr 27, 2025	BenchmarkingDecision Making	—Unverified	0
Generative Models for Fast Simulation of Cherenkov Detectors at the Electron-Ion Collider	Apr 26, 2025	BenchmarkingGPU	CodeCode Available	0
Assessing the Utility of Audio Foundation Models for Heart and Respiratory Sound Analysis	Apr 25, 2025	Benchmarking	—Unverified	0
Token Sequence Compression for Efficient Multimodal Computing	Apr 24, 2025	Benchmarking	—Unverified	0
Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency	Apr 24, 2025	BenchmarkingMath	CodeCode Available	1
Design and benchmarking of a two degree of freedom tendon driver unit for cable-driven wearable technologies	Apr 24, 2025	Benchmarking	—Unverified	0
QuantBench: Benchmarking AI Methods for Quantitative Investment	Apr 24, 2025	BenchmarkingContinual Learning	—Unverified	0
From Past to Present: A Survey of Malicious URL Detection Techniques, Datasets and Code Repositories	Apr 23, 2025	Benchmarking	CodeCode Available	0
MAYA: Addressing Inconsistencies in Generative Password Guessing through a Unified Benchmark	Apr 23, 2025	Benchmarking	CodeCode Available	0
LongMamba: Enhancing Mamba's Long Context Capabilities via Training-Free Receptive Field Enlargement	Apr 22, 2025	BenchmarkingLanguage Modeling	CodeCode Available	1
Enhancing TCR-Peptide Interaction Prediction with Pretrained Language Models and Molecular Representations	Apr 22, 2025	BenchmarkingFew-Shot Learning	—Unverified	0
Benchmarking machine learning models for predicting aerofoil performance	Apr 22, 2025	Benchmarking	—Unverified	0
Fluorescence Reference Target Quantitative Analysis Library	Apr 22, 2025	Benchmarking	CodeCode Available	0
CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents	Apr 22, 2025	BenchmarkingCross-Lingual Information Retrieval	—Unverified	0
Benchmarking LLM for Code Smells Detection: OpenAI GPT-4.0 vs DeepSeek-V3	Apr 22, 2025	BenchmarkingLanguage Modeling	—Unverified	0
A Large-scale Class-level Benchmark Dataset for Code Generation with LLMs	Apr 22, 2025	BenchmarkingClass-level Code Generation	—Unverified	0

Show:10 25 50

← PrevPage 11 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified