Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 526–550 of 5548 papers

Title	Date	Tasks	Status	Hype
Bridging the Generalisation Gap: Synthetic Data Generation for Multi-Site Clinical Model Validation	Apr 29, 2025	BenchmarkingFairness	CodeCode Available	0
TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models	Apr 29, 2025	BenchmarkingDataset Generation	CodeCode Available	0
Can LLMs Be Trusted for Evaluating RAG Systems? A Survey of Methods and Datasets	Apr 28, 2025	ArticlesBenchmarking	—Unverified	0
BLADE: Benchmark suite for LLM-driven Automated Design and Evolution of iterative optimisation heuristics	Apr 28, 2025	Benchmarking	—Unverified	0
WILD: a new in-the-Wild Image Linkage Dataset for synthetic image attribution	Apr 28, 2025	BenchmarkingImage Attribution	—Unverified	0
ResearchCodeAgent: An LLM Multi-Agent System for Automated Codification of Research Methodologies	Apr 28, 2025	BenchmarkingData Augmentation	—Unverified	0
BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text	Apr 28, 2025	Benchmarking	CodeCode Available	1
BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese	Apr 27, 2025	BenchmarkingProper Noun	CodeCode Available	2
Quantitative evaluation of brain-inspired vision sensors in high-speed robotic perception	Apr 27, 2025	BenchmarkingEvent-based vision	—Unverified	0
The Convergent Ethics of AI? Analyzing Moral Foundation Priorities in Large Language Models with a Multi-Framework Approach	Apr 27, 2025	BenchmarkingDecision Making	—Unverified	0
Generative Models for Fast Simulation of Cherenkov Detectors at the Electron-Ion Collider	Apr 26, 2025	BenchmarkingGPU	CodeCode Available	0
Assessing the Utility of Audio Foundation Models for Heart and Respiratory Sound Analysis	Apr 25, 2025	Benchmarking	—Unverified	0
Token Sequence Compression for Efficient Multimodal Computing	Apr 24, 2025	Benchmarking	—Unverified	0
Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency	Apr 24, 2025	BenchmarkingMath	CodeCode Available	1
Design and benchmarking of a two degree of freedom tendon driver unit for cable-driven wearable technologies	Apr 24, 2025	Benchmarking	—Unverified	0
QuantBench: Benchmarking AI Methods for Quantitative Investment	Apr 24, 2025	BenchmarkingContinual Learning	—Unverified	0
From Past to Present: A Survey of Malicious URL Detection Techniques, Datasets and Code Repositories	Apr 23, 2025	Benchmarking	CodeCode Available	0
MAYA: Addressing Inconsistencies in Generative Password Guessing through a Unified Benchmark	Apr 23, 2025	Benchmarking	CodeCode Available	0
LongMamba: Enhancing Mamba's Long Context Capabilities via Training-Free Receptive Field Enlargement	Apr 22, 2025	BenchmarkingLanguage Modeling	CodeCode Available	1
Enhancing TCR-Peptide Interaction Prediction with Pretrained Language Models and Molecular Representations	Apr 22, 2025	BenchmarkingFew-Shot Learning	—Unverified	0
Benchmarking machine learning models for predicting aerofoil performance	Apr 22, 2025	Benchmarking	—Unverified	0
Fluorescence Reference Target Quantitative Analysis Library	Apr 22, 2025	Benchmarking	CodeCode Available	0
CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents	Apr 22, 2025	BenchmarkingCross-Lingual Information Retrieval	—Unverified	0
Benchmarking LLM for Code Smells Detection: OpenAI GPT-4.0 vs DeepSeek-V3	Apr 22, 2025	BenchmarkingLanguage Modeling	—Unverified	0
A Large-scale Class-level Benchmark Dataset for Code Generation with LLMs	Apr 22, 2025	BenchmarkingClass-level Code Generation	—Unverified	0

Show:10 25 50

← PrevPage 22 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified