Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 151–200 of 5548 papers

Title	Date	Tasks	Status	Hype
BSBench: will your LLM find the largest prime number?	Jun 5, 2025	Benchmarking	CodeCode Available	0
Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation	Jun 5, 2025	Benchmarking	CodeCode Available	0
Design of intelligent proofreading system for English translation based on CNN and BERT	Jun 5, 2025	BenchmarkingMachine Translation	—Unverified	0
DIMCIM: A Quantitative Evaluation Framework for Default-mode Diversity and Generalization in Text-to-Image Generative Models	Jun 5, 2025	BenchmarkingDiversity	—Unverified	0
VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos	Jun 5, 2025	BenchmarkingMathematical Reasoning	—Unverified	0
MegaHan97K: A Large-Scale Dataset for Mega-Category Chinese Character Recognition with over 97K Categories	Jun 5, 2025	BenchmarkingOptical Character Recognition	CodeCode Available	2
HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model	Jun 5, 2025	BenchmarkingLanguage Modeling	—Unverified	0
From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems	Jun 5, 2025	BenchmarkingRAG	—Unverified	0
CzechLynx: A Dataset for Individual Identification and Pose Estimation of the Eurasian Lynx	Jun 5, 2025	2D Pose EstimationBenchmarking	—Unverified	0
A Unified Framework for Provably Efficient Algorithms to Estimate Shapley Values	Jun 5, 2025	Benchmarking	—Unverified	0
Urania: Differentially Private Insights into AI Use	Jun 5, 2025	BenchmarkingChatbot	—Unverified	0
AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs	Jun 5, 2025	BenchmarkingVideo Understanding	—Unverified	0
Seeing in the Dark: Benchmarking Egocentric 3D Vision with the Oxford Day-and-Night Dataset	Jun 4, 2025	3D geometryBenchmarking	—Unverified	0
Knowledge-guided Contextual Gene Set Analysis Using Large Language Models	Jun 4, 2025	Benchmarking	—Unverified	0
MELABenchv1: Benchmarking Large Language Models against Smaller Fine-Tuned Models for Low-Resource Maltese NLP	Jun 4, 2025	BenchmarkingLanguage Modelling	—Unverified	0
HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models	Jun 4, 2025	BenchmarkingGeneral Knowledge	CodeCode Available	0
MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale	Jun 4, 2025	BenchmarkingLanguage Modeling	—Unverified	0
macOSWorld: A Multilingual Interactive Benchmark for GUI Agents	Jun 4, 2025	BenchmarkingDomain Adaptation	CodeCode Available	1
A Kernel-Based Approach for Accurate Steady-State Detection in Performance Time Series	Jun 4, 2025	BenchmarkingIrregular Time Series	CodeCode Available	0
AssetOpsBench: Benchmarking AI Agents for Task Automation in Industrial Asset Operations and Maintenance	Jun 4, 2025	BenchmarkingScheduling	CodeCode Available	5
Curse of Slicing: Why Sliced Mutual Information is a Deceptive Measure of Statistical Dependence	Jun 4, 2025	Benchmarking	—Unverified	0
Generating Automotive Code: Large Language Models for Software Development and Verification in Safety-Critical Systems	Jun 4, 2025	BenchmarkingCode Generation	—Unverified	0
CETBench: A Novel Dataset constructed via Transformations over Programs for Benchmarking LLMs for Code-Equivalence Checking	Jun 4, 2025	BenchmarkingCode Generation	—Unverified	0
N^2: A Unified Python Package and Test Bench for Nearest Neighbor-Based Matrix Completion	Jun 4, 2025	BenchmarkingCausal Inference	—Unverified	0
ByteMorph: Benchmarking Instruction-Guided Image Editing with Non-Rigid Motions	Jun 3, 2025	BenchmarkingDiversity	CodeCode Available	1
AMLgentex: Mobilizing Data-Driven Research to Combat Money Laundering	Jun 3, 2025	Benchmarking	—Unverified	0
FailureSensorIQ: A Multi-Choice QA Dataset for Understanding Sensor Relationships and Failure Modes	Jun 3, 2025	BenchmarkingFeature Engineering	CodeCode Available	0
Tactile MNIST: Benchmarking Active Tactile Perception	Jun 3, 2025	BenchmarkingScene Understanding	—Unverified	0
FlowerTune: A Cross-Domain Benchmark for Federated Fine-Tuning of Large Language Models	Jun 3, 2025	BenchmarkingDomain Adaptation	—Unverified	0
SVGenius: Benchmarking LLMs in SVG Understanding, Editing and Generation	Jun 3, 2025	BenchmarkingStyle Transfer	—Unverified	0
NetPress: Dynamically Generated LLM Benchmarks for Network Applications	Jun 3, 2025	Benchmarking	CodeCode Available	1
Rethinking Machine Unlearning in Image Generation Models	Jun 3, 2025	BenchmarkingImage Generation	CodeCode Available	1
FormFactory: An Interactive Benchmarking Suite for Multimodal Form-Filling Agents	Jun 2, 2025	BenchmarkingForm	—Unverified	0
CVC: A Large-Scale Chinese Value Rule Corpus for Value Alignment of Large Language Models	Jun 2, 2025	Benchmarking	CodeCode Available	0
TIIF-Bench: How Does Your T2I Model Follow Your Instructions?	Jun 2, 2025	BenchmarkingInstruction Following	—Unverified	0
Benchmarking Neural Speech Codec Intelligibility with SITool	Jun 2, 2025	BenchmarkingDiagnostic	—Unverified	0
ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code	Jun 2, 2025	BenchmarkingCode Generation	—Unverified	0
ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists	Jun 2, 2025	BenchmarkingForm	—Unverified	0
GSCodec Studio: A Modular Framework for Gaussian Splat Compression	Jun 2, 2025	Benchmarking	CodeCode Available	2
Greening AI-enabled Systems with Software Engineering: A Research Agenda for Environmentally Sustainable AI Practices	Jun 2, 2025	Benchmarking	—Unverified	0
ModuLM: Enabling Modular and Multimodal Molecular Relational Learning with Large Language Models	Jun 1, 2025	BenchmarkingRelational Reasoning	—Unverified	0
CODEMENV: Benchmarking Large Language Models on Code Migration	Jun 1, 2025	Benchmarking	CodeCode Available	1
ACCESS DENIED INC: The First Benchmark Environment for Sensitivity Awareness	Jun 1, 2025	BenchmarkingManagement	CodeCode Available	0
MedBookVQA: A Systematic and Comprehensive Medical Benchmark Derived from Open-Access Book	Jun 1, 2025	Benchmarking	CodeCode Available	0
The iNaturalist Sounds Dataset	May 31, 2025	Benchmarking	—Unverified	0
AVROBUSTBENCH: Benchmarking the Robustness of Audio-Visual Recognition Models at Test-Time	May 31, 2025	BenchmarkingTest-time Adaptation	CodeCode Available	1
Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents	May 30, 2025	BenchmarkingCode Repair	—Unverified	0
PathGene: Benchmarking Driver Gene Mutations and Exon Prediction Using Multicenter Lung Cancer Histopathology Image Dataset	May 30, 2025	BenchmarkingMultiple Instance Learning	CodeCode Available	0
Beyond Atomic Geometry Representations in Materials Science: A Human-in-the-Loop Multimodal Framework	May 30, 2025	Benchmarking	CodeCode Available	0
GenSpace: Benchmarking Spatially-Aware Image Generation	May 30, 2025	BenchmarkingImage Generation	—Unverified	0

Show:10 25 50

← PrevPage 4 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified