Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1651–1700 of 5548 papers

Title	Date	Tasks	Status
TIIF-Bench: How Does Your T2I Model Follow Your Instructions?	Jun 2, 2025	BenchmarkingInstruction Following	—Unverified
Benchmarking Neural Speech Codec Intelligibility with SITool	Jun 2, 2025	BenchmarkingDiagnostic	—Unverified
CVC: A Large-Scale Chinese Value Rule Corpus for Value Alignment of Large Language Models	Jun 2, 2025	Benchmarking	CodeCode Available
ACCESS DENIED INC: The First Benchmark Environment for Sensitivity Awareness	Jun 1, 2025	BenchmarkingManagement	CodeCode Available
MedBookVQA: A Systematic and Comprehensive Medical Benchmark Derived from Open-Access Book	Jun 1, 2025	Benchmarking	CodeCode Available
ModuLM: Enabling Modular and Multimodal Molecular Relational Learning with Large Language Models	Jun 1, 2025	BenchmarkingRelational Reasoning	—Unverified
The iNaturalist Sounds Dataset	May 31, 2025	Benchmarking	—Unverified
Benchmarking Foundation Models for Zero-Shot Biometric Tasks	May 30, 2025	AttributeBenchmarking	—Unverified
CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation	May 30, 2025	BenchmarkingMachine Translation	—Unverified
Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents	May 30, 2025	BenchmarkingCode Repair	—Unverified
PhySense: Principle-Based Physics Reasoning Benchmarking for Large Language Models	May 30, 2025	Benchmarking	—Unverified
Beyond Atomic Geometry Representations in Materials Science: A Human-in-the-Loop Multimodal Framework	May 30, 2025	Benchmarking	CodeCode Available
Automated Structured Radiology Report Generation	May 30, 2025	Benchmarking	—Unverified
Geospatial Foundation Models to Enable Progress on Sustainable Development Goals	May 30, 2025	BenchmarkingEarth Observation	—Unverified
Progressive Class-level Distillation	May 30, 2025	BenchmarkingKnowledge Distillation	—Unverified
MetaFaith: Faithful Natural Language Uncertainty Expression in LLMs	May 30, 2025	Benchmarking	CodeCode Available
Segmenting France Across Four Centuries	May 30, 2025	BenchmarkingImage-to-Image Translation	CodeCode Available
SORCE: Small Object Retrieval in Complex Environments	May 30, 2025	BenchmarkingImage Retrieval	CodeCode Available
GenSpace: Benchmarking Spatially-Aware Image Generation	May 30, 2025	BenchmarkingImage Generation	—Unverified
PathGene: Benchmarking Driver Gene Mutations and Exon Prediction Using Multicenter Lung Cancer Histopathology Image Dataset	May 30, 2025	BenchmarkingMultiple Instance Learning	CodeCode Available
Benchmarking Large Language Models for Cryptanalysis and Mismatched-Generalization	May 30, 2025	BenchmarkingCryptanalysis	—Unverified
R2I-Bench: Benchmarking Reasoning-Driven Text-to-Image Generation	May 29, 2025	BenchmarkingImage Generation	—Unverified
Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMs	May 29, 2025	BenchmarkingFairness	CodeCode Available
LLM Performance for Code Generation on Noisy Tasks	May 29, 2025	BenchmarkingCode Generation	CodeCode Available
MSQA: Benchmarking LLMs on Graduate-Level Materials Science Reasoning and Knowledge	May 29, 2025	Benchmarking	—Unverified
SNS-Bench-VL: Benchmarking Multimodal Large Language Models in Social Networking Services	May 29, 2025	BenchmarkingInformation Retrieval	CodeCode Available
Socratic-PRMBench: Benchmarking Process Reward Models with Systematic Reasoning Patterns	May 29, 2025	Benchmarking	—Unverified
Joint Phase Shift Optimization and Precoder Selection for RIS-Assisted 5G NR MIMO Systems	May 29, 2025	Benchmarking	—Unverified
Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking	May 29, 2025	BenchmarkingGraph Question Answering	—Unverified
PGLearn -- An Open-Source Learning Toolkit for Optimal Power Flow	May 28, 2025	Benchmarking	—Unverified
Characterizing Bias: Benchmarking Large Language Models in Simplified versus Traditional Chinese	May 28, 2025	Benchmarking	CodeCode Available
Found in Translation: Measuring Multilingual LLM Consistency as Simple as Translate then Evaluate	May 28, 2025	Benchmarking	—Unverified
HelixDesign-Binder: A Scalable Production-Grade Platform for Binder Design Built on HelixFold3	May 28, 2025	BenchmarkingEfficient Exploration	—Unverified
Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates	May 28, 2025	BenchmarkingDiversity	—Unverified
B-XAIC Dataset: Benchmarking Explainable AI for Graph Neural Networks Using Chemical Data	May 28, 2025	BenchmarkingDrug Discovery	CodeCode Available
Benchmarking Abstract and Reasoning Abilities Through A Theoretical Perspective	May 28, 2025	BenchmarkingMemorization	CodeCode Available
TabularQGAN: A Quantum Generative Model for Tabular Data	May 28, 2025	BenchmarkingGenerative Adversarial Network	—Unverified
Jailbreak Distillation: Renewable Safety Benchmarking	May 28, 2025	BenchmarkingDiversity	—Unverified
StarBASE-GP: Biologically-Guided Automated Machine Learning for Genotype-to-Phenotype Association Analysis	May 28, 2025	Benchmarking	CodeCode Available
MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Chatbots and Dialogue Evaluators	May 28, 2025	BenchmarkingChatbot	CodeCode Available
Yambda-5B -- A Large-Scale Multi-modal Dataset for Ranking And Retrieval	May 28, 2025	BenchmarkingRecommendation Systems	—Unverified
Fedivertex: a Graph Dataset based on Decentralized Social Networks for Trustworthy Machine Learning	May 27, 2025	Benchmarking	CodeCode Available
Laparoscopic Image Desmoking Using the U-Net with New Loss Function and Integrated Differentiable Wiener Filter	May 27, 2025	Benchmarking	CodeCode Available
VideoMarkBench: Benchmarking Robustness of Video Watermarking	May 27, 2025	Benchmarking	CodeCode Available
SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge	May 27, 2025	BenchmarkingMultiple-choice	—Unverified
Gauss-Ramanujan Functions: Constructions, Properties, and Applications in Communications and Signal Processing	May 27, 2025	Benchmarking	—Unverified
MoE-Gyro: Self-Supervised Over-Range Reconstruction and Denoising for MEMS Gyroscopes	May 27, 2025	BenchmarkingDenoising	—Unverified
AutoJudger: An Agent-Driven Framework for Efficient Benchmarking of MLLMs	May 27, 2025	BenchmarkingQuestion Selection	CodeCode Available
DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding	May 27, 2025	BenchmarkingChange Detection	—Unverified
FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering	May 27, 2025	BenchmarkingQuestion Answering	CodeCode Available

Show:10 25 50

← PrevPage 34 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified