Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 851–900 of 5548 papers

Title	Date	Tasks	Status	Hype
Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents	Feb 27, 2025	Benchmarking	CodeCode Available	1
MathTutorBench: A Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors	Feb 26, 2025	Benchmarking	—Unverified	0
Medical Hallucinations in Foundation Models and Their Impact on Healthcare	Feb 26, 2025	BenchmarkingHallucination	CodeCode Available	2
Improved YOLOv12 with LLM-Generated Synthetic Data for Enhanced Apple Detection and Benchmarking Against YOLOv11 and YOLOv10	Feb 26, 2025	Benchmarkingobject-detection	—Unverified	0
Agentic Mixture-of-Workflows for Multi-Modal Chemical Search	Feb 26, 2025	BenchmarkingRetrieval	—Unverified	0
Is Your Paper Being Reviewed by an LLM? A New Benchmark Dataset and Approach for Detecting AI Text in Peer Review	Feb 26, 2025	BenchmarkingText Detection	—Unverified	0
Generalizable deep learning for photoplethysmography-based blood pressure estimation -- A Benchmarking Study	Feb 26, 2025	BenchmarkingBlood pressure estimation	CodeCode Available	1
Modelling Regional Solar Photovoltaic Capacity in Great Britain	Feb 26, 2025	Benchmarking	—Unverified	0
CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code Generation	Feb 26, 2025	BenchmarkingCode Generation	CodeCode Available	1
Exploring Graph Tasks with Pure LLMs: A Comprehensive Benchmark and Investigation	Feb 26, 2025	BenchmarkingGraph Learning	CodeCode Available	1
MEBench: Benchmarking Large Language Models for Cross-Document Multi-Entity Question Answering	Feb 26, 2025	BenchmarkingQuestion Answering	—Unverified	0
BatteryLife: A Comprehensive Dataset and Benchmark for Battery Life Prediction	Feb 26, 2025	BenchmarkingTime Series	CodeCode Available	3
Isolating Language-Coding from Problem-Solving: Benchmarking LLMs with PseudoEval	Feb 26, 2025	BenchmarkingCode Generation	—Unverified	0
Problem Solved? Information Extraction Design Space for Layout-Rich Documents using LLMs	Feb 25, 2025	BenchmarkingChunking	CodeCode Available	1
Science Across Languages: Assessing LLM Multilingual Translation of Scientific Papers	Feb 25, 2025	ArticlesBenchmarking	—Unverified	0
CayleyPy RL: Pathfinding and Reinforcement Learning on Cayley Graphs	Feb 25, 2025	Benchmarkingreinforcement-learning	—Unverified	0
OpenFly: A Comprehensive Platform for Aerial Vision-Language Navigation	Feb 25, 2025	BenchmarkingSemantic Segmentation	—Unverified	0
Safe Multi-Agent Navigation guided by Goal-Conditioned Safe Reinforcement Learning	Feb 25, 2025	BenchmarkingReinforcement Learning (RL)	CodeCode Available	0
A Real-time Spatio-Temporal Trajectory Planner for Autonomous Vehicles with Semantic Graph Optimization	Feb 25, 2025	Autonomous VehiclesBenchmarking	—Unverified	0
Overconfident Oracles: Limitations of In Silico Sequence Design Benchmarking	Feb 24, 2025	Benchmarking	—Unverified	0
Enhancing Image Matting in Real-World Scenes with Mask-Guided Iterative Refinement	Feb 24, 2025	Benchmarkingfeature selection	—Unverified	0
SynthRAD2025 Grand Challenge dataset: generating synthetic CTs for radiotherapy	Feb 24, 2025	BenchmarkingImage Generation	—Unverified	0
Benchmarking Temporal Reasoning and Alignment Across Chinese Dynasties	Feb 24, 2025	Benchmarking	CodeCode Available	0
MULTITAT: Benchmarking Multilingual Table-and-Text Question Answering	Feb 24, 2025	BenchmarkingQuestion Answering	CodeCode Available	0
Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts	Feb 24, 2025	BenchmarkingFact Verification	CodeCode Available	2
On Neural Inertial Classification Networks for Pedestrian Activity Recognition	Feb 23, 2025	Activity RecognitionBenchmarking	—Unverified	0
Recent Advances in Large Langauge Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation	Feb 23, 2025	Benchmarking	CodeCode Available	4
Benchmarking Online Object Trackers for Underwater Robot Position Locking Applications	Feb 23, 2025	BenchmarkingObject Tracking	—Unverified	0
VidLBEval: Benchmarking and Mitigating Language Bias in Video-Involved LVLMs	Feb 23, 2025	Benchmarking	—Unverified	0
BioMaze: Benchmarking and Enhancing Large Language Models for Biological Pathway Reasoning	Feb 23, 2025	Benchmarking	CodeCode Available	1
VisFactor: Benchmarking Fundamental Visual Cognition in Multimodal Large Language Models	Feb 23, 2025	BenchmarkingSpatial Reasoning	CodeCode Available	0
An Analyst-Inspector Framework for Evaluating Reproducibility of LLMs in Data Science	Feb 23, 2025	BenchmarkingCode Generation	CodeCode Available	0
Unmasking Societal Biases in Respiratory Support for ICU Patients through Social Determinants of Health	Feb 23, 2025	BenchmarkingFairness	CodeCode Available	0
Visual-RAG: Benchmarking Text-to-Image Retrieval Augmented Generation for Visual Knowledge Intensive Queries	Feb 23, 2025	BenchmarkingImage Retrieval	CodeCode Available	0
MHQA: A Diverse, Knowledge Intensive Mental Health Question Answering Challenge for Language Models	Feb 21, 2025	BenchmarkingDiagnostic	—Unverified	0
Methods and Trends in Detecting Generated Images: A Comprehensive Review	Feb 21, 2025	BenchmarkingDeepFake Detection	—Unverified	0
Bridging vision language model (VLM) evaluation gaps with a framework for scalable and cost-effective benchmark generation	Feb 21, 2025	BenchmarkingLanguage Modeling	—Unverified	0
Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMs	Feb 21, 2025	Benchmarking	CodeCode Available	1
Benchmarking machine learning for bowel sound pattern classification from tabular features to pretrained models	Feb 21, 2025	BenchmarkingDiagnostic	CodeCode Available	0
Para-Lane: Multi-Lane Dataset Registering Parallel Scans for Benchmarking Novel View Synthesis	Feb 21, 2025	3DGSAutonomous Driving	—Unverified	0
TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators	Feb 20, 2025	BenchmarkingCode Generation	CodeCode Available	2
Statistical Scenario Modelling and Lookalike Distributions for Multi-Variate AI Risk	Feb 20, 2025	Benchmarking	—Unverified	0
Probabilistic Robustness in Deep Learning: A Concise yet Comprehensive Guide	Feb 20, 2025	Adversarial RobustnessBenchmarking	—Unverified	0
Benchmarking Multimodal RAG through a Chart-based Document Question-Answering Generation Framework	Feb 20, 2025	BenchmarkingQuestion Answering	CodeCode Available	0
Sentence Smith: Formally Controllable Text Transformation and its Application to Evaluation of Text Embedding Models	Feb 20, 2025	BenchmarkingSentence	—Unverified	0
Position: Graph Learning Will Lose Relevance Due To Poor Benchmarks	Feb 20, 2025	BenchmarkingCombinatorial Optimization	—Unverified	0
Line Goes Up? Inherent Limitations of Benchmarks for Evaluating Large Language Models	Feb 20, 2025	Benchmarking	—Unverified	0
Beyond Self-Talk: A Communication-Centric Survey of LLM-Based Multi-Agent Systems	Feb 20, 2025	BenchmarkingDecision Making	—Unverified	0
FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis	Feb 20, 2025	Age EstimationBenchmarking	CodeCode Available	2
PredictaBoard: Benchmarking LLM Score Predictability	Feb 20, 2025	BenchmarkingCommon Sense Reasoning	CodeCode Available	0

Show:10 25 50

← PrevPage 18 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified