Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 2201–2225 of 5548 papers

Title	Date	Tasks	Status
Benchmarking Online Object Trackers for Underwater Robot Position Locking Applications	Feb 23, 2025	BenchmarkingObject Tracking	—Unverified
VidLBEval: Benchmarking and Mitigating Language Bias in Video-Involved LVLMs	Feb 23, 2025	Benchmarking	—Unverified
VisFactor: Benchmarking Fundamental Visual Cognition in Multimodal Large Language Models	Feb 23, 2025	BenchmarkingSpatial Reasoning	CodeCode Available
Bridging vision language model (VLM) evaluation gaps with a framework for scalable and cost-effective benchmark generation	Feb 21, 2025	BenchmarkingLanguage Modeling	—Unverified
Methods and Trends in Detecting Generated Images: A Comprehensive Review	Feb 21, 2025	BenchmarkingDeepFake Detection	—Unverified
MHQA: A Diverse, Knowledge Intensive Mental Health Question Answering Challenge for Language Models	Feb 21, 2025	BenchmarkingDiagnostic	—Unverified
Benchmarking machine learning for bowel sound pattern classification from tabular features to pretrained models	Feb 21, 2025	BenchmarkingDiagnostic	CodeCode Available
Para-Lane: Multi-Lane Dataset Registering Parallel Scans for Benchmarking Novel View Synthesis	Feb 21, 2025	3DGSAutonomous Driving	—Unverified
Probabilistic Robustness in Deep Learning: A Concise yet Comprehensive Guide	Feb 20, 2025	Adversarial RobustnessBenchmarking	—Unverified
Synthetic Porous Microstructures: Automatic Design, Simulation, and Permeability Analysis	Feb 20, 2025	Benchmarking	CodeCode Available
Line Goes Up? Inherent Limitations of Benchmarks for Evaluating Large Language Models	Feb 20, 2025	Benchmarking	—Unverified
Sentence Smith: Formally Controllable Text Transformation and its Application to Evaluation of Text Embedding Models	Feb 20, 2025	BenchmarkingSentence	—Unverified
Beyond Self-Talk: A Communication-Centric Survey of LLM-Based Multi-Agent Systems	Feb 20, 2025	BenchmarkingDecision Making	—Unverified
PredictaBoard: Benchmarking LLM Score Predictability	Feb 20, 2025	BenchmarkingCommon Sense Reasoning	CodeCode Available
Reinforcement Learning with Graph Attention for Routing and Wavelength Assignment with Lightpath Reuse	Feb 20, 2025	BenchmarkingGraph Attention	—Unverified
Benchmarking Multimodal RAG through a Chart-based Document Question-Answering Generation Framework	Feb 20, 2025	BenchmarkingQuestion Answering	CodeCode Available
Position: Graph Learning Will Lose Relevance Due To Poor Benchmarks	Feb 20, 2025	BenchmarkingCombinatorial Optimization	—Unverified
Statistical Scenario Modelling and Lookalike Distributions for Multi-Variate AI Risk	Feb 20, 2025	Benchmarking	—Unverified
GIMMICK -- Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking	Feb 19, 2025	Benchmarking	—Unverified
A Baseline Method for Removing Invisible Image Watermarks using Deep Image Prior	Feb 19, 2025	BenchmarkingMisinformation	—Unverified
Benchmarking Self-Supervised Learning Methods for Accelerated MRI Reconstruction	Feb 19, 2025	BenchmarkingMRI Reconstruction	CodeCode Available
VITAL: A New Dataset for Benchmarking Pluralistic Alignment in Healthcare	Feb 19, 2025	BenchmarkingDiversity	—Unverified
Position: There are no Champions in Long-Term Time Series Forecasting	Feb 19, 2025	BenchmarkingPosition	—Unverified
Benchmarking of Different YOLO Models for CAPTCHAs Detection and Classification	Feb 19, 2025	Benchmarking	—Unverified
EquiBench: Benchmarking Large Language Models' Understanding of Program Semantics via Equivalence Checking	Feb 18, 2025	BenchmarkingBinary Classification	—Unverified

Show:10 25 50

← PrevPage 89 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified