Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 876–900 of 5548 papers

Title	Date	Tasks	Status	Hype
On Neural Inertial Classification Networks for Pedestrian Activity Recognition	Feb 23, 2025	Activity RecognitionBenchmarking	—Unverified	0
Recent Advances in Large Langauge Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation	Feb 23, 2025	Benchmarking	CodeCode Available	4
Benchmarking Online Object Trackers for Underwater Robot Position Locking Applications	Feb 23, 2025	BenchmarkingObject Tracking	—Unverified	0
VidLBEval: Benchmarking and Mitigating Language Bias in Video-Involved LVLMs	Feb 23, 2025	Benchmarking	—Unverified	0
BioMaze: Benchmarking and Enhancing Large Language Models for Biological Pathway Reasoning	Feb 23, 2025	Benchmarking	CodeCode Available	1
VisFactor: Benchmarking Fundamental Visual Cognition in Multimodal Large Language Models	Feb 23, 2025	BenchmarkingSpatial Reasoning	CodeCode Available	0
An Analyst-Inspector Framework for Evaluating Reproducibility of LLMs in Data Science	Feb 23, 2025	BenchmarkingCode Generation	CodeCode Available	0
Unmasking Societal Biases in Respiratory Support for ICU Patients through Social Determinants of Health	Feb 23, 2025	BenchmarkingFairness	CodeCode Available	0
Visual-RAG: Benchmarking Text-to-Image Retrieval Augmented Generation for Visual Knowledge Intensive Queries	Feb 23, 2025	BenchmarkingImage Retrieval	CodeCode Available	0
MHQA: A Diverse, Knowledge Intensive Mental Health Question Answering Challenge for Language Models	Feb 21, 2025	BenchmarkingDiagnostic	—Unverified	0
Methods and Trends in Detecting Generated Images: A Comprehensive Review	Feb 21, 2025	BenchmarkingDeepFake Detection	—Unverified	0
Bridging vision language model (VLM) evaluation gaps with a framework for scalable and cost-effective benchmark generation	Feb 21, 2025	BenchmarkingLanguage Modeling	—Unverified	0
Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMs	Feb 21, 2025	Benchmarking	CodeCode Available	1
Benchmarking machine learning for bowel sound pattern classification from tabular features to pretrained models	Feb 21, 2025	BenchmarkingDiagnostic	CodeCode Available	0
Para-Lane: Multi-Lane Dataset Registering Parallel Scans for Benchmarking Novel View Synthesis	Feb 21, 2025	3DGSAutonomous Driving	—Unverified	0
TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators	Feb 20, 2025	BenchmarkingCode Generation	CodeCode Available	2
Statistical Scenario Modelling and Lookalike Distributions for Multi-Variate AI Risk	Feb 20, 2025	Benchmarking	—Unverified	0
Probabilistic Robustness in Deep Learning: A Concise yet Comprehensive Guide	Feb 20, 2025	Adversarial RobustnessBenchmarking	—Unverified	0
Benchmarking Multimodal RAG through a Chart-based Document Question-Answering Generation Framework	Feb 20, 2025	BenchmarkingQuestion Answering	CodeCode Available	0
Sentence Smith: Formally Controllable Text Transformation and its Application to Evaluation of Text Embedding Models	Feb 20, 2025	BenchmarkingSentence	—Unverified	0
Position: Graph Learning Will Lose Relevance Due To Poor Benchmarks	Feb 20, 2025	BenchmarkingCombinatorial Optimization	—Unverified	0
Line Goes Up? Inherent Limitations of Benchmarks for Evaluating Large Language Models	Feb 20, 2025	Benchmarking	—Unverified	0
Beyond Self-Talk: A Communication-Centric Survey of LLM-Based Multi-Agent Systems	Feb 20, 2025	BenchmarkingDecision Making	—Unverified	0
FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis	Feb 20, 2025	Age EstimationBenchmarking	CodeCode Available	2
PredictaBoard: Benchmarking LLM Score Predictability	Feb 20, 2025	BenchmarkingCommon Sense Reasoning	CodeCode Available	0

Show:10 25 50

← PrevPage 36 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified