Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 551–575 of 5548 papers

Title	Date	Tasks	Status	Hype
Generative CKM Construction using Partially Observed Data with Diffusion Model	Dec 19, 2024	Benchmarking	CodeCode Available	1
Autonomous Microscopy Experiments through Large Language Model Agents	Dec 18, 2024	BenchmarkingExperimental Design	CodeCode Available	1
Benchmarking and Improving Large Vision-Language Models for Fundamental Visual Graph Understanding and Reasoning	Dec 18, 2024	BenchmarkingGraph Learning	CodeCode Available	1
RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment	Dec 18, 2024	BenchmarkingRAG	CodeCode Available	1
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks	Dec 18, 2024	Benchmarking	CodeCode Available	1
MT-LENS: An all-in-one Toolkit for Better Machine Translation Evaluation	Dec 16, 2024	AllBenchmarking	CodeCode Available	1
CharacterBench: Benchmarking Character Customization of Large Language Models	Dec 16, 2024	Benchmarking	CodeCode Available	1
AD-LLM: Benchmarking Large Language Models for Anomaly Detection	Dec 15, 2024	Anomaly DetectionBenchmarking	CodeCode Available	1
Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning	Dec 11, 2024	AttributeBenchmarking	CodeCode Available	1
PowerMamba: A Deep State Space Model and Comprehensive Benchmark for Time Series Prediction in Electric Power Systems	Dec 9, 2024	BenchmarkingPrediction	CodeCode Available	1
Multi-Behavior Recommendation with Personalized Directed Acyclic Behavior Graphs	Dec 9, 2024	BenchmarkingComputational Efficiency	CodeCode Available	1
Grounding Descriptions in Images informs Zero-Shot Visual Recognition	Dec 5, 2024	AttributeBenchmarking	CodeCode Available	1
Does your model understand genes? A benchmark of gene properties for biological and text models	Dec 5, 2024	BenchmarkingMulti-class Classification	CodeCode Available	1
Down with the Hierarchy: The 'H' in HNSW Stands for "Hubs"	Dec 2, 2024	BenchmarkingRepresentation Learning	CodeCode Available	1
Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-Oasis	Nov 29, 2024	BenchmarkingClaim Verification	CodeCode Available	1
Circumventing shortcuts in audio-visual deepfake detection datasets with unsupervised learning	Nov 29, 2024	BenchmarkingDeepFake Detection	CodeCode Available	1
CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language Models	Nov 27, 2024	BenchmarkingEarth Observation	CodeCode Available	1
AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM	Nov 26, 2024	BenchmarkingText-to-Video Generation	CodeCode Available	1
VidHal: Benchmarking Temporal Hallucinations in Vision LLMs	Nov 25, 2024	BenchmarkingHallucination	CodeCode Available	1
Machine Learning for the Digital Typhoon Dataset: Extensions to Multiple Basins and New Developments in Representations and Tasks	Nov 25, 2024	Benchmarkingobject-detection	CodeCode Available	1
Multi-Agent Environments for Vehicle Routing Problems	Nov 21, 2024	Benchmarkingreinforcement-learning	CodeCode Available	1
StackEval: Benchmarking LLMs in Coding Assistance	Nov 21, 2024	Benchmarking	CodeCode Available	1
DLBacktrace: A Model Agnostic Explainability for any Deep Learning Models	Nov 19, 2024	BenchmarkingDeep Learning	CodeCode Available	1
Introducing Milabench: Benchmarking Accelerators for AI	Nov 18, 2024	BenchmarkingDeep Learning	CodeCode Available	1
FM-TS: Flow Matching for Time Series Generation	Nov 12, 2024	BenchmarkingImputation	CodeCode Available	1

Show:10 25 50

← PrevPage 23 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified