Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1976–2000 of 5548 papers

Title	Date	Tasks	Status	Hype
CaT-BENCH: Benchmarking Language Model Understanding of Causal and Temporal Dependencies in Plans	Jun 22, 2024	BenchmarkingDecision Making	—Unverified	0
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions	Jun 22, 2024	BenchmarkingCode Generation	CodeCode Available	4
Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph	Jun 21, 2024	BenchmarkingText Generation	CodeCode Available	2
GenoTEX: An LLM Agent Benchmark for Automated Gene Expression Data Analysis	Jun 21, 2024	AI AgentAutoML	CodeCode Available	2
Deciphering the Definition of Adversarial Robustness for post-hoc OOD Detectors	Jun 21, 2024	Adversarial DefenseAdversarial Robustness	—Unverified	0
Six-CD: Benchmarking Concept Removals for Benign Text-to-image Diffusion Models	Jun 21, 2024	Benchmarking	CodeCode Available	1
NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and Benchmarking	Jun 21, 2024	Autonomous DrivingBenchmarking	CodeCode Available	7
Benchmarking Retinal Blood Vessel Segmentation Models for Cross-Dataset and Cross-Disease Generalization	Jun 21, 2024	BenchmarkingSegmentation	CodeCode Available	0
Sports Intelligence: Assessing the Sports Understanding Capabilities of Language Models through Question Answering from Text to Video	Jun 21, 2024	BenchmarkingFew-Shot Learning	—Unverified	0
FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents	Jun 21, 2024	Benchmarking	—Unverified	0
CEBench: A Benchmarking Toolkit for the Cost-Effectiveness of LLM Pipelines	Jun 20, 2024	BenchmarkingDecision Making	CodeCode Available	0
Improving Expert Radiology Report Summarization by Prompting Large Language Models with a Layperson Summary	Jun 20, 2024	BenchmarkingIn-Context Learning	—Unverified	0
QeMFi: A Multifidelity Dataset of Quantum Chemical Properties of Diverse Molecules	Jun 20, 2024	Benchmarking	CodeCode Available	0
Beyond Optimism: Exploration With Partially Observable Rewards	Jun 20, 2024	BenchmarkingReinforcement Learning (RL)	CodeCode Available	0
Selected Languages are All You Need for Cross-lingual Truthfulness Transfer	Jun 20, 2024	AllBenchmarking	CodeCode Available	0
How far are today's time-series models from real-world weather forecasting applications?	Jun 20, 2024	BenchmarkingTime Series	CodeCode Available	2
The Elusive Pursuit of Reproducing PATE-GAN: Benchmarking, Auditing, Debugging	Jun 20, 2024	Benchmarking	CodeCode Available	0
Benchmarking Monocular 3D Dog Pose Estimation Using In-The-Wild Motion Capture Data	Jun 20, 2024	Animal Pose EstimationBenchmarking	—Unverified	0
African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object Classification	Jun 20, 2024	BenchmarkingClassification	CodeCode Available	1
HoTPP Benchmark: Are We Good at the Long Horizon Events Forecasting?	Jun 20, 2024	BenchmarkingPoint Processes	CodeCode Available	2
Resource-efficient Medical Image Analysis with Self-adapting Forward-Forward Networks	Jun 20, 2024	BenchmarkingMedical Image Analysis	—Unverified	0
DASB -- Discrete Audio and Speech Benchmark	Jun 20, 2024	BenchmarkingEmotion Recognition	—Unverified	0
A Benchmarking Study of Kolmogorov-Arnold Networks on Tabular Data	Jun 20, 2024	BenchmarkingKolmogorov-Arnold Networks	CodeCode Available	1
FairX: A comprehensive benchmarking tool for model analysis using fairness, utility, and explainability	Jun 20, 2024	BenchmarkingFairness	CodeCode Available	0
PoseBench: Benchmarking the Robustness of Pose Estimation Models under Corruptions	Jun 20, 2024	Animal Pose EstimationAutonomous Driving	—Unverified	0

Show:10 25 50

← PrevPage 80 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified