Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 2376–2400 of 5548 papers

Title	Date	Tasks	Status
ANTHROPOS-V: benchmarking the novel task of Crowd Volume Estimation	Jan 3, 2025	BenchmarkingCrowd Counting	CodeCode Available
QuArch: A Question-Answering Dataset for AI Agents in Computer Architecture	Jan 3, 2025	BenchmarkingQuestion Answering	—Unverified
AI-Powered Cow Detection in Complex Farm Environments	Jan 3, 2025	Benchmarking	—Unverified
PSYCHE: A Multi-faceted Patient Simulation Framework for Evaluation of Psychiatric Assessment Conversational Agents	Jan 3, 2025	Benchmarking	—Unverified
Benchmarking Constraint-Based Bayesian Structure Learning Algorithms: Role of Network Topology	Jan 2, 2025	BenchmarkingSensitivity	—Unverified
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings	Jan 2, 2025	BenchmarkingCode Generation	—Unverified
BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery	Jan 2, 2025	BenchmarkingExperimental Design	CodeCode Available
TabTreeFormer: Tabular Data Generation Using Hybrid Tree-Transformer	Jan 2, 2025	BenchmarkingQuantization	—Unverified
MSC-Bench: Benchmarking and Analyzing Multi-Sensor Corruption for Driving Perception	Jan 2, 2025	3D Object DetectionAutonomous Driving	—Unverified
State-of-the-art AI-based Learning Approaches for Deepfake Generation and Detection, Analyzing Opportunities, Threading through Pros, Cons, and Future Prospects	Jan 2, 2025	BenchmarkingFace Swapping	—Unverified
RCP-Bench: Benchmarking Robustness for Collaborative Perception Under Diverse Corruptions	Jan 1, 2025	Benchmarking	CodeCode Available
CroCoDL: Cross-device Collaborative Dataset for Localization	Jan 1, 2025	BenchmarkingPose Estimation	—Unverified
Six-CD: Benchmarking Concept Removals for Text-to-image Diffusion Models	Jan 1, 2025	Benchmarking	—Unverified
CholecTrack20: A Multi-Perspective Tracking Dataset for Surgical Tools	Jan 1, 2025	Benchmarking	—Unverified
CheXwhatsApp: A Dataset for Exploring Challenges in the Diagnosis of Chest X-rays through Mobile Devices	Jan 1, 2025	Benchmarking	—Unverified
Segmenting Maxillofacial Structures in CBCT Volumes	Jan 1, 2025	AnatomyBenchmarking	—Unverified
Sketchtopia: A Dataset and Foundational Agents for Benchmarking Asynchronous Multimodal Communication with Iconic Feedback	Jan 1, 2025	Benchmarking	—Unverified
InterAct: Advancing Large-Scale Versatile 3D Human-Object Interaction Generation	Jan 1, 2025	BenchmarkingHuman-Object Interaction Detection	—Unverified
SVLTA: Benchmarking Vision-Language Temporal Alignment via Synthetic Video Situation	Jan 1, 2025	BenchmarkingDiagnostic	—Unverified
On the Utility of Equivariance and Symmetry Breaking in Deep Learning Architectures on Point Clouds	Jan 1, 2025	Benchmarking	—Unverified
Geometry Matters: Benchmarking Scientific ML Approaches for Flow Prediction around Complex Geometries	Dec 31, 2024	BenchmarkingOut-of-Distribution Generalization	—Unverified
A review of faithfulness metrics for hallucination assessment in Large Language Models	Dec 31, 2024	BenchmarkingHallucination	—Unverified
AraSTEM: A Native Arabic Multiple Choice Question Benchmark for Evaluating LLMs Knowledge In STEM Subjects	Dec 31, 2024	BenchmarkingMultiple-choice	—Unverified
Measuring Large Language Models Capacity to Annotate Journalistic Sourcing	Dec 30, 2024	BenchmarkingEthics	—Unverified
SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity	Dec 30, 2024	BenchmarkingCode Generation	—Unverified

Show:10 25 50

← PrevPage 96 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified