Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 2501–2525 of 5548 papers

Title	Date	Tasks	Status
Personalized Multimodal Large Language Models: A Survey	Dec 3, 2024	BenchmarkingSurvey	—Unverified
OODFace: Benchmarking Robustness of Face Recognition under Common Corruptions and Appearance Variations	Dec 3, 2024	BenchmarkingFace Recognition	—Unverified
Noisy Ostracods: A Fine-Grained, Imbalanced Real-World Dataset for Benchmarking Robust Machine Learning and Label Correction Methods	Dec 3, 2024	Benchmarking	CodeCode Available
BN-AuthProf: Benchmarking Machine Learning for Bangla Author Profiling on Social Media Texts	Dec 3, 2024	Age And Gender ClassificationAge and Gender Estimation	CodeCode Available
Benchmarking symbolic regression constant optimization schemes	Dec 3, 2024	Benchmarkingregression	—Unverified
VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning	Dec 3, 2024	BenchmarkingVisual Reasoning	—Unverified
AI Benchmarks and Datasets for LLM Evaluation	Dec 2, 2024	BenchmarkingDistributed Computing	—Unverified
Medchain: Bridging the Gap Between LLM Agents and Clinical Practice through Interactive Sequential Benchmarking	Dec 2, 2024	BenchmarkingDecision Making	—Unverified
Agentic-HLS: An agentic reasoning based high-level synthesis system using large language models (AI for EDA workshop 2024)	Dec 2, 2024	BenchmarkingHigh-Level Synthesis	CodeCode Available
Understanding the World's Museums through Vision-Language Reasoning	Dec 2, 2024	BenchmarkingQuestion Answering	CodeCode Available
TextClass Benchmark: A Continuous Elo Rating of LLMs in Social Sciences	Nov 30, 2024	BenchmarkingClassification	CodeCode Available
Perception Test 2024: Challenge Summary and a Novel Hour-Long VideoQA Benchmark	Nov 29, 2024	BenchmarkingGrounded Video Question Answering	—Unverified
One-Shot Real-to-Sim via End-to-End Differentiable Simulation and Rendering	Nov 29, 2024	BenchmarkingObject	—Unverified
Consolidating and Developing Benchmarking Datasets for the Nepali Natural Language Understanding Tasks	Nov 28, 2024	BenchmarkingNatural Language Inference	—Unverified
HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos	Nov 28, 2024	BenchmarkingObject Tracking	—Unverified
λ: A Benchmark for Data-Efficiency in Long-Horizon Indoor Mobile Manipulation Robotics	Nov 28, 2024	BenchmarkingDiversity	—Unverified
Generating Diverse Synthetic Datasets for Evaluation of Real-life Recommender Systems	Nov 27, 2024	AutoMLBenchmarking	—Unverified
Benchmarking Agility and Reconfigurability in Satellite Systems for Tropical Cyclone Monitoring	Nov 27, 2024	BenchmarkingEarth Observation	—Unverified
Evaluating Generative AI-Enhanced Content: A Conceptual Framework Using Qualitative, Quantitative, and Mixed-Methods Approaches	Nov 26, 2024	Benchmarking	—Unverified
Agentic AI for Improving Precision in Identifying Contributions to Sustainable Development Goals	Nov 26, 2024	BenchmarkingRetrieval	—Unverified
Abnormality-Driven Representation Learning for Radiology Imaging	Nov 25, 2024	BenchmarkingContrastive Learning	—Unverified
A Review of Bayesian Uncertainty Quantification in Deep Probabilistic Image Segmentation	Nov 25, 2024	Active LearningBayesian Inference	—Unverified
Performance Benchmarking of Psychomotor Skills Using Wearable Devices: An Application in Sport	Nov 25, 2024	Benchmarking	—Unverified
Benchmarking Active Learning for NILM	Nov 24, 2024	Active LearningBenchmarking	—Unverified
ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain	Nov 23, 2024	BenchmarkingDiversity	CodeCode Available

Show:10 25 50

← PrevPage 101 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified