Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 2351–2400 of 5548 papers

Title	Date	Tasks	Status
Investigating Energy Efficiency and Performance Trade-offs in LLM Inference Across Tasks and DVFS Settings	Jan 14, 2025	BenchmarkingQuestion Answering	—Unverified
Benchmarking Multimodal Models for Fine-Grained Image Analysis: A Comparative Study Across Diverse Visual Features	Jan 14, 2025	Benchmarking	—Unverified
Benchmarking Graph Representations and Graph Neural Networks for Multivariate Time Series Classification	Jan 14, 2025	BenchmarkingGraph Representation Learning	CodeCode Available
The Paradox of Success in Evolutionary and Bioinspired Optimization: Revisiting Critical Issues, Key Studies, and Methodological Pathways	Jan 13, 2025	BenchmarkingMetaheuristic Optimization	—Unverified
Lessons From Red Teaming 100 Generative AI Products	Jan 13, 2025	BenchmarkingRed Teaming	—Unverified
Stronger Than You Think: Benchmarking Weak Supervision on Realistic Tasks	Jan 13, 2025	Benchmarking	CodeCode Available
Benchmarking Abstractive Summarisation: A Dataset of Human-authored Summaries of Norwegian News Articles	Jan 13, 2025	ArticlesBenchmarking	—Unverified
Understanding and Benchmarking Artificial Intelligence: OpenAI's o3 Is Not AGI	Jan 13, 2025	ARCBenchmarking	—Unverified
Benchmarking YOLOv8 for Optimal Crack Detection in Civil Infrastructure	Jan 12, 2025	BenchmarkingHyperparameter Optimization	—Unverified
Evidential Deep Learning for Uncertainty Quantification and Out-of-Distribution Detection in Jet Identification using Deep Neural Networks	Jan 10, 2025	Anomaly DetectionBenchmarking	CodeCode Available
Benchmarking Rotary Position Embeddings for Automatic Speech Recognition	Jan 10, 2025	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
Large Physics Models: Towards a collaborative approach with Large Language Models and Foundation Models	Jan 9, 2025	BenchmarkingPhilosophical Reflection	—Unverified
Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning	Jan 9, 2025	BenchmarkingQuestion Answering	—Unverified
CallNavi, A Challenge and Empirical Study on LLM Function Calling and Routing	Jan 9, 2025	BenchmarkingChatbot	—Unverified
LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation	Jan 9, 2025	2k8k	—Unverified
AgoraSpeech: A multi-annotated comprehensive dataset of political discourse through the lens of humans and AI	Jan 9, 2025	Benchmarkingnamed-entity-recognition	—Unverified
IOLBENCH: Benchmarking LLMs on Linguistic Reasoning	Jan 8, 2025	Benchmarking	CodeCode Available
Advancing Retrieval-Augmented Generation for Persian: Development of Language Models, Comprehensive Benchmarks, and Best Practices for Optimization	Jan 8, 2025	BenchmarkingGeneral Knowledge	—Unverified
An Analysis of Model Robustness across Concurrent Distribution Shifts	Jan 8, 2025	Benchmarking	—Unverified
Open-Source Manually Annotated Vocal Tract Database for Automatic Segmentation from 3D MRI Using Deep Learning: Benchmarking 2D and 3D Convolutional and Transformer Networks	Jan 8, 2025	BenchmarkingDeep Learning	—Unverified
Machine Learning for Identifying Grain Boundaries in Scanning Electron Microscopy (SEM) Images of Nanoparticle Superlattices	Jan 7, 2025	BenchmarkingClustering	—Unverified
Practical Design and Benchmarking of Generative AI Applications for Surgical Billing and Coding	Jan 7, 2025	BenchmarkingCode Generation	—Unverified
The FACTS Grounding Leaderboard: Benchmarking LLMs' Ability to Ground Responses to Long-Form Input	Jan 6, 2025	BenchmarkingForm	—Unverified
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models	Jan 6, 2025	BenchmarkingFeature Compression	—Unverified
Tougher Text, Smarter Models: Raising the Bar for Adversarial Defence Benchmarks	Jan 5, 2025	Adversarial RobustnessBenchmarking	CodeCode Available
ANTHROPOS-V: benchmarking the novel task of Crowd Volume Estimation	Jan 3, 2025	BenchmarkingCrowd Counting	CodeCode Available
QuArch: A Question-Answering Dataset for AI Agents in Computer Architecture	Jan 3, 2025	BenchmarkingQuestion Answering	—Unverified
AI-Powered Cow Detection in Complex Farm Environments	Jan 3, 2025	Benchmarking	—Unverified
PSYCHE: A Multi-faceted Patient Simulation Framework for Evaluation of Psychiatric Assessment Conversational Agents	Jan 3, 2025	Benchmarking	—Unverified
Benchmarking Constraint-Based Bayesian Structure Learning Algorithms: Role of Network Topology	Jan 2, 2025	BenchmarkingSensitivity	—Unverified
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings	Jan 2, 2025	BenchmarkingCode Generation	—Unverified
BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery	Jan 2, 2025	BenchmarkingExperimental Design	CodeCode Available
TabTreeFormer: Tabular Data Generation Using Hybrid Tree-Transformer	Jan 2, 2025	BenchmarkingQuantization	—Unverified
MSC-Bench: Benchmarking and Analyzing Multi-Sensor Corruption for Driving Perception	Jan 2, 2025	3D Object DetectionAutonomous Driving	—Unverified
State-of-the-art AI-based Learning Approaches for Deepfake Generation and Detection, Analyzing Opportunities, Threading through Pros, Cons, and Future Prospects	Jan 2, 2025	BenchmarkingFace Swapping	—Unverified
RCP-Bench: Benchmarking Robustness for Collaborative Perception Under Diverse Corruptions	Jan 1, 2025	Benchmarking	CodeCode Available
CroCoDL: Cross-device Collaborative Dataset for Localization	Jan 1, 2025	BenchmarkingPose Estimation	—Unverified
Six-CD: Benchmarking Concept Removals for Text-to-image Diffusion Models	Jan 1, 2025	Benchmarking	—Unverified
CholecTrack20: A Multi-Perspective Tracking Dataset for Surgical Tools	Jan 1, 2025	Benchmarking	—Unverified
CheXwhatsApp: A Dataset for Exploring Challenges in the Diagnosis of Chest X-rays through Mobile Devices	Jan 1, 2025	Benchmarking	—Unverified
Segmenting Maxillofacial Structures in CBCT Volumes	Jan 1, 2025	AnatomyBenchmarking	—Unverified
Sketchtopia: A Dataset and Foundational Agents for Benchmarking Asynchronous Multimodal Communication with Iconic Feedback	Jan 1, 2025	Benchmarking	—Unverified
InterAct: Advancing Large-Scale Versatile 3D Human-Object Interaction Generation	Jan 1, 2025	BenchmarkingHuman-Object Interaction Detection	—Unverified
SVLTA: Benchmarking Vision-Language Temporal Alignment via Synthetic Video Situation	Jan 1, 2025	BenchmarkingDiagnostic	—Unverified
On the Utility of Equivariance and Symmetry Breaking in Deep Learning Architectures on Point Clouds	Jan 1, 2025	Benchmarking	—Unverified
Geometry Matters: Benchmarking Scientific ML Approaches for Flow Prediction around Complex Geometries	Dec 31, 2024	BenchmarkingOut-of-Distribution Generalization	—Unverified
A review of faithfulness metrics for hallucination assessment in Large Language Models	Dec 31, 2024	BenchmarkingHallucination	—Unverified
AraSTEM: A Native Arabic Multiple Choice Question Benchmark for Evaluating LLMs Knowledge In STEM Subjects	Dec 31, 2024	BenchmarkingMultiple-choice	—Unverified
Measuring Large Language Models Capacity to Annotate Journalistic Sourcing	Dec 30, 2024	BenchmarkingEthics	—Unverified
SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity	Dec 30, 2024	BenchmarkingCode Generation	—Unverified

Show:10 25 50

← PrevPage 48 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified