Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1051–1100 of 5548 papers

Title	Date	Tasks	Status	Hype
An Interpretable Measure for Quantifying Predictive Dependence between Continuous Random Variables -- Extended Version	Jan 18, 2025	Benchmarking	—Unverified	0
ColorGrid: A Multi-Agent Non-Stationary Environment for Goal Inference and Assistance	Jan 17, 2025	BenchmarkingMulti-agent Reinforcement Learning	CodeCode Available	0
FORLAPS: An Innovative Data-Driven Reinforcement Learning Approach for Prescriptive Process Monitoring	Jan 17, 2025	BenchmarkingData Augmentation	—Unverified	0
PixelBrax: Learning Continuous Control from Pixels End-to-End on the GPU	Jan 16, 2025	Benchmarkingcontinuous-control	CodeCode Available	0
Village-Net Clustering: A Rapid approach to Non-linear Unsupervised Clustering of High-Dimensional Data	Jan 16, 2025	BenchmarkingClustering	—Unverified	0
SMPLest-X: Ultimate Scaling for Expressive Human Pose and Shape Estimation	Jan 16, 2025	Benchmarking	CodeCode Available	5
Off-policy Evaluation for Payments at Adyen	Jan 15, 2025	BenchmarkingDecision Making	—Unverified	0
Cancer-Net PCa-Seg: Benchmarking Deep Learning Models for Prostate Cancer Segmentation Using Synthetic Correlated Diffusion Imaging	Jan 15, 2025	BenchmarkingComputational Efficiency	—Unverified	0
MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents	Jan 15, 2025	BenchmarkingOptical Character Recognition (OCR)	—Unverified	0
Similarity-Quantized Relative Difference Learning for Improved Molecular Activity Prediction	Jan 15, 2025	Activity PredictionBenchmarking	—Unverified	0
ToMATO: Verbalizing the Mental States of Role-Playing LLMs for Benchmarking Theory of Mind	Jan 15, 2025	BenchmarkingMultiple-choice	CodeCode Available	1
Benchmarking Robustness of Contrastive Learning Models for Medical Image-Report Retrieval	Jan 15, 2025	BenchmarkingContrastive Learning	—Unverified	0
Evaluating SAT and SMT Solvers on Large-Scale Sudoku Puzzles	Jan 15, 2025	Benchmarking	CodeCode Available	0
Multimodal LLMs Can Reason about Aesthetics in Zero-Shot	Jan 15, 2025	BenchmarkingHallucination	CodeCode Available	1
Keras Sig: Efficient Path Signature Computation on GPU in Keras 3	Jan 14, 2025	BenchmarkingC++ code	—Unverified	0
Benchmarking Classical, Deep, and Generative Models for Human Activity Recognition	Jan 14, 2025	Activity RecognitionBenchmarking	—Unverified	0
Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models	Jan 14, 2025	BenchmarkingText-to-Video Generation	CodeCode Available	4
Benchmarking Multimodal Models for Fine-Grained Image Analysis: A Comparative Study Across Diverse Visual Features	Jan 14, 2025	Benchmarking	—Unverified	0
Benchmarking Vision Foundation Models for Input Monitoring in Autonomous Driving	Jan 14, 2025	Autonomous DrivingBenchmarking	—Unverified	0
Investigating Energy Efficiency and Performance Trade-offs in LLM Inference Across Tasks and DVFS Settings	Jan 14, 2025	BenchmarkingQuestion Answering	—Unverified	0
Data-driven inventory management for new products: An adjusted Dyna-Q approach with transfer learning	Jan 14, 2025	BenchmarkingManagement	—Unverified	0
Benchmarking Graph Representations and Graph Neural Networks for Multivariate Time Series Classification	Jan 14, 2025	BenchmarkingGraph Representation Learning	CodeCode Available	0
Benchmarking Abstractive Summarisation: A Dataset of Human-authored Summaries of Norwegian News Articles	Jan 13, 2025	ArticlesBenchmarking	—Unverified	0
Stronger Than You Think: Benchmarking Weak Supervision on Realistic Tasks	Jan 13, 2025	Benchmarking	CodeCode Available	0
Understanding and Benchmarking Artificial Intelligence: OpenAI's o3 Is Not AGI	Jan 13, 2025	ARCBenchmarking	—Unverified	0
The Paradox of Success in Evolutionary and Bioinspired Optimization: Revisiting Critical Issues, Key Studies, and Methodological Pathways	Jan 13, 2025	BenchmarkingMetaheuristic Optimization	—Unverified	0
TimberVision: A Multi-Task Dataset and Framework for Log-Component Segmentation and Tracking in Autonomous Forestry Operations	Jan 13, 2025	BenchmarkingDomain Adaptation	CodeCode Available	1
WebWalker: Benchmarking LLMs in Web Traversal	Jan 13, 2025	BenchmarkingOpen-Domain Question Answering	CodeCode Available	11
Lessons From Red Teaming 100 Generative AI Products	Jan 13, 2025	BenchmarkingRed Teaming	—Unverified	0
ZNO-Eval: Benchmarking reasoning capabilities of large language models in Ukrainian	Jan 12, 2025	BenchmarkingMath	CodeCode Available	1
Benchmarking YOLOv8 for Optimal Crack Detection in Civil Infrastructure	Jan 12, 2025	BenchmarkingHyperparameter Optimization	—Unverified	0
Retrieval-Augmented Dialogue Knowledge Aggregation for Expressive Conversational Speech Synthesis	Jan 11, 2025	AttributeBenchmarking	CodeCode Available	1
Evidential Deep Learning for Uncertainty Quantification and Out-of-Distribution Detection in Jet Identification using Deep Neural Networks	Jan 10, 2025	Anomaly DetectionBenchmarking	CodeCode Available	0
Benchmarking Rotary Position Embeddings for Automatic Speech Recognition	Jan 10, 2025	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified	0
DiffuSETS: 12-lead ECG Generation Conditioned on Clinical Text Reports and Patient-Specific Information	Jan 10, 2025	BenchmarkingData Augmentation	CodeCode Available	1
AgoraSpeech: A multi-annotated comprehensive dataset of political discourse through the lens of humans and AI	Jan 9, 2025	Benchmarkingnamed-entity-recognition	—Unverified	0
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?	Jan 9, 2025	BenchmarkingVideo Understanding	CodeCode Available	2
Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning	Jan 9, 2025	BenchmarkingQuestion Answering	—Unverified	0
CallNavi, A Challenge and Empirical Study on LLM Function Calling and Routing	Jan 9, 2025	BenchmarkingChatbot	—Unverified	0
VoxEval: Benchmarking the Knowledge Understanding Capabilities of End-to-End Spoken Language Models	Jan 9, 2025	BenchmarkingMathematical Problem-Solving	CodeCode Available	1
Large Physics Models: Towards a collaborative approach with Large Language Models and Foundation Models	Jan 9, 2025	BenchmarkingPhilosophical Reflection	—Unverified	0
LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation	Jan 9, 2025	2k8k	—Unverified	0
Open-Source Manually Annotated Vocal Tract Database for Automatic Segmentation from 3D MRI Using Deep Learning: Benchmarking 2D and 3D Convolutional and Transformer Networks	Jan 8, 2025	BenchmarkingDeep Learning	—Unverified	0
Advancing Retrieval-Augmented Generation for Persian: Development of Language Models, Comprehensive Benchmarks, and Best Practices for Optimization	Jan 8, 2025	BenchmarkingGeneral Knowledge	—Unverified	0
IOLBENCH: Benchmarking LLMs on Linguistic Reasoning	Jan 8, 2025	Benchmarking	CodeCode Available	0
An Analysis of Model Robustness across Concurrent Distribution Shifts	Jan 8, 2025	Benchmarking	—Unverified	0
Practical Design and Benchmarking of Generative AI Applications for Surgical Billing and Coding	Jan 7, 2025	BenchmarkingCode Generation	—Unverified	0
Machine Learning for Identifying Grain Boundaries in Scanning Electron Microscopy (SEM) Images of Nanoparticle Superlattices	Jan 7, 2025	BenchmarkingClustering	—Unverified	0
The FACTS Grounding Leaderboard: Benchmarking LLMs' Ability to Ground Responses to Long-Form Input	Jan 6, 2025	BenchmarkingForm	—Unverified	0
Underwater Image Restoration Through a Prior Guided Hybrid Sense Approach and Extensive Benchmark Analysis	Jan 6, 2025	BenchmarkingImage Enhancement	CodeCode Available	1

Show:10 25 50

← PrevPage 22 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified