Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 2051–2100 of 5548 papers

Title	Date	Tasks	Status	Hype
TGB 2.0: A Benchmark for Learning on Temporal Knowledge Graphs and Heterogeneous Graphs	Jun 14, 2024	BenchmarkingKnowledge Graphs	CodeCode Available	3
Beyond Slow Signs in High-fidelity Model Extraction	Jun 14, 2024	Benchmarkingmodel	CodeCode Available	0
LUMA: A Benchmark Dataset for Learning from Uncertain and Multimodal Data	Jun 14, 2024	BenchmarkingDecision Making	CodeCode Available	1
Benchmarking Spectral Graph Neural Networks: A Comprehensive Study on Effectiveness and Efficiency	Jun 14, 2024	Benchmarking	CodeCode Available	1
On the Evaluation of Speech Foundation Models for Spoken Language Understanding	Jun 14, 2024	BenchmarkingPrediction	—Unverified	0
CubeSat-Enabled Free-Space Optics: Joint Data Communication and Fine Beam Tracking	Jun 13, 2024	Benchmarking	—Unverified	0
ResearchArena: Benchmarking LLMs' Ability to Collect and Organize Information as Research Agents	Jun 13, 2024	BenchmarkingSurvey	—Unverified	0
Decoding the Diversity: A Review of the Indic AI Research Landscape	Jun 13, 2024	BenchmarkingDiversity	—Unverified	0
ECBD: Evidence-Centered Benchmark Design for NLP	Jun 13, 2024	Benchmarking	CodeCode Available	0
BTS: Building Timeseries Dataset: Empowering Large-Scale Building Analytics	Jun 13, 2024	Benchmarking	CodeCode Available	2
DrivAerNet++: A Large-Scale Multimodal Car Dataset with Computational Fluid Dynamics Simulations and Deep Learning Benchmarks	Jun 13, 2024	Benchmarking	CodeCode Available	3
SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models	Jun 13, 2024	Benchmarking	CodeCode Available	1
Are we making progress in unlearning? Findings from the first NeurIPS unlearning competition	Jun 13, 2024	Benchmarking	—Unverified	0
A Review of 315 Benchmark and Test Functions for Machine Learning Optimization Algorithms and Metaheuristics with Mathematical and Visual Descriptions	Jun 13, 2024	Benchmarking	—Unverified	0
StreamBench: Towards Benchmarking Continuous Improvement of Language Agents	Jun 13, 2024	BenchmarkingLanguage Modeling	CodeCode Available	2
Towards Reliable Detection of LLM-Generated Texts: A Comprehensive Evaluation Framework with CUDRT	Jun 13, 2024	BenchmarkingLLM-generated Text Detection	CodeCode Available	1
LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living	Jun 13, 2024	BenchmarkingHuman-Object Interaction Detection	—Unverified	0
Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs	Jun 13, 2024	BenchmarkingQuestion Answering	CodeCode Available	2
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs	Jun 13, 2024	BenchmarkingGPU	CodeCode Available	2
SR-CACO-2: A Dataset for Confocal Fluorescence Microscopy Image Super-Resolution	Jun 13, 2024	BenchmarkingImage Super-Resolution	CodeCode Available	1
DefAn: Definitive Answer Dataset for LLMs Hallucination Evaluation	Jun 13, 2024	BenchmarkingHallucination	CodeCode Available	0
MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases	Jun 12, 2024	BenchmarkingModel Compression	—Unverified	0
ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets	Jun 12, 2024	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified	0
Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks	Jun 12, 2024	BenchmarkingChatbot	CodeCode Available	3
Reinforcement Learning to Disentangle Multiqubit Quantum States from Partial Observations	Jun 12, 2024	BenchmarkingDeep Reinforcement Learning	CodeCode Available	0
TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation	Jun 12, 2024	BenchmarkingImage Generation	CodeCode Available	1
MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents	Jun 12, 2024	BenchmarkingLanguage Modeling	—Unverified	0
Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark	Jun 12, 2024	BenchmarkingMixture-of-Experts	CodeCode Available	1
Causality for Tabular Data Synthesis: A High-Order Structure Causal Benchmark Framework	Jun 12, 2024	BenchmarkingCausal Inference	CodeCode Available	1
It's all about PR -- Smart Benchmarking AI Accelerators using Performance Representatives	Jun 12, 2024	AllBenchmarking	—Unverified	0
How well it works: Benchmarking performance of GPT models on medical natural language processing tasks	Jun 12, 2024	Benchmarking	—Unverified	0
DB3V: A Dialect Dominated Dataset of Bird Vocalisation for Cross-corpus Bird Species Recognition	Jun 11, 2024	BenchmarkingCross-corpus	—Unverified	0
A PRISMA Driven Systematic Review of Publicly Available Datasets for Benchmark and Model Developments for Industrial Defect Detection	Jun 11, 2024	BenchmarkingDefect Detection	—Unverified	0
Advancing Annotation of Stance in Social Media Posts: A Comparative Analysis of Large Language Models and Crowd Sourcing	Jun 11, 2024	BenchmarkingStance Detection	—Unverified	0
Benchmarking and Boosting Radiology Report Generation for 3D High-Resolution Medical Images	Jun 11, 2024	BenchmarkingGPU	—Unverified	0
RAD: A Comprehensive Dataset for Benchmarking the Robustness of Image Anomaly Detection	Jun 11, 2024	Anomaly DetectionBenchmarking	CodeCode Available	1
Benchmarking Vision-Language Contrastive Methods for Medical Representation Learning	Jun 11, 2024	BenchmarkingContrastive Learning	CodeCode Available	0
MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal Large Language Models	Jun 11, 2024	BenchmarkingFairness	—Unverified	0
AudioMarkBench: Benchmarking Robustness of Audio Watermarking	Jun 11, 2024	Benchmarkingtext-to-speech	CodeCode Available	1
JavaBench: A Benchmark of Object-Oriented Code Generation for Evaluating Large Language Models	Jun 10, 2024	BenchmarkingCode Generation	CodeCode Available	0
Data-driven Power Flow Linearization: Simulation	Jun 10, 2024	BenchmarkingComputational Efficiency	—Unverified	0
Improving Generalization of Neural Vehicle Routing Problem Solvers Through the Lens of Model Architecture	Jun 10, 2024	BenchmarkingDecoder	CodeCode Available	0
INTERSPEECH 2009 Emotion Challenge Revisited: Benchmarking 15 Years of Progress in Speech Emotion Recognition	Jun 10, 2024	BenchmarkingEmotion Recognition	CodeCode Available	0
DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents	Jun 10, 2024	Benchmarkingscientific discovery	CodeCode Available	3
Can Language Models Serve as Text-Based World Simulators?	Jun 10, 2024	BenchmarkingDecision Making	—Unverified	0
Multivariate Stochastic Dominance via Optimal Transport and Applications to Models Benchmarking	Jun 10, 2024	BenchmarkingEconometrics	—Unverified	0
TopoBench: A Framework for Benchmarking Topological Deep Learning	Jun 9, 2024	BenchmarkingDeep Learning	CodeCode Available	3
Smiles2Dock: an open large-scale multi-task dataset for ML-based molecular docking	Jun 9, 2024	BenchmarkingDrug Discovery	CodeCode Available	1
QGEval: Benchmarking Multi-dimensional Evaluation for Question Generation	Jun 9, 2024	BenchmarkingQuestion Generation	CodeCode Available	1
ICU-Sepsis: A Benchmark MDP Built from Real Medical Data	Jun 9, 2024	BenchmarkingManagement	CodeCode Available	1

Show:10 25 50

← PrevPage 42 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified