Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 3076–3100 of 5548 papers

Title	Date	Tasks	Status
Advancing Annotation of Stance in Social Media Posts: A Comparative Analysis of Large Language Models and Crowd Sourcing	Jun 11, 2024	BenchmarkingStance Detection	—Unverified
Benchmarking Vision-Language Contrastive Methods for Medical Representation Learning	Jun 11, 2024	BenchmarkingContrastive Learning	CodeCode Available
DB3V: A Dialect Dominated Dataset of Bird Vocalisation for Cross-corpus Bird Species Recognition	Jun 11, 2024	BenchmarkingCross-corpus	—Unverified
Benchmarking and Boosting Radiology Report Generation for 3D High-Resolution Medical Images	Jun 11, 2024	BenchmarkingGPU	—Unverified
MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal Large Language Models	Jun 11, 2024	BenchmarkingFairness	—Unverified
INTERSPEECH 2009 Emotion Challenge Revisited: Benchmarking 15 Years of Progress in Speech Emotion Recognition	Jun 10, 2024	BenchmarkingEmotion Recognition	CodeCode Available
Can Language Models Serve as Text-Based World Simulators?	Jun 10, 2024	BenchmarkingDecision Making	—Unverified
Multivariate Stochastic Dominance via Optimal Transport and Applications to Models Benchmarking	Jun 10, 2024	BenchmarkingEconometrics	—Unverified
Improving Generalization of Neural Vehicle Routing Problem Solvers Through the Lens of Model Architecture	Jun 10, 2024	BenchmarkingDecoder	CodeCode Available
JavaBench: A Benchmark of Object-Oriented Code Generation for Evaluating Large Language Models	Jun 10, 2024	BenchmarkingCode Generation	CodeCode Available
Data-driven Power Flow Linearization: Simulation	Jun 10, 2024	BenchmarkingComputational Efficiency	—Unverified
Benchmarking Neural Decoding Backbones towards Enhanced On-edge iBCI Applications	Jun 8, 2024	BenchmarkingMamba	—Unverified
1st Place Winner of the 2024 Pixel-level Video Understanding in the Wild (CVPR'24 PVUW) Challenge in Video Panoptic Segmentation and Best Long Video Consistency of Video Semantic Segmentation	Jun 8, 2024	BenchmarkingInstance Segmentation	—Unverified
GenzIQA: Generalized Image Quality Assessment using Prompt-Guided Latent Diffusion Models	Jun 7, 2024	BenchmarkingDenoising	—Unverified
Deep Jansen-Rit Parameter Inference for Model-Driven Analysis of Brain Activity	Jun 7, 2024	BenchmarkingEEG	CodeCode Available
Scenarios and Approaches for Situated Natural Language Explanations	Jun 7, 2024	BenchmarkingIn-Context Learning	—Unverified
Behavior Structformer: Learning Players Representations with Structured Tokenization	Jun 7, 2024	Benchmarking	—Unverified
VisionAD, a software package of performant anomaly detection algorithms, and Proportion Localised, an interpretable metric	Jun 7, 2024	Anomaly DetectionBenchmarking	CodeCode Available
Hints-In-Browser: Benchmarking Language Models for Programming Feedback Generation	Jun 7, 2024	Benchmarking	—Unverified
Better Late Than Never: Formulating and Benchmarking Recommendation Editing	Jun 6, 2024	BenchmarkingRecommendation Systems	CodeCode Available
Benchmarking AlphaFold3's protein-protein complex accuracy and machine learning prediction reliability for binding free energy changes upon mutation	Jun 6, 2024	BenchmarkingDrug Discovery	—Unverified
Performance of large language models in numerical vs. semantic medical knowledge: Benchmarking on evidence-based Q&As	Jun 6, 2024	ArticlesBenchmarking	—Unverified
NATURAL PLAN: Benchmarking LLMs on Natural Language Planning	Jun 6, 2024	BenchmarkingScheduling	—Unverified
BEADs: Bias Evaluation Across Domains	Jun 6, 2024	BenchmarkingBias Detection	—Unverified
Empirical Guidelines for Deploying LLMs onto Resource-constrained Edge Devices	Jun 6, 2024	BenchmarkingRAG	—Unverified

Show:10 25 50

← PrevPage 124 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified