Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 3051–3100 of 5548 papers

Title	Date	Tasks	Status
NovoBench: Benchmarking Deep Learning-based De Novo Peptide Sequencing Methods in Proteomics	Jun 16, 2024	Benchmarkingde novo peptide sequencing	—Unverified
GANmut: Generating and Modifying Facial Expressions	Jun 16, 2024	BenchmarkingDiversity	—Unverified
Reactor Mk.1 performances: MMLU, HumanEval and BBH test results	Jun 15, 2024	BenchmarkingHumanEval	—Unverified
Benchmarking Children's ASR with Supervised and Self-supervised Speech Foundation Models	Jun 15, 2024	BenchmarkingData Augmentation	CodeCode Available
Beyond Slow Signs in High-fidelity Model Extraction	Jun 14, 2024	Benchmarkingmodel	CodeCode Available
ClimRetrieve: A Benchmarking Dataset for Information Retrieval from Corporate Climate Disclosures	Jun 14, 2024	Answer GenerationBenchmarking	CodeCode Available
SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading	Jun 14, 2024	BenchmarkingMathematical Proofs	CodeCode Available
On the Evaluation of Speech Foundation Models for Spoken Language Understanding	Jun 14, 2024	BenchmarkingPrediction	—Unverified
Benchmarking Generative Models on Computational Thinking Tests in Elementary Visual Programming	Jun 14, 2024	BenchmarkingGeneral Knowledge	—Unverified
Improving the Validity and Practical Usefulness of AI/ML Evaluations Using an Estimands Framework	Jun 14, 2024	Benchmarking	—Unverified
DefAn: Definitive Answer Dataset for LLMs Hallucination Evaluation	Jun 13, 2024	BenchmarkingHallucination	CodeCode Available
CubeSat-Enabled Free-Space Optics: Joint Data Communication and Fine Beam Tracking	Jun 13, 2024	Benchmarking	—Unverified
ResearchArena: Benchmarking LLMs' Ability to Collect and Organize Information as Research Agents	Jun 13, 2024	BenchmarkingSurvey	—Unverified
ECBD: Evidence-Centered Benchmark Design for NLP	Jun 13, 2024	Benchmarking	CodeCode Available
LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living	Jun 13, 2024	BenchmarkingHuman-Object Interaction Detection	—Unverified
Decoding the Diversity: A Review of the Indic AI Research Landscape	Jun 13, 2024	BenchmarkingDiversity	—Unverified
Are we making progress in unlearning? Findings from the first NeurIPS unlearning competition	Jun 13, 2024	Benchmarking	—Unverified
A Review of 315 Benchmark and Test Functions for Machine Learning Optimization Algorithms and Metaheuristics with Mathematical and Visual Descriptions	Jun 13, 2024	Benchmarking	—Unverified
MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents	Jun 12, 2024	BenchmarkingLanguage Modeling	—Unverified
How well it works: Benchmarking performance of GPT models on medical natural language processing tasks	Jun 12, 2024	Benchmarking	—Unverified
It's all about PR -- Smart Benchmarking AI Accelerators using Performance Representatives	Jun 12, 2024	AllBenchmarking	—Unverified
Reinforcement Learning to Disentangle Multiqubit Quantum States from Partial Observations	Jun 12, 2024	BenchmarkingDeep Reinforcement Learning	CodeCode Available
ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets	Jun 12, 2024	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases	Jun 12, 2024	BenchmarkingModel Compression	—Unverified
A PRISMA Driven Systematic Review of Publicly Available Datasets for Benchmark and Model Developments for Industrial Defect Detection	Jun 11, 2024	BenchmarkingDefect Detection	—Unverified
Advancing Annotation of Stance in Social Media Posts: A Comparative Analysis of Large Language Models and Crowd Sourcing	Jun 11, 2024	BenchmarkingStance Detection	—Unverified
Benchmarking Vision-Language Contrastive Methods for Medical Representation Learning	Jun 11, 2024	BenchmarkingContrastive Learning	CodeCode Available
DB3V: A Dialect Dominated Dataset of Bird Vocalisation for Cross-corpus Bird Species Recognition	Jun 11, 2024	BenchmarkingCross-corpus	—Unverified
Benchmarking and Boosting Radiology Report Generation for 3D High-Resolution Medical Images	Jun 11, 2024	BenchmarkingGPU	—Unverified
MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal Large Language Models	Jun 11, 2024	BenchmarkingFairness	—Unverified
INTERSPEECH 2009 Emotion Challenge Revisited: Benchmarking 15 Years of Progress in Speech Emotion Recognition	Jun 10, 2024	BenchmarkingEmotion Recognition	CodeCode Available
Can Language Models Serve as Text-Based World Simulators?	Jun 10, 2024	BenchmarkingDecision Making	—Unverified
Multivariate Stochastic Dominance via Optimal Transport and Applications to Models Benchmarking	Jun 10, 2024	BenchmarkingEconometrics	—Unverified
Improving Generalization of Neural Vehicle Routing Problem Solvers Through the Lens of Model Architecture	Jun 10, 2024	BenchmarkingDecoder	CodeCode Available
JavaBench: A Benchmark of Object-Oriented Code Generation for Evaluating Large Language Models	Jun 10, 2024	BenchmarkingCode Generation	CodeCode Available
Data-driven Power Flow Linearization: Simulation	Jun 10, 2024	BenchmarkingComputational Efficiency	—Unverified
Benchmarking Neural Decoding Backbones towards Enhanced On-edge iBCI Applications	Jun 8, 2024	BenchmarkingMamba	—Unverified
1st Place Winner of the 2024 Pixel-level Video Understanding in the Wild (CVPR'24 PVUW) Challenge in Video Panoptic Segmentation and Best Long Video Consistency of Video Semantic Segmentation	Jun 8, 2024	BenchmarkingInstance Segmentation	—Unverified
GenzIQA: Generalized Image Quality Assessment using Prompt-Guided Latent Diffusion Models	Jun 7, 2024	BenchmarkingDenoising	—Unverified
Deep Jansen-Rit Parameter Inference for Model-Driven Analysis of Brain Activity	Jun 7, 2024	BenchmarkingEEG	CodeCode Available
Scenarios and Approaches for Situated Natural Language Explanations	Jun 7, 2024	BenchmarkingIn-Context Learning	—Unverified
Behavior Structformer: Learning Players Representations with Structured Tokenization	Jun 7, 2024	Benchmarking	—Unverified
VisionAD, a software package of performant anomaly detection algorithms, and Proportion Localised, an interpretable metric	Jun 7, 2024	Anomaly DetectionBenchmarking	CodeCode Available
Hints-In-Browser: Benchmarking Language Models for Programming Feedback Generation	Jun 7, 2024	Benchmarking	—Unverified
Better Late Than Never: Formulating and Benchmarking Recommendation Editing	Jun 6, 2024	BenchmarkingRecommendation Systems	CodeCode Available
Benchmarking AlphaFold3's protein-protein complex accuracy and machine learning prediction reliability for binding free energy changes upon mutation	Jun 6, 2024	BenchmarkingDrug Discovery	—Unverified
Performance of large language models in numerical vs. semantic medical knowledge: Benchmarking on evidence-based Q&As	Jun 6, 2024	ArticlesBenchmarking	—Unverified
NATURAL PLAN: Benchmarking LLMs on Natural Language Planning	Jun 6, 2024	BenchmarkingScheduling	—Unverified
BEADs: Bias Evaluation Across Domains	Jun 6, 2024	BenchmarkingBias Detection	—Unverified
Empirical Guidelines for Deploying LLMs onto Resource-constrained Edge Devices	Jun 6, 2024	BenchmarkingRAG	—Unverified

Show:10 25 50

← PrevPage 62 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified