Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 3051–3075 of 5548 papers

Title	Date	Tasks	Status
NovoBench: Benchmarking Deep Learning-based De Novo Peptide Sequencing Methods in Proteomics	Jun 16, 2024	Benchmarkingde novo peptide sequencing	—Unverified
GANmut: Generating and Modifying Facial Expressions	Jun 16, 2024	BenchmarkingDiversity	—Unverified
Reactor Mk.1 performances: MMLU, HumanEval and BBH test results	Jun 15, 2024	BenchmarkingHumanEval	—Unverified
Benchmarking Children's ASR with Supervised and Self-supervised Speech Foundation Models	Jun 15, 2024	BenchmarkingData Augmentation	CodeCode Available
Beyond Slow Signs in High-fidelity Model Extraction	Jun 14, 2024	Benchmarkingmodel	CodeCode Available
ClimRetrieve: A Benchmarking Dataset for Information Retrieval from Corporate Climate Disclosures	Jun 14, 2024	Answer GenerationBenchmarking	CodeCode Available
SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading	Jun 14, 2024	BenchmarkingMathematical Proofs	CodeCode Available
On the Evaluation of Speech Foundation Models for Spoken Language Understanding	Jun 14, 2024	BenchmarkingPrediction	—Unverified
Benchmarking Generative Models on Computational Thinking Tests in Elementary Visual Programming	Jun 14, 2024	BenchmarkingGeneral Knowledge	—Unverified
Improving the Validity and Practical Usefulness of AI/ML Evaluations Using an Estimands Framework	Jun 14, 2024	Benchmarking	—Unverified
DefAn: Definitive Answer Dataset for LLMs Hallucination Evaluation	Jun 13, 2024	BenchmarkingHallucination	CodeCode Available
CubeSat-Enabled Free-Space Optics: Joint Data Communication and Fine Beam Tracking	Jun 13, 2024	Benchmarking	—Unverified
ResearchArena: Benchmarking LLMs' Ability to Collect and Organize Information as Research Agents	Jun 13, 2024	BenchmarkingSurvey	—Unverified
ECBD: Evidence-Centered Benchmark Design for NLP	Jun 13, 2024	Benchmarking	CodeCode Available
LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living	Jun 13, 2024	BenchmarkingHuman-Object Interaction Detection	—Unverified
Decoding the Diversity: A Review of the Indic AI Research Landscape	Jun 13, 2024	BenchmarkingDiversity	—Unverified
Are we making progress in unlearning? Findings from the first NeurIPS unlearning competition	Jun 13, 2024	Benchmarking	—Unverified
A Review of 315 Benchmark and Test Functions for Machine Learning Optimization Algorithms and Metaheuristics with Mathematical and Visual Descriptions	Jun 13, 2024	Benchmarking	—Unverified
MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents	Jun 12, 2024	BenchmarkingLanguage Modeling	—Unverified
How well it works: Benchmarking performance of GPT models on medical natural language processing tasks	Jun 12, 2024	Benchmarking	—Unverified
It's all about PR -- Smart Benchmarking AI Accelerators using Performance Representatives	Jun 12, 2024	AllBenchmarking	—Unverified
Reinforcement Learning to Disentangle Multiqubit Quantum States from Partial Observations	Jun 12, 2024	BenchmarkingDeep Reinforcement Learning	CodeCode Available
ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets	Jun 12, 2024	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases	Jun 12, 2024	BenchmarkingModel Compression	—Unverified
A PRISMA Driven Systematic Review of Publicly Available Datasets for Benchmark and Model Developments for Industrial Defect Detection	Jun 11, 2024	BenchmarkingDefect Detection	—Unverified

Show:10 25 50

← PrevPage 123 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified