Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 626–650 of 5548 papers

Title	Date	Tasks	Status	Hype
RTLRewriter: Methodologies for Large Models aided RTL Code Optimization	Sep 4, 2024	Benchmarking	CodeCode Available	1
LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs	Sep 3, 2024	16kBenchmarking	CodeCode Available	1
Towards Student Actions in Classroom Scenes: New Dataset and Baseline	Sep 2, 2024	Action DetectionBenchmarking	CodeCode Available	1
STEREO: Towards Adversarially Robust Concept Erasing from Text-to-Image Generation Models	Aug 29, 2024	BenchmarkingImage Generation	CodeCode Available	1
How Well Do LLMs Handle Cantonese? Benchmarking Cantonese Capabilities of Large Language Models	Aug 29, 2024	BenchmarkingGeneral Knowledge	CodeCode Available	1
LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language Models	Aug 28, 2024	BenchmarkingLogical Reasoning	CodeCode Available	1
Variational Autoencoder for Anomaly Detection: A Comparative Study	Aug 24, 2024	Anomaly DetectionBenchmarking	CodeCode Available	1
Scribbles for All: Benchmarking Scribble Supervised Segmentation Across Datasets	Aug 22, 2024	AllBenchmarking	CodeCode Available	1
BLADE: Benchmarking Language Model Agents for Data-Driven Science	Aug 19, 2024	BenchmarkingDecision Making	CodeCode Available	1
PADetBench: Towards Benchmarking Physical Attacks against Object Detection	Aug 17, 2024	Adversarial RobustnessBenchmarking	CodeCode Available	1
SER Evals: In-domain and Out-of-domain Benchmarking for Speech Emotion Recognition	Aug 14, 2024	Automatic Speech RecognitionBenchmarking	CodeCode Available	1
TabularBench: Benchmarking Adversarial Robustness for Tabular Deep Learning in Real-world Use-cases	Aug 14, 2024	Adversarial RobustnessBenchmarking	CodeCode Available	1
Benchmarking tree species classification from proximally-sensed laser scanning data: introducing the FOR-species20K dataset	Aug 12, 2024	Benchmarking	CodeCode Available	1
The impact of internal variability on benchmarking deep learning climate emulators	Aug 9, 2024	BenchmarkingDeep Learning	CodeCode Available	1
UAV-Enhanced Combination to Application: Comprehensive Analysis and Benchmarking of a Human Detection Dataset for Disaster Scenarios	Aug 9, 2024	BenchmarkingHuman Detection	CodeCode Available	1
WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models	Aug 7, 2024	AI and SafetyBenchmarking	CodeCode Available	1
Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond	Aug 7, 2024	BenchmarkingLanguage Identification	CodeCode Available	1
OpenOmni: A Collaborative Open Source Tool for Building Future-Ready Multimodal Conversational Agents	Aug 6, 2024	BenchmarkingRetrieval-augmented Generation	CodeCode Available	1
Guardians of Image Quality: Benchmarking Defenses Against Adversarial Attacks on Image Quality Metrics	Aug 2, 2024	Adversarial AttackAdversarial Purification	CodeCode Available	1
ClinicRealm: Re-evaluating Large Language Models with Conventional Machine Learning for Non-Generative Clinical Prediction Tasks	Jul 26, 2024	BenchmarkingModel Selection	CodeCode Available	1
VoxSim: A perceptual voice similarity dataset	Jul 26, 2024	BenchmarkingSpeaker Recognition	CodeCode Available	1
OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation	Jul 26, 2024	BenchmarkingDocument AI	CodeCode Available	1
Enhancing clinical decision support with physiological waveforms -- a multimodal benchmark in emergency care	Jul 25, 2024	BenchmarkingDiagnostic	CodeCode Available	1
AsEP: Benchmarking Deep Learning Methods for Antibody-specific Epitope Prediction	Jul 25, 2024	BenchmarkingDeep Learning	CodeCode Available	1
HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning	Jul 22, 2024	BenchmarkingHallucination	CodeCode Available	1

Show:10 25 50

← PrevPage 26 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified