Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 601–650 of 5548 papers

Title	Date	Tasks	Status	Hype
LoLI-Street: Benchmarking Low-Light Image Enhancement and Beyond	Oct 13, 2024	Autonomous DrivingAutonomous Vehicles	CodeCode Available	1
RMB: Comprehensively Benchmarking Reward Models in LLM Alignment	Oct 13, 2024	Benchmarking	CodeCode Available	1
When Graph meets Multimodal: Benchmarking on Multimodal Attributed Graphs Learning	Oct 11, 2024	AttributeBenchmarking	CodeCode Available	1
Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation	Oct 11, 2024	BenchmarkingImage Segmentation	CodeCode Available	1
Towards Generalisable Time Series Understanding Across Domains	Oct 9, 2024	BenchmarkingTime Series	CodeCode Available	1
Entering Real Social World! Benchmarking the Social Intelligence of Large Language Models from a First-person Perspective	Oct 8, 2024	AttributeBenchmarking	CodeCode Available	1
Model-GLUE: Democratized LLM Scaling for A Large Model Zoo in the Wild	Oct 7, 2024	BenchmarkingMixture-of-Experts	CodeCode Available	1
Large Scale MRI Collection and Segmentation of Cirrhotic Liver	Oct 6, 2024	BenchmarkingDiagnostic	CodeCode Available	1
Multimodal Large Language Models for Inverse Molecular Design with Retrosynthetic Planning	Oct 5, 2024	BenchmarkingDrug Design	CodeCode Available	1
EBES: Easy Benchmarking for Event Sequences	Oct 4, 2024	Benchmarking	CodeCode Available	1
DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects	Oct 3, 2024	BenchmarkingImitation Learning	CodeCode Available	1
LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services	Oct 3, 2024	BenchmarkingGPU	CodeCode Available	1
StringLLM: Understanding the String Processing Capability of Large Language Models	Oct 2, 2024	Benchmarking	CodeCode Available	1
MedQA-CS: Benchmarking Large Language Models Clinical Skills Using an AI-SCE Framework	Oct 2, 2024	BenchmarkingInstruction Following	CodeCode Available	1
MONICA: Benchmarking on Long-tailed Medical Image Classification	Oct 2, 2024	BenchmarkingClassification	CodeCode Available	1
Exploring QUIC Dynamics: A Large-Scale Dataset for Encrypted Traffic Analysis	Sep 30, 2024	BenchmarkingIntrusion Detection	CodeCode Available	1
ARLBench: Flexible and Efficient Benchmarking for Hyperparameter Optimization in Reinforcement Learning	Sep 27, 2024	AutoMLBenchmarking	CodeCode Available	1
MALPOLON: A Framework for Deep Species Distribution Modeling	Sep 26, 2024	BenchmarkingGPU	CodeCode Available	1
HazeSpace2M: A Dataset for Haze Aware Single Image Dehazing	Sep 25, 2024	BenchmarkingImage Dehazing	CodeCode Available	1
RMCBench: Benchmarking Large Language Models' Resistance to Malicious Code	Sep 23, 2024	BenchmarkingCode Generation	CodeCode Available	1
Boosting Healthcare LLMs Through Retrieved Context	Sep 23, 2024	BenchmarkingMultiple-choice	CodeCode Available	1
YesBut: A High-Quality Annotated Multimodal Dataset for evaluating Satire Comprehension capability of Vision-Language Models	Sep 20, 2024	BenchmarkingImage Captioning	CodeCode Available	1
MetaFormer and CNN Hybrid Model for Polyp Image Segmentation	Sep 16, 2024	BenchmarkingImage Segmentation	CodeCode Available	1
ODAQ: Open Dataset of Audio Quality - Benchmark on GitHub	Sep 13, 2024	Audio Quality AssessmentBenchmarking	CodeCode Available	1
Insights from Benchmarking Frontier Language Models on Web App Code Generation	Sep 8, 2024	BenchmarkingCode Generation	CodeCode Available	1
RTLRewriter: Methodologies for Large Models aided RTL Code Optimization	Sep 4, 2024	Benchmarking	CodeCode Available	1
LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs	Sep 3, 2024	16kBenchmarking	CodeCode Available	1
Towards Student Actions in Classroom Scenes: New Dataset and Baseline	Sep 2, 2024	Action DetectionBenchmarking	CodeCode Available	1
STEREO: Towards Adversarially Robust Concept Erasing from Text-to-Image Generation Models	Aug 29, 2024	BenchmarkingImage Generation	CodeCode Available	1
How Well Do LLMs Handle Cantonese? Benchmarking Cantonese Capabilities of Large Language Models	Aug 29, 2024	BenchmarkingGeneral Knowledge	CodeCode Available	1
LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language Models	Aug 28, 2024	BenchmarkingLogical Reasoning	CodeCode Available	1
Variational Autoencoder for Anomaly Detection: A Comparative Study	Aug 24, 2024	Anomaly DetectionBenchmarking	CodeCode Available	1
Scribbles for All: Benchmarking Scribble Supervised Segmentation Across Datasets	Aug 22, 2024	AllBenchmarking	CodeCode Available	1
BLADE: Benchmarking Language Model Agents for Data-Driven Science	Aug 19, 2024	BenchmarkingDecision Making	CodeCode Available	1
PADetBench: Towards Benchmarking Physical Attacks against Object Detection	Aug 17, 2024	Adversarial RobustnessBenchmarking	CodeCode Available	1
SER Evals: In-domain and Out-of-domain Benchmarking for Speech Emotion Recognition	Aug 14, 2024	Automatic Speech RecognitionBenchmarking	CodeCode Available	1
TabularBench: Benchmarking Adversarial Robustness for Tabular Deep Learning in Real-world Use-cases	Aug 14, 2024	Adversarial RobustnessBenchmarking	CodeCode Available	1
Benchmarking tree species classification from proximally-sensed laser scanning data: introducing the FOR-species20K dataset	Aug 12, 2024	Benchmarking	CodeCode Available	1
The impact of internal variability on benchmarking deep learning climate emulators	Aug 9, 2024	BenchmarkingDeep Learning	CodeCode Available	1
UAV-Enhanced Combination to Application: Comprehensive Analysis and Benchmarking of a Human Detection Dataset for Disaster Scenarios	Aug 9, 2024	BenchmarkingHuman Detection	CodeCode Available	1
WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models	Aug 7, 2024	AI and SafetyBenchmarking	CodeCode Available	1
Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond	Aug 7, 2024	BenchmarkingLanguage Identification	CodeCode Available	1
OpenOmni: A Collaborative Open Source Tool for Building Future-Ready Multimodal Conversational Agents	Aug 6, 2024	BenchmarkingRetrieval-augmented Generation	CodeCode Available	1
Guardians of Image Quality: Benchmarking Defenses Against Adversarial Attacks on Image Quality Metrics	Aug 2, 2024	Adversarial AttackAdversarial Purification	CodeCode Available	1
ClinicRealm: Re-evaluating Large Language Models with Conventional Machine Learning for Non-Generative Clinical Prediction Tasks	Jul 26, 2024	BenchmarkingModel Selection	CodeCode Available	1
VoxSim: A perceptual voice similarity dataset	Jul 26, 2024	BenchmarkingSpeaker Recognition	CodeCode Available	1
OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation	Jul 26, 2024	BenchmarkingDocument AI	CodeCode Available	1
Enhancing clinical decision support with physiological waveforms -- a multimodal benchmark in emergency care	Jul 25, 2024	BenchmarkingDiagnostic	CodeCode Available	1
AsEP: Benchmarking Deep Learning Methods for Antibody-specific Epitope Prediction	Jul 25, 2024	BenchmarkingDeep Learning	CodeCode Available	1
HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning	Jul 22, 2024	BenchmarkingHallucination	CodeCode Available	1

Show:10 25 50

← PrevPage 13 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified