Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 601–625 of 5548 papers

Title	Date	Tasks	Status	Hype
RMB: Comprehensively Benchmarking Reward Models in LLM Alignment	Oct 13, 2024	Benchmarking	CodeCode Available	1
LoLI-Street: Benchmarking Low-Light Image Enhancement and Beyond	Oct 13, 2024	Autonomous DrivingAutonomous Vehicles	CodeCode Available	1
Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation	Oct 11, 2024	BenchmarkingImage Segmentation	CodeCode Available	1
When Graph meets Multimodal: Benchmarking on Multimodal Attributed Graphs Learning	Oct 11, 2024	AttributeBenchmarking	CodeCode Available	1
Towards Generalisable Time Series Understanding Across Domains	Oct 9, 2024	BenchmarkingTime Series	CodeCode Available	1
Entering Real Social World! Benchmarking the Social Intelligence of Large Language Models from a First-person Perspective	Oct 8, 2024	AttributeBenchmarking	CodeCode Available	1
Model-GLUE: Democratized LLM Scaling for A Large Model Zoo in the Wild	Oct 7, 2024	BenchmarkingMixture-of-Experts	CodeCode Available	1
Large Scale MRI Collection and Segmentation of Cirrhotic Liver	Oct 6, 2024	BenchmarkingDiagnostic	CodeCode Available	1
Multimodal Large Language Models for Inverse Molecular Design with Retrosynthetic Planning	Oct 5, 2024	BenchmarkingDrug Design	CodeCode Available	1
EBES: Easy Benchmarking for Event Sequences	Oct 4, 2024	Benchmarking	CodeCode Available	1
DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects	Oct 3, 2024	BenchmarkingImitation Learning	CodeCode Available	1
LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services	Oct 3, 2024	BenchmarkingGPU	CodeCode Available	1
MONICA: Benchmarking on Long-tailed Medical Image Classification	Oct 2, 2024	BenchmarkingClassification	CodeCode Available	1
MedQA-CS: Benchmarking Large Language Models Clinical Skills Using an AI-SCE Framework	Oct 2, 2024	BenchmarkingInstruction Following	CodeCode Available	1
StringLLM: Understanding the String Processing Capability of Large Language Models	Oct 2, 2024	Benchmarking	CodeCode Available	1
Exploring QUIC Dynamics: A Large-Scale Dataset for Encrypted Traffic Analysis	Sep 30, 2024	BenchmarkingIntrusion Detection	CodeCode Available	1
ARLBench: Flexible and Efficient Benchmarking for Hyperparameter Optimization in Reinforcement Learning	Sep 27, 2024	AutoMLBenchmarking	CodeCode Available	1
MALPOLON: A Framework for Deep Species Distribution Modeling	Sep 26, 2024	BenchmarkingGPU	CodeCode Available	1
HazeSpace2M: A Dataset for Haze Aware Single Image Dehazing	Sep 25, 2024	BenchmarkingImage Dehazing	CodeCode Available	1
Boosting Healthcare LLMs Through Retrieved Context	Sep 23, 2024	BenchmarkingMultiple-choice	CodeCode Available	1
RMCBench: Benchmarking Large Language Models' Resistance to Malicious Code	Sep 23, 2024	BenchmarkingCode Generation	CodeCode Available	1
YesBut: A High-Quality Annotated Multimodal Dataset for evaluating Satire Comprehension capability of Vision-Language Models	Sep 20, 2024	BenchmarkingImage Captioning	CodeCode Available	1
MetaFormer and CNN Hybrid Model for Polyp Image Segmentation	Sep 16, 2024	BenchmarkingImage Segmentation	CodeCode Available	1
ODAQ: Open Dataset of Audio Quality - Benchmark on GitHub	Sep 13, 2024	Audio Quality AssessmentBenchmarking	CodeCode Available	1
Insights from Benchmarking Frontier Language Models on Web App Code Generation	Sep 8, 2024	BenchmarkingCode Generation	CodeCode Available	1

Show:10 25 50

← PrevPage 25 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified