SOTAVerified

Benchmarking

Papers

Showing 15511575 of 5548 papers

TitleStatusHype
Lightning UQ Box: A Comprehensive Framework for Uncertainty Quantification in Deep Learning0
AutoPenBench: Benchmarking Generative Agents for Penetration TestingCode2
Towards a Benchmark for Large Language Models for Business Process Management TasksCode0
EBES: Easy Benchmarking for Event SequencesCode1
Repurposing Foundation Model for Generalizable Medical Time Series Classification0
DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and ObjectsCode1
Large Language Model for Multi-Domain Translation: Benchmarking and Domain CoT Fine-tuning0
LLM-Pilot: Characterize and Optimize Performance of your LLM Inference ServicesCode1
MANTRA: The Manifold Triangulations AssemblageCode0
IoT-LLM: Enhancing Real-World IoT Task Reasoning with Large Language Models0
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based AgentsCode3
A Real Benchmark Swell Noise Dataset for Performing Seismic Data Denoising via Deep Learning0
MONICA: Benchmarking on Long-tailed Medical Image ClassificationCode1
Emo3D: Metric and Benchmarking Dataset for 3D Facial Expression Generation from Emotion Description0
CALF: Benchmarking Evaluation of LFQA Using Chinese Examinations0
OmniGenBench: Automating Large-scale in-silico Benchmarking for Genomic Foundation ModelsCode3
StringLLM: Understanding the String Processing Capability of Large Language ModelsCode1
Deep learning for action spotting in association football videos0
Deep Unlearn: Benchmarking Machine Unlearning0
MedQA-CS: Benchmarking Large Language Models Clinical Skills Using an AI-SCE FrameworkCode1
The Labyrinth of Links: Navigating the Associative Maze of Multi-modal LLMs0
shapiq: Shapley Interactions for Machine LearningCode4
ConServe: Harvesting GPUs for Low-Latency and High-Throughput Large Language Model Serving0
Benchmarking Large Language Models for Conversational Question Answering in Multi-instructional Documents0
FMBench: Benchmarking Fairness in Multimodal Large Language Models on Medical Tasks0
Show:102550
← PrevPage 63 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified