SOTAVerified

Benchmarking

Papers

Showing 15511600 of 5548 papers

TitleStatusHype
Lightning UQ Box: A Comprehensive Framework for Uncertainty Quantification in Deep Learning0
Towards a Benchmark for Large Language Models for Business Process Management TasksCode0
EBES: Easy Benchmarking for Event SequencesCode1
AutoPenBench: Benchmarking Generative Agents for Penetration TestingCode2
Repurposing Foundation Model for Generalizable Medical Time Series Classification0
LLM-Pilot: Characterize and Optimize Performance of your LLM Inference ServicesCode1
DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and ObjectsCode1
Large Language Model for Multi-Domain Translation: Benchmarking and Domain CoT Fine-tuning0
MANTRA: The Manifold Triangulations AssemblageCode0
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based AgentsCode3
IoT-LLM: Enhancing Real-World IoT Task Reasoning with Large Language Models0
A Real Benchmark Swell Noise Dataset for Performing Seismic Data Denoising via Deep Learning0
CALF: Benchmarking Evaluation of LFQA Using Chinese Examinations0
Emo3D: Metric and Benchmarking Dataset for 3D Facial Expression Generation from Emotion Description0
MONICA: Benchmarking on Long-tailed Medical Image ClassificationCode1
OmniGenBench: Automating Large-scale in-silico Benchmarking for Genomic Foundation ModelsCode3
StringLLM: Understanding the String Processing Capability of Large Language ModelsCode1
MedQA-CS: Benchmarking Large Language Models Clinical Skills Using an AI-SCE FrameworkCode1
The Labyrinth of Links: Navigating the Associative Maze of Multi-modal LLMs0
Deep Unlearn: Benchmarking Machine Unlearning0
ConServe: Harvesting GPUs for Low-Latency and High-Throughput Large Language Model Serving0
Deep learning for action spotting in association football videos0
shapiq: Shapley Interactions for Machine LearningCode4
Benchmarking Large Language Models for Conversational Question Answering in Multi-instructional Documents0
FMBench: Benchmarking Fairness in Multimodal Large Language Models on Medical Tasks0
CXPMRG-Bench: Pre-training and Benchmarking for X-ray Medical Report Generation on CheXpert Plus Dataset0
Exploring QUIC Dynamics: A Large-Scale Dataset for Encrypted Traffic AnalysisCode1
ImmersePro: End-to-End Stereo Video Synthesis Via Implicit Disparity LearningCode0
Benchmarking Adaptive Intelligence and Computer Vision on Human-Robot Collaboration0
Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs0
Match Stereo Videos via Bidirectional Alignment0
Beyond Prompts: Dynamic Conversational Benchmarking of Large Language ModelsCode2
GenTel-Safe: A Unified Benchmark and Shielding Framework for Defending Against Prompt Injection Attacks0
Tracking Everything in Robotic-Assisted Surgery0
A Survey on Graph Neural Networks for Remaining Useful Life Prediction: Methodologies, Evaluation and Future TrendsCode2
AstroMLab 2: AstroLLaMA-2-70B Model and Benchmarking Specialised LLMs for Astronomy0
Constrained Reinforcement Learning for Safe Heat Pump ControlCode0
SciDoc2Diagrammer-MAF: Towards Generation of Scientific Diagrams from Documents guided by Multi-Aspect Feedback Refinement0
EarthquakeNPP: Benchmark Datasets for Earthquake Forecasting with Neural Point Processes0
bnRep: A repository of Bayesian networks from the academic literature0
CLLMate: A Multimodal Benchmark for Weather and Climate Events Forecasting0
MCUBench: A Benchmark of Tiny Object Detectors on MCUs0
Data Analysis in the Era of Generative AI0
Constructing Confidence Intervals for 'the' Generalization Error -- a Comprehensive Benchmark StudyCode0
ARLBench: Flexible and Efficient Benchmarking for Hyperparameter Optimization in Reinforcement LearningCode1
The Elephant in the Room: Towards A Reliable Time-Series Anomaly Detection BenchmarkCode3
Conformal Prediction: A Theoretical Note and Benchmarking Transductive Node Classification in GraphsCode0
MALPOLON: A Framework for Deep Species Distribution ModelingCode1
Omnibenchmark (alpha) for continuous and open benchmarking in bioinformatics0
Proof of Thought : Neurosymbolic Program Synthesis allows Robust and Interpretable Reasoning0
Show:102550
← PrevPage 32 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified