SOTAVerified

Benchmarking

Papers

Showing 27012750 of 5548 papers

TitleStatusHype
PalmBench: A Comprehensive Benchmark of Compressed Large Language Models on Mobile Platforms0
Implicit to Explicit Entropy Regularization: Benchmarking ViT Fine-tuning under Noisy Labels0
PersoBench: Benchmarking Personalized Response Generation in Large Language ModelsCode0
How Do Large Language Models Understand Graph Patterns? A Benchmark for Graph Pattern Comprehension0
Ward: Provable RAG Dataset Inference via LLM Watermarks0
ActPlan-1K: Benchmarking the Procedural Planning Ability of Visual Language Models in Household Activities0
Towards a Benchmark for Large Language Models for Business Process Management TasksCode0
Benchmarking the Fidelity and Utility of Synthetic Relational Data0
Lightning UQ Box: A Comprehensive Framework for Uncertainty Quantification in Deep Learning0
Understanding Large Language Models in Your Pockets: Performance Study on COTS Mobile Devices0
IoT-LLM: Enhancing Real-World IoT Task Reasoning with Large Language Models0
MANTRA: The Manifold Triangulations AssemblageCode0
Repurposing Foundation Model for Generalizable Medical Time Series Classification0
Large Language Model for Multi-Domain Translation: Benchmarking and Domain CoT Fine-tuning0
Deep learning for action spotting in association football videos0
ConServe: Harvesting GPUs for Low-Latency and High-Throughput Large Language Model Serving0
CALF: Benchmarking Evaluation of LFQA Using Chinese Examinations0
The Labyrinth of Links: Navigating the Associative Maze of Multi-modal LLMs0
Emo3D: Metric and Benchmarking Dataset for 3D Facial Expression Generation from Emotion Description0
A Real Benchmark Swell Noise Dataset for Performing Seismic Data Denoising via Deep Learning0
Deep Unlearn: Benchmarking Machine Unlearning0
CXPMRG-Bench: Pre-training and Benchmarking for X-ray Medical Report Generation on CheXpert Plus Dataset0
FMBench: Benchmarking Fairness in Multimodal Large Language Models on Medical Tasks0
Benchmarking Large Language Models for Conversational Question Answering in Multi-instructional Documents0
Match Stereo Videos via Bidirectional Alignment0
Benchmarking Adaptive Intelligence and Computer Vision on Human-Robot Collaboration0
ImmersePro: End-to-End Stereo Video Synthesis Via Implicit Disparity LearningCode0
Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs0
Constrained Reinforcement Learning for Safe Heat Pump ControlCode0
Tracking Everything in Robotic-Assisted Surgery0
GenTel-Safe: A Unified Benchmark and Shielding Framework for Defending Against Prompt Injection Attacks0
AstroMLab 2: AstroLLaMA-2-70B Model and Benchmarking Specialised LLMs for Astronomy0
SciDoc2Diagrammer-MAF: Towards Generation of Scientific Diagrams from Documents guided by Multi-Aspect Feedback Refinement0
Data Analysis in the Era of Generative AI0
Constructing Confidence Intervals for 'the' Generalization Error -- a Comprehensive Benchmark StudyCode0
CLLMate: A Multimodal Benchmark for Weather and Climate Events Forecasting0
bnRep: A repository of Bayesian networks from the academic literature0
MCUBench: A Benchmark of Tiny Object Detectors on MCUs0
EarthquakeNPP: Benchmark Datasets for Earthquake Forecasting with Neural Point Processes0
Conformal Prediction: A Theoretical Note and Benchmarking Transductive Node Classification in GraphsCode0
Benchmarking Domain Generalization Algorithms in Computational PathologyCode0
Benchmarking Deep Learning Models for Object Detection on Edge Computing Devices0
Proof of Thought : Neurosymbolic Program Synthesis allows Robust and Interpretable Reasoning0
Omnibenchmark (alpha) for continuous and open benchmarking in bioinformatics0
SEN12-WATER: A New Dataset for Hydrological Applications and its Benchmarking0
Controlling Risk of Retrieval-augmented Generation: A Counterfactual Prompting FrameworkCode0
HLB: Benchmarking LLMs' Humanlikeness in Language Use0
Benchmarking Robustness of Endoscopic Depth Estimation with Synthetically Corrupted DataCode0
Qualitative Insights Tool (QualIT): LLM Enhanced Topic Modeling0
Ducho meets Elliot: Large-scale Benchmarks for Multimodal RecommendationCode0
Show:102550
← PrevPage 55 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified