SOTAVerified

Benchmarking

Papers

Showing 32013250 of 5548 papers

TitleStatusHype
Benchmarking projective simulation in navigation problems0
Benchmarking Processor Performance by Multi-Threaded Machine Learning Algorithms0
JuStRank: Benchmarking LLM Judges for System Ranking0
Benchmarking Pretrained Vision Embeddings for Near- and Duplicate Detection in Medical Images0
Aerial Scene Parsing: From Tile-level Scene Classification to Pixel-wise Semantic Labeling0
AERF: Adaptive ensemble random fuzzy algorithm for anomaly detection in cloud computing0
THOUGHTTERMINATOR: Benchmarking, Calibrating, and Mitigating Overthinking in Reasoning Models0
Benchmarking the Performance of Pre-trained LLMs across Urdu NLP Tasks0
KemenkeuGPT: Leveraging a Large Language Model on Indonesia's Government Financial Data and Regulations to Enhance Decision Making0
Keras Sig: Efficient Path Signature Computation on GPU in Keras 30
KetGPT -- Dataset Augmentation of Quantum Circuits using Transformers0
Benchmarking Pretrained Attention-based Models for Real-Time Recognition in Robot-Assisted Esophagectomy0
Classification of Single-View Object Point Clouds0
Benchmarking Practices in LLM-driven Offensive Security: Testbeds, Metrics, and Experiment Design0
Benchmarking Post-Hoc Unknown-Category Detection in Food Recognition0
Benchmarking Poisoning Attacks against Retrieval-Augmented Generation0
Benchmarking person re-identification approaches and training datasets for practical real-world implementations0
Deep Reinforcement Learning for Dynamic Order Picking in Warehouse Operations0
Knowledge-aware contrastive heterogeneous molecular graph learning0
AEON: Adaptive Estimation of Instance-Dependent In-Distribution and Out-of-Distribution Label Noise for Robust Learning0
TIIF-Bench: How Does Your T2I Model Follow Your Instructions?0
Knowledge Sharing in Manufacturing using Large Language Models: User Evaluation and Model Benchmarking0
3D Compositional Zero-shot Learning with DeCompositional Consensus0
Benchmarking Performance of Deep Learning Model for Material Segmentation on Two HPC Systems0
Know Thy Judge: On the Robustness Meta-Evaluation of LLM Safety Judges0
Benchmarking Pedestrian Odometry: The Brown Pedestrian Odometry Dataset (BPOD)0
Benchmarking PathCLIP for Pathology Image Analysis0
Kolmogorov-Arnold Network for Transistor Compact Modeling0
Koopman Theory-Inspired Method for Learning Time Advancement Operators in Unstable Flame Front Evolution0
Benchmarking Out-of-Distribution Generalization Capabilities of DNN-based Encoding Models for the Ventral Visual Cortex0
KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models0
KRISTEVA: Close Reading as a Novel Task for Benchmarking Interpretive Reasoning0
K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models via K-wise Human Preferences0
Benchmarking Ophthalmology Foundation Models for Clinically Significant Age Macular Degeneration Detection0
Benchmarking Open-Source Large Language Models on Healthcare Text Classification Tasks0
L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking BERT Sentence Representations for Hindi and Marathi0
L3 Fusion: Fast Transformed Convolutions on CPUs0
Advocating Character Error Rate for Multilingual ASR Evaluation0
Label Anchored Contrastive Learning for Language Understanding0
Comparison of Open-Source and Proprietary LLMs for Machine Reading Comprehension: A Practical Analysis for Industrial Applications0
Label-Efficient Point Cloud Semantic Segmentation: An Active Learning Approach0
Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models0
AI Cyber Risk Benchmark: Automated Exploitation Capabilities0
λ: A Benchmark for Data-Efficiency in Long-Horizon Indoor Mobile Manipulation Robotics0
LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs0
Time and Tokens: Benchmarking End-to-End Speech Dysfluency Detection0
LAG-MMLU: Benchmarking Frontier LLM Understanding in Latvian and Giriama0
Benchmarking Online Sequence-to-Sequence and Character-based Handwriting Recognition from IMU-Enhanced Pens0
Time Awareness in Large Language Models: Benchmarking Fact Recall Across Time0
Benchmarking Online Object Trackers for Underwater Robot Position Locking Applications0
Show:102550
← PrevPage 65 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified