SOTAVerified

Benchmarking

Papers

Showing 16761700 of 5548 papers

TitleStatusHype
KamNet: An Integrated Spatiotemporal Deep Neural Network for Rare Event Search in KamLAND-ZenCode0
Benchmarking and Improving Text-to-SQL Generation under AmbiguityCode0
An Efficient Two-stage Gradient Boosting Framework for Short-term Traffic State EstimationCode0
Joint Multi-Scale Tone Mapping and Denoising for HDR Image EnhancementCode0
KArSL: Arabic Sign Language DatabaseCode0
JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language ModelsCode0
A Benchmark on Extremely Weakly Supervised Text Classification: Reconcile Seed Matching and Prompting ApproachesCode0
JATE 2.0: Java Automatic Term Extraction with Apache SolrCode0
Certifiable Black-Box Attacks with Randomized Adversarial Examples: Breaking Defenses with Provable ConfidenceCode0
DyKnow: Dynamically Verifying Time-Sensitive Factual Knowledge in LLMsCode0
CEBench: A Benchmarking Toolkit for the Cost-Effectiveness of LLM PipelinesCode0
Benchmarking and Improving Compositional Generalization of Multi-aspect Controllable Text GenerationCode0
Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMsCode0
JExplore: Design Space Exploration Tool for Nvidia Jetson BoardsCode0
Characterizing Bias: Benchmarking Large Language Models in Simplified versus Traditional ChineseCode0
Keep Security! Benchmarking Security Policy Preservation in Large Language Model Contexts Against Indirect Attacks in Question AnsweringCode0
Large-scale Ridesharing DARP Instances Based on Real Travel DemandCode0
IPC: A Benchmark Data Set for Learning with Graph-Structured DataCode0
ISImed: A Framework for Self-Supervised Learning using Intrinsic Spatial Information in Medical ImagesCode0
IOLBENCH: Benchmarking LLMs on Linguistic ReasoningCode0
A Benchmarking Study of Vision-based Robotic Grasping AlgorithmsCode0
IoT Data Trust Evaluation via Machine LearningCode0
Causality-enhanced Decision-Making for Autonomous Mobile Robots in Dynamic EnvironmentsCode0
Benchmarking and Enhancing LLM Agents in Localizing Linux Kernel BugsCode0
PATH: A Discrete-sequence Dataset for Evaluating Online Unsupervised Anomaly Detection Approaches for Multivariate Time SeriesCode0
Show:102550
← PrevPage 68 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified