SOTAVerified

Benchmarking

Papers

Showing 17011750 of 5548 papers

TitleStatusHype
QeMFi: A Multifidelity Dataset of Quantum Chemical Properties of Diverse MoleculesCode0
Benchmarking Apache Spark and Hadoop MapReduce on Big Data ClassificationCode0
Learn How to Query from Unlabeled Data Streams in Federated LearningCode0
Light Field Saliency Detection with Deep Convolutional NetworksCode0
Machine learning classification of non-Markovian noise disturbing quantum dynamicsCode0
KhabarChin: Automatic Detection of Important News in the Persian LanguageCode0
A Benchmarking Study of Vision-based Robotic Grasping AlgorithmsCode0
Knowing-how & Knowing-that: A New Task for Machine Comprehension of User ManualsCode0
Causality-enhanced Decision-Making for Autonomous Mobile Robots in Dynamic EnvironmentsCode0
Benchmarking and Enhancing LLM Agents in Localizing Linux Kernel BugsCode0
PATH: A Discrete-sequence Dataset for Evaluating Online Unsupervised Anomaly Detection Approaches for Multivariate Time SeriesCode0
KArSL: Arabic Sign Language DatabaseCode0
Benchmarking and Confidence Evaluation of LALMs For Temporal ReasoningCode0
KamNet: An Integrated Spatiotemporal Deep Neural Network for Rare Event Search in KamLAND-ZenCode0
Keep Security! Benchmarking Security Policy Preservation in Large Language Model Contexts Against Indirect Attacks in Question AnsweringCode0
Knowledge-Driven Slot Constraints for Goal-Oriented Dialogue SystemsCode0
Joint Multi-Scale Tone Mapping and Denoising for HDR Image EnhancementCode0
Anchor Points: Benchmarking Models with Much Fewer ExamplesCode0
An Auditing Test To Detect Behavioral Shift in Language ModelsCode0
VitaGraph: Building a Knowledge Graph for Biologically Relevant Learning TasksCode0
JExplore: Design Space Exploration Tool for Nvidia Jetson BoardsCode0
Capsule Vision 2024 Challenge: Multi-Class Abnormality Classification for Video Capsule EndoscopyCode0
Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMsCode0
Cityscape-Adverse: Benchmarking Robustness of Semantic Segmentation with Realistic Scene Modifications via Diffusion-Based Image EditingCode0
An Analyst-Inspector Framework for Evaluating Reproducibility of LLMs in Data ScienceCode0
Benchmarking AutoML algorithms on a collection of synthetic classification problemsCode0
JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language ModelsCode0
Can Tree Based Approaches Surpass Deep Learning in Anomaly Detection? A Benchmarking StudyCode0
DyKnow: Dynamically Verifying Time-Sensitive Factual Knowledge in LLMsCode0
JATE 2.0: Java Automatic Term Extraction with Apache SolrCode0
Knowledge Enhanced Conditional Imputation for Healthcare Time-seriesCode0
IoT Data Trust Evaluation via Machine LearningCode0
IPC: A Benchmark Data Set for Learning with Graph-Structured DataCode0
Can LLMs Grasp Implicit Cultural Values? Benchmarking LLMs' Metacognitive Cultural Intelligence with CQ-BenchCode0
InViG: Benchmarking Interactive Visual Grounding with 500K Human-Robot InteractionsCode0
IOLBENCH: Benchmarking LLMs on Linguistic ReasoningCode0
ISImed: A Framework for Self-Supervised Learning using Intrinsic Spatial Information in Medical ImagesCode0
Inverse Contextual Bandits: Learning How Behavior Evolves over TimeCode0
Investigating the Impact of Hard Samples on Accuracy Reveals In-class Data ImbalanceCode0
Can geometric combinatorics improve RNA branching predictions?Code0
Introducing SLAMBench, a performance and accuracy benchmarking methodology for SLAMCode0
Air Learning: A Deep Reinforcement Learning Gym for Autonomous Aerial Robot Visual NavigationCode0
Can a single neuron learn predictive uncertainty?Code0
Can AI Validate Science? Benchmarking LLMs for Accurate Scientific Claim Evidence ReasoningCode0
Integration of nested cross-validation, automated hyperparameter optimization, high-performance computing to reduce and quantify the variance of test performance estimation of deep learning modelsCode0
Integrating Expert Knowledge into Logical Programs via LLMsCode0
JavaBench: A Benchmark of Object-Oriented Code Generation for Evaluating Large Language ModelsCode0
Analyzing the Feature Extractor Networks for Face Image SynthesisCode0
InstaIndoor and Multi-modal Deep Learning for Indoor Scene RecognitionCode0
Benchmarking Multi-dimensional AIGC Video Quality Assessment: A Dataset and Unified ModelCode0
Show:102550
← PrevPage 35 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified