SOTAVerified

Benchmarking

Papers

Showing 17011725 of 5548 papers

TitleStatusHype
Causality-enhanced Decision-Making for Autonomous Mobile Robots in Dynamic EnvironmentsCode0
Benchmarking Apache Spark and Hadoop MapReduce on Big Data ClassificationCode0
Benchmarking and Enhancing LLM Agents in Localizing Linux Kernel BugsCode0
PATH: A Discrete-sequence Dataset for Evaluating Online Unsupervised Anomaly Detection Approaches for Multivariate Time SeriesCode0
DyKnow: Dynamically Verifying Time-Sensitive Factual Knowledge in LLMsCode0
Benchmarking and Confidence Evaluation of LALMs For Temporal ReasoningCode0
Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMsCode0
ISImed: A Framework for Self-Supervised Learning using Intrinsic Spatial Information in Medical ImagesCode0
Anchor Points: Benchmarking Models with Much Fewer ExamplesCode0
Benchmarking a transformer-FREE model for ad-hoc retrievalCode0
An Auditing Test To Detect Behavioral Shift in Language ModelsCode0
IoT Data Trust Evaluation via Machine LearningCode0
VitaGraph: Building a Knowledge Graph for Biologically Relevant Learning TasksCode0
IPC: A Benchmark Data Set for Learning with Graph-Structured DataCode0
Capsule Vision 2024 Challenge: Multi-Class Abnormality Classification for Video Capsule EndoscopyCode0
Learning collective multi-cellular dynamics from temporal scRNA-seq via a transformer-enhanced Neural SDECode0
InViG: Benchmarking Interactive Visual Grounding with 500K Human-Robot InteractionsCode0
An Analyst-Inspector Framework for Evaluating Reproducibility of LLMs in Data ScienceCode0
Investigating the Impact of Hard Samples on Accuracy Reveals In-class Data ImbalanceCode0
Can Tree Based Approaches Surpass Deep Learning in Anomaly Detection? A Benchmarking StudyCode0
Inverse Contextual Bandits: Learning How Behavior Evolves over TimeCode0
CityNet: A Comprehensive Multi-Modal Urban Dataset for Advanced Research in Urban ComputingCode0
City-Scale Road Audit System using Deep LearningCode0
Cityscape-Adverse: Benchmarking Robustness of Semantic Segmentation with Realistic Scene Modifications via Diffusion-Based Image EditingCode0
IOLBENCH: Benchmarking LLMs on Linguistic ReasoningCode0
Show:102550
← PrevPage 69 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified