SOTAVerified

Benchmarking

Papers

Showing 25512600 of 5548 papers

TitleStatusHype
Coherent Feed Forward Quantum Neural Network0
Benchmarking Transferable Adversarial AttacksCode1
We're Not Using Videos Effectively: An Updated Domain Adaptive Video Segmentation BaselineCode1
Benchmarking Sensitivity of Continual Graph Learning for Skeleton-Based Action Recognition0
I Think, Therefore I am: Benchmarking Awareness of Large Language Models Using AwareBenchCode4
Good at captioning, bad at counting: Benchmarking GPT-4V on Earth observation dataCode0
Explainable Benchmarking for Iterative Optimization HeuristicsCode1
Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex ScenariosCode2
Category-wise Fine-Tuning: Resisting Incorrect Pseudo-Labels in Multi-Label Image Classification with Partial LabelsCode1
ToPro: Token-Level Prompt Decomposition for Cross-Lingual Sequence Labeling TasksCode0
Machine Translation Meta Evaluation through Translation Accuracy Challenge SetsCode1
Muffin or Chihuahua? Challenging Multimodal Large Language Models with Multipanel VQA0
PPM: Automated Generation of Diverse Programming Problems for Benchmarking Code Generation ModelsCode0
MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop QueriesCode3
SAM-based instance segmentation models for the automation of structural damage detection0
Benchmarking with MIMIC-IV, an irregular, spare clinical time series dataset0
Biological Valuation Map of Flanders: A Sentinel-2 Imagery Analysis0
Benchmarking Large Language Models in Complex Question Answering Attribution using Knowledge Graphs0
Automated legal reasoning with discretion to act using s(LAW)0
TriSAM: Tri-Plane SAM for zero-shot cortical blood vessel segmentation in VEM images0
Dataset and Benchmark: Novel Sensors for Autonomous Vehicle PerceptionCode1
Large Malaysian Language Model Based on Mistral for Enhanced Local Language Understanding0
SciMMIR: Benchmarking Scientific Multi-modal Information RetrievalCode1
Benchmarking the Fairness of Image Upsampling MethodsCode0
AgentBoard: An Analytical Evaluation Board of Multi-turn LLM AgentsCode3
What the Weight?! A Unified Framework for Zero-Shot Knowledge CompositionCode0
LLpowershap: Logistic Loss-based Automated Shapley Values Feature Selection MethodCode0
Benchmarking LLMs via Uncertainty QuantificationCode3
Deep Neural Network Benchmarks for Selective ClassificationCode0
Subgroup analysis methods for time-to-event outcomes in heterogeneous randomized controlled trialsCode0
A Vision-Language Foundation Model to Enhance Efficiency of Chest X-ray InterpretationCode3
Benchmarking Large Multimodal Models against Common CorruptionsCode1
CheX-GPT: Harnessing Large Language Models for Enhanced Chest X-ray Report LabelingCode1
Data-Driven Target Localization: Benchmarking Gradient Descent Using the Cramer-Rao Bound0
Data Augmentation for Traffic Classification0
R-Judge: Benchmarking Safety Risk Awareness for LLM AgentsCode2
WAVES: Benchmarking the Robustness of Image WatermarksCode2
NOTSOFAR-1 Challenge: New Datasets, Baseline, and Tasks for Distant Meeting Transcription0
Harnessing Orthogonality to Train Low-Rank Neural NetworksCode0
Large Language Models are Null-Shot Learners0
TACO: Benchmarking Generalizable Bimanual Tool-ACtion-Object Understanding0
OpenDPD: An Open-Source End-to-End Learning & Benchmarking Framework for Wideband Power Amplifier Modeling and Digital Pre-Distortion0
Authorship Obfuscation in Multilingual Machine-Generated Text DetectionCode2
RSUD20K: A Dataset for Road Scene Understanding In Autonomous DrivingCode1
A Reinforcement Learning Environment for Directed Quantum Circuit Synthesis0
Lifelogging As An Extreme Form of Personal Information Management -- What Lessons To Learn0
InfiAgent-DABench: Evaluating Agents on Data Analysis TasksCode2
Knowledge Sharing in Manufacturing using Large Language Models: User Evaluation and Model Benchmarking0
Latency-aware Road Anomaly Segmentation in Videos: A Photorealistic Dataset and New Metrics0
DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-InferenceCode7
Show:102550
← PrevPage 52 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified