SOTAVerified

Benchmarking

Papers

Showing 25512575 of 5548 papers

TitleStatusHype
Coherent Feed Forward Quantum Neural Network0
We're Not Using Videos Effectively: An Updated Domain Adaptive Video Segmentation BaselineCode1
Benchmarking Transferable Adversarial AttacksCode1
Benchmarking Sensitivity of Continual Graph Learning for Skeleton-Based Action Recognition0
I Think, Therefore I am: Benchmarking Awareness of Large Language Models Using AwareBenchCode4
Good at captioning, bad at counting: Benchmarking GPT-4V on Earth observation dataCode0
Explainable Benchmarking for Iterative Optimization HeuristicsCode1
Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex ScenariosCode2
Category-wise Fine-Tuning: Resisting Incorrect Pseudo-Labels in Multi-Label Image Classification with Partial LabelsCode1
Machine Translation Meta Evaluation through Translation Accuracy Challenge SetsCode1
ToPro: Token-Level Prompt Decomposition for Cross-Lingual Sequence Labeling TasksCode0
Muffin or Chihuahua? Challenging Multimodal Large Language Models with Multipanel VQA0
PPM: Automated Generation of Diverse Programming Problems for Benchmarking Code Generation ModelsCode0
Benchmarking with MIMIC-IV, an irregular, spare clinical time series dataset0
SAM-based instance segmentation models for the automation of structural damage detection0
MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop QueriesCode3
Biological Valuation Map of Flanders: A Sentinel-2 Imagery Analysis0
Benchmarking Large Language Models in Complex Question Answering Attribution using Knowledge Graphs0
Automated legal reasoning with discretion to act using s(LAW)0
TriSAM: Tri-Plane SAM for zero-shot cortical blood vessel segmentation in VEM images0
Dataset and Benchmark: Novel Sensors for Autonomous Vehicle PerceptionCode1
SciMMIR: Benchmarking Scientific Multi-modal Information RetrievalCode1
Large Malaysian Language Model Based on Mistral for Enhanced Local Language Understanding0
Benchmarking the Fairness of Image Upsampling MethodsCode0
AgentBoard: An Analytical Evaluation Board of Multi-turn LLM AgentsCode3
Show:102550
← PrevPage 103 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified