SOTAVerified

Benchmarking

Papers

Showing 17011725 of 5548 papers

TitleStatusHype
From Grounding to Planning: Benchmarking Bottlenecks in Web Agents0
A practical generalization metric for deep networks benchmarking0
Towards Student Actions in Classroom Scenes: New Dataset and BaselineCode1
Landscape-Aware Automated Algorithm Configuration using Multi-output Mixed Regression and Classification0
Revisiting Safe Exploration in Safe Reinforcement learning0
ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI SystemsCode3
Benchmarking LLM Code Generation for Audio Programming with Visual Dataflow Languages0
Accelerating the discovery of steady-states of planetary interior dynamics with machine learning0
Understanding the User: An Intent-Based Ranking Dataset0
SYNTHEVAL: Hybrid Behavioral Testing of NLP Models with Synthetic CheckListsCode0
STEREO: Towards Adversarially Robust Concept Erasing from Text-to-Image Generation ModelsCode1
Illuminating the Diversity-Fitness Trade-Off in Black-Box OptimizationCode0
How Well Do LLMs Handle Cantonese? Benchmarking Cantonese Capabilities of Large Language ModelsCode1
Benchmarking Japanese Speech Recognition on ASR-LLM Setups with Multi-Pass Augmented Generative Error Correction0
Interactive Agents: Simulating Counselor-Client Psychological Counseling via Role-Playing LLM-to-LLM InteractionsCode2
Benchmarking foundation models as feature extractors for weakly-supervised computational pathology0
LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language ModelsCode1
Atari-GPT: Benchmarking Multimodal Large Language Models as Low-Level Policies in Atari Games0
Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis0
Applications in CityLearn Gym Environment for Multi-Objective Control Benchmarking in Grid-Interactive Buildings and Districts0
BOX3D: Lightweight Camera-LiDAR Fusion for 3D Object Detection and Localization0
VHAKG: A Multi-modal Knowledge Graph Based on Synchronized Multi-view Videos of Daily ActivitiesCode0
Benchmarking Reinforcement Learning Methods for Dexterous Robotic Manipulation with a Three-Fingered Gripper0
Cross-subject Brain Functional Connectivity Analysis for Multi-task Cognitive State Evaluation0
FastTextSpotter: A High-Efficiency Transformer for Multilingual Scene Text SpottingCode0
Show:102550
← PrevPage 69 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified