SOTAVerified

Benchmarking

Papers

Showing 17011750 of 5548 papers

TitleStatusHype
From Grounding to Planning: Benchmarking Bottlenecks in Web Agents0
A practical generalization metric for deep networks benchmarking0
Revisiting Safe Exploration in Safe Reinforcement learning0
Landscape-Aware Automated Algorithm Configuration using Multi-output Mixed Regression and Classification0
Towards Student Actions in Classroom Scenes: New Dataset and BaselineCode1
ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI SystemsCode3
Benchmarking LLM Code Generation for Audio Programming with Visual Dataflow Languages0
Accelerating the discovery of steady-states of planetary interior dynamics with machine learning0
SYNTHEVAL: Hybrid Behavioral Testing of NLP Models with Synthetic CheckListsCode0
Understanding the User: An Intent-Based Ranking Dataset0
STEREO: Towards Adversarially Robust Concept Erasing from Text-to-Image Generation ModelsCode1
Illuminating the Diversity-Fitness Trade-Off in Black-Box OptimizationCode0
How Well Do LLMs Handle Cantonese? Benchmarking Cantonese Capabilities of Large Language ModelsCode1
Benchmarking Japanese Speech Recognition on ASR-LLM Setups with Multi-Pass Augmented Generative Error Correction0
Interactive Agents: Simulating Counselor-Client Psychological Counseling via Role-Playing LLM-to-LLM InteractionsCode2
Benchmarking foundation models as feature extractors for weakly-supervised computational pathology0
LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language ModelsCode1
Atari-GPT: Benchmarking Multimodal Large Language Models as Low-Level Policies in Atari Games0
Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis0
Applications in CityLearn Gym Environment for Multi-Objective Control Benchmarking in Grid-Interactive Buildings and Districts0
FastTextSpotter: A High-Efficiency Transformer for Multilingual Scene Text SpottingCode0
Cross-subject Brain Functional Connectivity Analysis for Multi-task Cognitive State Evaluation0
BOX3D: Lightweight Camera-LiDAR Fusion for 3D Object Detection and Localization0
Benchmarking Reinforcement Learning Methods for Dexterous Robotic Manipulation with a Three-Fingered Gripper0
VHAKG: A Multi-modal Knowledge Graph Based on Synchronized Multi-view Videos of Daily ActivitiesCode0
Comparative Analysis: Violence Recognition from Videos using Transfer LearningCode0
Evaluating Large Language Models on Spatial Tasks: A Multi-Task Benchmarking Study0
K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models via K-wise Human Preferences0
DHP Benchmark: Are LLMs Good NLG Evaluators?0
Data Augmentation for Continual RL via Adversarial Gradient Episodic Memory0
Variational Autoencoder for Anomaly Detection: A Comparative StudyCode1
No Dataset Needed for Downstream Knowledge Benchmarking: Response Dispersion Inversely Correlates with Accuracy on Domain-specific QA0
Top Score on the Wrong Exam: On Benchmarking in Machine Learning for Vulnerability Detection0
S3Simulator: A benchmarking Side Scan Sonar Simulator dataset for Underwater Image AnalysisCode0
Open Llama2 Model for the Lithuanian Language0
Benchmarking Counterfactual Interpretability in Deep Learning Models for Time Series Classification0
MultiMed: Massively Multimodal and Multitask Medical Understanding0
Extraction of Research Objectives, Machine Learning Model Names, and Dataset Names from Academic Papers and Analysis of Their Interrelationships Using LLM and Network Analysis0
Scribbles for All: Benchmarking Scribble Supervised Segmentation Across DatasetsCode1
Dynamic PDB: A New Dataset and a SE(3) Model Extension by Integrating Dynamic Behaviors and Physical Properties in Protein Structures0
WCEbleedGen: A wireless capsule endoscopy dataset and its benchmarking for automatic bleeding classification, detection, and segmentationCode0
Advances in Preference-based Reinforcement Learning: A Review0
SimBench: A Rule-Based Multi-Turn Interaction Benchmark for Evaluating an LLM's Ability to Generate Digital TwinsCode0
WeQA: A Benchmark for Retrieval Augmented Generation in Wind Energy Domain0
ISLES'24: Improving final infarct prediction in ischemic stroke using multimodal imaging and clinical data0
UKAN: Unbound Kolmogorov-Arnold Network Accompanied with Accelerated Library0
Benchmarking Large Language Models for Math Reasoning TasksCode0
PerturBench: Benchmarking Machine Learning Models for Cellular Perturbation AnalysisCode2
RP1M: A Large-Scale Motion Dataset for Piano Playing with Bi-Manual Dexterous Robot Hands0
QPO: Query-dependent Prompt Optimization via Multi-Loop Offline Reinforcement Learning0
Show:102550
← PrevPage 35 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified