SOTAVerified

Benchmarking

Papers

Showing 47014750 of 5548 papers

TitleStatusHype
Hard-Label Cryptanalytic Extraction of Neural Network ModelsCode0
Dynamic Neighborhood Construction for Structured Large Discrete Action SpacesCode0
Benchmarking Top-K Keyword and Top-K Document Processing with T^2K^2 and T^2K^2D^2Code0
HammerBench: Fine-Grained Function-Calling Evaluation in Real Mobile Device ScenariosCode0
MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Chatbots and Dialogue EvaluatorsCode0
MedArabiQ: Benchmarking Large Language Models on Arabic Medical TasksCode0
Benchmarking tools for a priori identifiability analysisCode0
MedBookVQA: A Systematic and Comprehensive Medical Benchmark Derived from Open-Access BookCode0
Benchmarking time series classification -- Functional data vs machine learning approachesCode0
Benchmarking the Robustness of UAV Tracking Against Common CorruptionsCode0
Roughness Index and Roughness Distance for Benchmarking Medical SegmentationCode0
The KANDY Benchmark: Incremental Neuro-Symbolic Learning and Reasoning with Kandinsky PatternsCode0
MEDFAIR: Benchmarking Fairness for Medical ImagingCode0
Benchmarking the Robustness of Optical Flow Estimation to CorruptionsCode0
Adaptive Power System Emergency Control using Deep Reinforcement LearningCode0
Benchmarking the Linear Algebra Awareness of TensorFlow and PyTorchCode0
gym-gazebo2, a toolkit for reinforcement learning using ROS 2 and GazeboCode0
Benchmarking the Hooke-Jeeves Method, MTS-LS1, and BSrr on the Large-scale BBOB Function SetCode0
Guidelines and Benchmarks for Deployment of Deep Learning Models on Smartphones as Real-Time AppsCode0
Grounding Synthetic Data Evaluations of Language Models in Unsupervised Document CorporaCode0
The KiTS19 Challenge Data: 300 Kidney Tumor Cases with Clinical Context, CT Semantic Segmentations, and Surgical OutcomesCode0
RTSeg: Real-time Semantic Segmentation Comparative StudyCode0
Meet Spinky: An Open-Source Spindle and K-Complex Detection Toolbox Validated on the Open-Access Montreal Archive of Sleep Studies (MASS).Code0
Benchmarking the Hill-Valley Evolutionary Algorithm for the GECCO 2018 Competition on Niching Methods Multimodal OptimizationCode0
Grounded Intuition of GPT-Vision's Abilities with Scientific ImagesCode0
GRATIS: GeneRAting TIme Series with diverse and controllable characteristicsCode0
Understanding the World's Museums through Vision-Language ReasoningCode0
RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language ModelsCode0
Grasp Pre-shape Selection by Synthetic Training: Eye-in-hand Shared Control on the Hannes ProsthesisCode0
Benchmarking the Fairness of Image Upsampling MethodsCode0
Graph-theoretical approach to robust 3D normal extraction of LiDAR dataCode0
A Modular Workflow for Performance Benchmarking of Neuronal Network SimulationsCode0
Messing Up 3D Virtual Environments: Transferable Adversarial 3D ObjectsCode0
Graph Neural Networks Are More Than Filters: Revisiting and Benchmarking from A Spectral PerspectiveCode0
Meta-Black-Box-Optimization through Offline Q-function LearningCode0
Learning Conjoint Attentions for Graph Neural NetsCode0
Graph Convolutional Networks Meet with High Dimensionality ReductionCode0
Benchmarking the Attribution Quality of Vision ModelsCode0
MetaFaith: Faithful Natural Language Uncertainty Expression in LLMsCode0
GPT4Graph: Can Large Language Models Understand Graph Structured Data ? An Empirical Evaluation and BenchmarkingCode0
MetaGreen: Meta-Learning Inspired Transformer Selection for Green Semantic CommunicationCode0
S3Simulator: A benchmarking Side Scan Sonar Simulator dataset for Underwater Image AnalysisCode0
Good at captioning, bad at counting: Benchmarking GPT-4V on Earth observation dataCode0
GOAL: Towards Benchmarking Few-Shot Sports Game SummarizationCode0
SACoD: Sensor Algorithm Co-Design Towards Efficient CNN-powered Intelligent PhlatCamCode0
GNNMerge: Merging of GNN Models Without Accessing Training DataCode0
Meta-survey on outlier and anomaly detectionCode0
The Legal Argument Reasoning Task in Civil ProcedureCode0
A Modular Benchmarking Infrastructure for High-Performance and Reproducible Deep LearningCode0
Safe Multi-Agent Navigation guided by Goal-Conditioned Safe Reinforcement LearningCode0
Show:102550
← PrevPage 95 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified