SOTAVerified

Benchmarking

Papers

Showing 23512400 of 5548 papers

TitleStatusHype
HATE-ITA: New Baselines for Hate Speech Detection in ItalianCode0
A Collection of Quality Diversity Optimization Problems Derived from Hyperparameter Optimization of Machine Learning ModelsCode0
HammerBench: Fine-Grained Function-Calling Evaluation in Real Mobile Device ScenariosCode0
An Evaluation of Machine Learning Approaches for Early Diagnosis of Autism Spectrum DisorderCode0
A Review of Testing Object-Based Environment Perception for Safe Automated DrivingCode0
Dynamic Neighborhood Construction for Structured Large Discrete Action SpacesCode0
Dyport: Dynamic Importance-based Hypothesis Generation Benchmarking TechniqueCode0
DynCIM: Dynamic Curriculum for Imbalanced Multimodal LearningCode0
Hard-Label Cryptanalytic Extraction of Neural Network ModelsCode0
IdeaBench: Benchmarking Large Language Models for Research Idea GenerationCode0
DynamoRep: Trajectory-Based Population Dynamics for Classification of Black-box Optimization ProblemsCode0
Effective Stabilized Self-Training on Few-Labeled Graph DataCode0
Grounding Synthetic Data Evaluations of Language Models in Unsupervised Document CorporaCode0
A Deep Reinforcement Learning Framework for Dynamic Portfolio Optimization: Evidence from China's Stock MarketCode0
Grasp Pre-shape Selection by Synthetic Training: Eye-in-hand Shared Control on the Hannes ProsthesisCode0
GRATIS: GeneRAting TIme Series with diverse and controllable characteristicsCode0
Grounded Intuition of GPT-Vision's Abilities with Scientific ImagesCode0
Guidelines and Benchmarks for Deployment of Deep Learning Models on Smartphones as Real-Time AppsCode0
Graph Neural Networks Are More Than Filters: Revisiting and Benchmarking from A Spectral PerspectiveCode0
Learning Conjoint Attentions for Graph Neural NetsCode0
Benchmarking LLM-based Relevance Judgment MethodsCode0
Graph Convolutional Networks Meet with High Dimensionality ReductionCode0
Inverse Contextual Bandits: Learning How Behavior Evolves over TimeCode0
Graph-theoretical approach to robust 3D normal extraction of LiDAR dataCode0
Good at captioning, bad at counting: Benchmarking GPT-4V on Earth observation dataCode0
DyKgChat: Benchmarking Dialogue Generation Grounding on Dynamic Knowledge GraphsCode0
Benchmarking Linguistic Diversity of Large Language ModelsCode0
GOAL: Towards Benchmarking Few-Shot Sports Game SummarizationCode0
GPT4Graph: Can Large Language Models Understand Graph Structured Data ? An Empirical Evaluation and BenchmarkingCode0
IOLBENCH: Benchmarking LLMs on Linguistic ReasoningCode0
DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action RecognitionCode0
Ducho meets Elliot: Large-scale Benchmarks for Multimodal RecommendationCode0
GNNMerge: Merging of GNN Models Without Accessing Training DataCode0
Are Synthetic Corruptions A Reliable Proxy For Real-World Corruptions?Code0
GiantHunter: Accurate detection of giant virus in metagenomic data using reinforcement-learning and Monte Carlo tree searchCode0
Global Prediction of COVID-19 Variant Emergence Using Dynamics-Informed Graph Neural NetworksCode0
DrugOOD: Out-of-Distribution (OOD) Dataset Curator and Benchmark for AI-aided Drug Discovery -- A Focus on Affinity Prediction Problems with Noise AnnotationsCode0
Benchmarking Learning Efficiency in Deep Reservoir ComputingCode0
Geological Inference from Textual Data using Word EmbeddingsCode0
Flexible Generation of Preference Data for Recommendation AnalysisCode0
DQI: Measuring Data Quality in NLPCode0
Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive EvaluationCode0
A General Benchmarking Framework for Text GenerationCode0
A Closer Look at Temporal Sentence Grounding in Videos: Dataset and MetricCode0
Generative Models for Fast Simulation of Cherenkov Detectors at the Electron-Ion ColliderCode0
Benchmarking Large Language Model Uncertainty for Prompt OptimizationCode0
Are Personalized Stochastic Parrots More Dangerous? Evaluating Persona Biases in Dialogue SystemsCode0
Domain-Expanded ASTE: Rethinking Generalization in Aspect Sentiment Triplet ExtractionCode0
Arena-Rosnav 2.0: A Development and Benchmarking Platform for Robot Navigation in Highly Dynamic EnvironmentsCode0
GenderBench: Evaluation Suite for Gender Biases in LLMsCode0
Show:102550
← PrevPage 48 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified