SOTAVerified

Benchmarking

Papers

Showing 24512500 of 5548 papers

TitleStatusHype
Does Table Source Matter? Benchmarking and Improving Multimodal Scientific Table Understanding and ReasoningCode0
EvoLearner: Learning Description Logics with Evolutionary AlgorithmsCode0
Graph Convolutional Networks Meet with High Dimensionality ReductionCode0
Benchmarking Large Language Models on Communicative Medical Coaching: a Novel System and DatasetCode0
gym-gazebo2, a toolkit for reinforcement learning using ROS 2 and GazeboCode0
Strong and Simple Baselines for Multimodal Utterance EmbeddingsCode0
GOAL: Towards Benchmarking Few-Shot Sports Game SummarizationCode0
Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician ExamsCode0
GNNMerge: Merging of GNN Models Without Accessing Training DataCode0
DLAMA: A Framework for Curating Culturally Diverse Facts for Probing the Knowledge of Pretrained Language ModelsCode0
LoopDB: A Loop Closure Dataset for Large Scale Simultaneous Localization and MappingCode0
Benchmarking Large Language Models for Math Reasoning TasksCode0
Benchmarking Large Language Models for Image Classification of Marine MammalsCode0
Divergent Creativity in Humans and Large Language ModelsCode0
Global Prediction of COVID-19 Variant Emergence Using Dynamics-Informed Graph Neural NetworksCode0
Distributional Depth-Based Estimation of Object Articulation ModelsCode0
Distributing Deep Learning Hyperparameter Tuning for 3D Medical Image SegmentationCode0
A Framework for Generating Informative Benchmark InstancesCode0
Expecting The Unexpected: Towards Broad Out-Of-Distribution DetectionCode0
Experimental Analysis of Large-scale Learnable Vector Storage CompressionCode0
Benchmarking Parameter Control Methods in Differential Evolution for Mixed-Integer Black-Box OptimizationCode0
AI-generated Image Quality Assessment in Visual CommunicationCode0
Geological Inference from Textual Data using Word EmbeddingsCode0
GiantHunter: Accurate detection of giant virus in metagenomic data using reinforcement-learning and Monte Carlo tree searchCode0
AstroVision: Towards Autonomous Feature Detection and Description for Missions to Small Bodies Using Deep LearningCode0
Machine learning classification of non-Markovian noise disturbing quantum dynamicsCode0
Good at captioning, bad at counting: Benchmarking GPT-4V on Earth observation dataCode0
A Classification Benchmark for Artificial Intelligence Detection of Laryngeal Cancer from Patient VoiceCode0
Distributed Non-Convex Optimization with Sublinear Speedup under Intermittent Client AvailabilityCode0
Flexible Generation of Preference Data for Recommendation AnalysisCode0
Dissecting Sample Hardness: A Fine-Grained Analysis of Hardness Characterization Methods for Data-Centric AICode0
Dissecting Dissonance: Benchmarking Large Multimodal Models Against Self-Contradictory InstructionsCode0
Benchmarking Large Language Models for Molecule Prediction TasksCode0
DispBench: Benchmarking Disparity Estimation to Synthetic CorruptionsCode0
Are Large Language Models Good at Utility Judgments?Code0
Benchmarking performance of object detection under image distortions in an uncontrolled environmentCode0
DispaRisk: Auditing Fairness Through Usable InformationCode0
A Framework for Evaluating PM2.5 Forecasts from the Perspective of Individual Decision MakingCode0
Exploring Context Generalizability in Citywide Crowd Mobility Prediction: An Analytic Framework and BenchmarkCode0
Benchmarking Perturbation-based Saliency Maps for Explaining Atari AgentsCode0
Generative Models for Fast Simulation of Cherenkov Detectors at the Electron-Ion ColliderCode0
GPT4Graph: Can Large Language Models Understand Graph Structured Data ? An Empirical Evaluation and BenchmarkingCode0
Exploring Model-based Planning with Policy NetworksCode0
GenderBench: Evaluation Suite for Gender Biases in LLMsCode0
GenCeption: Evaluate Multimodal LLMs with Unlabeled Unimodal DataCode0
Benchmarking Language-agnostic Intent Classification for Virtual Assistant PlatformsCode0
GECOBench: A Gender-Controlled Text Dataset and Benchmark for Quantifying Biases in ExplanationsCode0
A Recipe for CAC: Mosaic-based Generalized Loss for Improved Class-Agnostic CountingCode0
Benchmarking Label Noise in Instance Segmentation: Spatial Noise MattersCode0
Fully Automatic Segmentation of Gross Target Volume and Organs-at-Risk for Radiotherapy Planning of Nasopharyngeal CarcinomaCode0
Show:102550
← PrevPage 50 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified