SOTAVerified

Benchmarking

Papers

Showing 24262450 of 5548 papers

TitleStatusHype
Domain-Expanded ASTE: Rethinking Generalization in Aspect Sentiment Triplet ExtractionCode0
Arena-Rosnav 2.0: A Development and Benchmarking Platform for Robot Navigation in Highly Dynamic EnvironmentsCode0
Learned Bayesian Cramér-Rao Bound for Unknown Measurement Models Using Score Neural NetworksCode0
Learn How to Query from Unlabeled Data Streams in Federated LearningCode0
Learning Adaptive Discriminative Correlation Filters via Temporal Consistency Preserving Spatial Feature Selection for Robust Visual TrackingCode0
Geological Inference from Textual Data using Word EmbeddingsCode0
GiantHunter: Accurate detection of giant virus in metagenomic data using reinforcement-learning and Monte Carlo tree searchCode0
Domain2Vec: Domain Embedding for Unsupervised Domain AdaptationCode0
Do Localization Methods Actually Localize Memorized Data in LLMs? A Tale of Two BenchmarksCode0
Do LLMs Memorize Recommendation Datasets? A Preliminary Study on MovieLens-1MCode0
Do LLM Evaluators Prefer Themselves for a Reason?Code0
Does Table Source Matter? Benchmarking and Improving Multimodal Scientific Table Understanding and ReasoningCode0
Flexible Generation of Preference Data for Recommendation AnalysisCode0
Benchmarking Large Language Models on Communicative Medical Coaching: a Novel System and DatasetCode0
Graph Convolutional Networks Meet with High Dimensionality ReductionCode0
Hierarchical Neural Networks for Sequential Sentence Classification in Medical Scientific AbstractsCode0
Strong and Simple Baselines for Multimodal Utterance EmbeddingsCode0
Generative Models for Fast Simulation of Cherenkov Detectors at the Electron-Ion ColliderCode0
Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician ExamsCode0
DLAMA: A Framework for Curating Culturally Diverse Facts for Probing the Knowledge of Pretrained Language ModelsCode0
Benchmarking Large Language Models for Math Reasoning TasksCode0
Benchmarking Large Language Models for Image Classification of Marine MammalsCode0
Divergent Creativity in Humans and Large Language ModelsCode0
Generalization and Regularization in DQNCode0
GenCeption: Evaluate Multimodal LLMs with Unlabeled Unimodal DataCode0
Show:102550
← PrevPage 98 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified