SOTAVerified

Benchmarking

Papers

Showing 23762400 of 5548 papers

TitleStatusHype
DyKgChat: Benchmarking Dialogue Generation Grounding on Dynamic Knowledge GraphsCode0
Benchmarking Linguistic Diversity of Large Language ModelsCode0
GOAL: Towards Benchmarking Few-Shot Sports Game SummarizationCode0
GPT4Graph: Can Large Language Models Understand Graph Structured Data ? An Empirical Evaluation and BenchmarkingCode0
IOLBENCH: Benchmarking LLMs on Linguistic ReasoningCode0
DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action RecognitionCode0
Ducho meets Elliot: Large-scale Benchmarks for Multimodal RecommendationCode0
GNNMerge: Merging of GNN Models Without Accessing Training DataCode0
Are Synthetic Corruptions A Reliable Proxy For Real-World Corruptions?Code0
GiantHunter: Accurate detection of giant virus in metagenomic data using reinforcement-learning and Monte Carlo tree searchCode0
Global Prediction of COVID-19 Variant Emergence Using Dynamics-Informed Graph Neural NetworksCode0
DrugOOD: Out-of-Distribution (OOD) Dataset Curator and Benchmark for AI-aided Drug Discovery -- A Focus on Affinity Prediction Problems with Noise AnnotationsCode0
Benchmarking Learning Efficiency in Deep Reservoir ComputingCode0
Geological Inference from Textual Data using Word EmbeddingsCode0
Flexible Generation of Preference Data for Recommendation AnalysisCode0
DQI: Measuring Data Quality in NLPCode0
Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive EvaluationCode0
A General Benchmarking Framework for Text GenerationCode0
A Closer Look at Temporal Sentence Grounding in Videos: Dataset and MetricCode0
Generative Models for Fast Simulation of Cherenkov Detectors at the Electron-Ion ColliderCode0
Benchmarking Large Language Model Uncertainty for Prompt OptimizationCode0
Are Personalized Stochastic Parrots More Dangerous? Evaluating Persona Biases in Dialogue SystemsCode0
Domain-Expanded ASTE: Rethinking Generalization in Aspect Sentiment Triplet ExtractionCode0
Arena-Rosnav 2.0: A Development and Benchmarking Platform for Robot Navigation in Highly Dynamic EnvironmentsCode0
GenderBench: Evaluation Suite for Gender Biases in LLMsCode0
Show:102550
← PrevPage 96 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified