| GRATIS: GeneRAting TIme Series with diverse and controllable characteristics | Mar 7, 2019 | BenchmarkingClustering | CodeCode Available | 0 | 5 |
| GNNMerge: Merging of GNN Models Without Accessing Training Data | Mar 5, 2025 | BenchmarkingComputational Efficiency | CodeCode Available | 0 | 5 |
| DQI: Measuring Data Quality in NLP | May 2, 2020 | Active LearningBenchmarking | CodeCode Available | 0 | 5 |
| Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation | Apr 21, 2025 | Benchmarking | CodeCode Available | 0 | 5 |
| A General Benchmarking Framework for Text Generation | Dec 1, 2020 | BenchmarkingKnowledge Graphs | CodeCode Available | 0 | 5 |
| Global Prediction of COVID-19 Variant Emergence Using Dynamics-Informed Graph Neural Networks | Jan 7, 2024 | BenchmarkingGraph Neural Network | CodeCode Available | 0 | 5 |
| A Closer Look at Temporal Sentence Grounding in Videos: Dataset and Metric | Jan 22, 2021 | BenchmarkingSentence | CodeCode Available | 0 | 5 |
| Benchmarking Large Language Model Uncertainty for Prompt Optimization | Sep 16, 2024 | BenchmarkingDiversity | CodeCode Available | 0 | 5 |
| Are Personalized Stochastic Parrots More Dangerous? Evaluating Persona Biases in Dialogue Systems | Oct 8, 2023 | Benchmarking | CodeCode Available | 0 | 5 |
| Geological Inference from Textual Data using Word Embeddings | Apr 10, 2025 | BenchmarkingWord Embeddings | CodeCode Available | 0 | 5 |
| GiantHunter: Accurate detection of giant virus in metagenomic data using reinforcement-learning and Monte Carlo tree search | Jan 26, 2025 | BenchmarkingDiversity | CodeCode Available | 0 | 5 |
| GOAL: Towards Benchmarking Few-Shot Sports Game Summarization | Jul 18, 2022 | Benchmarking | CodeCode Available | 0 | 5 |
| Flexible Generation of Preference Data for Recommendation Analysis | Jul 23, 2024 | BenchmarkingRecommendation Systems | CodeCode Available | 0 | 5 |
| Domain-Expanded ASTE: Rethinking Generalization in Aspect Sentiment Triplet Extraction | May 23, 2023 | Aspect-Based Sentiment AnalysisAspect-Based Sentiment Analysis (ABSA) | CodeCode Available | 0 | 5 |
| Arena-Rosnav 2.0: A Development and Benchmarking Platform for Robot Navigation in Highly Dynamic Environments | Feb 20, 2023 | BenchmarkingRobot Navigation | CodeCode Available | 0 | 5 |
| Domain2Vec: Domain Embedding for Unsupervised Domain Adaptation | Jul 17, 2020 | BenchmarkingDisentanglement | CodeCode Available | 0 | 5 |
| Do Localization Methods Actually Localize Memorized Data in LLMs? A Tale of Two Benchmarks | Nov 15, 2023 | BenchmarkingNetwork Pruning | CodeCode Available | 0 | 5 |
| Separating form and meaning: Using self-consistency to quantify task understanding across multiple senses | May 19, 2023 | BenchmarkingForm | CodeCode Available | 0 | 5 |
| Do LLMs Memorize Recommendation Datasets? A Preliminary Study on MovieLens-1M | May 15, 2025 | BenchmarkingMemorization | CodeCode Available | 0 | 5 |
| Evaluating the Ability of LLMs to Solve Semantics-Aware Process Mining Tasks | Jul 2, 2024 | Activity PredictionAnomaly Detection | CodeCode Available | 0 | 5 |
| Do LLM Evaluators Prefer Themselves for a Reason? | Apr 4, 2025 | BenchmarkingCode Generation | CodeCode Available | 0 | 5 |
| Does Table Source Matter? Benchmarking and Improving Multimodal Scientific Table Understanding and Reasoning | Jan 22, 2025 | Benchmarking | CodeCode Available | 0 | 5 |
| Benchmarking Large Language Models on Communicative Medical Coaching: a Novel System and Dataset | Feb 8, 2024 | Benchmarking | CodeCode Available | 0 | 5 |
| Generative Models for Fast Simulation of Cherenkov Detectors at the Electron-Ion Collider | Apr 26, 2025 | BenchmarkingGPU | CodeCode Available | 0 | 5 |
| Strong and Simple Baselines for Multimodal Utterance Embeddings | May 14, 2019 | Benchmarking | CodeCode Available | 0 | 5 |