| NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens | Mar 18, 2024 | BenchmarkingQuestion Answering | CodeCode Available | 1 |
| Align and Distill: Unifying and Improving Domain Adaptive Object Detection | Mar 18, 2024 | Benchmarkingobject-detection | CodeCode Available | 1 |
| An Improved Metric and Benchmark for Assessing the Performance of Virtual Screening Models | Mar 15, 2024 | BenchmarkingDrug Discovery | CodeCode Available | 1 |
| Histo-Genomic Knowledge Distillation For Cancer Prognosis From Histopathology Whole Slide Images | Mar 15, 2024 | BenchmarkingKnowledge Distillation | CodeCode Available | 1 |
| Leveraging Foundation Models for Content-Based Medical Image Retrieval in Radiology | Mar 11, 2024 | BenchmarkingContent-Based Image Retrieval | CodeCode Available | 1 |
| Amharic LLaMA and LLaVA: Multimodal LLMs for Low Resource Languages | Mar 11, 2024 | BenchmarkingData Augmentation | CodeCode Available | 1 |
| Addressing Shortcomings in Fair Graph Learning Datasets: Towards a New Benchmark | Mar 9, 2024 | BenchmarkingFairness | CodeCode Available | 1 |
| Tapilot-Crossing: Benchmarking and Evolving LLMs Towards Interactive Data Analysis Agents | Mar 8, 2024 | BenchmarkingDecision Making | CodeCode Available | 1 |
| Benchmarking Micro-action Recognition: Dataset, Methods, and Applications | Mar 8, 2024 | Action RecognitionBenchmarking | CodeCode Available | 1 |
| R^2-Bench: Benchmarking the Robustness of Referring Perception Models under Perturbations | Mar 7, 2024 | Benchmarking | CodeCode Available | 1 |
| Ducho 2.0: Towards a More Up-to-Date Unified Framework for the Extraction of Multimodal Features in Recommendation | Mar 7, 2024 | BenchmarkingMultimodal Recommendation | CodeCode Available | 1 |
| Benchmarking Segmentation Models with Mask-Preserved Attribute Editing | Mar 2, 2024 | AttributeBenchmarking | CodeCode Available | 1 |
| TRUCE: Private Benchmarking to Prevent Contamination and Improve Comparative Evaluation of LLMs | Mar 1, 2024 | Benchmarking | CodeCode Available | 1 |
| Efficient Lifelong Model Evaluation in an Era of Rapid Progress | Feb 29, 2024 | BenchmarkingGPU | CodeCode Available | 1 |
| Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions | Feb 28, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 1 |
| Beacon, a lightweight deep reinforcement learning benchmark library for flow control | Feb 27, 2024 | BenchmarkingCPU | CodeCode Available | 1 |
| Benchmarking Data Science Agents | Feb 27, 2024 | BenchmarkingCode Generation | CodeCode Available | 1 |
| Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data | Feb 27, 2024 | Benchmarking | CodeCode Available | 1 |
| PST-Bench: Tracing and Benchmarking the Source of Publications | Feb 25, 2024 | Benchmarking | CodeCode Available | 1 |
| API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMs | Feb 23, 2024 | Benchmarkingslot-filling | CodeCode Available | 1 |
| CriticBench: Benchmarking LLMs for Critique-Correct Reasoning | Feb 22, 2024 | Benchmarking | CodeCode Available | 1 |
| Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment | Feb 21, 2024 | Adversarial RobustnessBenchmarking | CodeCode Available | 1 |
| The Effect of Batch Size on Contrastive Self-Supervised Speech Representation Learning | Feb 21, 2024 | BenchmarkingRepresentation Learning | CodeCode Available | 1 |
| CHILI: Chemically-Informed Large-scale Inorganic Nanomaterials Dataset for Advancing Graph Machine Learning | Feb 20, 2024 | Atomic number classificationBenchmarking | CodeCode Available | 1 |
| Benchmarking Knowledge Boundary for Large Language Models: A Different Perspective on Model Evaluation | Feb 18, 2024 | BenchmarkingLanguage Modeling | CodeCode Available | 1 |