| Large-scale Ridesharing DARP Instances Based on Real Travel Demand | May 30, 2023 | Benchmarking | CodeCode Available | 0 |
| Calibrating Pre-trained Language Classifiers on LLM-generated Noisy Labels via Iterative Refinement | May 26, 2025 | Benchmarking | CodeCode Available | 0 |
| JExplore: Design Space Exploration Tool for Nvidia Jetson Boards | Feb 16, 2025 | BenchmarkingGPU | CodeCode Available | 0 |
| Anchor Points: Benchmarking Models with Much Fewer Examples | Sep 14, 2023 | BenchmarkingLanguage Modeling | CodeCode Available | 0 |
| Laughing Heads: Can Transformers Detect What Makes a Sentence Funny? | May 19, 2021 | BenchmarkingSentence | CodeCode Available | 0 |
| THaMES: An End-to-End Tool for Hallucination Mitigation and Evaluation in Large Language Models | Sep 17, 2024 | BenchmarkingBinary Classification | CodeCode Available | 0 |
| JATE 2.0: Java Automatic Term Extraction with Apache Solr | May 1, 2016 | BenchmarkingTerm Extraction | CodeCode Available | 0 |
| JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models | May 23, 2025 | BenchmarkingDiversity | CodeCode Available | 0 |
| Calibrated Adaptive Probabilistic ODE Solvers | Dec 15, 2020 | BenchmarkingDescriptive | CodeCode Available | 0 |
| Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMs | May 29, 2025 | BenchmarkingFairness | CodeCode Available | 0 |