| Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMs | May 29, 2025 | BenchmarkingFairness | CodeCode Available | 0 | 5 |
| JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models | May 23, 2025 | BenchmarkingDiversity | CodeCode Available | 0 | 5 |
| A Benchmark on Extremely Weakly Supervised Text Classification: Reconcile Seed Matching and Prompting Approaches | May 22, 2023 | BenchmarkingClassification | CodeCode Available | 0 | 5 |
| Certifiable Black-Box Attacks with Randomized Adversarial Examples: Breaking Defenses with Provable Confidence | Apr 10, 2023 | Benchmarkingspeech-recognition | CodeCode Available | 0 | 5 |
| CEBench: A Benchmarking Toolkit for the Cost-Effectiveness of LLM Pipelines | Jun 20, 2024 | BenchmarkingDecision Making | CodeCode Available | 0 | 5 |
| Benchmarking and Improving Compositional Generalization of Multi-aspect Controllable Text Generation | Apr 5, 2024 | AttributeBenchmarking | CodeCode Available | 0 | 5 |
| ISImed: A Framework for Self-Supervised Learning using Intrinsic Spatial Information in Medical Images | Oct 22, 2024 | BenchmarkingSelf-Supervised Learning | CodeCode Available | 0 | 5 |
| DyKnow: Dynamically Verifying Time-Sensitive Factual Knowledge in LLMs | Apr 10, 2024 | Benchmarkingknowledge editing | CodeCode Available | 0 | 5 |
| Joint Multi-Scale Tone Mapping and Denoising for HDR Image Enhancement | Mar 16, 2023 | BenchmarkingDemosaicking | CodeCode Available | 0 | 5 |
| Keep Security! Benchmarking Security Policy Preservation in Large Language Model Contexts Against Indirect Attacks in Question Answering | May 21, 2025 | BenchmarkingLanguage Modeling | CodeCode Available | 0 | 5 |