| VL-ICL Bench: The Devil in the Details of Multimodal In-Context Learning | Mar 19, 2024 | BenchmarkingImage Captioning | CodeCode Available | 2 |
| InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents | Mar 5, 2024 | BenchmarkingLanguage Modeling | CodeCode Available | 2 |
| SciAssess: Benchmarking LLM Proficiency in Scientific Literature Analysis | Mar 4, 2024 | BenchmarkingDrug Discovery | CodeCode Available | 2 |
| REAL-Colon: A dataset for developing real-world AI applications in colonoscopy | Mar 4, 2024 | Benchmarking | CodeCode Available | 2 |
| Benchmarking Uncertainty Disentanglement: Specialized Uncertainties for Specialized Tasks | Feb 29, 2024 | BenchmarkingDisentanglement | CodeCode Available | 2 |
| ToMBench: Benchmarking Theory of Mind in Large Language Models | Feb 23, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 2 |
| CausalGym: Benchmarking causal interpretability methods on linguistic tasks | Feb 19, 2024 | BenchmarkingInterpretability Techniques for Deep Learning | CodeCode Available | 2 |
| Class-incremental Learning for Time Series: Benchmark and Evaluation | Feb 19, 2024 | Activity RecognitionBenchmarking | CodeCode Available | 2 |
| Event-Based Motion Magnification | Feb 19, 2024 | BenchmarkingMotion Detection | CodeCode Available | 2 |
| Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark | Feb 18, 2024 | Benchmarking | CodeCode Available | 2 |