| Understanding Long Videos with Multimodal Language Models | Mar 25, 2024 | Action RecognitionFine-grained Action Recognition | CodeCode Available | 2 |
| ToMBench: Benchmarking Theory of Mind in Large Language Models | Feb 23, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 2 |
| tinyBenchmarks: evaluating LLMs with fewer examples | Feb 22, 2024 | MMLUMultiple-choice | CodeCode Available | 2 |
| CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge | Feb 12, 2024 | General KnowledgeMultiple-choice | CodeCode Available | 2 |
| SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models | Feb 7, 2024 | DiversityMultiple-choice | CodeCode Available | 2 |
| Improving Medical Reasoning through Retrieval and Self-Reflection with Retrieval-Augmented Large Language Models | Jan 27, 2024 | Medical Question AnsweringMultiple-choice | CodeCode Available | 2 |
| Steering Llama 2 via Contrastive Activation Addition | Dec 9, 2023 | Multiple-choice | CodeCode Available | 2 |
| Biomedical knowledge graph-optimized prompt generation for large language models | Nov 29, 2023 | BenchmarkingKnowledge Graphs | CodeCode Available | 2 |
| SEED-Bench-2: Benchmarking Multimodal Large Language Models | Nov 28, 2023 | BenchmarkingImage Generation | CodeCode Available | 2 |
| MVBench: A Comprehensive Multi-modal Video Understanding Benchmark | Nov 28, 2023 | 3D Question Answering (3D-QA)Diagnostic | CodeCode Available | 2 |