| Multi-Head RAG: Solving Multi-Aspect Problems with LLMs | Jun 7, 2024 | BenchmarkingDecoder | CodeCode Available | 3 |
| WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild | Jun 7, 2024 | BenchmarkingChatbot | CodeCode Available | 3 |
| MLVU: Benchmarking Multi-task Long Video Understanding | Jun 6, 2024 | BenchmarkingVideo Understanding | CodeCode Available | 3 |
| Benchmarking and Improving Bird's Eye View Perception Robustness in Autonomous Driving | May 27, 2024 | Autonomous DrivingBenchmarking | CodeCode Available | 3 |
| Benchmarking Large Language Models on CFLUE -- A Chinese Financial Language Understanding Evaluation Dataset | May 17, 2024 | 16kBenchmarking | CodeCode Available | 3 |
| Are EEG-to-Text Models Working? | May 10, 2024 | BenchmarkingEEG | CodeCode Available | 3 |
| ACEGEN: Reinforcement learning of generative chemical agents for drug discovery | May 7, 2024 | BenchmarkingDecision Making | CodeCode Available | 3 |
| SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension | Apr 25, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 3 |
| STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases | Apr 19, 2024 | BenchmarkingRetrieval | CodeCode Available | 3 |
| DeepFake-O-Meter v2.0: An Open Platform for DeepFake Detection | Apr 19, 2024 | BenchmarkingDeepFake Detection | CodeCode Available | 3 |
| Advancing LLM Reasoning Generalists with Preference Trees | Apr 2, 2024 | BenchmarkingCode Generation | CodeCode Available | 3 |
| AlphaFin: Benchmarking Financial Analysis with Retrieval-Augmented Stock-Chain Framework | Mar 19, 2024 | BenchmarkingFinancial Analysis | CodeCode Available | 3 |
| Real-IAD: A Real-World Multi-View Dataset for Benchmarking Versatile Industrial Anomaly Detection | Mar 19, 2024 | Anomaly DetectionBenchmarking | CodeCode Available | 3 |
| Recurrent Drafter for Fast Speculative Decoding in Large Language Models | Mar 14, 2024 | BenchmarkingKnowledge Distillation | CodeCode Available | 3 |
| MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries | Jan 27, 2024 | BenchmarkingRAG | CodeCode Available | 3 |
| AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents | Jan 24, 2024 | Benchmarking | CodeCode Available | 3 |
| Benchmarking LLMs via Uncertainty Quantification | Jan 23, 2024 | BenchmarkingUncertainty Quantification | CodeCode Available | 3 |
| A Vision-Language Foundation Model to Enhance Efficiency of Chest X-ray Interpretation | Jan 22, 2024 | BenchmarkingDiagnostic | CodeCode Available | 3 |
| SEED-Bench: Benchmarking Multimodal Large Language Models | Jan 1, 2024 | BenchmarkingImage Generation | CodeCode Available | 3 |
| AM-RADIO: Agglomerative Vision Foundation Model -- Reduce All Domains Into One | Dec 10, 2023 | AllBenchmarking | CodeCode Available | 3 |
| LocoMuJoCo: A Comprehensive Imitation Learning Benchmark for Locomotion | Nov 4, 2023 | BenchmarkingImitation Learning | CodeCode Available | 3 |
| CRITERIA: a New Benchmarking Paradigm for Evaluating Trajectory Prediction Models for Autonomous Driving | Oct 11, 2023 | Autonomous DrivingBenchmarking | CodeCode Available | 3 |
| Exploring Progress in Multivariate Time Series Forecasting: Comprehensive Benchmarking and Heterogeneity Analysis | Oct 9, 2023 | BenchmarkingMultivariate Time Series Forecasting | CodeCode Available | 3 |
| T^3Bench: Benchmarking Current Progress in Text-to-3D Generation | Oct 4, 2023 | 3D GenerationBenchmarking | CodeCode Available | 3 |
| SMPLer-X: Scaling Up Expressive Human Pose and Shape Estimation | Sep 29, 2023 | 3D Human Pose Estimation3D Human Reconstruction | CodeCode Available | 3 |