| How far are today's time-series models from real-world weather forecasting applications? | Jun 20, 2024 | BenchmarkingTime Series | CodeCode Available | 2 |
| HoTPP Benchmark: Are We Good at the Long Horizon Events Forecasting? | Jun 20, 2024 | BenchmarkingPoint Processes | CodeCode Available | 2 |
| A large-scale multicenter breast cancer DCE-MRI benchmark dataset with expert segmentations | Jun 19, 2024 | Benchmarking | CodeCode Available | 2 |
| OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI | Jun 18, 2024 | Benchmarkingscientific discovery | CodeCode Available | 2 |
| GeoBench: Benchmarking and Analyzing Monocular Geometry Estimation Models | Jun 18, 2024 | BenchmarkingDepth Estimation | CodeCode Available | 2 |
| Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models | Jun 17, 2024 | Benchmarking | CodeCode Available | 2 |
| RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models | Jun 16, 2024 | Adversarial AttackBenchmarking | CodeCode Available | 2 |
| BTS: Building Timeseries Dataset: Empowering Large-Scale Building Analytics | Jun 13, 2024 | Benchmarking | CodeCode Available | 2 |
| Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs | Jun 13, 2024 | BenchmarkingGPU | CodeCode Available | 2 |
| Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs | Jun 13, 2024 | BenchmarkingQuestion Answering | CodeCode Available | 2 |
| StreamBench: Towards Benchmarking Continuous Improvement of Language Agents | Jun 13, 2024 | BenchmarkingLanguage Modeling | CodeCode Available | 2 |
| TCMBench: A Comprehensive Benchmark for Evaluating Large Language Models in Traditional Chinese Medicine | Jun 3, 2024 | BenchmarkingQuestion Answering | CodeCode Available | 2 |
| Benchmarking and Improving Detail Image Caption | May 29, 2024 | BenchmarkingImage Captioning | CodeCode Available | 2 |
| LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters | May 27, 2024 | BenchmarkingGSM8K | CodeCode Available | 2 |
| S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models | May 23, 2024 | Benchmarking | CodeCode Available | 2 |
| Large-Scale Multi-Center CT and MRI Segmentation of Pancreas with Deep Learning | May 20, 2024 | BenchmarkingMRI segmentation | CodeCode Available | 2 |
| MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering | May 20, 2024 | BenchmarkingQuestion Answering | CodeCode Available | 2 |
| PolygloToxicityPrompts: Multilingual Evaluation of Neural Toxic Degeneration in Large Language Models | May 15, 2024 | Benchmarking | CodeCode Available | 2 |
| OpenFactCheck: Building, Benchmarking Customized Fact-Checking Systems and Evaluating the Factuality of Claims and LLMs | May 9, 2024 | BenchmarkingFact Checking | CodeCode Available | 2 |
| iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image Retrieval | May 5, 2024 | BenchmarkingComposed Image Retrieval (CoIR) | CodeCode Available | 2 |
| Benchmarking Representations for Speech, Music, and Acoustic Events | May 2, 2024 | Audio ClassificationBenchmarking | CodeCode Available | 2 |
| HLSFactory: A Framework Empowering High-Level Synthesis Datasets for Machine Learning and Beyond | May 1, 2024 | BenchmarkingHigh-Level Synthesis | CodeCode Available | 2 |
| SIDBench: A Python Framework for Reliably Assessing Synthetic Image Detection Methods | Apr 29, 2024 | BenchmarkingImage Generation | CodeCode Available | 2 |
| Benchmarking Benchmark Leakage in Large Language Models | Apr 29, 2024 | BenchmarkingMathematical Reasoning | CodeCode Available | 2 |
| LongEmbed: Extending Embedding Models for Long Context Retrieval | Apr 18, 2024 | 4k8k | CodeCode Available | 2 |