| Exponentially Faster Language Modelling | Nov 15, 2023 | BenchmarkingCPU | CodeCode Available | 2 | 5 |
| GenoTEX: An LLM Agent Benchmark for Automated Gene Expression Data Analysis | Jun 21, 2024 | AI AgentAutoML | CodeCode Available | 2 | 5 |
| BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models | Apr 17, 2021 | Argument RetrievalBenchmarking | CodeCode Available | 2 | 5 |
| Event-Based Motion Magnification | Feb 19, 2024 | BenchmarkingMotion Detection | CodeCode Available | 2 | 5 |
| Extended Agriculture-Vision: An Extension of a Large Aerial Image Dataset for Agricultural Pattern Analysis | Mar 4, 2023 | BenchmarkingContrastive Learning | CodeCode Available | 2 | 5 |
| AlignBench: Benchmarking Chinese Alignment of Large Language Models | Nov 30, 2023 | Benchmarking | CodeCode Available | 2 | 5 |
| EV2Gym: A Flexible V2G Simulator for EV Smart Charging Research and Benchmarking | Apr 2, 2024 | BenchmarkingReinforcement Learning (RL) | CodeCode Available | 2 | 5 |
| EvalGIM: A Library for Evaluating Generative Image Models | Dec 13, 2024 | BenchmarkingDiversity | CodeCode Available | 2 | 5 |
| Authorship Obfuscation in Multilingual Machine-Generated Text Detection | Jan 15, 2024 | Adversarial RobustnessBenchmarking | CodeCode Available | 2 | 5 |
| EQ-Bench: An Emotional Intelligence Benchmark for Large Language Models | Dec 11, 2023 | BenchmarkingEmotional Intelligence | CodeCode Available | 2 | 5 |
| Evaluating Large-Vocabulary Object Detectors: The Devil is in the Details | Feb 1, 2021 | Benchmarkingobject-detection | CodeCode Available | 2 | 5 |
| HoTPP Benchmark: Are We Good at the Long Horizon Events Forecasting? | Jun 20, 2024 | BenchmarkingPoint Processes | CodeCode Available | 2 | 5 |
| HumanRefiner: Benchmarking Abnormal Human Generation and Refining with Coarse-to-fine Pose-Reversible Guidance | Jul 9, 2024 | BenchmarkingConditional Image Generation | CodeCode Available | 2 | 5 |
| FairMedFM: Fairness Benchmarking for Medical Imaging Foundation Models | Jul 1, 2024 | BenchmarkingFairness | CodeCode Available | 2 | 5 |
| EffiBench: Benchmarking the Efficiency of Automatically Generated Code | Feb 3, 2024 | BenchmarkingCode Completion | CodeCode Available | 2 | 5 |
| A large annotated medical image dataset for the development and evaluation of segmentation algorithms | Feb 25, 2019 | BenchmarkingSegmentation | CodeCode Available | 2 | 5 |
| InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks | Jan 10, 2024 | Benchmarking | CodeCode Available | 2 | 5 |
| InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents | Mar 5, 2024 | BenchmarkingLanguage Modeling | CodeCode Available | 2 | 5 |
| InstructScene: Instruction-Driven 3D Indoor Scene Synthesis with Semantic Graph Prior | Feb 7, 2024 | BenchmarkingDecoder | CodeCode Available | 2 | 5 |
| Interaction2Code: Benchmarking MLLM-based Interactive Webpage Code Generation from Interactive Prototyping | Nov 5, 2024 | BenchmarkingCode Generation | CodeCode Available | 2 | 5 |
| InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback | Jun 26, 2023 | BenchmarkingCode Generation | CodeCode Available | 2 | 5 |
| IntersectionZoo: Eco-driving for Benchmarking Multi-Agent Contextual Reinforcement Learning | Oct 19, 2024 | BenchmarkingMulti-agent Reinforcement Learning | CodeCode Available | 2 | 5 |
| A Toolkit for Reliable Benchmarking and Research in Multi-Objective Reinforcement Learning | Sep 26, 2023 | BenchmarkingMulti-Objective Reinforcement Learning | CodeCode Available | 2 | 5 |
| LLM-Based Multi-Agent Systems are Scalable Graph Generative Models | Oct 13, 2024 | BenchmarkingGraph Generation | CodeCode Available | 2 | 5 |
| State-specific protein-ligand complex structure prediction with a multi-scale deep generative model | Sep 30, 2022 | BenchmarkingBlind Docking | CodeCode Available | 2 | 5 |