| Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios | Jan 30, 2024 | Benchmarking | CodeCode Available | 2 |
| R-Judge: Benchmarking Safety Risk Awareness for LLM Agents | Jan 18, 2024 | Benchmarking | CodeCode Available | 2 |
| WAVES: Benchmarking the Robustness of Image Watermarks | Jan 16, 2024 | Benchmarking | CodeCode Available | 2 |
| Authorship Obfuscation in Multilingual Machine-Generated Text Detection | Jan 15, 2024 | Adversarial RobustnessBenchmarking | CodeCode Available | 2 |
| InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks | Jan 10, 2024 | Benchmarking | CodeCode Available | 2 |
| A Call to Reflect on Evaluation Practices for Age Estimation: Comparative Analysis of the State-of-the-Art and a Unified Benchmark | Jan 1, 2024 | Age EstimationBenchmarking | CodeCode Available | 2 |
| EQ-Bench: An Emotional Intelligence Benchmark for Large Language Models | Dec 11, 2023 | BenchmarkingEmotional Intelligence | CodeCode Available | 2 |
| AlignBench: Benchmarking Chinese Alignment of Large Language Models | Nov 30, 2023 | Benchmarking | CodeCode Available | 2 |
| Biomedical knowledge graph-optimized prompt generation for large language models | Nov 29, 2023 | BenchmarkingKnowledge Graphs | CodeCode Available | 2 |
| SEED-Bench-2: Benchmarking Multimodal Large Language Models | Nov 28, 2023 | BenchmarkingImage Generation | CodeCode Available | 2 |
| PG-Video-LLaVA: Pixel Grounding Large Video-Language Models | Nov 22, 2023 | BenchmarkingPhrase Grounding | CodeCode Available | 2 |
| Exponentially Faster Language Modelling | Nov 15, 2023 | BenchmarkingCPU | CodeCode Available | 2 |
| What's In My Big Data? | Oct 31, 2023 | Benchmarking | CodeCode Available | 2 |
| Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks | Oct 30, 2023 | Benchmarkingobject-detection | CodeCode Available | 2 |
| Formalizing and Benchmarking Prompt Injection Attacks and Defenses | Oct 19, 2023 | Benchmarking | CodeCode Available | 2 |
| Octopus: Embodied Vision-Language Programmer from Environmental Feedback | Oct 12, 2023 | BenchmarkingCode Generation | CodeCode Available | 2 |
| ProbTS: Benchmarking Point and Distributional Forecasting across Diverse Prediction Horizons | Oct 11, 2023 | BenchmarkingPosition | CodeCode Available | 2 |
| MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation | Oct 5, 2023 | BenchmarkingDecision Making | CodeCode Available | 2 |
| RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language Models | Oct 1, 2023 | Benchmarking | CodeCode Available | 2 |
| LawBench: Benchmarking Legal Knowledge of Large Language Models | Sep 28, 2023 | ArticlesBenchmarking | CodeCode Available | 2 |
| GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond | Sep 28, 2023 | Benchmarking | CodeCode Available | 2 |
| A Content-Driven Micro-Video Recommendation Dataset at Scale | Sep 27, 2023 | BenchmarkingRecommendation Systems | CodeCode Available | 2 |
| A Toolkit for Reliable Benchmarking and Research in Multi-Objective Reinforcement Learning | Sep 26, 2023 | BenchmarkingMulti-Objective Reinforcement Learning | CodeCode Available | 2 |
| VerilogEval: Evaluating Large Language Models for Verilog Code Generation | Sep 14, 2023 | BenchmarkingCode Generation | CodeCode Available | 2 |
| PyGraft: Configurable Generation of Synthetic Schemas and Knowledge Graphs at Your Fingertips | Sep 7, 2023 | BenchmarkingKnowledge Graphs | CodeCode Available | 2 |
| Benchmarking Large Language Models in Retrieval-Augmented Generation | Sep 4, 2023 | Benchmarkingcounterfactual | CodeCode Available | 2 |
| Orientation-Independent Chinese Text Recognition in Scene Images | Sep 3, 2023 | BenchmarkingImage Reconstruction | CodeCode Available | 2 |
| Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations | Aug 23, 2023 | BenchmarkingDecoder | CodeCode Available | 2 |
| BOLAA: Benchmarking and Orchestrating LLM-augmented Autonomous Agents | Aug 11, 2023 | BenchmarkingDecision Making | CodeCode Available | 2 |
| SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension | Jul 30, 2023 | BenchmarkingMultiple-choice | CodeCode Available | 2 |
| IML-ViT: Benchmarking Image Manipulation Localization by Vision Transformer | Jul 27, 2023 | BenchmarkingImage Manipulation | CodeCode Available | 2 |
| Foundational Models Defining a New Era in Vision: A Survey and Outlook | Jul 25, 2023 | Benchmarking | CodeCode Available | 2 |
| Remote Bio-Sensing: Open Source Benchmark Framework for Fair Evaluation of rPPG | Jul 24, 2023 | Benchmarking | CodeCode Available | 2 |
| Benchmarking Potential Based Rewards for Learning Humanoid Locomotion | Jul 19, 2023 | BenchmarkingReinforcement Learning (RL) | CodeCode Available | 2 |
| EasyTPP: Towards Open Benchmarking Temporal Point Processes | Jul 16, 2023 | BenchmarkingPoint Processes | CodeCode Available | 2 |
| A Dynamic Points Removal Benchmark in Point Cloud Maps | Jul 14, 2023 | BenchmarkingDynamic Point Removal | CodeCode Available | 2 |
| ClimateLearn: Benchmarking Machine Learning for Weather and Climate Modeling | Jul 4, 2023 | BenchmarkingWeather Forecasting | CodeCode Available | 2 |
| InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback | Jun 26, 2023 | BenchmarkingCode Generation | CodeCode Available | 2 |
| MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models | Jun 23, 2023 | BenchmarkingLanguage Modeling | CodeCode Available | 2 |
| OpenP5: An Open-Source Platform for Developing, Training, and Evaluating LLM-based Recommender Systems | Jun 19, 2023 | BenchmarkingDecoder | CodeCode Available | 2 |
| PINNacle: A Comprehensive Benchmark of Physics-Informed Neural Networks for Solving PDEs | Jun 15, 2023 | Benchmarking | CodeCode Available | 2 |
| Datasets and Benchmarks for Offline Safe Reinforcement Learning | Jun 15, 2023 | Autonomous DrivingBenchmarking | CodeCode Available | 2 |
| Aria Digital Twin: A New Benchmark Dataset for Egocentric 3D Machine Perception | Jun 10, 2023 | 3D Object DetectionBenchmarking | CodeCode Available | 2 |
| LibAUC: A Deep Learning Library for X-Risk Optimization | Jun 5, 2023 | BenchmarkingClassification | CodeCode Available | 2 |
| The Brain Tumor Segmentation (BraTS) Challenge 2023: Focus on Pediatrics (CBTN-CONNECT-DIPGR-ASNR-MICCAI BraTS-PEDs) | May 26, 2023 | BenchmarkingBrain Tumor Segmentation | CodeCode Available | 2 |
| Visualizing Linguistic Diversity of Text Datasets Synthesized by Large Language Models | May 19, 2023 | BenchmarkingDiversity | CodeCode Available | 2 |
| RoboPianist: Dexterous Piano Playing with Deep Reinforcement Learning | Apr 9, 2023 | BenchmarkingDeep Reinforcement Learning | CodeCode Available | 2 |
| OpenOccupancy: A Large Scale Benchmark for Surrounding Semantic Occupancy Perception | Mar 7, 2023 | Autonomous DrivingBenchmarking | CodeCode Available | 2 |
| FluidLab: A Differentiable Environment for Benchmarking Complex Fluid Manipulation | Mar 4, 2023 | BenchmarkingGPU | CodeCode Available | 2 |
| Extended Agriculture-Vision: An Extension of a Large Aerial Image Dataset for Agricultural Pattern Analysis | Mar 4, 2023 | BenchmarkingContrastive Learning | CodeCode Available | 2 |