| ODRL: A Benchmark for Off-Dynamics Reinforcement Learning | Oct 28, 2024 | Benchmarkingreinforcement-learning | CodeCode Available | 2 |
| CoqPilot, a plugin for LLM-based generation of proofs | Oct 25, 2024 | Benchmarking | CodeCode Available | 2 |
| Open6DOR: Benchmarking Open-instruction 6-DoF Object Rearrangement and A VLM-based Approach | Oct 24, 2024 | BenchmarkingInstruction Following | CodeCode Available | 2 |
| RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style | Oct 21, 2024 | BenchmarkingLanguage Modeling | CodeCode Available | 2 |
| Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following | Oct 21, 2024 | BenchmarkingInstruction Following | CodeCode Available | 2 |
| SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation | Oct 19, 2024 | AI AgentBenchmarking | CodeCode Available | 2 |
| IntersectionZoo: Eco-driving for Benchmarking Multi-Agent Contextual Reinforcement Learning | Oct 19, 2024 | BenchmarkingMulti-agent Reinforcement Learning | CodeCode Available | 2 |
| LLM-Based Multi-Agent Systems are Scalable Graph Generative Models | Oct 13, 2024 | BenchmarkingGraph Generation | CodeCode Available | 2 |
| COMPL-AI Framework: A Technical Interpretation and LLM Benchmarking Suite for the EU Artificial Intelligence Act | Oct 10, 2024 | BenchmarkingFairness | CodeCode Available | 2 |
| Benchmarking Agentic Workflow Generation | Oct 10, 2024 | Benchmarking | CodeCode Available | 2 |
| Quanda: An Interpretability Toolkit for Training Data Attribution Evaluation and Beyond | Oct 9, 2024 | Benchmarking | CodeCode Available | 2 |
| FedGraph: A Research Library and Benchmark for Federated Graph Learning | Oct 8, 2024 | BenchmarkingFederated Learning | CodeCode Available | 2 |
| MIBench: A Comprehensive Framework for Benchmarking Model Inversion Attack and Defense | Oct 7, 2024 | Adversarial RobustnessBenchmarking | CodeCode Available | 2 |
| dattri: A Library for Efficient Data Attribution | Oct 6, 2024 | Benchmarking | CodeCode Available | 2 |
| AutoPenBench: Benchmarking Generative Agents for Penetration Testing | Oct 4, 2024 | Benchmarking | CodeCode Available | 2 |
| Beyond Prompts: Dynamic Conversational Benchmarking of Large Language Models | Sep 30, 2024 | BenchmarkingContinual Learning | CodeCode Available | 2 |
| A Survey on Graph Neural Networks for Remaining Useful Life Prediction: Methodologies, Evaluation and Future Trends | Sep 29, 2024 | Benchmarkinggraph construction | CodeCode Available | 2 |
| GSplatLoc: Grounding Keypoint Descriptors into 3D Gaussian Splatting for Improved Visual Localization | Sep 24, 2024 | 3D geometry3DGS | CodeCode Available | 2 |
| Small Language Models: Survey, Measurements, and Insights | Sep 24, 2024 | BenchmarkingDecoder | CodeCode Available | 2 |
| A Survey on Multimodal Benchmarks: In the Era of Large AI Models | Sep 21, 2024 | BenchmarkingSurvey | CodeCode Available | 2 |
| Advances in APPFL: A Comprehensive and Extensible Federated Learning Framework | Sep 17, 2024 | BenchmarkingFederated Learning | CodeCode Available | 2 |
| Assessing SPARQL capabilities of Large Language Models | Sep 9, 2024 | BenchmarkingKnowledge Graphs | CodeCode Available | 2 |
| PlantSeg: A Large-Scale In-the-wild Dataset for Plant Disease Segmentation | Sep 6, 2024 | Benchmarkingimage-classification | CodeCode Available | 2 |
| Interactive Agents: Simulating Counselor-Client Psychological Counseling via Role-Playing LLM-to-LLM Interactions | Aug 28, 2024 | Benchmarking | CodeCode Available | 2 |
| PerturBench: Benchmarking Machine Learning Models for Cellular Perturbation Analysis | Aug 20, 2024 | Benchmarking | CodeCode Available | 2 |