| ODRL: A Benchmark for Off-Dynamics Reinforcement Learning | Oct 28, 2024 | Benchmarkingreinforcement-learning | CodeCode Available | 2 |
| CoqPilot, a plugin for LLM-based generation of proofs | Oct 25, 2024 | Benchmarking | CodeCode Available | 2 |
| Open6DOR: Benchmarking Open-instruction 6-DoF Object Rearrangement and A VLM-based Approach | Oct 24, 2024 | BenchmarkingInstruction Following | CodeCode Available | 2 |
| RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style | Oct 21, 2024 | BenchmarkingLanguage Modeling | CodeCode Available | 2 |
| Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following | Oct 21, 2024 | BenchmarkingInstruction Following | CodeCode Available | 2 |
| IntersectionZoo: Eco-driving for Benchmarking Multi-Agent Contextual Reinforcement Learning | Oct 19, 2024 | BenchmarkingMulti-agent Reinforcement Learning | CodeCode Available | 2 |
| SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation | Oct 19, 2024 | AI AgentBenchmarking | CodeCode Available | 2 |
| LLM-Based Multi-Agent Systems are Scalable Graph Generative Models | Oct 13, 2024 | BenchmarkingGraph Generation | CodeCode Available | 2 |
| Benchmarking Agentic Workflow Generation | Oct 10, 2024 | Benchmarking | CodeCode Available | 2 |
| COMPL-AI Framework: A Technical Interpretation and LLM Benchmarking Suite for the EU Artificial Intelligence Act | Oct 10, 2024 | BenchmarkingFairness | CodeCode Available | 2 |