| TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models | Oct 14, 2024 | 2kBenchmarking | CodeCode Available | 1 |
| Transforming Game Play: A Comparative Study of DCQN and DTQN Architectures in Reinforcement Learning | Oct 14, 2024 | Atari GamesBenchmarking | —Unverified | 0 |
| RMB: Comprehensively Benchmarking Reward Models in LLM Alignment | Oct 13, 2024 | Benchmarking | CodeCode Available | 1 |
| LLM-Based Multi-Agent Systems are Scalable Graph Generative Models | Oct 13, 2024 | BenchmarkingGraph Generation | CodeCode Available | 2 |
| LoLI-Street: Benchmarking Low-Light Image Enhancement and Beyond | Oct 13, 2024 | Autonomous DrivingAutonomous Vehicles | CodeCode Available | 1 |
| Yesterday's News: Benchmarking Multi-Dimensional Out-of-Distribution Generalisation of Misinformation Detection Models | Oct 12, 2024 | BenchmarkingMisinformation | CodeCode Available | 0 |
| LexSumm and LexT5: Benchmarking and Modeling Legal Summarization Tasks in English | Oct 12, 2024 | Benchmarking | CodeCode Available | 0 |
| FB-Bench: A Fine-Grained Multi-Task Benchmark for Evaluating LLMs' Responsiveness to Human Feedback | Oct 12, 2024 | Benchmarking | CodeCode Available | 0 |
| A Comparative Analysis on Ethical Benchmarking in Large Language Models | Oct 11, 2024 | BenchmarkingDecision Making | —Unverified | 0 |
| Enterprise Benchmarks for Large Language Model Evaluation | Oct 11, 2024 | BenchmarkingLanguage Model Evaluation | CodeCode Available | 0 |