| Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and Beyond | Jun 16, 2023 | BenchmarkingEvidence Selection | CodeCode Available | 1 |
| Collective Knowledge: organizing research projects as a database of reusable components and portable workflows with common APIs | Nov 2, 2020 | Benchmarking | CodeCode Available | 1 |
| Benchmarking Neural Network Generalization for Grammar Induction | Aug 16, 2023 | Benchmarking | CodeCode Available | 1 |
| Benchmarking Neural Network Robustness to Common Corruptions and Surface Variations | Jul 4, 2018 | Adversarial DefenseBenchmarking | CodeCode Available | 1 |
| Benchmarking Large Language Models for Persian: A Preliminary Study Focusing on ChatGPT | Apr 3, 2024 | BenchmarkingGeneral Knowledge | CodeCode Available | 1 |
| Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents | Feb 27, 2025 | Benchmarking | CodeCode Available | 1 |
| CombiBench: Benchmarking LLM Capability for Combinatorial Mathematics | May 6, 2025 | Benchmarking | CodeCode Available | 1 |
| EventEA: Benchmarking Entity Alignment for Event-centric Knowledge Graphs | Nov 5, 2022 | AttributeBenchmarking | CodeCode Available | 1 |
| Benchmarking Large Language Models for Automated Verilog RTL Code Generation | Dec 13, 2022 | BenchmarkingCode Generation | CodeCode Available | 1 |
| CodeUpdateArena: Benchmarking Knowledge Editing on API Updates | Jul 8, 2024 | Benchmarkingknowledge editing | CodeCode Available | 1 |