| Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization | Nov 15, 2023 | BenchmarkingInstruction Following | CodeCode Available | 1 |
| MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration | Nov 14, 2023 | BenchmarkingLanguage Modeling | CodeCode Available | 1 |
| Combinatorial Optimization with Policy Adaptation using Latent Space Search | Nov 13, 2023 | BenchmarkingCombinatorial Optimization | CodeCode Available | 1 |
| Benchmarking PtO and PnO Methods in the Predictive Combinatorial Optimization Regime | Nov 13, 2023 | BenchmarkingCombinatorial Optimization | CodeCode Available | 1 |
| WaterBench: Towards Holistic Evaluation of Watermarks for Large Language Models | Nov 13, 2023 | BenchmarkingInstruction Following | CodeCode Available | 1 |
| Flames: Benchmarking Value Alignment of LLMs in Chinese | Nov 12, 2023 | BenchmarkingFairness | CodeCode Available | 1 |
| CloudEval-YAML: A Practical Benchmark for Cloud Configuration Generation | Nov 10, 2023 | BenchmarkingCloud Computing | CodeCode Available | 1 |
| MultiIoT: Benchmarking Machine Learning for the Internet of Things | Nov 10, 2023 | BenchmarkingRepresentation Learning | CodeCode Available | 1 |
| TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for Human-Aligned LLMs | Nov 9, 2023 | BenchmarkingQuestion Answering | CodeCode Available | 1 |
| The voraus-AD Dataset for Anomaly Detection in Robot Applications | Nov 8, 2023 | Anomaly DetectionBenchmarking | CodeCode Available | 1 |