| Generative CKM Construction using Partially Observed Data with Diffusion Model | Dec 19, 2024 | Benchmarking | CodeCode Available | 1 |
| Benchmarking and Improving Large Vision-Language Models for Fundamental Visual Graph Understanding and Reasoning | Dec 18, 2024 | BenchmarkingGraph Learning | CodeCode Available | 1 |
| RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment | Dec 18, 2024 | BenchmarkingRAG | CodeCode Available | 1 |
| Autonomous Microscopy Experiments through Large Language Model Agents | Dec 18, 2024 | BenchmarkingExperimental Design | CodeCode Available | 1 |
| TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks | Dec 18, 2024 | Benchmarking | CodeCode Available | 1 |
| MT-LENS: An all-in-one Toolkit for Better Machine Translation Evaluation | Dec 16, 2024 | AllBenchmarking | CodeCode Available | 1 |
| CharacterBench: Benchmarking Character Customization of Large Language Models | Dec 16, 2024 | Benchmarking | CodeCode Available | 1 |
| AD-LLM: Benchmarking Large Language Models for Anomaly Detection | Dec 15, 2024 | Anomaly DetectionBenchmarking | CodeCode Available | 1 |
| Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning | Dec 11, 2024 | AttributeBenchmarking | CodeCode Available | 1 |
| PowerMamba: A Deep State Space Model and Comprehensive Benchmark for Time Series Prediction in Electric Power Systems | Dec 9, 2024 | BenchmarkingPrediction | CodeCode Available | 1 |