| Generative CKM Construction using Partially Observed Data with Diffusion Model | Dec 19, 2024 | Benchmarking | CodeCode Available | 1 |
| Autonomous Microscopy Experiments through Large Language Model Agents | Dec 18, 2024 | BenchmarkingExperimental Design | CodeCode Available | 1 |
| Benchmarking and Improving Large Vision-Language Models for Fundamental Visual Graph Understanding and Reasoning | Dec 18, 2024 | BenchmarkingGraph Learning | CodeCode Available | 1 |
| RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment | Dec 18, 2024 | BenchmarkingRAG | CodeCode Available | 1 |
| TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks | Dec 18, 2024 | Benchmarking | CodeCode Available | 1 |
| MT-LENS: An all-in-one Toolkit for Better Machine Translation Evaluation | Dec 16, 2024 | AllBenchmarking | CodeCode Available | 1 |
| CharacterBench: Benchmarking Character Customization of Large Language Models | Dec 16, 2024 | Benchmarking | CodeCode Available | 1 |
| AD-LLM: Benchmarking Large Language Models for Anomaly Detection | Dec 15, 2024 | Anomaly DetectionBenchmarking | CodeCode Available | 1 |
| Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning | Dec 11, 2024 | AttributeBenchmarking | CodeCode Available | 1 |
| PowerMamba: A Deep State Space Model and Comprehensive Benchmark for Time Series Prediction in Electric Power Systems | Dec 9, 2024 | BenchmarkingPrediction | CodeCode Available | 1 |
| Multi-Behavior Recommendation with Personalized Directed Acyclic Behavior Graphs | Dec 9, 2024 | BenchmarkingComputational Efficiency | CodeCode Available | 1 |
| Grounding Descriptions in Images informs Zero-Shot Visual Recognition | Dec 5, 2024 | AttributeBenchmarking | CodeCode Available | 1 |
| Does your model understand genes? A benchmark of gene properties for biological and text models | Dec 5, 2024 | BenchmarkingMulti-class Classification | CodeCode Available | 1 |
| Down with the Hierarchy: The 'H' in HNSW Stands for "Hubs" | Dec 2, 2024 | BenchmarkingRepresentation Learning | CodeCode Available | 1 |
| Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-Oasis | Nov 29, 2024 | BenchmarkingClaim Verification | CodeCode Available | 1 |
| Circumventing shortcuts in audio-visual deepfake detection datasets with unsupervised learning | Nov 29, 2024 | BenchmarkingDeepFake Detection | CodeCode Available | 1 |
| CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language Models | Nov 27, 2024 | BenchmarkingEarth Observation | CodeCode Available | 1 |
| AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM | Nov 26, 2024 | BenchmarkingText-to-Video Generation | CodeCode Available | 1 |
| VidHal: Benchmarking Temporal Hallucinations in Vision LLMs | Nov 25, 2024 | BenchmarkingHallucination | CodeCode Available | 1 |
| Machine Learning for the Digital Typhoon Dataset: Extensions to Multiple Basins and New Developments in Representations and Tasks | Nov 25, 2024 | Benchmarkingobject-detection | CodeCode Available | 1 |
| Multi-Agent Environments for Vehicle Routing Problems | Nov 21, 2024 | Benchmarkingreinforcement-learning | CodeCode Available | 1 |
| StackEval: Benchmarking LLMs in Coding Assistance | Nov 21, 2024 | Benchmarking | CodeCode Available | 1 |
| DLBacktrace: A Model Agnostic Explainability for any Deep Learning Models | Nov 19, 2024 | BenchmarkingDeep Learning | CodeCode Available | 1 |
| Introducing Milabench: Benchmarking Accelerators for AI | Nov 18, 2024 | BenchmarkingDeep Learning | CodeCode Available | 1 |
| FM-TS: Flow Matching for Time Series Generation | Nov 12, 2024 | BenchmarkingImputation | CodeCode Available | 1 |