| Generative CKM Construction using Partially Observed Data with Diffusion Model | Dec 19, 2024 | Benchmarking | CodeCode Available | 1 |
| Benchmarking and Improving Large Vision-Language Models for Fundamental Visual Graph Understanding and Reasoning | Dec 18, 2024 | BenchmarkingGraph Learning | CodeCode Available | 1 |
| RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment | Dec 18, 2024 | BenchmarkingRAG | CodeCode Available | 1 |
| Autonomous Microscopy Experiments through Large Language Model Agents | Dec 18, 2024 | BenchmarkingExperimental Design | CodeCode Available | 1 |
| TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks | Dec 18, 2024 | Benchmarking | CodeCode Available | 1 |
| MT-LENS: An all-in-one Toolkit for Better Machine Translation Evaluation | Dec 16, 2024 | AllBenchmarking | CodeCode Available | 1 |
| CharacterBench: Benchmarking Character Customization of Large Language Models | Dec 16, 2024 | Benchmarking | CodeCode Available | 1 |
| AD-LLM: Benchmarking Large Language Models for Anomaly Detection | Dec 15, 2024 | Anomaly DetectionBenchmarking | CodeCode Available | 1 |
| Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning | Dec 11, 2024 | AttributeBenchmarking | CodeCode Available | 1 |
| Multi-Behavior Recommendation with Personalized Directed Acyclic Behavior Graphs | Dec 9, 2024 | BenchmarkingComputational Efficiency | CodeCode Available | 1 |
| PowerMamba: A Deep State Space Model and Comprehensive Benchmark for Time Series Prediction in Electric Power Systems | Dec 9, 2024 | BenchmarkingPrediction | CodeCode Available | 1 |
| Does your model understand genes? A benchmark of gene properties for biological and text models | Dec 5, 2024 | BenchmarkingMulti-class Classification | CodeCode Available | 1 |
| Grounding Descriptions in Images informs Zero-Shot Visual Recognition | Dec 5, 2024 | AttributeBenchmarking | CodeCode Available | 1 |
| Down with the Hierarchy: The 'H' in HNSW Stands for "Hubs" | Dec 2, 2024 | BenchmarkingRepresentation Learning | CodeCode Available | 1 |
| Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-Oasis | Nov 29, 2024 | BenchmarkingClaim Verification | CodeCode Available | 1 |
| Circumventing shortcuts in audio-visual deepfake detection datasets with unsupervised learning | Nov 29, 2024 | BenchmarkingDeepFake Detection | CodeCode Available | 1 |
| CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language Models | Nov 27, 2024 | BenchmarkingEarth Observation | CodeCode Available | 1 |
| AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM | Nov 26, 2024 | BenchmarkingText-to-Video Generation | CodeCode Available | 1 |
| VidHal: Benchmarking Temporal Hallucinations in Vision LLMs | Nov 25, 2024 | BenchmarkingHallucination | CodeCode Available | 1 |
| Machine Learning for the Digital Typhoon Dataset: Extensions to Multiple Basins and New Developments in Representations and Tasks | Nov 25, 2024 | Benchmarkingobject-detection | CodeCode Available | 1 |
| StackEval: Benchmarking LLMs in Coding Assistance | Nov 21, 2024 | Benchmarking | CodeCode Available | 1 |
| Multi-Agent Environments for Vehicle Routing Problems | Nov 21, 2024 | Benchmarkingreinforcement-learning | CodeCode Available | 1 |
| DLBacktrace: A Model Agnostic Explainability for any Deep Learning Models | Nov 19, 2024 | BenchmarkingDeep Learning | CodeCode Available | 1 |
| Introducing Milabench: Benchmarking Accelerators for AI | Nov 18, 2024 | BenchmarkingDeep Learning | CodeCode Available | 1 |
| FM-TS: Flow Matching for Time Series Generation | Nov 12, 2024 | BenchmarkingImputation | CodeCode Available | 1 |
| Arctique: An artificial histopathological dataset unifying realism and controllability for uncertainty quantification | Nov 11, 2024 | BenchmarkingImage Segmentation | CodeCode Available | 1 |
| Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset | Nov 5, 2024 | BenchmarkingLanguage Modeling | CodeCode Available | 1 |
| Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks | Nov 4, 2024 | Action GenerationBenchmarking | CodeCode Available | 1 |
| LayerDAG: A Layerwise Autoregressive Diffusion Model for Directed Acyclic Graph Generation | Nov 4, 2024 | BenchmarkingGraph Generation | CodeCode Available | 1 |
| ROAD-Waymo: Action Awareness at Scale for Autonomous Driving | Nov 3, 2024 | Autonomous DrivingBenchmarking | CodeCode Available | 1 |
| MIRFLEX: Music Information Retrieval Feature Library for Extraction | Nov 1, 2024 | BenchmarkingInformation Retrieval | CodeCode Available | 1 |
| LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models | Nov 1, 2024 | BenchmarkingMixture-of-Experts | CodeCode Available | 1 |
| AllClear: A Comprehensive Dataset and Benchmark for Cloud Removal in Satellite Imagery | Oct 31, 2024 | BenchmarkingCloud Removal | CodeCode Available | 1 |
| Pedestrian Trajectory Prediction with Missing Data: Datasets, Imputation, and Benchmarking | Oct 31, 2024 | BenchmarkingImputation | CodeCode Available | 1 |
| DetectRL: Benchmarking LLM-Generated Text Detection in Real-World Scenarios | Oct 31, 2024 | BenchmarkingLLM-generated Text Detection | CodeCode Available | 1 |
| LLM4Mat-Bench: Benchmarking Large Language Models for Materials Property Prediction | Oct 31, 2024 | BenchmarkingPrediction | CodeCode Available | 1 |
| EMGBench: Benchmarking Out-of-Distribution Generalization and Adaptation for Electromyography | Oct 31, 2024 | BenchmarkingElectromyography (EMG) | CodeCode Available | 1 |
| DataRec: A Python Library for Standardized and Reproducible Data Management in Recommender Systems | Oct 30, 2024 | BenchmarkingManagement | CodeCode Available | 1 |
| Survey of Cultural Awareness in Language Models: Text and Beyond | Oct 30, 2024 | Benchmarking | CodeCode Available | 1 |
| LLMCBench: Benchmarking Large Language Model Compression for Efficient Deployment | Oct 28, 2024 | BenchmarkingLanguage Modeling | CodeCode Available | 1 |
| SPICEPilot: Navigating SPICE Code Generation and Simulation with AI Guidance | Oct 27, 2024 | BenchmarkingCode Generation | CodeCode Available | 1 |
| AgentSense: Benchmarking Social Intelligence of Language Agents through Interactive Scenarios | Oct 25, 2024 | BenchmarkingDiversity | CodeCode Available | 1 |
| Benchmarking Multi-Scene Fire and Smoke Detection | Oct 22, 2024 | Benchmarking | CodeCode Available | 1 |
| Comprehensive benchmarking of large language models for RNA secondary structure prediction | Oct 21, 2024 | Benchmarking | CodeCode Available | 1 |
| MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart Problems | Oct 18, 2024 | BenchmarkingQuestion Answering | CodeCode Available | 1 |
| Benchmarking Deep Reinforcement Learning for Navigation in Denied Sensor Environments | Oct 18, 2024 | Autonomous NavigationBenchmarking | CodeCode Available | 1 |
| Benchmarking Transcriptomics Foundation Models for Perturbation Analysis : one PCA still rules them all | Oct 17, 2024 | AllBenchmarking | CodeCode Available | 1 |
| WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation | Oct 16, 2024 | BenchmarkingFairness | CodeCode Available | 1 |
| RClicks: Realistic Click Simulation for Benchmarking Interactive Segmentation | Oct 15, 2024 | BenchmarkingInteractive Segmentation | CodeCode Available | 1 |
| TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models | Oct 14, 2024 | 2kBenchmarking | CodeCode Available | 1 |