| HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning | Jul 22, 2024 | BenchmarkingHallucination | CodeCode Available | 1 |
| POGEMA: A Benchmark Platform for Cooperative Multi-Agent Pathfinding | Jul 20, 2024 | BenchmarkingHeuristic Search | CodeCode Available | 1 |
| Thinking Racial Bias in Fair Forgery Detection: Models, Datasets and Evaluations | Jul 19, 2024 | BenchmarkingFairness | CodeCode Available | 1 |
| Restore Anything Model via Efficient Degradation Adaptation | Jul 18, 2024 | 5-Degradation Blind All-in-One Image RestorationBenchmarking | CodeCode Available | 1 |
| SKADA-Bench: Benchmarking Unsupervised Domain Adaptation Methods with Realistic Validation On Diverse Modalities | Jul 16, 2024 | BenchmarkingDomain Adaptation | CodeCode Available | 1 |
| Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models | Jul 16, 2024 | BenchmarkingCode Generation | CodeCode Available | 1 |
| When Heterophily Meets Heterogeneity: Challenges and a New Large-Scale Graph Benchmark | Jul 15, 2024 | BenchmarkingGraph Learning | CodeCode Available | 1 |
| Separable Operator Networks | Jul 15, 2024 | BenchmarkingGPU | CodeCode Available | 1 |
| CIBench: Evaluating Your LLMs with a Code Interpreter Plugin | Jul 15, 2024 | Benchmarking | CodeCode Available | 1 |
| OptiBench Meets ReSocratic: Measure and Improve LLMs for Optimization Modeling | Jul 13, 2024 | BenchmarkingMath | CodeCode Available | 1 |
| Retrospective for the Dynamic Sensorium Competition for predicting large-scale mouse primary visual cortex activity from videos | Jul 12, 2024 | BenchmarkingPupil Dilation | CodeCode Available | 1 |
| Benchmarking Language Model Creativity: A Case Study on Code Generation | Jul 12, 2024 | BenchmarkingCode Generation | CodeCode Available | 1 |
| PredBench: Benchmarking Spatio-Temporal Prediction across Diverse Disciplines | Jul 11, 2024 | BenchmarkingPrediction | CodeCode Available | 1 |
| Natural language is not enough: Benchmarking multi-modal generative AI for Verilog generation | Jul 11, 2024 | Benchmarking | CodeCode Available | 1 |
| Benchmarking Embedding Aggregation Methods in Computational Pathology: A Clinical Data Perspective | Jul 10, 2024 | BenchmarkingDiagnostic | CodeCode Available | 1 |
| Training on the Test Task Confounds Evaluation and Emergence | Jul 10, 2024 | BenchmarkingLanguage Modelling | CodeCode Available | 1 |
| OpenCIL: Benchmarking Out-of-Distribution Detection in Class-Incremental Learning | Jul 8, 2024 | Benchmarkingclass-incremental learning | CodeCode Available | 1 |
| CodeUpdateArena: Benchmarking Knowledge Editing on API Updates | Jul 8, 2024 | Benchmarkingknowledge editing | CodeCode Available | 1 |
| Replication in Visual Diffusion Models: A Survey and Outlook | Jul 7, 2024 | BenchmarkingSurvey | CodeCode Available | 1 |
| Benchmarking structure-based three-dimensional molecular generative models using GenBench3D: ligand conformation quality matters | Jul 5, 2024 | Benchmarkingvalid | CodeCode Available | 1 |
| Benchmark on Drug Target Interaction Modeling from a Structure Perspective | Jul 4, 2024 | BenchmarkingDrug Discovery | CodeCode Available | 1 |
| Comics Datasets Framework: Mix of Comics datasets for detection benchmarking | Jul 3, 2024 | BenchmarkingObject | CodeCode Available | 1 |
| GraCoRe: Benchmarking Graph Comprehension and Complex Reasoning in Large Language Models | Jul 3, 2024 | Benchmarking | CodeCode Available | 1 |
| Emotion and Intent Joint Understanding in Multimodal Conversation: A Benchmarking Dataset | Jul 3, 2024 | BenchmarkingDiversity | CodeCode Available | 1 |
| Occlusion-Aware Seamless Segmentation | Jul 2, 2024 | BenchmarkingDomain Adaptation | CodeCode Available | 1 |
| FineSurE: Fine-grained Summarization Evaluation using LLMs | Jul 1, 2024 | BenchmarkingHallucination | CodeCode Available | 1 |
| AI Agents That Matter | Jul 1, 2024 | Benchmarking | CodeCode Available | 1 |
| Overcoming Common Flaws in the Evaluation of Selective Classification Systems | Jul 1, 2024 | BenchmarkingClassification | CodeCode Available | 1 |
| Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents | Jul 1, 2024 | Benchmarking | CodeCode Available | 1 |
| GraphArena: Benchmarking Large Language Models on Graph Computational Problems | Jun 29, 2024 | BenchmarkingHallucination | CodeCode Available | 1 |
| iAMPCN: a deep-learning approach for identifying antimicrobial peptides and their functional activities | Jun 27, 2024 | Benchmarking | CodeCode Available | 1 |
| Depth-Driven Geometric Prompt Learning for Laparoscopic Liver Landmark Detection | Jun 25, 2024 | BenchmarkingPrompt Learning | CodeCode Available | 1 |
| SoK: Membership Inference Attacks on LLMs are Rushing Nowhere (and How to Fix It) | Jun 25, 2024 | BenchmarkingExperimental Design | CodeCode Available | 1 |
| MatText: Do Language Models Need More than Text & Scale for Materials Modeling? | Jun 25, 2024 | Benchmarking | CodeCode Available | 1 |
| AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models | Jun 24, 2024 | BenchmarkingData Augmentation | CodeCode Available | 1 |
| General Binding Affinity Guidance for Diffusion Models in Structure-Based Drug Design | Jun 24, 2024 | BenchmarkingDrug Design | CodeCode Available | 1 |
| Ragnarök: A Reusable RAG Framework and Baselines for TREC 2024 Retrieval-Augmented Generation Track | Jun 24, 2024 | BenchmarkingRAG | CodeCode Available | 1 |
| A Closer Look at Mortality Risk Prediction from Electrocardiograms | Jun 24, 2024 | BenchmarkingPrediction | CodeCode Available | 1 |
| Six-CD: Benchmarking Concept Removals for Benign Text-to-image Diffusion Models | Jun 21, 2024 | Benchmarking | CodeCode Available | 1 |
| A Benchmarking Study of Kolmogorov-Arnold Networks on Tabular Data | Jun 20, 2024 | BenchmarkingKolmogorov-Arnold Networks | CodeCode Available | 1 |
| African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object Classification | Jun 20, 2024 | BenchmarkingClassification | CodeCode Available | 1 |
| BeHonest: Benchmarking Honesty in Large Language Models | Jun 19, 2024 | BenchmarkingMisinformation | CodeCode Available | 1 |
| Job-SDF: A Multi-Granularity Dataset for Job Skill Demand Forecasting and Benchmarking | Jun 17, 2024 | BenchmarkingDemand Forecasting | CodeCode Available | 1 |
| MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models | Jun 17, 2024 | BenchmarkingFact Checking | CodeCode Available | 1 |
| A GPU-accelerated Large-scale Simulator for Transportation System Optimization Benchmarking | Jun 15, 2024 | BenchmarkingGPU | CodeCode Available | 1 |
| VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs | Jun 14, 2024 | Anomaly DetectionBenchmarking | CodeCode Available | 1 |
| Benchmarking Spectral Graph Neural Networks: A Comprehensive Study on Effectiveness and Efficiency | Jun 14, 2024 | Benchmarking | CodeCode Available | 1 |
| LUMA: A Benchmark Dataset for Learning from Uncertain and Multimodal Data | Jun 14, 2024 | BenchmarkingDecision Making | CodeCode Available | 1 |
| SR-CACO-2: A Dataset for Confocal Fluorescence Microscopy Image Super-Resolution | Jun 13, 2024 | BenchmarkingImage Super-Resolution | CodeCode Available | 1 |
| SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models | Jun 13, 2024 | Benchmarking | CodeCode Available | 1 |