| HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning | Jul 22, 2024 | BenchmarkingHallucination | CodeCode Available | 1 |
| POGEMA: A Benchmark Platform for Cooperative Multi-Agent Pathfinding | Jul 20, 2024 | BenchmarkingHeuristic Search | CodeCode Available | 1 |
| Thinking Racial Bias in Fair Forgery Detection: Models, Datasets and Evaluations | Jul 19, 2024 | BenchmarkingFairness | CodeCode Available | 1 |
| Restore Anything Model via Efficient Degradation Adaptation | Jul 18, 2024 | 5-Degradation Blind All-in-One Image RestorationBenchmarking | CodeCode Available | 1 |
| SKADA-Bench: Benchmarking Unsupervised Domain Adaptation Methods with Realistic Validation On Diverse Modalities | Jul 16, 2024 | BenchmarkingDomain Adaptation | CodeCode Available | 1 |
| Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models | Jul 16, 2024 | BenchmarkingCode Generation | CodeCode Available | 1 |
| CIBench: Evaluating Your LLMs with a Code Interpreter Plugin | Jul 15, 2024 | Benchmarking | CodeCode Available | 1 |
| Separable Operator Networks | Jul 15, 2024 | BenchmarkingGPU | CodeCode Available | 1 |
| When Heterophily Meets Heterogeneity: Challenges and a New Large-Scale Graph Benchmark | Jul 15, 2024 | BenchmarkingGraph Learning | CodeCode Available | 1 |
| OptiBench Meets ReSocratic: Measure and Improve LLMs for Optimization Modeling | Jul 13, 2024 | BenchmarkingMath | CodeCode Available | 1 |
| Benchmarking Language Model Creativity: A Case Study on Code Generation | Jul 12, 2024 | BenchmarkingCode Generation | CodeCode Available | 1 |
| Retrospective for the Dynamic Sensorium Competition for predicting large-scale mouse primary visual cortex activity from videos | Jul 12, 2024 | BenchmarkingPupil Dilation | CodeCode Available | 1 |
| Natural language is not enough: Benchmarking multi-modal generative AI for Verilog generation | Jul 11, 2024 | Benchmarking | CodeCode Available | 1 |
| PredBench: Benchmarking Spatio-Temporal Prediction across Diverse Disciplines | Jul 11, 2024 | BenchmarkingPrediction | CodeCode Available | 1 |
| Benchmarking Embedding Aggregation Methods in Computational Pathology: A Clinical Data Perspective | Jul 10, 2024 | BenchmarkingDiagnostic | CodeCode Available | 1 |
| Training on the Test Task Confounds Evaluation and Emergence | Jul 10, 2024 | BenchmarkingLanguage Modelling | CodeCode Available | 1 |
| OpenCIL: Benchmarking Out-of-Distribution Detection in Class-Incremental Learning | Jul 8, 2024 | Benchmarkingclass-incremental learning | CodeCode Available | 1 |
| CodeUpdateArena: Benchmarking Knowledge Editing on API Updates | Jul 8, 2024 | Benchmarkingknowledge editing | CodeCode Available | 1 |
| Replication in Visual Diffusion Models: A Survey and Outlook | Jul 7, 2024 | BenchmarkingSurvey | CodeCode Available | 1 |
| Benchmarking structure-based three-dimensional molecular generative models using GenBench3D: ligand conformation quality matters | Jul 5, 2024 | Benchmarkingvalid | CodeCode Available | 1 |
| Benchmark on Drug Target Interaction Modeling from a Structure Perspective | Jul 4, 2024 | BenchmarkingDrug Discovery | CodeCode Available | 1 |
| GraCoRe: Benchmarking Graph Comprehension and Complex Reasoning in Large Language Models | Jul 3, 2024 | Benchmarking | CodeCode Available | 1 |
| Emotion and Intent Joint Understanding in Multimodal Conversation: A Benchmarking Dataset | Jul 3, 2024 | BenchmarkingDiversity | CodeCode Available | 1 |
| Comics Datasets Framework: Mix of Comics datasets for detection benchmarking | Jul 3, 2024 | BenchmarkingObject | CodeCode Available | 1 |
| Occlusion-Aware Seamless Segmentation | Jul 2, 2024 | BenchmarkingDomain Adaptation | CodeCode Available | 1 |