| Benchmarking Language Model Creativity: A Case Study on Code Generation | Jul 12, 2024 | BenchmarkingCode Generation | CodeCode Available | 1 |
| A Comprehensive Survey on Retrieval Methods in Recommender Systems | Jul 11, 2024 | BenchmarkingRecommendation Systems | —Unverified | 0 |
| Evaluating Nuanced Bias in Large Language Model Free Response Answers | Jul 11, 2024 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| WayveScenes101: A Dataset and Benchmark for Novel View Synthesis in Autonomous Driving | Jul 11, 2024 | Autonomous DrivingBenchmarking | CodeCode Available | 2 |
| Natural language is not enough: Benchmarking multi-modal generative AI for Verilog generation | Jul 11, 2024 | Benchmarking | CodeCode Available | 1 |
| PredBench: Benchmarking Spatio-Temporal Prediction across Diverse Disciplines | Jul 11, 2024 | BenchmarkingPrediction | CodeCode Available | 1 |
| Beyond Benchmarking: A New Paradigm for Evaluation and Assessment of Large Language Models | Jul 10, 2024 | Benchmarking | —Unverified | 0 |
| Benchmarking Embedding Aggregation Methods in Computational Pathology: A Clinical Data Perspective | Jul 10, 2024 | BenchmarkingDiagnostic | CodeCode Available | 1 |
| How Aligned are Different Alignment Metrics? | Jul 10, 2024 | Benchmarking | —Unverified | 0 |
| InstructLayout: Instruction-Driven 2D and 3D Layout Synthesis with Semantic Graph Prior | Jul 10, 2024 | BenchmarkingDecoder | CodeCode Available | 2 |
| Training on the Test Task Confounds Evaluation and Emergence | Jul 10, 2024 | BenchmarkingLanguage Modelling | CodeCode Available | 1 |
| Revisiting, Benchmarking and Understanding Unsupervised Graph Domain Adaptation | Jul 9, 2024 | BenchmarkingDomain Adaptation | CodeCode Available | 3 |
| SPINEX-Clustering: Similarity-based Predictions with Explainable Neighbors Exploration for Clustering Problems | Jul 9, 2024 | BenchmarkingClustering | —Unverified | 0 |
| Analyzing the Effectiveness of Listwise Reranking with Positional Invariance on Temporal Generalizability | Jul 9, 2024 | BenchmarkingDecoder | —Unverified | 0 |
| HumanRefiner: Benchmarking Abnormal Human Generation and Refining with Coarse-to-fine Pose-Reversible Guidance | Jul 9, 2024 | BenchmarkingConditional Image Generation | CodeCode Available | 2 |
| HERMES: Holographic Equivariant neuRal network model for Mutational Effect and Stability prediction | Jul 9, 2024 | Benchmarking | CodeCode Available | 0 |
| CodeUpdateArena: Benchmarking Knowledge Editing on API Updates | Jul 8, 2024 | Benchmarkingknowledge editing | CodeCode Available | 1 |
| Simulation-based Benchmarking for Causal Structure Learning in Gene Perturbation Experiments | Jul 8, 2024 | BenchmarkingDecision Making | CodeCode Available | 0 |
| OpenCIL: Benchmarking Out-of-Distribution Detection in Class-Incremental Learning | Jul 8, 2024 | Benchmarkingclass-incremental learning | CodeCode Available | 1 |
| GTP-4o: Modality-prompted Heterogeneous Graph Learning for Omni-modal Biomedical Representation | Jul 8, 2024 | BenchmarkingGraph Embedding | —Unverified | 0 |
| TARGO: Benchmarking Target-driven Object Grasping under Occlusions | Jul 8, 2024 | BenchmarkingObject | —Unverified | 0 |
| A Benchmark for Multi-speaker Anonymization | Jul 8, 2024 | BenchmarkingDisentanglement | —Unverified | 0 |
| MERGE -- A Bimodal Audio-Lyrics Dataset for Static Music Emotion Recognition | Jul 8, 2024 | BenchmarkingDeep Learning | —Unverified | 0 |
| Replication in Visual Diffusion Models: A Survey and Outlook | Jul 7, 2024 | BenchmarkingSurvey | CodeCode Available | 1 |
| Rethinking the Effectiveness of Graph Classification Datasets in Benchmarks for Assessing GNNs | Jul 6, 2024 | BenchmarkingDataset Generation | CodeCode Available | 0 |