| The 6th Affective Behavior Analysis in-the-wild (ABAW) Competition | Feb 29, 2024 | Action Unit DetectionArousal Estimation | —Unverified | 0 |
| Efficient Lifelong Model Evaluation in an Era of Rapid Progress | Feb 29, 2024 | BenchmarkingGPU | CodeCode Available | 1 |
| Benchmarking Uncertainty Disentanglement: Specialized Uncertainties for Specialized Tasks | Feb 29, 2024 | BenchmarkingDisentanglement | CodeCode Available | 2 |
| FlowCyt: A Comparative Study of Deep Learning Approaches for Multi-Class Classification in Flow Cytometry Benchmarking | Feb 28, 2024 | BenchmarkingInductive Learning | CodeCode Available | 0 |
| Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions | Feb 28, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 1 |
| Editing Factual Knowledge and Explanatory Ability of Medical Large Language Models | Feb 28, 2024 | BenchmarkingHallucination | CodeCode Available | 0 |
| The Seeker's Dilemma: Realistic Formulation and Benchmarking for Hardware Trojan Detection | Feb 27, 2024 | Benchmarking | —Unverified | 0 |
| Beacon, a lightweight deep reinforcement learning benchmark library for flow control | Feb 27, 2024 | BenchmarkingCPU | CodeCode Available | 1 |
| Benchmarking GPT-4 on Algorithmic Problems: A Systematic Evaluation of Prompting Strategies | Feb 27, 2024 | BenchmarkingSystematic Generalization | —Unverified | 0 |
| Benchmarking Data Science Agents | Feb 27, 2024 | BenchmarkingCode Generation | CodeCode Available | 1 |
| The KANDY Benchmark: Incremental Neuro-Symbolic Learning and Reasoning with Kandinsky Patterns | Feb 27, 2024 | BenchmarkingBinary Classification | CodeCode Available | 0 |
| A Large-scale Evaluation of Pretraining Paradigms for the Detection of Defects in Electroluminescence Solar Cell Images | Feb 27, 2024 | BenchmarkingDefect Detection | —Unverified | 0 |
| Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data | Feb 27, 2024 | Benchmarking | CodeCode Available | 1 |
| Partial Rankings of Optimizers | Feb 26, 2024 | Benchmarking | CodeCode Available | 0 |
| Benchmarking LLMs on the Semantic Overlap Summarization Task | Feb 26, 2024 | BenchmarkingDocument Summarization | —Unverified | 0 |
| Towards Explainability and Fairness in Swiss Judgement Prediction: Benchmarking on a Multilingual Dataset | Feb 26, 2024 | BenchmarkingCross-Lingual Transfer | —Unverified | 0 |
| Performance Comparison of Surrogate-Assisted Evolutionary Algorithms on Computational Fluid Dynamics Problems | Feb 26, 2024 | BenchmarkingEvolutionary Algorithms | —Unverified | 0 |
| HypoTermQA: Hypothetical Terms Dataset for Benchmarking Hallucination Tendency of LLMs | Feb 25, 2024 | BenchmarkingChatbot | CodeCode Available | 0 |
| PST-Bench: Tracing and Benchmarking the Source of Publications | Feb 25, 2024 | Benchmarking | CodeCode Available | 1 |
| Decoding Intelligence: A Framework for Certifying Knowledge Comprehension in LLMs | Feb 24, 2024 | BenchmarkingKnowledge Graphs | —Unverified | 0 |
| E(3)-equivariant models cannot learn chirality: Field-based molecular generation | Feb 24, 2024 | BenchmarkingGraph Neural Network | —Unverified | 0 |
| API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMs | Feb 23, 2024 | Benchmarkingslot-filling | CodeCode Available | 1 |
| ToMBench: Benchmarking Theory of Mind in Large Language Models | Feb 23, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 2 |
| Benchmarking the Robustness of Panoptic Segmentation for Automated Driving | Feb 23, 2024 | BenchmarkingDecision Making | —Unverified | 0 |
| Benchmarking Observational Studies with Experimental Data under Right-Censoring | Feb 23, 2024 | Benchmarking | —Unverified | 0 |