| Efficient Lifelong Model Evaluation in an Era of Rapid Progress | Feb 29, 2024 | BenchmarkingGPU | CodeCode Available | 1 |
| The 6th Affective Behavior Analysis in-the-wild (ABAW) Competition | Feb 29, 2024 | Action Unit DetectionArousal Estimation | —Unverified | 0 |
| Benchmarking Uncertainty Disentanglement: Specialized Uncertainties for Specialized Tasks | Feb 29, 2024 | BenchmarkingDisentanglement | CodeCode Available | 2 |
| FlowCyt: A Comparative Study of Deep Learning Approaches for Multi-Class Classification in Flow Cytometry Benchmarking | Feb 28, 2024 | BenchmarkingInductive Learning | CodeCode Available | 0 |
| Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions | Feb 28, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 1 |
| Editing Factual Knowledge and Explanatory Ability of Medical Large Language Models | Feb 28, 2024 | BenchmarkingHallucination | CodeCode Available | 0 |
| The Seeker's Dilemma: Realistic Formulation and Benchmarking for Hardware Trojan Detection | Feb 27, 2024 | Benchmarking | —Unverified | 0 |
| Beacon, a lightweight deep reinforcement learning benchmark library for flow control | Feb 27, 2024 | BenchmarkingCPU | CodeCode Available | 1 |
| Benchmarking GPT-4 on Algorithmic Problems: A Systematic Evaluation of Prompting Strategies | Feb 27, 2024 | BenchmarkingSystematic Generalization | —Unverified | 0 |
| The KANDY Benchmark: Incremental Neuro-Symbolic Learning and Reasoning with Kandinsky Patterns | Feb 27, 2024 | BenchmarkingBinary Classification | CodeCode Available | 0 |
| Benchmarking Data Science Agents | Feb 27, 2024 | BenchmarkingCode Generation | CodeCode Available | 1 |
| A Large-scale Evaluation of Pretraining Paradigms for the Detection of Defects in Electroluminescence Solar Cell Images | Feb 27, 2024 | BenchmarkingDefect Detection | —Unverified | 0 |
| Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data | Feb 27, 2024 | Benchmarking | CodeCode Available | 1 |
| Benchmarking LLMs on the Semantic Overlap Summarization Task | Feb 26, 2024 | BenchmarkingDocument Summarization | —Unverified | 0 |
| Partial Rankings of Optimizers | Feb 26, 2024 | Benchmarking | CodeCode Available | 0 |
| Towards Explainability and Fairness in Swiss Judgement Prediction: Benchmarking on a Multilingual Dataset | Feb 26, 2024 | BenchmarkingCross-Lingual Transfer | —Unverified | 0 |
| Performance Comparison of Surrogate-Assisted Evolutionary Algorithms on Computational Fluid Dynamics Problems | Feb 26, 2024 | BenchmarkingEvolutionary Algorithms | —Unverified | 0 |
| HypoTermQA: Hypothetical Terms Dataset for Benchmarking Hallucination Tendency of LLMs | Feb 25, 2024 | BenchmarkingChatbot | CodeCode Available | 0 |
| PST-Bench: Tracing and Benchmarking the Source of Publications | Feb 25, 2024 | Benchmarking | CodeCode Available | 1 |
| E(3)-equivariant models cannot learn chirality: Field-based molecular generation | Feb 24, 2024 | BenchmarkingGraph Neural Network | —Unverified | 0 |
| Decoding Intelligence: A Framework for Certifying Knowledge Comprehension in LLMs | Feb 24, 2024 | BenchmarkingKnowledge Graphs | —Unverified | 0 |
| API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMs | Feb 23, 2024 | Benchmarkingslot-filling | CodeCode Available | 1 |
| ToMBench: Benchmarking Theory of Mind in Large Language Models | Feb 23, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 2 |
| Benchmarking the Robustness of Panoptic Segmentation for Automated Driving | Feb 23, 2024 | BenchmarkingDecision Making | —Unverified | 0 |
| Benchmarking Observational Studies with Experimental Data under Right-Censoring | Feb 23, 2024 | Benchmarking | —Unverified | 0 |
| GenCeption: Evaluate Multimodal LLMs with Unlabeled Unimodal Data | Feb 22, 2024 | Benchmarking | CodeCode Available | 0 |
| CriticBench: Benchmarking LLMs for Critique-Correct Reasoning | Feb 22, 2024 | Benchmarking | CodeCode Available | 1 |
| The Effect of Batch Size on Contrastive Self-Supervised Speech Representation Learning | Feb 21, 2024 | BenchmarkingRepresentation Learning | CodeCode Available | 1 |
| Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment | Feb 21, 2024 | Adversarial RobustnessBenchmarking | CodeCode Available | 1 |
| MM-Soc: Benchmarking Multimodal Large Language Models in Social Media Platforms | Feb 21, 2024 | BenchmarkingHate Speech Detection | CodeCode Available | 0 |
| PQA: Zero-shot Protein Question Answering for Free-form Scientific Enquiry with Large Language Models | Feb 21, 2024 | BenchmarkingForm | CodeCode Available | 0 |
| CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models | Feb 21, 2024 | Benchmarking | —Unverified | 0 |
| A Unified Framework and Dataset for Assessing Societal Bias in Vision-Language Models | Feb 21, 2024 | BenchmarkingImage to text | —Unverified | 0 |
| KetGPT -- Dataset Augmentation of Quantum Circuits using Transformers | Feb 20, 2024 | Benchmarking | —Unverified | 0 |
| CHILI: Chemically-Informed Large-scale Inorganic Nanomaterials Dataset for Advancing Graph Machine Learning | Feb 20, 2024 | Atomic number classificationBenchmarking | CodeCode Available | 1 |
| Benchmarking Retrieval-Augmented Generation for Medicine | Feb 20, 2024 | BenchmarkingInformation Retrieval | CodeCode Available | 4 |
| CausalGym: Benchmarking causal interpretability methods on linguistic tasks | Feb 19, 2024 | BenchmarkingInterpretability Techniques for Deep Learning | CodeCode Available | 2 |
| Synthetic location trajectory generation using categorical diffusion models | Feb 19, 2024 | BenchmarkingDecision Making | CodeCode Available | 0 |
| FeB4RAG: Evaluating Federated Search in the Context of Retrieval Augmented Generation | Feb 19, 2024 | BenchmarkingChatbot | —Unverified | 0 |
| Event-Based Motion Magnification | Feb 19, 2024 | BenchmarkingMotion Detection | CodeCode Available | 2 |
| Class-incremental Learning for Time Series: Benchmark and Evaluation | Feb 19, 2024 | Activity RecognitionBenchmarking | CodeCode Available | 2 |
| AnaloBench: Benchmarking the Identification of Abstract and Long-context Analogies | Feb 19, 2024 | Benchmarking | CodeCode Available | 0 |
| Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark | Feb 18, 2024 | Benchmarking | CodeCode Available | 2 |
| Benchmarking Knowledge Boundary for Large Language Models: A Different Perspective on Model Evaluation | Feb 18, 2024 | BenchmarkingLanguage Modeling | CodeCode Available | 1 |
| PEDANTS: Cheap but Effective and Interpretable Answer Equivalence | Feb 17, 2024 | BenchmarkingForm | CodeCode Available | 2 |
| VATr++: Choose Your Words Wisely for Handwritten Text Generation | Feb 16, 2024 | BenchmarkingText Generation | —Unverified | 0 |
| Learning Disentangled Audio Representations through Controlled Synthesis | Feb 16, 2024 | BenchmarkingDisentanglement | —Unverified | 0 |
| Benchmarking federated strategies in Peer-to-Peer Federated learning for biomedical data | Feb 15, 2024 | BenchmarkingFederated Learning | —Unverified | 0 |
| Large-scale Benchmarking of Metaphor-based Optimization Heuristics | Feb 15, 2024 | BenchmarkingExperimental Design | —Unverified | 0 |
| The Butterfly Effect of Model Editing: Few Edits Can Trigger Large Language Models Collapse | Feb 15, 2024 | BenchmarkingModel Editing | CodeCode Available | 0 |