| GenCeption: Evaluate Multimodal LLMs with Unlabeled Unimodal Data | Feb 22, 2024 | Benchmarking | CodeCode Available | 0 |
| CriticBench: Benchmarking LLMs for Critique-Correct Reasoning | Feb 22, 2024 | Benchmarking | CodeCode Available | 1 |
| The Effect of Batch Size on Contrastive Self-Supervised Speech Representation Learning | Feb 21, 2024 | BenchmarkingRepresentation Learning | CodeCode Available | 1 |
| Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment | Feb 21, 2024 | Adversarial RobustnessBenchmarking | CodeCode Available | 1 |
| MM-Soc: Benchmarking Multimodal Large Language Models in Social Media Platforms | Feb 21, 2024 | BenchmarkingHate Speech Detection | CodeCode Available | 0 |
| PQA: Zero-shot Protein Question Answering for Free-form Scientific Enquiry with Large Language Models | Feb 21, 2024 | BenchmarkingForm | CodeCode Available | 0 |
| CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models | Feb 21, 2024 | Benchmarking | —Unverified | 0 |
| A Unified Framework and Dataset for Assessing Societal Bias in Vision-Language Models | Feb 21, 2024 | BenchmarkingImage to text | —Unverified | 0 |
| KetGPT -- Dataset Augmentation of Quantum Circuits using Transformers | Feb 20, 2024 | Benchmarking | —Unverified | 0 |
| CHILI: Chemically-Informed Large-scale Inorganic Nanomaterials Dataset for Advancing Graph Machine Learning | Feb 20, 2024 | Atomic number classificationBenchmarking | CodeCode Available | 1 |
| Benchmarking Retrieval-Augmented Generation for Medicine | Feb 20, 2024 | BenchmarkingInformation Retrieval | CodeCode Available | 4 |
| CausalGym: Benchmarking causal interpretability methods on linguistic tasks | Feb 19, 2024 | BenchmarkingInterpretability Techniques for Deep Learning | CodeCode Available | 2 |
| Synthetic location trajectory generation using categorical diffusion models | Feb 19, 2024 | BenchmarkingDecision Making | CodeCode Available | 0 |
| FeB4RAG: Evaluating Federated Search in the Context of Retrieval Augmented Generation | Feb 19, 2024 | BenchmarkingChatbot | —Unverified | 0 |
| Event-Based Motion Magnification | Feb 19, 2024 | BenchmarkingMotion Detection | CodeCode Available | 2 |
| Class-incremental Learning for Time Series: Benchmark and Evaluation | Feb 19, 2024 | Activity RecognitionBenchmarking | CodeCode Available | 2 |
| AnaloBench: Benchmarking the Identification of Abstract and Long-context Analogies | Feb 19, 2024 | Benchmarking | CodeCode Available | 0 |
| Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark | Feb 18, 2024 | Benchmarking | CodeCode Available | 2 |
| Benchmarking Knowledge Boundary for Large Language Models: A Different Perspective on Model Evaluation | Feb 18, 2024 | BenchmarkingLanguage Modeling | CodeCode Available | 1 |
| PEDANTS: Cheap but Effective and Interpretable Answer Equivalence | Feb 17, 2024 | BenchmarkingForm | CodeCode Available | 2 |
| VATr++: Choose Your Words Wisely for Handwritten Text Generation | Feb 16, 2024 | BenchmarkingText Generation | —Unverified | 0 |
| Learning Disentangled Audio Representations through Controlled Synthesis | Feb 16, 2024 | BenchmarkingDisentanglement | —Unverified | 0 |
| Benchmarking federated strategies in Peer-to-Peer Federated learning for biomedical data | Feb 15, 2024 | BenchmarkingFederated Learning | —Unverified | 0 |
| Large-scale Benchmarking of Metaphor-based Optimization Heuristics | Feb 15, 2024 | BenchmarkingExperimental Design | —Unverified | 0 |
| The Butterfly Effect of Model Editing: Few Edits Can Trigger Large Language Models Collapse | Feb 15, 2024 | BenchmarkingModel Editing | CodeCode Available | 0 |