| E(3)-equivariant models cannot learn chirality: Field-based molecular generation | Feb 24, 2024 | BenchmarkingGraph Neural Network | —Unverified | 0 |
| API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMs | Feb 23, 2024 | Benchmarkingslot-filling | CodeCode Available | 1 |
| Benchmarking the Robustness of Panoptic Segmentation for Automated Driving | Feb 23, 2024 | BenchmarkingDecision Making | —Unverified | 0 |
| ToMBench: Benchmarking Theory of Mind in Large Language Models | Feb 23, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 2 |
| Benchmarking Observational Studies with Experimental Data under Right-Censoring | Feb 23, 2024 | Benchmarking | —Unverified | 0 |
| GenCeption: Evaluate Multimodal LLMs with Unlabeled Unimodal Data | Feb 22, 2024 | Benchmarking | CodeCode Available | 0 |
| CriticBench: Benchmarking LLMs for Critique-Correct Reasoning | Feb 22, 2024 | Benchmarking | CodeCode Available | 1 |
| Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment | Feb 21, 2024 | Adversarial RobustnessBenchmarking | CodeCode Available | 1 |
| The Effect of Batch Size on Contrastive Self-Supervised Speech Representation Learning | Feb 21, 2024 | BenchmarkingRepresentation Learning | CodeCode Available | 1 |
| MM-Soc: Benchmarking Multimodal Large Language Models in Social Media Platforms | Feb 21, 2024 | BenchmarkingHate Speech Detection | CodeCode Available | 0 |