| Trust but Verify: Programmatic VLM Evaluation in the Wild | Oct 17, 2024 | BenchmarkingLanguage Modelling | —Unverified | 0 |
| ORCHID: A Chinese Debate Corpus for Target-Independent Stance Detection and Argumentative Dialogue Summarization | Oct 17, 2024 | BenchmarkingStance Detection | CodeCode Available | 0 |
| Configurable Embodied Data Generation for Class-Agnostic RGB-D Video Segmentation | Oct 16, 2024 | BenchmarkingPanoptic Segmentation | —Unverified | 0 |
| Understanding the Role of LLMs in Multimodal Evaluation Benchmarks | Oct 16, 2024 | BenchmarkingLarge Language Model | CodeCode Available | 0 |
| WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation | Oct 16, 2024 | BenchmarkingFairness | CodeCode Available | 1 |
| AERO: Softmax-Only LLMs for Efficient Private Inference | Oct 16, 2024 | BenchmarkingDecoder | —Unverified | 0 |
| Benchmarking Defeasible Reasoning with Large Language Models -- Initial Experiments and Future Directions | Oct 16, 2024 | Benchmarking | —Unverified | 0 |
| Open Ko-LLM Leaderboard2: Bridging Foundational and Practical Evaluation for Korean LLMs | Oct 16, 2024 | Benchmarking | —Unverified | 0 |
| MLPerf Power: Benchmarking the Energy Efficiency of Machine Learning Systems from Microwatts to Megawatts for Sustainable AI | Oct 15, 2024 | Benchmarking | CodeCode Available | 4 |
| Benchmarking Data Efficiency in Δ-ML and Multifidelity Models for Quantum Chemistry | Oct 15, 2024 | Benchmarking | CodeCode Available | 0 |