| TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation | Jun 12, 2024 | BenchmarkingImage Generation | CodeCode Available | 1 |
| MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents | Jun 12, 2024 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark | Jun 12, 2024 | BenchmarkingMixture-of-Experts | CodeCode Available | 1 |
| Causality for Tabular Data Synthesis: A High-Order Structure Causal Benchmark Framework | Jun 12, 2024 | BenchmarkingCausal Inference | CodeCode Available | 1 |
| It's all about PR -- Smart Benchmarking AI Accelerators using Performance Representatives | Jun 12, 2024 | AllBenchmarking | —Unverified | 0 |
| How well it works: Benchmarking performance of GPT models on medical natural language processing tasks | Jun 12, 2024 | Benchmarking | —Unverified | 0 |
| DB3V: A Dialect Dominated Dataset of Bird Vocalisation for Cross-corpus Bird Species Recognition | Jun 11, 2024 | BenchmarkingCross-corpus | —Unverified | 0 |
| A PRISMA Driven Systematic Review of Publicly Available Datasets for Benchmark and Model Developments for Industrial Defect Detection | Jun 11, 2024 | BenchmarkingDefect Detection | —Unverified | 0 |
| Advancing Annotation of Stance in Social Media Posts: A Comparative Analysis of Large Language Models and Crowd Sourcing | Jun 11, 2024 | BenchmarkingStance Detection | —Unverified | 0 |
| Benchmarking and Boosting Radiology Report Generation for 3D High-Resolution Medical Images | Jun 11, 2024 | BenchmarkingGPU | —Unverified | 0 |
| RAD: A Comprehensive Dataset for Benchmarking the Robustness of Image Anomaly Detection | Jun 11, 2024 | Anomaly DetectionBenchmarking | CodeCode Available | 1 |
| Benchmarking Vision-Language Contrastive Methods for Medical Representation Learning | Jun 11, 2024 | BenchmarkingContrastive Learning | CodeCode Available | 0 |
| MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal Large Language Models | Jun 11, 2024 | BenchmarkingFairness | —Unverified | 0 |
| AudioMarkBench: Benchmarking Robustness of Audio Watermarking | Jun 11, 2024 | Benchmarkingtext-to-speech | CodeCode Available | 1 |
| JavaBench: A Benchmark of Object-Oriented Code Generation for Evaluating Large Language Models | Jun 10, 2024 | BenchmarkingCode Generation | CodeCode Available | 0 |
| Data-driven Power Flow Linearization: Simulation | Jun 10, 2024 | BenchmarkingComputational Efficiency | —Unverified | 0 |
| Improving Generalization of Neural Vehicle Routing Problem Solvers Through the Lens of Model Architecture | Jun 10, 2024 | BenchmarkingDecoder | CodeCode Available | 0 |
| INTERSPEECH 2009 Emotion Challenge Revisited: Benchmarking 15 Years of Progress in Speech Emotion Recognition | Jun 10, 2024 | BenchmarkingEmotion Recognition | CodeCode Available | 0 |
| DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents | Jun 10, 2024 | Benchmarkingscientific discovery | CodeCode Available | 3 |
| Can Language Models Serve as Text-Based World Simulators? | Jun 10, 2024 | BenchmarkingDecision Making | —Unverified | 0 |
| Multivariate Stochastic Dominance via Optimal Transport and Applications to Models Benchmarking | Jun 10, 2024 | BenchmarkingEconometrics | —Unverified | 0 |
| TopoBench: A Framework for Benchmarking Topological Deep Learning | Jun 9, 2024 | BenchmarkingDeep Learning | CodeCode Available | 3 |
| Smiles2Dock: an open large-scale multi-task dataset for ML-based molecular docking | Jun 9, 2024 | BenchmarkingDrug Discovery | CodeCode Available | 1 |
| QGEval: Benchmarking Multi-dimensional Evaluation for Question Generation | Jun 9, 2024 | BenchmarkingQuestion Generation | CodeCode Available | 1 |
| ICU-Sepsis: A Benchmark MDP Built from Real Medical Data | Jun 9, 2024 | BenchmarkingManagement | CodeCode Available | 1 |