| Towards Reliable Detection of LLM-Generated Texts: A Comprehensive Evaluation Framework with CUDRT | Jun 13, 2024 | BenchmarkingLLM-generated Text Detection | CodeCode Available | 1 |
| TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation | Jun 12, 2024 | BenchmarkingImage Generation | CodeCode Available | 1 |
| Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark | Jun 12, 2024 | BenchmarkingMixture-of-Experts | CodeCode Available | 1 |
| Causality for Tabular Data Synthesis: A High-Order Structure Causal Benchmark Framework | Jun 12, 2024 | BenchmarkingCausal Inference | CodeCode Available | 1 |
| RAD: A Comprehensive Dataset for Benchmarking the Robustness of Image Anomaly Detection | Jun 11, 2024 | Anomaly DetectionBenchmarking | CodeCode Available | 1 |
| AudioMarkBench: Benchmarking Robustness of Audio Watermarking | Jun 11, 2024 | Benchmarkingtext-to-speech | CodeCode Available | 1 |
| QGEval: Benchmarking Multi-dimensional Evaluation for Question Generation | Jun 9, 2024 | BenchmarkingQuestion Generation | CodeCode Available | 1 |
| EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-Language Models | Jun 9, 2024 | Benchmarking | CodeCode Available | 1 |
| Smiles2Dock: an open large-scale multi-task dataset for ML-based molecular docking | Jun 9, 2024 | BenchmarkingDrug Discovery | CodeCode Available | 1 |
| ICU-Sepsis: A Benchmark MDP Built from Real Medical Data | Jun 9, 2024 | BenchmarkingManagement | CodeCode Available | 1 |