| On the Evaluation of Speech Foundation Models for Spoken Language Understanding | Jun 14, 2024 | BenchmarkingPrediction | —Unverified | 0 |
| Beyond Slow Signs in High-fidelity Model Extraction | Jun 14, 2024 | Benchmarkingmodel | CodeCode Available | 0 |
| LUMA: A Benchmark Dataset for Learning from Uncertain and Multimodal Data | Jun 14, 2024 | BenchmarkingDecision Making | CodeCode Available | 1 |
| Benchmarking Spectral Graph Neural Networks: A Comprehensive Study on Effectiveness and Efficiency | Jun 14, 2024 | Benchmarking | CodeCode Available | 1 |
| TGB 2.0: A Benchmark for Learning on Temporal Knowledge Graphs and Heterogeneous Graphs | Jun 14, 2024 | BenchmarkingKnowledge Graphs | CodeCode Available | 3 |
| CubeSat-Enabled Free-Space Optics: Joint Data Communication and Fine Beam Tracking | Jun 13, 2024 | Benchmarking | —Unverified | 0 |
| ResearchArena: Benchmarking LLMs' Ability to Collect and Organize Information as Research Agents | Jun 13, 2024 | BenchmarkingSurvey | —Unverified | 0 |
| Decoding the Diversity: A Review of the Indic AI Research Landscape | Jun 13, 2024 | BenchmarkingDiversity | —Unverified | 0 |
| DrivAerNet++: A Large-Scale Multimodal Car Dataset with Computational Fluid Dynamics Simulations and Deep Learning Benchmarks | Jun 13, 2024 | Benchmarking | CodeCode Available | 3 |
| BTS: Building Timeseries Dataset: Empowering Large-Scale Building Analytics | Jun 13, 2024 | Benchmarking | CodeCode Available | 2 |
| ECBD: Evidence-Centered Benchmark Design for NLP | Jun 13, 2024 | Benchmarking | CodeCode Available | 0 |
| StreamBench: Towards Benchmarking Continuous Improvement of Language Agents | Jun 13, 2024 | BenchmarkingLanguage Modeling | CodeCode Available | 2 |
| Are we making progress in unlearning? Findings from the first NeurIPS unlearning competition | Jun 13, 2024 | Benchmarking | —Unverified | 0 |
| A Review of 315 Benchmark and Test Functions for Machine Learning Optimization Algorithms and Metaheuristics with Mathematical and Visual Descriptions | Jun 13, 2024 | Benchmarking | —Unverified | 0 |
| SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models | Jun 13, 2024 | Benchmarking | CodeCode Available | 1 |
| LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living | Jun 13, 2024 | BenchmarkingHuman-Object Interaction Detection | —Unverified | 0 |
| Towards Reliable Detection of LLM-Generated Texts: A Comprehensive Evaluation Framework with CUDRT | Jun 13, 2024 | BenchmarkingLLM-generated Text Detection | CodeCode Available | 1 |
| Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs | Jun 13, 2024 | BenchmarkingGPU | CodeCode Available | 2 |
| Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs | Jun 13, 2024 | BenchmarkingQuestion Answering | CodeCode Available | 2 |
| SR-CACO-2: A Dataset for Confocal Fluorescence Microscopy Image Super-Resolution | Jun 13, 2024 | BenchmarkingImage Super-Resolution | CodeCode Available | 1 |
| DefAn: Definitive Answer Dataset for LLMs Hallucination Evaluation | Jun 13, 2024 | BenchmarkingHallucination | CodeCode Available | 0 |
| MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases | Jun 12, 2024 | BenchmarkingModel Compression | —Unverified | 0 |
| ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets | Jun 12, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks | Jun 12, 2024 | BenchmarkingChatbot | CodeCode Available | 3 |
| TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation | Jun 12, 2024 | BenchmarkingImage Generation | CodeCode Available | 1 |