| Agentic-HLS: An agentic reasoning based high-level synthesis system using large language models (AI for EDA workshop 2024) | Dec 2, 2024 | BenchmarkingHigh-Level Synthesis | CodeCode Available | 0 |
| TextClass Benchmark: A Continuous Elo Rating of LLMs in Social Sciences | Nov 30, 2024 | BenchmarkingClassification | CodeCode Available | 0 |
| Circumventing shortcuts in audio-visual deepfake detection datasets with unsupervised learning | Nov 29, 2024 | BenchmarkingDeepFake Detection | CodeCode Available | 1 |
| One-Shot Real-to-Sim via End-to-End Differentiable Simulation and Rendering | Nov 29, 2024 | BenchmarkingObject | —Unverified | 0 |
| Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-Oasis | Nov 29, 2024 | BenchmarkingClaim Verification | CodeCode Available | 1 |
| Perception Test 2024: Challenge Summary and a Novel Hour-Long VideoQA Benchmark | Nov 29, 2024 | BenchmarkingGrounded Video Question Answering | —Unverified | 0 |
| OpenQDC: Open Quantum Data Commons | Nov 29, 2024 | Benchmarking | CodeCode Available | 2 |
| λ: A Benchmark for Data-Efficiency in Long-Horizon Indoor Mobile Manipulation Robotics | Nov 28, 2024 | BenchmarkingDiversity | —Unverified | 0 |
| GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks | Nov 28, 2024 | BenchmarkingObject Counting | CodeCode Available | 2 |
| Consolidating and Developing Benchmarking Datasets for the Nepali Natural Language Understanding Tasks | Nov 28, 2024 | BenchmarkingNatural Language Inference | —Unverified | 0 |