| CaT-BENCH: Benchmarking Language Model Understanding of Causal and Temporal Dependencies in Plans | Jun 22, 2024 | BenchmarkingDecision Making | —Unverified | 0 |
| BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions | Jun 22, 2024 | BenchmarkingCode Generation | CodeCode Available | 4 |
| Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph | Jun 21, 2024 | BenchmarkingText Generation | CodeCode Available | 2 |
| GenoTEX: An LLM Agent Benchmark for Automated Gene Expression Data Analysis | Jun 21, 2024 | AI AgentAutoML | CodeCode Available | 2 |
| Deciphering the Definition of Adversarial Robustness for post-hoc OOD Detectors | Jun 21, 2024 | Adversarial DefenseAdversarial Robustness | —Unverified | 0 |
| Six-CD: Benchmarking Concept Removals for Benign Text-to-image Diffusion Models | Jun 21, 2024 | Benchmarking | CodeCode Available | 1 |
| NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and Benchmarking | Jun 21, 2024 | Autonomous DrivingBenchmarking | CodeCode Available | 7 |
| Benchmarking Retinal Blood Vessel Segmentation Models for Cross-Dataset and Cross-Disease Generalization | Jun 21, 2024 | BenchmarkingSegmentation | CodeCode Available | 0 |
| Sports Intelligence: Assessing the Sports Understanding Capabilities of Language Models through Question Answering from Text to Video | Jun 21, 2024 | BenchmarkingFew-Shot Learning | —Unverified | 0 |
| FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents | Jun 21, 2024 | Benchmarking | —Unverified | 0 |
| CEBench: A Benchmarking Toolkit for the Cost-Effectiveness of LLM Pipelines | Jun 20, 2024 | BenchmarkingDecision Making | CodeCode Available | 0 |
| Improving Expert Radiology Report Summarization by Prompting Large Language Models with a Layperson Summary | Jun 20, 2024 | BenchmarkingIn-Context Learning | —Unverified | 0 |
| QeMFi: A Multifidelity Dataset of Quantum Chemical Properties of Diverse Molecules | Jun 20, 2024 | Benchmarking | CodeCode Available | 0 |
| Beyond Optimism: Exploration With Partially Observable Rewards | Jun 20, 2024 | BenchmarkingReinforcement Learning (RL) | CodeCode Available | 0 |
| Selected Languages are All You Need for Cross-lingual Truthfulness Transfer | Jun 20, 2024 | AllBenchmarking | CodeCode Available | 0 |
| How far are today's time-series models from real-world weather forecasting applications? | Jun 20, 2024 | BenchmarkingTime Series | CodeCode Available | 2 |
| The Elusive Pursuit of Reproducing PATE-GAN: Benchmarking, Auditing, Debugging | Jun 20, 2024 | Benchmarking | CodeCode Available | 0 |
| Benchmarking Monocular 3D Dog Pose Estimation Using In-The-Wild Motion Capture Data | Jun 20, 2024 | Animal Pose EstimationBenchmarking | —Unverified | 0 |
| African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object Classification | Jun 20, 2024 | BenchmarkingClassification | CodeCode Available | 1 |
| HoTPP Benchmark: Are We Good at the Long Horizon Events Forecasting? | Jun 20, 2024 | BenchmarkingPoint Processes | CodeCode Available | 2 |
| Resource-efficient Medical Image Analysis with Self-adapting Forward-Forward Networks | Jun 20, 2024 | BenchmarkingMedical Image Analysis | —Unverified | 0 |
| DASB -- Discrete Audio and Speech Benchmark | Jun 20, 2024 | BenchmarkingEmotion Recognition | —Unverified | 0 |
| A Benchmarking Study of Kolmogorov-Arnold Networks on Tabular Data | Jun 20, 2024 | BenchmarkingKolmogorov-Arnold Networks | CodeCode Available | 1 |
| FairX: A comprehensive benchmarking tool for model analysis using fairness, utility, and explainability | Jun 20, 2024 | BenchmarkingFairness | CodeCode Available | 0 |
| PoseBench: Benchmarking the Robustness of Pose Estimation Models under Corruptions | Jun 20, 2024 | Animal Pose EstimationAutonomous Driving | —Unverified | 0 |