| RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style | Oct 21, 2024 | BenchmarkingLanguage Modeling | CodeCode Available | 2 |
| Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following | Oct 21, 2024 | BenchmarkingInstruction Following | CodeCode Available | 2 |
| Hiding in Plain Sight: Reframing Hardware Trojan Benchmarking as a Hide&Seek Modification | Oct 21, 2024 | Benchmarking | —Unverified | 0 |
| Comprehensive benchmarking of large language models for RNA secondary structure prediction | Oct 21, 2024 | Benchmarking | CodeCode Available | 1 |
| A Framework for Evaluating Predictive Models Using Synthetic Image Covariates and Longitudinal Data | Oct 21, 2024 | Benchmarking | —Unverified | 0 |
| Sketch2Code: Evaluating Vision-Language Models for Interactive Web Design Prototyping | Oct 21, 2024 | Benchmarking | —Unverified | 0 |
| Dynamic Intelligence Assessment: Benchmarking LLMs on the Road to AGI with a Focus on Model Confidence | Oct 20, 2024 | Benchmarking | —Unverified | 0 |
| FlexMol: A Flexible Toolkit for Benchmarking Molecular Relational Learning | Oct 19, 2024 | BenchmarkingDrug Discovery | CodeCode Available | 0 |
| IntersectionZoo: Eco-driving for Benchmarking Multi-Agent Contextual Reinforcement Learning | Oct 19, 2024 | BenchmarkingMulti-agent Reinforcement Learning | CodeCode Available | 2 |
| SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation | Oct 19, 2024 | AI AgentBenchmarking | CodeCode Available | 2 |