| RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style | Oct 21, 2024 | BenchmarkingLanguage Modeling | CodeCode Available | 2 |
| Benchmarking Pathology Foundation Models: Adaptation Strategies and Scenarios | Oct 21, 2024 | BenchmarkingFew-Shot Learning | CodeCode Available | 0 |
| Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following | Oct 21, 2024 | BenchmarkingInstruction Following | CodeCode Available | 2 |
| Sketch2Code: Evaluating Vision-Language Models for Interactive Web Design Prototyping | Oct 21, 2024 | Benchmarking | —Unverified | 0 |
| A Framework for Evaluating Predictive Models Using Synthetic Image Covariates and Longitudinal Data | Oct 21, 2024 | Benchmarking | —Unverified | 0 |
| Comprehensive benchmarking of large language models for RNA secondary structure prediction | Oct 21, 2024 | Benchmarking | CodeCode Available | 1 |
| Dynamic Intelligence Assessment: Benchmarking LLMs on the Road to AGI with a Focus on Model Confidence | Oct 20, 2024 | Benchmarking | —Unverified | 0 |
| IntersectionZoo: Eco-driving for Benchmarking Multi-Agent Contextual Reinforcement Learning | Oct 19, 2024 | BenchmarkingMulti-agent Reinforcement Learning | CodeCode Available | 2 |
| SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation | Oct 19, 2024 | AI AgentBenchmarking | CodeCode Available | 2 |
| FlexMol: A Flexible Toolkit for Benchmarking Molecular Relational Learning | Oct 19, 2024 | BenchmarkingDrug Discovery | CodeCode Available | 0 |
| Advancing Histopathology with Deep Learning Under Data Scarcity: A Decade in Review | Oct 18, 2024 | BenchmarkingDeep Learning | —Unverified | 0 |
| LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs | Oct 18, 2024 | BenchmarkingFairness | —Unverified | 0 |
| Benchmarking Deep Reinforcement Learning for Navigation in Denied Sensor Environments | Oct 18, 2024 | Autonomous NavigationBenchmarking | CodeCode Available | 1 |
| MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart Problems | Oct 18, 2024 | BenchmarkingQuestion Answering | CodeCode Available | 1 |
| Benchmarking Transcriptomics Foundation Models for Perturbation Analysis : one PCA still rules them all | Oct 17, 2024 | AllBenchmarking | CodeCode Available | 1 |
| UCFE: A User-Centric Financial Expertise Benchmark for Large Language Models | Oct 17, 2024 | Benchmarking | CodeCode Available | 0 |
| Sum Secrecy Rate Maximization for Full Duplex ISAC Systems | Oct 17, 2024 | BenchmarkingIntegrated sensing and communication | —Unverified | 0 |
| Trust but Verify: Programmatic VLM Evaluation in the Wild | Oct 17, 2024 | BenchmarkingLanguage Modelling | —Unverified | 0 |
| Ab Initio Nonparametric Variable Selection for Scalable Symbolic Regression with Large p | Oct 17, 2024 | Benchmarkingregression | CodeCode Available | 0 |
| debiaSAE: Benchmarking and Mitigating Vision-Language Model Bias | Oct 17, 2024 | BenchmarkingBias Detection | CodeCode Available | 0 |
| ORCHID: A Chinese Debate Corpus for Target-Independent Stance Detection and Argumentative Dialogue Summarization | Oct 17, 2024 | BenchmarkingStance Detection | CodeCode Available | 0 |
| Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs | Oct 17, 2024 | Benchmarking | CodeCode Available | 0 |
| WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation | Oct 16, 2024 | BenchmarkingFairness | CodeCode Available | 1 |
| Understanding the Role of LLMs in Multimodal Evaluation Benchmarks | Oct 16, 2024 | BenchmarkingLarge Language Model | CodeCode Available | 0 |
| Configurable Embodied Data Generation for Class-Agnostic RGB-D Video Segmentation | Oct 16, 2024 | BenchmarkingPanoptic Segmentation | —Unverified | 0 |
| AERO: Softmax-Only LLMs for Efficient Private Inference | Oct 16, 2024 | BenchmarkingDecoder | —Unverified | 0 |
| Benchmarking Defeasible Reasoning with Large Language Models -- Initial Experiments and Future Directions | Oct 16, 2024 | Benchmarking | —Unverified | 0 |
| Open Ko-LLM Leaderboard2: Bridging Foundational and Practical Evaluation for Korean LLMs | Oct 16, 2024 | Benchmarking | —Unverified | 0 |
| MLPerf Power: Benchmarking the Energy Efficiency of Machine Learning Systems from Microwatts to Megawatts for Sustainable AI | Oct 15, 2024 | Benchmarking | CodeCode Available | 4 |
| Benchmarking Data Efficiency in Δ-ML and Multifidelity Models for Quantum Chemistry | Oct 15, 2024 | Benchmarking | CodeCode Available | 0 |
| Analysis and Benchmarking of Extending Blind Face Image Restoration to Videos | Oct 15, 2024 | BenchmarkingBlind Face Restoration | —Unverified | 0 |
| FoundTS: Comprehensive and Unified Benchmarking of Foundation Models for Time Series Forecasting | Oct 15, 2024 | Benchmarkingenergy management | —Unverified | 0 |
| RClicks: Realistic Click Simulation for Benchmarking Interactive Segmentation | Oct 15, 2024 | BenchmarkingInteractive Segmentation | CodeCode Available | 1 |
| The Trap of Presumed Equivalence: Artificial General Intelligence Should Not Be Assessed on the Scale of Human Intelligence | Oct 14, 2024 | Benchmarking | —Unverified | 0 |
| Personalised Feedback Framework for Online Education Programmes Using Generative AI | Oct 14, 2024 | BenchmarkingManagement | —Unverified | 0 |
| ChakmaNMT: A Low-resource Machine Translation On Chakma Language | Oct 14, 2024 | BenchmarkingMachine Translation | —Unverified | 0 |
| LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory | Oct 14, 2024 | BenchmarkingLarge Language Model | CodeCode Available | 3 |
| Revisiting and Benchmarking Graph Autoencoders: A Contrastive Learning Perspective | Oct 14, 2024 | BenchmarkingContrastive Learning | CodeCode Available | 0 |
| Building a Multivariate Time Series Benchmarking Datasets Inspired by Natural Language Processing (NLP) | Oct 14, 2024 | BenchmarkingMulti-Task Learning | —Unverified | 0 |
| SensorBench: Benchmarking LLMs in Coding-Based Sensor Processing | Oct 14, 2024 | BenchmarkingManagement | CodeCode Available | 0 |
| TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models | Oct 14, 2024 | 2kBenchmarking | CodeCode Available | 1 |
| Transforming Game Play: A Comparative Study of DCQN and DTQN Architectures in Reinforcement Learning | Oct 14, 2024 | Atari GamesBenchmarking | —Unverified | 0 |
| RMB: Comprehensively Benchmarking Reward Models in LLM Alignment | Oct 13, 2024 | Benchmarking | CodeCode Available | 1 |
| LLM-Based Multi-Agent Systems are Scalable Graph Generative Models | Oct 13, 2024 | BenchmarkingGraph Generation | CodeCode Available | 2 |
| LoLI-Street: Benchmarking Low-Light Image Enhancement and Beyond | Oct 13, 2024 | Autonomous DrivingAutonomous Vehicles | CodeCode Available | 1 |
| Yesterday's News: Benchmarking Multi-Dimensional Out-of-Distribution Generalisation of Misinformation Detection Models | Oct 12, 2024 | BenchmarkingMisinformation | CodeCode Available | 0 |
| LexSumm and LexT5: Benchmarking and Modeling Legal Summarization Tasks in English | Oct 12, 2024 | Benchmarking | CodeCode Available | 0 |
| FB-Bench: A Fine-Grained Multi-Task Benchmark for Evaluating LLMs' Responsiveness to Human Feedback | Oct 12, 2024 | Benchmarking | CodeCode Available | 0 |
| A Comparative Analysis on Ethical Benchmarking in Large Language Models | Oct 11, 2024 | BenchmarkingDecision Making | —Unverified | 0 |
| Enterprise Benchmarks for Large Language Model Evaluation | Oct 11, 2024 | BenchmarkingLanguage Model Evaluation | CodeCode Available | 0 |