| UCFE: A User-Centric Financial Expertise Benchmark for Large Language Models | Oct 17, 2024 | Benchmarking | CodeCode Available | 0 |
| debiaSAE: Benchmarking and Mitigating Vision-Language Model Bias | Oct 17, 2024 | BenchmarkingBias Detection | CodeCode Available | 0 |
| Benchmarking Defeasible Reasoning with Large Language Models -- Initial Experiments and Future Directions | Oct 16, 2024 | Benchmarking | —Unverified | 0 |
| Configurable Embodied Data Generation for Class-Agnostic RGB-D Video Segmentation | Oct 16, 2024 | BenchmarkingPanoptic Segmentation | —Unverified | 0 |
| Understanding the Role of LLMs in Multimodal Evaluation Benchmarks | Oct 16, 2024 | BenchmarkingLarge Language Model | CodeCode Available | 0 |
| Open Ko-LLM Leaderboard2: Bridging Foundational and Practical Evaluation for Korean LLMs | Oct 16, 2024 | Benchmarking | —Unverified | 0 |
| AERO: Softmax-Only LLMs for Efficient Private Inference | Oct 16, 2024 | BenchmarkingDecoder | —Unverified | 0 |
| Benchmarking Data Efficiency in Δ-ML and Multifidelity Models for Quantum Chemistry | Oct 15, 2024 | Benchmarking | CodeCode Available | 0 |
| Analysis and Benchmarking of Extending Blind Face Image Restoration to Videos | Oct 15, 2024 | BenchmarkingBlind Face Restoration | —Unverified | 0 |
| FoundTS: Comprehensive and Unified Benchmarking of Foundation Models for Time Series Forecasting | Oct 15, 2024 | Benchmarkingenergy management | —Unverified | 0 |
| Transforming Game Play: A Comparative Study of DCQN and DTQN Architectures in Reinforcement Learning | Oct 14, 2024 | Atari GamesBenchmarking | —Unverified | 0 |
| ChakmaNMT: A Low-resource Machine Translation On Chakma Language | Oct 14, 2024 | BenchmarkingMachine Translation | —Unverified | 0 |
| Building a Multivariate Time Series Benchmarking Datasets Inspired by Natural Language Processing (NLP) | Oct 14, 2024 | BenchmarkingMulti-Task Learning | —Unverified | 0 |
| The Trap of Presumed Equivalence: Artificial General Intelligence Should Not Be Assessed on the Scale of Human Intelligence | Oct 14, 2024 | Benchmarking | —Unverified | 0 |
| Personalised Feedback Framework for Online Education Programmes Using Generative AI | Oct 14, 2024 | BenchmarkingManagement | —Unverified | 0 |
| SensorBench: Benchmarking LLMs in Coding-Based Sensor Processing | Oct 14, 2024 | BenchmarkingManagement | CodeCode Available | 0 |
| Revisiting and Benchmarking Graph Autoencoders: A Contrastive Learning Perspective | Oct 14, 2024 | BenchmarkingContrastive Learning | CodeCode Available | 0 |
| LexSumm and LexT5: Benchmarking and Modeling Legal Summarization Tasks in English | Oct 12, 2024 | Benchmarking | CodeCode Available | 0 |
| FB-Bench: A Fine-Grained Multi-Task Benchmark for Evaluating LLMs' Responsiveness to Human Feedback | Oct 12, 2024 | Benchmarking | CodeCode Available | 0 |
| Yesterday's News: Benchmarking Multi-Dimensional Out-of-Distribution Generalisation of Misinformation Detection Models | Oct 12, 2024 | BenchmarkingMisinformation | CodeCode Available | 0 |
| Guidelines for Fine-grained Sentence-level Arabic Readability Annotation | Oct 11, 2024 | BenchmarkingSentence | —Unverified | 0 |
| Can we hop in general? A discussion of benchmark selection and design using the Hopper environment | Oct 11, 2024 | BenchmarkingReinforcement Learning (RL) | —Unverified | 0 |
| Test-driven Software Experimentation with LASSO: an LLM Prompt Benchmarking Example | Oct 11, 2024 | BenchmarkingCode Generation | —Unverified | 0 |
| uto\!L: Autonomous Evaluation of LLMs for Truth Maintenance and Reasoning Tasks | Oct 11, 2024 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| Enterprise Benchmarks for Large Language Model Evaluation | Oct 11, 2024 | BenchmarkingLanguage Model Evaluation | CodeCode Available | 0 |
| A Comparative Analysis on Ethical Benchmarking in Large Language Models | Oct 11, 2024 | BenchmarkingDecision Making | —Unverified | 0 |
| Identifying Money Laundering Subgraphs on the Blockchain | Oct 10, 2024 | Benchmarking | CodeCode Available | 0 |
| Audio Explanation Synthesis with Generative Foundation Models | Oct 10, 2024 | BenchmarkingDecision Making | CodeCode Available | 0 |
| TRIAGE: Ethical Benchmarking of AI Models Through Mass Casualty Simulations | Oct 10, 2024 | BenchmarkingDecision Making | CodeCode Available | 0 |
| Advocating Character Error Rate for Multilingual ASR Evaluation | Oct 9, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| InAttention: Linear Context Scaling for Transformers | Oct 9, 2024 | BenchmarkingDecoder | —Unverified | 0 |
| Benchmarking Data Heterogeneity Evaluation Approaches for Personalized Federated Learning | Oct 9, 2024 | BenchmarkingFairness | CodeCode Available | 0 |
| TuringQ: Benchmarking AI Comprehension in Theory of Computation | Oct 9, 2024 | Benchmarking | CodeCode Available | 0 |
| Analysis of different disparity estimation techniques on aerial stereo image datasets | Oct 9, 2024 | BenchmarkingDepth Estimation | —Unverified | 0 |
| OmniPose6D: Towards Short-Term Object Pose Tracking in Dynamic Scenes from Monocular RGB | Oct 9, 2024 | BenchmarkingDiversity | —Unverified | 0 |
| HERM: Benchmarking and Enhancing Multimodal LLMs for Human-Centric Understanding | Oct 9, 2024 | BenchmarkingInstruction Following | —Unverified | 0 |
| M3Bench: Benchmarking Whole-body Motion Generation for Mobile Manipulation in 3D Scenes | Oct 9, 2024 | BenchmarkingMotion Generation | —Unverified | 0 |
| Active Evaluation Acquisition for Efficient LLM Benchmarking | Oct 8, 2024 | Benchmarking | —Unverified | 0 |
| Manual Verbalizer Enrichment for Few-Shot Text Classification | Oct 8, 2024 | BenchmarkingClassification | —Unverified | 0 |
| Benchmarking of a new data splitting method on volcanic eruption data | Oct 8, 2024 | Benchmarking | —Unverified | 0 |
| QGym: Scalable Simulation and Benchmarking of Queuing Network Controllers | Oct 8, 2024 | Benchmarking | CodeCode Available | 0 |
| Named Clinical Entity Recognition Benchmark | Oct 7, 2024 | BenchmarkingDecoder | CodeCode Available | 0 |
| Precise Model Benchmarking with Only a Few Observations | Oct 7, 2024 | Benchmarkingmodel | —Unverified | 0 |
| Rule-based Data Selection for Large Language Models | Oct 7, 2024 | BenchmarkingMath | —Unverified | 0 |
| TuneVLSeg: Prompt Tuning Benchmark for Vision-Language Segmentation Models | Oct 7, 2024 | BenchmarkingSegmentation | CodeCode Available | 0 |
| Translation Canvas: An Explainable Interface to Pinpoint and Analyze Translation Systems | Oct 7, 2024 | BenchmarkingMachine Translation | —Unverified | 0 |
| Adjusting Pretrained Backbones for Performativity | Oct 6, 2024 | BenchmarkingDeep Learning | CodeCode Available | 0 |
| ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection | Oct 6, 2024 | BenchmarkingMathematical Reasoning | —Unverified | 0 |
| Implicit to Explicit Entropy Regularization: Benchmarking ViT Fine-tuning under Noisy Labels | Oct 5, 2024 | Benchmarking | —Unverified | 0 |
| Transformers Utilization in Chart Understanding: A Review of Recent Advances & Future Trends | Oct 5, 2024 | BenchmarkingChart Understanding | —Unverified | 0 |