| Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models | Dec 17, 2024 | Benchmarking | —Unverified | 0 |
| Benchmarking and Understanding Compositional Relational Reasoning of LLMs | Dec 17, 2024 | BenchmarkingRelational Reasoning | CodeCode Available | 0 |
| Selective Shot Learning for Code Explanation | Dec 17, 2024 | Benchmarking | —Unverified | 0 |
| C-FedRAG: A Confidential Federated Retrieval-Augmented Generation System | Dec 17, 2024 | BenchmarkingRAG | —Unverified | 0 |
| AI PERSONA: Towards Life-long Personalization of LLMs | Dec 17, 2024 | Benchmarking | —Unverified | 0 |
| F-Bench: Rethinking Human Preference Evaluation Metrics for Benchmarking Face Generation, Customization, and Restoration | Dec 17, 2024 | BenchmarkingFace Generation | —Unverified | 0 |
| A Scalable Approach to Benchmarking the In-Conversation Differential Diagnostic Accuracy of a Health AI | Dec 17, 2024 | BenchmarkingChatbot | —Unverified | 0 |
| ShiftedBronzes: Benchmarking and Analysis of Domain Fine-Grained Classification in Open-World Settings | Dec 17, 2024 | Benchmarking | —Unverified | 0 |
| How Different AI Chatbots Behave? Benchmarking Large Language Models in Behavioral Economics Games | Dec 16, 2024 | BenchmarkingChatbot | —Unverified | 0 |
| CharacterBench: Benchmarking Character Customization of Large Language Models | Dec 16, 2024 | Benchmarking | CodeCode Available | 1 |
| SciFaultyQA: Benchmarking LLMs on Faulty Science Question Detection with a GAN-Inspired Approach to Synthetic Dataset Generation | Dec 16, 2024 | BenchmarkingDataset Generation | CodeCode Available | 0 |
| MT-LENS: An all-in-one Toolkit for Better Machine Translation Evaluation | Dec 16, 2024 | AllBenchmarking | CodeCode Available | 1 |
| QUENCH: Measuring the gap between Indic and Non-Indic Contextual General Reasoning in LLMs | Dec 16, 2024 | BenchmarkingCommon Sense Reasoning | CodeCode Available | 0 |
| PunchBench: Benchmarking MLLMs in Multimodal Punchline Comprehension | Dec 16, 2024 | BenchmarkingImage Captioning | —Unverified | 0 |
| AD-LLM: Benchmarking Large Language Models for Anomaly Detection | Dec 15, 2024 | Anomaly DetectionBenchmarking | CodeCode Available | 1 |
| RoLargeSum: A Large Dialect-Aware Romanian News Dataset for Summary, Headline, and Keyword Generation | Dec 15, 2024 | ArticlesBenchmarking | CodeCode Available | 0 |
| Sequence-Level Leakage Risk of Training Data in Large Language Models | Dec 15, 2024 | Benchmarking | —Unverified | 0 |
| Benchmarking and Learning Multi-Dimensional Quality Evaluator for Text-to-3D Generation | Dec 15, 2024 | 3D GenerationBenchmarking | —Unverified | 0 |
| NoisyEQA: Benchmarking Embodied Question Answering Against Noisy Queries | Dec 14, 2024 | BenchmarkingEmbodied Question Answering | —Unverified | 0 |
| NeuralPLexer3: Accurate Biomolecular Complex Structure Prediction with Flow Models | Dec 14, 2024 | BenchmarkingDrug Design | CodeCode Available | 2 |
| EvalGIM: A Library for Evaluating Generative Image Models | Dec 13, 2024 | BenchmarkingDiversity | CodeCode Available | 2 |
| CRS Arena: Crowdsourced Benchmarking of Conversational Recommender Systems | Dec 13, 2024 | BenchmarkingRecommendation Systems | —Unverified | 0 |
| Benchmarking Table Comprehension In The Wild | Dec 13, 2024 | BenchmarkingQuestion Answering | —Unverified | 0 |
| Benchmarking Linguistic Diversity of Large Language Models | Dec 13, 2024 | BenchmarkingDiversity | CodeCode Available | 0 |
| Benchmarking large language models for materials synthesis: the case of atomic layer deposition | Dec 13, 2024 | BenchmarkingHallucination | —Unverified | 0 |