| UnrealZoo: Enriching Photo-realistic Virtual Worlds for Embodied AI | Dec 30, 2024 | BenchmarkingReinforcement Learning (RL) | —Unverified | 0 |
| Stratify: Unifying Multi-Step Forecasting Strategies | Dec 29, 2024 | Benchmarking | —Unverified | 0 |
| Towards Ideal Temporal Graph Neural Networks: Evaluations and Conclusions after 10,000 GPU Hours | Dec 28, 2024 | BenchmarkingGPU | —Unverified | 0 |
| On dataset transferability in medical image classification | Dec 28, 2024 | BenchmarkingClassification | CodeCode Available | 0 |
| Machine Generated Product Advertisements: Benchmarking LLMs Against Human Performance | Dec 27, 2024 | BenchmarkingPersuasiveness | —Unverified | 0 |
| How Propense Are Large Language Models at Producing Code Smells? A Benchmarking Study | Dec 25, 2024 | BenchmarkingCode Generation | —Unverified | 0 |
| Re-assessing ImageNet: How aligned is its single-label assumption with its multi-label nature? | Dec 24, 2024 | Benchmarking | —Unverified | 0 |
| The Jungle of Generative Drug Discovery: Traps, Treasures, and Ways Out | Dec 24, 2024 | BenchmarkingDeep Learning | —Unverified | 0 |
| A Deep Reinforcement Learning Framework for Dynamic Portfolio Optimization: Evidence from China's Stock Market | Dec 24, 2024 | BenchmarkingDecision Making | CodeCode Available | 0 |
| MixMAS: A Framework for Sampling-Based Mixer Architecture Search for Multimodal Fusion and Learning | Dec 24, 2024 | Benchmarking | CodeCode Available | 0 |
| Benchmarking Generative AI Models for Deep Learning Test Input Generation | Dec 23, 2024 | BenchmarkingDeep Learning | CodeCode Available | 0 |
| Chumor 2.0: Towards Benchmarking Chinese Humor Understanding | Dec 23, 2024 | Benchmarking | CodeCode Available | 0 |
| SCBench: A Sports Commentary Benchmark for Video LLMs | Dec 23, 2024 | Benchmarking | —Unverified | 0 |
| Factuality or Fiction? Benchmarking Modern LLMs on Ambiguous QA with Citations | Dec 23, 2024 | BenchmarkingQuestion Answering | —Unverified | 0 |
| Multimodal Deep Reinforcement Learning for Portfolio Optimization | Dec 23, 2024 | ArticlesBenchmarking | —Unverified | 0 |
| StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs | Dec 23, 2024 | BenchmarkingLogical Reasoning | —Unverified | 0 |
| First-frame Supervised Video Polyp Segmentation via Propagative and Semantic Dual-teacher Network | Dec 21, 2024 | BenchmarkingTransfer Learning | CodeCode Available | 0 |
| Patherea: Cell Detection and Classification for the 2020s | Dec 21, 2024 | BenchmarkingCell Detection | —Unverified | 0 |
| HammerBench: Fine-Grained Function-Calling Evaluation in Real Mobile Device Scenarios | Dec 21, 2024 | Benchmarking | CodeCode Available | 0 |
| Deciphering the Underserved: Benchmarking LLM OCR for Low-Resource Scripts | Dec 20, 2024 | BenchmarkingOptical Character Recognition | CodeCode Available | 0 |
| TelcoLM: collecting data, adapting, and benchmarking language models for the telecommunication domain | Dec 20, 2024 | Benchmarking | —Unverified | 0 |
| Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage | Dec 20, 2024 | AttributeBenchmarking | —Unverified | 0 |
| AI-generated Image Quality Assessment in Visual Communication | Dec 20, 2024 | BenchmarkingImage Quality Assessment | CodeCode Available | 0 |
| Enriching Social Science Research via Survey Item Linking | Dec 20, 2024 | BenchmarkingEntity Disambiguation | CodeCode Available | 0 |
| Benchmarking LLMs and SLMs for patient reported outcomes | Dec 20, 2024 | BenchmarkingPrivacy Preserving | —Unverified | 0 |
| A Classification Benchmark for Artificial Intelligence Detection of Laryngeal Cancer from Patient Voice | Dec 20, 2024 | BenchmarkingDiagnostic | CodeCode Available | 0 |
| Pitfalls of topology-aware image segmentation | Dec 19, 2024 | BenchmarkingImage Segmentation | —Unverified | 0 |
| AntiLeak-Bench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge | Dec 18, 2024 | BenchmarkingWorld Knowledge | CodeCode Available | 0 |
| Generation of Large District Heating System Models Using Open-Source Data and Tools: An Exemplary Workflow | Dec 18, 2024 | Benchmarking | —Unverified | 0 |
| Mind Your Theory: Theory of Mind Goes Deeper Than Reasoning | Dec 18, 2024 | BenchmarkingPosition | —Unverified | 0 |
| DateLogicQA: Benchmarking Temporal Biases in Large Language Models | Dec 17, 2024 | Benchmarking | CodeCode Available | 0 |
| Selective Shot Learning for Code Explanation | Dec 17, 2024 | Benchmarking | —Unverified | 0 |
| Benchmarking and Understanding Compositional Relational Reasoning of LLMs | Dec 17, 2024 | BenchmarkingRelational Reasoning | CodeCode Available | 0 |
| C-FedRAG: A Confidential Federated Retrieval-Augmented Generation System | Dec 17, 2024 | BenchmarkingRAG | —Unverified | 0 |
| AI PERSONA: Towards Life-long Personalization of LLMs | Dec 17, 2024 | Benchmarking | —Unverified | 0 |
| A Scalable Approach to Benchmarking the In-Conversation Differential Diagnostic Accuracy of a Health AI | Dec 17, 2024 | BenchmarkingChatbot | —Unverified | 0 |
| Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models | Dec 17, 2024 | Benchmarking | —Unverified | 0 |
| ShiftedBronzes: Benchmarking and Analysis of Domain Fine-Grained Classification in Open-World Settings | Dec 17, 2024 | Benchmarking | —Unverified | 0 |
| F-Bench: Rethinking Human Preference Evaluation Metrics for Benchmarking Face Generation, Customization, and Restoration | Dec 17, 2024 | BenchmarkingFace Generation | —Unverified | 0 |
| SciFaultyQA: Benchmarking LLMs on Faulty Science Question Detection with a GAN-Inspired Approach to Synthetic Dataset Generation | Dec 16, 2024 | BenchmarkingDataset Generation | CodeCode Available | 0 |
| PunchBench: Benchmarking MLLMs in Multimodal Punchline Comprehension | Dec 16, 2024 | BenchmarkingImage Captioning | —Unverified | 0 |
| QUENCH: Measuring the gap between Indic and Non-Indic Contextual General Reasoning in LLMs | Dec 16, 2024 | BenchmarkingCommon Sense Reasoning | CodeCode Available | 0 |
| How Different AI Chatbots Behave? Benchmarking Large Language Models in Behavioral Economics Games | Dec 16, 2024 | BenchmarkingChatbot | —Unverified | 0 |
| RoLargeSum: A Large Dialect-Aware Romanian News Dataset for Summary, Headline, and Keyword Generation | Dec 15, 2024 | ArticlesBenchmarking | CodeCode Available | 0 |
| Benchmarking and Learning Multi-Dimensional Quality Evaluator for Text-to-3D Generation | Dec 15, 2024 | 3D GenerationBenchmarking | —Unverified | 0 |
| Sequence-Level Leakage Risk of Training Data in Large Language Models | Dec 15, 2024 | Benchmarking | —Unverified | 0 |
| NoisyEQA: Benchmarking Embodied Question Answering Against Noisy Queries | Dec 14, 2024 | BenchmarkingEmbodied Question Answering | —Unverified | 0 |
| CRS Arena: Crowdsourced Benchmarking of Conversational Recommender Systems | Dec 13, 2024 | BenchmarkingRecommendation Systems | —Unverified | 0 |
| Benchmarking Table Comprehension In The Wild | Dec 13, 2024 | BenchmarkingQuestion Answering | —Unverified | 0 |
| Benchmarking Linguistic Diversity of Large Language Models | Dec 13, 2024 | BenchmarkingDiversity | CodeCode Available | 0 |