| OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning | Dec 31, 2024 | BenchmarkingLogical Reasoning | CodeCode Available | 4 |
| A review of faithfulness metrics for hallucination assessment in Large Language Models | Dec 31, 2024 | BenchmarkingHallucination | —Unverified | 0 |
| AraSTEM: A Native Arabic Multiple Choice Question Benchmark for Evaluating LLMs Knowledge In STEM Subjects | Dec 31, 2024 | BenchmarkingMultiple-choice | —Unverified | 0 |
| Measuring Large Language Models Capacity to Annotate Journalistic Sourcing | Dec 30, 2024 | BenchmarkingEthics | —Unverified | 0 |
| TrajLearn: Trajectory Prediction Learning using Deep Generative Models | Dec 30, 2024 | Autonomous NavigationBenchmarking | CodeCode Available | 1 |
| UnrealZoo: Enriching Photo-realistic Virtual Worlds for Embodied AI | Dec 30, 2024 | BenchmarkingReinforcement Learning (RL) | —Unverified | 0 |
| SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity | Dec 30, 2024 | BenchmarkingCode Generation | —Unverified | 0 |
| Stratify: Unifying Multi-Step Forecasting Strategies | Dec 29, 2024 | Benchmarking | —Unverified | 0 |
| On dataset transferability in medical image classification | Dec 28, 2024 | BenchmarkingClassification | CodeCode Available | 0 |
| Towards Ideal Temporal Graph Neural Networks: Evaluations and Conclusions after 10,000 GPU Hours | Dec 28, 2024 | BenchmarkingGPU | —Unverified | 0 |
| Machine Generated Product Advertisements: Benchmarking LLMs Against Human Performance | Dec 27, 2024 | BenchmarkingPersuasiveness | —Unverified | 0 |
| How Propense Are Large Language Models at Producing Code Smells? A Benchmarking Study | Dec 25, 2024 | BenchmarkingCode Generation | —Unverified | 0 |
| The Jungle of Generative Drug Discovery: Traps, Treasures, and Ways Out | Dec 24, 2024 | BenchmarkingDeep Learning | —Unverified | 0 |
| Re-assessing ImageNet: How aligned is its single-label assumption with its multi-label nature? | Dec 24, 2024 | Benchmarking | —Unverified | 0 |
| MixMAS: A Framework for Sampling-Based Mixer Architecture Search for Multimodal Fusion and Learning | Dec 24, 2024 | Benchmarking | CodeCode Available | 0 |
| A Deep Reinforcement Learning Framework for Dynamic Portfolio Optimization: Evidence from China's Stock Market | Dec 24, 2024 | BenchmarkingDecision Making | CodeCode Available | 0 |
| Factuality or Fiction? Benchmarking Modern LLMs on Ambiguous QA with Citations | Dec 23, 2024 | BenchmarkingQuestion Answering | —Unverified | 0 |
| StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs | Dec 23, 2024 | BenchmarkingLogical Reasoning | —Unverified | 0 |
| Benchmarking Generative AI Models for Deep Learning Test Input Generation | Dec 23, 2024 | BenchmarkingDeep Learning | CodeCode Available | 0 |
| Multimodal Deep Reinforcement Learning for Portfolio Optimization | Dec 23, 2024 | ArticlesBenchmarking | —Unverified | 0 |
| SCBench: A Sports Commentary Benchmark for Video LLMs | Dec 23, 2024 | Benchmarking | —Unverified | 0 |
| SMAC-Hard: Enabling Mixed Opponent Strategy Script and Self-play on SMAC | Dec 23, 2024 | BenchmarkingMulti-agent Reinforcement Learning | CodeCode Available | 1 |
| On the Generalization Ability of Machine-Generated Text Detectors | Dec 23, 2024 | BenchmarkingMisinformation | CodeCode Available | 1 |
| Chumor 2.0: Towards Benchmarking Chinese Humor Understanding | Dec 23, 2024 | Benchmarking | CodeCode Available | 0 |
| Dora: Sampling and Benchmarking for 3D Shape Variational Auto-Encoders | Dec 23, 2024 | 3D Shape ModelingBenchmarking | CodeCode Available | 4 |