| Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation | Jan 1, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Q-Bench-Video: Benchmark the Video Quality Understanding of LMMs | Jan 1, 2025 | Multiple-choiceVideo Generation | —Unverified | 0 |
| Separation of Powers: On Segregating Knowledge from Observation in LLM-enabled Knowledge-based Visual Question Answering | Jan 1, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| FSBench: A Figure Skating Benchmark for Advancing Artistic Sports Understanding | Jan 1, 2025 | Action RecognitionMultiple-choice | CodeCode Available | 0 |
| IllusionBench: A Large-scale and Comprehensive Benchmark for Visual Illusion Understanding in Vision-Language Models | Jan 1, 2025 | HallucinationMultiple-choice | —Unverified | 0 |
| EQUATOR: A Deterministic Framework for Evaluating LLM Reasoning with Open-Ended Questions. # v1.0.0-beta | Dec 31, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| A review of faithfulness metrics for hallucination assessment in Large Language Models | Dec 31, 2024 | BenchmarkingHallucination | —Unverified | 0 |
| Setting Standards in Turkish NLP: TR-MMLU for Large Language Model Evaluation | Dec 31, 2024 | Language Model EvaluationLanguage Modeling | —Unverified | 0 |
| AraSTEM: A Native Arabic Multiple Choice Question Benchmark for Evaluating LLMs Knowledge In STEM Subjects | Dec 31, 2024 | BenchmarkingMultiple-choice | —Unverified | 0 |
| MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models | Dec 31, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 0 |
| Monty Hall and Optimized Conformal Prediction to Improve Decision-Making with LLMs | Dec 31, 2024 | Conformal PredictionDecision Making | —Unverified | 0 |
| SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity | Dec 30, 2024 | BenchmarkingCode Generation | —Unverified | 0 |
| HindiLLM: Large Language Model for Hindi | Dec 29, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Using Large Language Models for Automated Grading of Student Writing about Science | Dec 25, 2024 | AstronomyMultiple-choice | —Unverified | 0 |
| In Case You Missed It: ARC 'Challenge' Is Not That Challenging | Dec 23, 2024 | ARCMultiple-choice | —Unverified | 0 |
| MMLU-CF: A Contamination-free Multi-task Language Understanding Benchmark | Dec 19, 2024 | MMLUMultiple-choice | CodeCode Available | 2 |
| LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks | Dec 19, 2024 | 8kIn-Context Learning | CodeCode Available | 5 |
| Are You Doubtful? Oh, It Might Be Difficult Then! Exploring the Use of Model Uncertainty for Question Difficulty Estimation | Dec 16, 2024 | Multiple-choice | —Unverified | 0 |
| CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding | Dec 16, 2024 | HallucinationMultiple-choice | —Unverified | 0 |
| Auto-bidding in real-time auctions via Oracle Imitation Learning (OIL) | Dec 16, 2024 | Imitation LearningMultiple-choice | —Unverified | 0 |
| Seeing the Forest and the Trees: Solving Visual Graph and Tree Based Data Structure Problems using Large Multimodal Models | Dec 15, 2024 | Multiple-choice | —Unverified | 0 |
| MedG-KRP: Medical Graph Knowledge Representation Probing | Dec 14, 2024 | Multiple-choiceMultiple Choice Question Answering (MCQA) | CodeCode Available | 0 |
| Do LLMs Act as Repositories of Causal Knowledge? | Dec 14, 2024 | Causal InferenceMultiple-choice | —Unverified | 0 |
| A recent evaluation on the performance of LLMs on radiation oncology physics using questions of randomly shuffled options | Dec 14, 2024 | Multiple-choice | —Unverified | 0 |
| Superhuman performance of a large language model on the reasoning tasks of a physician | Dec 14, 2024 | DiagnosticLanguage Modeling | —Unverified | 0 |