| The Achievement of Higher Flexibility in Multiple Choice-based Tests Using Image Classification Techniques | Nov 2, 2017 | BIG-bench Machine LearningGeneral Classification | —Unverified | 0 | 0 |
| AraSTEM: A Native Arabic Multiple Choice Question Benchmark for Evaluating LLMs Knowledge In STEM Subjects | Dec 31, 2024 | BenchmarkingMultiple-choice | —Unverified | 0 | 0 |
| AraTrust: An Evaluation of Trustworthiness for LLMs in Arabic | Mar 14, 2024 | EthicsMultiple-choice | —Unverified | 0 | 0 |
| A recent evaluation on the performance of LLMs on radiation oncology physics using questions of randomly shuffled options | Dec 14, 2024 | Multiple-choice | —Unverified | 0 | 0 |
| Are LLM-generated plain language summaries truly understandable? A large-scale crowdsourced evaluation | May 15, 2025 | InformativenessMultiple-choice | —Unverified | 0 | 0 |
| A review of faithfulness metrics for hallucination assessment in Large Language Models | Dec 31, 2024 | BenchmarkingHallucination | —Unverified | 0 | 0 |
| Are You Doubtful? Oh, It Might Be Difficult Then! Exploring the Use of Model Uncertainty for Question Difficulty Estimation | Dec 16, 2024 | Multiple-choice | —Unverified | 0 | 0 |
| ARGUS: Hallucination and Omission Evaluation in Video-LLMs | Jun 9, 2025 | DescriptiveForm | —Unverified | 0 | 0 |
| ActionAtlas: A VideoQA Benchmark for Domain-specialized Action Recognition | Oct 8, 2024 | Action RecognitionMultiple-choice | —Unverified | 0 | 0 |
| Aryl: An Elastic Cluster Scheduler for Deep Learning | Feb 16, 2022 | Deep LearningGPU | —Unverified | 0 | 0 |