| HCQA @ Ego4D EgoSchema Challenge 2024 | Jun 22, 2024 | Caption Generation | CodeCode Available | 1 |
| African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object Classification | Jun 20, 2024 | BenchmarkingClassification | CodeCode Available | 1 |
| SynDARin: Synthesising Datasets for Automated Reasoning in Low-Resource Languages | Jun 20, 2024 | Language ModellingLarge Language Model | —Unverified | 0 |
| QRMeM: Unleash the Length Limitation through Question then Reflection Memory Mechanism | Jun 19, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| ClinicalLab: Aligning Agents for Multi-Departmental Clinical Diagnostics in the Real World | Jun 19, 2024 | DiagnosticMultiple-choice | CodeCode Available | 2 |
| Enhancing Distractor Generation for Multiple-Choice Questions with Retrieval Augmented Pretraining and Knowledge Graph Integration | Jun 19, 2024 | BenchmarkingDistractor Generation | —Unverified | 0 |
| On the Principles behind Opinion Dynamics in Multi-Agent Systems of Large Language Models | Jun 18, 2024 | Multiple-choice | —Unverified | 0 |
| UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions | Jun 18, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 0 |
| QOG:Question and Options Generation based on Language Model | Jun 18, 2024 | Information RetrievalLanguage Modeling | —Unverified | 0 |
| DetectBench: Can Large Language Model Detect and Piece Together Implicit Evidence? | Jun 18, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| IPEval: A Bilingual Intellectual Property Agency Consultation Evaluation Benchmark for Large Language Models | Jun 18, 2024 | ManagementMultiple-choice | CodeCode Available | 0 |
| Aqulia-Med LLM: Pioneering Full-Process Open-Source Medical Language Models | Jun 18, 2024 | Multiple-choice | —Unverified | 0 |
| Grade Score: Quantifying LLM Performance in Option Selection | Jun 17, 2024 | Decision MakingFairness | CodeCode Available | 0 |
| FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food Culture | Jun 16, 2024 | DiversityMultiple-choice | CodeCode Available | 1 |
| Balancing Rigor and Utility: Mitigating Cognitive Biases in Large Language Models for Multiple-Choice Questions | Jun 16, 2024 | Decision MakingLanguage Modelling | CodeCode Available | 0 |
| VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment | Jun 16, 2024 | Action UnderstandingBenchmarking | —Unverified | 0 |
| VCEval: Rethinking What is a Good Educational Video and How to Automatically Evaluate It | Jun 15, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-training | Jun 15, 2024 | Domain AdaptationLanguage Modeling | CodeCode Available | 1 |
| CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language Models | Jun 14, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 2 |
| Evaluating ChatGPT-4 Vision on Brazil's National Undergraduate Computer Science Exam | Jun 14, 2024 | FairnessLogical Reasoning | CodeCode Available | 0 |
| BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages | Jun 14, 2024 | Multiple-choice | CodeCode Available | 1 |
| IntentionQA: A Benchmark for Evaluating Purchase Intention Comprehension Abilities of Language Models in E-commerce | Jun 14, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 1 |
| Bayesian Statistical Modeling with Predictors from LLMs | Jun 13, 2024 | Multiple-choice | —Unverified | 0 |
| AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models | Jun 13, 2024 | Multiple-choice | —Unverified | 0 |
| INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance | Jun 13, 2024 | Multiple-choiceVisual Reasoning | CodeCode Available | 1 |