| Adversarial Databases Improve Success in Retrieval-based Large Language Models | Jul 19, 2024 | Multiple-choiceRAG | —Unverified | 0 |
| TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish | Jul 17, 2024 | MathMultiple-choice | CodeCode Available | 1 |
| Fine-tuning Multimodal Large Language Models for Product Bundling | Jul 16, 2024 | In-Context LearningMultiple-choice | CodeCode Available | 1 |
| MINI-LLM: Memory-Efficient Structured Pruning for Large Language Models | Jul 16, 2024 | GPUMultiple-choice | —Unverified | 0 |
| Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models | Jul 15, 2024 | Backdoor AttackMultiple-choice | CodeCode Available | 1 |
| AstroMLab 1: Who Wins Astronomy Jeopardy!? | Jul 15, 2024 | AstronomyBenchmarking | —Unverified | 0 |
| NTSEBENCH: Cognitive Reasoning Benchmark for Vision Language Models | Jul 15, 2024 | Common Sense ReasoningMultiple-choice | —Unverified | 0 |
| LAB-Bench: Measuring Capabilities of Language Models for Biology Research | Jul 14, 2024 | Language ModellingMultiple-choice | —Unverified | 0 |
| Leveraging large language models for nano synthesis mechanism explanation: solid foundations or mere conjectures? | Jul 12, 2024 | Logical ReasoningMultiple-choice | CodeCode Available | 0 |
| Evaluating Nuanced Bias in Large Language Model Free Response Answers | Jul 11, 2024 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| Self-Recognition in Language Models | Jul 9, 2024 | Multiple-choice | CodeCode Available | 0 |
| ORAN-Bench-13K: An Open Source Benchmark for Assessing LLMs in Open Radio Access Networks | Jul 8, 2024 | Anomaly DetectionCode Generation | CodeCode Available | 1 |
| Can Model Uncertainty Function as a Proxy for Multiple-Choice Question Item Difficulty? | Jul 7, 2024 | Multiple-choice | CodeCode Available | 0 |
| LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts | Jul 6, 2024 | Logical ReasoningMathematical Reasoning | CodeCode Available | 1 |
| MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding | Jul 6, 2024 | ArticlesInstruction Following | CodeCode Available | 2 |
| Are Large Language Models Consistent over Value-laden Questions? | Jul 3, 2024 | Multiple-choice | CodeCode Available | 0 |
| CFinBench: A Comprehensive Chinese Financial Benchmark for Large Language Models | Jul 2, 2024 | Multiple-choice | —Unverified | 0 |
| Is Your Large Language Model Knowledgeable or a Choices-Only Cheater? | Jul 2, 2024 | Graph MiningLanguage Modeling | CodeCode Available | 0 |
| MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation | Jun 29, 2024 | Multiple-choice | CodeCode Available | 1 |
| InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding | Jun 28, 2024 | Multiple-choiceVideo Understanding | CodeCode Available | 1 |
| Changing Answer Order Can Decrease MMLU Accuracy | Jun 27, 2024 | MMLUMultiple-choice | —Unverified | 0 |
| Length Optimization in Conformal Prediction | Jun 27, 2024 | Conformal PredictionLanguage Modeling | CodeCode Available | 0 |
| DiVERT: Distractor Generation with Variational Errors Represented as Text for Math Multiple-choice Questions | Jun 27, 2024 | Distractor GenerationMath | CodeCode Available | 0 |
| VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation | Jun 25, 2024 | ARCBenchmarking | CodeCode Available | 0 |
| Evaluating Visual and Cultural Interpretation: The K-Viscuit Benchmark with Human-VLM Collaboration | Jun 24, 2024 | DiversityMultiple-choice | —Unverified | 0 |
| HCQA @ Ego4D EgoSchema Challenge 2024 | Jun 22, 2024 | Caption Generation | CodeCode Available | 1 |
| African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object Classification | Jun 20, 2024 | BenchmarkingClassification | CodeCode Available | 1 |
| SynDARin: Synthesising Datasets for Automated Reasoning in Low-Resource Languages | Jun 20, 2024 | Language ModellingLarge Language Model | —Unverified | 0 |
| QRMeM: Unleash the Length Limitation through Question then Reflection Memory Mechanism | Jun 19, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| ClinicalLab: Aligning Agents for Multi-Departmental Clinical Diagnostics in the Real World | Jun 19, 2024 | DiagnosticMultiple-choice | CodeCode Available | 2 |
| Enhancing Distractor Generation for Multiple-Choice Questions with Retrieval Augmented Pretraining and Knowledge Graph Integration | Jun 19, 2024 | BenchmarkingDistractor Generation | —Unverified | 0 |
| On the Principles behind Opinion Dynamics in Multi-Agent Systems of Large Language Models | Jun 18, 2024 | Multiple-choice | —Unverified | 0 |
| UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions | Jun 18, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 0 |
| QOG:Question and Options Generation based on Language Model | Jun 18, 2024 | Information RetrievalLanguage Modeling | —Unverified | 0 |
| DetectBench: Can Large Language Model Detect and Piece Together Implicit Evidence? | Jun 18, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| IPEval: A Bilingual Intellectual Property Agency Consultation Evaluation Benchmark for Large Language Models | Jun 18, 2024 | ManagementMultiple-choice | CodeCode Available | 0 |
| Aqulia-Med LLM: Pioneering Full-Process Open-Source Medical Language Models | Jun 18, 2024 | Multiple-choice | —Unverified | 0 |
| Grade Score: Quantifying LLM Performance in Option Selection | Jun 17, 2024 | Decision MakingFairness | CodeCode Available | 0 |
| FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food Culture | Jun 16, 2024 | DiversityMultiple-choice | CodeCode Available | 1 |
| Balancing Rigor and Utility: Mitigating Cognitive Biases in Large Language Models for Multiple-Choice Questions | Jun 16, 2024 | Decision MakingLanguage Modelling | CodeCode Available | 0 |
| VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment | Jun 16, 2024 | Action UnderstandingBenchmarking | —Unverified | 0 |
| VCEval: Rethinking What is a Good Educational Video and How to Automatically Evaluate It | Jun 15, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-training | Jun 15, 2024 | Domain AdaptationLanguage Modeling | CodeCode Available | 1 |
| CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language Models | Jun 14, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 2 |
| Evaluating ChatGPT-4 Vision on Brazil's National Undergraduate Computer Science Exam | Jun 14, 2024 | FairnessLogical Reasoning | CodeCode Available | 0 |
| BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages | Jun 14, 2024 | Multiple-choice | CodeCode Available | 1 |
| IntentionQA: A Benchmark for Evaluating Purchase Intention Comprehension Abilities of Language Models in E-commerce | Jun 14, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 1 |
| Bayesian Statistical Modeling with Predictors from LLMs | Jun 13, 2024 | Multiple-choice | —Unverified | 0 |
| AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models | Jun 13, 2024 | Multiple-choice | —Unverified | 0 |
| INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance | Jun 13, 2024 | Multiple-choiceVisual Reasoning | CodeCode Available | 1 |