| Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive Framework | Jul 24, 2023 | Contrastive LearningMultimodal Reasoning | CodeCode Available | 1 |
| SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models | Jul 20, 2023 | BenchmarkingLanguage Modeling | CodeCode Available | 1 |
| Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla | Jul 18, 2023 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Assessing the Quality of Multiple-Choice Questions Using GPT-4 and Rule-Based Methods | Jul 16, 2023 | Multiple-choice | CodeCode Available | 0 |
| MMBench: Is Your Multi-modal Model an All-around Player? | Jul 12, 2023 | AllInstruction Following | CodeCode Available | 5 |
| Analyzing Multiple-Choice Reading and Listening Comprehension Tests | Jul 3, 2023 | Multiple-choiceReading Comprehension | —Unverified | 0 |
| Structured Dialogue Discourse Parsing | Jun 26, 2023 | Discourse ParsingMultiple-choice | CodeCode Available | 0 |
| Chance-Constrained Multiple-Choice Knapsack Problem: Model, Algorithms, and Applications | Jun 26, 2023 | Combinatorial OptimizationMultiple-choice | CodeCode Available | 0 |
| Analysis of the Cambridge Multiple-Choice Questions Reading Dataset with a Focus on Candidate Response Distribution | Jun 22, 2023 | Multiple-choice | —Unverified | 0 |
| Solving and Generating NPR Sunday Puzzles with Large Language Models | Jun 21, 2023 | Multiple-choicePrompt Engineering | CodeCode Available | 0 |
| RECAP-KG: Mining Knowledge Graphs from Raw GP Notes for Remote COVID-19 Assessment in Primary Care | Jun 17, 2023 | Decision Makinggraph construction | —Unverified | 0 |
| Thrilled by Your Progress! Large Language Models (GPT-4) No Longer Struggle to Pass Assessments in Higher Education Programming Courses | Jun 15, 2023 | Multiple-choice | —Unverified | 0 |
| Can ChatGPT pass the Vietnamese National High School Graduation Examination? | Jun 15, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Questioning the Survey Responses of Large Language Models | Jun 13, 2023 | Multiple-choiceSurvey | CodeCode Available | 0 |
| Investigating the Effectiveness of ChatGPT in Mathematical Reasoning and Problem Solving: Evidence from the Vietnamese National High School Graduation Examination | Jun 10, 2023 | MathMathematical Reasoning | —Unverified | 0 |
| Xiezhi: An Ever-Updating Benchmark for Holistic Domain Knowledge Evaluation | Jun 9, 2023 | JurisprudenceManagement | CodeCode Available | 1 |
| Network-based Representations and Dynamic Discrete Choice Models for Multiple Discrete Choice Analysis | Jun 7, 2023 | Discrete Choice ModelsMultiple-choice | —Unverified | 0 |
| Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam Dataset | Jun 5, 2023 | BenchmarkingMultiple-choice | CodeCode Available | 1 |
| Conformal Prediction with Large Language Models for Multi-Choice Question Answering | May 28, 2023 | Conformal PredictionMultiple-choice | CodeCode Available | 1 |
| Fine-Tuning Language Models with Just Forward Passes | May 27, 2023 | GPUIn-Context Learning | CodeCode Available | 3 |
| BUCA: A Binary Classification Approach to Unsupervised Commonsense Question Answering | May 25, 2023 | Binary ClassificationKnowledge Graphs | CodeCode Available | 0 |
| ToMChallenges: A Principle-Guided Dataset and Diverse Evaluation Tasks for Exploring Theory of Mind | May 24, 2023 | Multiple-choiceQuestion Answering | CodeCode Available | 0 |
| Have Large Language Models Developed a Personality?: Applicability of Self-Assessment Tests in Measuring Personality in LLMs | May 24, 2023 | Multiple-choice | —Unverified | 0 |
| This Land is Your, My Land: Evaluating Geopolitical Biases in Language Models | May 24, 2023 | Language ModellingLarge Language Model | CodeCode Available | 0 |
| Increasing Probability Mass on Answer Choices Does Not Always Improve Accuracy | May 24, 2023 | In-Context LearningMultiple-choice | CodeCode Available | 0 |