| Adapt-Pruner: Adaptive Structural Pruning for Efficient Small Language Model Training | Feb 5, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| QLESS: A Quantized Approach for Data Valuation and Selection in Large Language Model Fine-Tuning | Feb 3, 2025 | Data ValuationLanguage Modeling | CodeCode Available | 0 |
| Evaluation of Large Language Models via Coupled Token Generation | Feb 3, 2025 | ChatbotLarge Language Model | CodeCode Available | 0 |
| Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial? | Feb 2, 2025 | MathMMLU | —Unverified | 0 |
| LLM-Powered Benchmark Factory: Reliable, Generic, and Efficient | Feb 2, 2025 | MMLU | CodeCode Available | 0 |
| DFPE: A Diverse Fingerprint Ensemble for Enhancing LLM Performance | Jan 29, 2025 | DiversityMMLU | CodeCode Available | 0 |
| IndicMMLU-Pro: Benchmarking Indic Large Language Models on Multi-Task Language Understanding | Jan 27, 2025 | BenchmarkingDiversity | —Unverified | 0 |
| HardML: A Benchmark For Evaluating Data Science And Machine Learning knowledge and reasoning in AI | Jan 26, 2025 | MMLUMultiple-choice | —Unverified | 0 |
| Humanity's Last Exam | Jan 24, 2025 | Humanity's Last ExamLanguage Modeling | —Unverified | 0 |
| On the Reasoning Capacity of AI Models and How to Quantify It | Jan 23, 2025 | MemorizationMMLU | —Unverified | 0 |
| Is your LLM trapped in a Mental Set? Investigative study on how mental sets affect the reasoning capabilities of LLMs | Jan 21, 2025 | GSM8KIn-Context Learning | —Unverified | 0 |
| MyGO Multiplex CoT: A Method for Self-Reflection in Large Language Models via Double Chain of Thought Thinking | Jan 20, 2025 | Decision MakingGSM8K | CodeCode Available | 1 |
| Explain-Query-Test: Self-Evaluating LLMs Via Explanation and Comprehension Discrepancy | Jan 20, 2025 | MMLU | CodeCode Available | 0 |
| Control LLM: Controlled Evolution for Intelligence Retention in LLM | Jan 19, 2025 | MathMathematical Reasoning | CodeCode Available | 1 |
| DNA 1.0 Technical Report | Jan 18, 2025 | BelebeleGSM8K | —Unverified | 0 |
| Inference-Time-Compute: More Faithful? A Research Note | Jan 14, 2025 | AttributeMMLU | —Unverified | 0 |
| CHAIR -- Classifier of Hallucination as Improver | Jan 5, 2025 | HallucinationMMLU | CodeCode Available | 0 |
| Unraveling Indirect In-Context Learning Using Influence Functions | Jan 1, 2025 | In-Context LearningInformativeness | —Unverified | 0 |
| Setting Standards in Turkish NLP: TR-MMLU for Large Language Model Evaluation | Dec 31, 2024 | Language Model EvaluationLanguage Modeling | —Unverified | 0 |
| Monty Hall and Optimized Conformal Prediction to Improve Decision-Making with LLMs | Dec 31, 2024 | Conformal PredictionDecision Making | —Unverified | 0 |
| SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity | Dec 30, 2024 | BenchmarkingCode Generation | —Unverified | 0 |
| MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design | Dec 19, 2024 | MMLUQuantization | —Unverified | 0 |
| MMLU-CF: A Contamination-free Multi-task Language Understanding Benchmark | Dec 19, 2024 | MMLUMultiple-choice | CodeCode Available | 2 |
| ORBIT: Cost-Effective Dataset Curation for Large Language Model Domain Adaptation with an Astronomy Case Study | Dec 19, 2024 | AstronomyDomain Adaptation | CodeCode Available | 0 |
| ChainRank-DPO: Chain Rank Direct Preference Optimization for LLM Rankers | Dec 18, 2024 | MMLUReranking | —Unverified | 0 |