| FRAMES: Boosting LLMs with A Four-Quadrant Multi-Stage Pretraining Strategy | Feb 8, 2025 | MMLU | —Unverified | 0 |
| Adapt-Pruner: Adaptive Structural Pruning for Efficient Small Language Model Training | Feb 5, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| QLESS: A Quantized Approach for Data Valuation and Selection in Large Language Model Fine-Tuning | Feb 3, 2025 | Data ValuationLanguage Modeling | CodeCode Available | 0 |
| Evaluation of Large Language Models via Coupled Token Generation | Feb 3, 2025 | ChatbotLarge Language Model | CodeCode Available | 0 |
| Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial? | Feb 2, 2025 | MathMMLU | —Unverified | 0 |
| LLM-Powered Benchmark Factory: Reliable, Generic, and Efficient | Feb 2, 2025 | MMLU | CodeCode Available | 0 |
| DFPE: A Diverse Fingerprint Ensemble for Enhancing LLM Performance | Jan 29, 2025 | DiversityMMLU | CodeCode Available | 0 |
| IndicMMLU-Pro: Benchmarking Indic Large Language Models on Multi-Task Language Understanding | Jan 27, 2025 | BenchmarkingDiversity | —Unverified | 0 |
| HardML: A Benchmark For Evaluating Data Science And Machine Learning knowledge and reasoning in AI | Jan 26, 2025 | MMLUMultiple-choice | —Unverified | 0 |
| Humanity's Last Exam | Jan 24, 2025 | Humanity's Last ExamLanguage Modeling | —Unverified | 0 |
| On the Reasoning Capacity of AI Models and How to Quantify It | Jan 23, 2025 | MemorizationMMLU | —Unverified | 0 |
| Is your LLM trapped in a Mental Set? Investigative study on how mental sets affect the reasoning capabilities of LLMs | Jan 21, 2025 | GSM8KIn-Context Learning | —Unverified | 0 |
| Explain-Query-Test: Self-Evaluating LLMs Via Explanation and Comprehension Discrepancy | Jan 20, 2025 | MMLU | CodeCode Available | 0 |
| DNA 1.0 Technical Report | Jan 18, 2025 | BelebeleGSM8K | —Unverified | 0 |
| Inference-Time-Compute: More Faithful? A Research Note | Jan 14, 2025 | AttributeMMLU | —Unverified | 0 |
| CHAIR -- Classifier of Hallucination as Improver | Jan 5, 2025 | HallucinationMMLU | CodeCode Available | 0 |
| Unraveling Indirect In-Context Learning Using Influence Functions | Jan 1, 2025 | In-Context LearningInformativeness | —Unverified | 0 |
| Monty Hall and Optimized Conformal Prediction to Improve Decision-Making with LLMs | Dec 31, 2024 | Conformal PredictionDecision Making | —Unverified | 0 |
| Setting Standards in Turkish NLP: TR-MMLU for Large Language Model Evaluation | Dec 31, 2024 | Language Model EvaluationLanguage Modeling | —Unverified | 0 |
| SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity | Dec 30, 2024 | BenchmarkingCode Generation | —Unverified | 0 |
| ORBIT: Cost-Effective Dataset Curation for Large Language Model Domain Adaptation with an Astronomy Case Study | Dec 19, 2024 | AstronomyDomain Adaptation | CodeCode Available | 0 |
| MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design | Dec 19, 2024 | MMLUQuantization | —Unverified | 0 |
| ChainRank-DPO: Chain Rank Direct Preference Optimization for LLM Rankers | Dec 18, 2024 | MMLUReranking | —Unverified | 0 |
| Unveiling the Secret Recipe: A Guide For Supervised Fine-Tuning Small LLMs | Dec 17, 2024 | MMLU | —Unverified | 0 |
| Nanoscaling Floating-Point (NxFP): NanoMantissa, Adaptive Microexponents, and Code Recycling for Direct-Cast Compression of Large Language Models | Dec 15, 2024 | MMLUQuantization | —Unverified | 0 |
| LLM Distillation for Efficient Few-Shot Multiple Choice Question Answering | Dec 13, 2024 | Few-Shot LearningKnowledge Distillation | —Unverified | 0 |
| Llama 3 Meets MoE: Efficient Upcycling | Dec 13, 2024 | Mixture-of-ExpertsMMLU | CodeCode Available | 0 |
| Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation | Dec 4, 2024 | MMLU | —Unverified | 0 |
| Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset | Dec 3, 2024 | ARCMMLU | —Unverified | 0 |
| Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models | Dec 2, 2024 | MMLUMultiple-choice | CodeCode Available | 0 |
| The Vulnerability of Language Model Benchmarks: Do They Accurately Reflect True LLM Performance? | Dec 2, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Improving Physics Reasoning in Large Language Models Using Mixture of Refinement Agents | Dec 1, 2024 | Mathematical ReasoningMMLU | —Unverified | 0 |
| Simple and Provable Scaling Laws for the Test-Time Compute of Large Language Models | Nov 29, 2024 | MMLU | —Unverified | 0 |
| Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference | Nov 27, 2024 | GSM8KLanguage Modeling | —Unverified | 0 |
| Predicting Emergent Capabilities by Finetuning | Nov 25, 2024 | CoLAGSM8K | —Unverified | 0 |
| Learning from "Silly" Questions Improves Large Language Models, But Only Slightly | Nov 21, 2024 | EconometricsGlobal Facts | —Unverified | 0 |
| GenBFA: An Evolutionary Optimization Approach to Bit-Flip Attacks on LLMs | Nov 21, 2024 | MMLUText Generation | —Unverified | 0 |
| Real-time Adapting Routing (RAR): Improving Efficiency Through Continuous Learning in Software Powered by Layered Foundation Models | Nov 14, 2024 | Domain GeneralizationIn-Context Learning | —Unverified | 0 |
| Reasoning Robustness of LLMs to Adversarial Typographical Errors | Nov 8, 2024 | GSM8KMMLU | —Unverified | 0 |
| Watson: A Cognitive Observability Framework for the Reasoning of LLM-Powered Agents | Nov 5, 2024 | MMLU | —Unverified | 0 |
| TODO: Enhancing LLM Alignment with Ternary Preferences | Nov 2, 2024 | ARCMMLU | CodeCode Available | 0 |
| Project MPG: towards a generalized performance benchmark for LLM capabilities | Oct 28, 2024 | BenchmarkingChatbot | —Unverified | 0 |
| Adaptive Dense Reward: Understanding the Gap Between Action and Reward Space in Alignment | Oct 23, 2024 | GSM8KHumanEval | —Unverified | 0 |
| Step Guided Reasoning: Improving Mathematical Reasoning using Guidance Generation and Step Reasoning | Oct 18, 2024 | MathMathematical Reasoning | —Unverified | 0 |
| BenTo: Benchmark Task Reduction with In-Context Transferability | Oct 17, 2024 | In-Context LearningMMLU | CodeCode Available | 0 |
| MIND: Math Informed syNthetic Dialogues for Pretraining LLMs | Oct 15, 2024 | GSM8KMath | —Unverified | 0 |
| G-Designer: Architecting Multi-agent Communication Topologies via Graph Neural Networks | Oct 15, 2024 | HumanEvalLanguage Modelling | —Unverified | 0 |
| Divide, Reweight, and Conquer: A Logit Arithmetic Approach for In-Context Learning | Oct 14, 2024 | In-Context LearningMMLU | CodeCode Available | 0 |
| Towards Multilingual LLM Evaluation for European Languages | Oct 11, 2024 | ARCGSM8K | —Unverified | 0 |
| Upcycling Large Language Models into Mixture of Experts | Oct 10, 2024 | Mixture-of-ExpertsMMLU | —Unverified | 0 |