| What Matters in Transformers? Not All Attention is Needed | Jun 22, 2024 | AllMMLU | CodeCode Available | 2 |
| Pistis-RAG: Enhancing Retrieval-Augmented Generation with Human Feedback | Jun 21, 2024 | Information RetrievalLearning-To-Rank | —Unverified | 0 |
| Data Efficient Evaluation of Large Language Models and Text-to-Image Models via Adaptive Sampling | Jun 21, 2024 | ClusteringMMLU | —Unverified | 0 |
| DEM: Distribution Edited Model for Training with Mixed Data Distributions | Jun 21, 2024 | DiversityInstruction Following | —Unverified | 0 |
| Optimised Grouped-Query Attention Mechanism for Transformers | Jun 21, 2024 | MMLU | —Unverified | 0 |
| Understanding Finetuning for Factual Knowledge Extraction | Jun 20, 2024 | MMLUQuestion Answering | —Unverified | 0 |
| Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation | Jun 20, 2024 | GSM8KLanguage Model Evaluation | CodeCode Available | 0 |
| LiveMind: Low-latency Large Language Models with Simultaneous Inference | Jun 20, 2024 | Collaborative InferenceLanguage Modeling | CodeCode Available | 1 |
| ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools | Jun 18, 2024 | AllGSM8K | CodeCode Available | 14 |
| Cultural Conditioning or Placebo? On the Effectiveness of Socio-Demographic Prompting | Jun 17, 2024 | EthicsMMLU | —Unverified | 0 |
| Input Conditioned Graph Generation for Language Agents | Jun 17, 2024 | Graph GenerationMMLU | CodeCode Available | 0 |
| DataComp-LM: In search of the next generation of training sets for language models | Jun 17, 2024 | Language ModellingMMLU | CodeCode Available | 7 |
| The Base-Rate Effect on LLM Benchmark Performance: Disambiguating Test-Taking Strategies from Benchmark Performance | Jun 17, 2024 | counterfactualMMLU | —Unverified | 0 |
| ShareLoRA: Parameter Efficient and Robust Large Language Model Fine-tuning via Shared Low-Rank Adaptation | Jun 16, 2024 | Continual LearningGSM8K | CodeCode Available | 0 |
| MMLU-SR: A Benchmark for Stress-Testing Reasoning Capability of Large Language Models | Jun 15, 2024 | Mathematical ReasoningMMLU | —Unverified | 0 |
| Reactor Mk.1 performances: MMLU, HumanEval and BBH test results | Jun 15, 2024 | BenchmarkingHumanEval | —Unverified | 0 |
| GEB-1.3B: Open Lightweight Large Language Model | Jun 14, 2024 | CPULanguage Modeling | —Unverified | 0 |
| Quantifying Variance in Evaluation Benchmarks | Jun 14, 2024 | MMLU | —Unverified | 0 |
| An Empirical Study of Mamba-based Language Models | Jun 12, 2024 | 16kIn-Context Learning | —Unverified | 0 |
| Are We Done with MMLU? | Jun 6, 2024 | MMLUVirology | CodeCode Available | 3 |
| Does your data spark joy? Performance gains from domain upsampling at the end of training | Jun 5, 2024 | GSM8KHumanEval | —Unverified | 0 |
| MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures | Jun 3, 2024 | ChatbotMMLU | —Unverified | 0 |
| Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization Function | Jun 3, 2024 | DiversityMMLU | CodeCode Available | 0 |
| MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark | Jun 3, 2024 | MMLUMulti-task Language Understanding | CodeCode Available | 3 |
| Spanish and LLM Benchmarks: is MMLU Lost in Translation? | May 28, 2024 | MMLUTranslation | —Unverified | 0 |