| Applying RLAIF for Code Generation with API-usage in Lightweight LLMs | Jun 28, 2024 | Code GenerationHallucination | —Unverified | 0 |
| Losing Visual Needles in Image Haystacks: Vision Language Models are Easily Distracted in Short and Long Contexts | Jun 24, 2024 | Mathematical ReasoningVisual Question Answering (VQA) | —Unverified | 0 |
| Anomaly Detection of Tabular Data Using LLMs | Jun 24, 2024 | Anomaly DetectionLong-Context Understanding | —Unverified | 0 |
| Evaluating Large Vision-and-Language Models on Children's Mathematical Olympiads | Jun 22, 2024 | Mathematical Reasoning | —Unverified | 0 |
| Mathador-LM: A Dynamic Benchmark for Mathematical Reasoning on Large Language Models | Jun 18, 2024 | Mathematical Reasoning | CodeCode Available | 0 |
| CodeGemma: Open Code Models Based on Gemma | Jun 17, 2024 | Code CompletionMathematical Reasoning | —Unverified | 0 |
| Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning | Jun 16, 2024 | BenchmarkingMath | —Unverified | 0 |
| MMLU-SR: A Benchmark for Stress-Testing Reasoning Capability of Large Language Models | Jun 15, 2024 | Mathematical ReasoningMMLU | —Unverified | 0 |
| ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models | Jun 13, 2024 | Code Generationdomain classification | —Unverified | 0 |
| Robustness Assessment of Mathematical Reasoning in the Presence of Missing and Contradictory Conditions | Jun 7, 2024 | HallucinationMathematical Reasoning | —Unverified | 0 |