Instruction Following

Instruction following is the basic task of the model. This task is dedicated to evaluating the ability of the large model to follow human instructions. It is hoped that the model can generate controllable and safe answers.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 501–550 of 1135 papers

Title	Date	Tasks	Status
CASTILLO: Characterizing Response Length Distributions of Large Language Models	May 22, 2025	Instruction FollowingLanguage Modeling	CodeCode Available
ToDi: Token-wise Distillation via Fine-Grained Divergence Control	May 22, 2025	Instruction FollowingKnowledge Distillation	—Unverified
ManipLVM-R1: Reinforcement Learning for Reasoning in Embodied Manipulation with Large Vision-Language Models	May 22, 2025	Instruction Followingreinforcement-learning	—Unverified
LIFEBench: Evaluating Length Instruction Following in Large Language Models	May 22, 2025	Instruction FollowingText Generation	CodeCode Available
Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective	May 21, 2025	Instruction FollowingLanguage Modeling	—Unverified
Hunyuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-Thought	May 21, 2025	ChatbotInstruction Following	—Unverified
ThinkLess: A Training-Free Inference-Efficient Method for Reducing Reasoning Redundancy	May 21, 2025	Instruction FollowingTransfer Learning	—Unverified
FlowKV: Enhancing Multi-Turn Conversational Coherence in LLMs via Isolated Key-Value Cache Management	May 21, 2025	Instruction FollowingManagement	—Unverified
Joint Flashback Adaptation for Forgetting-Resistant Instruction Tuning	May 21, 2025	Arithmetic ReasoningInstruction Following	—Unverified
Two Experts Are All You Need for Steering Thinking: Reinforcing Cognitive Effort in MoE Reasoning Models Without Additional Training	May 20, 2025	AllDomain Generalization	—Unverified
DecIF: Improving Instruction-Following through Meta-Decomposition	May 20, 2025	Instruction FollowingResponse Generation	—Unverified
Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels	May 20, 2025	Instruction FollowingKnowledge Distillation	—Unverified
Domain Adaptation of VLM for Soccer Video Understanding	May 20, 2025	Action ClassificationDomain Adaptation	—Unverified
Multi-Level Aware Preference Learning: Enhancing RLHF for Complex Multi-Instruction Tasks	May 19, 2025	Instruction Following	—Unverified
Rethinking Predictive Modeling for LLM Routing: When Simple kNN Beats Complex Learned Routers	May 19, 2025	Instruction FollowingQuestion Answering	—Unverified
What Prompts Don't Say: Understanding and Managing Underspecification in LLM Prompts	May 19, 2025	Instruction Following	CodeCode Available
Causal Head Gating: A Framework for Interpreting Roles of Attention Heads in Transformers	May 19, 2025	In-Context LearningInstruction Following	—Unverified
KIT's Offline Speech Translation and Instruction Following Submission for IWSLT 2025	May 19, 2025	Automatic Speech RecognitionInstruction Following	—Unverified
CompBench: Benchmarking Complex Instruction-guided Image Editing	May 18, 2025	BenchmarkingInstruction Following	—Unverified
Enhancing Complex Instruction Following for Large Language Models with Mixture-of-Contexts Fine-tuning	May 17, 2025	DecoderInstruction Following	—Unverified
Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors	May 17, 2025	counterfactualInstruction Following	CodeCode Available
Navigating the Alpha Jungle: An LLM-Powered MCTS Framework for Formulaic Factor Mining	May 16, 2025	Instruction Following	—Unverified
HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages	May 16, 2025	DiversityInstruction Following	—Unverified
When Thinking Fails: The Pitfalls of Reasoning for Instruction-Following in LLMs	May 16, 2025	In-Context LearningInstruction Following	—Unverified
GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents	May 16, 2025	BenchmarkingInstruction Following	—Unverified
UniEval: Unified Holistic Evaluation for Unified Multimodal Understanding and Generation	May 15, 2025	DiversityInstruction Following	—Unverified
Tests as Prompt: A Test-Driven-Development Benchmark for LLM Code Generation	May 13, 2025	Code GenerationIn-Context Learning	—Unverified
Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning?	May 13, 2025	Chart Question AnsweringFact Checking	CodeCode Available
Efficient Telecom Specific LLM: TSLAM-Mini with QLoRA and Digital Twin Data	May 10, 2025	Instruction Followingparameter-efficient fine-tuning	—Unverified
Assessing Robustness to Spurious Correlations in Post-Training Language Models	May 9, 2025	Instruction FollowingMathematical Reasoning	—Unverified
T2VTextBench: A Human Evaluation Benchmark for Textual Control in Video Generation Models	May 8, 2025	Instruction FollowingText-to-Video Generation	—Unverified
Incentivizing Inclusive Contributions in Model Sharing Markets	May 5, 2025	Federated LearningInstruction Following	—Unverified
PIPA: A Unified Evaluation Protocol for Diagnosing Interactive Planning Agents	May 2, 2025	Instruction FollowingResponse Generation	—Unverified
T2VPhysBench: A First-Principles Benchmark for Physical Consistency in Text-to-Video Generation	May 1, 2025	counterfactualInstruction Following	—Unverified
UAV-VLN: End-to-End Vision Language guided Navigation for UAVs	Apr 30, 2025	Common Sense ReasoningInstruction Following	—Unverified
Ask, Fail, Repeat: Meeseeks, an Iterative Feedback Benchmark for LLMs' Multi-turn Instruction-Following Ability	Apr 30, 2025	Instruction FollowingIntent Recognition	—Unverified
TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models	Apr 29, 2025	BenchmarkingDataset Generation	CodeCode Available
CachePrune: Neural-Based Attribution Defense Against Indirect Prompt Injection Attacks	Apr 29, 2025	Instruction Following	—Unverified
Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs	Apr 24, 2025	Image-text RetrievalInstruction Following	—Unverified
ManipDreamer: Boosting Robotic Manipulation World Model with Action Tree and Visual Guidance	Apr 23, 2025	Instruction FollowingSSIM	—Unverified
ParamΔ for Direct Weight Mixing: Post-Train Large Language Model at Zero Cost	Apr 23, 2025	Instruction FollowingLanguage Modeling	—Unverified
Case Study: Fine-tuning Small Language Models for Accurate and Private CWE Detection in Python Code	Apr 23, 2025	Instruction FollowingPrivacy Preserving	—Unverified
DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight Language Models	Apr 21, 2025	Computational EfficiencyInstruction Following	—Unverified
Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators	Apr 21, 2025	Code GenerationInstruction Following	CodeCode Available
Improving Instruct Models for Free: A Study on Partial Adaptation	Apr 15, 2025	Few-Shot LearningIn-Context Learning	—Unverified
SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction Fine-Tuning	Apr 12, 2025	Instruction Following	—Unverified
Playpen: An Environment for Exploring Learning Through Conversational Interaction	Apr 11, 2025	Instruction FollowingLarge Language Model	CodeCode Available
VideoExpert: Augmented LLM for Temporal-Sensitive Video Understanding	Apr 10, 2025	Instruction FollowingVideo Understanding	—Unverified
Capybara-OMNI: An Efficient Paradigm for Building Omni-Modal Language Models	Apr 10, 2025	Instruction Following	—Unverified
Holistic Capability Preservation: Towards Compact Yet Comprehensive Reasoning Models	Apr 9, 2025	Instruction FollowingMathematical Problem-Solving	—Unverified

Show:10 25 50

← PrevPage 11 of 23Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	AutoIF (Llama3 70B)	Inst-level loose-accuracy	90.4	—	Unverified
2	AutoIF (Qwen2 72B)	Inst-level loose-accuracy	88	—	Unverified
3	GPT-4	Inst-level loose-accuracy	85.37	—	Unverified
4	PaLM 2 S	Inst-level loose-accuracy	59.11	—	Unverified