Instruction Following

Instruction following is the basic task of the model. This task is dedicated to evaluating the ability of the large model to follow human instructions. It is hoped that the model can generate controllable and safe answers.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 351–375 of 1135 papers

Title	Date	Tasks	Status	Hype	Score
On the Exploitability of Instruction Tuning	Jun 28, 2023	Data PoisoningInstruction Following	CodeCode Available	1	5
Improving Translation Faithfulness of Large Language Models via Augmenting Instructions	Aug 24, 2023	Instruction FollowingMachine Translation	CodeCode Available	1	5
ChartInstruct: Instruction Tuning for Chart Comprehension and Reasoning	Mar 14, 2024	Chart UnderstandingInstruction Following	CodeCode Available	1	5
Incentivizing Reasoning for Advanced Instruction-Following of Large Language Models	Jun 2, 2025	Instruction FollowingReinforcement Learning (RL)	CodeCode Available	1	5
DocLens: Multi-aspect Fine-grained Evaluation for Medical Text Generation	Nov 16, 2023	Decision MakingInstruction Following	CodeCode Available	1	5
CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code Generation	Feb 26, 2025	BenchmarkingCode Generation	CodeCode Available	1	5
Investigating the Effectiveness of Task-Agnostic Prefix Prompt for Instruction Following	Feb 28, 2023	Instruction FollowingZero-shot Generalization	CodeCode Available	1	5
Instruct and Extract: Instruction Tuning for On-Demand Information Extraction	Oct 24, 2023	Instruction Following	CodeCode Available	1	5
Enhancing Cross-Tokenizer Knowledge Distillation with Contextual Dynamical Mapping	Feb 16, 2025	Code GenerationInstruction Following	CodeCode Available	1	5
CB2: Collaborative Natural Language Interaction Research Platform	Mar 14, 2023	Instruction Following	CodeCode Available	1	5
Unlocking Reasoning Potential in Large Langauge Models by Scaling Code-form Planning	Sep 19, 2024	FormInstruction Following	CodeCode Available	1	5
Follow My Instruction and Spill the Beans: Scalable Data Extraction from Retrieval-Augmented Generation Systems	Feb 27, 2024	Instruction FollowingRAG	CodeCode Available	1	5
NPHardEval4V: A Dynamic Reasoning Benchmark of Multimodal Large Language Models	Mar 4, 2024	Instruction Following	CodeCode Available	1	5
OmniGenBench: A Benchmark for Omnipotent Multimodal Generation across 50+ Tasks	May 24, 2025	Image GenerationInstruction Following	CodeCode Available	1	5
On the Multi-turn Instruction Following for Conversational Web Agents	Feb 23, 2024	Conversational Web NavigationInstruction Following	CodeCode Available	1	5
Engineering flexible machine learning systems by traversing functionally-invariant paths	Apr 30, 2022	Adversarial RobustnessContinual Learning	CodeCode Available	1	5
Myriad: Large Multimodal Model by Applying Vision Experts for Industrial Anomaly Detection	Oct 29, 2023	Anomaly DetectionImage Captioning	CodeCode Available	1	5
Natural Language Embedded Programs for Hybrid Language Symbolic Reasoning	Sep 19, 2023	Instruction FollowingLanguage Modeling	CodeCode Available	1	5
Are Emergent Abilities in Large Language Models just In-Context Learning?	Sep 4, 2023	In-Context LearningInstruction Following	CodeCode Available	1	5
Hybrid Alignment Training for Large Language Models	Jun 21, 2024	Instruction Following	CodeCode Available	1	5
MergeBench: A Benchmark for Merging Domain-Specialized LLMs	May 16, 2025	Instruction Following	CodeCode Available	1	5
IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis	May 23, 2025	Instruction Following	CodeCode Available	1	5
Learning to Map Natural Language Instructions to Physical Quadcopter Control using Simulated Flight	Oct 21, 2019	continuous-controlContinuous Control	CodeCode Available	1	5
ReALFRED: An Embodied Instruction Following Benchmark in Photo-Realistic Environments	Jul 26, 2024	Instruction Following	CodeCode Available	1	5
Multi-modal Preference Alignment Remedies Degradation of Visual Instruction Tuning on Language Models	Feb 16, 2024	DiversityInstruction Following	CodeCode Available	1	5

Show:10 25 50

← PrevPage 15 of 46Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	AutoIF (Llama3 70B)	Inst-level loose-accuracy	90.4	—	Unverified
2	AutoIF (Qwen2 72B)	Inst-level loose-accuracy	88	—	Unverified
3	GPT-4	Inst-level loose-accuracy	85.37	—	Unverified
4	PaLM 2 S	Inst-level loose-accuracy	59.11	—	Unverified