Instruction Following

Instruction following is the basic task of the model. This task is dedicated to evaluating the ability of the large model to follow human instructions. It is hoped that the model can generate controllable and safe answers.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1091–1100 of 1135 papers

Title	Date	Tasks	Status
Does Context Matter? ContextualJudgeBench for Evaluating LLM-based Judges in Contextual Settings	Mar 19, 2025	Instruction FollowingLarge Language Model	CodeCode Available
CORDIAL: Can Multimodal Large Language Models Effectively Understand Coherence Relationships?	Feb 16, 2025	Instruction Following	CodeCode Available
From Supervised to Generative: A Novel Paradigm for Tabular Deep Learning with Large Language Models	Oct 11, 2023	In-Context LearningInstruction Following	CodeCode Available
FMDLlama: Financial Misinformation Detection based on Large Language Models	Sep 24, 2024	Explanation GenerationInstruction Following	CodeCode Available
X-Shot: A Unified System to Handle Frequent, Few-shot and Zero-shot Learning Simultaneously in Classification	Mar 6, 2024	Domain GeneralizationInstruction Following	CodeCode Available
Discovering Hierarchical Latent Capabilities of Language Models via Causal Representation Learning	Jun 12, 2025	Instruction FollowingMathematical Reasoning	CodeCode Available
Mitigating the Bias of Large Language Model Evaluation	Sep 25, 2024	Instruction FollowingLanguage Model Evaluation	CodeCode Available
Synthetic Programming Elicitation for Text-to-Code in Very Low-Resource Programming and Formal Languages	Jun 5, 2024	Instruction FollowingRetrieval	CodeCode Available
Self-Judge: Selective Instruction Following with Alignment Self-Evaluation	Sep 2, 2024	Instruction FollowingSemantic Similarity	CodeCode Available
Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning	Apr 1, 2024	Image CaptioningInstruction Following	CodeCode Available

Show:10 25 50

← PrevPage 110 of 114Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	AutoIF (Llama3 70B)	Inst-level loose-accuracy	90.4	—	Unverified
2	AutoIF (Qwen2 72B)	Inst-level loose-accuracy	88	—	Unverified
3	GPT-4	Inst-level loose-accuracy	85.37	—	Unverified
4	PaLM 2 S	Inst-level loose-accuracy	59.11	—	Unverified