Instruction Following

Instruction following is the basic task of the model. This task is dedicated to evaluating the ability of the large model to follow human instructions. It is hoped that the model can generate controllable and safe answers.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 526–550 of 1135 papers

Title	Date	Tasks	Status
UniEval: Unified Holistic Evaluation for Unified Multimodal Understanding and Generation	May 15, 2025	DiversityInstruction Following	—Unverified
Tests as Prompt: A Test-Driven-Development Benchmark for LLM Code Generation	May 13, 2025	Code GenerationIn-Context Learning	—Unverified
Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning?	May 13, 2025	Chart Question AnsweringFact Checking	CodeCode Available
Efficient Telecom Specific LLM: TSLAM-Mini with QLoRA and Digital Twin Data	May 10, 2025	Instruction Followingparameter-efficient fine-tuning	—Unverified
Assessing Robustness to Spurious Correlations in Post-Training Language Models	May 9, 2025	Instruction FollowingMathematical Reasoning	—Unverified
T2VTextBench: A Human Evaluation Benchmark for Textual Control in Video Generation Models	May 8, 2025	Instruction FollowingText-to-Video Generation	—Unverified
Incentivizing Inclusive Contributions in Model Sharing Markets	May 5, 2025	Federated LearningInstruction Following	—Unverified
PIPA: A Unified Evaluation Protocol for Diagnosing Interactive Planning Agents	May 2, 2025	Instruction FollowingResponse Generation	—Unverified
T2VPhysBench: A First-Principles Benchmark for Physical Consistency in Text-to-Video Generation	May 1, 2025	counterfactualInstruction Following	—Unverified
UAV-VLN: End-to-End Vision Language guided Navigation for UAVs	Apr 30, 2025	Common Sense ReasoningInstruction Following	—Unverified
Ask, Fail, Repeat: Meeseeks, an Iterative Feedback Benchmark for LLMs' Multi-turn Instruction-Following Ability	Apr 30, 2025	Instruction FollowingIntent Recognition	—Unverified
TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models	Apr 29, 2025	BenchmarkingDataset Generation	CodeCode Available
CachePrune: Neural-Based Attribution Defense Against Indirect Prompt Injection Attacks	Apr 29, 2025	Instruction Following	—Unverified
Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs	Apr 24, 2025	Image-text RetrievalInstruction Following	—Unverified
ManipDreamer: Boosting Robotic Manipulation World Model with Action Tree and Visual Guidance	Apr 23, 2025	Instruction FollowingSSIM	—Unverified
ParamΔ for Direct Weight Mixing: Post-Train Large Language Model at Zero Cost	Apr 23, 2025	Instruction FollowingLanguage Modeling	—Unverified
Case Study: Fine-tuning Small Language Models for Accurate and Private CWE Detection in Python Code	Apr 23, 2025	Instruction FollowingPrivacy Preserving	—Unverified
DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight Language Models	Apr 21, 2025	Computational EfficiencyInstruction Following	—Unverified
Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators	Apr 21, 2025	Code GenerationInstruction Following	CodeCode Available
Improving Instruct Models for Free: A Study on Partial Adaptation	Apr 15, 2025	Few-Shot LearningIn-Context Learning	—Unverified
SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction Fine-Tuning	Apr 12, 2025	Instruction Following	—Unverified
Playpen: An Environment for Exploring Learning Through Conversational Interaction	Apr 11, 2025	Instruction FollowingLarge Language Model	CodeCode Available
VideoExpert: Augmented LLM for Temporal-Sensitive Video Understanding	Apr 10, 2025	Instruction FollowingVideo Understanding	—Unverified
Capybara-OMNI: An Efficient Paradigm for Building Omni-Modal Language Models	Apr 10, 2025	Instruction Following	—Unverified
Holistic Capability Preservation: Towards Compact Yet Comprehensive Reasoning Models	Apr 9, 2025	Instruction FollowingMathematical Problem-Solving	—Unverified

Show:10 25 50

← PrevPage 22 of 46Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	AutoIF (Llama3 70B)	Inst-level loose-accuracy	90.4	—	Unverified
2	AutoIF (Qwen2 72B)	Inst-level loose-accuracy	88	—	Unverified
3	GPT-4	Inst-level loose-accuracy	85.37	—	Unverified
4	PaLM 2 S	Inst-level loose-accuracy	59.11	—	Unverified