SOTAVerified

Instruction Following

Instruction following is the basic task of the model. This task is dedicated to evaluating the ability of the large model to follow human instructions. It is hoped that the model can generate controllable and safe answers.

Papers

Showing 301350 of 1135 papers

TitleStatusHype
Instruction Position Matters in Sequence Generation with Large Language ModelsCode1
InsViE-1M: Effective Instruction-based Video Editing with Elaborate Dataset ConstructionCode1
AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source DataCode1
Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual AwarenessCode1
Mosaic-IT: Free Compositional Data Augmentation Improves Instruction TuningCode1
NPHardEval4V: A Dynamic Reasoning Benchmark of Multimodal Large Language ModelsCode1
Instruct and Extract: Instruction Tuning for On-Demand Information ExtractionCode1
Extend Model Merging from Fine-Tuned to Pre-Trained Large Language Models via Weight DisentanglementCode1
MoDS: Model-oriented Data Selection for Instruction TuningCode1
Chat Vector: A Simple Approach to Equip LLMs with Instruction Following and Model Alignment in New LanguagesCode1
Factorizing Perception and Policy for Interactive Instruction FollowingCode1
Monte Carlo Thought Search: Large Language Model Querying for Complex Scientific Reasoning in Catalyst DesignCode1
InfMLLM: A Unified Framework for Visual-Language TasksCode1
Investigating the Effectiveness of Task-Agnostic Prefix Prompt for Instruction FollowingCode1
Instruction-Following Agents with Multimodal TransformerCode1
AceGPT, Localizing Large Language Models in ArabicCode1
ChatGPT may Pass the Bar Exam soon, but has a Long Way to Go for the LexGLUE benchmarkCode1
Improving Translation Faithfulness of Large Language Models via Augmenting InstructionsCode1
EventHallusion: Diagnosing Event Hallucinations in Video LLMsCode1
Incentivizing Reasoning for Advanced Instruction-Following of Large Language ModelsCode1
Interactive Text-to-Image Retrieval with Large Language Models: A Plug-and-Play ApproachCode1
MM-CamObj: A Comprehensive Multimodal Dataset for Camouflaged Object ScenariosCode1
Ex3: Automatic Novel Writing by Extracting, Excelsior and ExpandingCode1
Agri-LLaVA: Knowledge-Infused Large Multimodal Assistant on Agricultural Pests and DiseasesCode1
Infer Human's Intentions Before Following Natural Language InstructionsCode1
Inferring Rewards from Language in ContextCode1
IDA-Bench: Evaluating LLMs on Interactive Guided Data AnalysisCode1
ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat TemplatesCode1
MMIDR: Teaching Large Language Model to Interpret Multimodal Misinformation via Knowledge DistillationCode1
Evaluating LLMs at Detecting Errors in LLM ResponsesCode1
Evaluating Large Language Models at Evaluating Instruction FollowingCode1
IHEval: Evaluating Language Models on Following the Instruction HierarchyCode1
AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic ScenariosCode1
Facial Affective Behavior Analysis with Instruction TuningCode1
Evaluating Correctness and Faithfulness of Instruction-Following Models for Question AnsweringCode1
Hybrid Alignment Training for Large Language ModelsCode1
FaithScore: Fine-grained Evaluations of Hallucinations in Large Vision-Language ModelsCode1
Instruction-Tuning Data Synthesis from Scratch via Web ReconstructionCode1
MELoRA: Mini-Ensemble Low-Rank Adapters for Parameter-Efficient Fine-TuningCode1
InstructionGPT-4: A 200-Instruction Paradigm for Fine-Tuning MiniGPT-4Code1
MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMsCode1
Instruction-Guided Visual MaskingCode1
Few-shot Object Grounding and Mapping for Natural Language Robot Instruction FollowingCode1
Alexa Arena: A User-Centric Interactive Platform for Embodied AICode1
ChartInstruct: Instruction Tuning for Chart Comprehension and ReasoningCode1
Finding Blind Spots in Evaluator LLMs with Interpretable ChecklistsCode1
M-IFEval: Multilingual Instruction-Following EvaluationCode1
InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech SystemsCode1
DocLens: Multi-aspect Fine-grained Evaluation for Medical Text GenerationCode1
Enhancing Cross-Tokenizer Knowledge Distillation with Contextual Dynamical MappingCode1
Show:102550
← PrevPage 7 of 23Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1AutoIF (Llama3 70B)Inst-level loose-accuracy90.4Unverified
2AutoIF (Qwen2 72B)Inst-level loose-accuracy88Unverified
3GPT-4Inst-level loose-accuracy85.37Unverified
4PaLM 2 SInst-level loose-accuracy59.11Unverified