SOTAVerified

Instruction Following

Instruction following is the basic task of the model. This task is dedicated to evaluating the ability of the large model to follow human instructions. It is hoped that the model can generate controllable and safe answers.

Papers

Showing 201225 of 1135 papers

TitleStatusHype
LITA: Language Instructed Temporal-Localization AssistantCode2
DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal AlignmentCode2
BLSP-Emo: Towards Empathetic Large Speech-Language ModelsCode2
Aligning Modalities in Vision Large Language Models via Preference Fine-tuningCode2
Learning to Decode Collaboratively with Multiple Language ModelsCode2
DeSTA2: Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning DataCode2
From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction TuningCode2
TESS 2: A Large-Scale Generalist Diffusion Language ModelCode2
SciLitLLM: How to Adapt LLMs for Scientific Literature UnderstandingCode2
Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context LearningCode1
Adaptive Markup Language Generation for Contextually-Grounded Visual Document UnderstandingCode1
A Multi-Modal AI Copilot for Single-Cell Analysis with Instruction FollowingCode1
Generation-driven Contrastive Self-training for Zero-shot Text Classification with Instruction-following LLMCode1
A Multi-Dimensional Constraint Framework for Evaluating and Improving Instruction Following in Large Language ModelsCode1
BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language ModelsCode1
Benchmarking Large Language Models on Controllable Generation under Diversified InstructionsCode1
Instruction Following without Instruction TuningCode1
Large Language Models as Evaluators for Recommendation ExplanationsCode1
Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable SummarizationCode1
A modular vision language navigation and manipulation framework for long horizon compositional tasks in indoor environmentCode1
AlpaGasus: Training A Better Alpaca with Fewer DataCode1
DANLI: Deliberative Agent for Following Natural Language InstructionsCode1
AlpaCare:Instruction-tuned Large Language Models for Medical ApplicationCode1
GIE-Bench: Towards Grounded Evaluation for Text-Guided Image EditingCode1
FuseChat-3.0: Preference Optimization Meets Heterogeneous Model FusionCode1
Show:102550
← PrevPage 9 of 46Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1AutoIF (Llama3 70B)Inst-level loose-accuracy90.4Unverified
2AutoIF (Qwen2 72B)Inst-level loose-accuracy88Unverified
3GPT-4Inst-level loose-accuracy85.37Unverified
4PaLM 2 SInst-level loose-accuracy59.11Unverified