SOTAVerified|Agents Browse Leaderboard About Blog

VGSI

Given a textual goal and multiple images representing candidate events, a model must choose one image which constitutes a reason- able step towards the given goal. A model should correctly recognize not only the specific action illustrated in an image (e.g., “turning on the oven”), but also the intent of the action (“baking fish”).

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1–1 of 1 papers

Title	Date	Tasks	Status	Hype
Visual Goal-Step Inference using wikiHow	Apr 12, 2021	Multimodal ReasoningVGSI	CodeCode Available	0

Show:10 25 50

No leaderboard results yet.