SOTAVerified

Temporal Localization

Papers

Showing 125 of 153 papers

TitleStatusHype
Fine-Tuning Large Audio-Language Models with LoRA for Precise Temporal Localization of Prolonged Exposure Therapy Elements0
VideoMolmo: Spatio-Temporal Grounding Meets PointingCode2
DisTime: Distribution-based Time Representation for Video Large Language ModelsCode1
Transforming faces into video stories -- VideoFace2.0Code0
MINERVA: Evaluating Complex Video ReasoningCode2
Hierarchical and Multimodal Data for Daily Activity UnderstandingCode0
TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation0
A Large-Language Model Framework for Relative Timeline Extraction from PubMed Case Reports0
Crash Time Matters: HybridMamba for Fine-Grained Temporal Localization in Traffic Surveillance Footage0
SocialGesture: Delving into Multi-person Gesture Understanding0
ATARS: An Aerial Traffic Atomic Activity Recognition and Temporal Segmentation DatasetCode0
Crab: A Unified Audio-Visual Scene Understanding Model with Explicit CooperationCode2
VideoMind: A Chain-of-LoRA Agent for Long Video ReasoningCode3
Adapting to the Unknown: Training-Free Audio-Visual Event Perception with Dynamic ThresholdsCode0
Watch and Learn: Leveraging Expert Knowledge and Language for Surgical Video Understanding0
Measure Twice, Cut Once: Grasping Video Structures and Event Semantics with LLMs for Video Temporal Localization0
Towards Fine-Grained Video Question Answering0
TimeLoc: A Unified End-to-End Framework for Precise Timestamp Localization in Long VideosCode1
Weakly Supervised Multiple Instance Learning for Whale Call Detection and Temporal Localization in Long-Duration Passive Acoustic MonitoringCode0
Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video GroundingCode1
Fusion of Millimeter-wave Radar and Pulse Oximeter Data for Low-burden Diagnosis of Obstructive Sleep Apnea-Hypopnea Syndrome0
LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal UnderstandingCode2
Pseudo Strong Labels from Frame-Level Predictions for Weakly Supervised Sound Event Detection0
Do Current Video LLMs Have Strong OCR Abilities? A Preliminary StudyCode0
ShotVL: Human-Centric Highlight Frame Retrieval via Language Queries0
Show:102550
← PrevPage 1 of 7Next →

No leaderboard results yet.