Live Interactive Training for Video Segmentation

The current system (top) does not learn from user feedback, leading to the same errors to reappear and requiring repeated corrections (e.g., 14 prompts to add the missing cards), which leads to substantial annotation time (e.g., 5.62 mins). In contrast, our LIT-LoRA method continuously adapts to user correction input and generalizes to similar future errors, reducing the number of required corrections (e.g., down to 4) and user annotation time (e.g., down to 3.18 mins).

Abstract

Interactive video segmentation often requires many user interventions for robust performance in challenging scenarios (e.g., occlusions, object separations, camouflage, etc.). Yet, even state-of-the-art models like SAM2 use corrections only for immediate fixes without learning from this feedback, leading to inefficient, repetitive user effort.

To address this, we introduce Live Interactive Training (LIT), a novel framework for prompt-based visual systems where models also learn online from human corrections at inference time. Our primary instantiation, LIT-LoRA, implements this by continually updating a lightweight LoRA module on-the-fly. When a user provides a correction, this module is rapidly trained on that feedback, allowing the vision system to improve performance on subsequent frames of the same video. Leveraging the core principles of LIT, our LIT-LoRA implementation achieves an average 18-34% reduction in total corrections on challenging video segmentation benchmarks, with a negligible training overhead of ~0.5s per correction. We further demonstrate its generality by successfully adapting it to other segmentation models and extending it to CLIP-based fine-grained image classification. Our work highlights the promise of live adaptation to transform interactive tools and significantly reduce redundant human effort in complex visual tasks.

Motivation - User Interaction Pattern

User interaction patterns and the impact across datasets. (a) The number of user corrections follows a clear long-tailed distribution: a small fraction of challenging videos accounts for the majority of interactions. (b) The challenging cases (≥ 10 corrections) require substantially more user inputs than the dataset average. (c) User feedback consistently improves segmentation performance, especially for the challenging subset. (d) Corrections are not uniformly distributed in time; most prompts occur in the early to late portions of each sequence, indicating the recurrence of errors.

Intuition: User corrections are important for improving performance, especially for the challenging subset. But errors repeat frequently, which is why we need to learn from them.

Methods

Left: Overview of the LIT-LoRA framework on VOS. As the video progresses, segmentation errors may arise. When the user provides a correction (which can be time-consuming), the correction is used to train a LoRA module on-the-fly. The LoRA module is then consulted for later errors: if its prediction meets the validation criterion, it is accepted to correct the error; otherwise, the adapter is further refined using the latest correction. Right: LIT LoRA module illustration.