A Summarize-then-Search Method for Long Video Question Answerin Experiment Details

26 May 2024

This paper is available on arxiv under CC 4.0 license.


(1) Jiwan Chung, MIR Lab Yonsei University (https://jiwanchung.github.io/);

(2) Youngjae Yu, MIR Lab Yonsei University (https://jiwanchung.github.io/).

A. Experiment Details

Computational Budget. Long Story Short uses GPT-3 (175B parameters) via OpenAI API as the backbone. An average prompt to summarize a video segment processes ∼ 3000 tokens, while a QA prompt usually takes ∼ 4000 tokens. For CLIPCheck, we extract CLIP features and compute the cosine similarity using a single NVIDIA A6000 GPU: it takes 0.5 hours to process video frames for the MovieQA validation split.

Hyperparameters. All hyperparameters are pre-defined by analyzing a single training sample. For narrative search, we use sentence similarity threshold α ≥ 0.5 to find plot pieces when GPT-3 does not output a single index. We use the binary entropy threshold E ′ ≥ 0.4 in CLIPCheck. We run each experiment only once, as our method is deterministic and is not susceptible to randomness in initialization.

Video Segmentation Scheme. There are predefined segment boundary annotations for all datasets we utilize in this paper. Also, all plot pieces have aligned clip segments in turn since we perform summarization on each clip segmented with the predefined boundaries. Also, before applying LSS we filter out clip segments that 1. are too short, 2. have no aligned image frame, or 3. have no text context to make sure that we can retrieve the clip segments using plot summaries.

External Libraries. We use OpenAI API to access GPT-3 language model. The CLIP features are computed with the Huggingface implementations (https://huggingface. co/docs/transformers/main/en/model_doc/clip).