"Kurosawa": A Script Writer's Assistant: Experiments and Evaluation

cover
23 May 2024

Authors:

(1) Prerak Gandhi, Department of Computer Science and Engineering, Indian Institute of Technology Bombay, Mumbai, prerakgandhi@cse.iitb.ac.in, and these authors contributed equally to this work;

(2) Vishal Pramanik, Department of Computer Science and Engineering, Indian Institute of Technology Bombay, Mumbai, vishalpramanik,pb@cse.iitb.ac.in, and these authors contributed equally to this work;

(3) Pushpak Bhattacharyya, Department of Computer Science and Engineering, Indian Institute of Technology Bombay, Mumbai.

5. Experiments and Evaluation

We fine-tune GPT3 with our datasets (refer appendix A.6).

5.1. Plot Generation

We have created 5 models by fine-tuning GPT-3 with our movie plot dataset in the following manner, (i) original (without annotation) (O): input- short storylines, output- plots without any annotations, (ii) annotation and short input (AS): input- short storylines, output- plots annotated with 4-act structure, (iii) annotation and long input (AL): input long, more descriptive storylines, output- plots annotated with 4-act structure, (iv) annotation and short input with genres included (ASG): input short storylines and genre, output- plots annotated with 4-act structure, (v) annotation and long input with genres included (ALG): input- long and more descriptive storylines along with the genre, output- plots annotated with 4-act structure.

For automatic evaluation we use BLEU (Papineni et al., 2002), Perplexity (Jelinek et al., 1977), ROUGE (Lin, 2004). We also use human evaluation in the form of a five-point Likert Scale (Likert, 1932). The rating system has 1-> Strongly Disagree, 2-> Disagree, 3-> Neutral, 4-> Agree, 5-> Strongly Agree. Human-written stories are assumed to have a rating of 5 for each of the following 5 features: (1) Fluency: grammatical correctness; (2) Coherence: logical ordering of sentences and paragraphs; (3) Relevance: Whether the key points from the prompt have been highlighted in the output; (4) Likability: The measure of how much the story is enjoyable; (5) Creativity: If the output introduced any new events, character profiles, or relationships.

For plot generation, we generate 50 plots from 50 test prompts. We divide the stories into five groups of 10 and assign three evaluators to each group.

For scene generation, we generate ten scenes from 10 test prompts. We assign five evaluators to rate these ten stories.

This paper is available on arxiv under CC 4.0 DEED license.