# V1 Teaching Video Production Workflow

Goal: turn a reference teaching video into an original MentorAI teaching video with a detailed storyboard, character design, generated visuals, edited video, and licensed narration.

## Voice Use Rule

Do not clone or modify a real teacher's YouTube voice unless we have clear permission from the speaker or rights holder.

Allowed uses of a reference video's audio:

- Transcribe it for internal analysis.
- Extract lesson structure, pacing, board usage, and teaching tactics.
- Rewrite the lesson in our own words.

Allowed narration outputs:

- Existing Edge-TTS voice: `zh-TW-YunJheNeural`.
- User-recorded voice with the user's consent.
- Licensed voice actor recording.
- Voice clone only when the speaker has explicitly authorized the clone and the intended use.

If authorization is expected but not yet collected, keep production in `tts` or `human_recording_pending` mode. Switch to `authorized_voice_clone` only after the consent record is stored with the project.

## Pipeline Overview

```text
Reference Video
  -> Download metadata, audio, and sample frames
  -> Whisper transcript
  -> Teaching analysis
  -> Original lesson script
  -> Storyboard
  -> Character bible
  -> Image/video generation prompts
  -> Narration generation or recording
  -> FFmpeg composition
  -> QA and release package
```

## Stage 1: Reference Ingestion

Inputs:

- YouTube URL.
- Target subject, grade, and lesson objective.
- Permission status for voice and likeness.

Outputs:

- `metadata.json`
- `audio.m4a`
- `video.mp4`
- `frames/contact_sheet.jpg`
- `transcript.txt`

Commands used in the current V1 research:

```bash
yt-dlp --write-info-json --skip-download <youtube-url>
ffmpeg -i audio.m4a -f segment -segment_time 300 -c copy chunks/chunk_%03d.m4a
```

For transcription, use Whisper through a provider key and store only the working transcript locally.

## Stage 2: Teaching Analysis

Analyze the reference into reusable teaching decisions, not copied wording:

- Opening hook: why this lesson matters.
- Course level: example, `國一國文`.
- Reading axis: example, `觀察力 + 想像力 = 物外之趣`.
- Blackboard strategy: title, author, source, key vocabulary, sentence structure, rhetoric.
- Teacher behavior: conversational, exam-aware, checks student misconceptions.
- Pacing: first overview, then sentence-level explanation.

Output file:

- `docs/research/v1-reference-<video-id>-analysis.md`

## Stage 3: Original Script

Each teaching unit should have these fields:

```ts
interface TeachingSegment {
  id: number;
  title: string;
  sourceText: string;
  focus: string;
  boardNotes: string[];
  narration: string;
  visualPrompt: string;
}
```

Writing rules:

- `sourceText` can include public-domain original text.
- `boardNotes` must be short enough to read on a blackboard.
- `narration` should be spoken Mandarin, not textbook prose.
- Do not paste long transcript passages from the reference teacher.
- Every segment should answer: what did the author see, what did he imagine, and why does it matter?

## Stage 4: Storyboard

Storyboard table format:

| Segment | Time | Board Text | Visual Action | Teacher Action | Audio |
| --- | --- | --- | --- | --- | --- |
| 01 | 0:03-0:45 | 題解、出處、主軸 | Blackboard title and child observation visual | Teacher points at title and reading axis | Overview narration |
| 02 | 0:45-1:30 | 余、童稚、明察秋毫 | Sunlight and tiny leaf detail | Teacher underlines vocabulary | Word explanation |
| 03 | 1:30-2:20 | 夏蚊成雷、私擬 | Mosquitoes transform into cranes | Teacher contrasts reality and imagination | Rhetoric explanation |

For automated production, store storyboard data as JSON:

```json
{
  "lessonId": "childhood-wonder-v1",
  "visualStyle": "modern green-board Chinese class",
  "segments": [
    {
      "id": 1,
      "title": "題解與主軸",
      "durationPolicy": "match_narration_audio",
      "board": {
        "sourceText": "余憶童稚時，能張目對日，明察秋毫。",
        "notes": ["出處：《浮生六記》", "經：明察秋毫", "緯：物外之趣"]
      },
      "teacher": {
        "pose": "front three-quarter, pointing to board",
        "emotion": "warm, confident, patient"
      },
      "visual": {
        "prompt": "modern Taiwanese junior-high Chinese literature lesson, green chalkboard, teacher silhouette, child observing tiny leaf details"
      },
      "audio": {
        "mode": "tts",
        "voice": "zh-TW-YunJheNeural"
      }
    }
  ]
}
```

## Stage 5: Character Design

Create a reusable character bible.

Teacher character:

- Role: modern Taiwanese junior-high Chinese literature teacher.
- Clothing: clean white shirt, simple trousers, optional lapel mic.
- Personality: patient, precise, slightly conversational, not theatrical.
- Expression: relaxed eyebrows, soft eyes, small reassuring smile.
- Gesture set: point to board, underline, open palm explanation, slight lean toward students.
- Boundary: not a real-person likeness unless authorized.

Student/visual world:

- Student is optional and symbolic.
- Use miniature-world visuals for `兒時記趣`: insects, grass, smoke, cranes, hills, valleys.
- Keep blackboard as the cognitive anchor; story visuals support the board instead of replacing it.

Prompt template:

```text
Modern Taiwanese junior-high Chinese literature lesson, green chalkboard classroom,
a kind male teacher in a neat white shirt, not a real-person likeness,
warm patient teaching expression, pointing to the board,
story visual: {scene},
clear educational composition, no readable text, no logos, no watermarks.
```

## Stage 6: Visual Generation

For current V1:

- Use `generateWithKie()`.
- Model: `gpt-image-2-text-to-image`.
- Aspect ratio: `16:9`.
- Size: `1k`.
- Generated images are background/story assets.
- Final readable text should be rendered by our SVG overlay, not generated inside the image.

Output pattern:

```text
storage/childhood-wonder-video/
├── segment-01-illustration.png
├── segment-01-slide.png
├── segment-01-audio.mp3
└── childhood-wonder-full.mp4
```

## Stage 7: Voice And Narration

Default path:

```text
original narration text -> Edge-TTS -> segment MP3 -> measured duration
```

Recommended voice settings:

- Voice: `zh-TW-YunJheNeural`
- Rate: `-10%`
- Pitch: `+2Hz`

If using a human voice:

1. Record one WAV per segment.
2. Normalize loudness.
3. Replace generated segment MP3s.
4. Keep the same duration-driven video composition.

If using an authorized voice clone:

1. Save written permission with project notes.
2. Store speaker consent and allowed usage.
3. Generate per-segment narration from our rewritten script.
4. Label outputs as synthetic/authorized where required by the publishing platform.

### Authorized Voice Clone Path

This is the preferred path if the teacher grants permission for a reusable voice model.

Required authorization fields:

- Speaker legal/display name.
- Rights holder, if different from speaker.
- Allowed project: MentorAI teaching videos.
- Allowed voice use: synthetic narration generated from rewritten lesson scripts.
- Allowed likeness use: yes/no. Voice permission does not automatically imply face/likeness permission.
- Allowed platforms: internal testing, YouTube, course platform, paid product, ads.
- Allowed duration: one video, one course, fixed period, or ongoing.
- Revocation process.
- Disclosure requirement, if any.

Audio dataset requirements:

- 10-30 minutes clean speech minimum for a basic model; 60+ minutes is better.
- Same language and speaking style as target videos.
- WAV preferred, 44.1 kHz or 48 kHz, mono or stereo.
- Avoid background music, classroom noise, echo, overlapping speakers, and compressed livestream audio.
- Include 2-5 minutes of the teacher reading the exact new narration style if possible.

Dataset folder:

```text
storage/voice-rights/<speaker-id>/
├── consent.md
├── source-audio/
│   ├── raw-001.wav
│   └── raw-002.wav
├── cleaned-audio/
│   ├── clean-001.wav
│   └── clean-002.wav
├── training-manifest.json
└── voice-model.json
```

Generation contract:

```json
{
  "mode": "authorized_voice_clone",
  "speakerId": "teacher-liu-authorized",
  "consentPath": "storage/voice-rights/teacher-liu-authorized/consent.md",
  "inputScript": "segment.narration",
  "outputPattern": "storage/childhood-wonder-video/segment-{id}-audio.mp3",
  "durationPolicy": "match_generated_audio"
}
```

Integration rule:

- The rest of the video pipeline should not care whether the segment MP3 came from Edge-TTS, human recording, or an authorized voice clone.
- Keep the same filename pattern so `renderAnimatedSlideSegment()` can continue to derive timing from audio durations.
- If a voice-clone provider returns WAV, normalize and convert to MP3 before composition.

Suggested post-processing:

```bash
ffmpeg -y -i input.wav -af loudnorm=I=-16:TP=-1.5:LRA=11 output-normalized.wav
ffmpeg -y -i output-normalized.wav -codec:a libmp3lame -q:a 3 segment-01-audio.mp3
```

### Human Recording Path

This is often the fastest high-quality path if the teacher can record the rewritten script:

1. Export the final segment narration script.
2. Ask the teacher to record one file per segment.
3. Normalize loudness.
4. Replace the segment MP3 files.
5. Compose the video using the same duration-driven pipeline.

Recording brief:

- Speak like a class explanation, not an advertisement.
- Leave half-second pauses after important board points.
- Keep each segment in one take when possible.
- Do not read the source text too fast; students need time to see the board.

## Stage 8: Video Composition

The current implementation composes still-image states:

1. Render title slide.
2. Render each segment slide with SVG overlay.
3. Generate one MP3 per segment.
4. Use `ffprobe` to measure MP3 durations.
5. Concatenate title silence and all segment MP3 files.
6. Use `renderAnimatedSlideSegment()` to create the final MP4.

Command:

```bash
npm run video:childhood-wonder -w @mentorai/api
```

Do not manually guess durations. Let the generated narration determine segment timing.

## Stage 9: Editing And QA

Required checks:

- No generated image contains fake readable text.
- Blackboard text does not overflow.
- Each segment starts when its narration starts.
- Final video has both video and audio streams.
- Total duration is approximately `3s + sum(segment audio durations)`.
- Teacher persona remains consistent across images.
- Narration uses original wording, not copied reference-video phrasing.

Verification commands:

```bash
find storage/childhood-wonder-video -maxdepth 1 -type f | sort
ffprobe -v error -show_entries format=duration,size -of default=nw=1 \
  storage/childhood-wonder-video/childhood-wonder-full.mp4
ffprobe -v error -show_entries stream=index,codec_type,codec_name,width,height \
  -of default=nw=1 storage/childhood-wonder-video/childhood-wonder-full.mp4
```

## Stage 10: Release Package

Package these files for review:

- Final MP4.
- Contact sheet of slides.
- Script/storyboard JSON.
- Character bible.
- Prompt log.
- Audio mode declaration: TTS, human-recorded, or authorized clone.
- Reference analysis note.

Suggested review order:

1. Watch without sound for visual clarity.
2. Listen without watching for narration logic.
3. Watch full video for timing.
4. Check one segment against storyboard and source text.