Skip to content

feat(skills): remotion-to-hyperframes corpus T1+T2 (3/7)#508

Merged
jrusso1020 merged 2 commits into
mainfrom
skill/r2hf-corpus-t1-t2
Apr 28, 2026
Merged

feat(skills): remotion-to-hyperframes corpus T1+T2 (3/7)#508
jrusso1020 merged 2 commits into
mainfrom
skill/r2hf-corpus-t1-t2

Conversation

@jrusso1020
Copy link
Copy Markdown
Collaborator

@jrusso1020 jrusso1020 commented Apr 27, 2026

What

The first two test fixtures the remotion-to-hyperframes skill is graded against. Each fixture is a self-contained directory:

tier-N-name/
├── remotion-src/   full Remotion project (package.json, src/, remotion.config.ts, tsconfig.json)
├── hf-src/         hand-translated HyperFrames composition (index.html)
├── expected.json   tier metadata + SSIM threshold + translation notes + measured validation
├── README.md       human walk-through of the translation choices
└── setup.sh        (T2 only) generates binary assets via ffmpeg

T1 — title-card-fade

3 s @ 30 fps, 1280×720. Single AbsoluteFill, single useCurrentFrame-driven interpolate with multi-segment input [0, 15, 75, 90] → [0, 1, 1, 0] (fade in / hold / fade out). No audio, no media, no custom components.

Validated mean SSIM: 0.974 · threshold 0.95.

T2 — title-image-outro

6 s @ 30 fps, 1280×720, three <Sequence> scenes:

  • TitleScene (0–2 s) — spring({damping:12, stiffness:100, mass:1}) driving scale on text
  • ImageScene (2–4 s) — <Img src={staticFile("square.png")}> with linear fade-in + scale
  • OutroScene (4–6 s) — 1-second linear fade-in
  • <Audio src={staticFile("music.wav")} volume={0.5} /> throughout

setup.sh generates the 200×200 PNG and 6-second silent WAV via ffmpeg so binaries stay out of the repo.

Validated mean SSIM: 0.985 · threshold 0.95. The spring → back.out(1.4) translation came out cleaner than the original ~0.05 SSIM budget anticipated.

End-to-end validation

Rendered Remotion baseline + HF translation locally, ran scripts/render_diff.sh. Both fixtures meet their thresholds with comfortable margin.

Tier Measured mean Measured p05 Threshold Margin
T1 0.974 0.972 0.95 +0.022 from p05
T2 0.985 0.966 0.95 +0.016 from p05

The dominant non-translation noise floor is system font fallback divergence between Remotion's bundled Chromium and HF's chrome-headless-shell. Same font-weight: 800 renders perceptibly bolder on HF — costs ~0.025 mean SSIM and is the lower bound on what threshold can be set.

Critical Remotion config

Both fixtures' remotion.config.ts set setVideoImageFormat("png") + setColorSpace("bt709"). Remotion's default JPEG output writes yuvj420p (full-range) which costs ~0.05 SSIM vs HF's yuv420p (limited-range). Without this, T1 lands at 0.958 instead of 0.974 — render_diff would be measuring an encoder difference, not translation fidelity.

Why

#3 in the 7-PR stack.

T1 + T2 together exercise:

  • AbsoluteFill, Sequence
  • useCurrentFrame, useVideoConfig
  • interpolate (single-segment + multi-segment, with extrapolation)
  • spring
  • Audio, Img, staticFile

That's the bulk of what real Remotion compositions use day-to-day. T3 (PR 4) adds custom React subcomponents + Zod schemas; T4 (PR 5) covers escape-hatch cases.

Stack

#506 (1/7) — scaffold
#507 (2/7) — eval harness
this PR (3/7) — T1 + T2 fixtures
#509 (4/7) — T3 data-driven fixture
5/7 — T4 escape-hatch fixtures
6/7 — references/*.md (translation map)
7/7 — SKILL.md body + corpus orchestrator

Test plan

  • T1: lint_source.py reports 0 blockers / 0 warnings / 0 infos
  • T2: lint_source.py reports 0 blockers / 0 warnings / 2 infos (the two staticFile() references — correctly classified as translatable)
  • T2: setup.sh runs successfully, generates PNG + WAV
  • All fixture .tsx, .ts, .json, .md files pass oxfmt --check and oxlint
  • T1: end-to-end render + SSIM diff (mean 0.974, ≥ 0.95 threshold)
  • T2: end-to-end render + SSIM diff (mean 0.985, ≥ 0.95 threshold)

Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fixture content is directionally useful, but the human-facing corpus docs disagree with the machine-readable thresholds. Since agents will read both, these should be made consistent before merging this layer.

../../../scripts/render_diff.sh ./remotion-src/out/baseline.mp4 ./hf.mp4 ./diff
```

`expected.json` documents the SSIM threshold (0.97) for this fixture.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This says the fixture threshold is 0.97, but expected.json sets ssim_threshold to 0.95 and the rationale also says 0.95. Please make the README match the executable contract; otherwise the skill/reference reader gets two different gates for the same fixture.

../../../scripts/render_diff.sh ./remotion-src/out/baseline.mp4 ./hf.mp4 ./diff
```

## Why threshold 0.92?
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue here: this section is written around a 0.92 threshold and compares against T1 at 0.97, but the checked-in expected.json uses 0.95 for T2 and T1 also uses 0.95. The corpus needs one source of truth, especially because the final orchestrator reads expected.json while humans/agents read this README.

@jrusso1020 jrusso1020 force-pushed the skill/r2hf-eval-harness branch from 00de7e4 to 2a309f2 Compare April 27, 2026 23:03
@jrusso1020 jrusso1020 force-pushed the skill/r2hf-corpus-t1-t2 branch from 08fa028 to abaa743 Compare April 27, 2026 23:04
@jrusso1020
Copy link
Copy Markdown
Collaborator Author

jrusso1020 commented Apr 27, 2026

@miguel-heygen — addressed in the amended commit abaa7430:

T1 README: The trailing line "expected.json documents the SSIM threshold (0.97)" is now "0.95" (matching expected.json). Calibrated mean (0.974) is also called out so the reader knows where the value comes from.

T2 README: The "Why threshold 0.92?" header and surrounding paragraph now reference 0.95 (matching expected.json). The reasoning was rewritten to reflect what calibration actually showed: spring → back.out(1.4) came in cleaner than the original 0.05-SSIM budget anticipated, so 0.95 is the gate (validated mean 0.985).

grep -n "0\.97\|0\.92" tier-1-title-card/README.md tier-2-multi-scene/README.md now only matches the validated-mean number 0.974 (one occurrence in T1's note about the calibrated SSIM), no threshold mentions.

Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed the latest head. The T1/T2 threshold documentation now matches the executable expected.json values, and I do not have remaining blockers on this layer.

@jrusso1020 jrusso1020 force-pushed the skill/r2hf-eval-harness branch from 2a309f2 to 90845dd Compare April 27, 2026 23:31
@jrusso1020 jrusso1020 force-pushed the skill/r2hf-corpus-t1-t2 branch from abaa743 to 2649d6d Compare April 27, 2026 23:31
Adds the deterministic eval primitives the skill calls into:

  scripts/render_diff.sh    SSIM diff between two MP4s, JSON summary, configurable threshold
  scripts/frame_strip.sh    side-by-side comparison strip for visual debugging
  scripts/lint_source.py    pre-translation lint over Remotion source — blocks/warnings/infos

The harness is decoupled from the render pipeline: it accepts paths to
already-rendered MP4s. The skill orchestrator (PR 7) drives both renders
and feeds the outputs in. This keeps the harness usable in CI, in
sandboxes, and on any machine that has ffmpeg without needing the full
Remotion + HyperFrames toolchain.

Lint catches the patterns from the skill's out-of-scope list:
- useState / useReducer (state-machine driven animation)
- useEffect with deps (side effects)
- async calculateMetadata (Promise-returning composition metadata)
- @remotion/lambda imports
- third-party React UI libraries (MUI, Chakra, Mantine, antd, shadcn, Radix, NextUI)
- delayRender / useCallback / useMemo (warnings)
- staticFile / interpolateColors (info — translatable but flagged)

Smoke test (scripts/tests/smoke.sh) exercises all three scripts against
synthetic inputs: identical ffmpeg testsrc videos pass at threshold 0.99,
different ffmpeg testsrc videos fail at 0.99, frame_strip produces a
strip.png, lint produces 0 blockers on a clean fixture and >=3 blockers
on a fixture that uses useState + useEffect + MUI + async metadata.

Validated locally: smoke.sh exits 0.
@jrusso1020 jrusso1020 force-pushed the skill/r2hf-eval-harness branch from 90845dd to 70e0b8b Compare April 27, 2026 23:54
Adds the first two test fixtures the skill is graded against. Each fixture
ships:
  - remotion-src/  full Remotion project (package.json, src/, remotion.config.ts, tsconfig.json)
  - hf-src/        hand-translated HyperFrames composition (index.html)
  - expected.json  tier metadata + SSIM threshold + translation notes + measured validation
  - README.md      human walk-through of the translation choices
  - setup.sh       (T2 only) generates binary assets (PNG, WAV) via ffmpeg

T1 — title-card-fade
- 3 s @ 30 fps, 1280x720
- Single AbsoluteFill, single useCurrentFrame interpolate
  with multi-segment input [0,15,75,90] -> [0,1,1,0]
- Validated mean SSIM 0.974, threshold 0.95
  (~0.025 gap from font-fallback divergence between Remotion's bundled
   Chromium and HF's chrome-headless-shell)

T2 — title-image-outro
- 6 s @ 30 fps, 1280x720, three Sequences (TitleScene, ImageScene, OutroScene)
- Exercises spring, interpolate, Audio, Img, staticFile
- Spring -> GSAP back.out(1.4) translation
- Validated mean SSIM 0.985, threshold 0.95
  (translation came out cleaner than predicted; spring->back.out drift was
   smaller than the ~0.05 budget I'd expected)
- setup.sh generates a 200x200 blue PNG and a 6 s silent WAV via ffmpeg
  so binaries stay out of the repo

Calibration done end-to-end: rendered Remotion baseline + HF translation,
ran scripts/render_diff.sh, set thresholds ~0.02 below measured p05.

Critical Remotion config: setVideoImageFormat("png") + setColorSpace("bt709").
The default JPEG output writes yuvj420p (full-range) which costs ~0.05 SSIM
vs HF's yuv420p (limited-range). Both fixtures' remotion.config.ts encode
this so render_diff.sh measures translation fidelity, not encoder differences.

Both fixtures lint clean (0 blockers via scripts/lint_source.py).
T2 staticFile() references correctly flagged as info-level findings.

The fixtures are not yet wired into CI — that comes with PR 7's orchestrator.
For now, render and eval are documented in each README and run by hand.
@jrusso1020 jrusso1020 force-pushed the skill/r2hf-corpus-t1-t2 branch from 2649d6d to 9ff46d7 Compare April 27, 2026 23:54
@jrusso1020 jrusso1020 marked this pull request as ready for review April 28, 2026 00:29
@jrusso1020 jrusso1020 changed the base branch from skill/r2hf-eval-harness to main April 28, 2026 05:13
@jrusso1020 jrusso1020 merged commit 4ab4576 into main Apr 28, 2026
20 checks passed
@jrusso1020 jrusso1020 deleted the skill/r2hf-corpus-t1-t2 branch April 28, 2026 05:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants