feat(skills): remotion-to-hyperframes corpus T1+T2 (3/7)#508
Conversation
d8242cd to
bb3e7d5
Compare
51c058f to
00de7e4
Compare
bb3e7d5 to
08fa028
Compare
miguel-heygen
left a comment
There was a problem hiding this comment.
The fixture content is directionally useful, but the human-facing corpus docs disagree with the machine-readable thresholds. Since agents will read both, these should be made consistent before merging this layer.
| ../../../scripts/render_diff.sh ./remotion-src/out/baseline.mp4 ./hf.mp4 ./diff | ||
| ``` | ||
|
|
||
| `expected.json` documents the SSIM threshold (0.97) for this fixture. |
There was a problem hiding this comment.
This says the fixture threshold is 0.97, but expected.json sets ssim_threshold to 0.95 and the rationale also says 0.95. Please make the README match the executable contract; otherwise the skill/reference reader gets two different gates for the same fixture.
| ../../../scripts/render_diff.sh ./remotion-src/out/baseline.mp4 ./hf.mp4 ./diff | ||
| ``` | ||
|
|
||
| ## Why threshold 0.92? |
There was a problem hiding this comment.
Same issue here: this section is written around a 0.92 threshold and compares against T1 at 0.97, but the checked-in expected.json uses 0.95 for T2 and T1 also uses 0.95. The corpus needs one source of truth, especially because the final orchestrator reads expected.json while humans/agents read this README.
00de7e4 to
2a309f2
Compare
08fa028 to
abaa743
Compare
|
@miguel-heygen — addressed in the amended commit T1 README: The trailing line "expected.json documents the SSIM threshold (0.97)" is now "0.95" (matching T2 README: The "Why threshold 0.92?" header and surrounding paragraph now reference 0.95 (matching
|
miguel-heygen
left a comment
There was a problem hiding this comment.
Re-reviewed the latest head. The T1/T2 threshold documentation now matches the executable expected.json values, and I do not have remaining blockers on this layer.
2a309f2 to
90845dd
Compare
abaa743 to
2649d6d
Compare
Adds the deterministic eval primitives the skill calls into: scripts/render_diff.sh SSIM diff between two MP4s, JSON summary, configurable threshold scripts/frame_strip.sh side-by-side comparison strip for visual debugging scripts/lint_source.py pre-translation lint over Remotion source — blocks/warnings/infos The harness is decoupled from the render pipeline: it accepts paths to already-rendered MP4s. The skill orchestrator (PR 7) drives both renders and feeds the outputs in. This keeps the harness usable in CI, in sandboxes, and on any machine that has ffmpeg without needing the full Remotion + HyperFrames toolchain. Lint catches the patterns from the skill's out-of-scope list: - useState / useReducer (state-machine driven animation) - useEffect with deps (side effects) - async calculateMetadata (Promise-returning composition metadata) - @remotion/lambda imports - third-party React UI libraries (MUI, Chakra, Mantine, antd, shadcn, Radix, NextUI) - delayRender / useCallback / useMemo (warnings) - staticFile / interpolateColors (info — translatable but flagged) Smoke test (scripts/tests/smoke.sh) exercises all three scripts against synthetic inputs: identical ffmpeg testsrc videos pass at threshold 0.99, different ffmpeg testsrc videos fail at 0.99, frame_strip produces a strip.png, lint produces 0 blockers on a clean fixture and >=3 blockers on a fixture that uses useState + useEffect + MUI + async metadata. Validated locally: smoke.sh exits 0.
90845dd to
70e0b8b
Compare
Adds the first two test fixtures the skill is graded against. Each fixture
ships:
- remotion-src/ full Remotion project (package.json, src/, remotion.config.ts, tsconfig.json)
- hf-src/ hand-translated HyperFrames composition (index.html)
- expected.json tier metadata + SSIM threshold + translation notes + measured validation
- README.md human walk-through of the translation choices
- setup.sh (T2 only) generates binary assets (PNG, WAV) via ffmpeg
T1 — title-card-fade
- 3 s @ 30 fps, 1280x720
- Single AbsoluteFill, single useCurrentFrame interpolate
with multi-segment input [0,15,75,90] -> [0,1,1,0]
- Validated mean SSIM 0.974, threshold 0.95
(~0.025 gap from font-fallback divergence between Remotion's bundled
Chromium and HF's chrome-headless-shell)
T2 — title-image-outro
- 6 s @ 30 fps, 1280x720, three Sequences (TitleScene, ImageScene, OutroScene)
- Exercises spring, interpolate, Audio, Img, staticFile
- Spring -> GSAP back.out(1.4) translation
- Validated mean SSIM 0.985, threshold 0.95
(translation came out cleaner than predicted; spring->back.out drift was
smaller than the ~0.05 budget I'd expected)
- setup.sh generates a 200x200 blue PNG and a 6 s silent WAV via ffmpeg
so binaries stay out of the repo
Calibration done end-to-end: rendered Remotion baseline + HF translation,
ran scripts/render_diff.sh, set thresholds ~0.02 below measured p05.
Critical Remotion config: setVideoImageFormat("png") + setColorSpace("bt709").
The default JPEG output writes yuvj420p (full-range) which costs ~0.05 SSIM
vs HF's yuv420p (limited-range). Both fixtures' remotion.config.ts encode
this so render_diff.sh measures translation fidelity, not encoder differences.
Both fixtures lint clean (0 blockers via scripts/lint_source.py).
T2 staticFile() references correctly flagged as info-level findings.
The fixtures are not yet wired into CI — that comes with PR 7's orchestrator.
For now, render and eval are documented in each README and run by hand.
2649d6d to
9ff46d7
Compare
What
The first two test fixtures the
remotion-to-hyperframesskill is graded against. Each fixture is a self-contained directory:T1 — title-card-fade
3 s @ 30 fps, 1280×720. Single
AbsoluteFill, singleuseCurrentFrame-driveninterpolatewith multi-segment input[0, 15, 75, 90] → [0, 1, 1, 0](fade in / hold / fade out). No audio, no media, no custom components.Validated mean SSIM: 0.974 · threshold 0.95.
T2 — title-image-outro
6 s @ 30 fps, 1280×720, three
<Sequence>scenes:spring({damping:12, stiffness:100, mass:1})driving scale on text<Img src={staticFile("square.png")}>with linear fade-in + scale<Audio src={staticFile("music.wav")} volume={0.5} />throughoutsetup.shgenerates the 200×200 PNG and 6-second silent WAV via ffmpeg so binaries stay out of the repo.Validated mean SSIM: 0.985 · threshold 0.95. The spring →
back.out(1.4)translation came out cleaner than the original ~0.05 SSIM budget anticipated.End-to-end validation
Rendered Remotion baseline + HF translation locally, ran
scripts/render_diff.sh. Both fixtures meet their thresholds with comfortable margin.The dominant non-translation noise floor is system font fallback divergence between Remotion's bundled Chromium and HF's
chrome-headless-shell. Samefont-weight: 800renders perceptibly bolder on HF — costs ~0.025 mean SSIM and is the lower bound on what threshold can be set.Critical Remotion config
Both fixtures'
remotion.config.tssetsetVideoImageFormat("png") + setColorSpace("bt709"). Remotion's default JPEG output writesyuvj420p(full-range) which costs ~0.05 SSIM vs HF'syuv420p(limited-range). Without this, T1 lands at 0.958 instead of 0.974 — render_diff would be measuring an encoder difference, not translation fidelity.Why
#3 in the 7-PR stack.
T1 + T2 together exercise:
AbsoluteFill,SequenceuseCurrentFrame,useVideoConfiginterpolate(single-segment + multi-segment, with extrapolation)springAudio,Img,staticFileThat's the bulk of what real Remotion compositions use day-to-day. T3 (PR 4) adds custom React subcomponents + Zod schemas; T4 (PR 5) covers escape-hatch cases.
Stack
#506 (1/7) — scaffold
#507 (2/7) — eval harness
this PR (3/7) — T1 + T2 fixtures
#509 (4/7) — T3 data-driven fixture
5/7 — T4 escape-hatch fixtures
6/7 — references/*.md (translation map)
7/7 — SKILL.md body + corpus orchestrator
Test plan
lint_source.pyreports 0 blockers / 0 warnings / 0 infoslint_source.pyreports 0 blockers / 0 warnings / 2 infos (the twostaticFile()references — correctly classified as translatable)setup.shruns successfully, generates PNG + WAV