Web Pattern Capture Toolkit · research / tooling

Lego.
Pipeline.

Fully capture production websites, reproduce them locally, and extract interaction patterns as Lego blocks.

The goal isn't simple scraping — it's building a database of implementation patterns verified on real sites. Handles Three.js 3D, Next.js SSR, Webflow, and Vue/Nuxt.

"Playwright capture is
the general-purpose solution.
After fixing 14 bugs,
promoted to the primary Path."
— D11 / DECISION_LOG
capture.py
1,076 lines
Bug fixes
14
Decisions
12
Verified sites
5
Paths
A + D
Sourcemap methods
23
§0 · Why

Why I built it.

When you see a great interaction on Awwwards or similar showcases, it usually ends with "how did they make this?" The texture, timing, and scroll response — design language that's hard to describe in words — just disappears without ever becoming a reusable block.

So I wanted to capture great sites and stack them up as block-level pieces, like a design palette. Drop them into Figma, hand them to Claude, plug them into a project — usable anywhere as copy-paste-ready components.

A block is "one self-contained interaction technique" — things like sticky scroll reveal, parallax text mask, or cursor follow blob. Finely classified by category, with reverse search too ("an interaction with this kind of feel" → matching blocks).

Ultimately, connected via API. "Give me 10 blocks with this concept" → loaded directly inside your design tool. The hypothesis: for AI to reproduce web design well, it needs to learn from blocks, not words.

§1 · The Story

From regex, all the way to Path D.

Path A
sourcemap restore
If the JS bundle has a sourcemap, go all the way back to the original components.

collect → check_sourcemap → restore_source → transform.mjs → assemble → Vite. teetran.com (Gatsby + Prismic) restored at 100%. The reason D1 moved from regex to Babel AST: nested JSX and scope analysis are impossible with regex.

Path B
deprecated
HTML → JSX reconstruction.

For sites with no sourcemap. Problem: interactions didn't survive. Abandoned.

Path C
mirror + proxy
Mirror the original HTML + JS as-is, then hydrate.

opalcamera.com (Next.js + GSAP) succeeded — 17 ScrollTriggers, 351-frame image sequence. D10: failed on kprverse due to Three.js timing sensitivity. Confirmed the limits of the proxy approach.

Path D
browser capture
Take the result of an actual browser run via Playwright, wholesale.

D11. Promoted to primary. All 14 bugs found across 3 sites (landonorris, igloo, immersive-g) were resolved. Accumulated in capture.py's 1076 lines. Now handles full Three.js 3D, SPA live fallback, and CDN 403 blocks.

Now
blocks design complete
Capture engine done → block extraction pipeline.

Separate "one self-contained interaction technique" from each captured site into a React component. 10 categories, 7 triggers, 3 difficulty levels. Design complete, implementation pending.

§2 · Quick start (macOS)

Just one URL.

# 1. Clone
git clone https://github.com/your-username/lego-pipeline.git
cd lego-pipeline

# 2. Install dependencies
pip install playwright beautifulsoup4 cssutils lxml Pillow --break-system-packages
playwright install chromium
cd scripts && npm install && cd ..

# 3. Capture
python3 scripts/capture.py https://example.com

# 4. Play back locally
cd data/sites/example.com/capture
python3 serve.py
# → http://localhost:8080
Python
3.10+
brew install python@3.12
Node.js
18+
brew install node
Playwright
latest
+ chromium
§3 · Pipeline (Path D)

One URL → an identical local copy.

INPUT https://... PLAYWRIGHT · CHROMIUM 1. Page load · save every response 2. Capture raw server HTML (not rendered DOM) 3. Scroll · trigger lazy-load assets 4. Image preload · hover simulation 5. networkidle (wait for 3D textures) POST-CAPTURE URL rewrite · sorted by length desc Domain mapping (CDN → /__ext__/cdn) Inject fetch/XHR/Image/Worker interceptors Strip integrity, srcset, tracking 403 stub JS · skip 400+ JS/HTML scan → auto-DL missing assets 3D quality variants detected (ultralow → high) OUTPUT data/sites/{domain}/capture/ index.html (rewritten) serve.py (live fallback) *.js *.css (static rewrite) *.png *.jpg (media) *.glb *.ktx2 (3D) __ext__/ (CDN) ● localhost:8080 BLOCKS (planned) Extract Lego-block units · text-slide-up · marquee · 3d-scene · smooth-scroll · variable-font · hover-effect · cursor · loading …
§4 · Path A vs Path D

It splits on whether a sourcemap exists.

Path A · sourcemap restore
ideal
collect → check_sourcemap → restore_source → transform.mjs → assemble → Vite
  • ✓ Original component structure preserved
  • ✓ Gatsby → Vite conversion works
  • ✓ AST-level code rewriting
  • ✗ Impossible without a sourcemap
  • ✗ Requires framework-conversion rules
Verified: teetran.com · 100% fidelity
Path D · browser capture
primary
Playwright + Chromium → the actual browser result, captured wholesale
  • ✓ No sourcemap required
  • ✓ Handles Three.js · WebGL
  • ✓ SPA live fallback
  • ✓ Generalized after solving 14 bugs
  • ✗ Loses the original component structure
Verified: landonorris · igloo · immersive-g · ~90-95%
run.py auto-detects whether a sourcemap exists and routes to Path A or Path D. Override manually with --force-path A or --force-path D.
§5 · Verified captures

5 sites, all at different difficulty levels.

Site
Stack
Path
Fidelity
Note
teetran.com
Gatsby + Prismic
A
100%
sourcemap → AST → Vite build
opalcamera.com
Next.js + GSAP
C
95%+
17 ScrollTriggers · 351-frame sequence
landonorris.com
Webflow + Three.js
D
~95%
3D helmet scene · social cards · hover images
igloo.inc
Full Three.js 3D
D
working
KTX2 · WASM · EXR · 4 textures manually patched
immersive-g.com
Vue/Nuxt + Three.js
D
~90%
SPA live fallback, minor details missing

Recommended options by site type

Standard marketing (HTML+CSS)capture.py URL
Webflow--wait 5
Next.js SSR / Nuxt SSR--wait 5 --crawl
Three.js / WebGL 3D--wait 10 + live fallback
Rive / Lottie--wait 8
GSAP animations--wait 5
§6 · 14 bugs that made it universal

3D, SPA, CDN blocks — each breaks for a different reason.

C1COEP/COOP headers blocking resources✓ fixed
Detailed write-up in progress. Reach out if you're curious → mattheogim@gmail.com
C2CSS integrity hash mismatch✓ fixed
Detailed write-up in progress. Reach out if you're curious → mattheogim@gmail.com
C322 duplicate canvases in rendered DOM (WebGL conflicts)★ key
Detailed write-up in progress. Reach out if you're curious → mattheogim@gmail.com
C4URL encoding mismatches (font/image 404s)✓ fixed
Detailed write-up in progress. Reach out if you're curious → mattheogim@gmail.com
C5CDN 403 blocks JS → site stops working✓ fixed
Detailed write-up in progress. Reach out if you're curious → mattheogim@gmail.com
C6129 srcset responsive images missed✓ fixed
Detailed write-up in progress. Reach out if you're curious → mattheogim@gmail.com
C7Lazy-load / hidden images not captured✓ fixed
Detailed write-up in progress. Reach out if you're curious → mattheogim@gmail.com
C8Dynamic JS URL construction (runtime fetch)★ key
Detailed write-up in progress. Reach out if you're curious → mattheogim@gmail.com
C9CSS url() external references not rewritten✓ fixed
Detailed write-up in progress. Reach out if you're curious → mattheogim@gmail.com
C10Tracking / dev script errors✓ fixed
Detailed write-up in progress. Reach out if you're curious → mattheogim@gmail.com
C11Protocol-relative URLs (//cdn...)✓ fixed
Detailed write-up in progress. Reach out if you're curious → mattheogim@gmail.com
C12403 HTML saved as an asset → "Unexpected token '<'"✓ fixed
Detailed write-up in progress. Reach out if you're curious → mattheogim@gmail.com
C13URL prefix substitution collisions✓ fixed
Detailed write-up in progress. Reach out if you're curious → mattheogim@gmail.com
C143D textures not captured (Three.js fetch)★ key
Detailed write-up in progress. Reach out if you're curious → mattheogim@gmail.com

Error pattern categories (auto-resolution strategies)

A. Missing assets (most common)
  • · 3D textures · networkidle + JS scan
  • · Progressive quality · auto-detect variants
  • · SPA chunks · serve.py live fallback
  • · hover/click triggers (partial)
  • · CDN 403 · stub + skip 400+
B. URL / paths
  • · URL encoding · unquote() fallback
  • · Prefix collisions · sort by length
  • · Protocol-relative URLs (//) · add entry
  • · Dynamic JS URLs · runtime interceptor
C. Rendering
  • · SRI hashes · strip integrity
  • · Duplicate canvases · use raw HTML
  • · COEP/COOP · strip in serve.py
  • · Tracking · strip patterns
D. Partially unresolved
  • · CSS custom properties (JS-runtime dependent)
  • · Fonts · worked around with live fallback
  • · JS init-order dependencies
  • · Web Worker assets · interceptor doesn't apply
§7 · 12 decisions

The map of how it got this way.

#
Decision
Summary
D1
regex → AST
Nested JSX / scope analysis impossible → introduced Babel AST
D2
stdout JSON
Node writeFileSync EPERM → routed via stdout
D3
Gatsby → Vite
Gatsby runtime too heavy → plain React + Vite
D4
navigate() → pushState()
Prevents full reload → preserves SPA routing
D5
opalcamera first
Build the Next.js static base before tackling immersive-g (extreme)
D6
kprverse sourcemap false positive
Nuxt catch-all returns 200+JSON → content validation required
D7
Expanded source discovery
deep_check.sh went from 9 to 23 methods
D8
Split docs structure
One doc → four (current/decisions/archive/vision)
D9
Introduced Path C mirroring
Raw HTML+JS as-is → restore interactions via hydration
D10
kprverse mirror failure
Three.js timing-sensitive → confirmed limits of the proxy approach
D11
Introduced Path D ★
Playwright capture = the general-purpose solution. Promoted to primary
D12
Skip all 400+
From specific extensions → all (prevents CDN HTML errors)
§8 · serve.py · live fallback

It's not just a static server.

BROWSER GET /some-asset serve.py localhost:8080 Local file exists Return immediately File as-is Local file missing (404) Fetch from origin server → cache locally → return Fast Captured site Progressive offline Cache grows with use Eventually --offline works

This design enables SPA page transitions (chunks for other pages fetched live), auto-recovery of missing assets, and progressive offline conversion. The --offline flag serves purely from the local store, with no origin-server access.

§9 · Scripts

Each one has its own job.

Script
Path
What it does
run.py
router
Auto-detects (sourcemap or not) → routes to Path A / D
capture.py ★
D
Primary engine · 1076 lines · accumulated from 14 bug fixes
collect.py
A
Collects HTML, CSS, JS, images, fonts, 3D, screenshots
check_sourcemap.py
A
Checks for JS .map files · parses X-SourceMap · sourceMappingURL
restore_source.py
A
Sourcemap → recreate original structure from webpack module paths
transform.mjs
A
Babel AST · framework conversion (Gatsby → Vite)
assemble.py
A
Scaffolds a Vite project → buildable clone
deep_check.sh
util
Probes source-code accessibility through 23 methods
parse_css.py
util
Extracts CSS selectors, media queries, variables
§10 · Block extraction (design complete)

Next step: decompose the site into Legos.

A block = one self-contained web interaction technique → extracted as a React component.

CSS technique inventory → JS pattern detection → DOM section splitting
→ technique ↔ section mapping → LLM-assisted block assembly → standalone-render verification
Output shape:
data/sites/{domain}/blocks/
├── blocks.json
├── block-001-text-slide-up.jsx
├── block-002-marquee.jsx
└── all-blocks.jsx
10 categories
text-animation scroll-behavior layout-transition hover-effect typography 3d-scene media navigation cursor loading
7 triggers
load scroll hover click mousemove resize time
3 difficulty levels
simple (CSS-only) medium (CSS+JS) complex (external lib)
🎯
Open questions · what I'm thinking about now

If you've read this far, a single line on any of these would help a lot.

Q1 A lighter-weight capture approach — right now I spin up a real browser to capture the whole page (Playwright) and attempt original-source restoration (sourcemap). Is there a way to dissect web design at a similar fidelity without a browser?
Q2 Predicting runtime assets — libraries like Three.js fetch additional files mid-execution. Today I trace through network logs after the run, but is there a way to detect them ahead of time?
Q3 How to feed this to AI — when handing extracted components (blocks) to an AI as learning references, is it better to give the entire set, or just representative examples per category?
Q4 Copyright — to what extent is it legitimate to capture publicly-shipped sites as "learning references" and use them for AI training / generation? Creative-ownership questions worry me most.
If you have thoughts, email me →
§11 · Roadmap

Where it is · where it's going.

With Path A and D done, moving into the block-extraction phase.

Capture · Path A+D Block extraction ← now AI integration
Now · Shipped

Already working

  • Path A (Playwright capture) · Path D (general-purpose solution)
  • All 14 bugs resolved
  • 5 sites verified (teetran · opalcamera · landonorris · igloo · immersive-g)
  • serve.py live fallback
  • 12 design decisions logged (DECISION_LOG)
  • 23 sourcemap-restore methods
  • capture.py stabilized at 1,076 lines
Next · In progress

Currently working on

  • Block extraction pipeline (10 categories · 7 triggers · 3 difficulty levels)
  • Auto-generated CSS technique inventory
  • JS pattern detection + DOM section splitting
  • Standalone-render verification per block
Later · Backlog

Future plans

  • Block library → fine-tune AI code generation
  • Site → block auto-recommendation (search)
  • Block composition preview editor
  • Auto-expanded capture targets (Awwwards top 100)
Link copied