Visual Pairing System.
Audio → visual identity layer for Hook Objects. Each visual pairing should make the song feel more immediate, more artist-authored, and more shareable.
Casset's thesis is artist homes for audiovisual identity in the generative media era. Visual Pairing is useful only when it strengthens the Hook Object: the audio moment, lyric timing, visual identity, share artifact, provenance, and return path. It should not become a standalone visual marketplace or generated-content layer before the world format is loved.
1. Product overview
What it is
A system that lets every track on Casset carry multiple visual interpretations. Today a track has one cover image; going forward it has a first-class library of visuals— short vertical loops (5–30s) that can be paired with a hook and exported as a 9:16 video.
Sources of visuals:
- Artist-uploaded — the artist's own aesthetic (photos, stills, disposable-camera imagery, BTS, short clips).
- Release-native — cover art, saved stills, and Shader Lab treatments already attached to the hook.
- Community-created future — filmmakers, motion designers, and 3D artists submit visuals to a track or to a marketplace.
- Feature-flagged generation future — architecture stays parked for later premium packs or alternate pressings, but does not define V1.
Core artist actions
- Browse visuals associated with their track.
- Select a visual as the "active" pairing.
- Export the hook as a 1080×1920 video using that visual.
- Share — TikTok / IG / X / Casset link.
Why it matters
- Identity over decoration. Most artists don't have a director on retainer. A cover image is thin. A moving visual is a brand.
- Share-worthy by default. Vertical video is the unit of distribution. A static cover on a hook clip loses to any competitor with motion.
- Creator flywheel. Visual makers currently have no way to build a career around music. Casset becomes the exchange.
- Network effect per track. Every track becomes surface area for multiple creators → more shares, more variants, more discovery.
2. Foundation — already shipped
We are not building this from scratch. The pairing system is mostly a composition of systems already in production. Every row below is a primitive we reuse in V1–V3.
Track.previewStartSec, lib/hook-constants.ts, components/hooks/lib/tiktok-video.ts — generateTikTokVideo(). Canvas + WebCodecs + mp4-muxer + Meyda. No server ffmpeg.analyzeAudioAsync() + renderFrame() in lib/tiktok-video.tspickQualityPreset() in lib/tiktok-video.ts — auto-fallback on mobileapp/preview/ExportProgressModal.tsxvideoCache keyed on trackId:start:duration[:shareUrl]Artist, User models; /u/[username]; preview pagesArtist.stripeAccountId, User.stripeAccountId, /api/checkout/*, bounty payout pipeline in lib/bounty-payout.tsHookShare model, /s/{shareId}, referral attribution/api/upload/media — mp4/mov/webm up to 500MB, magic-byte validation in lib/magic-bytes.tsArtistMedia model — per-artist; useful precedent, not the target surfaceHookVideoSubmission — per-track, PENDING/APPROVED/REJECTED moderation + upload source + attribution. Schema-level blueprint for community visuals.VisualGenerationJob, visual-pack metadata, and Casset Studios runtime hooks remain feature-flagged for a later expansion path.generateTikTokVideo() loop + a single upload UI. No new infra primitives required.3. User flows
A. Artist flow
Artist opens track in Studio
→ "Visuals" tab (new)
→ V1: Upload one visual (MP4/MOV/WebM, ≤ 30s, 9:16 recommended)
→ V2+: Browse attached visuals (own + community + AI) — filter by mood/BPM/tag
→ Tap visual to preview with the hook (real-time, in-app)
→ Select "Use this visual" → becomes active pairing for the track
→ Export hook video (reuses existing TikTok-ready pipeline, now with visual layer)
→ Share sheet → TikTok / IG / X / copy link / Casset share URLB. Visual creator flow (V3)
Creator signs up → claims handle → completes VisualCreator profile
→ Upload a visual (loop) — mp4/mov/webm, ≤ 30s
→ Tag: mood (chill / dark / euphoric / nostalgic / ...), BPM range, genre affinity,
aesthetic (film / 3D / anime / generative)
→ Either:
(a) Attach directly to a specific track (if invited / open submission)
(b) Submit to marketplace — discoverable by any artist
→ Optional: set licensing tier (free / paid / rev-share)
→ Dashboard: impressions, pairings, exports, shares, earningsC. Viewer / fan flow
Viewer lands on a hook preview (feed card, share link, profile page)
→ Sees the artist's currently active visual looping behind the hook
→ Taps the "visual switcher" affordance (bottom-left chip, say)
→ Swipes through alternative visuals for this track
→ Can share "this version" — the share URL encodes the visual choice
→ Recipient opens the link → same track, same hook, that visual → exportable as their own shareFans don't edit — they curate. The act of sharing a specific pairing is a feature.
4. Feature breakdown (V1 → V3)
V1 — MVP (ship fast, 2–3 weeks) MVP
Goal: prove the creative loop (upload visual → export video → share) end-to-end with minimum surface.
- Artist can upload a single visual per track. One
VisualAssetrow,kind = ARTIST,isActive = true. - Visual replaces the static cover background in the export pipeline. Waveform bars, avatar, title typography, watermark all remain.
- Preview plays the visual behind the audio hook on the track detail page and in the export modal.
- Fallback: tracks without a visual render exactly as today (no regression).
- No marketplace, no AI, no tags, no moderation queue — artist-only uploads are implicitly trusted (same trust model as their audio upload today).
- Storage: Vercel Blob via existing
/api/upload/media. MP4/MOV/WebM, ≤ 30s, ≤ 500MB already enforced.
Out of scope for V1: multiple visuals per track, fan-facing switcher, generated visuals, any revenue share, moderation UI.
V2 — Multi-visual + browse V2
Goal: each track becomes an actual library, and the fan switcher exists.
- Multiple
VisualAssetrows per track; one markedisActive. - Artist UI: "Add visual" → upload or select from their previous uploads. Reorder, archive.
- Tag system (lightweight):
mood,genre, optionalbpmMin/bpmMax. Free-text allowed but normalized via a small allowlist server-side. - Fan visual switcher on preview + feed cards. Swipe between visuals without interrupting audio.
- Share URL encodes visual selection.
?v={visualAssetId}resolves server-side so the exported video inherits the chosen visual. - Export cache key updated to include visual ID.
V3 — Marketplace layer V3
Goal: creator ecosystem with attribution + optional payout.
VisualCreatorProfile— distinct identity surface (reusesUser; adds profile fields + optional Stripe payout via existingUser.stripeAccountId).VisualSubmission— creator-submitted visuals withPENDING/APPROVED/REJECTED(modeled directly onHookVideoSubmission).- Marketplace browse — artists search visuals by tag, mood, BPM, creator. Attach with one tap → creates a
VisualAssetreferencing the submission. - Attribution: exported video frames carry a subtle
visuals by @handlewatermark line below the existing watermark. Always shown — cannot be removed. - Metrics per visual: impressions (fan views), pairings (times selected by an artist), exports, shares.
- Optional rev-share — creators mark a visual as paid (one-time unlock cents) or rev-share (% of future bounty pool). Payouts ride the existing Stripe Connect transfer pipeline (
lib/bounty-payout.ts). - Feature-flagged visual generation future: optional alternate visual packs seeded by hook identity, palette, lyrics, and artist visual DNA. Stored with provenance metadata, but kept out of the primary Studios workflow until quality and positioning are ready.
5. Data model
All new tables. Prisma-shaped — aligned to existing Casset conventions (cuid, createdAt/updatedAt, camelCase, @@index where queried, app-layer enforcement of single-active matching the Bounty winner pattern).
enum VisualKind {
ARTIST // uploaded by the track's artist
COMMUNITY // submitted by a VisualCreator, approved
AI // feature-flagged future generation provider
}
enum VisualStatus {
PROCESSING // upload/render in flight
READY // usable in pairings + exports
FAILED // transcode or generation error
ARCHIVED // soft-hidden by owner
}
enum VisualSubmissionStatus { PENDING APPROVED REJECTED }
model VisualAsset {
id String @id @default(cuid())
trackId String
track Track @relation(fields: [trackId], references: [id], onDelete: Cascade)
kind VisualKind
status VisualStatus @default(PROCESSING)
// Source
uploaderUserId String?
uploader User? @relation("VisualUploader", fields: [uploaderUserId], references: [id], onDelete: SetNull)
submissionId String? @unique
submission VisualSubmission? @relation(fields: [submissionId], references: [id])
// AI provenance (kind = AI only)
sourceModel String?
sourcePromptHash String?
sourcePromptText String? @db.VarChar(2000)
// Media
videoUrl String @db.VarChar(2048)
posterImageUrl String? @db.VarChar(2048)
durationSec Float
widthPx Int
heightPx Int
fps Int?
// Pairing
isActive Boolean @default(false)
// Metrics (denormalized; authoritative counts via events)
impressions Int @default(0)
pairings Int @default(0)
exports Int @default(0)
shares Int @default(0)
tags VisualTag[] @relation("VisualAssetTags")
createdAt DateTime @default(now())
updatedAt DateTime @updatedAt
@@index([trackId, status])
@@index([trackId, isActive])
@@index([uploaderUserId, createdAt])
@@index([kind, status, createdAt])
}
model VisualTag {
id String @id @default(cuid())
slug String @unique // "chill", "dark", "film", "3d", ...
label String
category String // "mood" | "aesthetic" | "genre"
assets VisualAsset[] @relation("VisualAssetTags")
@@index([category])
}
model VisualSubmission {
id String @id @default(cuid())
submitterUserId String
submitter User @relation(fields: [submitterUserId], references: [id], onDelete: Cascade)
targetTrackId String?
targetTrack Track? @relation(fields: [targetTrackId], references: [id], onDelete: SetNull)
videoUrl String @db.VarChar(2048)
posterImageUrl String? @db.VarChar(2048)
durationSec Float
widthPx Int
heightPx Int
note String? @db.VarChar(280)
pitchTagsJson Json?
status VisualSubmissionStatus @default(PENDING)
reviewerUserId String?
reviewedAt DateTime?
rejectedReason String? @db.VarChar(280)
// Licensing
licenseKind String @default("FREE") // "FREE" | "PAID_ONE_TIME" | "REV_SHARE"
priceCents Int?
revSharePct Int?
createdAt DateTime @default(now())
updatedAt DateTime @updatedAt
asset VisualAsset?
@@index([status, createdAt])
@@index([targetTrackId, status])
@@index([submitterUserId, createdAt])
}
model VisualCreatorProfile {
id String @id @default(cuid())
userId String @unique
user User @relation(fields: [userId], references: [id], onDelete: Cascade)
displayName String?
bio String? @db.VarChar(500)
toolsJson Json?
websiteUrl String?
reelUrl String?
// Aggregates (computed, not source-of-truth)
totalSubmissions Int @default(0)
totalApprovals Int @default(0)
totalPairings Int @default(0)
totalExports Int @default(0)
totalEarningsCents Int @default(0)
createdAt DateTime @default(now())
updatedAt DateTime @updatedAt
}Relationships
Track1…NVisualAsset(cascade on delete)Track1…NVisualSubmission(nullable target for marketplace-only submissions)VisualAssetN…NVisualTagVisualSubmission1…0/1VisualAsset(approval creates the asset)User1…NVisualAsset(as uploader)User1…0/1VisualCreatorProfile
Why not reuse ArtistMedia or HookVideoSubmission?
ArtistMediais per-artist, not per-track, and is positioned as bonus download material — conceptually and relationally wrong surface.HookVideoSubmissionis per-track but represents fan-made clips of the hook (i.e. another artifact — people reacting to the track), not reusable visual beds. Keeping them separate avoids overloaded semantics in queries, metrics, and payouts.
6. API design
All routes under /api/visuals/* and /api/tracks/[trackId]/visuals/*. Auth + rate limits follow existing patterns (getUserIdFromRequest, rateLimit).
Upload + create
POST /api/upload/media ← already exists, reused as-is
→ returns { url, sizeBytes, mime, category: "video" }
POST /api/tracks/:trackId/visuals ← new
body: { videoUrl, posterImageUrl?, durationSec, widthPx, heightPx,
kind: "ARTIST", tagSlugs?: string[] }
→ creates VisualAsset (PROCESSING → READY after server-side probe)
→ 403 if caller is not the track's artist (unless submission flow)Browse / fetch
GET /api/tracks/:trackId/visuals ← new
?kind=ARTIST|COMMUNITY|AI&active=true|false&limit=&cursor=
→ paginated VisualAssets (READY only) with tags, creator, metrics
→ anonymous callers allowed (same visibility as hook preview)
GET /api/visuals/:id ← new
→ single asset with creator profile expandedSelection (set active pairing)
PATCH /api/tracks/:trackId/visuals/:id/active ← new
body: { active: true }
→ artist-only; transactionally clears previous active, sets this one active
→ pattern mirrors /api/hooks/submissions/[id]/winner/route.tsExport
POST /api/hooks/tiktok-video ← already exists
body: { trackId, hookStartSec, hookDurationSec, visualAssetId?, shareUrl? }
→ server returns signed URLs + metadata; encoding runs client-side
via generateTikTokVideo()Submission flow (V3)
POST /api/visuals/submissions ← new (marketplace OR targeted)
body: { targetTrackId?, videoUrl, ..., licenseKind,
priceCents?, revSharePct?, pitchTagsJson }
GET /api/visuals/submissions ← new (admin + submitter's own list)
POST /api/visuals/submissions/:id/approve ← new (admin) → creates VisualAsset
POST /api/visuals/submissions/:id/reject ← new (admin) → sets rejectedReasonAttach (artist accepts a community visual)
POST /api/tracks/:trackId/visuals/attach ← new (V3)
body: { submissionId }
→ creates a VisualAsset linked to the submission
→ if licenseKind=PAID_ONE_TIME: creates a Stripe PaymentIntent (artist pays)
→ if REV_SHARE: records licenseKind on the asset for future payout splitMetrics ingestion
POST /api/visuals/:id/event ← new, fire-and-forget
body: { type: "impression" | "pairing" | "export" | "share" }
→ rate-limited, client-fired on actual UX events
→ increments denormalized counters, emits to an analytics sink7. Export system integration
The highest-leverage integration point. The existing pipeline in lib/tiktok-video.ts already does 95% of the work. We're not rebuilding — we're swapping one draw call.
Current architecture
generateTikTokVideo() runs entirely in the browser:
- Fetch cover image + avatar + audio in parallel.
- Decode audio via
OfflineAudioContext. - Run Meyda on the mono downmix to produce per-frame waveform bars.
- Pre-render a static background canvas (cover image + gradient + overlays).
- Per frame:
drawImage(bgCanvas)→ draw waveform → draw avatar/text/watermark. - Encode via
VideoEncoder, mux to MP4 viamp4-muxer.
The static background is the extension point.
V1 integration: video background
- Accept optional
visualAssetIdinTikTokVideoParams. - Resolve to a
videoUrlserver-side (signed if needed); pass to the client. - Client-side: instead of pre-rendering a static bg canvas, load an offscreen
<video>element, seek to 0, play muted + looped, anddrawImage(videoEl, ...)on each frame in the encoding loop. - The waveform, avatar, typography, and watermark all layer on top unchanged.
Aspect ratio handling (9:16)
- Target surface is 1080×1920 logical. If the source visual is:
- 9:16 already → cover-fill, no crop.
- 16:9 or 1:1 → center-crop with a slight zoom (preserve the "identity" feel vs letterboxing).
- Portrait but not 9:16 → fit with blurred backdrop extension (reuse the cover-image treatment we already have).
- Decision function lives in
lib/visuals/fit.ts(new), returns{ sx, sy, sw, sh, dx, dy, dw, dh }consumed bydrawImage.
Watermarking
- Existing watermark (Casset logo + optional share URL) stays unchanged.
- V3 only: when the active visual is
kind != ARTIST, append a second line:visuals by @creatorHandle, rendered in the same watermark block at slightly reduced opacity. Always present — this is the attribution guarantee that keeps creators contributing.
Waveform rendering
- No changes to Meyda analysis.
- Contrast guard: sample the visual's mean luminance across the hook duration. If the visual is very bright, darken the waveform container with a 30–40% bottom gradient scrim. Keeps bars legible without muddying the visual.
Performance
- Per-frame
drawImageon a decoded video is cheap on modern hardware (WebCodecs +VideoFramefromHTMLVideoElementis the fast path). - Existing quality presets (
high/standard/low) already auto-downgrade on mobile — no new branching. - Cache key gains
visualAssetId:${trackId}:${start}:${duration}:${visualAssetId}:${shareUrl}. - Budget target: export time stays under 30s on mid-tier mobile for a 30s hook with visual.
Fallback
- Track with no active visual → existing codepath, zero behavioral change.
- Visual fails to decode / load → falls back to static cover bg (same codepath as today), non-fatal. Log + surface a toast.
8. Matching system (future-facing)
We start dumb, we end smart. The V1 data model must not close the door on V3 matching.
Phase 1 — Tags (V2)
- Artist visuals and community submissions both carry tags.
- UI exposes tag filters: mood, aesthetic, genre, BPM range.
- "Suggested visuals" on a track = intersection(track genre tags, visual tags) ranked by recency + pairing count.
- Zero ML. Zero embeddings. Ship this.
Phase 2 — Audio-feature suggestion (V2.5)
- Reuse Meyda to extract per-track features on upload (RMS, spectral centroid, tempo estimate, mode).
- Persist a compact
TrackFeaturesrow (separate from this doc's schema). - Score
VisualAssettags against track features with a hand-tuned weight table. Good enough for "this is lofi → surface muted, slow visuals".
Phase 3 — Embedding-based matching (V3+)
- Embed visuals via a CLIP-style model (frames sampled every 1s → mean-pooled).
- Embed tracks via an audio embedding model (prototype with existing features → swap for a dedicated API when available).
- Joint match:
score = cosine(trackEmb, visualEmb) + tag_prior + recency_boost + creator_quality. - Recommender endpoint:
GET /api/tracks/:id/visuals/recommendations. - Design constraint: every recommendation must be explainable to the artist ("matched on: moody, nocturnal, 80–90 BPM"). No black-box.
Reusing the Intelligence Layer
Casset already has lib/intelligence/ with an OpenAI + rule-based fallback and a coach-voice tone system. The matching recommender fits this layer cleanly — same fallback ladder, same UI voice.
9. Risks & constraints
Content moderation
- Risk: community visuals and any future generated packs can carry NSFW, copyrighted, or violent content.
- V1 (artist-only) has the same trust profile as existing audio uploads → no new surface.
- V2: no community uploads yet.
- V3: mandatory moderation queue (
VisualSubmission.status = PENDING), modeled onHookVideoSubmission. Auto-screen with a provider (AWS Rekognition / Hive / Cloudflare Images moderation), human review for edge cases. - Report-to-moderation action on every played visual. One-strike takedown.
Copyright / ownership
- Risk: creators submit visuals containing third-party footage they don't own.
- Upload agreement (ToS checkbox) — explicit representation of ownership/license.
- DMCA handler endpoint + takedown workflow (reuse the bounty takedown pattern).
- V3: payout release delayed 7 days post-approval to allow challenges before funds move.
Low-quality spam visuals
- Soft rate limit: 3 pending submissions per creator.
- Require tag completeness + poster frame.
- Auto-dedupe on
sourcePromptHashfor AI kind; perceptual-hash on video for uploads. - Creator reputation (reuse
PromoterReputationpattern): approval rate, pairing rate, flag count.
Performance / cost of video rendering
- Client-side path (V1–V3) reuses the existing budget. Only new cost is the extra
drawImage(video)per frame — negligible vs encode. - AI generation is the cost center. Gate behind credits (Casset already has
CreditLedger). - Server-side export considered only if WebCodecs coverage drops or the quality ceiling becomes binding. Not on the roadmap.
Storage
- Vercel Blob. Visuals ≤ 30s at reasonable bitrate (≤ 10MB typical). A track with 20 visuals is ≤ 200MB.
- Lifecycle rule:
ARCHIVEDassets drop to cold storage after 30 days;FAILEDassets purge after 7.
Export file size
Existing MAX_OUTPUT_BYTES = 15 * 1024 * 1024 (15MB) ceiling in lib/tiktok-video.ts already bounds output. Adding a video bg doesn't change this — dynamic bitrate already adapts to duration.
10. UX principles
- Visuals are identity, not decoration. The pairing UI lives next to the track, not under "extras". The export preview shows the visual as the dominant surface, not a thumbnail.
- One-tap everything. Selection = tap. Preview = instant. Export = one tap → progress modal (already built) → share sheet.
- Motion is the default. A track without a visual shows the existing static cover. A track with a visual should feel like a new object category on Casset.
- Attribution is non-negotiable. Community visuals always carry the creator's handle on the exported video. This is the trust contract with creators.
- Never block the artist. Failed visual load → fallback to cover. Unmoderated submission → invisible to fans but visible to admin + submitter. The artist's export always works.
- Swipe, don't menu. Fan switcher on feed cards is a horizontal swipe, not a dropdown. Mirrors TikTok intuition.
- Explainable matching. When we recommend visuals, we say why — in plain language, using the same coach voice as the Intelligence Layer.
11. Strategic impact
Increases sharing
- A hook video with motion outperforms a static cover on every social platform's algorithm. We already export 1080×1920 — this is the multiplier, not the foundation.
- The visual switcher creates multiple distinct share artifacts per track. One track, N videos, N share chances.
Creates a creator ecosystem
- Visual artists (motion designers, VJs, 3D artists, AI operators) have no native home in music today. Album art is one-shot and mostly static. TikTok effects are for memes.
- Casset becomes the first platform where a filmmaker's reel has a direct revenue path via music pairings.
- Reputation compounds — the same primitives we already have for promoters (
PromoterReputation, trust badges, streaks) port directly to visual creators.
Differentiates from feed + music platforms
- Linktree / Beacons: no world layer, no audio-native participation.
- Spotify / Apple Music: one cover per track, no identity surface, no fan-side mutation.
- Bandcamp: static artifacts, no short-video pipeline.
- TikTok / Reels: consumption layer only. No identity layer for the artist. No creator-to-artist pairing market.
- Casset + Visual Pairing: Hook Object + identity + share artifact in one loop.
Second-order effects
- New acquisition surface: visual creators import their audience to Casset.
- New engagement loop: fans who can't make music but can make visuals now have a reason to sign up.
- New Drop variants: artists can run "visual drops" — crowdsource the visual bed, pick a winner, ride the existing bounty rail.
Appendices
A. Rollout
- Pre-V1: feature flag
visual_pairing_v1scoped to internal + 10 launch-partner artists. - V1 launch: open to all artists. No fan-facing UI beyond the export carrying the visual.
- V2 launch: fan switcher behind
visual_pairing_v2flag; graduate on retention delta. - V3 launch: marketplace behind
visual_marketplaceflag; graduate once moderation SLA + payout flow are green for 2 weeks.
B. Success metrics
- V1: % of active artists with at least one visual attached; export completion rate with vs without visual; share-through rate from exported video.
- V2: visuals-per-track distribution; fan switcher interaction rate; share URL visual-retention (does the recipient keep the selected visual).
- V3: approved submissions / week; creator activation (first pairing → first export); rev-share payout volume; creator retention week-4.
C. Non-goals (explicit)
- Full video editor in-app (trimming, effects, transitions) — not a Casset product.
- User-generated visuals on top of other users' exported videos — this is a remix feature, separately scoped.
- Live / real-time visuals tied to playback — out of scope for Visual Pairing V1, not for Casset's broader Hook Object runtime.
docs/roadmap/visual-pairing-system.md. This page and the markdown stay in sync — edit one, edit both.