Introduction
We built a video shortener in a single day using Claude Code and CE.SDK. It extracts 3-4 short clips from long-form video, handles transcription, identifies the best moments via AI, detects speakers, and outputs vertical/horizontal/square formats—all running in the browser.
Features:
- Extracts 3-4 clips per video (highlights, summaries, or cleaned-up edits)
- Outputs 9:16 (vertical), 16:9 (landscape), or 1:1 (square)
- Detects speakers and maps them to faces with user confirmation
- Auto-crops to follow the active speaker
- Adds captions and text hooks
- Non-destructive: change aspect ratio or template without re-processing
Best suited for: Videos with speech/dialogue (podcasts, interviews, presentations, vlogs)
Why Client-Side?
CE.SDK's CreativeEngine runs in the browser via WebAssembly. Video decoding, timeline manipulation, effects, and preview all happen on the user's device.
Benefits:
- No upload/download wait — edits preview instantly
- Non-destructive — switch aspect ratio or template without rendering
- Lower infrastructure costs — your costs don't scale with video length or user count
Tech Stack
- Frontend: Next.js + React
- Video Engine: CE.SDK (CreativeEngine)
- Transcription: ElevenLabs Scribe v2
- AI Analysis: Google Gemini
Architecture Overview
High-Level Flow
Required API Keys
| Service | Purpose | Environment Variable |
|---|---|---|
| CE.SDK | Video editing engine | NEXT_PUBLIC_CESDK_LICENSE |
| ElevenLabs | Speech-to-text transcription | ELEVENLABS_API_KEY |
| Gemini (via OpenRouter or direct) | AI highlight detection | OPENROUTER_API_KEY or GEMINI_API_KEY |
Setting Up CE.SDK
What is CE.SDK?
CE.SDK (CreativeEngine SDK) is a browser-based engine for video, image, and design editing—a programmable video editor you can embed in your app.
Key Concepts:
- Engine: The runtime that manages the editing session
- Scene: The document/project containing all elements
- Blocks: Individual elements (video clips, text, shapes, audio)
- Timeline: Time-based arrangement of blocks for video editing
Installation
npm install @cesdk/cesdk-js
Initializing the CreativeEngine
import CreativeEngine from '@cesdk/cesdk-js';
const engine = await CreativeEngine.init({
license: process.env.NEXT_PUBLIC_CESDK_LICENSE,
});
// Create a video scene
const scene = engine.scene.createVideo();
// Get the page (timeline container)
const pages = engine.scene.getPages();
const page = pages[0];
// Configure page dimensions for your target aspect ratio
engine.block.setWidth(page, 1080); // 9:16 vertical
engine.block.setHeight(page, 1920);
Uploading Video to CE.SDK
CE.SDK works with video through a fill-based system. The graphic block is the container, while the video fill holds the actual media source and playback properties.
// Create a video block
const videoBlock = engine.block.create('graphic');
const videoFill = engine.block.createFill('video');
// Set the video source
engine.block.setString(
videoFill,
'fill/video/fileURI',
videoUrl // Can be a blob URL or remote URL
);
// Apply fill to block
engine.block.setFill(videoBlock, videoFill);
// Add to timeline
engine.block.appendChild(page, videoBlock);
Extracting Audio for Transcription
// Configure audio-only export
const mimeType = 'audio/mp4';
// Export just the audio track
const audioBlob = await engine.block.export(page, mimeType, {
targetWidth: 0,
targetHeight: 0,
});
// audioBlob can now be sent to transcription API
Setting both dimensions to 0 tells CE.SDK to skip video encoding entirely, making this export much faster than exporting the full video.
Getting Video Metadata
// Get video duration
const duration = engine.block.getDuration(videoBlock);
// Get dimensions from the fill
const videoFill = engine.block.getFill(videoBlock);
const sourceWidth = engine.block.getSourceWidth(videoFill);
const sourceHeight = engine.block.getSourceHeight(videoFill);
console.log(`Video: ${sourceWidth}x${sourceHeight}, ${duration}s`);
AI-Powered Transcription & Highlight Detection
The Pipeline
- Audio → Transcription: Send extracted audio to ElevenLabs Scribe
- Transcription → Analysis: Feed word-level transcript to Gemini
- Analysis → Timestamps: Map AI suggestions back to precise video times
Transcription with Speaker Diarization
ElevenLabs Scribe v2 provides:
- Word-level timestamps (start/end time for each word)
- Speaker diarization (which speaker said what)
The output is a structured transcript where each word has a precise timestamp, enabling frame-accurate editing.
AI Highlight Detection with Gemini
The prompt structure matters. Here's what works:
You are analyzing a video transcript to identify segments for short-form content.
TRANSCRIPT:
[Word-by-word transcript with timestamps]
TASK:
Identify 3-4 segments that work as standalone short videos. For each segment:
1. Find the exact starting and ending words
2. Ensure clean sentence boundaries (no mid-sentence cuts)
3. Aim for 30-60 second segments
OUTPUT FORMAT (JSON):
{
"concepts": [
{
"id": "concept_1",
"title": "Hook title",
"description": "Why this segment works as a standalone clip",
"trimmed_text": "The exact transcript text to keep...",
"estimated_duration_seconds": 45
}
]
}
CRITERIA FOR SELECTION:
- Strong hooks (surprising statements, questions, bold claims)
- Complete thoughts (don't cut mid-explanation)
- Emotional peaks (humor, insight, controversy)
- Standalone value (makes sense without context)
Before finalizing each segment, ask: "If someone started watching here,
would they understand what's being discussed?"
Mapping Back to Timestamps
Once Gemini returns the trimmed_text, we match it against our word-level transcript to find exact timestamps:
AI returns: "The secret to success is actually quite simple..."
Transcript has: [{ word: "The", start: 45.2 }, { word: "secret", start: 45.4 }, ...]
Result: Trim video from 45.2s to 52.8s
This text-matching approach is more reliable than asking the AI to output timestamps directly. LLMs can hallucinate timestamps or miscalculate offsets, but they're excellent at identifying the right words—so we let the transcript data provide the ground truth for timing.
Working with the CE.SDK Timeline
Understanding Blocks
// Video/Image content
const graphic = engine.block.create('graphic');
// Audio track
const audio = engine.block.create('audio');
// Text overlay
const text = engine.block.create('text');
// Each block can be positioned on the timeline
engine.block.setTimeOffset(block, startTimeInSeconds);
engine.block.setDuration(block, durationInSeconds);
Manipulating Trim Points
Trimming controls which portion of the source media is shown:
const videoFill = engine.block.getFill(videoBlock);
// Set where in the source video to start (in seconds)
engine.block.setTrimOffset(videoFill, 45.2);
// Set how long to play from that point
engine.block.setTrimLength(videoFill, 30.0);
// Also update the block's duration to match
engine.block.setDuration(videoBlock, 30.0);
Working with Fills and Their Timing
// Get the fill (contains the actual media)
const fill = engine.block.getFill(block);
// Fills have their own timing properties
const trimStart = engine.block.getTrimOffset(fill);
const trimDuration = engine.block.getTrimLength(fill);
// The block's duration should typically match the fill's trim length
engine.block.setDuration(block, trimDuration);
Think of the fill as the media source (which part of the original video to use) and the block as the timeline placement (when and how long it appears). Both need to be updated together for clean edits.
Creating Time-Based Edits from Transcript Words
interface TranscriptWord {
word: string;
start: number;
end: number;
speaker_id?: string;
}
function applyTranscriptTrim(
engine: CreativeEngine,
videoBlock: number,
words: TranscriptWord[]
) {
if (words.length === 0) return;
const startTime = words[0].start;
const endTime = words[words.length - 1].end;
const duration = endTime - startTime;
const fill = engine.block.getFill(videoBlock);
engine.block.setTrimOffset(fill, startTime);
engine.block.setTrimLength(fill, duration);
engine.block.setDuration(videoBlock, duration);
}
Generating Speaker Thumbnails
async function generateSpeakerThumbnail(
engine: CreativeEngine,
videoBlock: number,
timestampSeconds: number,
size: number = 128
): Promise<string> {
const fill = engine.block.getFill(videoBlock);
// Seek to the specific timestamp
engine.block.setTrimOffset(fill, timestampSeconds);
engine.block.setTrimLength(fill, 0.1); // Just a single frame
// Export as image
const blob = await engine.block.export(videoBlock, 'image/jpeg', {
targetWidth: size,
targetHeight: size,
});
return URL.createObjectURL(blob);
}
We sample multiple timestamps throughout each speaker's talk time to show different facial angles and expressions—this helps users identify the right person even if they're looking away or mid-gesture in one frame.
Speaker Detection & Face Tracking
Why Semi-Automatic?
Fully automatic speaker detection fails often enough that we added a confirmation step. Users verify detected faces against speaker names from the transcript—takes a few seconds and prevents bad crops on the entire video.
How It Works
- Sample frames throughout the video
- Detect & cluster faces using face-api.js (runs in browser, no server needed)
- User confirms speaker identities via thumbnails
- Correlate with transcript diarization to map speakers → face locations
This gives us verified speaker-to-face mapping for dynamic cropping and picture-in-picture layouts.
Multi-Speaker Templates & Dynamic Switching
The Concept
When a video has multiple speakers, we can create layouts that show:
- The active speaker prominently
- Other speakers in smaller picture-in-picture views
- Dynamic switching as the conversation flows
Creating Picture-in-Picture with CE.SDK
// Duplicate the video block for each speaker slot
const pipBlock = engine.block.duplicate(originalVideoBlock);
// Position and size the PiP
engine.block.setWidth(pipBlock, 200);
engine.block.setHeight(pipBlock, 200);
engine.block.setPositionX(pipBlock, 20); // 20px from left
engine.block.setPositionY(pipBlock, 20); // 20px from top
// Enable cropping
engine.block.setClipped(pipBlock, true);
engine.block.setContentFillMode(pipBlock, 'Cover');
Key Technique: Muting Duplicate Audio
When duplicating video blocks for multi-speaker layouts, each copy has its own audio track. We must mute all but one. The setMuted API operates on the video fill, not the block itself:
// For each speaker slot after the first, mute the video fill
if (slotIndex > 0) {
const videoFill = engine.block.getFill(duplicatedBlock);
if (videoFill) {
engine.block.setMuted(videoFill, true);
}
}
Dynamic Speaker Switching
As the active speaker changes throughout the video, we:
- Detect which speaker is talking (from transcript diarization)
- Swap speaker positions in the template
- Keep the active speaker in the prominent position
The layout updates automatically as the conversation switches between speakers. We apply different trim offsets to each duplicated block based on the transcript timing—so the main speaker slot shows the person currently talking while PiP slots show the listeners.
Preview, Playback & Export
Setting Up the Canvas
const container = document.getElementById('cesdk-canvas');
engine.element.attachTo(container);
Playback Controls
engine.player.play();
engine.player.pause();
engine.player.setPlaybackTime(30.5); // seek to 30.5 seconds
const currentTime = engine.player.getPlaybackTime();
const isPlaying = engine.player.isPlaying();
Syncing UI State
engine.player.onPlaybackTimeChanged(() => {
const time = engine.player.getPlaybackTime();
updateTimeDisplay(time);
updateProgressBar(time / totalDuration);
});
engine.player.onPlaybackStateChanged(() => {
updatePlayButton(engine.player.isPlaying());
});
Export Options
const exportOptions = {
targetWidth: 1080,
targetHeight: 1920,
framerate: 30,
videoBitrate: 8_000_000, // 8 Mbps
};
const blob = await engine.block.export(
page,
'video/mp4',
exportOptions,
(progress) => updateProgressBar(progress * 100)
);
// Trigger download
const url = URL.createObjectURL(blob);
const a = document.createElement('a');
a.href = url;
a.download = 'shortened-video.mp4';
a.click();
For longer videos, consider showing estimated time remaining or allowing background export. Browser export is single-threaded and blocks the tab—a 5-minute export of a 60-second clip isn't unusual on average hardware, so user feedback is critical.
The Finished App
The user flow:
- Upload → Drop a long-form video into the browser
- Configure → Pick output mode (highlights/summary/cleanup) and aspect ratio (9:16, 16:9, 1:1)
- Verify speakers → Match detected faces to transcript speaker names
- Review clips → Browse the 3-4 AI-suggested segments, adjust if needed
- Choose template → Solo speaker, sidecar, stacked, etc.
- Preview → Scrub through the timeline, see exactly what you'll get
- Export → Download the final video directly from the browser
What's Next
Ideas for Extension
- Caption style controls: Custom fonts, animations, and positioning for subtitles
- B-roll insertion: Automatically add relevant stock footage
- Music & sound effects: AI-selected background audio
- Brand templates: Custom overlays, intros, outros
- Batch processing: Process multiple videos in sequence
Taking It Server-Side
Client-side processing for large files strains browser memory, and users must keep the tab open during export. A hybrid approach works better for production—upload in the background while users edit, then render on a server. You can also offload just the export step—let users build their edits in the browser, then send the CE.SDK scene JSON to your backend for faster, background rendering.
CE.SDK runs server-side with the same API. For batch processing, background jobs, or offloading rendering from user devices, see the CE.SDK Renderer for creative automation.
Resources
- CE.SDK Documentation
- CE.SDK Video Editing Guide
- GitHub: Video Shortener Source
- ElevenLabs API Docs
- Gemini API Docs