Eray – IMG.LY Blog

Build in a Day: AI Video Clipping with CE.SDK

Eray — Thu, 05 Feb 2026 12:17:28 GMT

Introduction

We built a video shortener in a single day using Claude Code and CE.SDK. It extracts 3-4 short clips from long-form video, handles transcription, identifies the best moments via AI, detects speakers, and outputs vertical/horizontal/square formats—all running in the browser.

Features:

Extracts 3-4 clips per video (highlights, summaries, or cleaned-up edits)
Outputs 9:16 (vertical), 16:9 (landscape), or 1:1 (square)
Detects speakers and maps them to faces with user confirmation
Auto-crops to follow the active speaker
Adds captions and text hooks
Non-destructive: change aspect ratio or template without re-processing

Best suited for: Videos with speech/dialogue (podcasts, interviews, presentations, vlogs)

Why Client-Side?

CE.SDK’s CreativeEngine runs in the browser via WebAssembly. Video decoding, timeline manipulation, effects, and preview all happen on the user’s device.

Benefits:

No upload/download wait — edits preview instantly
Non-destructive — switch aspect ratio or template without rendering
Lower infrastructure costs — your costs don’t scale with video length or user count

Tech Stack

Frontend: Next.js + React
Video Engine: CE.SDK (CreativeEngine)
Transcription: ElevenLabs Scribe v2
AI Analysis: Google Gemini

Architecture Overview

High-Level Flow

Required API Keys

Service	Purpose	Environment Variable
CE.SDK	Video editing engine	`NEXT_PUBLIC_CESDK_LICENSE`
ElevenLabs	Speech-to-text transcription	`ELEVENLABS_API_KEY`
Gemini (via OpenRouter or direct)	AI highlight detection	`OPENROUTER_API_KEY` or `GEMINI_API_KEY`

Setting Up CE.SDK

What is CE.SDK?

CE.SDK (CreativeEngine SDK) is a browser-based engine for video, image, and design editing—a programmable video editor you can embed in your app.

Key Concepts:

Engine: The runtime that manages the editing session
Scene: The document/project containing all elements
Blocks: Individual elements (video clips, text, shapes, audio)
Timeline: Time-based arrangement of blocks for video editing

Installation

npm install @cesdk/cesdk-js

Initializing the CreativeEngine

import CreativeEngine from '@cesdk/cesdk-js';

const engine = await CreativeEngine.init({
  license: process.env.NEXT_PUBLIC_CESDK_LICENSE,
});

// Create a video scene
const scene = engine.scene.createVideo();

// Get the page (timeline container)
const pages = engine.scene.getPages();
const page = pages[0];

// Configure page dimensions for your target aspect ratio
engine.block.setWidth(page, 1080); // 9:16 vertical
engine.block.setHeight(page, 1920);

Uploading Video to CE.SDK

CE.SDK works with video through a fill-based system. The graphic block is the container, while the video fill holds the actual media source and playback properties.

// Create a video block
const videoBlock = engine.block.create('graphic');
const videoFill = engine.block.createFill('video');

// Set the video source
engine.block.setString(
  videoFill,
  'fill/video/fileURI',
  videoUrl // Can be a blob URL or remote URL
);

// Apply fill to block
engine.block.setFill(videoBlock, videoFill);

// Add to timeline
engine.block.appendChild(page, videoBlock);

Extracting Audio for Transcription

// Configure audio-only export
const mimeType = 'audio/mp4';

// Export just the audio track
const audioBlob = await engine.block.export(page, mimeType, {
  targetWidth: 0,
  targetHeight: 0,
});

// audioBlob can now be sent to transcription API

Setting both dimensions to 0 tells CE.SDK to skip video encoding entirely, making this export much faster than exporting the full video.

Getting Video Metadata

// Get video duration
const duration = engine.block.getDuration(videoBlock);

// Get dimensions from the fill
const videoFill = engine.block.getFill(videoBlock);
const sourceWidth = engine.block.getSourceWidth(videoFill);
const sourceHeight = engine.block.getSourceHeight(videoFill);

console.log(`Video: ${sourceWidth}x${sourceHeight}, ${duration}s`);

AI-Powered Transcription & Highlight Detection

The Pipeline

Audio → Transcription: Send extracted audio to ElevenLabs Scribe
Transcription → Analysis: Feed word-level transcript to Gemini
Analysis → Timestamps: Map AI suggestions back to precise video times

Transcription with Speaker Diarization

ElevenLabs Scribe v2 provides:

Word-level timestamps (start/end time for each word)
Speaker diarization (which speaker said what)

The output is a structured transcript where each word has a precise timestamp, enabling frame-accurate editing.

AI Highlight Detection with Gemini

The prompt structure matters. Here’s what works:

You are analyzing a video transcript to identify segments for short-form content.

TRANSCRIPT:
[Word-by-word transcript with timestamps]

TASK:
Identify 3-4 segments that work as standalone short videos. For each segment:
1. Find the exact starting and ending words
2. Ensure clean sentence boundaries (no mid-sentence cuts)
3. Aim for 30-60 second segments

OUTPUT FORMAT (JSON):
{
  "concepts": [
    {
      "id": "concept_1",
      "title": "Hook title",
      "description": "Why this segment works as a standalone clip",
      "trimmed_text": "The exact transcript text to keep...",
      "estimated_duration_seconds": 45
    }
  ]
}

CRITERIA FOR SELECTION:
- Strong hooks (surprising statements, questions, bold claims)
- Complete thoughts (don't cut mid-explanation)
- Emotional peaks (humor, insight, controversy)
- Standalone value (makes sense without context)


Before finalizing each segment, ask: "If someone started watching here,
would they understand what's being discussed?"

Mapping Back to Timestamps

Once Gemini returns the trimmed_text, we match it against our word-level transcript to find exact timestamps:

AI returns:     "The secret to success is actually quite simple..."
Transcript has: [{ word: "The", start: 45.2 }, { word: "secret", start: 45.4 }, ...]

Result:         Trim video from 45.2s to 52.8s

This text-matching approach is more reliable than asking the AI to output timestamps directly. LLMs can hallucinate timestamps or miscalculate offsets, but they’re excellent at identifying the right words—so we let the transcript data provide the ground truth for timing.

Working with the CE.SDK Timeline

Understanding Blocks

// Video/Image content
const graphic = engine.block.create('graphic');

// Audio track
const audio = engine.block.create('audio');

// Text overlay
const text = engine.block.create('text');

// Each block can be positioned on the timeline
engine.block.setTimeOffset(block, startTimeInSeconds);
engine.block.setDuration(block, durationInSeconds);

Manipulating Trim Points

Trimming controls which portion of the source media is shown:

const videoFill = engine.block.getFill(videoBlock);

// Set where in the source video to start (in seconds)
engine.block.setTrimOffset(videoFill, 45.2);

// Set how long to play from that point
engine.block.setTrimLength(videoFill, 30.0);

// Also update the block's duration to match
engine.block.setDuration(videoBlock, 30.0);

Working with Fills and Their Timing

// Get the fill (contains the actual media)
const fill = engine.block.getFill(block);

// Fills have their own timing properties
const trimStart = engine.block.getTrimOffset(fill);
const trimDuration = engine.block.getTrimLength(fill);

// The block's duration should typically match the fill's trim length
engine.block.setDuration(block, trimDuration);

Think of the fill as the media source (which part of the original video to use) and the block as the timeline placement (when and how long it appears). Both need to be updated together for clean edits.

Creating Time-Based Edits from Transcript Words

interface TranscriptWord {
  word: string;
  start: number;
  end: number;
  speaker_id?: string;
}

function applyTranscriptTrim(
  engine: CreativeEngine,
  videoBlock: number,
  words: TranscriptWord[]
) {
  if (words.length === 0) return;

  const startTime = words[0].start;
  const endTime = words[words.length - 1].end;
  const duration = endTime - startTime;

  const fill = engine.block.getFill(videoBlock);

  engine.block.setTrimOffset(fill, startTime);
  engine.block.setTrimLength(fill, duration);
  engine.block.setDuration(videoBlock, duration);
}

Generating Speaker Thumbnails

async function generateSpeakerThumbnail(
  engine: CreativeEngine,
  videoBlock: number,
  timestampSeconds: number,
  size: number = 128
): Promise<string> {
  const fill = engine.block.getFill(videoBlock);

  // Seek to the specific timestamp
  engine.block.setTrimOffset(fill, timestampSeconds);
  engine.block.setTrimLength(fill, 0.1); // Just a single frame

  // Export as image
  const blob = await engine.block.export(videoBlock, 'image/jpeg', {
    targetWidth: size,
    targetHeight: size,
  });

  return URL.createObjectURL(blob);
}

We sample multiple timestamps throughout each speaker’s talk time to show different facial angles and expressions—this helps users identify the right person even if they’re looking away or mid-gesture in one frame.

Speaker Detection & Face Tracking

Why Semi-Automatic?

Fully automatic speaker detection fails often enough that we added a confirmation step. Users verify detected faces against speaker names from the transcript—takes a few seconds and prevents bad crops on the entire video.

How It Works

Sample frames throughout the video
Detect & cluster faces using face-api.js (runs in browser, no server needed)
User confirms speaker identities via thumbnails
Correlate with transcript diarization to map speakers → face locations

This gives us verified speaker-to-face mapping for dynamic cropping and picture-in-picture layouts.

Multi-Speaker Templates & Dynamic Switching

The Concept

When a video has multiple speakers, we can create layouts that show:

The active speaker prominently
Other speakers in smaller picture-in-picture views
Dynamic switching as the conversation flows

Creating Picture-in-Picture with CE.SDK

// Duplicate the video block for each speaker slot
const pipBlock = engine.block.duplicate(originalVideoBlock);

// Position and size the PiP
engine.block.setWidth(pipBlock, 200);
engine.block.setHeight(pipBlock, 200);
engine.block.setPositionX(pipBlock, 20); // 20px from left
engine.block.setPositionY(pipBlock, 20); // 20px from top

// Enable cropping
engine.block.setClipped(pipBlock, true);
engine.block.setContentFillMode(pipBlock, 'Cover');

Key Technique: Muting Duplicate Audio

When duplicating video blocks for multi-speaker layouts, each copy has its own audio track. We must mute all but one. The setMuted API operates on the video fill, not the block itself:

// For each speaker slot after the first, mute the video fill
if (slotIndex > 0) {
  const videoFill = engine.block.getFill(duplicatedBlock);
  if (videoFill) {
    engine.block.setMuted(videoFill, true);
  }
}

Dynamic Speaker Switching

As the active speaker changes throughout the video, we:

Detect which speaker is talking (from transcript diarization)
Swap speaker positions in the template
Keep the active speaker in the prominent position

The layout updates automatically as the conversation switches between speakers. We apply different trim offsets to each duplicated block based on the transcript timing—so the main speaker slot shows the person currently talking while PiP slots show the listeners.

Preview, Playback & Export

Setting Up the Canvas

const container = document.getElementById('cesdk-canvas');
engine.element.attachTo(container);

Playback Controls

engine.player.play();
engine.player.pause();
engine.player.setPlaybackTime(30.5); // seek to 30.5 seconds

const currentTime = engine.player.getPlaybackTime();
const isPlaying = engine.player.isPlaying();

Syncing UI State

engine.player.onPlaybackTimeChanged(() => {
  const time = engine.player.getPlaybackTime();
  updateTimeDisplay(time);
  updateProgressBar(time / totalDuration);
});

engine.player.onPlaybackStateChanged(() => {
  updatePlayButton(engine.player.isPlaying());
});

Export Options

const exportOptions = {
  targetWidth: 1080,
  targetHeight: 1920,
  framerate: 30,
  videoBitrate: 8_000_000, // 8 Mbps
};

const blob = await engine.block.export(
  page,
  'video/mp4',
  exportOptions,
  (progress) => updateProgressBar(progress * 100)
);

// Trigger download
const url = URL.createObjectURL(blob);
const a = document.createElement('a');
a.href = url;
a.download = 'shortened-video.mp4';
a.click();

For longer videos, consider showing estimated time remaining or allowing background export. Browser export is single-threaded and blocks the tab—a 5-minute export of a 60-second clip isn’t unusual on average hardware, so user feedback is critical.

The Finished App

The user flow:

Upload → Drop a long-form video into the browser
Configure → Pick output mode (highlights/summary/cleanup) and aspect ratio (9:16, 16:9, 1:1)
Verify speakers → Match detected faces to transcript speaker names
Review clips → Browse the 3-4 AI-suggested segments, adjust if needed
Choose template → Solo speaker, sidecar, stacked, etc.
Preview → Scrub through the timeline, see exactly what you’ll get
Export → Download the final video directly from the browser

What’s Next

Ideas for Extension

Caption style controls: Custom fonts, animations, and positioning for subtitles
B-roll insertion: Automatically add relevant stock footage
Music & sound effects: AI-selected background audio
Brand templates: Custom overlays, intros, outros
Batch processing: Process multiple videos in sequence

Taking It Server-Side

Client-side processing for large files strains browser memory, and users must keep the tab open during export. A hybrid approach works better for production—upload in the background while users edit, then render on a server. You can also offload just the export step—let users build their edits in the browser, then send the CE.SDK scene JSON to your backend for faster, background rendering.

CE.SDK runs server-side with the same API. For batch processing, background jobs, or offloading rendering from user devices, see the CE.SDK Renderer for creative automation.

Resources

Made by IMG.LY with CE.SDK

One SDK to Power Them All: Announcing IMG.LY SDK

Eray — Tue, 19 Aug 2025 10:25:44 GMT

From day one, IMG.LY has been driven by a singular mission: to empower creativity at scale. By embedding our technology directly into apps, websites, and workflows, we have helped hundreds of customers from social networks to e-commerce platforms to deliver rich, media-powered experiences to millions of users. We started out by offering two specialized SDKs, the Photo Editor SDK and the Video Editor SDK, later on, we introduced the Creative Editor SDK (CE.SDK) as a more flexible, all-in-one creative editing solution.

As CE.SDK has matured into a foundation for all creative workflows including photo and video editing, we’re taking the next step in our evolution. Later this year, we’ll unify our offerings under a single brand:

IMG.LY SDK.

For existing customers of PE.SDK and VE.SDK, rest assured, both products will continue to receive long-term support. We’ll explain what that means below.

From Fragmentation to Unification

Over the years, our Photo Editor SDK (PE.SDK) and Video Editor SDK (VE.SDK) have enabled customers to deliver creative editing for a specific domain (photo and video). While both excelled within their specific use cases, maintaining separate SDKs brought growing complexity from divergent feature sets across platforms to duplicated codebases and asynchronous upgrade paths. Not only did our customers’ use cases expand, driving demand for more custom editing solutions, this fragmentation ultimately limited our ability to move fast, innovate, and serve this growing demand.

That’s why we introduced Creative Editor SDK (CE.SDK): a unified foundation designed from the ground up to scale across creative domains and workflows. At its core is a single, shared engine capable of handling image, video, layout, text, and audio processing paired with native UI components for every major platform. With each release, CE.SDK has steadily absorbed functionality from both PE.SDK and VE.SDK, making it our most capable and extensible offering yet.

But as this convergence progressed, it also introduced a new kind of complexity: customers were often unsure which SDK to choose. By bringing everything into one consistent, cross-platform solution where every feature is developed in sync for web and mobile, CE.SDK eliminates that uncertainty and simplifies the path forward.

One SDK to Power Them All

Beyond the engine, the SDK now includes prebuilt solutions tailored to common use cases, for example photo, video, or AI-powered design tooling. These solutions help teams get started faster without compromising on customization or scalability.

As we move forward, the IMG.LY SDK will be the sole focus of our platform investments. That means:

Regular updates and new features delivered consistently across all supported platforms.
Turnkey solutions will provide an even simpler integration path, requiring only a few lines of code for set up.
Plugin support and a modular architecture for maximum extensibility.
Automation and Headless Mode for powerful backend or non-UI-driven workflows.
A unified API surface: one learning curve, one integration, fewer surprises.
And for new prospects and existing partners alike, a clear and confident message: IMG.LY SDK is the future.

Long-Term Support for PE. and VE.SDK

As part of this transition, PE.SDK and VE.SDK are now in long-term support mode (LTS). This means:

They remain actively supported by a dedicated team.
They are no longer sold to new customers.
We encourage current users to begin exploring CE.SDK — soon branded as IMG.LY SDK — as the future-ready path forward.

If you’re currently using PE.SDK or VE.SDK, rest assured: we’ve got you covered. But for faster innovation and access to our latest AI-powered features, we strongly recommend evaluating IMG.LY SDK.

What’s Next?

Of course, the journey doesn’t stop here. The unification of our SDKs paves the road for many more exciting capabilities, such as:

Improve Developer Experience
Simpler setup, smarter defaults, and faster time-to-interaction, so teams can go from idea to integration in record time.
Build for Agentic AI
Designed for agent-driven workflows: responsive canvases, multimodal orchestration, and APIs tailored for autonomous actions.
Better Serve Designers & Teams
Our upcoming Studio offering will include project dashboards, version control, and multi-user permissions for collaborative work.
Personalization at Scale
While we already support for variables and templating, we are exploring how to leverage AI to effortlessly create smart variation.
Customizable UI & Fast Workflow Building
We’re making it easier than ever to adapt the editor to your needs whether you’re modifying the UI, embedding your own logic, or orchestrating custom creative workflows.

Now is the perfect time to connect with our team, share your roadmap, test-drive upcoming modules, and plan your migration. Your feedback will help shape the future of IMG.LY SDK.

IMG.LY x AI

Eray — Mon, 30 Jun 2025 13:18:14 GMT

Over the past two years, one question has consistently emerged in conversations with customers: “What about AI?”
While AI promises to disrupt many industries, it remains difficult to grasp how this technology will reshape creative workflows.

Our customers look to us for guidance: What is IMG.LY’s vision? How will our SDK help them harness this wave of innovation?
Last month, we took our first significant step by launching an initial suite of generative AI-powered features for our web SDK. We’ve embedded these tools deeply into photo, video, and design editing workflows and the response from customers and prospects was overwhelmingly positive.

And this is just the beginning. An immense opportunity lies ahead for both IMG.LY and our customers to drive transformation in the creative domain through our SDK. In this post, I want to share our vision for the future of creative tools powered by AI and our SDK.

Going forward, we’re focusing on three central goals:

1. Deep Integration of AI Capabilities into Editing Workflows

The pace of AI innovation continues to be remarkably fast. New models and configurations emerge almost daily, some as APIs, others open-sourced on platforms like Hugging Face. While some offer generalist features, others provide specialized, industry-specific capabilities, such as automatically obscuring license plates, blurring faces or replacing skies for property exteriors.

The true value of these AI tools emerges not in isolation, but when they’re seamlessly combined within existing workflows.

That’s why we built a plugin system for CE.SDK.
The CE.SDK plugin system has fundamentally changed how quickly we, and our customers, can integrate new capabilities. We’ve created a way to bring AI models and agents directly into connected workflows within the editor, making AI feel native rather than bolted on.

A recent example is our integration of OpenAI’s gpt-image-1 API, which we implemented in just a few days after its release. We used it to build a visual prompting workflow that takes into account all elements of a page, text, images, and annotations, to generate results based on the complete layout context.

These early successes are encouraging us to ship even faster and bring new features to our customers.

We’re now partnering with fal.ai to provide a nearly out-of-the-box experience to use any new generative AI model through their platform (more on this soon). Fal.ai excels at speed—when exciting new models emerge, they offer API access within days. This partnership ensures our users can quickly access new AI capabilities with minimal effort. More partnerships and integrations are on the horizon, bringing world-class AI APIs with intuitive interfaces directly into the editor.

2. Enabling AI Agents as Creative Collaborators

AI agents, like humans, leverage tools to accomplish tasks efficiently. CE.SDK occupies a unique position as a highly adaptable, multi-platform technology for editing various media types—including a fully documented headless version that’s perfect for programmatic control.

Thanks to CE.SDK’s fully documented API and headless architecture, AI agents can be created to navigate and operate the editor programmatically. This opens the door to agents that act as powerful scaffolders:

Generating initial designs and layouts based on brief descriptions
Automatically aligning content with brand guidelines
Transforming static designs into dynamic videos with a single command
Creating engaging short video content from prompts
Adapting existing designs to new formats and dimensions

Crucially, everything an AI agent creates remains fully editable by human users. You can refine, add nuance, and perfect the results through collaboration. AI provides the scaffolding, humans remain the tastemakers, adding the creative spark that makes designs truly exceptional.

3. AI-Powered SDK Configuration and Customization

AI agents are already actively involved in building, refining, and optimizing software tools—a trend that’s accelerating rapidly. This shift directly impacts developer experience and points to an exciting future where our SDK serves not just human developers, but AI agents as well.

To facilitate this interaction, we’ve already taken steps like making our documentation available in LLM-friendly formats. But this is just the beginning of our journey toward radically improving the experience for both developers and AI agents.

Our ultimate goal is conversational configuration. Imagine describing your requirements in plain language, and having an AI agent handle the rest:

Configuring the SDK’s visual aesthetics to match your brand
Setting up custom functionality and workflows
Integrating media libraries and plugins
Optimizing performance for specific use cases

This isn’t just about making development faster. It’s about democratizing access to powerful creative tools, allowing anyone to build sophisticated editing experiences—regardless of technical expertise.

Looking Forward

We believe that AI collaboration represents a transformative shift in creative technology, empowering users and developers to achieve extraordinary outcomes. At IMG.LY, we’re committed to being at the forefront of this exciting journey. Our vision extends beyond simply adding AI features, we’re reimagining how creative tools are built, configured, and used in an AI-augmented world.

Stay ahead with us: subscribe to our newsletter for exclusive updates.

AI-first Visual Editor using GPT-4o’s gpt-image-1 Model

Eray — Mon, 05 May 2025 20:58:07 GMT

What We Built

We integrated OpenAI’s new gpt-image-1 API (from GPT-4o) directly into our fully functional visual editor, CreativeEditor SDK (CE.SDK), enabling generation, editing, and refinement of images without ever leaving your creative workflow.

Open AI Editor Demo Page

From Simple Image Generation to Visual Prompting on a Canvas

Inside the editor, users can now:

Generate Images
Use prompts to generate images from scratch.

Generate Images from Visual Prompts
Turn full compositions—images, text, and annotations—into fresh visual content. Just select your page and let AI handle the rest, as shown in the video.

Reimagine Images & Text
Edit existing images and text with prompts to iterate faster and create variants.

Create Incredible Compositions
Combine generated and uploaded images into complex compositions.

Each step builds on the last, evolving from basic generation into true visual prompting powered by multiple input modes, all within one canvas. Check out the live demo here.

How We Built It

We built this integration using our CE.SDK and its flexible plugin system, designed from the ground up to support AI-first creative workflows.

This approach lets developers plug in any model or API—text, image, video, or audio—and run them all in one seamless editing flow. Whether you’re using OpenAI, Stability, or an in-house model, CE.SDK gives you the tools to bring it into the visual workflow natively.

🔗 Check out our AI Editor.
📘 Learn how to integrate AI into CE.SDK.

Why This Matters

Generative AI’s full potential isn’t unlocked by prompting alone, it’s unlocked when embedded into real-world creative workflows.

Designers, marketers, and content teams don’t just need outputs; they need control, iteration, and context. By bringing AI directly into the canvas where assets are created and edited, we turn generative models into tools for actual production, not just ideation.

This shift enables:

Creative work in context: No switching between ChatGPT and design tools.
Real-time augmentation: Prompt, edit, refine in place.
Scalable content generation: Automate localization, personalization, and variants.
Multimodal orchestration: Use visuals, layouts, and annotations as inputs.

It’s a step toward making multimodal AI usable for real design workflows, not just concept generation.

Integration & Feedback

This linked demo is rate-limited, if you would like to test more extensively or if you are interested in giving the AI editor a spin inside your own app, you can get started with our documentation.

We’d love your feedback, any thoughts, questions, and ideas are welcome!
Reach out to us.

3,000+ creative professionals gain early access to new features and updates—don’t miss out, and subscribe to our newsletter.

How to Build a Short Video Generator Using CE.SDK

Eray — Tue, 04 Mar 2025 12:57:47 GMT

In the following, I’m presenting a simple cookbook for building an AI-based video generator app, as described in my previous blog post. We’re using a combination of different APIs to generate text, audio, and images and will compose & render the final video using the headless CreativeEditor SDK. We also call it the Creative Engine.

This cookbook showcases the powerful capabilities of our client-side Creative Engine. The engine enables real-time video generation directly in the browser, eliminating the need for server-side processing. What sets this approach apart is its ability to produce editable source files that can then be opened with CreativeEditor SDK.

This approach is giving users complete control over every aspect of your video–from text and images to animations and overall composition. This means your users can refine and perfect your content even after the initial generation.

Get the complete code on GitHub.

Scope

This tutorial focuses on building an app with a simple UX:

Input your keywords/topics
Choose between landscape or portrait format
Generate and preview your video
Edit the video in the CE.SDK video editor

The app flow we will create:

The post-editing we will get with CE.SDK:

Technical Overview

The app follows three major steps to generate the video.

A script is generated based on User input, the output is a structured XML file.
The XML script is parsed to extract text and image information. The extracted data will then be used to generate audio & image files through third-party APIs
All assets are loaded into the creative engine. This is where the composition, including animation and effects, takes place. The Creative Engine then exports a video and scene file, which can be edited with the Creative Editor.

Setup

We’ll use a boilerplate with Next.js, React, Typescript & Tailwind. Make sure you retrieve all necessary keys:

Anthropic (LLM)
ElevenLabs (text to speech)
fal.ai (text to image)
IMG.LY CE.SDK – Retrieve a free trial key

// Required environment variables
NEXT_PUBLIC_ANTHROPIC_API_KEY = your_claude_api_key;
NEXT_PUBLIC_FAL_API_KEY = your_fal_ai_key;
NEXT_PUBLIC_ELEVEN_LABS_KEY = your_eleven_labs_key;
NEXT_PUBLIC_IMG_LY_KEY = your_img_ly_key;

Implementation

1. Generate The Script

In this step, we’ll focus on generating the initial prompt and then passing it to the Anthropic API.

As with many things with LLM, there are many different strategies for structuring the initial prompt. From experience, the best result comes from providing examples of the desired output. We’ve decided to use an XML document; this can be easily parsed later on and is less error-prone compared to a JSON.

We now define the structure of how information should be saved in the XML.

<video>
  <group part="intro">
    <element>
      <text voiceId="50YSQEDPA2vlOxhCseP4" style="0.2">
        Did you know these fascinating facts about pyramids?
      </text>
      <image>Ancient Egyptian pyramid at sunset</image>
    </element>
  </group>
  <group part="content">
    <element>
      <text voiceId="50YSQEDPA2vlOxhCseP4" style="0.2">
        The Great Pyramid was the tallest structure for over 3,800 years!
      </text>
      <image>Great Pyramid comparison to modern buildings</image>
    </element>
  </group>
  <group part="outro">
    <element>
      <text voiceId="50YSQEDPA2vlOxhCseP4" style="0.4">
        The pyramids continue to reveal their secrets to this day...
      </text>
      <image>A giant 3D question mark hovering over the pyramids</image>
    </element>
    <element>
      <text voiceId="50YSQEDPA2vlOxhCseP4" style="0.4">
        Stay curious - there's always more to discover!
      </text>
      <image>Pyramids under starry night sky</image>
    </element>
  </group>
</video>

In this tutorial, we’ll focus on the format trivia only as shown in this example. For later iterations, however, I’m planning to implement different content formats (e.g., trivia, quiz, recipe, etc.). Each of these formats will have its example XML. Therefore, I’m nesting the XML in a simple format object to scale this up easily later.

interface Format {
  name: string;
  example: string;
}

const formats: Record<string, Format> = {
  trivia: {
    name: 'Trivia',
    example: `<video><group>...</group></video>` // Add example from above
  }
};

Using this format object with the example, we can now generate the prompt.

What do we need for this prompt?

Description of the task
Description of the desired output, incl. an example for the specified format
Topic as provided by the user

The topic provided by the user is passed to the function as a string.

export const createVideoScriptPrompt = (
  topic: string,
  formatName: string = 'trivia'
) => {
  const format = formats[formatName];
  if (!format) throw new Error(`Format ${formatName} not found`);

  return `
Format: ${format.name}
Topic: ${topic}

Please write a detailed script for this short video, considering the specified format and topic.
Include an introduction, main content sections, and an outro. Each section should have an image.
Structure the script as an XML Document with clear sections, descriptions for the images.
The image description should be written as a prompt. This prompt will be used to generate an image.
Put the description between the image tags. The video shouldn't be longer than 30 seconds.

Example format:
${format.example}`;
};

2. Generate All Assets

In the second step, we’ll parse through the LLM response, which should be the XML. We’ll create a simple parsing function to extract all text information that should be sent to text-to-speech and text-to-image AIs.

Please note that all these steps can be easily streamlined by using AI-assisted coding. Just provide the example XML as input and your desired output.

API Calls
When finding text & image tags in the XML, we’ll call API functions for text-to-speech and text-to-image. For this example, I’m using ElevenLabs & fal APIs. You will find all API calls in the api.ts.

Since the LLM generated a script that includes image prompts, make sure to pass them to the API.

export async function generateImage(prompt: string): Promise<string | null> {
  try {
    console.log('Generating image for prompt:', prompt);
    const result = await fal.subscribe('fal-ai/flux/dev', {
      input: {
        prompt: prompt,
        size: 'portrait_16_9',
      },
    });
    const typedResult = result as { images: { url: string }[] };
    console.log('Image generation successful. URL:', typedResult.images[0].url);
    return typedResult.images[0].url;
  } catch (error) {
    console.error('Error generating image:', error);
    return null;
  }
}

Timestamps
Last but not least, we need to come up with timestamps. What’s the duration of each segment? This is critical information for composing the video. Luckily, this is quite easy: Each scene is as long as the generated audio for each segment. This duration for the audio segments can be calculated: Most TTS like ElevenLabs provide timestamps along the audio file. These are typically character-based timestamps, so we first have to calculate the timestamps for each word and then the duration for the entire text section.

Ready For The Next Steps
All Asset URLs that are generated will be saved in a VideoBlock object for convenience. The duration of the VideoBlock is the duration of the audio, as calculated above.

interface VideoBlock {
  text: string;
  imageUrl: string | null;
  audioUrl: string | null;
  startTime: number;
  duration: number;
  wordTimestamps: Array<{ word: string; start: number; duration: number }>;
}

3. Generate The Video

We have everything together now: The completed XML with timestamps, duration, and all assets. It’s now time to generate the video using the creative engine.

Let’s first add an empty container in our HTML that will be referenced for initiating the creative engine.

{
  /* Add container for Creative Engine */
}
<div id="cesdk_container" className="invisible mt-8 rounded-lg bg-gray-100" />;

We can now initialize the engine. Use this code snippet from our documentation.

We’ll then set up a function that creates a simple composition using the provided VideoBlocks. The engine requires you to first create a scene, append a page to the scene, and then create tracks within the page. The tracks are basically the layers in the timeline. I recommend setting one track as a background track using the following snippet:

// Set video track as a background track by connecting the page duration to the video track
engine.block.setAlwaysOnBottom(videotrack, true);
engine.block.setPageDurationSource(page, videotrack);

The Creative Engine provides powerful API calls to style & manipulate blocks in many ways. Here is an example of how we can animate the images with a slow zoom effect:

const imageZoomAnimation = engine.block.createAnimation('crop_zoom');
engine.block.setInAnimation(image, imageZoomAnimation);
engine.block.setDuration(imageZoomAnimation, block.duration);
engine.block.setBool(imageZoomAnimation, 'animation/crop_zoom/fade', false);

Export The Video & Scene
Exporting the video is easy. Just pass the page to the export function. In our example, we’re also saving the scene file so we can edit the video later.

// Export video
const progressCallback = (
  renderedFrames: number,
  encodedFrames: number,
  totalFrames: number
) => {
  console.log(`Progress: ${Math.round((encodedFrames / totalFrames) * 100)}%`);
};

const blob = await engine.block.exportVideo(
  page,
  'video/mp4',
  progressCallback,
  {}
);

// Save scene to string
const sceneData = await engine.scene.saveToString();

// Create scene blob
const sceneBlob = new Blob([sceneData], {
  type: 'text/plain',
});

4. Add A Video Editor

The last step is to add the video editor for post editing and pass the scene file. With CE.SDK, this effort is reduced to adding a few lines of code. In the init function, we’re configuring the editor and adding callbacks for the export:

const initEditor = async () => {
        const config = {
          license: 'A-O53TWXK5bfyconUx7e53S5YU7DzjuGpMAH5vvKjLd0zBa6IhsoF7zChy1uCVbj',
          userId: 'guides-user',
          theme: 'dark',
          baseURL: '<https://cdn.img.ly/packages/imgly/cesdk-js/1.44.0/assets>',
          role: 'Creator',
          ui: {
            elements: {
              view: 'default',
              panels: {
              },
              navigation: {
                position: 'top',
                action: {
                  save: true,
                  load: true,
                  close: true,
                  download: true,
                  export: true
                }
              },
              dock: {
                iconSize: 'normal', // 'large' or 'normal'
                hideLabels: true // false or true
              }
            }
          },
          callbacks: {
            onUpload: 'local',
            onSave: (scene: string) => {
              const element = document.createElement('a')
              const base64Data = btoa(unescape(encodeURIComponent(scene)))
              element.setAttribute(
                'href',
                `data:application/octet-stream;base64,${base64Data}`
              )
              element.setAttribute(
                'download',
                `video-${new Date().toISOString()}.scene`
              )
              element.style.display = 'none'
              document.body.appendChild(element)
              element.click()
              document.body.removeChild(element)
            },
            onClose: () => {
              onClose();
            },
            onLoad: 'upload',
            onDownload: 'download',
            onExport: 'download'
          }
        }

Conclusion

By following this cookbook, you can streamline the process of AI-generated video creation, making it fast and efficient. This method is especially useful for content creators, educators, and marketers looking to automate video production while maintaining creative control.
Next, try experimenting with video styles, refining AI scripts, or exploring advanced editing.
Feel free to GitHub repo and share your creations with us on X. Happy creating!

3,000+ creative professionals gain early access to new features and updates—don’t miss out, and subscribe to our newsletter.

How I Built a Short Video Generator with AI & CE.SDK in One Day

Eray — Thu, 09 Jan 2025 11:27:13 GMT

Here’s the crux of product development in the age of LLMs: how much can AI truly accelerate the development process?

We have seen videos of solo developers building small apps entirely with AI with just a few prompts. But how does it scale to more complex development projects? As LLMs rapidly evolve, their scope and impact will only increase.

That’s why I regularly challenge myself to build a small project with the help of AI. I’m a prime candidate to test the AI productivity boost: a jack-of-all-trades (and a master of none) with a background in both design and engineering, yet no hands-on experience in the past five years. My latest challenge? Build a web-based short video generator within one day.

In this post, I’ll share the most intriguing takeaways from tackling this project.

Why a Short Video Generator?

Why focus on this idea? It’s simple: to ride the wave of a new trend. A format called “faceless” short videos is gaining traction among creators on platforms like YouTube and TikTok.

https://www.youtube.com/embed/DfQ3fhqfKVc?feature=oembed

What’s fascinating about these videos is their automation: an LLM generates a script, which is then transformed into speech, images, and text assets using various AI services. These assets are automatically assembled into a cohesive video.

The general concept is compelling: It’s still generative content, but mixed with classic video composition techniques. This approach offers greater accuracy, consistency, and control over pure generative AI.

The potential to automate video production at this scale is exciting. Add its relatively low complexity and high production value, and it became the perfect topic for my challenge.

Enter CE.SDK

Another reason I chose this challenge was its compatibility with CE.SDK, our design and video editor library. CE.SDK offers a robust editing toolkit that integrates into any product with just a few lines of code. Its features, like headless mode, are ideal for automating workflows like video generation.

Most faceless video services use React-based video generation and achieve fair results. However, using CE.SDK instead of a react-based library could potentially boost the overall experience with three critical improvements:

Editable Outputs: This is huge. Full automation often needs human adjustments for fine-tuning. CE.SDK enables automated video generation while allowing manual refinement of the results.
Enhanced Visual Quality: CE.SDK has its own rendering pipeline, allowing for more nuanced visual effects and animations. When you’re competing against others in this space, it can make a huge difference if you’re able to produce higher fidelity in the visual output.
Visual Design Workflow: Create design components or even entire templates visually, and then use them via code. This authoring workflow can be extremely helpful in creating rich, interesting designs for the generated videos.

The Ground Rules

To keep the challenge focused, I set strict rules:

Time Limit: Spend no more than 12 hours on the challenge.
No Manual Coding: Avoid writing any code yourself—everything should be built through conversations with AI.
Trust the AI: Do not read or analyze code generated by the AI. Rely entirely on its decisions.
Skip External Research: Do not read or explore the APIs you intend to use. Instead, provide links to the AI and let it determine how to use them.
Compare AI Performance: Alternate Claude Sonnet 3.5 and ChatGPT o1 for code generation to evaluate which performs better.

The Tools & Workflow

Code Editor: Cursor
Built on VSCode’s foundation, Cursor stood out as the only editor offering both an integrated chat interface and the ability to switch between different LLMs. However, with GitHub’s recent significant updates to Copilot, I’ll switch to VSCode with Copilot for future challenges.

UI Prototyping: Claude Artifacts
Rather than building the entire project in my code editor, I chose to prototype the UI directly through Claude’s web interface. The benefits were immense:

Instant results: To create an artifact, Claude streamlines development by automatically writing and compiling code while leveraging essential UI libraries and components. This automation eliminates setup time and technical overhead, allowing me to focus purely on design iterations.
Instants Variations: Claude enables rapid prototyping through parallel conversations. When a design direction didn’t quite work, I could simply start a fresh conversation with modified requirements and evaluate a new prototype. This approach helped me develop three viable concepts quickly - a pace that would have been impossible in a traditional code editor.
Quality of execution: Claude transforms rough concepts into polished, intuitive interfaces. Its suggestions often surpassed my initial ideas, offering sophisticated solutions I hadn’t considered.
Keep it clean: By prototyping outside the code editor, I kept the main project’s codebase clean and focused. This separation prevented the accumulation of experimental code and maintained the clarity of our primary development environment.

Quickly prototype your interface with Claude Artifacts.

APIs
Key APIs used in the project included:

Script Generation: Claude Sonnet 3.5 vs various ChatGPT models.
Image/Video Assets: Fal.ai Flux models.
Speech Synthesis: ElevenLabs.

Building the App: Divide and Conquer

After having prototyped the UI, I started to chat with the LLM inside the code editor so that it can code the app. To work with the AI efficiently, I followed a divide-and-conquer approach. Rather than simply asking it to “build me a video app,” I broke down the problem into manageable steps:

Generate a video script
Create an AI prompt that includes user input and examples of the desired output format. Pass this prompt to the LLM API.
Parse the script to generate assets (speech, images, text)
Parse the LLM’s response to extract image prompts and speech paragraphs. Send these to their respective APIs.
Compose the final video
Load all the generated assets into a predefined template to generate the finished video through the CE.SDK library.

After completing these steps, I was finally able to generate my first fully automated videos! With a few more tweaks and additions, I had an MVP ready within twelve hours.

The final result: A Short Video Generator

There are still some missing features, partly because I spent a significant amount of time refining the prompt to generate the video script. I also had to bend the rules occasionally—sometimes the LLM would hit a wall, and I had to read or write small snippets of code.

Key Takeaways

Engineering Knowledge Is Essential
You should have some engineering background to achieve the AI productivity boost in development.

AI doesn’t solve everything for you. You are still the architect. You provide a lot of input and guidance. AI often needs to be pointed to the right strategy. Foundational knowledge of computer science is hugely advantageous for working with AI effectively.
As mentioned, I had to read and write a few lines of code myself. Without coding experience, I would have probably not been able to progress, as the LLM was not able to.
The getting started experience is nowhere close to novice-friendly. How do you get started with a new project in a code editor that actually requires you to do the setup manually? My workaround was to create an empty project, and then ask the LLM to instruct me to use a boilerplate for react. Again, this is engineering knowledge, any novice would have hit a wall already at this point.

Claude Outperformed ChatGPT
Claude was a clear winner in the side-by-side comparison, because of three reasons:

Claude Artifacts was a game changer for UI prototyping.
It was generally better at writing and understanding code. Difficult to quantify, but in some cases Claude fixed the mess ChatGPT left in the code
Claude can process URLs, which makes working with APIs much smoother.

Who would have thought new LLMs would catch up to OpenAI so quickly after they released the first version of ChatGPT?

Complexity Slows AI
The more code in my project, the slower the overall progress. LLMs struggled with the growing complexity. Their context windows filled more quickly, and their responses became increasingly unreliable. At some point, it becomes extremely difficult to make architectural changes, especially if this affects multiple parts of the app. When trying to fix errors, you’ll often find yourself in a whack-a-mole game. While the AI would resolve one issue, it would inadvertently introduce new problems elsewhere, creating an endless loop of fixes and regressions.

Ultimately, the time invested in this challenge was well worth it. While LLMs can’t build products end to end on their own, they can significantly streamline product development when paired with the right human collaboration. The real question is whether development teams are ready to adapt their habits and explore new workflows to boost productivity.

Next Steps

This challenge has inspired me to refine and expand on this project. Future iterations will focus on harnessing CE.SDK’s unique features to push the boundaries of automated video generation.

Stay tuned for part two of this series—there’s much more to explore!

UPDATE: Read part two - a cookbook how to build your own short video creator!

Over 3,000 creative professionals gain early access to our new features, insights and updates—don’t miss out, and subscribe to our newsletter.

Unleash Creativity with CE.SDK’s New Plugin System

Daniel — Mon, 03 Jun 2024 08:42:25 GMT

At IMG.LY, we have always believed that a superb design editor should be effortlessly customizable and extensible. We are thrilled to roll out a brand-new Plugin system for CE.SDK in the upcoming months—to take creative editing and feature development to the next level.

Starting with one-click features like background removal or vectorizers, smart design tools like QR codes or subtitle generators, and deeply interactive features like generative AI for text and images; all these tools can be used or built by customers soon.

Additionally, our upcoming Plugin system will bring you unparalleled autonomy by making feature development for our SDK accessible and offering extensive options to reconfigure our editor’s UI.

Start exploring our Plugin System rollout now to immediately benefit from upcoming features.

Built for Modification

Let’s dive deep into some of the opportunities CE.SDK plugins help unlock.

Unlocking the AI Revolution

While the AI transformation is already fully underway, much of the tech is still not very accessible to product builders, often requiring deep technical knowledge to get started. At the same time, AI features become significantly more valuable when integrated with other editing functionalities in workflows or automation. With our plugins, we aim to make it effortless to leverage innovation and put it to use.

Boosting Customer Autonomy

Key to our success is providing maximum flexibility and autonomy to our customers about product decisions. Ultimately, you shouldn’t depend on our product roadmap; rather, you should be able to add features when you like. While our SDKs are already highly configurable, plugins allow tailoring the whole user experience not only on a look & feel level but through custom functionalities and editing experiences.

Accelerate Product Expansion

Many ecosystems witnessed explosive growth in added value to the user after releasing plugin mechanisms. Currently, only IMG.LY core developers can contribute to the SDK. We have started to extend this to solution engineers and even designers on our team who don’t have much knowledge of the inner workings of the SDK. Ultimately, we will push this more and more into a community of contributors, making the community’s innovation accessible to everyone.

Key Concepts

Three important concepts have driven the development of our Plugins:

Customizable Menu Bars
We are extending our API so that it allows easy hooking into various parts of the UI. Our editor has key components like the inspector, toolbar, and on-canvas menu. These are now all accessible through an API, so you can hook your feature anywhere in the editor.

UI Building Blocks to Provide Consistency
To reach a high level of consistency and speed up time for development, we will be providing out-of-the-box UI components such as buttons, sliders, text inputs, etc.

Escape Hatches
From experience, we know that sometimes unique functionality needs unique solutions, so we have added escape hatches to add custom elements via HTML whenever needed.

What Can You Build with our Plugin System?

Let’s explore some potential use cases of the plugin system.

Custom Actions

Adding custom actions is a great option to make simple third-party APIs accessible within the editor. This can be one-click edits such as background removal, vectorizers, or auto-enhancement for images. You can also add custom actions for text in combination with Large Language Models (LLM) to provide features like autocorrection and improved writing, etc.

Custom Tools

Some custom functionality will require more than just a single button, e.g., to generate AI images, background patterns, QR codes, or maps. In these cases, you’ll require sliders, text input, drop-downs, and many other UI elements. With plugins, you can easily create panels with your own UI to bring any custom tool to life.

Custom Assets & Presets

Apart from building custom tools, you can also bundle and group effects into presets and make them accessible in a custom panel. This is especially useful to simplify the design process or create standard design components: for example, providing beautiful text presets will enable your users to create instantly great text designs without any design knowledge.

Custom Libraries

Integrate third-party libraries, such as Unsplash, Getty Images, Pexels, or your own.

Custom Editor Behavior

Some of our customers asked us how they could move a functionality from one place in the editor to another. Let’s say you wish to move the function ‘move to front’ from the inspector to the canvas menu. This is not a problem! You can do this by using the internal API endpoints of our editor.

As for custom editor behavior, you can do far more than just move functionality. Here is a demo of layer lists we built as a plugin.

Custom User Feedback

Additionally, you can enhance the canvas with overlays, which are useful for providing alerts, instructions, or feedback directly on the canvas.

What’s Next

We are now rolling out Plugins and building an initial set of features through Plugins ourselves—available for the web first, and mobile SDKs will follow. Keep your eyes peeled for our next releases and don’t hesitate to get in touch with us to learn more about plugins, and how your product benefits from integration without losing time and resources.

Thank you for reading. Join 3,000+ creative professionals—subscribe to our newsletter for updates on new features, plugins, early access, and more!

Work Where You Work Best

Daniel — Tue, 23 Jun 2020 13:01:57 GMT

When the coronavirus pandemic started spreading in Germany, our team started working from home, weeks before it became mandatory. This step seemed comparatively easy for us, like for many other tech-driven companies.

From the get-go, img.ly has been a company of digital natives, and as such we are used to digital tools for day to day work. Everyone already worked with laptops instead of desktop computers, we used online collaboration tools in the majority of our time, communicated with tools like Slack, and relied on cloud services whenever possible. So when the pandemic hit, we were perfectly equipped to work from home.

As for everyone, there were a few hiccups in the beginning. Many of our teammates missed day-to-day social contact. We all like to chit chat once in a while, enjoy a coffee break in our neighborhood or have lunch together in one of the many local restaurants.

From day one, we added more video calls and played online games to compensate for the missing real-life interactions. We also tried to recreate our office atmosphere by introducing voice channels aside from our other communication channels.

We realized that the newly introduced voice channels were used extensively in the first few weeks but became vacant later on. Instead, people relied on slack or zoom to hop on a call whenever necessary or wrote down problems to discuss with the team asynchronously over time.

After two to four weeks everything seemed to be back at what we would call normal operation. Everyone was back on track and we were making progress towards our goals.

While it is obvious to us that the current situation is pretty special, we realized that it might be a good time to reassess how we work and what defines img.ly.

How can we use the current situation to come out stronger when it’s over?
Somehow it bothered us how everyone talked about things getting back to normal. Obviously, we want our social contacts back, go to restaurants, bars and meet with friends, but what about day-to-day work? Was it better before? Or maybe, does it feel better now?

Work when you work best

Before to this new situation, we always had flexible working hours and relied on everyone making their own decision when to work best. While everyone started at different times during the day, we all came to the office eventually. Working from home was possible but only used as an exception when needed.

The reason was simple: we preferred brief, sometimes spontaneous communication between teams and team members in the office during the day, which made us feel quick in our decision making. Over the years, with a growing team and office space, we sat together to discuss what “our” version of a good working environment was, eventually realizing that there is no one-size-fits-all solution, as everyone has their own needs. That’s why we always thought about different areas in the office that are built to amplify either trivial or deep work. The “virtual office” has similar requirements: We introduced Slack statuses like “Focus work” or “Out for a break” to communicate that they will be getting back later to other teammates if they have any questions. Eventually, everyone seemed to have more control over their own time while being productive, and some are being more productive than before.

Work where you work best

As said, some tensions came up. People were missing real-life contacts, and it seemed that communication between teams could be improved. While the first problem remains a challenge, the latter one got our attention quickly. Before jumping to conclusions, we asked ourselves if this was a new problem, only to realize that it had been there before, just mitigated by the daily run-ins in the office. This resulted in people feeling informed but eventually missing some important details. However, these occasional run-ins in the office didn’t happen anymore, and as such, details would now easily get lost.

Best practice would be to write down all information, record a video, or use other media that can be accessed asynchronously at any time by all team members. However, most saw this as a chore from the management, and as such, it was easily neglected or simply forgotten. Even if the information was written down, it just landed somewhere in our GoogleDrive, Asana, Jira, or some other tool. Principally available for everyone but found and looked at only by some.

But now everyone needs this information as nobody has the occasional run-ins anymore, which creates a general urge to get more information. Instead of lengthy transcripts, people started sharing their results in Slack-channels and updating their daily to-dos in detail. Most importantly, it created awareness and partially converted a chore into something meaningful to the benefit of everyone. However, it is still far from perfect and we will continue to work on that.

Making a temporary state a permanent solution

The taste of increased flexibility in your life and productivity at work leads us to consider making our temporary state a permanent solution.

There are obvious strategic advantages for a tech company: The talent pool increases with a radically increased search radius. Also, we need less office space, a huge pain point we had recently.

Still, there is a yin to the yang, and nothing comes on a silver platter. We now have to filter our talent pool for people that bring the right mindset in terms of expertise, structure, and seniority for remote work. Believe it or not, there are not so many left when applying such filters.

We have to heavily invest in bringing our people together on a very regular basis because we are still a people business, and in-person interactions were key to our company’s success. We know about the perks of being an office-first company. According to Andreas Klinger, “Head of Remote” at AngelList, remote work is great for iterations, while in-person meetings foster innovation. We can see that. It’s all solvable with the right process and mindset, but it’s important to be diligent about this.

Thankfully, this topic has already been extensively covered with conferences, articles, and pioneer companies such as GitLab, Basecamp, and PSPDFKit. I can always recommend them as a reference when you start thinking about becoming a remote company. Of course, it’s easier when you’re a remote company from day one – nevertheless, we are convinced that this change is an opportunity for us to grow and thrive. So in the end, we made the best of the pandemic.

Building the Creative Engine of the Digital World

Daniel — Wed, 22 Apr 2020 12:42:48 GMT

Since releasing our first Software Development Kit (SDK) for photo editing in 2015, it quickly found its way into the hands of thousands of application developers. The demand allowed us to ramp up our efforts to expand our SDK to cover more platforms, add features and launch VideoEditor SDK.

At the same time, two very important things happened to us.

First of all, we got experienced in building rendering engines, while learning in-depth about the requirements of processing photos for professional and semi-professional use-cases, on every platform. Our team leveled up big time, to a degree where we are confident there aren’t many teams with such a domain-specific expertise.

Secondly, we gained critical insights into the visual creation process across many different industries, ranging from print services, social networks to marketing tools. In these processes, photo editing is only one part of the creative flow. There are topics like layouting, animation, generative design and many more, which go beyond the mere editing of an image, but are still essential to the business case.

Our vision is to fill the obvious gap, providing a holistic solution that allows users to map entire creative flows into a tool. By leveraging our technology our customers will become a lot faster and more competitive in their markets.

Six months ago, we decided to build the foundation for this vision and kicked off the development of UBQ, an engine empowering a new generation of tools for creativity.

It’s time to shed some light on our creative engine.

What is a Creative Engine?

Before we go forward, let’s establish what we mean by a creative engine. In general, an engine is a platform that allows developers to create tools that support creators to perform specific tasks such as editing photos, creating visual appealing posters, or social media posts. From our perspective, a creative engine will ease the development of design tools for various niches and as such it must be easily adaptable for different usage and user requirements.

To ensure that we don’t have to reinvent the wheel for each creative tool, a creative engine provides building blocks especially targeted to creative output. Some examples for these are:

High-quality image adjustment, editing, filtering, and manipulation.
Automated layouting of design elements on a canvas.
High-quality text rendering and layout algorithms.
Support for various industry-standard assets, image, and video formats.

However, these are only some of the basic features – there are far more advanced things that a creative engine can provide.

On the one hand, this may include scripting tools to ease the generation of generative art or automation for tedious design tasks via automated image segmentation.

On the other hand, there are also tools that not only help the individual creator but ease the cooperation and collaboration between multiple creators and editors. The details may vary, but it can be as simple as defining formats to exchange assets and designs between creators or even allow the simultaneous editing of their creation.

Introducing the UBQ Creative Engine

With UBQ we strive to lay the groundwork for modern visual and communication design.

Believing that these days, a good design tool has to be there for you every step of the way – it has to be ubiquitous**.** Let us explain what this means for us.

Usable on any Platform

In this sense, we want it to be easily accessible to everyone, and as such the UBQ stands for its availability across multiple platforms. It is conceptionally web- and mobile-first but is also available for classic operating systems such as macOS, Linux, and Windows, while being easily portable across various platforms.

Unified Rendering, Generative, and Inference Engine

While our previous products focused on processing a single image or video, our scope is greatly extended with UBQ. Besides being able to process multiple images, with effects such as filters, adjustments, and so forth, in UBQ we generalized images to a concept we call Design Blocks. As such, static images are not the only source of data. It can be anything, even code that generates the visual input. In turn allowing that generative patterns and complex generative art is processed directly in the engine.

We put a lot of work in the last years into modern tools for automation to help creators get rid of time-consuming tasks such as image segmentation, color adjustments, and many others. Our foundation for this stems from modern machine learning and neural network advancements. As such UBQ supports executing neural networks and inferring information. Thus basically allowing any neural network to be part of a design generalized in what we call Compute Blocks*.*

Collaboration – Interact in Realtime with other Creators

The cooperation and collaboration between multiple creators are dear to our hearts. Therefore, we believe that creation is not only the effort of an individual but of many.

UBQ will enable multiple creators to work together. Thus, allowing a design team to work together, share thoughts and assets instantly. Consequently, this constributes in accelerating their workflows and enriching their creative process.

All that is reinforced due to the multiplayer approach to design as well as reusable design elements.

Creators shall be enabled to work together locally as well as remotely and interact on the same design at the same time.

Cooperation – Ease Exchange of Designs with your Peers

Besides the collaborative aspect, UBQ is built around the interaction of creators and non-designers – which we refer to as editors.

Editors will be enabled to take existing designs and adapt it to their needs, by either just giving instant feedback to the creators, changing color palettes to fit the needs of their corporate design guidelines or changing images and text passages to make a design appealing to their audience.

Design Blocks – Smart and Reusable Design Element

Another important design decision is the ability to exchange and reuse design elements between several projects. Therefore, UBQs foundation is built around the concept of smart design elements that we call blocks.

Blocks are reusable design elements that encapsulate complex and tedious design tasks.

They currently come in two flavors, Design and Compute Blocks ranging from simple predefined image adjustments settings to automated image segmentation as well as predefined adaptable design elements.

From our point of view, creators should not have to spend their time with the recreation of preexisting designs. They should not have to watch tutorials just to get to the same output as someone else before. In UBQ, these design blocks are the core foundation for building adaptable and interchangeable designs.

Design blocks respond to the available space, adapting their content accordingly and may include complex logical rules or just a combination of assets. UBQ has all batteries included, providing a rich library of predefined blocks. By design, blocks can easily be built from a combination of other blocks or by using a scripting language similar to processing or p5.js to allow even generative and parametric design.

Final Words

As of April 2020, we are working on the UBQ creative engine for almost half a year now incorporating knowledge from 5+ years developing our SDKs.

We believe we can make an impact on how designs and design-tools are created, adapted and distributed. We are interested in your feedback and open to discussions, also we are looking for people who want to get involved as team members, early adopters or just want to play with it. Expect some demos during the end of summer showcasing what the UBQ engine is capable of.

Soon, we will get back to you with more exciting news.

Stay tuned!

We’re hiring!

Oh, and by the way, we’re hiring a Senior Frontend Developer at img.ly to reinforce our team and help shaping the frontend of our new design tool. Check out the job here and drop us a line, if you are interested.

From 2D to 3D Photo Editing

Malte — Tue, 26 Jun 2018 00:00:00 GMT

Last November, we released Portrait, an iOS app that helps create amazing, stylized selfies and portraits instantly.

With over a million downloads and many more portrait images created, we feel that the idea and vision of Portrait was more than confirmed. The central component of Portrait is an AI that is trained to clip portraits from the background, a technique we are eager to further improve and refine. In fact, Portrait helped us to explore a novel technique for image editing, as we were able to leverage a new powerful data set in photography: depth data.

We began feeding our AI models with the depth data from the iPhone Xs TrueDepth camera and had one goal in mind: to infer depth information for portrait imagery, or bringing three-dimensionality into a two-dimensional photo. Along the way, we created a new architecture concept, that allows performance and memory improvements through modularizing and reusing neural networks.

In the following article, we’d like to present some of our results along with the insights we made.

The New Cool: Depth Data

The usage of depth data in image editing initially became available with the iPhone 7 Plus when Apple introduced ‘Portrait Mode’. By combining a depth map and face detection, the devices are able to blur our distant objects and backgrounds, mimicking a ‘bokeh’ or depth of field effect, which is well known from DSLRs cameras.

While the actual implementation varies, all major manufacturers nowadays offer a similar mode by incorporating depth data into their image editing pipeline. This is either achieved through the conventional dual or even triple camera on the back of a phone, dual-pixel offset calculations combined with machine learning or dedicated sensors like Apples TrueDepth module. In fact, for a modern flagship phone, some sort of depth based portrait mode is almost a commodity.

From a developers perspective, things look a little different: Depth data became a first-class citizen throughout the iOS APIs in iOS 11 and such data is now easily accessible on supported devices. Android users obviously have access to depth data as well, either by utilizing multiple cameras or by Googles dual-pixel based machine learning approach, seen in the newer Pixel 2 phones. But contrary to iOS, Android doesn’t yet offer a common developer interface to access such data. In fact, developers aren’t able to access any of the depth information Google or other manufacturers collected within their camera apps. This means developers would either need to implement the algorithm to infer depth from two images themselves or try to rebuild Googles sophisticated machine learning powered system. Neither of these options is practical and probably not even possible given the usual limitations to camera APIs.

So although being quite common, depth data isn’t as easily accessible for developers as one might think. Right now you’re out of luck on Android, dependent on hardware on iOS and even then limited to the 1.000$ flagship if you’re interested in depth for images taken with the front camera. And last but not least, across all devices and platforms, there is no way for you to generate a depth map for an existing image.

Deep Possibilities

Despite the restrictions, we decided to first explore the power of depth for image editing, as depth data provides many new exciting creative possibilities:

If we have a depth map for a given image, our editing possibilities are increased dramatically. Instead of a 2D image, a flat plane of color values, we suddenly have a depth value for each individual pixel, which translates into a 3D landscape highlighting distinct objects in the foreground and a clear indication of background.

Depth-aware Editing

Instead of relying on color and texture differences to determine fore- and background, one could literally edit these regions individually. This allows adjustments like darkening the background while lightening the foreground, which makes portraits ‘pop’. If we’d be able to generate a high-resolution depth map, we could easily replace the AI currently used in Portrait and would allow even more sophisticated creatives. Thanks to the new APIs, there are already some awesome iOS apps available that specialize in depth based editing. One famous example is Darkroom with their “depth-aware filters”:

Depth of Field Effects

As a depth of field or bokeh effect was the initial motivation for Apple to incorporate depth sensing technology, it is one of the most obvious applications. Depth is crucial for such an effect, as the amount of bluriness of any given region directly depends on its distance to the camera lens.

3D Asset Placement

As mentioned above, a depth map gives us a 3D understanding of the image. We’re able to tell if subject A is positioned in front of or behind subject B. This allows placement of digital assets like stickers or text in a ‘depth-aware’ fashion, but could also be used to apply ‘intelligent’ depth of field, e.g. a bokeh effect that ensures all faces are in focus.

Enter Deep Learning

Motivated by the possibilities enabled by depth maps, we were wondering if we could bring this magic to any type of portrait image. We consulted existing literature on depth inference and found various papers¹ and articles on the topic, some of which even presented results that seemed sufficient for our use cases. In our case, we didn’t need accurate, as in ‘this pixel is 30cm in front of the camera’, results, but we were only interested in getting the general distance relations correct. For us, knowing that region A was slightly behind but definitely way in front of region B was enough to generate a visually pleasing effect and by constraining our domain to portrait imagery, we were able to further reduce the tasks complexity.

Given our experience with deep learning and our current focus on introducing machine learning powered features to the PhotoEditor SDK, we immediately decided to tackle the new challenge with deep learning or more specifically convolutional neural networks. Having a huge dataset of image and depth map pairs available, made this choice even easier. We stuck to a system similar to our previous segmentation model but decided to put more emphasis on allowing the reuse of individual parts, as this would come in handy when adding additional features in the future. To achieve this, we created a new modularized neural network approach named Hydra, which will be presented in an upcoming blog post.

During development, we followed our tried and tested workflow of starting with a complex custom model, which is then tweaked and refined to match our performance requirements while maintaining the prediction quality we need. Once that was done, we had a fast and small model, trained on thousands of iPhone front camera selfies and capable of inferring high fidelity depth maps from a plain RGB image in under a second.

The Prototype

After creating a small model capable of inferring depth maps for any given portrait image, we immediately wanted to evaluate its performance in a ‘real-world’ environment. We decided to build a prototype that applies a depth of field effect to a portrait image, by using the model and its outputs. With our long-term goal of deploying the model to iOS, Android and the web in mind, we built the prototype using TensorFlowJS to explore this newly released library. Our browser demo consists of a minimal ‘Hydra’ implementation with individual modules, one for extracting features and one for the actual depth inference, which can both be executed individually.

While being optimized for performance and memory footprint, the trained weights of the model still add up to ~18MB, which we will improve by further fine-tuning or even applying pruning or quantization. Once the models are loaded, all further processing happens on the device though, so you may try out all the samples without worrying about your data plan.

Results

Seeing our vision come to life was quite a stunning experience. Suddenly our browser was able to perform a complex depth of field effect without the need for special hardware, manual annotations or anything else apart from our image. And the best part was manually moving the focal plane through the image, either by sliding or tapping on different regions. Although being trained on ‘just’ selfies the model handles turned heads, silhouettes and multiple people pretty well and isn’t as restricted to its domain as we initially expected.

And while our initial prototype is still weighing in at ~18MB, we’re certain to slim that down further in order to use the model in production. Performance wise we were very impressed with the TensorFlowJS inference speed. Even though everything is happening on the client side and is therefore dependent on the clients hardware, we saw inference speed below one second right of the bat and those greatly improved after the initial run, as the resources were already allocated. While not being immediately helpful for the depth inference part, this allowed us to further confirm our theory behind Hydra: Re-running inference once the necessary resources on the machine have been allocated greatly increases performance and might even allow real-time performance after an initial setup-time.

To summarise, we’re definitely eager to further explore the use of depth data in image editing and think we have found a way to overcome the access restrictions on different platforms and hardware with our custom model. Combined with our new Hydra approach we can see lots of potential features that will delight both our users and customers and we will keep you updated right here.

(1)
The papers we extracted most knowledge for our use case from were:
“Depth Map Prediction from a Single Image using a Multi-Scale Deep Network” (arXiv)
“Deeper Depth Prediction with Fully Convolutional Residual Networks” (arXiv)

**Thanks for reading! To stay in the loop, subscribe to our Newsletter.**

Rise above the Noise

Eray — Tue, 01 May 2018 22:00:00 GMT

A few years back, a close friend came by with a presentation she had prepared for a marketing campaign. Overenergized by this career-boosting assignment, she had nothing else on her mind for weeks, barely reachable and always stressed out. When she finally popped out of her little bubble, you could feel how proud she was, presenting me her latest draft with a gleam in her eyes. She knew that I’m a very visual guy, so she asked for my opinion.

While I could see the enormous amount of work she put into research, structure and storytelling, most of its brilliance was shaded by its design — or better, non-existing design. Oversaturated images, logos placed maliciously in the middle of nowhere, no concept of harmonic paddings, margins or color palettes. As a consequence, the visual presentation was an emotional flatline, to put it mildly. Slide after slide, I was in a constant struggle of ’should I tell her’ or not.

If you’re a designer, you might be familiar with this feeling. Your trained eyes can be a true pain in the ass, often triggering discomfort whenever something is unbalanced, or, let’s say, lacks a certain design attitude. In the digital world, with each ad, presentation, website or user interface you come across, there lurks an urge to set things straight, like with a crooked frame on the wall. While some might argue that this can quickly end in an unhealthy compulsiveness, to some degree this is important. Good design is an essential ingredient to the effectiveness of the underlying medium. After all, your design is the packaging for your content, for your message, and as such, it deserves equal attention. When done right, its true power unfolds with emotions; it can make people feel excited, and most importantly, **it can create desirability**. In the end, it’s the perfect tool to maximise the impact of your message and let your work truly stand out. But heck, enough with design philosophy.

Back at my friend’s computer, I suggested a few visual changes to the presentation, asking her to choose other fonts, pick different colours, change some pictures, and add more consistency to the layout. She did most of the changes herself following instructions and simple questions, which gave her a sense of ownership over the design.

Looking back, it was two things that were really frustrating me. On the one hand it’s my conviction that a marketeer should know the basic principles of design. You’re responsible for communication, and design is an important tool to shape it and helps to rise above the noise. On the other hand, it was also the tools she’s been using for the presentation and assets that were responsible for that mess. Tools made for designers, so a designer can craft the most beautiful visuals, seem powerless in the hands of a novice. Even business tools like Powerpoint do so little to educate and empower their users to create good design. Time after time, I’ve seen how marketeers, coders, product managers and business managers across industries had to rely on design departments or just struggled with their own mediocre design skills.

Who could have known that this frequent observation would help us pave the way for our own product. In our early days, when we built the first versions of the PhotoEditor SDK, we focused on helping developers to add basic photo editing functions to their products. With non-destructive editing and handling of multiple elements of the canvas we expanded our editor beyond photography — to a design tool.

Once we had all the essential functions in place, we started looking at every tool, from text over brush to adjustments from different angles. The most essential question was: how can this tool deliver beautiful output to those who lack design expertise? To those who don’t really know what character spacing is, or to the ones that don’t really understand what clarity does to an image and what it’s good for.

We believe in the vast potential of democratising design, and digital design starts with an editor.

Our journey towards that goal has just started. In April, we have launched a novel text design tool that makes text layouting a breeze. Our app Portrait showcases how we can generate beautifully designed portraits instantly and has already been downloaded over a Million times. There is much more to come, with tons of ideas, prototypes and data in our backpack.

Of course, making our vision happen is tremendously challenging from a UI perspective, and again, it is the quality of the design that will make the difference. As editing is a process that spans platforms, use cases and mediums — I consider this job a true boss fight for every designer. A huge challenge with a huge accomplishment in return, provided that we’re able to pull it off.

So, if you happen to be a designer, and feel the itching in your fingertips, just shoot us an email. Let’s start a conversation, and once you join our team let’s shape the future of design.

**Thanks for reading! To stay in the loop, subscribe to our Newsletter.**

When Creativity meets A.I.

Eray — Thu, 16 Nov 2017 23:00:00 GMT

A new generation of A.I. algorithms, propelled by rising computational power, new hardware, and a shift in paradigms made its first notable impact in the creative world: The works of Gatys et al. and Krizhevsky et al. have not only gathered considerable public attention but have helped apps like Prisma to be adapted and used by millions. I strongly believe that this is merely the beginning. **With the help of machine learning, we will fine-tune, simplify, and automate creative processes and ultimately empower new techniques for design and content creation.**

We’ve been following this topic for quite some time now and have spent considerable effort in researching the opportunities of deep learning for our PhotoEditorSDK. After more than a year of research and development, today, we’re finally bringing one of our apps to beta. **Portrait** combines supervised deep learning with the visual power of our SDK. In a nutshell, Portrait makes creating beautifully designed portrait images as easy as taking a selfie. You turn your selfies into movie poster-like portraits, with styles ranging from double-exposure photography to stencil art. One may consider it as the next iteration of what Apple and Google recently brought to market with their new camera features.

We’ve now come a long way and gained invaluable insights on our journey so far. Not only did we get our hands dirty with countless training sessions and refinements to the neural net, but our first hand experience also helped to set expectation management right and to dismantle hype from substance. Most notably, it changed our product shaping process, making it more important than ever to foster strong ties between the product stakeholders and to share a common vision and goal everybody can get behind.

In the following I’d like to share the story of how we built the app and closed the gaps between roles of the stakeholders within this process.

Preface: Before Neural Networks were the Hot New Thing

My journey begins over ten years ago, while I was graduating in neuroscience. Back then, the idea of A.I. was just a vague promise. Artificial Neural Networks were too small, computers lacked the necessary power, and the results were certainly nice, but still too weak to compete with other traditional algorithms. Research felt stuck in tiny little specializations without really following a broader vision. Dazzled by its impracticability, my interest in Neural Networks slowly began to fade.

It took research on Neural Networks another six years to get back on my radar. At that time, I was leading several product developments at 9elements. When I learned about the work of DeepMind (now Google) I had a genuine feeling that this time, A.I. was ready for the limelight.

As we were in the course of building a library for image editing and computer vision — the PhotoEditorSDK, we realized how much neural nets could also affect the creative space, given its ability to abstract and formalize rules. What if there was a machine that could reproduce the common and dull tasks you have to do as an art director within a second? What if designers could get rid of repetitive and tedious activities that interrupt their creative flow?

But this topic isn’t something you’d learn in a week, obviously. Still, innovations cannot happen if you’re not willing to take a risk, so we decided to invest considerable time and resources into this technology.

From a product management’s perspective, this process is actually an anti-pattern: Usually, you wouldn’t want to start by finding the right purpose for a technology, instead you’d find the right technology for a purpose. I still believe that this is essentially the right approach, but sometimes you have to abandon your best practices and take a swim in uncharted waters. Consequently, we asked Malte, one of our iOS engineers, to spearhead our research and take a deep dive into this topic. We decided to start off with image segmentation as the first process that we wanted to optimize through machine learning. Masking and clipping sometimes can be a tedious tasks, and ultimately we wanted to reduce this process that can take several minutes to a single click.

Chapter 1: The Machine Engineer

Malte, who is a diligent engineer and — how convenient — a passionate photographer, started investigating some approaches that focused on image segmentation. You can read more about his journey in his article. Although he experimented with various neural networks and post-processing techniques, the resulting masks sometimes lacked the desired accuracy and wouldn’t have matched a user’s expectations. This was a first expected insight. As we want to deliver ready-to-use products to our customers, that don’t need any complex tweaking, this was something we had to fix. Our problems originated mostly from our rather ambitious goal to segment any type of object within an image. It would have required to train with vast data and to scale up the number of filters in our network. However, due to our on-device constraint, this would have killed our carefully crafted performance.

Therefore, we shifted this generalist approach to a specialized network for images of a certain domain that the model can be applied to. In hindsight, this seems quite obvious, as our rather small model would have never been able to cope with the amount of variations existing in ‘the real world’ anyway. So, we went back to the drawing board and started discussing which domain to focus on. That’s where we got suck; we struggled to find an obvious trend in our customers’ use cases or known photography platforms.

It was actually during his summer holiday, when Malte had the flash of genius. At a stop-over in Singapore, he noticed how the city was flooded with selfie-stick wielding tourists. The sheer amount of selfies taken at any public place in Singapore left him astonished and he realised that he just found the right domain. Selfies, and portraits in general, felt like an infinite datasource and prime use case for our image segmentation algorithm. Back home, we decided to focus on selfies and portrait-like photography.

Malte started searching for portrait datasets and found a collection of roughly 2000 portrait images collected from Flickr. Those were a great starting point and after a few training runs, he already reached satisfactory results, as the model was now capable to capture all available variations. At that point, we had a system at our hands that was able to segment portrait or selfie images in real-time on the device you’re capturing them with. This seemed like a great opportunity, but we didn’t want to stop just there. Releasing a prototype that can free a selfie from its background is nice, but doesn’t feel like something that would truly showcase how AI can make a difference in our creative process.

Chapter 2: The Art Director

This is where our Art Director Tommi, a renowned graphic artist and former sprayer, stepped in to explore what can be done with **a selfie, an accurate alpha mask and the image editing features from our PhotoEditor SDK**. When Tommi took the lead, I asked him to draft a vision, a creative direction for our app that combines all the tools and possibilities at our disposal.

Together, we started exploring portrait trends and unique imagery that would help us find a direction for our showcase. Soon, the walls of our meeting rooms and offices were plastered with inspirational works on portrait photography of all different kinds and styles. This visual catalogue kept inspiring us, although we weren’t sure on which style to settle in the end. It was when we could hardly find any more free spots on our walls and after looking at them for countless times that the idea struck:

What if we could enable users to turn their portrait to what we saw on these walls? And this, without actually having the design expertise they would normally be required to do so.

Instead of brooding over a completely new form of portraits, we could take all these styles and instantly realize them with the technology we had. From that point on, we flipped our process upside down. Instead of thinking about what the technology is capable of, or identifying a problem worth solving, we aimed for the creative output that we wanted our app to produce. While we started our venture with a technology, we now had visual results that we could work towards. The main question shifted from “What is our technology capable of?” to “How can we achieve this visual output with our technology?”

Tommi designed five lead graphics, so our team of engineers and designers could grasp what we ultimately wanted to achieve, using only a selfie and the features of our SDK.

Act 3: Closing ranks

With such a clear vision for our app, we started separating the wheat from the chaff, categorizing the portraits and understanding which operations of our SDK we had to combine, assemble and enhance to create these visuals.

What followed was a remarkable interplay across multiple stakeholders of our team. While we were always very vocal proponents of building strong relationship between product stakeholders, **the introduction of the AI layer actually glued our team further together**.

Our designers started to embrace the engineering perspective, playfully identifying both opportunities and constraints through the tech layer. At the same time, our engineers embraced the design vision and formalized it into code. Let me give you some examples:

While thinking of the UI, we understood that the transformation of a selfie into a graphical artwork required an immediate feedback for the user, so they can find a pose that works best with the respective artwork. Consequently, we optimized our networks for real-time processing, a true challenge that needed strong expertise in both iOS engineering and neural net architecture.

Our designs and recipes in turn had to be tweaked to gracefully allow for errors of our AI, because an error rate of 3% can still produce undesired artefacts and mask inaccuracies. We did that by using techniques that beautifully fringed edges of the portrait.

Altogether, the close cooperation, as well as countless meetings, feedback loops, and the continuous fine tuning of the code and underlying recipes is what brings us here.

All of this wouldn’t have happened if we hadn’t took the risk to invest in a rising technology in the first place. And all of this wouldn’t have been possible if it wasn’t for the exemplary cooperation between all the stakeholders. Portrait is a showcase of how technology can inspire and tie a team together. This, in the end, is absolutely necessary if we want to achieve the leaps we expect with AI. If you want to impact the creative space by introducing an AI layer to it, your engineers have to think like designers, or at least deeply understand their work.

The Road Ahead

Portrait is a first showcase and one step of many in our venture to wire several AI aspects deep into our SDK. On our journey, we’ve identified many more opportunities where we can help broader audiences to make creative work and design more accessible. Of course, we will also improve our models and networks with better and more data, always keeping in mind the aesthetic and visual output we’d like to achieve. We’ll keep you posted on our updates and next ventures into this exciting new era.

If you liked what you read, I’d encourage you to check out Portrait and our PhotoEditor SDK!

Thanks to my co-authors Malte & Felix!

Thanks for reading! To stay in the loop, subscribe to our Newsletter.