AI – IMG.LY Blog

AI Design Agents and Creative Automation: How to Ship a Full Campaign Without a Designer

Klaudia — Wed, 01 Apr 2026 10:14:03 GMT

You have your files, hooks, Google Ads data but still can’t actually launch a campaign because you have to go to a designer for the final assets.

That gap is exactly where campaign momentum dies. Not at the strategy stage or in copy but at the last mile, when everything is ready except the thing people will actually see.

Luckily, it no longer has to.

The AI Marketing Stack Has a Design-Shaped Hole in It

Most marketing workflows have been quietly transformed over the last two years. Copy generation, audience segmentation, keyword research, performance analysis - all of it runs faster now, with less manual input. A solo marketer can do what used to require a team. Except for one part.

Design hasn’t moved. The workflow is still: write a brief, hand it to a designer, wait, review, revise, wait again. Then do the same thing in three more formats because the 1:1 you approved doesn’t fit Stories, Display, or LinkedIn. That loop can take days. And it doesn’t matter how good your AI-generated copy is if it’s sitting in a doc waiting for someone to have bandwidth.

This isn’t a resourcing problem. Hiring more designers doesn’t fix the structural issue; it just adds capacity to a fundamentally slow process. The problem is that the tools built for design weren’t designed for the workflow a modern marketer actually runs. They expect a designer at the keyboard. They don’t expect a marketer with a campaign brief and a conversation window.

The result: campaigns that are otherwise fully automated still stall before they ship. The bottleneck moved from copy to creative. And most teams haven’t noticed yet, because design delays feel normal. They’ve always been there.

What an AI Design Agent Actually Changes

An AI design agent is an autonomous AI system, not a prompt-response tool. It plans, reasons, and executes design tasks independently. Given a goal, it breaks that goal into steps, uses the tools available to it, retains context across the session, and can self-correct when the output isn’t right. That’s the category: a system that drives the workflow rather than waiting for instruction at each step.

Most design agents deliver speed. The handoff that used to take days can happen in minutes. But the output is still a deliverable: a file you receive, use as-is, or move somewhere else. Most operate inside existing tools or generate assets you work with elsewhere. The speed is real, the editability usually isn’t. If something needs to change, you’re back to prompting from scratch.

CoDesign sits differently. It doesn’t hand you a finished asset. It gives you a working starting point on a real canvas. Layers, text boxes, placeholders - real assets you can continue to work with, in the same conversation, without leaving the session.

Brand consistency gets handled at the foundation. Load your brand kit once, including colors, fonts, logo, and layout rules, and every output that session respects those constraints. You’re not eyeballing hex codes or hoping the font looks right. The rules are applied from the start, not checked at the end.

Multi-format adaptation is where the time savings become concrete. A campaign that runs across Instagram, Google Display, LinkedIn, and print doesn’t produce four separate briefs and four separate rounds of designer revisions. You describe the formats you need, and the agent adapts the work. The campaign stays consistent across all of them.

The biggest shift isn’t speed, though that’s real. It’s that you stay in creative control throughout. There’s no handoff moment where you lose the thread and have to re-explain the brief to someone else. The context lives in the conversation.

How to Run a Campaign Production Session with CoDesign

This is the actual sequence. Walk through it once and the workflow becomes repeatable.

1.Start with the brief. Open CoDesign and describe the campaign in plain language. Audience, objective, platform, tone, any constraints. The more specific you are here, the less iteration you’ll need later. Treat it like briefing a senior designer who hasn’t worked with your brand before.

2.Generate copy variations before you open the canvas. Use whichever AI writing tool you already work with — ChatGPT, Claude, or whatever is in your stack — to produce multiple copy directions: headline hooks, body copy, CTAs. Get four or five versions per element. Copy and design are separate steps that feed into each other. Having real options ready before you start the design session means CoDesign has something specific to work with, not a blank brief waiting to be interpreted.

3.Feed the brief to the AI design companion. With copy variations ready, ask CoDesign to generate initial ad designs. Describe the format, the hierarchy you want, any layout preferences. You’ll get structured, editable designs on the canvas. Not a rendered image, but a working starting point with real layers.

4.Apply your brand kit. If you haven’t already, load your brand assets: logo, color palette, type system, approved imagery. The agent applies these rules to the designs. Every output from this point respects your brand standards automatically.

5.Adapt across formats. Tell the agent which formats you need. The social variant, the display variant, the vertical for Stories, the square for feed. Watch the layouts adapt to each context, maintaining the campaign idea and brand consistency across dimensions. If the hierarchy needs adjusting for a specific format, describe what’s not working and the agent fixes it in conversation.

6.Refine in conversation. This is where the canvas-based approach earns its value. You’re not generating new versions from scratch. You’re iterating on what’s already there, in the same session. “Move the logo to the bottom right. Try the headline in the lighter weight. Swap this layout for something with more white space.” Each exchange builds on the last, so the conversation stays grounded in what’s on the canvas rather than starting over from a new prompt.

7.Export and ship. When the designs are approved, export in the formats your media plan requires. The session context lives with the file, so if something needs to change post-launch, you’re not starting from zero.

One honest note: the quality of the output is proportional to the quality of the brief. Vague prompts produce generic starting points. Teams that invest two minutes in a specific, structured brief consistently get more usable first outputs than teams that describe the campaign in one sentence and expect the agent to fill in the gaps.

This Is What Closing the Loop Actually Looks Like

Design agents don’t replace designers. They remove the bottleneck that sits between strategy and execution.

The marketer who used to wait three days for ad assets can now produce a full campaign set in a single session. The designer who spent half their week on format resizes and small-copy tweaks can spend that time on the work that genuinely requires their judgment. That means brand-defining creative, campaign concepts, and anything where taste and experience are the actual input.

It was never about willingness to collaborate. The tools just didn’t allow for anything else. Every design change, no matter how small, had to go through a handoff. A headline adjustment on a banner resize does not need a creative director. A resize from 1:1 to 9:16 does not need a brief, a Slack message, and a 48-hour turnaround.

Thanks to design agents conversation shifts. It moves from “can you make this” to “how should this look.” And that’s a way more interesting conversation.

Interested in trying IMG.LY CoDesign? Reach out to our team.

What Is an AI Design Agent?

Klaudia — Mon, 30 Mar 2026 04:52:01 GMT

The phrase “design agent” is being used in two completely different ways right now, and the confusion is worth clearing up before either meaning becomes the default. Search for it today and you’ll mostly find AI design agencies - services firms that use AI in their process. That’s a reasonable business, but it has nothing to do with what this article is about.

The second meaning, a design agent as a category of AI tool, is the one that matters for product builders, developers, and creative teams trying to understand where AI-assisted design is actually going. That’s what we’re defining here: what it is, how it works, how it differs from tools you already know, and why the distinction matters if you’re building or evaluating creative software in 2026.

First, What an AI Design Agent Is Not

An AI design agency. This is a services firm that uses AI tools in its creative process. It’s a completely different category - a service, not a tool - and it’s the most common result you’ll find when you search “design agent” right now.

An AI image generator. Midjourney, DALL-E, and similar tools produce images from text prompts. They generate a single output from a single instruction.

An AI feature inside a design tool. Figma’s AI suggestions, auto-layout assistance, and background removal are AI features. They augment a manual workflow making specific tasks faster.

A design automation pipeline. Server-side systems that batch-generate assets from templates are automation, not agency. They’re fast and scalable, but they don’t converse, can’t interpret an ambiguous brief, and don’t refine their output in response to feedback. The confusion here is understandable: modern automation systems use AI models internally and can produce polished, varied-looking results that are easy to mistake for something more intelligent. But give one an ambiguous brief and it either fails or produces something technically correct that misses the point entirely. Give a design agent the same brief and it asks a question. That distinction - executing a fixed process versus reasoning about intent - is what separates the two categories.

The distinction matters because “design agent” is becoming a meaningful category term, and what it actually describes is different enough from all of the above that collapsing the distinctions creates real confusion, both for people evaluating tools and for people building products.

AI Design Agent - a Working Definition

There is no settled definition of an AI design agent yet, the term is still early. Based on what we see in the tools that genuinely deliver on the promise, three properties distinguish a real design agent from something that merely resembles one:

1. Autonomy. The agent takes multi-step actions without requiring a human instruction at each step. Given “create a five-page product catalog in a Scandinavian minimal style,” it works out the layout, typography, and image placement independently and produces the result.

2. Conversational interface. The agent communicates in natural language. It can ask clarifying questions, explain what it’s done, and accept follow-up instructions. The interaction feels like briefing a designer, not operating software.

3. Refinement loop. The agent takes feedback across multiple conversation turns: “make the typography lighter,” “apply the brand’s warm color palette”, and updates the design accordingly. This iterative loop is what separates a design agent from a one-shot generation tool.

A tool that has all three of these properties is a design agent. A tool that has one or two is something else; probably useful, but in a different category.

How an AI Design Agent Works in Practice

An abstract definition only takes you so far. Here’s what a fully implemented design agent workflow actually looks like, drawn from a real demo of IMG.LY CoDesign.

Step 1. The user opens CoDesign. The agent interface - a chat panel - sits alongside the canvas. They’re in the same window.

Step 2. The user types a brief: “Here’s a CSV of five products from our furniture brand. Generate one landscape catalog page per product — two-column grid, full-bleed photo left, typography-led layout right. Clean, minimal. Think Hay, Muuto, Frama.”

Step 3. The agent processes the brief, the data, and any brand context it has access to. It generates a five-page catalog in the editor: product names, descriptions, prices, photo placeholders, layout structure consistent across all five pages.

Step 4. The user reviews the output and follows up in the chat: “Nice. Pre-fill the photo placeholders with black-and-white product photography, soft contrast.”

Step 5. The agent updates the design. The user manually adjusts one headline, corrects a price, and exports.

The full workflow, from brief to print-ready output, took minutes rather than hours. The user never left the tool. The agent handled all structural and stylistic decisions. The human handled final judgment and small corrections.

But generating layouts from a brief is only part of what a design agent can do. In the same chat interface, the agent can build functional decision-making tools directly inside the editor. Ask it to create a Color Themes panel, and it produces a working component: five named presets, each one applying a complete color theme across the entire design in a single click. The user doesn’t switch to a settings screen or manually update individual elements. The agent has built the control they need, right where they need it, as part of the same conversation. That’s a meaningfully different capability: not just producing a designed output, but constructing the tools that let the user make better decisions about that output as they finalize it.

This table is meant as a fair comparison, each column represents how these tool categories actually behave today:

Capability	AI Image Generator	AI Design Features	Design Agent
Takes a brief	Prompt only	No	Yes
Produces editable layouts	No	Partial	Depends on a tool
Multi-step autonomy	No	No	Yes
Conversational refinement	No	Limited	Yes
Brand / context awareness	No	Partial	Yes
Iterative across a session	No	No	Yes

The pattern here isn’t that design agents are better at everything - it’s that they operate at a different level of the workflow. Image generators and AI design features are task-level tools. A design agent is a workflow-level tool.

The Autonomy Slider: Why Human Control Still Matters

One of the most important design decisions when building or evaluating a design agent is how much autonomy the AI exercises by default.

Full autonomy is fast. The agent makes all decisions and presents a finished output. But it can produce results that don’t match the user’s actual intent, especially when the brief is ambiguous or the stakes are high.

Minimal autonomy is safe. The agent only suggests, the human decides everything. But at that point, you’ve lost much of the value of having an agent in the first place.

The best implementations give users what you might call an autonomy slider: the ability to let the agent run with full independence on some tasks, take targeted direction on others, and step aside entirely when the user wants to edit manually. The right level of autonomy depends on the task and the user’s confidence, not a fixed setting applied uniformly across every interaction.

For any design agent, this means the interface needs to let the user reach in and adjust manually, mid-workflow, without losing what the agent has already done. The agent and the editor have to exist in the same environment. Separate windows with an export step between them break the loop entirely.

Where AI Design Agents Create the Most Value

Not every creative workflow benefits equally from an AI agent. The scenarios where design agents consistently create the most value tend to fall into three categories:

High-volume, high-variation creative production. Marketing teams producing dozens of ad variants, e-commerce teams generating product imagery at scale, publishers creating template-based content in bulk. A team can brief the agent once: brand colors, copy, format specs, and get 40 properly formatted variants back in the time it would have taken to manually produce three, each one consistent with the last.

Non-designer users who need professional results. Not everyone who needs to produce a designed output is a designer. Marketers, retailers, operations teams, small business owners - they know what they want but don’t have the tools or time to build it manually. A design agent gives them a way in that doesn’t require either.

Expert designers who want to explore faster. Experienced designers using the agent to generate starting points, explore multiple directions quickly, or offload the time-consuming production work while retaining full control over the final result. A designer can spend 20 minutes reviewing six distinct layout directions the agent generated from a single brief, rather than a full afternoon building each one from scratch.

These aren’t exhaustive categories, but they represent the use cases where the workflow-level shift actually changes what’s possible, not just what’s faster.

A Different Kind of Design Tool

A design agent makes creative work faster and more accessible. For designers who want to explore more directions in less time, and for non-designers who have always known what they wanted but lacked the tools to build it.

That’s what IMG.LY Codesign is built around. It’s in early access now, inside a fully featured design editor - get in touch for early access.

Vibe Design Tools Compared

Klaudia — Sun, 29 Mar 2026 03:51:56 GMT

Vibe coding gave non-technical people the ability to build software by describing what they wanted. The same shift is now happening with vibe design. For the first time, a marketer, a retailer, or a small business owner can create a professional design exactly to their specification just by explaining it to an AI. No design skills required. No external tool. No waiting for a designer to become available.

This article compares four tools leading that shift - what they do well, where they differ, and how to choose between them.

Tools Built for Creators

These tools represent the current state of standalone vibe design software. Each is genuinely good at what it does. The differences come down to what kind of work they’re built for.

Google Stitch

Google Stitch is an AI-native design canvas from Google Labs, launched for public beta in early 2026. It accepts text, voice, images, and code as input and produces high-fidelity UI designs and interactive prototypes on an infinite canvas. Its design agent can reason across an entire project rather than just the current frame - a meaningful step beyond tools that only operate on one screen at a time.

Stitch introduces a DESIGN.md file format for exporting and importing design rules across tools, and it connects to Cursor, Claude Code, and Gemini CLI via an MCP server. It’s available free during beta with monthly generation limits, and it’s clearly aimed at UI/UX designers and developers building product interfaces - not marketing or graphic design work.

Figma Make

Figma Make takes natural language prompts or existing Figma designs and produces working prototypes and web apps - functional code, not just mockups, with optional Supabase backend integration. Because it lives inside Figma, it has access to your existing design libraries, component systems, and tokens from the start.

It’s best understood as a rapid ideation and prototyping tool. It generates interactive, working outputs fast, but the code benefits from developer review before going to production. It’s distinct from Figma Sites, which handles actual publishing. AI credits are required for usage.

Lovart

Lovart is a standalone AI design agent founded by a former ByteDance senior product director. Its scope goes well beyond graphic asset creation: from a single prompt, it can generate brand identity systems, UI mockups, video content, packaging designs, and full marketing campaigns. It uses a proprietary MCoT (Mind Chain of Thought) reasoning engine designed to mimic how a creative director thinks - analyzing business context, target audience, and brand requirements, not just aesthetic style.

Its infinite canvas continuously analyzes all assets present to maintain visual consistency across an entire project, and outputs are compatible with Figma, Photoshop, and After Effects. Lovart is positioned for marketing creatives, solo designers, and small creative teams who need to produce full campaigns without a large agency.

Adobe Firefly Boards

Adobe Firefly Boards is Adobe’s AI-first collaborative ideation canvas, built for creative professionals who need to move quickly from inspiration to concept. It lives inside the Adobe Firefly web app and mobile apps (iOS and Android), and syncs with Creative Cloud so work can move from moodboard to production without switching platforms.

The core of Boards is an infinite multimedia canvas where you bring in text prompts, reference images, video clips, and Adobe Stock assets, then generate across all of them at once. What makes it unusual is the model breadth: Boards gives you access to Adobe’s own Firefly models alongside partner models from Runway, Pika, Luma AI, Google Veo, Black Forest Labs, and others—all inside the same workspace. Outputs are images, video clips, text overlays, and assembled mood boards or storyboards, and every generated asset carries Content Credentials that automatically record which AI model produced it.

Canva Magic Studio

Canva Magic Studio is a suite of more than 15 AI-powered tools embedded directly into the Canva editor on web, mobile, and desktop. It’s not a separate app—if you’re already a Canva user, you’re already inside Magic Studio.

The suite covers writing, image generation, video creation, format switching, and multi-language translation, all accepting text prompts, uploaded media, documents, and existing canvas elements as input. Brand Kit integration is what distinguishes it from more isolated AI generation tools: when you generate a design through Magic Design, it pulls from your saved Brand Kit—logo, colors, fonts—rather than producing something generically styled. That keeps AI-generated output on-brand without manual cleanup.

The trade-off is structural. Canva outputs are flat exports within a consumer-grade editor, not structured design objects you can manipulate at the element level. Generations are fast and on-brand, but the editing ceiling is Canva’s editor—for users who need granular creative control or complex asset structures, that’s a real constraint, not just a preference.

IMG.LY CoDesign

IMG.LY CoDesign is AI-powered design companion built for creators and teams producing marketing materials, branded content, social assets, and multi-format campaigns. What sets it apart from other standalone tools is how it treats AI output: every element CoDesign generates is a structured, editable design object on the canvas - not a flat image to export and rebuild elsewhere. Move a headline, swap a font, resize a block, change a color - directly in the same tool that generated it.

The agent and the editor aren’t two modes. They’re the same tool. You can prompt a direction, refine it in conversation, and take over the canvas directly at any point - without switching tools, exporting, or starting over. CoDesign also goes beyond the canvas: ask it to build a brand panel, a form-based template, or a custom configurator, and it builds that too - directly inside the editor, from the same prompt interface.

Here is a quick demo to showcase what IMG.LY CoDesign can do.

Vibe Design Tools at a Glance

	Google Stitch	Figma Make	Lovart	Adobe Firefly Boards	Canva Magic Studio	IMG.LY CoDesign
Primary use	UI/UX design from prompts	Prompt-to-prototype / app	Full campaign and brand creation	AI-first concepting and moodboarding	AI design generation inside Canva editor	AI-assisted design for creators and teams
Where it lives	Google (standalone web app)	Inside Figma	Lovart (standalone)	Adobe Firefly web + mobile; syncs with Creative Cloud	Inside Canva (web, mobile, desktop)	IMG.LY
Input types	Text, voice, images, code	Text, existing Figma designs	Text, images, brand briefs	Text, images, video, Adobe Stock assets	Text, images, video, documents, existing design elements	Text, images, CSV, brand kits
Output type	UI designs and prototypes	Interactive prototypes and web apps	Brand assets, campaigns, video, packaging	Images, video, text overlays, mood boards, storyboards	Images, video, presentations, styled text, format-converted designs	Editable multi-page designs, videos, animations
Output format	Flat / exportable	Code / exportable	Flat / exportable	Flat / exportable; Content Credentials attached	Flat / exportable within Canva editor	Structured editable objects on canvas
Brand context	Manual per session	Via Figma design system	Canvas-aware within session	Enterprise Custom Models; personal library import	Brand Kit applied automatically at generation	Loaded at session start
In-chat UI generation	No	No	No	No	No	Yes
Multi-model AI	No	No	No	Yes (Runway, Pika, Luma, Google Veo, BFL, and more)	No (Dream Lab uses Leonardo.ai Phoenix)	No
Best for	Product designers, developers	Product teams in Figma ecosystem	Marketing creatives, solo designers	Creative professionals doing concepting and ideation	Non-designers and content teams needing fast, on-brand output	Creators and teams needing editable AI-generated design

For a detailed breakdown of how Canva Magic Studio compares to CoDesign specifically, see Canva Magic Studio vs. CoDesign.

How to Choose

The criteria here aren’t about feature counts. They’re about the kind of work you’re doing.

Choose Google Stitch if you’re a product designer or developer building UI and want an AI agent that reasons across your whole project, not just individual screens. It’s the most technically connected of the six—MCP server, code input, DESIGN.md export—and it’s clearly built for product interface work.

Choose Figma Make if you’re already in the Figma ecosystem and want to go from prompt to working prototype fast. It inherits your design system automatically, which removes a lot of setup friction, and it outputs functional code rather than static designs.

Choose Lovart if you need to produce full creative campaigns—brand identity, packaging, video, marketing assets—from a single prompt. Its MCoT reasoning engine and canvas-level consistency analysis make it particularly strong for high-volume marketing creative work.

Choose Adobe Firefly Boards if you’re a creative professional working inside the Adobe ecosystem who needs to generate and align on large volumes of visual concepts before moving them into production. It’s the right call for designers, photographers, and creative directors doing serious concepting work that will ultimately land in Photoshop, Premiere, or Illustrator.

Choose Canva Magic Studio if you need fast, on-brand creative output and your users aren’t designers. Magic Studio’s Brand Kit integration means generated designs pull from your saved colors, fonts, and logos automatically—no cleanup, no manual adjustment. It’s not the tool for pixel-level creative control, but for marketing teams and content creators producing high volumes across multiple formats and channels, that’s rarely the priority.

Choose CoDesign if you need AI-generated output you can actually edit. Every element lands on the canvas as a structured design object—headline, image block, color field—not a flat export. That matters when the brief changes, the brand needs adjusting, or the output needs to become ten variations rather than one. For creators and teams producing branded content at scale, that editability is the difference between a starting point and a finished asset.

CoDesign is in early access. Be among the first to prompt a direction and walk away with a design you can actually use. Talk to our team to learn more.

What Is Vibe Design? The Definitive Guide for Product Builders, Designers, and Creative Teams

Klaudia — Fri, 20 Mar 2026 17:23:51 GMT

Vibe Design - A Concept That Just Got a Name

Creative work has always involved describing an idea and having someone skilled make it real. Vibe design follows the same principle — except the “someone” is AI, ready whenever you are, no brief needed. You describe what you want through text, images, brand assets, or sketches, and the AI generates the design. You direct the process; the tool handles the making.

The term itself is not new but was brought into mainstream use in early 2026 by Google via their Stitch announcement, and within a day it was appearing across The Register, CNBC, and TechRadar. The practice it describes had been building for well over a year — in Figma’s AI features, in generative design tools, in the growing number of products letting users create from a prompt rather than from a toolbox.

Today, we’ll define what vibe design actually is, traces where the idea came from, and explains what it means for people building products.

Where the Term Comes From: Vibe Coding’s Design Sibling

To understand vibe design, it helps to start with the concept it’s directly descended from: vibe coding.

Former Director of AI at Tesla and co-founder of OpenAI, Andrej Karpathy coined “vibe coding” in early 2025 to describe a shift in how developers work with AI-generated code. The idea: instead of writing implementation yourself, you describe what you want in natural language and an AI writes the code. The developer’s role shifts from implementation to direction. You stop thinking about how things are built and focus entirely on what you want to create.

Vibe design is the same shift applied to visual creation. Instead of opening a design tool and manually placing elements, adjusting colors, choosing fonts and spacing, you describe what you want — in words, or by uploading a reference image, a photo, a sketch, a brand kit — and the design takes shape. The phrase “vibe coding design” captures this lineage neatly: it’s the same intent-first principle, moved from engineering into the visual layer.

The underlying movement had been gathering momentum since AI image generators went mainstream, accelerating rapidly as tools like Figma Make, Lovart, and now Google Stitch brought the concept directly into design workflows. The label arrived late to a trend that had already changed how a lot of creative work gets done.

What Vibe Design Actually Means: A Working Definition

Here is a working definition that holds up across tools and use cases:

Vibe design is a creative workflow in which the primary input is intent — described in natural language or visual references — rather than manual manipulation of design tools. The designer’s role becomes one of direction, curation, and refinement rather than construction.

Three things define a genuine vibe design workflow:

1. Intent-first input. The starting point is a brief, a description, a reference image, or a combination — not a blank canvas and a toolbox. You’re communicating what you want, not building it.

2. Generative execution. An AI interprets that intent and produces a designed output — a layout, a color scheme, a complete page, a set of variations. The construction step is handled by the system.

3. Human refinement in the loop. The human stays involved throughout — approving directions, adjusting outputs, steering away from things that don’t work. The AI handles execution; the human handles judgment.

What vibe design is not: it’s not simply using an AI image generator to produce pictures. Image generation is one possible input into a vibe design workflow, not the workflow itself. Vibe design produces editable, structured design outputs — layouts, components, documents, campaigns — not static images. The output is something you can work with, not just something you can look at.

How It Differs from AI-Assisted Design

“AI-assisted design” has covered a lot of ground over the past few years: autocomplete for design tokens, background removal, content generation within a layout. These are useful additions to a manual workflow. But in all of them, the designer still drives — AI is a tool called on for specific tasks while the human remains in the seat.

Vibe design flips the ratio. The AI drives the initial creation; the human steers and refines. It’s a different relationship with the tool, not a faster version of the same one.

The distinction matters because it changes three things:

What skills are most useful. Writing clear, directed prompts and making fast curatorial judgments matters more than knowing every keyboard shortcut.
What the workflow looks like. You’re reviewing and steering outputs rather than constructing from scratch.
What software is relevant. Vibe design tools are built around a different interaction model than tools built to accelerate traditional manual design.

Vibe Design in Practice: Three Scenarios

The best way to make this concrete is to show what an AI design workflow built around vibe design actually looks like. These three scenarios cover the range of contexts where it’s becoming relevant.

Scenario 1: A Marketing Team, No Designer Available

A marketing manager at a mid-sized e-commerce brand needs a product launch campaign for social media. There’s no designer available this week — they’re tied up on a bigger project.

She opens the creative tool embedded in their marketing platform, uploads the product photo and the brand guidelines, and types: “Create a campaign for our summer collection — clean, minimal, white space heavy, headline-driven.”

She receives a set of formatted, brand-consistent assets sized for each social channel. The layouts are on-brand. The typography follows the guidelines she uploaded. She adjusts the headline copy on two of the assets and swaps one background color. The whole thing takes 12 minutes.

No design skills required. No third-party tool. No waiting for a designer to become available. The campaign goes out on schedule.

Scenario 2: A Designer Exploring Variations

A senior designer is working on a brand campaign for a luxury lifestyle client. She’s settled on a layout she likes, but she can’t land on the right color direction — everything she’s tried manually feels either too cold or too safe, and exporting variations to compare them side by side is eating time she doesn’t have.

Instead, she types a single instruction into the agent chat: “Add a panel on the left with five color theme presets I can click to instantly apply to my design.”

The agent builds the panel directly inside the editor. Five named presets — Warm Sand, Midnight, Rose Quartz, Forest, Slate Blue — each applying a complete color theme across the entire design in one click: backgrounds, accents, headings, body text, all updated together. She works through all five in under a minute and finds the direction she was looking for without typing another prompt.

The variation-exploration workflow, at its most useful, doesn’t just produce more outputs — it builds the tools you need to make the decision faster.

Scenario 3: A Product Team Embedding Creative Capability

A print-on-demand platform serves retailers and small brands who need to produce product catalogues regularly but don’t have in-house design resource. One of their customers — a retailer for a furniture brand — opens the editor, pastes a CSV of five products into the agent chat, and describes the layout style she wants: two-column landscape, typography-led, minimal, referencing the aesthetic of Hay, Muuto, and Frama.

The agent generates a complete five-page catalogue inside the editor — one page per product, consistent layout throughout, with product names, descriptions, prices, and photo placeholders already in place. She follows up in plain language: “Pre-fill the photo placeholders with elegant product photography, make them black and white, soft contrast.” The agent updates all five pages. She adjusts one headline manually and exports.

The Vibe Design Tools Shaping the Space Right Now

Several tools are explicitly built around this workflow.

Tool	What it does	Where the agent lives	Best for
Google Stitch	Voice and text prompts to UI design	Google’s standalone tool	UI/UX designers, developers
Figma Make	Prompt to prototype inside Figma	Inside Figma (standalone)	Product designers working in Figma
Lovart	AI design agent for graphic creation	Lovart’s standalone platform	Marketing creatives, solo designers
IMG.LY	Design companion for all types of design works and tasks	IMG.LY Codesign studio	Designers, marketers, brand owners

Google Stitch is built around the idea that UI design should start with a conversation. You describe a screen — its purpose, the actions it needs to support, the general feel — and Stitch produces an interface design you can refine. It’s aimed at developers and UI/UX designers who want to move faster in the early stages of building a product interface. Where it works well is in getting from a rough idea to a structured screen layout without having to make every decision from scratch.

Figma Make extends the environment that product designers already work in. Because it lives inside Figma, it has access to your existing components, tokens, and design system. The prompt-to-prototype workflow is useful for designers who want to explore how a brief might translate into a working layout without manually composing every frame. Its biggest advantage is that the output lands directly in a space where a full design team can take over.

Lovart is focused on graphic and campaign creation rather than UI or product design. It’s built for the kind of work that marketing creatives and solo designers do a lot of — producing visual assets for social, campaigns, brand activations. The emphasis is on speed and aesthetic quality for graphic outputs rather than on structured, component-based design systems.

IMG.LY brings a design companion together with manual canvas edits - each element created by AI can be manually moved, changed, or adapted. It maintains brand and template context throughout the generation process, redefining the meaning of templates - from simple asset with placeholders, to true brand guideliness. Excels at number of import and export options, making a great options for true graphic and asset designs.

Where Vibe Design Has Limits

Vibe design works well when there’s something to work from — a reference image, a brand kit, an existing visual direction. When there’s genuinely nothing to draw from, outputs tend toward the generic. A completely new brand identity with no existing visual language is a poor fit for a vibe design workflow; that kind of work still benefits from the deliberate, decision-by-decision process of traditional design.

It’s also less suited to accessibility-critical UI, where precise specification — contrast ratios, touch targets, interaction states — matters more than mood or aesthetic direction. A generated layout might look right without being accessible, and catching that requires careful manual review.

Finally, the more technically constrained the brief, the more refinement the output will need. Vibe design compresses the path to a starting point; it doesn’t always compress the path to a final, production-ready output. Teams that go in expecting to iterate will get more out of it than teams that expect to export and ship.

The Human Element: Vibe Design Is Not Autonomous Design

A common concern about vibe design workflows is that they reduce the role of skill and judgment in creative work. The evidence so far points the other way.

The most effective workflows keep the human firmly in control of direction, curation, and final judgment. The AI generates; the human decides what’s good, what fits the brief, what needs to change. Removing that layer doesn’t improve outcomes — it just produces more output with no quality filter.

What changes is not whether human judgment matters, but at which stage it matters most. In a traditional design workflow, judgment is exercised continuously — at every click, every color choice, every alignment decision. In a vibe design workflow, judgment operates at a higher level: Is this the right direction? Does this match the intent? What needs to change?

The craft is still there. The instruments are different.

A Shift That’s Already Underway

Vibe design isn’t a trend arriving from the future. It’s a name for a shift that’s been building for several years and just became visible enough to label properly.

The creative AI tools exist. The workflows are being adopted. The user expectation is forming. Naming the practice in 2026 didn’t create the movement — it just gave it a shared vocabulary that makes it easier to talk about and build toward.

If what you’re looking for is a balance between manual control of the output and power of AI generation, IMG.LY Codesign might be the right fit for you. Talk to our team to see how it can fit inside your stack.

CE.SDK v1.69 Release Notes

Neslihan — Fri, 27 Feb 2026 18:29:53 GMT

v1.69 is the biggest developer experience release we’ve shipped. Agent Skills collapse the time from idea to working editor from days to under a minute. Production-ready Starter Kits eliminate days of boilerplate setup. PPTX and Canva Importer lets users bring their existing designs directly into your editor. And a comprehensive video overhaul across Web, iOS, and Android closes the gap between CE.SDK and native professional editing apps.

Let’s dive in!

Introducing IMG.LY Agent Skills for CE.SDK

The biggest change to CE.SDK’s developer experience since launch.

AI agents have changed how developers build. With Agent Skills, CE.SDK works natively with your AI agent.

Install the plugin for Claude Code or Cursor and your AI coding agent becomes a CE.SDK expert with bundled access to guides, API references, and starter kits across 10 web frameworks. No external services, no MCP servers, no context-switching.

What your developers can now do:

/cesdk:build — Describe a use case in plain language. The agent detects the framework, pulls the right starter kit, and scaffolds a working project autonomously.
/cesdk:explain — Ask any CE.SDK implementation question and get answers adapted to the specific framework and experience level.
/cesdk:docs-[framework] — Retrieve guides and API references directly from the IDE, without leaving the editor.

For your product, this means: developers prototype in minutes, not days. Non-technical team members can now build and extend functional editors independently. And onboarding new engineers to CE.SDK shrinks from a multi-day ramp to a single session.

→ Quick Start & Full Announcement
→ Explore Agent Skills Documentation

Launch in Minutes with Production-Ready Starter Kits for Web

Previously, getting CE.SDK running in a real product meant adapting complex demo projects: stripping features, adjusting architecture, and undoing assumptions baked into example code.

Starter Kits replace all of that with pre-configured, production-ready editor UIs.

Specialized kits for Photo, Video, and Design editors are scoped, clean, and immediately deployable. Each kit starts with a focused feature set. You explicitly opt in to what you need, so your UI stays performant and your users don’t see capabilities they can’t use.

For your product, this means: a functional, shippable editor in minutes rather than days of setup. Clean architecture from day one, without accumulating technical debt.

→ Explore Starter Kit Documentation

Let Users Bring Their PPTX & Canva Designs to CE.SDK

Some customers don’t start from scratch. They have existing creative assets locked inside PowerPoint and Canva — and until now, bringing those into CE.SDK required manual recreation.

The new PPTX Importer removes that barrier.

Users can import native PowerPoint files or export Canva designs as PPTX and bring them directly into your CE.SDK-powered editor, with layouts, elements, and structure intact.

→ Try the PPTX Import Demo
→ View PPTX npm

Professional-Grade Video Editing Across Platforms

This release expands CE.SDK’s video features across web & mobile platforms.

Control Video Speed (iOS & Android)

Users can now adjust the playback speed of video and audio clips via a dedicated Speed UI. This has been one of the most-requested mobile features and is now available on both platforms.
→ View Video Speed Documentation iOS
→ View Video Speed Documentation Android

Create Video Groups in the Timeline (Web)

Multiple clips can now be combined into a single manageable unit with synchronized timing, trimming, and a unified timeline representation.

This enables complex, multi-layer content production and speeds up the design process.

Video Timeline Now Grows Automatically (Web)

The timeline now automatically adjusts its height based on content. Manual track configuration is gone, and complex projects are easier to pick up for your new users.

Use a Clutter-Free iOS Timeline (iOS)

Non-video tracks now show a single thumbnail per clip, reducing clutter in multi-layer timelines on your iOS app.

Full Changelog

See all technical details, breaking changes, and performance improvements in the v1.69.0 Changelog.

Thank you for building with IMG.LY.

Introducing IMG.LY Agent Skills

Neslihan — Mon, 16 Feb 2026 18:54:06 GMT

TL;DR
We are moving beyond static documentation. IMG.LY Agent Skills is a specialized intelligence layer that transforms AI coding assistants (Claude Code, Cursor, Windsurf) into CE.SDK implementation experts. Instead of mapping APIs manually, you now inject our full SDK expertise directly into your IDE and coding agent.

→ Explore Agent Skill Documentation
→ Access Agent Skills GitHub Repository

The “Quick Start” Paradox

In our recent analysis of client integrations, a consistent pattern emerged: even with a high-performance SDK, the “Time to First Edit” is often delayed by research. Developers spend a significant portion of their initial integration time context-switching between their IDE and documentation, manually mapping complex API hierarchies to their specific framework.

We figured: You don’t just need better documentation; you need an expert partner who is already inside your codebase.

The Solution: IMG.LY Agent Skills for Web

We are moving beyond static documentation. Today, we are launching IMG.LY Agent Skills, a specialized intelligence layer that transforms AI coding assistants like Claude Code, Cursor, and Gemini into CE.SDK implementation experts.

Instead of searching for answers, you now give your agent our complete SDK knowledge with a single command.

How It Works: Your Autonomous Implementation Partner

By injecting versioned, live documentation and framework-specific starter kits directly into your agent’s context window, we’ve created three adaptive paths to launch:

1. The Explain Path

Architecture is complex. The Explain Skill acts as a digital implementation partner that adapts to you. It can provide a technical deep-dive on the export pipeline for a Senior Dev or a high-level summary for a Product Manager—in any language and at any level of detail.

2. The Build Path

Command your agent to “Add a photo editor to my React app.” The agent autonomously detects your environment, pulls the correct starter kit, and handles the boilerplate. It doesn’t just show you code; it builds the foundation for you.

3. The Docs Path

Stop the tab-switching. Your agent retrieves versioned API references offline, keeping your focus entirely within the terminal or IDE.

The Shift to Autonomous Engineering

This release marks a strategic shift for IMG.LY. We believe the next era of software development isn’t about “better documentation,” but about Portable Expertise.

By providing that AI with a “skill” rather than a manual, we are empowering teams to move from a multi-month build cycle to a 30-second scaffold.

Supported Frameworks

The initial Web release supports 10 frameworks out of the box:
React · Vue.js · Svelte · Angular · Next.js · Nuxt.js · SvelteKit · Electron · Node.js · Vanilla JS

Get Started

Run the command:

npx skills add imgly/agent-skills

For example, Claude Code users, can run:

claude plugin marketplace add imgly/agent-skills

claude plugin install cesdk@imgly

→ Explore the full Agent Skill Documentation

The “editor of your dreams” is now a conversation away.

Join 3,000+ creative professionals who get early access to new features and updates—subscribe.

Build in a Day: AI Video Clipping with CE.SDK

Eray — Thu, 05 Feb 2026 12:17:28 GMT

Introduction

We built a video shortener in a single day using Claude Code and CE.SDK. It extracts 3-4 short clips from long-form video, handles transcription, identifies the best moments via AI, detects speakers, and outputs vertical/horizontal/square formats—all running in the browser.

Features:

Extracts 3-4 clips per video (highlights, summaries, or cleaned-up edits)
Outputs 9:16 (vertical), 16:9 (landscape), or 1:1 (square)
Detects speakers and maps them to faces with user confirmation
Auto-crops to follow the active speaker
Adds captions and text hooks
Non-destructive: change aspect ratio or template without re-processing

Best suited for: Videos with speech/dialogue (podcasts, interviews, presentations, vlogs)

Why Client-Side?

CE.SDK’s CreativeEngine runs in the browser via WebAssembly. Video decoding, timeline manipulation, effects, and preview all happen on the user’s device.

Benefits:

No upload/download wait — edits preview instantly
Non-destructive — switch aspect ratio or template without rendering
Lower infrastructure costs — your costs don’t scale with video length or user count

Tech Stack

Frontend: Next.js + React
Video Engine: CE.SDK (CreativeEngine)
Transcription: ElevenLabs Scribe v2
AI Analysis: Google Gemini

Architecture Overview

High-Level Flow

Required API Keys

Service	Purpose	Environment Variable
CE.SDK	Video editing engine	`NEXT_PUBLIC_CESDK_LICENSE`
ElevenLabs	Speech-to-text transcription	`ELEVENLABS_API_KEY`
Gemini (via OpenRouter or direct)	AI highlight detection	`OPENROUTER_API_KEY` or `GEMINI_API_KEY`

Setting Up CE.SDK

What is CE.SDK?

CE.SDK (CreativeEngine SDK) is a browser-based engine for video, image, and design editing—a programmable video editor you can embed in your app.

Key Concepts:

Engine: The runtime that manages the editing session
Scene: The document/project containing all elements
Blocks: Individual elements (video clips, text, shapes, audio)
Timeline: Time-based arrangement of blocks for video editing

Installation

npm install @cesdk/cesdk-js

Initializing the CreativeEngine

import CreativeEngine from '@cesdk/cesdk-js';

const engine = await CreativeEngine.init({
  license: process.env.NEXT_PUBLIC_CESDK_LICENSE,
});

// Create a video scene
const scene = engine.scene.createVideo();

// Get the page (timeline container)
const pages = engine.scene.getPages();
const page = pages[0];

// Configure page dimensions for your target aspect ratio
engine.block.setWidth(page, 1080); // 9:16 vertical
engine.block.setHeight(page, 1920);

Uploading Video to CE.SDK

CE.SDK works with video through a fill-based system. The graphic block is the container, while the video fill holds the actual media source and playback properties.

// Create a video block
const videoBlock = engine.block.create('graphic');
const videoFill = engine.block.createFill('video');

// Set the video source
engine.block.setString(
  videoFill,
  'fill/video/fileURI',
  videoUrl // Can be a blob URL or remote URL
);

// Apply fill to block
engine.block.setFill(videoBlock, videoFill);

// Add to timeline
engine.block.appendChild(page, videoBlock);

Extracting Audio for Transcription

// Configure audio-only export
const mimeType = 'audio/mp4';

// Export just the audio track
const audioBlob = await engine.block.export(page, mimeType, {
  targetWidth: 0,
  targetHeight: 0,
});

// audioBlob can now be sent to transcription API

Setting both dimensions to 0 tells CE.SDK to skip video encoding entirely, making this export much faster than exporting the full video.

Getting Video Metadata

// Get video duration
const duration = engine.block.getDuration(videoBlock);

// Get dimensions from the fill
const videoFill = engine.block.getFill(videoBlock);
const sourceWidth = engine.block.getSourceWidth(videoFill);
const sourceHeight = engine.block.getSourceHeight(videoFill);

console.log(`Video: ${sourceWidth}x${sourceHeight}, ${duration}s`);

AI-Powered Transcription & Highlight Detection

The Pipeline

Audio → Transcription: Send extracted audio to ElevenLabs Scribe
Transcription → Analysis: Feed word-level transcript to Gemini
Analysis → Timestamps: Map AI suggestions back to precise video times

Transcription with Speaker Diarization

ElevenLabs Scribe v2 provides:

Word-level timestamps (start/end time for each word)
Speaker diarization (which speaker said what)

The output is a structured transcript where each word has a precise timestamp, enabling frame-accurate editing.

AI Highlight Detection with Gemini

The prompt structure matters. Here’s what works:

You are analyzing a video transcript to identify segments for short-form content.

TRANSCRIPT:
[Word-by-word transcript with timestamps]

TASK:
Identify 3-4 segments that work as standalone short videos. For each segment:
1. Find the exact starting and ending words
2. Ensure clean sentence boundaries (no mid-sentence cuts)
3. Aim for 30-60 second segments

OUTPUT FORMAT (JSON):
{
  "concepts": [
    {
      "id": "concept_1",
      "title": "Hook title",
      "description": "Why this segment works as a standalone clip",
      "trimmed_text": "The exact transcript text to keep...",
      "estimated_duration_seconds": 45
    }
  ]
}

CRITERIA FOR SELECTION:
- Strong hooks (surprising statements, questions, bold claims)
- Complete thoughts (don't cut mid-explanation)
- Emotional peaks (humor, insight, controversy)
- Standalone value (makes sense without context)


Before finalizing each segment, ask: "If someone started watching here,
would they understand what's being discussed?"

Mapping Back to Timestamps

Once Gemini returns the trimmed_text, we match it against our word-level transcript to find exact timestamps:

AI returns:     "The secret to success is actually quite simple..."
Transcript has: [{ word: "The", start: 45.2 }, { word: "secret", start: 45.4 }, ...]

Result:         Trim video from 45.2s to 52.8s

This text-matching approach is more reliable than asking the AI to output timestamps directly. LLMs can hallucinate timestamps or miscalculate offsets, but they’re excellent at identifying the right words—so we let the transcript data provide the ground truth for timing.

Working with the CE.SDK Timeline

Understanding Blocks

// Video/Image content
const graphic = engine.block.create('graphic');

// Audio track
const audio = engine.block.create('audio');

// Text overlay
const text = engine.block.create('text');

// Each block can be positioned on the timeline
engine.block.setTimeOffset(block, startTimeInSeconds);
engine.block.setDuration(block, durationInSeconds);

Manipulating Trim Points

Trimming controls which portion of the source media is shown:

const videoFill = engine.block.getFill(videoBlock);

// Set where in the source video to start (in seconds)
engine.block.setTrimOffset(videoFill, 45.2);

// Set how long to play from that point
engine.block.setTrimLength(videoFill, 30.0);

// Also update the block's duration to match
engine.block.setDuration(videoBlock, 30.0);

Working with Fills and Their Timing

// Get the fill (contains the actual media)
const fill = engine.block.getFill(block);

// Fills have their own timing properties
const trimStart = engine.block.getTrimOffset(fill);
const trimDuration = engine.block.getTrimLength(fill);

// The block's duration should typically match the fill's trim length
engine.block.setDuration(block, trimDuration);

Think of the fill as the media source (which part of the original video to use) and the block as the timeline placement (when and how long it appears). Both need to be updated together for clean edits.

Creating Time-Based Edits from Transcript Words

interface TranscriptWord {
  word: string;
  start: number;
  end: number;
  speaker_id?: string;
}

function applyTranscriptTrim(
  engine: CreativeEngine,
  videoBlock: number,
  words: TranscriptWord[]
) {
  if (words.length === 0) return;

  const startTime = words[0].start;
  const endTime = words[words.length - 1].end;
  const duration = endTime - startTime;

  const fill = engine.block.getFill(videoBlock);

  engine.block.setTrimOffset(fill, startTime);
  engine.block.setTrimLength(fill, duration);
  engine.block.setDuration(videoBlock, duration);
}

Generating Speaker Thumbnails

async function generateSpeakerThumbnail(
  engine: CreativeEngine,
  videoBlock: number,
  timestampSeconds: number,
  size: number = 128
): Promise<string> {
  const fill = engine.block.getFill(videoBlock);

  // Seek to the specific timestamp
  engine.block.setTrimOffset(fill, timestampSeconds);
  engine.block.setTrimLength(fill, 0.1); // Just a single frame

  // Export as image
  const blob = await engine.block.export(videoBlock, 'image/jpeg', {
    targetWidth: size,
    targetHeight: size,
  });

  return URL.createObjectURL(blob);
}

We sample multiple timestamps throughout each speaker’s talk time to show different facial angles and expressions—this helps users identify the right person even if they’re looking away or mid-gesture in one frame.

Speaker Detection & Face Tracking

Why Semi-Automatic?

Fully automatic speaker detection fails often enough that we added a confirmation step. Users verify detected faces against speaker names from the transcript—takes a few seconds and prevents bad crops on the entire video.

How It Works

Sample frames throughout the video
Detect & cluster faces using face-api.js (runs in browser, no server needed)
User confirms speaker identities via thumbnails
Correlate with transcript diarization to map speakers → face locations

This gives us verified speaker-to-face mapping for dynamic cropping and picture-in-picture layouts.

Multi-Speaker Templates & Dynamic Switching

The Concept

When a video has multiple speakers, we can create layouts that show:

The active speaker prominently
Other speakers in smaller picture-in-picture views
Dynamic switching as the conversation flows

Creating Picture-in-Picture with CE.SDK

// Duplicate the video block for each speaker slot
const pipBlock = engine.block.duplicate(originalVideoBlock);

// Position and size the PiP
engine.block.setWidth(pipBlock, 200);
engine.block.setHeight(pipBlock, 200);
engine.block.setPositionX(pipBlock, 20); // 20px from left
engine.block.setPositionY(pipBlock, 20); // 20px from top

// Enable cropping
engine.block.setClipped(pipBlock, true);
engine.block.setContentFillMode(pipBlock, 'Cover');

Key Technique: Muting Duplicate Audio

When duplicating video blocks for multi-speaker layouts, each copy has its own audio track. We must mute all but one. The setMuted API operates on the video fill, not the block itself:

// For each speaker slot after the first, mute the video fill
if (slotIndex > 0) {
  const videoFill = engine.block.getFill(duplicatedBlock);
  if (videoFill) {
    engine.block.setMuted(videoFill, true);
  }
}

Dynamic Speaker Switching

As the active speaker changes throughout the video, we:

Detect which speaker is talking (from transcript diarization)
Swap speaker positions in the template
Keep the active speaker in the prominent position

The layout updates automatically as the conversation switches between speakers. We apply different trim offsets to each duplicated block based on the transcript timing—so the main speaker slot shows the person currently talking while PiP slots show the listeners.

Preview, Playback & Export

Setting Up the Canvas

const container = document.getElementById('cesdk-canvas');
engine.element.attachTo(container);

Playback Controls

engine.player.play();
engine.player.pause();
engine.player.setPlaybackTime(30.5); // seek to 30.5 seconds

const currentTime = engine.player.getPlaybackTime();
const isPlaying = engine.player.isPlaying();

Syncing UI State

engine.player.onPlaybackTimeChanged(() => {
  const time = engine.player.getPlaybackTime();
  updateTimeDisplay(time);
  updateProgressBar(time / totalDuration);
});

engine.player.onPlaybackStateChanged(() => {
  updatePlayButton(engine.player.isPlaying());
});

Export Options

const exportOptions = {
  targetWidth: 1080,
  targetHeight: 1920,
  framerate: 30,
  videoBitrate: 8_000_000, // 8 Mbps
};

const blob = await engine.block.export(
  page,
  'video/mp4',
  exportOptions,
  (progress) => updateProgressBar(progress * 100)
);

// Trigger download
const url = URL.createObjectURL(blob);
const a = document.createElement('a');
a.href = url;
a.download = 'shortened-video.mp4';
a.click();

For longer videos, consider showing estimated time remaining or allowing background export. Browser export is single-threaded and blocks the tab—a 5-minute export of a 60-second clip isn’t unusual on average hardware, so user feedback is critical.

The Finished App

The user flow:

Upload → Drop a long-form video into the browser
Configure → Pick output mode (highlights/summary/cleanup) and aspect ratio (9:16, 16:9, 1:1)
Verify speakers → Match detected faces to transcript speaker names
Review clips → Browse the 3-4 AI-suggested segments, adjust if needed
Choose template → Solo speaker, sidecar, stacked, etc.
Preview → Scrub through the timeline, see exactly what you’ll get
Export → Download the final video directly from the browser

What’s Next

Ideas for Extension

Caption style controls: Custom fonts, animations, and positioning for subtitles
B-roll insertion: Automatically add relevant stock footage
Music & sound effects: AI-selected background audio
Brand templates: Custom overlays, intros, outros
Batch processing: Process multiple videos in sequence

Taking It Server-Side

Client-side processing for large files strains browser memory, and users must keep the tab open during export. A hybrid approach works better for production—upload in the background while users edit, then render on a server. You can also offload just the export step—let users build their edits in the browser, then send the CE.SDK scene JSON to your backend for faster, background rendering.

CE.SDK runs server-side with the same API. For batch processing, background jobs, or offloading rendering from user devices, see the CE.SDK Renderer for creative automation.

Resources

Made by IMG.LY with CE.SDK

From Prompt to Editor: Running CE.SDK Inside ChatGPT with the Apps SDK

Jan — Fri, 19 Dec 2025 15:43:16 GMT

With the new ChatGPT Apps SDK and Model Context Protocol (MCP), chat interfaces are starting to look less like Q&A tools and more like places where work actually happens. To explore what that means for creative workflows, we built a small technical demo: CE.SDK running directly inside ChatGPT.

From a user’s perspective, the flow is almost trivial. You ask ChatGPT for something like an ecommerce template. ChatGPT searches our template catalog, selects a matching design, and opens it instantly in a fully interactive CE.SDK editor — right inside the chat interface. What looks like a preview is, in fact, a real editor loaded with a real template scene.

This isn’t meant as a product announcement. It’s a technical proof of concept showing how creative SDKs can plug directly into AI-native interfaces.

CE.SDK as a ChatGPT App

https://www.youtube.com/embed/AoDqUNVLLJ4?feature=oembed

The integration is built around a custom MCP server that exposes CE.SDK to ChatGPT as a tool. The server speaks OpenAI’s JSON-RPC–style MCP and implements the standard lifecycle methods (initialize, tools.list, tools.call, resources.read). It knows about our premium template catalog and emits structured payloads that the frontend understands.

On the client side, a Next.js app listens to tool output events streamed from ChatGPT, renders CE.SDK widgets, and hydrates them with the payloads returned by the tool — such as a scene URL, placeholder values, or export permissions. Templates are loaded via CE.SDK’s Template API, either from a URL or from a serialized scene string.

Under the hood, the stack is fairly conventional:

Next.js 15 (App Router)
CE.SDK Web / CreativeEngine
A custom MCP handler to normalize JSON-RPC
Vercel for hosting

What’s new is not the technology itself, but the context in which it runs.

Working with MCP in Practice

The hardest part of the demo wasn’t CE.SDK — it was MCP.

OpenAI’s MCP implementation is extremely strict. Even the smallest schema mismatch can trigger the infamous “TaskGroup 424” error, usually without any hint as to what went wrong. In many cases, the HTTP response is technically successful, but the JSON structure doesn’t match the expected schema closely enough.

The key lesson here is to treat MCP responses as hard contracts:

Validate every response against a schema (for example with zod).
Mirror OpenAI’s field names exactly, even for empty or optional capabilities.
Assume that a 424 almost always means “your JSON shape is wrong”.

Another important insight was how critical visual context is in chat-based tools. If your MCP responses don’t include thumbnails or preview images, ChatGPT will often fall back to rendering links. For creative tools, that immediately breaks the experience. In a chat UI, visuals aren’t an enhancement — they are the interface.

State handling also requires a shift in mindset. ChatGPT can replay tool calls, and each prompt effectively creates a new widget instance. You can’t rely on mutating an existing editor. The frontend needs to be idempotent: load scenes from serialized state first, then apply changes. Every tool call should be treated as a fresh render.

Why This Pattern Matters

This demo points to a broader change in how creative software may be accessed. Chat becomes a coordination layer, not just a conversational one. Instead of explaining how something could be designed, the AI opens the actual editor and lets the user continue from there.

For CE.SDK, this fits naturally. Editors become embeddable capabilities rather than standalone applications, and AI systems become the entry point into creative workflows. Prompting turns into doing.

Beyond OpenAI: The MCP UI Standard

Although this demo uses OpenAI’s MCP, the architecture maps cleanly to the new MCP UI standard recently introduced by Anthropic. That standard aims to make tool definitions and UI rendering more consistent across models and platforms.

Because this integration already separates tool logic from UI rendering and relies on structured, explicit payloads, transferring it to Anthropic’s MCP UI model is conceptually straightforward. CE.SDK can act as a reusable creative surface across ChatGPT, Claude, and future AI app ecosystems.

You can read more about Anthropic’s MCP UI direction here:

https://blog.modelcontextprotocol.io/posts/2025-11-21-mcp-apps/

This demo is intentionally small and technical, but it highlights a meaningful shift: AI systems that don’t just describe creative outcomes, but open the tools to actually create them.

Animate Between Images - AI-Native Video Workflows with CE.SDK and Veo 3

Jan — Wed, 22 Oct 2025 14:10:31 GMT

With the release of Veo 3.1, we wanted to show just how effortless it is to embed generative AI capabilities directly into creative workflows. In this quick demo, we integrated Veo 3 into CreativeEditor SDK (CE.SDK) enabling users to animate between two still images in just a few clicks.

From Still Images to Motion

In our demo, we start with two images of the same person, one wearing a hat, the other without. Inside the editor, users simply select both images, click the AI context button, and choose “Animate between images.”

The images are loaded into a side panel where users can optionally add a text prompt to guide the transition. Once generated, the resulting short video is placed directly on the canvas ready for editing, compositing, or export.

What’s particularly impressive is the generation speed. In this example, Veo 3.1 produced a smooth 8-second transition in just 9 seconds a major improvement compared to earlier versions. This speed makes iterative creative workflows feel fluid and responsive, bridging the gap between prompt-driven generation and real-time editing.

Try It Out

You can check out the implementation on GitHub and give Veo 3.1 a spin inside your CE.SDK instance (sign up for a free trial if you haven’t already).

A Glimpse of AI-Native Editing

This simple feature highlights how easy it is to make your editor AI-native, combining traditional editing tools with generative intelligence.

Practical use cases include:

E-commerce
Show products “in action” or animate between styles and configurations.
Marketing
Create quick product reveal animations from static assets.
Content creation
Generate short motion clips or “tween” between creative scenes.

And if users want longer clips, they can simply add another 8-second track and transition seamlessly into the next.

The Future of Creative Workflows

With Veo 3 integrated, CE.SDK becomes a powerful playground for AI-driven creativity from image-to-video to scene interpolation and contextual animation.

We already empowered over 600 innovative startups, government entities, and Fortune 500 companies to add powerful design, video, and photo editing workflows to their products. Get in touch, to see how we can do the same for you.

What is Visual Prompting?

Jan — Tue, 29 Jul 2025 13:06:11 GMT

A New Paradigm for Creative AI, Built by IMG.LY

To say it’s trite to refer to the impact of AI in this or that domain as disruptive or groundbreaking would be an understatement. Yet, few areas have been as profoundly affected as the creative process. With just a text prompt, anyone can produce stunning images, remix visual styles, and explore design possibilities at a scale and speed never seen before. AI has inserted itself so quickly into this process that its gone from curious novelty to an essential part of the creator toolchain.

The more serious adoption we see, however, the more key limitations of today’s AI tooling come into focus: the prompt itself.

Text alone, for all its expressive power, struggles to capture the essence of visual intent. Most creative work doesn’t begin with a sentence it begins with a sketch, a layout, a mood board, or an arrangement of elements. Visual ideas are shared by pointing, placing, showing.

At IMG.LY, we have begun to think about better ways to direct AI for visual generation, the term we use is Visual Prompting.

Visual Prompting: the practice of composing a visual scene or layout as input for a generative model.

Instead of describing what you want with paragraphs of text, you show it directly using a canvas of images, text, spatial cues, and annotations. This visual composition then becomes the prompt for the AI to generate new content in return. It’s a more natural, intuitive, and powerful way to collaborate with AI, especially when integrated directly into the creative process.

Problem: the Chat Disconnect

The current generation of AI tools has largely been shaped by language-first interfaces. Whether it’s ChatGPT for writing or Midjourney for image generation, the assumption is the same: the user will type a descriptive prompt, and the AI will generate a result based on it.

But when it comes to design, this workflow quickly runs into friction. Visual ideas are inherently spatial and non-linear. Trying to express layout, balance, mood, or specific spatial relationships through text can feel like trying to describe a painting over the phone. It’s possible but unnecessarily cumbersome.

A designer might want to:

Indicate that a certain area in the image should be blue.
Replace a background with a texture sample.
Position a character precisely in a composition.
Annotate which parts of a scene to preserve or modify.

All of these are difficult to express fluently in text. But they’re effortless in a visual interface. The truth is: an image is worth more than a thousand words when prompting an image.

What Is Visual Prompting?

Visual Prompting is a multimodal approach to generative AI, where the input to the model is not just text, but a full visual composition: images, text, annotations, and layout.

Rather than prompting AI in isolation, the user builds their intent on a canvas. This might include:

Reference images that communicate mood or style.
Text blocks indicating desired copy or instructions.
Annotations pointing to specific areas with notes like “make this glow” or “replace this object.”
Spatial composition: where elements are arranged meaningfully to convey intent.

The visual prompt is then interpreted by a multimodal model such as OpenAI’s gpt-image-1 to generate new visual content that reflects not only the textual description, but also the visual context.

How Visual Prompting Works in CE.SDK

About time for an example. As part of our recent AI released we demoed how to use OpenAIs gpt-image-1 model to build visual prompting into CreativeEditor SDK (CE.SDK).

Here’s what the process looks like inside CE.SDK:

Compose Visually: The user creates a layout with reference content, uploaded images, icons, color schemes, design elements, placeholder text, and annotations. This composition represents the “prompt” in visual form.
Add AI Layers: With a single click, the user can trigger image generation using CE.SDK’s built-in AI plugin. The plugin sends the visual context (alongside any optional text input) to a multimodal model capable of interpreting both.
Refine and Iterate: Users can adjust the layout, reposition elements, change annotations, or layer in new references, then prompt again. Because the canvas is interactive and editable, the feedback loop is tight.
Build Up Complexity: Over time, users can layer generated images with manually designed components or other generated outputs, creating rich compositions that blend AI creativity with human direction.

This workflow turns the traditional prompt/response cycle into a conversation between the designer and the model, with the canvas acting as the shared language.

Who Is Visual Prompting For?

The use cases for Visual Prompting extend across industries:

Creative teams can go from reference to generation in seconds, iterating visually instead of wrangling prompts.
Marketing teams can generate regionalized or personalized creative variants from a shared layout.
Product designers can prototype in context, turning layouts into realistic screens without leaving the editor.
Storytellers and content creators can use annotated sketches to generate detailed illustrations or scene variations.
E-commerce platforms can give sellers the power to visually customize their brand materials with AI assistance.

In every case, Visual Prompting replaces friction with flow and text-based prompting with something more expressive, more reliable, and more fun.

Built for This: Multimodal Models and CE.SDK’s Plugin System

Visual Prompting is only possible because of two parallel advancements:

Multimodal AI models, such as OpenAI’s gpt-image-1, that can interpret both images and text, understand spatial relationships, and respond to annotated cues.
A flexible, composable editor SDK like CE.SDK, which enables the construction of visual prompts on a live canvas, and makes it easy to integrate AI models directly into the design flow.

Our SDK was built from the ground up to support AI-first creative workflows. Its plugin architecture allows you to add any model or API, image generation, video generation, captioning, text rewriting and use it natively inside the editor without the need to switch tools or copy/paste.

Generative AI’s full potential is only unlocked when it is embedded directly into the tools creatives use not siloed in chatbots or separate interfaces. Visual Prompting allows that embedding to go even deeper, aligning the mode of input (visual) with the desired output (visual).

Explore It Yourself

🎨 Try out Visual Prompting in our AI Editor demo
📘 Learn How to Integrate AI into CE.SDK
💬 Contact Us to Bring Visual Prompting to Your Product

Vibe-Engineering: When AI Does All the Coding, What Do We Actually Do?

Daniel — Mon, 07 Jul 2025 09:51:05 GMT

This isn’t an article about AI replacing engineers, it’s about discovering what (software) engineering will become when freed from the mechanics of coding itself.

Vibe-Coding, and Vibe-Engineering are omnipresent in my social feeds these days. As with every new trend, it’s hard to judge what really works and what is pure marketing hype. But the central question nagged at me: if AI really can do all the coding, what exactly are we humans supposed to do?

Therefore, I wanted to find out for myself. I didn’t want to build just a Hello World example, so I pulled an old idea out of the closet and dusted it off.

The Experiment: AI Agents Code, Humans Curate

For some time, I had considered porting our IMG.LY background removal library from JavaScript to other platforms using Rust. However, I had postponed this side project due to the anticipated effort involved. This seemed like the perfect project to discover what my role would be in a world where AI does the heavy lifting. To pull it off, I self-imposed only one strict rule:

Hands off the code!

Every commit, every fix, every feature would be handled by an AI coding agent. My role? Being the curator, formulating my intent, reviewing, play-testing, and guiding the agent with feedback.

The agent should be the sole coder, from start to finish.

This was a true hands-off experiment: could an AI build an entire Rust library if I never touched the code?

Is it really something new? After years of handing off development to teammates as a CTO or teaching students, this didn’t feel completely different.

So, what happens when you let an AI do all the coding, and you just review and give feedback? What does it actually mean to be a human in this new paradigm?

The result is baffling and gives a glimpse of what we as engineers can expect for the future.

10,000+ lines of production-ready code. Authored by AI, orchestrated by a human.

Seeing is believing? Then explore the results and judge for yourself at github.com/imgly/background-removal-rs.
I open-sourced the full code and the agent rules used during this experience.

Here’s what I learned: vibe-engineering an entire Rust library with a coding agent CLI as my only pair programmer.

On a quick note, I chose to use Claude Code with Opus 4 and Sonnet 4, as I was already familiar with them and had achieved the best results so far. The costs were kept in check by using a Claude Max Account, which at the time is around ~€100.

The General Workflow That Emerged

After some initial time to get used to working with an AI agent, I realized that the workflow follows pretty closely a typical software engineering flow, but with an important twist that answers our central question: what do humans actually do when AI codes? While most people focus on the code writing part, I discovered that the quality assurance part becomes the human’s primary domain. In my experiments, I realized that to keep the code intact over time, this QA phase is essential—and this is where humans truly shine. Most interestingly, with the right rules and prompting, we can let the agent do the heavy lifting here, too.

Here’s how the collaboration actually worked in practice:

Getting Started
The project began with me setting up the environment and configuring timeouts for long-running tasks—because AI agents need time to compile Rust code and run extensive test suites. Together, the AI and I created a CLAUDE.md file with coding standards and rules (you can see the actual rules we used). We also wrote project requirements documents collaboratively, with me providing the vision and the AI helping structure the technical details.

The Development Dance
Once we got rolling, a natural rhythm emerged. I’d describe what I wanted—something like “add support for different image formats with color profile preservation.” The AI would then create a detailed implementation plan, breaking down the work into manageable chunks. I’d review this plan, often tweaking the approach or scope, then give the green light.

What happened next was fascinating: the AI would write the code, run correctness checks, update all the tests (unit tests, documentation tests, end-to-end tests), run the linter, analyze coverage, and even update documentation and examples. Sometimes it would run performance benchmarks without me asking—the overzealous builder at work.

My New Role: The Quality Guardian
My job became verifying that the intent was met and the API actually made sense. I’d play-test the functionality, try to break it in ways a real user might, and provide feedback. When something worked well, I’d document what we learned for future development. The AI would then write tests to lock in that new functionality, ensuring we never accidentally broke what was working.

This cycle repeated for every feature, with the AI handling the heavy lifting while I focused on direction, quality, and user experience.

The interesting part was that the more I got used to the workflow, the more I handed off to the agent. The validation and general experience had to be largely influenced or orchestrated by me, but building, testing, and even updating documentation could be delegated. In the beginning, I started just building things, but the more we built, the more the AI agent failed to keep existing functionality intact. With those small context windows, it’s kind of understandable after all.

Quality Assurance & Knowledge Capturing Matters More Than Ever

As engineers, we know that at some points we cannot grasp every influence of our code changes on the whole code base anymore. Therefore, we use tests and tooling to help us keep our sanity.

The good thing is that the Rust and Cargo toolchains are exceptionally high quality when it comes to providing the right tooling also for agents.

Due to its lack of long-term context and limited knowledge of all the library’s capabilities, the tests and verification mechanisms become crucial. While these are part of the context, they’re not stored in the agent’s memory but baked into the codebase itself. With the tests and the tooling being available to the agent, it will still make mistakes, as I do, but they can effectively auto-heal or auto-correct their own mistakes with the provided feedback.

I can only advise being even stricter with QA from the very beginning.

Another thing that became apparent very quickly is that, unlike a human colleague with lots of context and a good memory, agents still need significant help to remember things to avoid rediscovering the same information repeatedly. This revealed another crucial human role: context engineering becomes critically important. As such, we engineers become the keepers of institutional knowledge, the ones that help the agent remember what worked and what didn’t. I would describe it as knowledge capture, for lack of a better term.

Tips & Tricks

AI Is Eager, Resourceful, and Has Memory Like a Sieve

Here are the key insights I gained from building the library with an AI coding agent:

Code that works: This might seem obvious, but it wasn’t clear to me if the library would ever be usable or publishable, but it is. The AI proved it could deliver production-ready code.

Time Effectiveness: While creating the library took around 3 weeks, don’t let this fool you. Most of the time, the agent worked alone, and I only had to check in once in a while to review and give feedback. Iteration cycles and rewriting still take time. So I wouldn’t say development was faster than if I did it myself, but much more scalable and effective with my own time. Here’s what I actually spent my time on: reviewing architectural decisions, ensuring the API made sense, and play-testing the final product.

Stamina: The agent never complained, even when fixing 400+ warnings after setting the linter to pedantic—it pushed through. The AI powered through hundreds of warnings, even clustering them and suggesting which to fix first. To be fair, most humans would have either gone crazy or aborted the effort claiming “it’s good enough.” This revealed something important: as humans, we don’t need to be the ones grinding through tedious tasks anymore.

Overzealous Builder: The agent often built more than I asked for. Scoping and clear implementation plans were essential. This taught me that one of the key human roles is setting boundaries and maintaining focus.

Premature “Done”: Sometimes the agent claimed it was finished but left code mocked or incomplete. I learned to always ask, “Is there anything left to implement?” Quality assurance becomes fundamentally a human responsibility.

Old Knowledge: The AI sometimes relied on outdated library knowledge. I had to remind it to check the latest docs and version of dependencies, but when asked, it searched the web, GitHub, crates.io, etc., and gathered all necessary information for using a library. Humans become the validators of information freshness.

Forgetting and Compaction: Larger features grow outside the context window, so the agent would “compact” context, sometimes forgetting important details. Forcing it to write implementation plans in markdown files helps maintain flow and allows better resuming and bookkeeping. We humans become the keepers of long-term project memory.

Helps Establishing Rules: While working with the agent, I saw repeating patterns like it forgetting how to check, build, and test the application, or best practices like using git worktrees. I started adding new rules to Claude.md by myself at first, but quickly realized that the agent is far better at formulating and adding new rules if I asked it to. The human role evolved into being the pattern recognizer and rule creator.

Doesn’t Always Obey the Rules: Most of the time, the rules are followed, but occasionally it doesn’t follow them and forgets to automatically run all tests. This reinforced that humans must remain the final guardians of process and quality.

Code Deletion: Occasionally, the AI “fixed” issues by removing important code paths. Solution: Insist on never removing functionality without discussion, and build a robust test suite. We become the protectors of existing functionality.

Process Management: The AI struggled with background processes, understanding when to run things in the background, and keeping track of what it started. For example, starting a web server via Bash (see Claude Code Bash tool docs), and then trying to use a playwright tool (see also playwright-mcp) to access this server was blocked by these issues. Complex orchestration remains a distinctly human skill.

Overly Agreeable: Too often, it seems to just agree with whatever I said. For the future, I wish agents wouldn’t be so obedient. This highlighted that humans need to be the challengers and devil’s advocates in the process.

Further Remarks

After three weeks of this experiment, the patterns became clear. Here’s what I learned about the human role in AI-assisted development:

Ensure Verify, and Self-Correct: Always ask the AI to check, format, lint, test, and benchmark its work. Your job becomes quality orchestration, not quality execution.

Always Plan First: Insist on a clear implementation plan before coding begins. Humans excel at high-level architectural thinking and breaking down complex problems.

Write It Down: Let an agent keep notes, todos, and open issues in markdown files. Ideally in the repository itself. Don’t rely on the AI’s memory. We become the institutional memory keepers.

Scope Features: Break tasks into small, manageable pieces. Feature scoping and boundary setting become a core human skill.

Test Suite is King: A robust test suite prevents accidental “fixes” that remove functionality. Humans become the guardians of existing behavior.

Stay Involved: Fast iteration and discussion with the AI is crucial. Don’t go fully hands-off. Active curation and guidance remain essential.

Conclusion

So, what do humans actually do when AI does all the coding?

To answer that, let’s look at what AI agents are today:

Agents are super-talented, overly obedient coding assistants with lots of stamina and endless potential.

They’re fast, never complain, and iterate like champs. But they need your guidance, structure, and a healthy dose of skepticism.

In the end, the best results come from a true partnership: you plan and guide—the AI builds, fixes, and learns.

If you follow these concepts, then

“Vibe engineering is a highly engaging experience.”

And if you’re wondering: yes, I’d do it again. But next time, I’ll make sure the AI writes everything down too, and I’ll put even more effort into validating and securing new features as soon as they’re play-tested and verified. The human role isn’t disappearing—it’s evolving into something more strategic and impactful.

Through this experiment, I’ve identified three fundamental shifts that define what engineering becomes in an AI-assisted world:

From Execution to Orchestration
We’re no longer the ones typing code—we’re the conductors. Our new skills revolve around prompting (framing tasks so AI understands our intent), directing workflows (knowing when to use AI versus human judgment), and tool fluency (combining different AI tools to create something greater than their parts).

From Knowledge to Reasoning
AI can recall facts and syntax better than any human ever could. But what matters now is our ability to interpret, contextualize, and make sense of uncertainty—something current AI still struggles with. We become the ones who ask “why” and “what if” rather than just “how.”

From Repetition to Adaptation
Skills rooted in routine are increasingly automated. The enduring value lies in adaptive thinking, problem-framing, and reinventing approaches when the usual patterns don’t apply. We’re the ones who recognize when something feels off, even if we can’t immediately articulate why.

Afterthought

During this experiment, I saw that the agent repeated similar tasks over and over again. One thing that struck me was looking up documentation. Whenever something with a third-party library didn’t pan out, it started web-searching for info and guides about the library. It rarely started out with reading the Rust docs first, so I had to tell it to use the Rust docs. Therefore, I assume that providing quick access via tools to guides, docs, API references, and examples will spare me some roundtrips, time, and tokens and accelerate the process. I know that there are tools like context7, but it seems to be focused on JavaScript. At least for the Rust ecosystem, the docs are centralized and standardized and can even be used locally. A tool to access those quickly and find things in them would be highly beneficial.

Additionally, for coding projects, there are typical steps to follow to verify if the code is sound. It’s probably also best to provide specific tools or workflows to follow these steps exactly. Rules already help with this, though.

Last but not least, a natural next step would be for an agent to analyze its history periodically and provide new memory entries (rules) based on that to reduce repetition and common mistakes. For more details, see the Appendix: Most used prompts section, where I analyzed the top 10 most common prompts to help agents proactively develop useful rules.

If you’re curious about improvements, I’ve put together a wishlist for Claude Code that would make the agent even more useful.

Try It Yourself

If you want to experience the results of vibe-engineering firsthand:

# Clone and explore the codebase
git clone https://github.com/imgly/background-removal-rs.git
git checkout --tag v0.2.0

# Or install the CLI directly
cargo install --git https://github.com/imgly/background-removal-rs.git --tag v0.2.0

Remember: every line of code in this library was written by AI while I played the role of curator, architect, and quality guardian. Judge for yourself whether this new paradigm produces production-ready results.

Appendix

Appendix: Claude Code Wishlist

During the experiment, I encountered some issues whose resolution would improve the agent.

Project-scoped history (not global).
Project-scoped todos (not global).
Project-scoped implementation plans (not global).
Customizable /compact prompt or choose other compact strategies.
Improved handling of background processes.
History analytics with memory proposal.
Code-specific predefined Code and Validate workflows.
Allow forking of multiple agents from a single point in the session history to create multiple trials with the same intent and context.

Appendix: Most Used Prompts

Claude code stores the history under ~/.claude/history as json formats. I asked claude code to read them in and categorize the most used prompts.

#	Category	% Usage (Count)	Description / Examples
1	Other/Specific Instructions	19.5%	Technical specs, detailed requirements
e.g.: “Relax the success metrics”, “Use tensor data directly as alpha channel”
2	Questions	12.7%	Status checks, clarifications
e.g.: “What’s next?”, “How can I test it myself?”, “Do we have any mock implementations?“
3	Implementation Requests	11.1%	Direct build/create requests
e.g.: “Create a PRD for Rust port”, “Make ONNX Runtime injectable”
4	Simple Responses	10.0%	Short confirmations, approvals
e.g.: “ok”, “go for it”, specific technical choices
5	Continuation Commands	8.4%	Requests to proceed
e.g.: “go on” (59 times)
6	File Operations	8.3%	File management
e.g.: “Write this into PRD.md”, “Move packages into crates directory”
7	Context Summaries	8.3%	Session continuation due to context limits
e.g.: “This session is being continued from a previous conversation…“
8	Analysis/Review Requests	7.7%	Requests to analyze/review
e.g.: “Analyze my background removal project”, “Check the preprocessing”
9	Bug Fix/Issue Resolution	4.4%	Problem identification/fixes
e.g.: “You mix up release and debug”, “This is wrong”
10	Testing	3.8%	Running/validating tests
e.g.: “Test images are incorrect”, “Rerun comparison tests”

Appendix: Claude Code Settings

In real-world projects, the default timeout of two minutes is not enough. I bumped the timeouts to the maximum values to allow execution of unit tests, E2E tests, benchmarks, and long-running tasks.

BASH_DEFAULT_TIMEOUT_MS: Sets the default timeout (in milliseconds) for long-running bash commands.
BASH_MAX_TIMEOUT_MS: Specifies the maximum timeout (in milliseconds) that can be set for bash commands.
BASH_MAX_OUTPUT_LENGTH: Limits the maximum number of characters in bash outputs before they are truncated in the middle.
CLAUDE_BASH_MAINTAIN_PROJECT_WORKING_DIR: Ensures the working directory is reset to the original after each Bash command.

{
  "env": {
    "CLAUDE_BASH_MAINTAIN_PROJECT_WORKING_DIR": "true",
    "BASH_MAX_TIMEOUT_MS": "3600000",
    "BASH_DEFAULT_TIMEOUT_MS": "3600000",
    "MCP_TIMEOUT": "3600000",
    "MCP_TOOL_TIMEOUT": "3600000"
  }
}

Library Capabilities

High-performance Rust library for AI-powered background removal with hardware acceleration, built for
production scale and developer productivity.

Performance Highlights

Hardware acceleration: CUDA (NVIDIA), CoreML (Apple Silicon), CPU fallback
Sub-second processing on modern hardware (100-1200ms depending on image size)
Memory efficient with optimized threading and session reuse

Key Features

AI Models & Quality

Multiple state-of-the-art models (ISNet, BiRefNet)
FP16/FP32 precision variants for performance vs. quality
Portrait-optimized and general-purpose models
Custom model support via ONNX models from Hugging Face

Format and Color Profile Support

Input: JPEG, PNG, WebP, TIFF, BMP with ICC color profile preservation
Output: PNG (transparency), JPEG, WebP, TIFF, raw RGBA8

Integration Patterns

One-liner API for simple use cases
Session-based processing for batch operations
Stream processing from any AsyncRead source
CLI tool for standalone usage and pipeline integration

Architecture & Platforms

Dual Backend System

ONNX Runtime: Maximum performance, GPU acceleration
Tract: Pure Rust, zero external dependencies

Platform Support

macOS: Apple Silicon + Intel with CoreML acceleration
Linux/Windows: NVIDIA CUDA + CPU fallback

Developer Experience

Modern Rust Ecosystem

Async/await native support
Comprehensive documentation and examples
Zero-warning policy with extensive testing
Structured tracing for production observability

CLI Capabilities

Batch processing with recursive directory support
Model management (download, cache, clear)
Provider diagnostics and performance monitoring
Pipeline-friendly with stdin/stdout support

3,000+ creative professionals get early access to new features and updates—don’t miss out, and subscribe to our newsletter.

IMG.LY x AI

Eray — Mon, 30 Jun 2025 13:18:14 GMT

Over the past two years, one question has consistently emerged in conversations with customers: “What about AI?”
While AI promises to disrupt many industries, it remains difficult to grasp how this technology will reshape creative workflows.

Our customers look to us for guidance: What is IMG.LY’s vision? How will our SDK help them harness this wave of innovation?
Last month, we took our first significant step by launching an initial suite of generative AI-powered features for our web SDK. We’ve embedded these tools deeply into photo, video, and design editing workflows and the response from customers and prospects was overwhelmingly positive.

And this is just the beginning. An immense opportunity lies ahead for both IMG.LY and our customers to drive transformation in the creative domain through our SDK. In this post, I want to share our vision for the future of creative tools powered by AI and our SDK.

Going forward, we’re focusing on three central goals:

1. Deep Integration of AI Capabilities into Editing Workflows

The pace of AI innovation continues to be remarkably fast. New models and configurations emerge almost daily, some as APIs, others open-sourced on platforms like Hugging Face. While some offer generalist features, others provide specialized, industry-specific capabilities, such as automatically obscuring license plates, blurring faces or replacing skies for property exteriors.

The true value of these AI tools emerges not in isolation, but when they’re seamlessly combined within existing workflows.

That’s why we built a plugin system for CE.SDK.
The CE.SDK plugin system has fundamentally changed how quickly we, and our customers, can integrate new capabilities. We’ve created a way to bring AI models and agents directly into connected workflows within the editor, making AI feel native rather than bolted on.

A recent example is our integration of OpenAI’s gpt-image-1 API, which we implemented in just a few days after its release. We used it to build a visual prompting workflow that takes into account all elements of a page, text, images, and annotations, to generate results based on the complete layout context.

These early successes are encouraging us to ship even faster and bring new features to our customers.

We’re now partnering with fal.ai to provide a nearly out-of-the-box experience to use any new generative AI model through their platform (more on this soon). Fal.ai excels at speed—when exciting new models emerge, they offer API access within days. This partnership ensures our users can quickly access new AI capabilities with minimal effort. More partnerships and integrations are on the horizon, bringing world-class AI APIs with intuitive interfaces directly into the editor.

2. Enabling AI Agents as Creative Collaborators

AI agents, like humans, leverage tools to accomplish tasks efficiently. CE.SDK occupies a unique position as a highly adaptable, multi-platform technology for editing various media types—including a fully documented headless version that’s perfect for programmatic control.

Thanks to CE.SDK’s fully documented API and headless architecture, AI agents can be created to navigate and operate the editor programmatically. This opens the door to agents that act as powerful scaffolders:

Generating initial designs and layouts based on brief descriptions
Automatically aligning content with brand guidelines
Transforming static designs into dynamic videos with a single command
Creating engaging short video content from prompts
Adapting existing designs to new formats and dimensions

Crucially, everything an AI agent creates remains fully editable by human users. You can refine, add nuance, and perfect the results through collaboration. AI provides the scaffolding, humans remain the tastemakers, adding the creative spark that makes designs truly exceptional.

3. AI-Powered SDK Configuration and Customization

AI agents are already actively involved in building, refining, and optimizing software tools—a trend that’s accelerating rapidly. This shift directly impacts developer experience and points to an exciting future where our SDK serves not just human developers, but AI agents as well.

To facilitate this interaction, we’ve already taken steps like making our documentation available in LLM-friendly formats. But this is just the beginning of our journey toward radically improving the experience for both developers and AI agents.

Our ultimate goal is conversational configuration. Imagine describing your requirements in plain language, and having an AI agent handle the rest:

Configuring the SDK’s visual aesthetics to match your brand
Setting up custom functionality and workflows
Integrating media libraries and plugins
Optimizing performance for specific use cases

This isn’t just about making development faster. It’s about democratizing access to powerful creative tools, allowing anyone to build sophisticated editing experiences—regardless of technical expertise.

Looking Forward

We believe that AI collaboration represents a transformative shift in creative technology, empowering users and developers to achieve extraordinary outcomes. At IMG.LY, we’re committed to being at the forefront of this exciting journey. Our vision extends beyond simply adding AI features, we’re reimagining how creative tools are built, configured, and used in an AI-augmented world.

Stay ahead with us: subscribe to our newsletter for exclusive updates.

AI-first Visual Editor using GPT-4o’s gpt-image-1 Model

Eray — Mon, 05 May 2025 20:58:07 GMT

What We Built

We integrated OpenAI’s new gpt-image-1 API (from GPT-4o) directly into our fully functional visual editor, CreativeEditor SDK (CE.SDK), enabling generation, editing, and refinement of images without ever leaving your creative workflow.

Open AI Editor Demo Page

From Simple Image Generation to Visual Prompting on a Canvas

Inside the editor, users can now:

Generate Images
Use prompts to generate images from scratch.

Generate Images from Visual Prompts
Turn full compositions—images, text, and annotations—into fresh visual content. Just select your page and let AI handle the rest, as shown in the video.

Reimagine Images & Text
Edit existing images and text with prompts to iterate faster and create variants.

Create Incredible Compositions
Combine generated and uploaded images into complex compositions.

Each step builds on the last, evolving from basic generation into true visual prompting powered by multiple input modes, all within one canvas. Check out the live demo here.

How We Built It

We built this integration using our CE.SDK and its flexible plugin system, designed from the ground up to support AI-first creative workflows.

This approach lets developers plug in any model or API—text, image, video, or audio—and run them all in one seamless editing flow. Whether you’re using OpenAI, Stability, or an in-house model, CE.SDK gives you the tools to bring it into the visual workflow natively.

🔗 Check out our AI Editor.
📘 Learn how to integrate AI into CE.SDK.

Why This Matters

Generative AI’s full potential isn’t unlocked by prompting alone, it’s unlocked when embedded into real-world creative workflows.

Designers, marketers, and content teams don’t just need outputs; they need control, iteration, and context. By bringing AI directly into the canvas where assets are created and edited, we turn generative models into tools for actual production, not just ideation.

This shift enables:

Creative work in context: No switching between ChatGPT and design tools.
Real-time augmentation: Prompt, edit, refine in place.
Scalable content generation: Automate localization, personalization, and variants.
Multimodal orchestration: Use visuals, layouts, and annotations as inputs.

It’s a step toward making multimodal AI usable for real design workflows, not just concept generation.

Integration & Feedback

This linked demo is rate-limited, if you would like to test more extensively or if you are interested in giving the AI editor a spin inside your own app, you can get started with our documentation.

We’d love your feedback, any thoughts, questions, and ideas are welcome!
Reach out to us.

3,000+ creative professionals gain early access to new features and updates—don’t miss out, and subscribe to our newsletter.

OpenAI GPT-4o Image Generation (gpt-image-1) API: A Complete Guide for Creative Workflows for 2025

Jan — Mon, 28 Apr 2025 07:55:48 GMT

Update: AI-first Visual Editing

A day after the release of the gpt-image-1 API, we took it for a spin and integrated it into CreativeEditor SDK. Users can now generate images, create variants and use the canvas to compose visual prompts with our design editor. See it in action:

Open AI Editor Demo Page

Introduction

The release of OpenAI’s gpt-image-1 model signals a pivotal shift in the creative developer landscape—one that moves beyond static, one-shot image generation and toward a more dynamic, multimodal interaction model. Until recently, most image APIs followed a predictable pattern: submit a prompt, receive a finished image. The process was useful, but flat. What’s changing now is not just image quality or style fidelity, but the shape of the workflow itself. With gpt-image-1, built on the GPT-4o foundation, developers can start designing creative tools that feel conversational and iterative. This evolution invites a new kind of interface where prompting, tweaking, and refining happen inside the canvas, not outside of it.

For teams building creative editing experience into their app, this moment coincides with the release of IMG.LY’s AI Editor SDK, a powerful, fully integrated toolkit designed for generative workflows. The SDK is already equipped to support interactive image generation, contextual editing, and multimodal inputs, and you can try it today through this live demo.

This guide is a comprehensive introduction to the gpt-image-1 API, but it also goes further. It’s not just about wiring up an endpoint, it’s about rethinking what image generation means in a user-centric product.

From prompt handling to interactive iteration, we’ll walk through how to design creative cycles, not just outputs. This guide explores how to make that shift, how to go from generating images to integrating gpt-image-1 into real creative cycles, where AI becomes a tool that bends to user intent, not the other way around.

Overview of `gpt-image-1`

OpenAI’s gpt-image-1 model, released in April 2025, is the latest evolution in the company’s generative image lineup and marks a turning point in how developers approach visual creation inside applications. Built on the same multimodal foundation as GPT-4o, this model allows applications to move beyond one-shot static generation and instead build toward more conversational, iterative image workflows.

Model Architecture and Capabilities

gpt-image-1 is rooted in GPT-4o’s ability to understand and generate across modalities. It is designed to produce high-resolution images—up to 4096×4096 pixels—based on natural language prompts. The model handles complex scenes with more fidelity than previous iterations and provides improved consistency in how it interprets detailed descriptions. This is particularly relevant for tools that need reliability when turning prompt inputs into design elements.

Parameter Control

Developers working with gpt-image-1 have access to a streamlined set of parameters, here is a subset of the most important ones:

prompt: The primary text input describing the desired image.
size: Choose between “1024x1024”, “1024x1536” (portrait), “1536x1024” (landscape), or “auto” (default, based on prompt).
n: Number of images to generate (default is 1).
response_format: Always returns b64_json. URL outputs are not supported.

Unlike DALL·E 3, gpt-image-1 does not accept style modifiers or quality settings. It is designed for straightforward, high-fidelity image creation driven purely by the text prompt and size selection.

Full documentation of these options is available via OpenAI’s official guide.

Style and Use Case Alignment

By supporting a wide range of stylistic templates, gpt-image-1 positions itself as a flexible backend for everything from marketing collateral to storyboarding tools. The output can be tailored to suit technical illustrations, concept art, or even photorealistic renderings, allowing developers to map visual outputs more directly to brand or product requirements.

Limitations and Future Direction

As of April 2025, gpt-image-1 supports only one image per request and does not offer fine-grained image editing or inpainting. However, its tight coupling with GPT-4o suggests that future iterations may embrace persistent context, conversational refinement, or even integrated image-plus-text exchanges within the same session. For developers building editors or multimodal workflows, the current model lays a strong foundation for these future capabilities.

API Setup and Usage

2.1 Get Access

To start using gpt-image-1, developers must first register for access via the OpenAI platform at platform.openai.com. Access requires an API key, which is tied to your OpenAI account and associated usage limits based on your billing tier. Be sure to confirm that your account is approved for image generation, as availability may differ by region and subscription level. Once authenticated, keys can be created in your dashboard and stored securely in your server or development environment.

2.2 First Image Generation (Node.js Example)

The image generation API for gpt-image-1 can be used directly via OpenAI’s official Node.js client. Below is a complete example showing how to send a prompt and receive an image URL in response:

import OpenAI from 'openai';
import fs from 'fs';

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY, // make sure this is securely set
});

async function generateImage() {
  try {
    const prompt = `
    A studio ghibli style illustration of a cyberpunk girl holding a butterfly on her finger.
    `;

    const result = await openai.images.generate({
      model: 'gpt-image-1',
      prompt,
      size: '1024x1024', // or "1024x1536", "1536x1024", or "auto"
    });

    const image_base64 = result.data[0].b64_json;
    const image_bytes = Buffer.from(image_base64, 'base64');
    fs.writeFileSync('butterfly.png', image_bytes);
    console.log('Image saved as butterfly.png');
  } catch (err) {
    console.error('Error generating image:', err);
  }
}

generateImage();

Remember that all outputs from gpt-image-1 are delivered as base64-encoded JSON. Developers should decode this data for display, storage, or further processing within their applications. For complete parameter options and examples, consult the OpenAI Images API guide.

Integrating with CE.SDK

Embedding gpt-image-1 into a creative editor like CE.SDK is about more than just piping an image into a canvas. It reshapes how users interact with content creation, bridging manual design work and AI-driven generation within the same editing environment. Rather than operating as a standalone prompt generator, gpt-image-1 becomes a continuous creative partner inside your editor. For in in-depth technical guide on how to integrate gpt-image-1 stay tuned for our upcoming tutorial, sign up to our newsletter to be notified when it goes live.

Embedding Image Generation in a Creative Editing Workflow

The natural entry point for gpt-image-1 inside CE.SDK is through a dual-mode experience: offering users the option to start either from scratch or from existing context. In “from scratch” mode, a user might open a blank scene and initiate an image generation by writing a prompt for example, “Create a vibrant festival scene at sunset.” The result appears directly on the canvas, immediately editable like any other design element.

Where gpt-image-1 shows its real potential is in “in-context editing.” Here, users interact with existing content—a background, a product shot, or a decorative element and trigger AI enhancements based on that visual context. A user might select an image of a bird, as in the example below and ask for variants, initiate a background swap, or request a change like adding more birds in a conversational interface embedded in the editor. Because CE.SDK treats generated images as first-class canvas elements, context such as positioning, layering, and cropping is preserved throughout the process.

Let’s see what this might look like in practice. We positioned an image of a single bird on our canvas, opening the AI context menu we can now manipulate that image in place using the OpenAI API:

We edit the image and prompt the API to add more birds:

We see that the model correctly identified the type of bird in the picture (seagull) and filled it in with a swarm of flying seagulls.

We can now continue to work with the image, overlaying filters, changing the texture, cropping etc.

Switching Between Manual Edits and AI-Powered Enhancements

A critical design principle when integrating gpt-image-1 is giving users freedom to toggle between manual edits and AI suggestions. Manual edits should always remain possible after generation, e.g. cropping, masking, compositing while users can also seamlessly prompt gpt-image-1 for additional changes without losing prior work. Think of variant generation as a branch: a user picks a generated image and creates “forks” by asking for alternate styles, different lighting, or new thematic elements.

In this setup, the generated image serves as a stable node in the creative graph, while edits and regenerations can attach contextually. This workflow minimizes user frustration by avoiding the “start over” penalty typical of isolated generation APIs. It also opens up more complex creative behaviors, like blending user-drawn sketches with AI-augmented refinements, or iteratively developing an asset library around a consistent visual theme.

An upcoming in-depth tutorial will walk through implementing this multimodal workflow step-by-step, but the key takeaway is that gpt-image-1 shines brightest when it is embedded into a creative loop—not treated as a black-box generator, but as an interactive, iterative design companion.

Prompt Engineering Tips

One of the most overlooked but critical factors in successful image generation is prompt design. With gpt-image-1, prompt engineering isn’t just about describing an image—it’s about steering the model toward intent, tone, composition, and usability. Because the model is capable of rendering complex scenes and a wide range of styles, thoughtful phrasing and contextual hints can dramatically affect the outcome.

Writing for Visual Intent

Start by clarifying what the image is supposed to communicate. Are you looking for atmosphere, action, product detail, or narrative clarity? A prompt like “a city skyline at night” is a starting point, but it leaves too much to chance. Adding elements like “view from a rooftop bar, with glowing signage and overcast haze” gives the model anchors for both composition and mood.

Leveraging Artistic Language

You can further refine outputs by referencing mediums or artistic schools. Prompts that include terms like “in watercolor style,” “oil painting,” ”80s anime aesthetic,” or “studio photography” help the model lock onto a particular visual identity. These cues not only improve stylistic fidelity but also align the output with specific brand or genre expectations, which is especially important for products with a defined look and feel.

Creating Consistency in Branded Outputs

When generating a set of related images, such as social media creatives, campaign assets, or UI visuals, consistency becomes more important than variety. To achieve this, structure prompts with repeatable patterns and include brand elements such as color palettes, motifs, or reference characters. While gpt-image-1 doesn’t yet support persistent memory across requests, consistency can be enforced by prompting with the same style terms, layout descriptions, and constraints. Teams working within CE.SDK can even pair prompt templates with locked canvas layers to preserve composition between generations.

Ultimately, good prompt engineering is not about verbosity but about clarity and constraint. It’s less like writing poetry and more like drafting a product spec. The best prompts are focused, directive, and give the model just enough creative freedom within clear boundaries. However, effective prompting should not burden the user. In practice, the interface should abstract most of the complexity away. Users can be guided toward better outputs through simple UI choices—selecting predefined styles, choosing themes, or adjusting mood settings—while the system dynamically enhances and augments their input behind the scenes. By managing the technical depth invisibly, you enable a creative process that feels intuitive and powerful without ever making prompt engineering the center of the user experience.

Real-World Use Cases

The versatility of gpt-image-1 makes it especially impactful across a variety of industries where visual content creation is either a core product feature or a major operational need. Beyond isolated image generation, the model supports workflows that demand contextual awareness, brand consistency, and iterative refinement, key ingredients for modern digital products.

Web-to-Print

In web-to-print applications, customers expect to customize marketing materials, event invitations, signage, or packaging with minimal friction. By integrating gpt-image-1, developers can offer template-driven personalization where users simply select a theme or enter a few keywords, and receive ready-to-edit visual assets. Combined with CE.SDK’s layout and editing capabilities, this enables a highly interactive experience where generated backgrounds, graphical elements, or themed illustrations can be dynamically placed into editable templates.

Marketing teams rely on high-frequency content creation, often needing visually consistent, campaign-specific assets. gpt-image-1 can assist by automating the generation of background scenes, promotional visuals, and thematic graphics based on campaign briefs. Brands can define style presets aligned with their visual identity, making it easy for marketing teams to produce “on-brand” assets without heavy design overhead. Integrating image generation directly into campaign builders or social scheduling tools amplifies speed without sacrificing quality.

Digital Asset Management (DAM)

Asset libraries often suffer from gaps: missing variants, seasonal versions, or content tailored to different demographics. DAM systems can integrate gpt-image-1 to extend asset catalogs dynamically. Instead of manually commissioning variations, users can generate alternative backgrounds, localize visuals with region-specific elements, or adjust brand visuals for different markets—all from a single master file. With CE.SDK handling structured editing, teams maintain asset consistency while boosting creative flexibility.

E-Commerce

Product visualization remains a huge challenge in e-commerce, especially for smaller retailers. gpt-image-1 can be used to automatically create product lifestyle imagery, context backgrounds, or thematic campaigns without expensive photo shoots. For example, a single shoe photograph can be placed into a generated “urban,” “sporty,” or “luxury” background, customized according to target audiences. When tightly integrated into e-commerce platforms, this enables faster product launches, A/B tested visuals, and localized campaigns at scale.

E-Learning

Educational platforms can harness gpt-image-1 to generate explanatory diagrams, thematic illustrations, or scene-based visual storytelling assets. Instead of relying solely on static stock imagery, teachers, course designers, or even learners themselves can prompt the generation of custom visuals aligned with the curriculum. When embedded into authoring tools, this approach accelerates content creation and enables more engaging, visually enriched learning experiences tailored to specific topics and age groups.

Cost Optimization

While gpt-image-1 opens up impressive creative possibilities, it also introduces new cost considerations that developers and product teams must plan for carefully. Since image generation typically incurs higher API costs than text-based operations, structuring workflows efficiently becomes critical, especially at scale.

Balancing Price, Quality, and Resolution

The cost of generating an image with gpt-image-1 depends significantly on both the requested resolution and the selected quality setting. Higher resolutions like 4096×4096 produce sharper, more detailed results, but they also consume more compute resources-and therefore cost more. For many use cases, especially for previews, lower resolutions such as 1024×1024 or 2048×2048 strike an excellent balance between visual fidelity and API efficiency. Reserving the highest quality settings for final exports or premium workflows can help manage overall spend without compromising user experience.

Image Reuse and Smart Upscaling

One practical cost-saving approach is to design workflows that encourage image reuse. Instead of regenerating similar images for every small variation, applications can create high-quality master images and allow users to crop, edit, or layer additional design elements dynamically. Integrating smart upscaling techniques-for instance, using specialized image enhancement libraries after initial generation-also allows teams to work with smaller base images without sacrificing end-user quality.

Rate Limits and Batching Strategies

Every call to gpt-image-1 counts toward your usage quota, and OpenAI imposes rate limits depending on account tier. To optimize performance and cost, it’s helpful to batch generation requests thoughtfully where possible-for instance, combining multiple prompts into structured queues or allowing users to preview low-res draft versions before finalizing a high-res render. Building this logic into your app’s generation flow not only controls expenses but also improves perceived responsiveness, an important UX factor for creative applications.

By considering cost optimization as an early design constraint rather than a late-stage patch, developers can build scalable, sustainable creative tools powered by gpt-image-1.

Bonus: Starter Kit Repo

We are currently in the process of integrating the new GPT-4o-powered gpt-image-1 model into CE.SDK. As part of this effort, we are preparing a comprehensive Starter Kit will showcase a complete with CE.SDK integration, real-time prompt input, image generation workflows, and best practices for building an AI-powered creative editor.

Both a public GitHub repository and a live demo will be made available soon. If you want to be notified when the Starter Kit launches, you can subscribe to updates here.

This Starter Kit is designed to help developers move beyond simple image generation into building full creative cycles, where users can generate, edit, refine, and remix visuals seamlessly inside the editor.

FAQs

Choosing to work with gpt-image-1 raises a number of practical and strategic questions. Below, we address the most common topics for teams evaluating the model for integration into creative workflows.

How is `gpt-image-1` different from DALL·E 3?

While DALL·E 3 and gpt-image-1 both translate text prompts into images, the underlying architecture and integration paths are different. gpt-image-1 is built on GPT-4o’s multimodal framework, making it better suited for future conversational and iterative workflows. It also offers support for a wider range of styles, higher resolutions up to 4096×4096 pixels, and is positioned for deeper integration into dynamic user experiences rather than one-off generation tasks.

Can you fine-tune or train `gpt-image-1`?

As of April 2025, OpenAI does not allow fine-tuning of gpt-image-1. The model is optimized for broad creative use cases out of the box. Developers seeking more control typically customize the user-facing prompt engineering or combine outputs with structured editing tools like CE.SDK to achieve brand or project-specific consistency.

Is offline support available?

Currently, gpt-image-1 requires access to OpenAI’s cloud APIs. There is no offline inference mode or local deployment option. Teams requiring strict data residency, offline workflows, or private model hosting should consider hybrid architectures where images are generated securely via backend services and then edited locally using embedded tools like CE.SDK.

What about copyright and licensing?

Images generated by gpt-image-1 can be used commercially according to OpenAI’s usage policies, but developers are encouraged to review the latest terms. Outputs are not directly copyrighted by OpenAI or the user, and responsibility for ensuring compliance with branding, likeness, or content standards typically falls on the developer or platform operator. When deploying generation features to end-users, it is good practice to provide clear terms of use and, if needed, additional moderation or review layers.

By addressing these considerations early, teams can integrate gpt-image-1 more effectively and responsibly into creative products and workflows.

Conclusion

gpt-image-1 offers developers a significant opportunity to rethink what image generation can mean inside creative applications. It is not simply a tool for producing pictures on command, but a foundation for building interactive, iterative design workflows where users stay in control of the creative process. When combined with CE.SDK, it becomes even easier to move from static outputs to living, editable canvases that support real-world design needs. As we continue to integrate GPT-4o capabilities, the next wave of creative tooling will be about more than prompting images-it will be about shaping truly collaborative creative environments. Now is the time to start experimenting, iterating, and reimagining the user experience around this new generation of multimodal AI.

How OpenAI's Upcoming GPT-4o Image Generation API Will Change Creative Workflows

Jan — Mon, 14 Apr 2025 10:51:53 GMT

If you’ve been working with image-generation APIs over the past year, you’ve probably gotten used to a certain flow: send a prompt, wait a few seconds, and get a flat image back. It’s a one-shot deal. Useful? Definitely. But not exactly interactive. That’s what will change with OpenAI’s upcoming GPT-4o image-generation capabilities.
IMG.LY, which recently released a suite of AI features for its design editor, is eagerly awaiting the release to expand how users can interact with AI-driven creativity even further.

Update: AI-first Visual Editing

A day after the release of the gpt-image-1 API, we put the UX principles outlined in this post into practice and integrated it into CreativeEditor SDK. Users can now generate images, create variants and use the canvas to compose visual prompts with our design editor. See it in action:

Open AI Editor Demo Page

GPT-4o: Beyond the Prompt-to-Image Pipeline

GPT-4o isn’t just another version of DALL·E. It represents a shift in how developers will integrate AI into creative applications. While DALL·E 3 is powerful it is also somewhat siloed (you send a prompt, you get an image), GPT-4o looks like it will be part of a much more dynamic, conversational model one that accepts both text and image inputs, and could soon generate visual content in context, on the fly, and as part of a back-and-forth user interaction.

If you’ve used ChatGPT recently, you’ve already seen glimpses of this. You can drop an image into the chat, ask GPT to describe or edit it, and get a response that feels fluid and visual. Developers should expect the API version to follow a similar pattern. It likely won’t just be a /generate-image endpoint. Instead, we may be looking at an extension of the chat/completions endpoint that handles multimodal messages. That changes the way you integrate this capability into your application. Rather than simply placing an image generation step in your pipeline, you will have to build your app’s UX around this new user flow. This comes with its own set of unique challenges.

Rethinking the Interface: Prompting as a Conversation

So what does this mean if you’re planning to integrate multi-modal image generation into your own product? For starters, you’ll probably need to rethink how users initiate and refine prompts. In the DALL·E flow, you might offer a text box with a few style dropdowns and call it a day. But in a GPT-4o world, your UI needs to support image inputs, persistent context, and dynamic editing, image gen becomes more like a conversation than a command.

This is where the rubber meets the road. The tools that will benefit most from GPT-4o aren’t static generators but interactive editors. Think collaborative design apps, video editors with generative overlays, or product customizers that let users sketch or upload a photo and then iterate with AI. Put differently, the model output isn’t the endpoint but rather a checkpoint in the creation process.

A Typical Iteration Cycle in a Multimodal Workflow

Here’s a rough sketch of a workflow we might be seeing more of: The user starts with a prompt and an image, maybe a rough sketch or collage created inside an editor, a product photo, or a UI frame. GPT-4o returns a generated image based on that input. The user then edits or annotates the result, maybe adds new prompt text for refinement, and resubmits that combination to further develop the output. This cycle might loop several times: generate, tweak, refine, regenerate.

That’s a fundamentally different interaction model from past AI tooling. It’s less about one-off generation and more about a guided creative journey, where the user is in dialogue with the model. The result: better alignment with the original intent, more control, and more usable creative outputs.

There is an additional, more subjective benefit to this kind of workflow: it gives the user a sense of autonomy again; they are back in the driver’s seat and less at the whim of an inscrutable machine. In many contexts, that makes a difference. Most notably, as we discussed in our white paper on print personalization, the psychological benefit of personalization lies to a large extent in the investment, the sense of ownership that comes about when you create something. “Make it yours” is the common tagline attached to personalization campaigns in e-commerce. That only works if the user exerts more control over the output than iterating over a set of prompts.

The most pithy encapsulation of this paradigm that I have heard is Humans on top, AI on tap.

Persistent Elements and Visual Consistency

One particularly interesting frontier here is character and object persistence. If a user defines a character early in the workflow, either via prompt, image, or a combination, they’ll increasingly expect that character to appear consistently across assets. Think of it as visual continuity, whether you’re generating scenes in a story, slides in a deck, or frames in a video.

If the user of a creative marketing cloud creates a campaign avatar or mascot, that character needs to be consistent within and across campaigns.

Being able to reference earlier outputs, prompts, or style cues gives the user control over not just individual assets but the whole arc of the design narrative. GPT-4o’s ability to maintain that continuity is a game-changer for workflows that involve storytelling, brand identity, or serialized design work.

What to Expect from the API

Technically, if GPT-4o follows OpenAI’s recent design philosophy, you can expect a JSON-based API with a messages array, where content can include both text and image_url types. The output will likely be returned either as an image URL hosted by OpenAI or as base64-encoded image data, depending on the format you request.

That structure plays nicely with modern JavaScript front-end frameworks. React, Svelte, and Vue are all well-suited to async generation flows with visual previews. If you’re already using tools like Zustand or Jotai for local state or something like tRPC or GraphQL for structured calls, you’re in a good position to layer GPT-4o in without breaking the flow.

Trade-offs and Technical Considerations

There are trade-offs, of course. GPT-4o will probably cost more per call than a standard DALL·E 2 or 3 generation. Its latency is still an open question, and the multimodal input support will likely require more thoughtful UX decisions. What happens when a user drops an image and wants to undo just part of the generation? Where do you store prompt context for edits? How do you communicate what’s editable and what’s not?

This is where design and engineering need to work together. You’ll want to build an interface that makes AI feel like a creative partner, not just a backend service. That might mean giving users a visual prompt history or allowing partial re-generations of specific canvas elements. You’ll need sensible fallback states. What happens when generation fails or the result isn’t what the user wanted?

Where IMG.LY’s CE.SDK Fits In

We have already given the questions raised above some serious thought, and most of the complexities introduced by this new workflow are the table stakes for the Creative Editor. So, if you’ve already integrated IMG.LY’s CE.SDK, we have taken care of most of these problems, and you can seamlessly integrate with any AI model. We are actively working on an off-the-shelf integration of the GPT-4o image model once its public API launches.

In general, you can treat GPT-4o’s image outputs as just another layer in the editing canvas, positioned, styled, cropped, and ultimately editable in the same environment as everything else. That’s the real power of multimodal workflows: not just generating but integrating. And once GPT-4o’s API goes live, you’ll want your infrastructure ready to slot it in with minimal friction.

The Loop: Prompt, Generate, Refine

The era of single-shot generation is winding down. What’s coming next is a loop: edit, prompt, generate, refine, repeat. And this loop doesn’t just belong in the backend, it needs to live in the UI, in a way that invites user input, creativity, and correction.

We’ll be publishing more on how this integrates into IMG.LY’s upcoming AI workflows soon. Expect tools that don’t just generate visuals but help teams and individuals work through ideas in real time. Because especially as AI gets more potent, it needs humans on top.

3,000+ creative professionals gain early access to new features and updates—don’t miss out, and subscribe to our newsletter.

Top 5 Generative AI APIs for Creative Apps in 2025: A Developer’s Guide (GPT-4o, Gemini, Firefly, and More)

Jan — Mon, 14 Apr 2025 07:57:51 GMT

If you’re working on creative tooling right now, anything from a lightweight design editor to a marketing automation suite, you’re probably already thinking about or actively working on bringing image generation into the mix. The tech is here, expectations are rising, and if your users can’t type a prompt and get a visual back in seconds, your app might feel like it’s lagging behind.

But choosing which AI model to integrate, and how, isn’t all that straightforward. There’s a growing ecosystem of APIs out there, and they don’t all behave the same way, some are designed for open-ended creativity, others for structured workflows. Some offer pixel-perfect fidelity with fine control, others lean toward rapid ideation. And Importantly in our content not all of them are equally accessible to developers.

This is a guide to help you make sense of it all. What models are available, how do they differ, and what should you consider when embedding them into your product. This isn’t supposed to be a hype piece or a leaderboard, just a clear-eyed look at what’s out there and what’s coming.

OpenAI GPT-4o

GPT-4o is OpenAI’s next-gen multimodal model, currently only available inside ChatGPT. It can take both text and images as input and is capable of generating image outputs in context.

The potential upside is significant. With GPT-4o, you may soon be able to create deeply interactive creative tools where users chat, sketch, and prompt all within a single UI. It’s likely to support richer input types and more natural iteration flows.

The main downside is availability. There’s no API yet, so you can’t build on it directly. It also remains to be seen how OpenAI will expose generation tools—whether through a dedicated endpoint or via the chat interface.

GPT-4o is right for you if you’re planning ahead and want to design for a future where multimodal interaction is the norm. It’s not something you can use today, but it should inform how you architect your UI and prompt handling.

OpenAI DALL·E 3

DALL·E 3 is OpenAI’s current image generation API, available via both the platform and ChatGPT. It translates text prompts into images and is known for interpreting prompts accurately and producing clean, useful visuals.

Its strengths are clarity, commercial readiness, and reliability. It’s easy to use and integrates well into frontend flows that involve text-to-image generation.

However, it lacks features like inpainting, style tuning, or detailed layout control. You also don’t get deep iteration features—each image is a new generation.

DALL·E 3 is a good fit if you want high-quality results from text prompts with minimal complexity. It’s especially useful for marketing visuals, content automation, and simple design tools.

Google Gemini (Imagen)

Gemini, powered by Google’s Imagen models, is available via fal.ai, Makersuite, Vertex AI. It supports not only text prompts, but also sketches and inpainting, making it one of the more flexible APIs for creative work.

Its big advantage is control. You can use sketches to guide composition and make visual edits to generated outputs. That makes it ideal for iterative design processes.

The downside is that it can be tricky to navigate Google’s ecosystem. Access and feature sets can change quickly, and the integration overhead is higher than OpenAI.

Gemini is right for you if your product needs image refinement, visual grounding, or sketch-to-image workflows. It fits e-commerce editors, mockup tools, and design collaboration features.

Adobe Firefly

Firefly is Adobe’s generative image model, integrated tightly into Creative Cloud. It stands out for its licensing model—images are trained on Adobe Stock, meaning they’re cleared for commercial use.

The biggest strength here is trust and integration. Designers already using Photoshop or Illustrator can use Firefly to generate content directly in their layers and work non-destructively.

The drawback is API access. There is no public endpoint for Firefly yet, and its features are embedded in Adobe’s own ecosystem.

Firefly is a strong option if you’re building for agencies, brand teams, or other users with high expectations around copyright and integration with existing Adobe workflows.

Stability AI (SDXL)

Stability AI offers an open-source model suite, with SDXL as the flagship for high-resolution image generation. It supports both text and image inputs and can be run locally or hosted via services like Replicate.

Its biggest advantage is flexibility. You can fine-tune models, build custom workflows, or even run inference offline. It’s ideal for teams that want full control.

The challenge is quality consistency. Compared to closed models like DALL·E, SDXL may require more tuning, and prompt engineering matters more. Hosting and scaling also require more effort.

SDXL is right for you if you need an open, customizable system that fits into a broader pipeline. It’s a solid choice for research tools, OSS projects, and privacy-conscious applications.

Midjourney

Midjourney is a proprietary model with a focus on aesthetic, stylized image generation. It runs exclusively via Discord and is popular for its distinctive look and community-driven prompts.

Its upside is the quality of its visuals, especially for stylized scenes or concept art. Designers often use it as an ideation tool.

The limitation is integration. There’s no API, no SDK, and limited ways to embed it in your own product beyond scraping or bots.

Midjourney is best used as an inspiration engine. If your workflow includes moodboarding or creative brainstorming, it can supplement—but not power—your product.

Hugging Face

Hugging Face is a hub for open models, offering hosted APIs for SDXL variants, Playground v2, and other creative generation tools.

The main benefit is diversity. You can try multiple models, experiment with variations, and deploy quickly using their hosted inference endpoints.

That said, it’s not always ready for production. Some models lack documentation or support, and you may need to piece together features.

Hugging Face is a great choice for experimental projects, prototyping, or if you want to stay vendor-neutral and build your own stack.

Runway Gen-2 and Leonardo.Ai

Runway and Leonardo are rising players at the edge of AI and media. Runway’s Gen-2 supports text-to-video and animated image generation, while Leonardo focuses on style-consistent 2D asset generation.

These platforms bring specialization. Runway is tailored to video and cinematic scenes, while Leonardo offers structured design features for asset creators.

They’re less open from a dev perspective. APIs are limited, and integration support is still maturing.

Use these tools if your use case leans into video, motion, or asset generation for games and content libraries. They’re best when you’re not looking to build your own editor, but to enhance creative capacity.

Quick Comparison

Model/API	Input	Output	Control	API Access	Best For
GPT-4o (OpenAI)	text, image (chat)	image (likely)	medium-high	not yet	assistants, multimodal UIs
DALL·E 3	text	image	medium	yes	content tools, illustrations
Gemini (Google)	text, sketch	image	high	yes	e-commerce, product mockups
Firefly (Adobe)	text	image, layers	very high	no	professional design tools
SDXL	text, image	image	high	yes	custom tools, OSS projects
Midjourney	text	image	very high	no	stylized inspiration
Hugging Face	text, image	image	medium-high	yes	experimentation, open models
Runway Gen-2	text	video/image	medium	yes	motion design, AI video
Leonardo.Ai	text	image	high	limited	game assets, style templates

Conclusion

If you’re building for creative users, especially those used to real-time feedback and control, then how you wrap these APIs into your workflow matters more than which model you use. It’s not just about generating images. It’s about how you let users prompt, refine, iterate, and remix inside your canvas.

That’s the opportunity here. Not just plugging in a model, but designing a loop where generation feels native to creation. The APIs are improving fast. The real challenge, and the real product value, is in how you build around them.

3,000+ creative professionals gain early access to new features and updates—don’t miss out, and subscribe to our newsletter.

How to Build a Short Video Generator Using CE.SDK

Eray — Tue, 04 Mar 2025 12:57:47 GMT

In the following, I’m presenting a simple cookbook for building an AI-based video generator app, as described in my previous blog post. We’re using a combination of different APIs to generate text, audio, and images and will compose & render the final video using the headless CreativeEditor SDK. We also call it the Creative Engine.

This cookbook showcases the powerful capabilities of our client-side Creative Engine. The engine enables real-time video generation directly in the browser, eliminating the need for server-side processing. What sets this approach apart is its ability to produce editable source files that can then be opened with CreativeEditor SDK.

This approach is giving users complete control over every aspect of your video–from text and images to animations and overall composition. This means your users can refine and perfect your content even after the initial generation.

Get the complete code on GitHub.

Scope

This tutorial focuses on building an app with a simple UX:

Input your keywords/topics
Choose between landscape or portrait format
Generate and preview your video
Edit the video in the CE.SDK video editor

The app flow we will create:

The post-editing we will get with CE.SDK:

Technical Overview

The app follows three major steps to generate the video.

A script is generated based on User input, the output is a structured XML file.
The XML script is parsed to extract text and image information. The extracted data will then be used to generate audio & image files through third-party APIs
All assets are loaded into the creative engine. This is where the composition, including animation and effects, takes place. The Creative Engine then exports a video and scene file, which can be edited with the Creative Editor.

Setup

We’ll use a boilerplate with Next.js, React, Typescript & Tailwind. Make sure you retrieve all necessary keys:

Anthropic (LLM)
ElevenLabs (text to speech)
fal.ai (text to image)
IMG.LY CE.SDK – Retrieve a free trial key

// Required environment variables
NEXT_PUBLIC_ANTHROPIC_API_KEY = your_claude_api_key;
NEXT_PUBLIC_FAL_API_KEY = your_fal_ai_key;
NEXT_PUBLIC_ELEVEN_LABS_KEY = your_eleven_labs_key;
NEXT_PUBLIC_IMG_LY_KEY = your_img_ly_key;

Implementation

1. Generate The Script

In this step, we’ll focus on generating the initial prompt and then passing it to the Anthropic API.

As with many things with LLM, there are many different strategies for structuring the initial prompt. From experience, the best result comes from providing examples of the desired output. We’ve decided to use an XML document; this can be easily parsed later on and is less error-prone compared to a JSON.

We now define the structure of how information should be saved in the XML.

<video>
  <group part="intro">
    <element>
      <text voiceId="50YSQEDPA2vlOxhCseP4" style="0.2">
        Did you know these fascinating facts about pyramids?
      </text>
      <image>Ancient Egyptian pyramid at sunset</image>
    </element>
  </group>
  <group part="content">
    <element>
      <text voiceId="50YSQEDPA2vlOxhCseP4" style="0.2">
        The Great Pyramid was the tallest structure for over 3,800 years!
      </text>
      <image>Great Pyramid comparison to modern buildings</image>
    </element>
  </group>
  <group part="outro">
    <element>
      <text voiceId="50YSQEDPA2vlOxhCseP4" style="0.4">
        The pyramids continue to reveal their secrets to this day...
      </text>
      <image>A giant 3D question mark hovering over the pyramids</image>
    </element>
    <element>
      <text voiceId="50YSQEDPA2vlOxhCseP4" style="0.4">
        Stay curious - there's always more to discover!
      </text>
      <image>Pyramids under starry night sky</image>
    </element>
  </group>
</video>

In this tutorial, we’ll focus on the format trivia only as shown in this example. For later iterations, however, I’m planning to implement different content formats (e.g., trivia, quiz, recipe, etc.). Each of these formats will have its example XML. Therefore, I’m nesting the XML in a simple format object to scale this up easily later.

interface Format {
  name: string;
  example: string;
}

const formats: Record<string, Format> = {
  trivia: {
    name: 'Trivia',
    example: `<video><group>...</group></video>` // Add example from above
  }
};

Using this format object with the example, we can now generate the prompt.

What do we need for this prompt?

Description of the task
Description of the desired output, incl. an example for the specified format
Topic as provided by the user

The topic provided by the user is passed to the function as a string.

export const createVideoScriptPrompt = (
  topic: string,
  formatName: string = 'trivia'
) => {
  const format = formats[formatName];
  if (!format) throw new Error(`Format ${formatName} not found`);

  return `
Format: ${format.name}
Topic: ${topic}

Please write a detailed script for this short video, considering the specified format and topic.
Include an introduction, main content sections, and an outro. Each section should have an image.
Structure the script as an XML Document with clear sections, descriptions for the images.
The image description should be written as a prompt. This prompt will be used to generate an image.
Put the description between the image tags. The video shouldn't be longer than 30 seconds.

Example format:
${format.example}`;
};

2. Generate All Assets

In the second step, we’ll parse through the LLM response, which should be the XML. We’ll create a simple parsing function to extract all text information that should be sent to text-to-speech and text-to-image AIs.

Please note that all these steps can be easily streamlined by using AI-assisted coding. Just provide the example XML as input and your desired output.

API Calls
When finding text & image tags in the XML, we’ll call API functions for text-to-speech and text-to-image. For this example, I’m using ElevenLabs & fal APIs. You will find all API calls in the api.ts.

Since the LLM generated a script that includes image prompts, make sure to pass them to the API.

export async function generateImage(prompt: string): Promise<string | null> {
  try {
    console.log('Generating image for prompt:', prompt);
    const result = await fal.subscribe('fal-ai/flux/dev', {
      input: {
        prompt: prompt,
        size: 'portrait_16_9',
      },
    });
    const typedResult = result as { images: { url: string }[] };
    console.log('Image generation successful. URL:', typedResult.images[0].url);
    return typedResult.images[0].url;
  } catch (error) {
    console.error('Error generating image:', error);
    return null;
  }
}

Timestamps
Last but not least, we need to come up with timestamps. What’s the duration of each segment? This is critical information for composing the video. Luckily, this is quite easy: Each scene is as long as the generated audio for each segment. This duration for the audio segments can be calculated: Most TTS like ElevenLabs provide timestamps along the audio file. These are typically character-based timestamps, so we first have to calculate the timestamps for each word and then the duration for the entire text section.

Ready For The Next Steps
All Asset URLs that are generated will be saved in a VideoBlock object for convenience. The duration of the VideoBlock is the duration of the audio, as calculated above.

interface VideoBlock {
  text: string;
  imageUrl: string | null;
  audioUrl: string | null;
  startTime: number;
  duration: number;
  wordTimestamps: Array<{ word: string; start: number; duration: number }>;
}

3. Generate The Video

We have everything together now: The completed XML with timestamps, duration, and all assets. It’s now time to generate the video using the creative engine.

Let’s first add an empty container in our HTML that will be referenced for initiating the creative engine.

{
  /* Add container for Creative Engine */
}
<div id="cesdk_container" className="invisible mt-8 rounded-lg bg-gray-100" />;

We can now initialize the engine. Use this code snippet from our documentation.

We’ll then set up a function that creates a simple composition using the provided VideoBlocks. The engine requires you to first create a scene, append a page to the scene, and then create tracks within the page. The tracks are basically the layers in the timeline. I recommend setting one track as a background track using the following snippet:

// Set video track as a background track by connecting the page duration to the video track
engine.block.setAlwaysOnBottom(videotrack, true);
engine.block.setPageDurationSource(page, videotrack);

The Creative Engine provides powerful API calls to style & manipulate blocks in many ways. Here is an example of how we can animate the images with a slow zoom effect:

const imageZoomAnimation = engine.block.createAnimation('crop_zoom');
engine.block.setInAnimation(image, imageZoomAnimation);
engine.block.setDuration(imageZoomAnimation, block.duration);
engine.block.setBool(imageZoomAnimation, 'animation/crop_zoom/fade', false);

Export The Video & Scene
Exporting the video is easy. Just pass the page to the export function. In our example, we’re also saving the scene file so we can edit the video later.

// Export video
const progressCallback = (
  renderedFrames: number,
  encodedFrames: number,
  totalFrames: number
) => {
  console.log(`Progress: ${Math.round((encodedFrames / totalFrames) * 100)}%`);
};

const blob = await engine.block.exportVideo(
  page,
  'video/mp4',
  progressCallback,
  {}
);

// Save scene to string
const sceneData = await engine.scene.saveToString();

// Create scene blob
const sceneBlob = new Blob([sceneData], {
  type: 'text/plain',
});

4. Add A Video Editor

The last step is to add the video editor for post editing and pass the scene file. With CE.SDK, this effort is reduced to adding a few lines of code. In the init function, we’re configuring the editor and adding callbacks for the export:

const initEditor = async () => {
        const config = {
          license: 'A-O53TWXK5bfyconUx7e53S5YU7DzjuGpMAH5vvKjLd0zBa6IhsoF7zChy1uCVbj',
          userId: 'guides-user',
          theme: 'dark',
          baseURL: '<https://cdn.img.ly/packages/imgly/cesdk-js/1.44.0/assets>',
          role: 'Creator',
          ui: {
            elements: {
              view: 'default',
              panels: {
              },
              navigation: {
                position: 'top',
                action: {
                  save: true,
                  load: true,
                  close: true,
                  download: true,
                  export: true
                }
              },
              dock: {
                iconSize: 'normal', // 'large' or 'normal'
                hideLabels: true // false or true
              }
            }
          },
          callbacks: {
            onUpload: 'local',
            onSave: (scene: string) => {
              const element = document.createElement('a')
              const base64Data = btoa(unescape(encodeURIComponent(scene)))
              element.setAttribute(
                'href',
                `data:application/octet-stream;base64,${base64Data}`
              )
              element.setAttribute(
                'download',
                `video-${new Date().toISOString()}.scene`
              )
              element.style.display = 'none'
              document.body.appendChild(element)
              element.click()
              document.body.removeChild(element)
            },
            onClose: () => {
              onClose();
            },
            onLoad: 'upload',
            onDownload: 'download',
            onExport: 'download'
          }
        }

Conclusion

By following this cookbook, you can streamline the process of AI-generated video creation, making it fast and efficient. This method is especially useful for content creators, educators, and marketers looking to automate video production while maintaining creative control.
Next, try experimenting with video styles, refining AI scripts, or exploring advanced editing.
Feel free to GitHub repo and share your creations with us on X. Happy creating!

3,000+ creative professionals gain early access to new features and updates—don’t miss out, and subscribe to our newsletter.

How I Built a Short Video Generator with AI & CE.SDK in One Day

Eray — Thu, 09 Jan 2025 11:27:13 GMT

Here’s the crux of product development in the age of LLMs: how much can AI truly accelerate the development process?

We have seen videos of solo developers building small apps entirely with AI with just a few prompts. But how does it scale to more complex development projects? As LLMs rapidly evolve, their scope and impact will only increase.

That’s why I regularly challenge myself to build a small project with the help of AI. I’m a prime candidate to test the AI productivity boost: a jack-of-all-trades (and a master of none) with a background in both design and engineering, yet no hands-on experience in the past five years. My latest challenge? Build a web-based short video generator within one day.

In this post, I’ll share the most intriguing takeaways from tackling this project.

Why a Short Video Generator?

Why focus on this idea? It’s simple: to ride the wave of a new trend. A format called “faceless” short videos is gaining traction among creators on platforms like YouTube and TikTok.

https://www.youtube.com/embed/DfQ3fhqfKVc?feature=oembed

What’s fascinating about these videos is their automation: an LLM generates a script, which is then transformed into speech, images, and text assets using various AI services. These assets are automatically assembled into a cohesive video.

The general concept is compelling: It’s still generative content, but mixed with classic video composition techniques. This approach offers greater accuracy, consistency, and control over pure generative AI.

The potential to automate video production at this scale is exciting. Add its relatively low complexity and high production value, and it became the perfect topic for my challenge.

Enter CE.SDK

Another reason I chose this challenge was its compatibility with CE.SDK, our design and video editor library. CE.SDK offers a robust editing toolkit that integrates into any product with just a few lines of code. Its features, like headless mode, are ideal for automating workflows like video generation.

Most faceless video services use React-based video generation and achieve fair results. However, using CE.SDK instead of a react-based library could potentially boost the overall experience with three critical improvements:

Editable Outputs: This is huge. Full automation often needs human adjustments for fine-tuning. CE.SDK enables automated video generation while allowing manual refinement of the results.
Enhanced Visual Quality: CE.SDK has its own rendering pipeline, allowing for more nuanced visual effects and animations. When you’re competing against others in this space, it can make a huge difference if you’re able to produce higher fidelity in the visual output.
Visual Design Workflow: Create design components or even entire templates visually, and then use them via code. This authoring workflow can be extremely helpful in creating rich, interesting designs for the generated videos.

The Ground Rules

To keep the challenge focused, I set strict rules:

Time Limit: Spend no more than 12 hours on the challenge.
No Manual Coding: Avoid writing any code yourself—everything should be built through conversations with AI.
Trust the AI: Do not read or analyze code generated by the AI. Rely entirely on its decisions.
Skip External Research: Do not read or explore the APIs you intend to use. Instead, provide links to the AI and let it determine how to use them.
Compare AI Performance: Alternate Claude Sonnet 3.5 and ChatGPT o1 for code generation to evaluate which performs better.

The Tools & Workflow

Code Editor: Cursor
Built on VSCode’s foundation, Cursor stood out as the only editor offering both an integrated chat interface and the ability to switch between different LLMs. However, with GitHub’s recent significant updates to Copilot, I’ll switch to VSCode with Copilot for future challenges.

UI Prototyping: Claude Artifacts
Rather than building the entire project in my code editor, I chose to prototype the UI directly through Claude’s web interface. The benefits were immense:

Instant results: To create an artifact, Claude streamlines development by automatically writing and compiling code while leveraging essential UI libraries and components. This automation eliminates setup time and technical overhead, allowing me to focus purely on design iterations.
Instants Variations: Claude enables rapid prototyping through parallel conversations. When a design direction didn’t quite work, I could simply start a fresh conversation with modified requirements and evaluate a new prototype. This approach helped me develop three viable concepts quickly - a pace that would have been impossible in a traditional code editor.
Quality of execution: Claude transforms rough concepts into polished, intuitive interfaces. Its suggestions often surpassed my initial ideas, offering sophisticated solutions I hadn’t considered.
Keep it clean: By prototyping outside the code editor, I kept the main project’s codebase clean and focused. This separation prevented the accumulation of experimental code and maintained the clarity of our primary development environment.

Quickly prototype your interface with Claude Artifacts.

APIs
Key APIs used in the project included:

Script Generation: Claude Sonnet 3.5 vs various ChatGPT models.
Image/Video Assets: Fal.ai Flux models.
Speech Synthesis: ElevenLabs.

Building the App: Divide and Conquer

After having prototyped the UI, I started to chat with the LLM inside the code editor so that it can code the app. To work with the AI efficiently, I followed a divide-and-conquer approach. Rather than simply asking it to “build me a video app,” I broke down the problem into manageable steps:

Generate a video script
Create an AI prompt that includes user input and examples of the desired output format. Pass this prompt to the LLM API.
Parse the script to generate assets (speech, images, text)
Parse the LLM’s response to extract image prompts and speech paragraphs. Send these to their respective APIs.
Compose the final video
Load all the generated assets into a predefined template to generate the finished video through the CE.SDK library.

After completing these steps, I was finally able to generate my first fully automated videos! With a few more tweaks and additions, I had an MVP ready within twelve hours.

The final result: A Short Video Generator

There are still some missing features, partly because I spent a significant amount of time refining the prompt to generate the video script. I also had to bend the rules occasionally—sometimes the LLM would hit a wall, and I had to read or write small snippets of code.

Key Takeaways

Engineering Knowledge Is Essential
You should have some engineering background to achieve the AI productivity boost in development.

AI doesn’t solve everything for you. You are still the architect. You provide a lot of input and guidance. AI often needs to be pointed to the right strategy. Foundational knowledge of computer science is hugely advantageous for working with AI effectively.
As mentioned, I had to read and write a few lines of code myself. Without coding experience, I would have probably not been able to progress, as the LLM was not able to.
The getting started experience is nowhere close to novice-friendly. How do you get started with a new project in a code editor that actually requires you to do the setup manually? My workaround was to create an empty project, and then ask the LLM to instruct me to use a boilerplate for react. Again, this is engineering knowledge, any novice would have hit a wall already at this point.

Claude Outperformed ChatGPT
Claude was a clear winner in the side-by-side comparison, because of three reasons:

Claude Artifacts was a game changer for UI prototyping.
It was generally better at writing and understanding code. Difficult to quantify, but in some cases Claude fixed the mess ChatGPT left in the code
Claude can process URLs, which makes working with APIs much smoother.

Who would have thought new LLMs would catch up to OpenAI so quickly after they released the first version of ChatGPT?

Complexity Slows AI
The more code in my project, the slower the overall progress. LLMs struggled with the growing complexity. Their context windows filled more quickly, and their responses became increasingly unreliable. At some point, it becomes extremely difficult to make architectural changes, especially if this affects multiple parts of the app. When trying to fix errors, you’ll often find yourself in a whack-a-mole game. While the AI would resolve one issue, it would inadvertently introduce new problems elsewhere, creating an endless loop of fixes and regressions.

Ultimately, the time invested in this challenge was well worth it. While LLMs can’t build products end to end on their own, they can significantly streamline product development when paired with the right human collaboration. The real question is whether development teams are ready to adapt their habits and explore new workflows to boost productivity.

Next Steps

This challenge has inspired me to refine and expand on this project. Future iterations will focus on harnessing CE.SDK’s unique features to push the boundaries of automated video generation.

Stay tuned for part two of this series—there’s much more to explore!

UPDATE: Read part two - a cookbook how to build your own short video creator!

Over 3,000 creative professionals gain early access to our new features, insights and updates—don’t miss out, and subscribe to our newsletter.

IMG.LY Research: AI-based Generative Design Editing

Mirko — Tue, 16 Jul 2024 10:44:05 GMT

Generative AI is transforming the tech landscape, finding applications in virtually every field. At IMG.LY, we’re exploring how these advancements can revolutionize creative workflows. This article presents a research project, where we integrate Large Language Models (LLMs) with our flagship product, CreativeEditor SDK (CE.SDK), to enable natural language-driven design edits.

Our flagship product, CreativeEditor SDK (CE.SDK) allows for advanced creative workflows for countless use cases in industries ranging from print to marketing tech. Most use cases can be realized with the out-of-the-box feature set, but it also exposes a best-in-class API, called Engine API, to build complex custom workflows with designs and videos.

In this article, we will showcase how to combine our CE.SDK Engine API with LLMs to edit designs with natural language.

Introduction to LLMs

Generative AI is often associated with chatbots, but its capabilities stretch further. It is a versatile text processor that can transform any textual input into various structured outputs. This adaptability is due to its training on diverse textual patterns, allowing it to support a wide range of text-to-text applications beyond just generating conversational prose.

Carefully crafting an input text (prompt) in a way that instructs the LLM to output a specific structured output format allows us to use LLMs to solve almost any arbitrary text-based task.

Crafting prompts is an art in itself. When ensuring that we receive the required output we need to adhere to the following steps:

Consider what type of data our model was trained on to ensure the correct formatting of our input text. The most basic example is using English as our “main prompt language” since most LLMs are mainly trained on English text samples.
All necessary problem-specific information to solve the task needs to be included in the prompt. While LLMs often possess an inherent understanding of the world inferred from the vast amount of text they are trained on, they may not know much about our specific problem. Furthermore, LLMs have the notorious tendency to hallucinate, that is fill in missing context with incoherent or incorrect information. To ensure the best performance, the input to the LLM must provide as much context as possible.
Finally, we need to instruct the LLM well enough to output text-based data in a format we can then parse and process.

Human vs. AI Workflows for Executing Design Tasks

We started this project with the vision to use generative AI to magically handle requests like these (in increasing order of complexity):

Make the logo bigger
Translate this design into German
Adopt this template to our brand colors and brand assets
Transform this Instagram story portrait design into a landscape YouTube thumbnail

When trying to delegate a task to an AI, it’s best to start by thinking about how these tasks are currently solved by humans.

Let’s walk through how a human would complete a task such as
Make the logo bigger:

Humans would visually scan the design and automatically segment it by its elements such as objects, backgrounds, or text.
Humans would then read and comprehend the task “Make the logo bigger”
Finally, users would use their existing knowledge of how to move and interact with design software to fulfill the task by manipulating the individual elements in the design.

Based on these considerations we can extract the implicit knowledge necessary to fulfill a task and make it explicit for the benefit of our LLM.

Design Representation:

To enable the LLM to understand and manipulate a design, it is essential to provide a representation of the design. This can be either textual or a mix of textual and visual (if the LLM has vision capabilities) data. While supplying the current design as a raster image to the LLM is trivial, serializing a CE.SDK design into a textual format requires a custom serialization process. The textual representation is important since it allows the LLM to identify, address, and comprehend the different components of the design effectively.

Refer to Appendix: Use-Case dependent serialization of CE.SDK Designs for a more in-depth explanation of how to accomplish this.

Editing Protocol:

LLMs do not interact with design software using traditional human interfaces like a mouse, keyboard, or visual feedback. Therefore, we need a specific protocol for the LLM to propose changes to the design. We have developed a method where we pass a textual representation of the design to the LLM as part of the prompt so that the LLM can indicate changes to the design by returning a modification of this representation.

Practically, this means that if we pass in an element such as <Image id="1337" x=”100” y="100" .../>, the LLM can change those x and y attributes by simply returning <Image id="1337" x="0" y="0" .. /> inside its output text. Since we can identify the design element that was changed using the ID attribute, we can then calculate the programmatic changes that need to be applied to the design, like in this case engine.block.setPositionX(1337, 0) and engine.block.setPositionY(1337, 0).

Refer to Appendix: Parsing and transforming LLM response for a deeper look into this topic.

Based on what we have learned, we assemble a workflow, with the LLM as the center point that allows the LLM to execute design-related tasks on any of our CE.SDK designs. This workflow can be divided into two sub-tasks: Composing an input text (prompt) with all necessary context and parsing and applying the output of the LLM to the CE.SDK design.

Composing the Input Text

As seen in the graphic above we first compose the input text based on different components to provide the model with all necessary context to fulfill the user’s editing request: This includes a general text to “instruct” the model, the actual request that the user entered, an exemplary design representation to explain our output format to the model and a representation of the currently edited design.

The general, static text to “instruct” the model is composed of the following three parts.

What we are trying to achieve in general:
"You are an AI with expertise in design, specifically focused on XML representations of designs".
Output format instructions:
Your responses should only contain one XML document. Ensure that you do not introduce new attributes to any XML elements. You can change image elements by setting the alt attribute, which then will be used to search Unsplash for a fitting image. A sample alt text is “A mechanic changing tires with a pair of beautiful work gloves on”.
Additionally, pay close attention to the layout: verify that no elements in the XML document extend beyond the page boundaries. This constraint is critical for maintaining consistency and accuracy in XML formatting. Always double-check your XML output for these requirements.
The actual user request:
e.g., "Make the logo bigger"

By including a textual representation of a comprehensive example design we can show the model which layer types are available as well as which properties of those layers can be manipulated.

The model now has all the necessary context and building blocks to respond to user requests in a format that we can process downstream.

Applying the Output Text

In the first step, we scan the output text for an XML-like document and if we find one, attempt to parse it.

This will yield a structured data object we can compare to the one we passed in and calculate which elements have been modified, added, or removed.

The resulting change set can then be translated into specific calls to the CE.SDK Engine API to change the current design.

Issues faced

Latency: One issue with state-of-the-art models is their big latency: Each LLM response contains approximately 1000 tokens. That means each request takes 7-45 seconds (depending on the model) to complete. This long delay may be unacceptable for some user experiences. However, we see this issue as transitory and expect upcoming models to have much smaller latency while maintaining their capabilities.

Pricing: Each request/response with GPT-4 turbo as a backing model costs around 5 cents and restricts some use cases. We also expect the pricing to drop significantly. The new GPT-4o model for example reduces the price by half.

Hallucinations: LLMs do not always follow the instructions properly and, e.g., produce output that is not parsable. Hallucinations directly correlate with the capabilities of the model and this issue is not apparent at current state-of-the-art LLMs like e.g GPT-4/GPT-4o.

Conclusion

We present a novel and adaptable approach to use Generative AI and LLMs specifically to interact with IMG.LY’s CreativeEditor SDK. We showcase how this technology can be used to execute common design requests on arbitrary CE.SDK Designs. The proposition that LLMs can understand textual representations of visual elements was by no means obvious. This research project has revealed that it is very well within the scope of LLMs to translate instructions from a visual semantic context to its textual representation and back. This invites more inquiries into LLMs as assistants for tasks with a heavy visual component such as design.

While further research is needed to make this technology available in production environments, we are confident that Generative AI-based editing will play a big role in the future of Graphics and Video editing.

Appendix: Use-case Dependent Serialization of CE.SDK Designs

LLMs work based on “tokens” which are equal to words. However, a design, like e.g a poster design or a social media graphic, is highly visual. That means that we need a way to convert a design into text, a way to serialize it. Our CE.SDK engine can serialize an existing scene using our engine.block.saveToString method. However, this serialization contains a huge pile of information that is not necessary to do edits inside the file. LLMs are priced by token and their speed is also relative to the number of tokens the input and output have. Thus, the number of tokens should be reduced.

We looked at several ways to convert the current state of the design into a textual representation. Since GenAI is trained on a lot of (X)HTML which has an XML-like format, we decided to serialize any designs into a tag-based XML-like format.

The IMG.LY editor internally refers to design elements like images or texts as “blocks”. These blocks are uniquely identifiable and addressable using a numeric ID. We use this ID to be able to identify a serialized design block in the input and output of the LLM. Example: <Image id="12582927" x="0" y="0" width="800" height="399" />

For each of the CE.SDK block types like e.g “Text” or “Graphics” (representing images or vector shapes), we will the CE.SDK Engine API to query very specific data from the block. That means that for example we only have a text attribute for Text blocks.

This rather specific mapping of only certain properties from the CE.SDK design into the text serialization allows us to optimize the design serialization for different use cases. A use-case where we e.g. want to automatically name each layer does maybe not require fine-grained information about e.g the font size.

Appendix: Parsing and transforming LLM response

The LLM answers with arbitrary tokens. It’s not possible to restrict the response to a certain syntax. By settling on a well-defined and widely used format we instruct the model to also reply with an XML-like document, similar to the one we passed in as “current state”.

After receiving the LLM’s Response we first make sure that only a single XML document is present inside the response. We then compare the retrieved XML document with the state of the Design that we passed into the LLM and generate a change set. This change set contains entries like “Color of block with ID=123 has changed”. These change set entries are then converted into programmatic commands, like e.g engine.block.setColor(123) and executed on the current design.

One challenges when working with an LLM is the inability to restrict the output space. Thus, we are never guaranteed that the LLM did not add e.g new XML node names or that it even replies with a proper, valid XML-like document. The only lever to influence the probability of a proper XML-like document is to use strong prompting and LLM that are good at following those instructions.

In our tests, state-of-the-art models like GPT-4 can follow those instructions without any further tooling.

Further Research Topics

It’s also worth exploring fine-tuning an LLM specifically for this task which could improve the performance of the LLM for the specific tasks.

It would also be possible to use more advanced libraries like Guidance, which allows to define a grammar for the LLM response thus making sure that the output of the LLM is always parseable.

Another way to improve the performance would be to methodically test different prompt templates and find a way to measure and compare the output quality.

Thank you for reading!

3,000+ creative professionals gain exclusive access and hear of our releases first—subscribe to our newsletter and never miss out.

Cutting Through The Jungle: An In-depth Review of Cloud GPU Providers to Train Your AI Models in 2024

Walter — Mon, 22 Apr 2024 06:35:54 GMT

Navigating the World of AI Models Hosting

Here at IMG.LY, we recently dug into finding the best place to host AI models to support apps we’re dreaming up. We wanted to figure out if using cloud GPUs or going serverless would work better for us. As we were looking specifically for service providers to run Image Generation Workloads on, we focused on those that could be the best fit for that. Along the way, we picked up some cool insights and ran into a few hiccups. We think sharing our journey and the things we figured out could help you when you’re looking to deploy your own AI models.

First off, we’ll explain what cloud GPU and serverless hosting really mean. Then, we’ll chat about their good and not-so-good sides when it comes to hosting AI models. It’s super important to make sure whatever hosting you choose fits your model like a glove. We’ll talk about some tools we stumbled upon that could help with that. Next up, we’ll give you a peek at some of the providers we checked out and our thoughts on how they might fit with what we’re working on. We decided to skip over the big names like IBM, Google, and Amazon this time. We were curious about what the newer, smaller companies have to offer.

To wrap things up, we’ll share some final thoughts on all our research. Plus, we’ll throw in some tips and ideas you might want to think about when you’re doing your own digging. Whether you’re developing AI models or planning to host some of the well-known ones, we hope our adventure helps you nail down the perfect hosting solution for what you need. Ready to jump in?

Kinds of Cloud Hosting for AI Models

Cloud hosting has been around for as long as there has been a cloud. Though the server hardware is not at your location, earlier versions of cloud hosting required that your team learnt lots about server infrastructure. As things have evolved, providers now manage the infrastructure so that you can focus on your work. You can now host even just a single function in the cloud, if that’s what you need. In our research, we looked at general serverless hosting and at Cloud GPU AI providers.

Serverless Hosting

Serverless hosting can be defined as an architecture model that lets developers build and run applications and services without managing the servers they run on. The cloud provider manages things like security, provisioning, scaling, and connectivity.

In a serverless CPU-loads hosting the host provisions your services to the most appropriate and available hardware. However, with most of the providers of GPU loads you get to choose.

Serverless Pros:

Pay-per-compute model: you only pay for the compute time you consume.
Autoscaling: the provider will automatically scale up or down depending on load, from a few requests a day to thousands per second.
No server management: eliminates the need for developers to also understand server infrastructure. Often, just a Docker image holding an application is sufficient.

Serverless Cons:

Cold starts: instance deallocates after a certain idle time (enabling the great pay-per-compute model) so initial request after this can be noticeably slow.
Limited control over specifics: certain GPU hardware or even server hardware may be unavailable at times which can impact performance.
Limitations on time - there may be limitations on the execution time of functions, which can impact long-running processes.

Cloud GPU Hosting

Cloud GPU hosting provides access to GPU and TPU (Tensor Processing Unit) hardware that can perform the parallel operations essential for AI model training and inference. The provider allows users to configure specific hardware for their jobs.

With cloud GPU each service or model gets its own GPU while running. Your other services communicate with the model through an API.

Cloud GPU Pros:

High performance: GPUs are specifically designed to run AI models and other tasks like deep learning and complex simulations.
Full control of hardware: users can specify specific hardware configurations for their projects.
Persistent availability: resources are not deallocated, so there is no latency for provisioning for the first request.
Cost-effective experiments: the upfront cost of purchasing GPU hardware to experiment with different configurations is eliminated. Services are priced with a pay-as-you-go model.

Cloud GPU Cons:

Costs over time: costs do not go down during periods of low demand. Over time, costs can potentially surpass the cost of investing in local hardware.
Management overhead - managing and optimizing hardware configurations is not automatically part of the hosting. You’ve got to learn some server administration and manage security and upgrades.

Providers

It’s important to understand that this isn’t a ranking of the best providers or an endorsement. It’s what we discovered with some web searching, reviewing the available documentation, and tinkering with any demo or free tools and models the provider makes available. The list could easily have been different providers and we think some of the pros and cons and qualities would be the same. Hopefully, some of the questions we raise and the pros or cons we noticed in our research can help you to guide your research.

Our goal was to find potential hosts for various workflows with different models in a scalable manner. We want to be able to build applications around the workflows. Some of our, specific, requirements include:

Autoscaling, ideally out-of-the-box without the need for custom Kubernetes setup or similar technologies.
Minimal vendor lock-in.
Compatibility with various technologies (REST API, WebSocket, Webhooks, etc.).
Support for Windows Server.

With those disclaimers and caveats, here is a short summary of our research.

Provider	Best For
Runpod IO (Serverless)	Deploy AI models with GPU support and require customizable API interfaces.
Vast AI (Serverless)	Affordable GPU resources and a variety of GPU options for AI model training.
Paperspace (Serverless)	Flexible workflows and support for different stages of AI model development.
CoreWeave (Serverless)	Strong knowledge of Kubernetes and need autoscaling capabilities for AI workloads.
Modal (Serverless)	Comprehensive documentation and examples for deploying AI models in containers.
ComfyICU (Serverless)	Serverless infrastructure tailored for hosting ComfyUI applications.
Replicate (Serverless)	Easy-to-use API for executing AI tasks without managing infrastructure.
Genesis Cloud (Cloud GPU)	Sustainability and need scalable GPU instances for AI model training.
Fly IO (Cloud GPU)	To deploy complete applications with GPU support in a scalable environment.
Runpod IO (Cloud GPU)	GPU resources in various regions and require customizable Docker-based deployments.
Lamda Labs (Cloud GPU)	On-demand GPU resources for model training and inference tasks.
Together AI (Cloud GPU)	A platform for testing serverless models and occasional access to GPU clusters.

If you want to skip ahead to a specific part, here are the providers we will be diving into:

Serverless Providers
Runpod IO (Serverless)
Vast AI
Paperspace
Banana Dev
CoreWeave
Modal
ComfyICU
Replicate

GPU Cloud Providers
Genesis Cloud
Fly IO
Runpod IO (Cloud GPU)
Lamda Labs
Together AI

Serverless Providers

Runpod IO (Serverless)

Runpod IO

Concept:

A Docker image that includes the installation of Python + GPU packages, models, and ComfyUI.
Python/Go handlers act as an API interface to ComfyUI, which is vendor-specific, but can be wrapped in a more general API for reuse. For more information, see this article on hosting a ComfyUI workflow via API.

Pros:

Good documentation, including public GitHub repositories with examples.
Relatively large community for a new provider.
Compatibility with Windows Server.
Handlers allow for webhook and WebSocket-like communication for API feedback.
Network volume to store models/data and reduce cold start times.
Control over the number of workers and the ability to define persistently active workers.

Cons:

Availability of GPUs, especially in Europe, needs to be validated.
Handlers can only be written in Python and Go.

Open Questions:

General open questions regarding serverless infrastructure and AI inference tasks.

Conclusion:

The overall package seems very mature. The setup can largely be adopted from the GitHub examples. Good documentation and community support (notably on Reddit). The open questions regarding pricing and cold starts are typical for serverless infrastructure.

Vast AI

Concept:

Peer-to-Peer Sharing. Companies/organizations can rent out their unused GPUs.
A GPU Marketplace approach.

Pros:

Affordable prices through their peer-to-peer GPU sharing model.
A wide selection of different GPUs.
Good global availability of GPUs.
Ability to define autoscaler groups, allowing different workflows to scale differently.

Cons:

The autoscaler is currently only in beta mode.
Data privacy/security concerns when renting GPUs from anonymous providers.

Open Questions:

How will the autoscaler beta evolve?
Control over GPU providers: Can one allow only certain trusted providers (e.g., those based in the EU)?

Conclusion:

Even though the pricing is more affordable, there may be significant issues, in terms of security and data protection, as well as the fact that the autoscaler is still in the beta phase.

Paperspace

Concept:

The serverless approach (Workflows or Gradient) is still in beta Paperspace Gradient Workflows is based on Argo Workflows which utilizes Kubernetes.
A predefined API is available for communicating with workflows, as detailed in DigitalOcean’s documentation for Paperspace commands.

Pros:

The ability to use different machines (GPUs) at different stages of a workflow.
Provided by Digital Ocean, allows for general hosting customers to expand into GPU hosting without finding a new vendor.
Possible Windows support as outlined in DigitalOcean’s documentation on running Windows apps.

Cons:

Complex documentation: offers many features for various use cases (AI learning, data preparation, validation, and inference).
Vendor lock-in through a proprietary system: Gradient Workflows and YAML config are specific to Paperspace.
No real-time feedback over the API.

Open Questions:

Since it’s still in beta, how will the ecosystem continue to develop?
How extensive is the knowledge of Kubernetes required to implement autoscaling?

Conclusion:

It’s positive that it’s offered by Digital Ocean as they are a more mature company with general hosting experience. The approach seems very specific to Digital Ocean. Furthermore, it may require experience with Kubernetes.

Banana Dev

It has been excluded: Recently, they announced the termination of their serverless model as it was not cost-effective.

Learning from this: Currently, there are many new providers entering the market aiming to establish themselves as cloud GPU or serverless GPU providers. This highlights the importance of minimizing vendor lock-in.

CoreWeave

Concept:

Heavily based on Kubernetes.
- A Kubernetes file is created for setup; scaling and additional infrastructure are managed by Core Weave.

Pros:

Autoscaling by default with the possibility of scaling to zero.
Supports Windows.
Minimal vendor lock-in due to Kubernetes configuration.

Cons:

Strong dependency on Kubernetes, with the serverless setup based on KNative documentation.
Does not offer a handler API, etc., to communicate directly with ComfyUI.

Open Questions:

How complicated would it be to implement an API interface and resulting scaling to address the correct instances, etc.

Conclusion:

Good documentation and a close interface to Kubernetes. For a team with strong knowledge of Kubernetes, this could be a prime candidate.

Concept:

Container Setup: Containers are defined through Modal’s own container setup Modal custom container documentation.
- Docker images can also be used.
Modal-specific handlers to communicate with ComfyUI and other models.

Pros:

Supports webhooks and custom endpoints Modal webhooks documentation.
Focus on fast startups/cold starts.
Emphasis on AI inference tasks.
Comprehensive documentation with many examples.

Cons:

Vendor lock-in if Modal’s container setup is used.
Autoscaling and scaling configuration are not directly described.

Open Questions:

How exactly does the autoscaling work?

Assessment:

For us, this is a candidate for closer consideration. The container setup can be managed through Dockerfiles, and the API defined by Modal’s own interface.

ComfyICU

Concept:

Pure focus on ComfyUI, serverless infrastructure.
API interface for communication.

Pros:

Minimal setup effort.

Cons:

Limited control over the API.
Limited GPU resources.

Open Questions:

How does the autoscaling work, if it exists at all?
Community-based open source. What is the long-term support for this project?

Conclusion:

Potentially useful for testing or building a demo site, but probably not suitable for developing our commercial applications.

Replicate

Concept:

Execution of AI tasks/models in the cloud via an API.
No access to infrastructure, etc.

Pros:

Supports various languages: Node, Python, Swift.

Cons:

No control over the infrastructure, number of GPUs, or workers.
API rate limits.

Open Questions:

How can autoscaling be enabled?
Is it possible to create custom API endpoints, webhooks, websockets?

Conclusion:

For testing or as a demo for one’s own model, this can be a very good platform. However, as a standalone application interface, it doesn’t meet some of our core requirements.

GPU Cloud Providers

Genesis Cloud

Concept:

Focus on sustainability and renewable energy.
Scaling through instances as detailed in here.

Pros:

A REST API is available for managing instances.

Cons:

The availability of GPUs varies significantly by region.
Limited selection of GPUs.

Open Questions:

How quickly can new instances be scaled up or down?

Conclusion:

The use case for Genesis Cloud appears to be more suited for model training or tasks that require a significant amount of computing power for extended periods.

Fly IO

Concept:

Focus on the deployment of complete applications.
Also offers its own GPU servers.

Pros:

Docker File support with additional configuration via a TOML file.
Quick scaling of GPUs up or down facilitated by the launch process.

Cons:

Limited selection of GPUs, with only very large GPUs available.
Specifically tailored for Linux.

Open Questions:

How well does the launch system perform for relatively fast inference tasks?

Conclusion:

Since primarily large GPUs are available, the focus here also appears to be more on model training or other long-duration tasks. However, the launch system might also potentially be used for inference.

Runpod IO (Cloud GPU)

Runpod IO

Concept:

A wide range of GPUs available across various regions.
Base Docker images for popular tasks or support for custom Docker images.

Pros:

Many different data center regions.
A variety of CPUs available.
Simple setup via Docker images.

Cons:

No direct autoscaling (would need to use Runpod Serverless for that).
Despite a large selection of GPUs and many different data center locations, the availability of GPUs is not very high.

Open Questions:

Can autoscaling be implemented without using serverless?

Conclusion:

The setup can largely be adopted from the GitHub examples. There is good documentation and a community (much of it on Reddit). The availability of GPUs could become a problem, especially for smaller GPUs.

Lamda Labs

Concept:

On-demand cloud with a focus on model training and inference.
Similar concept to Runpod, offering a variety of GPUs.
- GPU availability is very limited.

Conclusion:

Runpod and Lambda Labs seem to have a similar approach and similar offerings. Runpod appears to have greater availability.

Together AI

Concept:

Offers an API and playground for testing serverless models.
Also offers GPU clusters but only upon request.

Conclusion:

We didn’t dig into the GPU clusters since information is available only upon request. Otherwise, in the API/serverless area, it appears to be similar to Replicate.

Established Providers

As we said in the introduction we did not examine the old, large providers like Google Cloud, AWS, Azure, Nvidia, etc., in detail. Rather, we focused on the new providers aiming specifically at the market segment of AI GPUs. With the older providers, we are more in the realm of cloud GPUs and less in serverless. Given the size of these providers and the wide range of market segments they cover, it can make sense to opt for them if one is already familiar with their architecture and documentation.

Google Cloud Platform (GCP)
AWS
Microsoft Azure
IBM Cloud
NVIDIA GPU Cloud (NGC)

Conclusion

Just as we saw that performance can vary wildly for different models, pricing can be similarly complex. When evaluating costs, consider factors like response times, the number of required workers, and potential charges for features like caching. Many providers offer detailed pricing guidelines on their websites, which can be crucial for ensuring you only pay for the computing power you truly need. Experimenting with performance of your model and applications during development will be helpful to make sure your hardware and pricing are both optimized for your application.

Another thing to consider is what kind of experience does your team already have? Most cloud GPU services provide tools like CLI or REST APIs to manage resources, which can be a steep learning curve if your team is not familiar with these technologies. Additionally, while serverless platforms may support multiple programming languages, compatibility with your team’s preferred language—be it JavaScript, Python, or Go—is essential. As exciting as it can be to learn new languages, it’s probably not the best use of your team’s time.

The size of files you’ll be moving between your model and the other parts of your project may also be a factor. Your users may not notice latency for models that communicate using text only. Text moves quickly from point to point in a network. However, if your model takes large image files as input or output, you may find that moving data between data centers is too slow. Then you’d want to focus on providers who can offer more general hosting in addition to cloud GPU hosting.

As we continue to research this for our own projects, we are thinking the best configuration for us is to use a cloud GPU exclusively for generation tasks and communicate with it via an API from our existing back end. We will have to experiment to see if we can have those functions geographically separate, or if we need to find one hosting company and one data center for both. As we learn more we may change our ideas, but that’s part of the fun of working in technology, things change. By using the higher-cost cloud GPU for as few tasks as possible, we’ll know we aren’t wasting compute power for things easily handled by a general CPU.

We hope this has given you some useful background and ideas as you research hosting options for your AI projects. Understanding the subtle differences between serverless and cloud GPU hosting can spark innovative ideas tailored to your needs. Perhaps some of the lesser-known providers we’ve explored might just be the perfect fit for your next project. As always, the dynamic nature of technology keeps us on our toes—ready to adapt and evolve. Happy hosting!

Thanks for reading. Join over 3000 specialists with powerful apps and subscribe to our newsletter. We keep you in the loop with brand-new features, early access, and updates.

AI – IMG.LY Blog

AI Design Agents and Creative Automation: How to Ship a Full Campaign Without a Designer

The AI Marketing Stack Has a Design-Shaped Hole in It

What an AI Design Agent Actually Changes

How to Run a Campaign Production Session with CoDesign

This Is What Closing the Loop Actually Looks Like

What Is an AI Design Agent?

First, What an AI Design Agent Is Not

AI Design Agent - a Working Definition

How an AI Design Agent Works in Practice

Design Agents vs. Related Tools

The Autonomy Slider: Why Human Control Still Matters

Where AI Design Agents Create the Most Value

A Different Kind of Design Tool

Vibe Design Tools Compared

Tools Built for Creators

Google Stitch

Figma Make

Lovart

Adobe Firefly Boards

Canva Magic Studio

IMG.LY CoDesign

Vibe Design Tools at a Glance

How to Choose

What Is Vibe Design? The Definitive Guide for Product Builders, Designers, and Creative Teams

Vibe Design - A Concept That Just Got a Name

Where the Term Comes From: Vibe Coding’s Design Sibling

What Vibe Design Actually Means: A Working Definition

How It Differs from AI-Assisted Design

Vibe Design in Practice: Three Scenarios

Scenario 1: A Marketing Team, No Designer Available

Scenario 2: A Designer Exploring Variations

Scenario 3: A Product Team Embedding Creative Capability

The Vibe Design Tools Shaping the Space Right Now

Where Vibe Design Has Limits

The Human Element: Vibe Design Is Not Autonomous Design

A Shift That’s Already Underway

CE.SDK v1.69 Release Notes

Introducing IMG.LY Agent Skills for CE.SDK

Launch in Minutes with Production-Ready Starter Kits for Web

Let Users Bring Their PPTX & Canva Designs to CE.SDK

Professional-Grade Video Editing Across Platforms

Full Changelog

Introducing IMG.LY Agent Skills

The “Quick Start” Paradox

The Solution: IMG.LY Agent Skills for Web

How It Works: Your Autonomous Implementation Partner

1. The Explain Path

2. The Build Path

3. The Docs Path

The Shift to Autonomous Engineering

Supported Frameworks

Get Started

Build in a Day: AI Video Clipping with CE.SDK

Introduction

Why Client-Side?

Tech Stack

Architecture Overview

Setting Up CE.SDK

What is CE.SDK?

Installation

Initializing the CreativeEngine

Uploading Video to CE.SDK

Extracting Audio for Transcription

Getting Video Metadata

AI-Powered Transcription & Highlight Detection

The Pipeline

Transcription with Speaker Diarization

AI Highlight Detection with Gemini

Mapping Back to Timestamps

Working with the CE.SDK Timeline

Understanding Blocks

Manipulating Trim Points

Working with Fills and Their Timing

Creating Time-Based Edits from Transcript Words

Generating Speaker Thumbnails

Speaker Detection & Face Tracking

Why Semi-Automatic?

How It Works

Multi-Speaker Templates & Dynamic Switching

Overview of `gpt-image-1`