Build a Short Video Generator with CE.SDK

In the following, I’m presenting a simple cookbook for building an AI-based video generator app, as described in my previous blog post. We’re using a combination of different APIs to generate text, audio, and images and will compose & render the final video using the headless CreativeEditor SDK. We also call it the Creative Engine.

This cookbook showcases the powerful capabilities of our client-side Creative Engine. The engine enables real-time video generation directly in the browser, eliminating the need for server-side processing. What sets this approach apart is its ability to produce editable source files that can then be opened with CreativeEditor SDK.

This approach is giving users complete control over every aspect of your video–from text and images to animations and overall composition. This means your users can refine and perfect your content even after the initial generation.

Get the complete code on GitHub.

Scope

This tutorial focuses on building an app with a simple UX:

Input your keywords/topics
Choose between landscape or portrait format
Generate and preview your video
Edit the video in the CE.SDK video editor

The app flow we will create:

The post-editing we will get with CE.SDK:

Technical Overview

The app follows three major steps to generate the video.

A script is generated based on User input, the output is a structured XML file.
The XML script is parsed to extract text and image information. The extracted data will then be used to generate audio & image files through third-party APIs
All assets are loaded into the creative engine. This is where the composition, including animation and effects, takes place. The Creative Engine then exports a video and scene file, which can be edited with the Creative Editor.

Setup

We’ll use a boilerplate with Next.js, React, Typescript & Tailwind. Make sure you retrieve all necessary keys:

Anthropic (LLM)
ElevenLabs (text to speech)
fal.ai (text to image)
IMG.LY CE.SDK – Retrieve a free trial key

// Required environment variables
NEXT_PUBLIC_ANTHROPIC_API_KEY=your_claude_api_key
NEXT_PUBLIC_FAL_API_KEY=your_fal_ai_key 
NEXT_PUBLIC_ELEVEN_LABS_KEY=your_eleven_labs_key
NEXT_PUBLIC_IMG_LY_KEY=your_img_ly_key

Implementation

1. Generate The Script

In this step, we’ll focus on generating the initial prompt and then passing it to the Anthropic API.

As with many things with LLM, there are many different strategies for structuring the initial prompt. From experience, the best result comes from providing examples of the desired output. We’ve decided to use an XML document; this can be easily parsed later on and is less error-prone compared to a JSON.

We now define the structure of how information should be saved in the XML.

<video>
    <group part="intro">
        <element>
            <text voiceId="50YSQEDPA2vlOxhCseP4" style="0.2">Did you know these fascinating facts about pyramids?</text>
            <image>Ancient Egyptian pyramid at sunset</image>
        </element>
        
    </group>
    <group part="content">
        <element>
            <text voiceId="50YSQEDPA2vlOxhCseP4" style="0.2">The Great Pyramid was the tallest structure for over 3,800 years!</text>
            <image>Great Pyramid comparison to modern buildings</image>
        </element>
    </group>
    <group part="outro">
        <element>
            <text voiceId="50YSQEDPA2vlOxhCseP4" style="0.4">The pyramids continue to reveal their secrets to this day...</text>
            <image>A giant 3D question mark hovering over the pyramids</image>
        </element>
        <element>
            <text voiceId="50YSQEDPA2vlOxhCseP4" style="0.4">Stay curious - there's always more to discover!</text>
            <image>Pyramids under starry night sky</image>
        </element>
    </group>
</video>

In this tutorial, we’ll focus on the format trivia only as shown in this example. For later iterations, however, I’m planning to implement different content formats (e.g., trivia, quiz, recipe, etc.). Each of these formats will have its example XML. Therefore, I’m nesting the XML in a simple format object to scale this up easily later.

interface Format {
  name: string;
  example: string;
}

const formats: Record<string, Format> = {
  trivia: {
    name: 'Trivia',
    example: `<video><group>...</group></video>` // Add example from above
  }
};

Using this format object with the example, we can now generate the prompt.

What do we need for this prompt?

Description of the task
Description of the desired output, incl. an example for the specified format
Topic as provided by the user

The topic provided by the user is passed to the function as a string.

export const createVideoScriptPrompt = (topic: string, formatName: string = 'trivia') => {
  const format = formats[formatName];
  if (!format) throw new Error(`Format ${formatName} not found`);

  return `
Format: ${format.name}
Topic: ${topic}

Please write a detailed script for this short video, considering the specified format and topic.
Include an introduction, main content sections, and an outro. Each section should have an image.
Structure the script as an XML Document with clear sections, descriptions for the images.
The image description should be written as a prompt. This prompt will be used to generate an image.
Put the description between the image tags. The video shouldn't be longer than 30 seconds.

Example format:
${format.example}`;
}

2. Generate All Assets

In the second step, we’ll parse through the LLM response, which should be the XML. We’ll create a simple parsing function to extract all text information that should be sent to text-to-speech and text-to-image AIs.

Please note that all these steps can be easily streamlined by using AI-assisted coding. Just provide the example XML as input and your desired output.

API Calls
When finding text & image tags in the XML, we’ll call API functions for text-to-speech and text-to-image. For this example, I’m using ElevenLabs & fal APIs. You will find all API calls in the api.ts.

Since the LLM generated a script that includes image prompts, make sure to pass them to the API.

export async function generateImage(prompt: string): Promise<string | null> {
  try {
    console.log('Generating image for prompt:', prompt);
    const result = await fal.subscribe("fal-ai/flux/dev", {
      input: {
        prompt: prompt,
        size: "portrait_16_9",
      },
    });
    const typedResult = result as { images: { url: string }[] };
    console.log('Image generation successful. URL:', typedResult.images[0].url);
    return typedResult.images[0].url;
  } catch (error) {
    console.error('Error generating image:', error);
    return null;
  }
}

Timestamps
Last but not least, we need to come up with timestamps. What’s the duration of each segment? This is critical information for composing the video. Luckily, this is quite easy: Each scene is as long as the generated audio for each segment. This duration for the audio segments can be calculated: Most TTS like ElevenLabs provide timestamps along the audio file. These are typically character-based timestamps, so we first have to calculate the timestamps for each word and then the duration for the entire text section.

Ready For The Next Steps
All Asset URLs that are generated will be saved in a VideoBlock object for convenience. The duration of the VideoBlock is the duration of the audio, as calculated above.

interface VideoBlock {
  text: string;
  imageUrl: string | null;
  audioUrl: string | null;
  startTime: number;
  duration: number;
  wordTimestamps: Array<{ word: string, start: number, duration: number }>;
}

3. Generate The Video

We have everything together now: The completed XML with timestamps, duration, and all assets. It’s now time to generate the video using the creative engine.

Let’s first add an empty container in our HTML that will be referenced for initiating the creative engine.

  {/* Add container for Creative Engine */}
  <div
    id="cesdk_container"
    className="bg-gray-100 invisible rounded-lg mt-8"
  />

We can now initialize the engine. Use this code snippet from our documentation.

We’ll then set up a function that creates a simple composition using the provided VideoBlocks. The engine requires you to first create a scene, append a page to the scene, and then create tracks within the page. The tracks are basically the layers in the timeline. I recommend setting one track as a background track using the following snippet:


// Set video track as a background track by connecting the page duration to the video track
engine.block.setAlwaysOnBottom(videotrack, true);
engine.block.setPageDurationSource(page, videotrack);

The Creative Engine provides powerful API calls to style & manipulate blocks in many ways. Here is an example of how we can animate the images with a slow zoom effect:

const imageZoomAnimation = engine.block.createAnimation('crop_zoom');
engine.block.setInAnimation(image, imageZoomAnimation);
engine.block.setDuration(imageZoomAnimation, block.duration)
engine.block.setBool(imageZoomAnimation, 'animation/crop_zoom/fade', false)

Export The Video & Scene
Exporting the video is easy. Just pass the page to the export function. In our example, we’re also saving the scene file so we can edit the video later.

  // Export video
  const progressCallback = (renderedFrames: number, encodedFrames: number, totalFrames: number) => {
    console.log(`Progress: ${Math.round((encodedFrames / totalFrames) * 100)}%`);
  };

  const blob = await engine.block.exportVideo(
    page,
    'video/mp4',
    progressCallback,
    {}
  );

  // Save scene to string
  const sceneData = await engine.scene.saveToString();
  
  // Create scene blob
  const sceneBlob = new Blob([sceneData], {
    type: 'text/plain'
  });

4. Add A Video Editor

The last step is to add the video editor for post editing and pass the scene file. With CE.SDK, this effort is reduced to adding a few lines of code. In the init function, we’re configuring the editor and adding callbacks for the export:

const initEditor = async () => {
        const config = {
          license: 'A-O53TWXK5bfyconUx7e53S5YU7DzjuGpMAH5vvKjLd0zBa6IhsoF7zChy1uCVbj',
          userId: 'guides-user',
          theme: 'dark',
          baseURL: '<https://cdn.img.ly/packages/imgly/cesdk-js/1.44.0/assets>',
          role: 'Creator',
          ui: {
            elements: {
              view: 'default',
              panels: {
              },
              navigation: {
                position: 'top',
                action: {
                  save: true,
                  load: true,
                  close: true,
                  download: true,
                  export: true
                }
              },
              dock: {
                iconSize: 'normal', // 'large' or 'normal'
                hideLabels: true // false or true
              }
            }
          },
          callbacks: {
            onUpload: 'local',
            onSave: (scene: string) => {
              const element = document.createElement('a')
              const base64Data = btoa(unescape(encodeURIComponent(scene)))
              element.setAttribute(
                'href',
                `data:application/octet-stream;base64,${base64Data}`
              )
              element.setAttribute(
                'download',
                `video-${new Date().toISOString()}.scene`
              )
              element.style.display = 'none'
              document.body.appendChild(element)
              element.click()
              document.body.removeChild(element)
            },
            onClose: () => {
              onClose();
            },
            onLoad: 'upload',
            onDownload: 'download',
            onExport: 'download'
          }
        }

Conclusion

By following this cookbook, you can streamline the process of AI-generated video creation, making it fast and efficient. This method is especially useful for content creators, educators, and marketers looking to automate video production while maintaining creative control.
Next, try experimenting with video styles, refining AI scripts, or exploring advanced editing.
Feel free to GitHub repo and share your creations with us on X. Happy creating!

3,000+ creative professionals gain early access to new features and updates—don't miss out, and subscribe to our newsletter.

Scope

Technical Overview

Setup

Implementation

1. Generate The Script

2. Generate All Assets

3. Generate The Video

4. Add A Video Editor

Conclusion

What is Visual Prompting?

Vibe-Engineering: When AI Does All the Coding, What Do We Actually Do?

IMG.LY x AI

Build vs. Buy: Is Fabric.js Right for You

AI-first Visual Editor using GPT-4o’s gpt-image-1 Model