Artificial Intelligence – IMG.LY Blog

How I Built a Short Video Generator with AI & CE.SDK in One Day

Eray — Thu, 09 Jan 2025 11:27:13 GMT

Here’s the crux of product development in the age of LLMs: how much can AI truly accelerate the development process?

We have seen videos of solo developers building small apps entirely with AI with just a few prompts. But how does it scale to more complex development projects? As LLMs rapidly evolve, their scope and impact will only increase.

That’s why I regularly challenge myself to build a small project with the help of AI. I’m a prime candidate to test the AI productivity boost: a jack-of-all-trades (and a master of none) with a background in both design and engineering, yet no hands-on experience in the past five years. My latest challenge? Build a web-based short video generator within one day.

In this post, I’ll share the most intriguing takeaways from tackling this project.

Why a Short Video Generator?

Why focus on this idea? It’s simple: to ride the wave of a new trend. A format called “faceless” short videos is gaining traction among creators on platforms like YouTube and TikTok.

https://www.youtube.com/embed/DfQ3fhqfKVc?feature=oembed

What’s fascinating about these videos is their automation: an LLM generates a script, which is then transformed into speech, images, and text assets using various AI services. These assets are automatically assembled into a cohesive video.

The general concept is compelling: It’s still generative content, but mixed with classic video composition techniques. This approach offers greater accuracy, consistency, and control over pure generative AI.

The potential to automate video production at this scale is exciting. Add its relatively low complexity and high production value, and it became the perfect topic for my challenge.

Enter CE.SDK

Another reason I chose this challenge was its compatibility with CE.SDK, our design and video editor library. CE.SDK offers a robust editing toolkit that integrates into any product with just a few lines of code. Its features, like headless mode, are ideal for automating workflows like video generation.

Most faceless video services use React-based video generation and achieve fair results. However, using CE.SDK instead of a react-based library could potentially boost the overall experience with three critical improvements:

Editable Outputs: This is huge. Full automation often needs human adjustments for fine-tuning. CE.SDK enables automated video generation while allowing manual refinement of the results.
Enhanced Visual Quality: CE.SDK has its own rendering pipeline, allowing for more nuanced visual effects and animations. When you’re competing against others in this space, it can make a huge difference if you’re able to produce higher fidelity in the visual output.
Visual Design Workflow: Create design components or even entire templates visually, and then use them via code. This authoring workflow can be extremely helpful in creating rich, interesting designs for the generated videos.

The Ground Rules

To keep the challenge focused, I set strict rules:

Time Limit: Spend no more than 12 hours on the challenge.
No Manual Coding: Avoid writing any code yourself—everything should be built through conversations with AI.
Trust the AI: Do not read or analyze code generated by the AI. Rely entirely on its decisions.
Skip External Research: Do not read or explore the APIs you intend to use. Instead, provide links to the AI and let it determine how to use them.
Compare AI Performance: Alternate Claude Sonnet 3.5 and ChatGPT o1 for code generation to evaluate which performs better.

The Tools & Workflow

Code Editor: Cursor
Built on VSCode’s foundation, Cursor stood out as the only editor offering both an integrated chat interface and the ability to switch between different LLMs. However, with GitHub’s recent significant updates to Copilot, I’ll switch to VSCode with Copilot for future challenges.

UI Prototyping: Claude Artifacts
Rather than building the entire project in my code editor, I chose to prototype the UI directly through Claude’s web interface. The benefits were immense:

Instant results: To create an artifact, Claude streamlines development by automatically writing and compiling code while leveraging essential UI libraries and components. This automation eliminates setup time and technical overhead, allowing me to focus purely on design iterations.
Instants Variations: Claude enables rapid prototyping through parallel conversations. When a design direction didn’t quite work, I could simply start a fresh conversation with modified requirements and evaluate a new prototype. This approach helped me develop three viable concepts quickly - a pace that would have been impossible in a traditional code editor.
Quality of execution: Claude transforms rough concepts into polished, intuitive interfaces. Its suggestions often surpassed my initial ideas, offering sophisticated solutions I hadn’t considered.
Keep it clean: By prototyping outside the code editor, I kept the main project’s codebase clean and focused. This separation prevented the accumulation of experimental code and maintained the clarity of our primary development environment.

Quickly prototype your interface with Claude Artifacts.

APIs
Key APIs used in the project included:

Script Generation: Claude Sonnet 3.5 vs various ChatGPT models.
Image/Video Assets: Fal.ai Flux models.
Speech Synthesis: ElevenLabs.

Building the App: Divide and Conquer

After having prototyped the UI, I started to chat with the LLM inside the code editor so that it can code the app. To work with the AI efficiently, I followed a divide-and-conquer approach. Rather than simply asking it to “build me a video app,” I broke down the problem into manageable steps:

Generate a video script
Create an AI prompt that includes user input and examples of the desired output format. Pass this prompt to the LLM API.
Parse the script to generate assets (speech, images, text)
Parse the LLM’s response to extract image prompts and speech paragraphs. Send these to their respective APIs.
Compose the final video
Load all the generated assets into a predefined template to generate the finished video through the CE.SDK library.

After completing these steps, I was finally able to generate my first fully automated videos! With a few more tweaks and additions, I had an MVP ready within twelve hours.

The final result: A Short Video Generator

There are still some missing features, partly because I spent a significant amount of time refining the prompt to generate the video script. I also had to bend the rules occasionally—sometimes the LLM would hit a wall, and I had to read or write small snippets of code.

Key Takeaways

Engineering Knowledge Is Essential
You should have some engineering background to achieve the AI productivity boost in development.

AI doesn’t solve everything for you. You are still the architect. You provide a lot of input and guidance. AI often needs to be pointed to the right strategy. Foundational knowledge of computer science is hugely advantageous for working with AI effectively.
As mentioned, I had to read and write a few lines of code myself. Without coding experience, I would have probably not been able to progress, as the LLM was not able to.
The getting started experience is nowhere close to novice-friendly. How do you get started with a new project in a code editor that actually requires you to do the setup manually? My workaround was to create an empty project, and then ask the LLM to instruct me to use a boilerplate for react. Again, this is engineering knowledge, any novice would have hit a wall already at this point.

Claude Outperformed ChatGPT
Claude was a clear winner in the side-by-side comparison, because of three reasons:

Claude Artifacts was a game changer for UI prototyping.
It was generally better at writing and understanding code. Difficult to quantify, but in some cases Claude fixed the mess ChatGPT left in the code
Claude can process URLs, which makes working with APIs much smoother.

Who would have thought new LLMs would catch up to OpenAI so quickly after they released the first version of ChatGPT?

Complexity Slows AI
The more code in my project, the slower the overall progress. LLMs struggled with the growing complexity. Their context windows filled more quickly, and their responses became increasingly unreliable. At some point, it becomes extremely difficult to make architectural changes, especially if this affects multiple parts of the app. When trying to fix errors, you’ll often find yourself in a whack-a-mole game. While the AI would resolve one issue, it would inadvertently introduce new problems elsewhere, creating an endless loop of fixes and regressions.

Ultimately, the time invested in this challenge was well worth it. While LLMs can’t build products end to end on their own, they can significantly streamline product development when paired with the right human collaboration. The real question is whether development teams are ready to adapt their habits and explore new workflows to boost productivity.

Next Steps

This challenge has inspired me to refine and expand on this project. Future iterations will focus on harnessing CE.SDK’s unique features to push the boundaries of automated video generation.

Stay tuned for part two of this series—there’s much more to explore!

UPDATE: Read part two - a cookbook how to build your own short video creator!

Over 3,000 creative professionals gain early access to our new features, insights and updates—don’t miss out, and subscribe to our newsletter.

Inpainting: Removing Distracting Objects in High-Resolution Images

Vivien — Tue, 08 Dec 2020 15:03:28 GMT

Introduction

You may know this situation: You are out on a trip when suddenly a unique opportunity for a photograph appears, like a wild animal showing up or sun rays breaking through the rain clouds for a few seconds. Without hesitation, you grab your camera and capture the sight. Later you discover that a distracting object, like a road sign, is ruining your shot. Time for some cumbersome retouching.

Now, imagine you could erase the distracting object just by highlighting it. Wonderful! From the field of deep learning, a technique for image manipulation called Image Inpainting makes it possible. Image Inpainting aims to cut out undesired parts of an image and fills up missing information with plausible content of patterns, colors, and textures that match the surrounding.

Today we would like to share experiences that we have gained during the application of deep learning inpainting approaches. Furthermore, we’ll present some quality optimization steps that we have implemented to improve results addressing the transformation to high-resolution outputs. But let us start with a quick introduction: who are we and why are we concerned with these kinds of topics?

We are a small consortium consisting of the Bochumer Institute of Technology, a research institute aiming to transfer knowledge from academia into industry, and the company IMG.LY, a team of software engineers and designers developing creative tools like the PhotoEditor SDK and the UBQ engine. Together we are working in the EFRE.NRW funded research project KI Design that targets artificial intelligence (AI) and deep learning-based algorithms for image content analysis and modification, as well as a leveraging tool kit for aesthetic improvements.

Image Inpainting has been a viable technique in image processing for quite some time, even before “Artificial Intelligence” was on everyone’s lips. Common for most inpainting algorithms is that an area of an image is highlighted to be corrected. Many conventional algorithms then analyze the statistical distribution to fill the resulting gap by finding and using nearest neighbor patches. The most famous and state of the art approach of this method is the PatchMatch algorithm. It uses a fast, structured randomized search to identify the approximate nearest neighbor patches that will fill in the respective part of the image.

However, there are two drawbacks: first, regardless of the approximation, performance still might be an issue, and second, the results suffer from a lack of semantic understanding of the scene. Thus, research dived into new ideas and directions and tried the application and implementation of AI- and neural network-based approaches to solving these issues.‌‌‌‌ For us, the removal of annoying background content is a useful feature, as it improves the overall image aesthetic. Having this available on mobile devices would be particularly interesting. Due to performance limitations and ever-improving integrated cameras, a mobile solution requires a fast and lightweight model architecture as well as the ability to process high-resolution images.

Summarized, our expectation for an AI-based inpainting algorithm are:

removal of (manually highlighted) background objects/persons
feasibility to process high-resolution images
fast and lightweight network (applicable for smartphones)

The number of publications addressing this or similar requirements has increased enormously in recent years. After digging into the literature, we identified two promising approaches and tested them.

Testing and Comparing Model Architectures

These selected networks were based on the latest scientific findings and appeared to provide high-quality output. Both approaches have well-documented repositories – a special thank you to the authors for their great work (of repositories and papers as well)! The selected networks are:

Partial Convolutional Neural Networks (PCONV); [paper, github-repository]
Generative Multi-column Convolutional Neural Networks (GMCNN); [paper, github-repository]

You may be wondering why exactly we chose these models for comparison purposes, as the latest scientific findings sound a bit vague. Indeed, it is considerably difficult to identify the best fitting model architecture. As far as we know, there is no standardized validation method or data set. Most papers demonstrate their results on self-selected test images and further compare them with again self-selected approaches. The only option we had was to evaluate models that seemed reasonable to us. A validation method or standardized test set could be a valuable scientific contribution here. Let’s turn back to the selected models.‌‌‌‌ The PCONV network uses multiple convolutional layers and adds a partial convolutional layer. The key feature is that the convolution does not consider invalid pixels, indicated by an updating mask. This prevents the algorithm from picking up the color of the mask (typically the average color tone of the image) and transmit it into the reconstruction process.

The GMCNN – a GAN-based model – is built in a special architecture consisting of 3 networks: a generator, split up into three branches addressing different feature levels, a local and global discriminator, a VGG19 net calculating the implicit diversified Markov random field (ID-MRF), introduced in the paper. This ID-MRF serves as a loss term comparing generated content with nearest-neighbor patches of the ground truth image. While the interaction of all three networks is required in the training phase, only the generative network serves for testing and production. More details and figures regarding the model architecture are available in the official paper.

Due to the lack of standardized sets, we created our own test sets addressing different levels of complexity. This also included image data requiring an understanding of semantic structures. In our comparison, we paid special attention to ensuring the filled content was harmonious, and a possible artifact interspersion was reduced to a minimum. In particular, image artifacts could raise issues in terms of translation with respect to high-resolution information. Here is an example output of our tests:

In comparison, the inpainting result based on PCONV suffered from some blurred artifacts and erratically deviating shades, cf. Figure 1C, whereas the GMCNN-based result appeared to be more precise and plausible concerning the semantic context, cf. Figure 1D. You can see this clearly when you look at the grille door that was covered by a person. The GMCNN approach, cf Figure 1D, had recognized and respected the grid structure, while the PCONV overlayed this with a uniform (black) color tone. In consideration of all test data results, we decided to follow up with the GMCNN.

However, we would like to emphasize that this does not mean that one model architecture is better suited for Image Inpainting than the other. The used weights build-up of the PCONV architecture may achieve similar results with further training or different test sets.

What About High-Resolution Inpainting?

At the current state of the model, the processing of high-resolution images remains uncovered. Out of all the papers and repositories we found, even papers promising high-resolution often just targeted image sizes of 1024x1024 pixel at maximum. Our expectations were a resolution of substantially more than 2000x2000 pixels. A reason for this issue seemed to be the hardware demanding and time-consuming training phase when processing high-resolution images. ‌‌
Furthermore, the application of a high-resolution inpainting model could entail performance issues. These are not neglectable to us, as we are facing a prospective implementation on smartphones that can’t keep up with the power of a modern graphics card. Thus, an additional challenge is a high-quality transformation of low-resolution outputs to a high-resolution.

Apply Low-Resolution Inpainting Output to High-Resolution Images

‌‌The GMCNN model was trained with the Places dataset, formatted in a 512x680 resolution. Feeding in high-resolution images would exceed the training input size by far and further require information of feature dimensions that the model has never seen before. That could result in almost completely distorted reconstructions.

A straightforward solution is to downscale the high-resolution image before feeding it to the model and then resize the result up to the original image size conclusively. Due to the upscaling (e.g., via bicubic interpolation), the image details suffer from a loss of quality. Therefore a better approach is to take only the masked areas of the upscaled inpainting prediction and stitch it back into the original image. That prevents the loss of initially known details from the unmasked regions. For the maintained inpainting regions, the lack of image details, as well as the artifacts and distortions, pose a complex challenge that we aimed to overcome with the following approaches.‌‌‌‌

Shrinking-Mask-Approach

While the inpainted area mostly yields realistic-looking content for the more marginal regions, the performance decreases strongly towards the center, cf. Figure 3D. We especially noticed this behavior for larger masks. Conclusively a recursive inpainting procedure with an iteratively shrinking mask, cf. Figure 3E, seems to be a reasonable approach. With this concept, we try to improve the inpainting results in a progressive manner starting from the boundary to the center of the masked regions while utilizing the generated information of the preceding recursion, cf. Figure 3F.‌‌

To us, it was essential to have a dynamic method that allows the handling of all mask forms and sizes. Therefore, we decided to apply an erosion kernel to the original mask in a recursive fashion until it is fully eroded. The amount of shrunk masks determines the number of inpainting performed by the network.‌‌‌‌

Two-Step-Approach

While investigating and testing various quality optimization steps, we also fed high-resolution images into our model and discovered that the results for smaller masks were convincing. That led us to the hypothesis that not the resolution but rather the number of pixels to reconstruct seems to be the limiting factor. This finding served as the basis for our two-step-approach.‌‌‌‌

Briefly, the approach works as follows. In the first step, we perform inpainting on a downscaled high-resolution image while applying the original mask. In a second step, we transfer the model output of step one into a higher resolution and perform inpainting again. This time we apply a modified mask containing only small coherent mask regions, for which we exploit the provided higher resolution context information. ‌‌
In more detail, the first step is characterized as the baseline approach, cf. Figure 5: We scale the masked image down to the training resolution of 512x680 pixels and fill up the missing information.

Optionally, the shrinking-mask-approach can be applied in the first step.

In the second step, we quadruplicate the output of step 1 to a resolution of 1024x1360 pixels. To prevent the resolution loss for unmasked regions caused by this upscaling, we stitch the generated content into the same sized (downscaled) input image. The resulting image serves as the model input for step 2. ‌‌

To avoid/reduce image artifacts in the subsequent inpainting process, we modify the original mask to contain only the small mask regions and the boundaries of the large mask regions. In detail, we temporarily shrink the mask with an erosion kernel to ablate small mask segments and the marginal areas of larger mask sections, cf. Figure 6B. Finally, we calculate the difference between the original mask and the altered mask, resulting in our desired modified mask, cf. Figure 6C.

By re-inpainting, we double the resolution of the generated content for the small contiguous mask regions, cf. Figure 7A bottom right, as well as for the masked boundary areas, cf. Figure 7A upper left. Moreover, through the latter, we achieve smoothing of the intense decay in resolution between the unmasked regions and the generated content arising from step 1. Finally, we scale our image back to the original input resolution and stitch the generated content to the original image to maintain the original resolution for unmasked areas.

Conclusion

For us, it was impressive to see how AI-based inpainting can successfully and deceptively realistic fill in missing information. Not only the consideration of structural (semantic) content is an advantage compared to conventional approaches, but especially the decreased demand on required hardware. In our view, this opens up the opportunity to reach a much larger group of users of inpainting algorithms: in place of using powerful hardware and professional software, mobile devices could achieve small but decisive changes.

In summary, we have dealt with the application of high-resolution images, which is undoubtedly gaining in importance due to the ever-improving smartphone cameras. Processing high-resolution images entail an increasing number of pixels to “inpaint” and could further lead to quality as well as performance issues. Thus, we decided to improve the output of low-resolution networks and to provide them with more information to support a subsequent upscaling procedure.
We have implemented two different approaches, shrinking-mask and two-step-approach that can be applied independently or in a combined manner. It turned out that both methods subjectively increased the image quality. However, this comes along with higher computational demands, as models are applied multiple times.
Overall, we think that the combination of these two approaches will represent a good toolkit for AI-based high-resolution image inpainting. But we’ll keep an eye on the upcoming scientific developments.

Smart Cropping - Automatically crop images to optimal regions with deep neural networks

Vivien — Fri, 24 Jul 2020 11:48:34 GMT

Pictures are omnipresent on the social web. It is common to instantly post photos of all kinds of events to share with friends and followers. Also, businesses want to show presence on social networks and employ designated social media managers to represent the company and to communicate to customers.

Let’s assume you work as a social media manager in a company. Your job is to communicate with customers and represent your company on various social media platforms. One part of your job is to share pictures of your company’s work. Since you’re serving multiple social media platforms, you always have to consider their specific aspect ratio requirements for images. One platform wants you to provide square photos, whereas another one asks for pictures in a wide landscape format.

You are a busy person, you don’t want to waste time on cropping hundreds of images into the proper format, but you also don’t want to crop your pictures weirdly.

Suppose you want a portrait-oriented version of the following image. Simply choosing the center would lead to an odd picture containing only one half of the bird. What you want is the image to include the region of interest; here, probably the whole bird in the center of the image.

But how can we automatically find such image regions?

When humans look at images, they intuitively focus on significant elements of the photos first. If you look at the following pictures, …

… you will probably notice that your first focus on the salient parts of the image (maybe the geyser or the sundown for the first image, and the reindeers on the road for the second image).

As it turns out, it is possible to train neural networks to predict such salient regions. A prediction of such a network is called a saliency map. It basically is a grayscale image of the same size as the picture. Each pixel intensity encodes the degree of saliency. These saliency maps allow us to find the best image region for a given aspect ratio.

But how can networks be trained to predict salient regions in a picture? And how do we, given the salience information, crop an image to an optimal region?

Fortunately, there was already a considerable amount of research regarding saliency prediction. Basically, there are two main approaches: attention-based saliency prediction and segmentation based saliency prediction. The first group focuses on predicting the center points of human attention regardless of object segmentation and boundaries, whereas the latter considers the most salient objects as a whole.

For our application, it seemed more suitable to choose an attention-based approach. We decided to go with an LSTM based model.

Briefly summarized, the approach works as follows: A deep convolutional neural network (CNN), pre-trained on image classification, acts as a feature extractor. The value of some intermediate layer (or hidden layer) is forwarded to the recurrent LSTM that further improves the prediction. The saliency map then is the output of the LSTM, combined with the Gaussian priors.

In particular, a dilated convolutional network, in our case, a modified RESNET50 already pre-trained on the SALICON dataset, is deployed for feature extraction. The original paper for this method used a network for image classification. Many CNNs can act as feature detectors, but they don’t perform equally well. For example, compared to the standard convolutional feature extraction networks, the dilated networks prevent the harmful effects of image rescaling on the saliency prediction. The extracted feature maps are then fed into an attentive convolutional LSTM (recurrent neural network). This iteratively improves the saliency prediction on the obtained feature maps. Finally, multiple trainable (isotropic) Gaussian priors are added to take the bias of human attention into account, since humans tend to focus on the image center.

We trained the network on the SALICON dataset, which includes 20,000 images from Microsoft COCO and 15,000 corresponding saliency maps. The saliency maps were generated by empirical studies modeling human eye fixation by mouse movements. We optimized our network with a composed loss function considering the Pearson Correlation Coefficient and the Kullback-Leibler divergence, representing standard saliency prediction loss measures.

With this approach, we could already predict pleasing saliency maps suitable for smart image cropping. Unfortunately, our first successful model took up way too much memory, thus being useless for practical applications. Therefore we had to compress the model to a suitable size while maintaining the smart cropping performance as high as possible. After a while of unsatisfactory trials, we found that instead of the RESNET50, we could just deploy the way smaller Keras-intern MobileNet as our feature extraction model.

This model option indeed provides less precise results for the saliency map prediction. However, it is still suitable for the smart image cropping, since we only need to know the position of the focus points roughly. Not only could we save much memory capacity employing this model variation, but also we could increase the runtime of our model significantly, which was our goal. This way, we created a model suitable for practical applications that are supposed to take over the inconvenient manual cropping process.

Once we have the saliency map, the smart crop can be determined quite easily. First, we compute the edge length of a window covering the given image as much as possible while fulfilling the required aspect ratio. Afterward, we slide this window over the predicted saliency map and determine the position that maximizes the covered saliency density. Now we only need to crop the image based on the optimal window position. Thus we obtain our smart cropped image suiting the required aspect ratio. The method is inspired by this paper.

To sum up, using saliency prediction and maximization, smart cropping enables us to find the best image regions for any aspect ratio. This technique, which we’re currently building into our UBQ engine, reduces the user’s burden to manually crop images into the required aspect ratio.

This project was funded by the European Regional Development Fund (ERDF).

When Creativity meets A.I.

Eray — Thu, 16 Nov 2017 23:00:00 GMT

A new generation of A.I. algorithms, propelled by rising computational power, new hardware, and a shift in paradigms made its first notable impact in the creative world: The works of Gatys et al. and Krizhevsky et al. have not only gathered considerable public attention but have helped apps like Prisma to be adapted and used by millions. I strongly believe that this is merely the beginning. **With the help of machine learning, we will fine-tune, simplify, and automate creative processes and ultimately empower new techniques for design and content creation.**

We’ve been following this topic for quite some time now and have spent considerable effort in researching the opportunities of deep learning for our PhotoEditorSDK. After more than a year of research and development, today, we’re finally bringing one of our apps to beta. **Portrait** combines supervised deep learning with the visual power of our SDK. In a nutshell, Portrait makes creating beautifully designed portrait images as easy as taking a selfie. You turn your selfies into movie poster-like portraits, with styles ranging from double-exposure photography to stencil art. One may consider it as the next iteration of what Apple and Google recently brought to market with their new camera features.

We’ve now come a long way and gained invaluable insights on our journey so far. Not only did we get our hands dirty with countless training sessions and refinements to the neural net, but our first hand experience also helped to set expectation management right and to dismantle hype from substance. Most notably, it changed our product shaping process, making it more important than ever to foster strong ties between the product stakeholders and to share a common vision and goal everybody can get behind.

In the following I’d like to share the story of how we built the app and closed the gaps between roles of the stakeholders within this process.

Preface: Before Neural Networks were the Hot New Thing

My journey begins over ten years ago, while I was graduating in neuroscience. Back then, the idea of A.I. was just a vague promise. Artificial Neural Networks were too small, computers lacked the necessary power, and the results were certainly nice, but still too weak to compete with other traditional algorithms. Research felt stuck in tiny little specializations without really following a broader vision. Dazzled by its impracticability, my interest in Neural Networks slowly began to fade.

It took research on Neural Networks another six years to get back on my radar. At that time, I was leading several product developments at 9elements. When I learned about the work of DeepMind (now Google) I had a genuine feeling that this time, A.I. was ready for the limelight.

As we were in the course of building a library for image editing and computer vision — the PhotoEditorSDK, we realized how much neural nets could also affect the creative space, given its ability to abstract and formalize rules. What if there was a machine that could reproduce the common and dull tasks you have to do as an art director within a second? What if designers could get rid of repetitive and tedious activities that interrupt their creative flow?

But this topic isn’t something you’d learn in a week, obviously. Still, innovations cannot happen if you’re not willing to take a risk, so we decided to invest considerable time and resources into this technology.

From a product management’s perspective, this process is actually an anti-pattern: Usually, you wouldn’t want to start by finding the right purpose for a technology, instead you’d find the right technology for a purpose. I still believe that this is essentially the right approach, but sometimes you have to abandon your best practices and take a swim in uncharted waters. Consequently, we asked Malte, one of our iOS engineers, to spearhead our research and take a deep dive into this topic. We decided to start off with image segmentation as the first process that we wanted to optimize through machine learning. Masking and clipping sometimes can be a tedious tasks, and ultimately we wanted to reduce this process that can take several minutes to a single click.

Chapter 1: The Machine Engineer

Malte, who is a diligent engineer and — how convenient — a passionate photographer, started investigating some approaches that focused on image segmentation. You can read more about his journey in his article. Although he experimented with various neural networks and post-processing techniques, the resulting masks sometimes lacked the desired accuracy and wouldn’t have matched a user’s expectations. This was a first expected insight. As we want to deliver ready-to-use products to our customers, that don’t need any complex tweaking, this was something we had to fix. Our problems originated mostly from our rather ambitious goal to segment any type of object within an image. It would have required to train with vast data and to scale up the number of filters in our network. However, due to our on-device constraint, this would have killed our carefully crafted performance.

Therefore, we shifted this generalist approach to a specialized network for images of a certain domain that the model can be applied to. In hindsight, this seems quite obvious, as our rather small model would have never been able to cope with the amount of variations existing in ‘the real world’ anyway. So, we went back to the drawing board and started discussing which domain to focus on. That’s where we got suck; we struggled to find an obvious trend in our customers’ use cases or known photography platforms.

It was actually during his summer holiday, when Malte had the flash of genius. At a stop-over in Singapore, he noticed how the city was flooded with selfie-stick wielding tourists. The sheer amount of selfies taken at any public place in Singapore left him astonished and he realised that he just found the right domain. Selfies, and portraits in general, felt like an infinite datasource and prime use case for our image segmentation algorithm. Back home, we decided to focus on selfies and portrait-like photography.

Malte started searching for portrait datasets and found a collection of roughly 2000 portrait images collected from Flickr. Those were a great starting point and after a few training runs, he already reached satisfactory results, as the model was now capable to capture all available variations. At that point, we had a system at our hands that was able to segment portrait or selfie images in real-time on the device you’re capturing them with. This seemed like a great opportunity, but we didn’t want to stop just there. Releasing a prototype that can free a selfie from its background is nice, but doesn’t feel like something that would truly showcase how AI can make a difference in our creative process.

Chapter 2: The Art Director

This is where our Art Director Tommi, a renowned graphic artist and former sprayer, stepped in to explore what can be done with **a selfie, an accurate alpha mask and the image editing features from our PhotoEditor SDK**. When Tommi took the lead, I asked him to draft a vision, a creative direction for our app that combines all the tools and possibilities at our disposal.

Together, we started exploring portrait trends and unique imagery that would help us find a direction for our showcase. Soon, the walls of our meeting rooms and offices were plastered with inspirational works on portrait photography of all different kinds and styles. This visual catalogue kept inspiring us, although we weren’t sure on which style to settle in the end. It was when we could hardly find any more free spots on our walls and after looking at them for countless times that the idea struck:

What if we could enable users to turn their portrait to what we saw on these walls? And this, without actually having the design expertise they would normally be required to do so.

Instead of brooding over a completely new form of portraits, we could take all these styles and instantly realize them with the technology we had. From that point on, we flipped our process upside down. Instead of thinking about what the technology is capable of, or identifying a problem worth solving, we aimed for the creative output that we wanted our app to produce. While we started our venture with a technology, we now had visual results that we could work towards. The main question shifted from “What is our technology capable of?” to “How can we achieve this visual output with our technology?”

Tommi designed five lead graphics, so our team of engineers and designers could grasp what we ultimately wanted to achieve, using only a selfie and the features of our SDK.

Act 3: Closing ranks

With such a clear vision for our app, we started separating the wheat from the chaff, categorizing the portraits and understanding which operations of our SDK we had to combine, assemble and enhance to create these visuals.

What followed was a remarkable interplay across multiple stakeholders of our team. While we were always very vocal proponents of building strong relationship between product stakeholders, **the introduction of the AI layer actually glued our team further together**.

Our designers started to embrace the engineering perspective, playfully identifying both opportunities and constraints through the tech layer. At the same time, our engineers embraced the design vision and formalized it into code. Let me give you some examples:

While thinking of the UI, we understood that the transformation of a selfie into a graphical artwork required an immediate feedback for the user, so they can find a pose that works best with the respective artwork. Consequently, we optimized our networks for real-time processing, a true challenge that needed strong expertise in both iOS engineering and neural net architecture.

Our designs and recipes in turn had to be tweaked to gracefully allow for errors of our AI, because an error rate of 3% can still produce undesired artefacts and mask inaccuracies. We did that by using techniques that beautifully fringed edges of the portrait.

Altogether, the close cooperation, as well as countless meetings, feedback loops, and the continuous fine tuning of the code and underlying recipes is what brings us here.

All of this wouldn’t have happened if we hadn’t took the risk to invest in a rising technology in the first place. And all of this wouldn’t have been possible if it wasn’t for the exemplary cooperation between all the stakeholders. Portrait is a showcase of how technology can inspire and tie a team together. This, in the end, is absolutely necessary if we want to achieve the leaps we expect with AI. If you want to impact the creative space by introducing an AI layer to it, your engineers have to think like designers, or at least deeply understand their work.

The Road Ahead

Portrait is a first showcase and one step of many in our venture to wire several AI aspects deep into our SDK. On our journey, we’ve identified many more opportunities where we can help broader audiences to make creative work and design more accessible. Of course, we will also improve our models and networks with better and more data, always keeping in mind the aesthetic and visual output we’d like to achieve. We’ll keep you posted on our updates and next ventures into this exciting new era.

If you liked what you read, I’d encourage you to check out Portrait and our PhotoEditor SDK!

Thanks to my co-authors Malte & Felix!

Thanks for reading! To stay in the loop, subscribe to our Newsletter.

Deep Learning for Photo Editing

Malte — Thu, 20 Apr 2017 00:00:00 GMT

Deep learning, a subfield of machine learning, has become one of the most known areas in the ongoing AI hype. Having led to many important publications and impressive results, it is applied to dozens of different scenarios and has already yielded interesting results like human-like speech generation, high accuracy object detection, advanced machine translation, super resolution and many more.

There is a steady flow of papers and publications that describe the latest advances in network design, compare existing architectures or describe unseen approaches leading to even better results than the current state-of-the-art. At the same time more and more companies and developers jump on the deep learning bandwagon and deploy the ideas and architectures to real world production systems.

This article describes our approach to applying deep learning to our image editing product, the struggle we had with finding the right architecture and the experiences we made while developing a system that can be deployed to mobile devices.

Our vision

At 9elements, we’ve had various AI topics on our radar for quite some time now. With deep learning, we finally found a tremendous opportunity for our product, the Photoeditor SDK: We believe AI-based algorithms could be the ideal approach to boost our users creative output and simplify complex design tasks.

Given the hype and results, we decided to dip our toes into deep learning, which quickly lead to some research regarding the most common challenges in interactive image editing. We quickly surfaced image segmentation as a major challenge that could be solved using deep learning and started investigating further.

If you have ever tried to select a distinctive region in a picture, say your best friend on the beach or your cute pet, you know the struggle of carefully moving your cursor along the object’s outer bounds until you eventually miss a part or accidentally select something that doesn’t belong to the object. Professional image editing tools can be quite helpful in accomplishing such tasks, but on the one hand, they aren’t available on your mobile device, where you take and publish the images, and on the other hand, can be quite expensive and usually require some hands-on time, before you can produce anything usable.

Our goal was to finally remove the hassle from image clipping. We wanted to reduce the required user interaction to a minimum while offering an intuitive solution that doesn’t require any manuals or online courses. On top of that, as we provide native SDKs for web, iOS, and Android, the solution had to be deployable to all of these systems without relying on a powerful backend or being limited to certain features.

Having formulated our rather ambitious goals, we started our journey by looking into the most common research papers and classic techniques for image segmentation. We then focused on the deep learning part and quickly had an idea on how to design our approach.

Our journey

Image segmentation, the process of classifying each pixel in a picture to be rather fore- or background, is a popular research field and still perceived as quite challenging due to the complicated nature of the task. We, humans, are extremely well trained at perceiving scenes, identifying objects and making logical assumptions based on the visual input we receive.

For a long time, all approaches were based on colors, edges, and contrast and relied heavily on fine-tuned parameters, which had to be adjusted to every new scenario. That changed in 2012 when Krizhevsky et al. presented astonishing object classification results on the ImageNet benchmark using a neural network. Suddenly a system was able to classify objects with unprecedented accuracy and no need for any human fine-tuning. The neural network was ‘just’ trained on the dataset by seeing images combined with their corresponding labels and adjusting its internal representation until it couldn’t learn any further.

As we had already decided on using deep learning for our task, using a neural net clearly was our way to go. We started by examining the existing solutions and approaches, created our first prototype based on our findings and refined our approach and implementation until everything met our expectations.

Scene Labeling

The first approaches we examined focused on segmenting the whole image. This is a common task called scene labeling or semantic labeling, because it allows robots and other systems to understand a scene. The goal is to classify each pixel in an image to a particular object category. An example could be a self-driving car that searches for the road and tries to determine whether any pedestrians are crossing the street. Such a car would try to classify each pixel as road, pedestrian, tree, traffic sign, etc.:

While offering lots of possibilities, the existing solutions were lacking the desired accuracy we needed to provide visually pleasing image segmentations. For a self-driving car, it doesn’t matter if the ‘person’ region for some pedestrian accurately covers the person’s outlines. However, for us it does.

To overcome these issues we experimented with post processing techniques that used the segmentations we found as a base for further optimisations. This lead to our first approach where we would initially segment the entire image using a convolutional neural network, offer the found regions as selectable regions to a user and then try to refine the user’s selection using conventional image segmentations to find the best possible mask.

While already yielding some useful results the system did not quite match our requirements. If the initial segmentation was too coarse or off in critical regions, the user could never select an area that would lead to his desired segmentation.

Image segmentation based on user inputs

We went back to the drawing board and searched for other approaches that would fit our use case. It didn’t take long, and we stumbled upon Deep Interactive Object Selection, a paper that presents an interactive system which creates image segmentations based on user clicks. It looked like a good fit for our requirements, and we updated our existing system to generate fake user inputs and train on combinations of these inputs and images.

To train the net, we used the publicly available COCO dataset, which contains around 300.000 images with more than 2 million annotated object instances. To handle the amount of data, we limited our training data to a subset of the full dataset. This subset was made up of images that contain objects from certain categories and cover a minimum area within the image. As we generated the inputs artificially by adding clicks on the object mask, we could generate as many training data from the COCO subset as we wanted. After some experiments, we settled for three different strategies to create user inputs and trained the net with roughly 300.000 training records.

The masks generated by the updated system were quite impressive already. The neural net could infer which object the user wanted to mask in the image, just by looking at raw pixel data and the user’s clicks on the object. Happy with the first results, we tried to tackle the next hurdle. Before diving deeper into optimizing the neural net, which is a rather error prone process and consumes lots of time, we wanted to deploy the net to a mobile device. We wanted to make sure that such a tool is usable on any device and the performance would match our expectations.

Neural nets on mobile devices

Neural nets are sets of operations, executed in a specific order and based on millions of parameters. Therefore one “run” of such a net requires a lot of computation power, as millions of calculations have to be carried out. At the same time, the millions of parameters need to be deployed, as they represent the model or the representation the neural net has learned during training. So, to deploy our neural net, we had to solve these two requirements on an iPhone.

The first requirement, computing power, was thankfully solved by Apple. With the latest iOS version a specialised framework, called Metal Performance Shaders, was introduced. It offers the all required operations and is tailored to run these on the phones GPU, which is fast and efficient. To execute our net using the framework we had to translate our TensorFlow network code to Swift and rebuild the net’s architecture using Metal Performance Shader operations. Sadly Apple only supports a subset of todays common neural network operations, so we were forced to write some shader code to reconstruct the full network.

The second requirement, extracting the trained parameters and deploying them to the device was much easier. We just had to restore our previously trained model from a TensorFlow checkpoint, write all trained variables into a file and deploy this file with our iOS app. When needed, the iOS app would load the file into memory, and our network implementation would use the given parameters to run an inference pass.

Having met the two requirements, our network worked fine on an iPhone. We added the postprocessing operations and were able to segment images by a single tap without the need for a backend or any network communication. But there were some caveats.

While our neural net was a very common and widely used network, it was huge regarding the trainable variables. A trained model contains ~134 million parameters, which translates to about half a gigabyte of data that needs to be deployed with the app. This was obviously a showstopper for a mobile image editing app, as we couldn’t justify a 500MB download just to be able to segment images with your finger.

Furthermore, the results were still very coarse. If your colleague waved his arms in an image, the net usually could easily detect his torso, head and maybe his legs, but almost never the arms or hands. Fixing this using our postprocessing algorithms wasn’t that much of an option as it would have required lots of computing power and why bother using a neural net with millions of parameters if we fall back to conventional image processing techniques anyway?

So all in all, we had already learned a lot: Our approach of processing user inputs combined with raw image data as neural net input led to usable outputs, although quite coarse. Deploying such a net to mobile devices was possible, and the performance was good enough for using it in an interactive tool. The next step was to optimize the system to fix the parameter size and get finer results.

Combining SqueezeNet and SharpMask

We decided to tackle the network size first, as laying a proper foundation for optimizing the coarseness seemed like a sane thing to do. When looking for small nets with few parameters and fast inference its hard not to stumble across the SqueezeNet architecture by Iandola et al. which was published in November 2016. It met our use case, didn’t use any exotic operations that would be hard to implement on mobile and the results looked promising, so we removed the original network from our system and replaced it with an altered SqueezeNet implementation. And to our surprise, it worked almost right away. We had to tweak our training pipeline, and the results differed slightly, but all in all the small network with only ~5 million parameters matched the performance of our previous behemoth with ~134 million parameters. We quickly updated our conversion script and found out that our deployable model file just shrunk from ~500mb to 2.9mb. What a happy day!

Having solved the network size issue, we went ahead and thought about increasing the precision of our predictions. A loss of resolution is unavoidable in convolutional neural networks, as later layers acquire a larger “view” of the inputs by reducing their input size with so-called “pooling” layers. These layers take for example four values from the previous layer and merge them into a single one. Therefore our new SqueezeNet-based system created a 32 by 32-pixel image mask from a 512 by 512-pixel input image. Up to now we just scaled these up by using a transposed convolution. This allowed the net to learn how the upscaling worked best, but the fine details from the initial input image were already lost at this point.

We remembered Facebooks SharpMask system introduced in summer 2016 and revisited the accompanying paper. Their refinement modules seemed like a good fit, as they were able to gradually incorporate features from lower levels, but with higher resolution, into the coarse outputs. We adopted the idea and altered the refinement modules to take the final SqueezeNet output. The modules then combined the coarse SqueezeNet output with the pooling layers intermediate results and were able to refine the result. This increased our model size and the computation costs by a fair amount, but lead to much finer and more detailed results.

Once we settled on our architecture, we started an extensive training run, in which we tested more than one hundred different variations of hyperparameters, architectural details, and resizing techniques. Evaluating the results, we selected a variation, which made the best compromise between accuracy and inference speed/model size.

Our results and prototype

Having managed to fix all the issues, we were eager to see how the whole system performed on a mobile device with limited computing power and inputs. We updated our mobile app to use the new network architecture and the freshly trained model to compare the refined system to our previous approach. The results were amazing. When selecting objects that matched the categories of our training data and were fully visible in the image, we were able to generate fine-grained selection masks with just a single tap. More complex or larger objects required a few more taps, but we could always find a selection mask for our object, that was at least a solid starting point for further optimizations.

We decided to build a more polished prototype based on our existing img.ly iOS app. This app uses our PhotoEditor SDK to offer advanced image editing including focus and filter operations. As we were now able to create masks based on objects in the image we quickly settled on enhancing our filter and focus tools with selective masking.

Retrospective

Looking back at our journey into deep learning, it was one of the more frustrating yet fascinating ones. The sheer amount of possible applications is exciting, and once you get the hang of training something on your data, you immediately want to start experimenting with new things. On the other hand, you’re usually building huge black boxes with millions of float values, which makes debugging a pain. Especially when trying to replicate an already implemented architecture on other platforms, this can quickly become rather frustrating. If your outputs don’t match the expected results, your only option is to repeatedly go over your code, check all parameters and hope you stumble upon the wrong number somewhere. But once you manage to set everything up and start seeing some good results, you instantly want to tweak and optimise the bits and pieces of your system.

Overall, deep learning is a pain to debug, but yields great results, opens up a new field of photo editing applications and we’ll definitely keep exploring the new possibilities of applying the techniques in our product. Stay tuned for upcoming features!

Thanks for reading! To stay in the loop, subscribe to our Newsletter.