Deep Learning – IMG.LY Blog

Inpainting: Removing Distracting Objects in High-Resolution Images

Vivien — Tue, 08 Dec 2020 15:03:28 GMT

Introduction

You may know this situation: You are out on a trip when suddenly a unique opportunity for a photograph appears, like a wild animal showing up or sun rays breaking through the rain clouds for a few seconds. Without hesitation, you grab your camera and capture the sight. Later you discover that a distracting object, like a road sign, is ruining your shot. Time for some cumbersome retouching.

Now, imagine you could erase the distracting object just by highlighting it. Wonderful! From the field of deep learning, a technique for image manipulation called Image Inpainting makes it possible. Image Inpainting aims to cut out undesired parts of an image and fills up missing information with plausible content of patterns, colors, and textures that match the surrounding.

Today we would like to share experiences that we have gained during the application of deep learning inpainting approaches. Furthermore, we’ll present some quality optimization steps that we have implemented to improve results addressing the transformation to high-resolution outputs. But let us start with a quick introduction: who are we and why are we concerned with these kinds of topics?

We are a small consortium consisting of the Bochumer Institute of Technology, a research institute aiming to transfer knowledge from academia into industry, and the company IMG.LY, a team of software engineers and designers developing creative tools like the PhotoEditor SDK and the UBQ engine. Together we are working in the EFRE.NRW funded research project KI Design that targets artificial intelligence (AI) and deep learning-based algorithms for image content analysis and modification, as well as a leveraging tool kit for aesthetic improvements.

Image Inpainting has been a viable technique in image processing for quite some time, even before “Artificial Intelligence” was on everyone’s lips. Common for most inpainting algorithms is that an area of an image is highlighted to be corrected. Many conventional algorithms then analyze the statistical distribution to fill the resulting gap by finding and using nearest neighbor patches. The most famous and state of the art approach of this method is the PatchMatch algorithm. It uses a fast, structured randomized search to identify the approximate nearest neighbor patches that will fill in the respective part of the image.

However, there are two drawbacks: first, regardless of the approximation, performance still might be an issue, and second, the results suffer from a lack of semantic understanding of the scene. Thus, research dived into new ideas and directions and tried the application and implementation of AI- and neural network-based approaches to solving these issues.‌‌‌‌ For us, the removal of annoying background content is a useful feature, as it improves the overall image aesthetic. Having this available on mobile devices would be particularly interesting. Due to performance limitations and ever-improving integrated cameras, a mobile solution requires a fast and lightweight model architecture as well as the ability to process high-resolution images.

Summarized, our expectation for an AI-based inpainting algorithm are:

removal of (manually highlighted) background objects/persons
feasibility to process high-resolution images
fast and lightweight network (applicable for smartphones)

The number of publications addressing this or similar requirements has increased enormously in recent years. After digging into the literature, we identified two promising approaches and tested them.

Testing and Comparing Model Architectures

These selected networks were based on the latest scientific findings and appeared to provide high-quality output. Both approaches have well-documented repositories – a special thank you to the authors for their great work (of repositories and papers as well)! The selected networks are:

Partial Convolutional Neural Networks (PCONV); [paper, github-repository]
Generative Multi-column Convolutional Neural Networks (GMCNN); [paper, github-repository]

You may be wondering why exactly we chose these models for comparison purposes, as the latest scientific findings sound a bit vague. Indeed, it is considerably difficult to identify the best fitting model architecture. As far as we know, there is no standardized validation method or data set. Most papers demonstrate their results on self-selected test images and further compare them with again self-selected approaches. The only option we had was to evaluate models that seemed reasonable to us. A validation method or standardized test set could be a valuable scientific contribution here. Let’s turn back to the selected models.‌‌‌‌ The PCONV network uses multiple convolutional layers and adds a partial convolutional layer. The key feature is that the convolution does not consider invalid pixels, indicated by an updating mask. This prevents the algorithm from picking up the color of the mask (typically the average color tone of the image) and transmit it into the reconstruction process.

The GMCNN – a GAN-based model – is built in a special architecture consisting of 3 networks: a generator, split up into three branches addressing different feature levels, a local and global discriminator, a VGG19 net calculating the implicit diversified Markov random field (ID-MRF), introduced in the paper. This ID-MRF serves as a loss term comparing generated content with nearest-neighbor patches of the ground truth image. While the interaction of all three networks is required in the training phase, only the generative network serves for testing and production. More details and figures regarding the model architecture are available in the official paper.

Due to the lack of standardized sets, we created our own test sets addressing different levels of complexity. This also included image data requiring an understanding of semantic structures. In our comparison, we paid special attention to ensuring the filled content was harmonious, and a possible artifact interspersion was reduced to a minimum. In particular, image artifacts could raise issues in terms of translation with respect to high-resolution information. Here is an example output of our tests:

In comparison, the inpainting result based on PCONV suffered from some blurred artifacts and erratically deviating shades, cf. Figure 1C, whereas the GMCNN-based result appeared to be more precise and plausible concerning the semantic context, cf. Figure 1D. You can see this clearly when you look at the grille door that was covered by a person. The GMCNN approach, cf Figure 1D, had recognized and respected the grid structure, while the PCONV overlayed this with a uniform (black) color tone. In consideration of all test data results, we decided to follow up with the GMCNN.

However, we would like to emphasize that this does not mean that one model architecture is better suited for Image Inpainting than the other. The used weights build-up of the PCONV architecture may achieve similar results with further training or different test sets.

What About High-Resolution Inpainting?

At the current state of the model, the processing of high-resolution images remains uncovered. Out of all the papers and repositories we found, even papers promising high-resolution often just targeted image sizes of 1024x1024 pixel at maximum. Our expectations were a resolution of substantially more than 2000x2000 pixels. A reason for this issue seemed to be the hardware demanding and time-consuming training phase when processing high-resolution images. ‌‌
Furthermore, the application of a high-resolution inpainting model could entail performance issues. These are not neglectable to us, as we are facing a prospective implementation on smartphones that can’t keep up with the power of a modern graphics card. Thus, an additional challenge is a high-quality transformation of low-resolution outputs to a high-resolution.

Apply Low-Resolution Inpainting Output to High-Resolution Images

‌‌The GMCNN model was trained with the Places dataset, formatted in a 512x680 resolution. Feeding in high-resolution images would exceed the training input size by far and further require information of feature dimensions that the model has never seen before. That could result in almost completely distorted reconstructions.

A straightforward solution is to downscale the high-resolution image before feeding it to the model and then resize the result up to the original image size conclusively. Due to the upscaling (e.g., via bicubic interpolation), the image details suffer from a loss of quality. Therefore a better approach is to take only the masked areas of the upscaled inpainting prediction and stitch it back into the original image. That prevents the loss of initially known details from the unmasked regions. For the maintained inpainting regions, the lack of image details, as well as the artifacts and distortions, pose a complex challenge that we aimed to overcome with the following approaches.‌‌‌‌

Shrinking-Mask-Approach

While the inpainted area mostly yields realistic-looking content for the more marginal regions, the performance decreases strongly towards the center, cf. Figure 3D. We especially noticed this behavior for larger masks. Conclusively a recursive inpainting procedure with an iteratively shrinking mask, cf. Figure 3E, seems to be a reasonable approach. With this concept, we try to improve the inpainting results in a progressive manner starting from the boundary to the center of the masked regions while utilizing the generated information of the preceding recursion, cf. Figure 3F.‌‌

To us, it was essential to have a dynamic method that allows the handling of all mask forms and sizes. Therefore, we decided to apply an erosion kernel to the original mask in a recursive fashion until it is fully eroded. The amount of shrunk masks determines the number of inpainting performed by the network.‌‌‌‌

Two-Step-Approach

While investigating and testing various quality optimization steps, we also fed high-resolution images into our model and discovered that the results for smaller masks were convincing. That led us to the hypothesis that not the resolution but rather the number of pixels to reconstruct seems to be the limiting factor. This finding served as the basis for our two-step-approach.‌‌‌‌

Briefly, the approach works as follows. In the first step, we perform inpainting on a downscaled high-resolution image while applying the original mask. In a second step, we transfer the model output of step one into a higher resolution and perform inpainting again. This time we apply a modified mask containing only small coherent mask regions, for which we exploit the provided higher resolution context information. ‌‌
In more detail, the first step is characterized as the baseline approach, cf. Figure 5: We scale the masked image down to the training resolution of 512x680 pixels and fill up the missing information.

Optionally, the shrinking-mask-approach can be applied in the first step.

In the second step, we quadruplicate the output of step 1 to a resolution of 1024x1360 pixels. To prevent the resolution loss for unmasked regions caused by this upscaling, we stitch the generated content into the same sized (downscaled) input image. The resulting image serves as the model input for step 2. ‌‌

To avoid/reduce image artifacts in the subsequent inpainting process, we modify the original mask to contain only the small mask regions and the boundaries of the large mask regions. In detail, we temporarily shrink the mask with an erosion kernel to ablate small mask segments and the marginal areas of larger mask sections, cf. Figure 6B. Finally, we calculate the difference between the original mask and the altered mask, resulting in our desired modified mask, cf. Figure 6C.

By re-inpainting, we double the resolution of the generated content for the small contiguous mask regions, cf. Figure 7A bottom right, as well as for the masked boundary areas, cf. Figure 7A upper left. Moreover, through the latter, we achieve smoothing of the intense decay in resolution between the unmasked regions and the generated content arising from step 1. Finally, we scale our image back to the original input resolution and stitch the generated content to the original image to maintain the original resolution for unmasked areas.

Conclusion

For us, it was impressive to see how AI-based inpainting can successfully and deceptively realistic fill in missing information. Not only the consideration of structural (semantic) content is an advantage compared to conventional approaches, but especially the decreased demand on required hardware. In our view, this opens up the opportunity to reach a much larger group of users of inpainting algorithms: in place of using powerful hardware and professional software, mobile devices could achieve small but decisive changes.

In summary, we have dealt with the application of high-resolution images, which is undoubtedly gaining in importance due to the ever-improving smartphone cameras. Processing high-resolution images entail an increasing number of pixels to “inpaint” and could further lead to quality as well as performance issues. Thus, we decided to improve the output of low-resolution networks and to provide them with more information to support a subsequent upscaling procedure.
We have implemented two different approaches, shrinking-mask and two-step-approach that can be applied independently or in a combined manner. It turned out that both methods subjectively increased the image quality. However, this comes along with higher computational demands, as models are applied multiple times.
Overall, we think that the combination of these two approaches will represent a good toolkit for AI-based high-resolution image inpainting. But we’ll keep an eye on the upcoming scientific developments.

Smart Cropping - Automatically crop images to optimal regions with deep neural networks

Vivien — Fri, 24 Jul 2020 11:48:34 GMT

Pictures are omnipresent on the social web. It is common to instantly post photos of all kinds of events to share with friends and followers. Also, businesses want to show presence on social networks and employ designated social media managers to represent the company and to communicate to customers.

Let’s assume you work as a social media manager in a company. Your job is to communicate with customers and represent your company on various social media platforms. One part of your job is to share pictures of your company’s work. Since you’re serving multiple social media platforms, you always have to consider their specific aspect ratio requirements for images. One platform wants you to provide square photos, whereas another one asks for pictures in a wide landscape format.

You are a busy person, you don’t want to waste time on cropping hundreds of images into the proper format, but you also don’t want to crop your pictures weirdly.

Suppose you want a portrait-oriented version of the following image. Simply choosing the center would lead to an odd picture containing only one half of the bird. What you want is the image to include the region of interest; here, probably the whole bird in the center of the image.

But how can we automatically find such image regions?

When humans look at images, they intuitively focus on significant elements of the photos first. If you look at the following pictures, …

… you will probably notice that your first focus on the salient parts of the image (maybe the geyser or the sundown for the first image, and the reindeers on the road for the second image).

As it turns out, it is possible to train neural networks to predict such salient regions. A prediction of such a network is called a saliency map. It basically is a grayscale image of the same size as the picture. Each pixel intensity encodes the degree of saliency. These saliency maps allow us to find the best image region for a given aspect ratio.

But how can networks be trained to predict salient regions in a picture? And how do we, given the salience information, crop an image to an optimal region?

Fortunately, there was already a considerable amount of research regarding saliency prediction. Basically, there are two main approaches: attention-based saliency prediction and segmentation based saliency prediction. The first group focuses on predicting the center points of human attention regardless of object segmentation and boundaries, whereas the latter considers the most salient objects as a whole.

For our application, it seemed more suitable to choose an attention-based approach. We decided to go with an LSTM based model.

Briefly summarized, the approach works as follows: A deep convolutional neural network (CNN), pre-trained on image classification, acts as a feature extractor. The value of some intermediate layer (or hidden layer) is forwarded to the recurrent LSTM that further improves the prediction. The saliency map then is the output of the LSTM, combined with the Gaussian priors.

In particular, a dilated convolutional network, in our case, a modified RESNET50 already pre-trained on the SALICON dataset, is deployed for feature extraction. The original paper for this method used a network for image classification. Many CNNs can act as feature detectors, but they don’t perform equally well. For example, compared to the standard convolutional feature extraction networks, the dilated networks prevent the harmful effects of image rescaling on the saliency prediction. The extracted feature maps are then fed into an attentive convolutional LSTM (recurrent neural network). This iteratively improves the saliency prediction on the obtained feature maps. Finally, multiple trainable (isotropic) Gaussian priors are added to take the bias of human attention into account, since humans tend to focus on the image center.

We trained the network on the SALICON dataset, which includes 20,000 images from Microsoft COCO and 15,000 corresponding saliency maps. The saliency maps were generated by empirical studies modeling human eye fixation by mouse movements. We optimized our network with a composed loss function considering the Pearson Correlation Coefficient and the Kullback-Leibler divergence, representing standard saliency prediction loss measures.

With this approach, we could already predict pleasing saliency maps suitable for smart image cropping. Unfortunately, our first successful model took up way too much memory, thus being useless for practical applications. Therefore we had to compress the model to a suitable size while maintaining the smart cropping performance as high as possible. After a while of unsatisfactory trials, we found that instead of the RESNET50, we could just deploy the way smaller Keras-intern MobileNet as our feature extraction model.

This model option indeed provides less precise results for the saliency map prediction. However, it is still suitable for the smart image cropping, since we only need to know the position of the focus points roughly. Not only could we save much memory capacity employing this model variation, but also we could increase the runtime of our model significantly, which was our goal. This way, we created a model suitable for practical applications that are supposed to take over the inconvenient manual cropping process.

Once we have the saliency map, the smart crop can be determined quite easily. First, we compute the edge length of a window covering the given image as much as possible while fulfilling the required aspect ratio. Afterward, we slide this window over the predicted saliency map and determine the position that maximizes the covered saliency density. Now we only need to crop the image based on the optimal window position. Thus we obtain our smart cropped image suiting the required aspect ratio. The method is inspired by this paper.

To sum up, using saliency prediction and maximization, smart cropping enables us to find the best image regions for any aspect ratio. This technique, which we’re currently building into our UBQ engine, reduces the user’s burden to manually crop images into the required aspect ratio.

This project was funded by the European Regional Development Fund (ERDF).

From 2D to 3D Photo Editing

Malte — Tue, 26 Jun 2018 00:00:00 GMT

Last November, we released Portrait, an iOS app that helps create amazing, stylized selfies and portraits instantly.

With over a million downloads and many more portrait images created, we feel that the idea and vision of Portrait was more than confirmed. The central component of Portrait is an AI that is trained to clip portraits from the background, a technique we are eager to further improve and refine. In fact, Portrait helped us to explore a novel technique for image editing, as we were able to leverage a new powerful data set in photography: depth data.

We began feeding our AI models with the depth data from the iPhone Xs TrueDepth camera and had one goal in mind: to infer depth information for portrait imagery, or bringing three-dimensionality into a two-dimensional photo. Along the way, we created a new architecture concept, that allows performance and memory improvements through modularizing and reusing neural networks.

In the following article, we’d like to present some of our results along with the insights we made.

The New Cool: Depth Data

The usage of depth data in image editing initially became available with the iPhone 7 Plus when Apple introduced ‘Portrait Mode’. By combining a depth map and face detection, the devices are able to blur our distant objects and backgrounds, mimicking a ‘bokeh’ or depth of field effect, which is well known from DSLRs cameras.

While the actual implementation varies, all major manufacturers nowadays offer a similar mode by incorporating depth data into their image editing pipeline. This is either achieved through the conventional dual or even triple camera on the back of a phone, dual-pixel offset calculations combined with machine learning or dedicated sensors like Apples TrueDepth module. In fact, for a modern flagship phone, some sort of depth based portrait mode is almost a commodity.

From a developers perspective, things look a little different: Depth data became a first-class citizen throughout the iOS APIs in iOS 11 and such data is now easily accessible on supported devices. Android users obviously have access to depth data as well, either by utilizing multiple cameras or by Googles dual-pixel based machine learning approach, seen in the newer Pixel 2 phones. But contrary to iOS, Android doesn’t yet offer a common developer interface to access such data. In fact, developers aren’t able to access any of the depth information Google or other manufacturers collected within their camera apps. This means developers would either need to implement the algorithm to infer depth from two images themselves or try to rebuild Googles sophisticated machine learning powered system. Neither of these options is practical and probably not even possible given the usual limitations to camera APIs.

So although being quite common, depth data isn’t as easily accessible for developers as one might think. Right now you’re out of luck on Android, dependent on hardware on iOS and even then limited to the 1.000$ flagship if you’re interested in depth for images taken with the front camera. And last but not least, across all devices and platforms, there is no way for you to generate a depth map for an existing image.

Deep Possibilities

Despite the restrictions, we decided to first explore the power of depth for image editing, as depth data provides many new exciting creative possibilities:

If we have a depth map for a given image, our editing possibilities are increased dramatically. Instead of a 2D image, a flat plane of color values, we suddenly have a depth value for each individual pixel, which translates into a 3D landscape highlighting distinct objects in the foreground and a clear indication of background.

Depth-aware Editing

Instead of relying on color and texture differences to determine fore- and background, one could literally edit these regions individually. This allows adjustments like darkening the background while lightening the foreground, which makes portraits ‘pop’. If we’d be able to generate a high-resolution depth map, we could easily replace the AI currently used in Portrait and would allow even more sophisticated creatives. Thanks to the new APIs, there are already some awesome iOS apps available that specialize in depth based editing. One famous example is Darkroom with their “depth-aware filters”:

Depth of Field Effects

As a depth of field or bokeh effect was the initial motivation for Apple to incorporate depth sensing technology, it is one of the most obvious applications. Depth is crucial for such an effect, as the amount of bluriness of any given region directly depends on its distance to the camera lens.

3D Asset Placement

As mentioned above, a depth map gives us a 3D understanding of the image. We’re able to tell if subject A is positioned in front of or behind subject B. This allows placement of digital assets like stickers or text in a ‘depth-aware’ fashion, but could also be used to apply ‘intelligent’ depth of field, e.g. a bokeh effect that ensures all faces are in focus.

Enter Deep Learning

Motivated by the possibilities enabled by depth maps, we were wondering if we could bring this magic to any type of portrait image. We consulted existing literature on depth inference and found various papers¹ and articles on the topic, some of which even presented results that seemed sufficient for our use cases. In our case, we didn’t need accurate, as in ‘this pixel is 30cm in front of the camera’, results, but we were only interested in getting the general distance relations correct. For us, knowing that region A was slightly behind but definitely way in front of region B was enough to generate a visually pleasing effect and by constraining our domain to portrait imagery, we were able to further reduce the tasks complexity.

Given our experience with deep learning and our current focus on introducing machine learning powered features to the PhotoEditor SDK, we immediately decided to tackle the new challenge with deep learning or more specifically convolutional neural networks. Having a huge dataset of image and depth map pairs available, made this choice even easier. We stuck to a system similar to our previous segmentation model but decided to put more emphasis on allowing the reuse of individual parts, as this would come in handy when adding additional features in the future. To achieve this, we created a new modularized neural network approach named Hydra, which will be presented in an upcoming blog post.

During development, we followed our tried and tested workflow of starting with a complex custom model, which is then tweaked and refined to match our performance requirements while maintaining the prediction quality we need. Once that was done, we had a fast and small model, trained on thousands of iPhone front camera selfies and capable of inferring high fidelity depth maps from a plain RGB image in under a second.

The Prototype

After creating a small model capable of inferring depth maps for any given portrait image, we immediately wanted to evaluate its performance in a ‘real-world’ environment. We decided to build a prototype that applies a depth of field effect to a portrait image, by using the model and its outputs. With our long-term goal of deploying the model to iOS, Android and the web in mind, we built the prototype using TensorFlowJS to explore this newly released library. Our browser demo consists of a minimal ‘Hydra’ implementation with individual modules, one for extracting features and one for the actual depth inference, which can both be executed individually.

While being optimized for performance and memory footprint, the trained weights of the model still add up to ~18MB, which we will improve by further fine-tuning or even applying pruning or quantization. Once the models are loaded, all further processing happens on the device though, so you may try out all the samples without worrying about your data plan.

Results

Seeing our vision come to life was quite a stunning experience. Suddenly our browser was able to perform a complex depth of field effect without the need for special hardware, manual annotations or anything else apart from our image. And the best part was manually moving the focal plane through the image, either by sliding or tapping on different regions. Although being trained on ‘just’ selfies the model handles turned heads, silhouettes and multiple people pretty well and isn’t as restricted to its domain as we initially expected.

And while our initial prototype is still weighing in at ~18MB, we’re certain to slim that down further in order to use the model in production. Performance wise we were very impressed with the TensorFlowJS inference speed. Even though everything is happening on the client side and is therefore dependent on the clients hardware, we saw inference speed below one second right of the bat and those greatly improved after the initial run, as the resources were already allocated. While not being immediately helpful for the depth inference part, this allowed us to further confirm our theory behind Hydra: Re-running inference once the necessary resources on the machine have been allocated greatly increases performance and might even allow real-time performance after an initial setup-time.

To summarise, we’re definitely eager to further explore the use of depth data in image editing and think we have found a way to overcome the access restrictions on different platforms and hardware with our custom model. Combined with our new Hydra approach we can see lots of potential features that will delight both our users and customers and we will keep you updated right here.

(1)
The papers we extracted most knowledge for our use case from were:
“Depth Map Prediction from a Single Image using a Multi-Scale Deep Network” (arXiv)
“Deeper Depth Prediction with Fully Convolutional Residual Networks” (arXiv)

**Thanks for reading! To stay in the loop, subscribe to our Newsletter.**

When Creativity meets A.I.

Eray — Thu, 16 Nov 2017 23:00:00 GMT

A new generation of A.I. algorithms, propelled by rising computational power, new hardware, and a shift in paradigms made its first notable impact in the creative world: The works of Gatys et al. and Krizhevsky et al. have not only gathered considerable public attention but have helped apps like Prisma to be adapted and used by millions. I strongly believe that this is merely the beginning. **With the help of machine learning, we will fine-tune, simplify, and automate creative processes and ultimately empower new techniques for design and content creation.**

We’ve been following this topic for quite some time now and have spent considerable effort in researching the opportunities of deep learning for our PhotoEditorSDK. After more than a year of research and development, today, we’re finally bringing one of our apps to beta. **Portrait** combines supervised deep learning with the visual power of our SDK. In a nutshell, Portrait makes creating beautifully designed portrait images as easy as taking a selfie. You turn your selfies into movie poster-like portraits, with styles ranging from double-exposure photography to stencil art. One may consider it as the next iteration of what Apple and Google recently brought to market with their new camera features.

We’ve now come a long way and gained invaluable insights on our journey so far. Not only did we get our hands dirty with countless training sessions and refinements to the neural net, but our first hand experience also helped to set expectation management right and to dismantle hype from substance. Most notably, it changed our product shaping process, making it more important than ever to foster strong ties between the product stakeholders and to share a common vision and goal everybody can get behind.

In the following I’d like to share the story of how we built the app and closed the gaps between roles of the stakeholders within this process.

Preface: Before Neural Networks were the Hot New Thing

My journey begins over ten years ago, while I was graduating in neuroscience. Back then, the idea of A.I. was just a vague promise. Artificial Neural Networks were too small, computers lacked the necessary power, and the results were certainly nice, but still too weak to compete with other traditional algorithms. Research felt stuck in tiny little specializations without really following a broader vision. Dazzled by its impracticability, my interest in Neural Networks slowly began to fade.

It took research on Neural Networks another six years to get back on my radar. At that time, I was leading several product developments at 9elements. When I learned about the work of DeepMind (now Google) I had a genuine feeling that this time, A.I. was ready for the limelight.

As we were in the course of building a library for image editing and computer vision — the PhotoEditorSDK, we realized how much neural nets could also affect the creative space, given its ability to abstract and formalize rules. What if there was a machine that could reproduce the common and dull tasks you have to do as an art director within a second? What if designers could get rid of repetitive and tedious activities that interrupt their creative flow?

But this topic isn’t something you’d learn in a week, obviously. Still, innovations cannot happen if you’re not willing to take a risk, so we decided to invest considerable time and resources into this technology.

From a product management’s perspective, this process is actually an anti-pattern: Usually, you wouldn’t want to start by finding the right purpose for a technology, instead you’d find the right technology for a purpose. I still believe that this is essentially the right approach, but sometimes you have to abandon your best practices and take a swim in uncharted waters. Consequently, we asked Malte, one of our iOS engineers, to spearhead our research and take a deep dive into this topic. We decided to start off with image segmentation as the first process that we wanted to optimize through machine learning. Masking and clipping sometimes can be a tedious tasks, and ultimately we wanted to reduce this process that can take several minutes to a single click.

Chapter 1: The Machine Engineer

Malte, who is a diligent engineer and — how convenient — a passionate photographer, started investigating some approaches that focused on image segmentation. You can read more about his journey in his article. Although he experimented with various neural networks and post-processing techniques, the resulting masks sometimes lacked the desired accuracy and wouldn’t have matched a user’s expectations. This was a first expected insight. As we want to deliver ready-to-use products to our customers, that don’t need any complex tweaking, this was something we had to fix. Our problems originated mostly from our rather ambitious goal to segment any type of object within an image. It would have required to train with vast data and to scale up the number of filters in our network. However, due to our on-device constraint, this would have killed our carefully crafted performance.

Therefore, we shifted this generalist approach to a specialized network for images of a certain domain that the model can be applied to. In hindsight, this seems quite obvious, as our rather small model would have never been able to cope with the amount of variations existing in ‘the real world’ anyway. So, we went back to the drawing board and started discussing which domain to focus on. That’s where we got suck; we struggled to find an obvious trend in our customers’ use cases or known photography platforms.

It was actually during his summer holiday, when Malte had the flash of genius. At a stop-over in Singapore, he noticed how the city was flooded with selfie-stick wielding tourists. The sheer amount of selfies taken at any public place in Singapore left him astonished and he realised that he just found the right domain. Selfies, and portraits in general, felt like an infinite datasource and prime use case for our image segmentation algorithm. Back home, we decided to focus on selfies and portrait-like photography.

Malte started searching for portrait datasets and found a collection of roughly 2000 portrait images collected from Flickr. Those were a great starting point and after a few training runs, he already reached satisfactory results, as the model was now capable to capture all available variations. At that point, we had a system at our hands that was able to segment portrait or selfie images in real-time on the device you’re capturing them with. This seemed like a great opportunity, but we didn’t want to stop just there. Releasing a prototype that can free a selfie from its background is nice, but doesn’t feel like something that would truly showcase how AI can make a difference in our creative process.

Chapter 2: The Art Director

This is where our Art Director Tommi, a renowned graphic artist and former sprayer, stepped in to explore what can be done with **a selfie, an accurate alpha mask and the image editing features from our PhotoEditor SDK**. When Tommi took the lead, I asked him to draft a vision, a creative direction for our app that combines all the tools and possibilities at our disposal.

Together, we started exploring portrait trends and unique imagery that would help us find a direction for our showcase. Soon, the walls of our meeting rooms and offices were plastered with inspirational works on portrait photography of all different kinds and styles. This visual catalogue kept inspiring us, although we weren’t sure on which style to settle in the end. It was when we could hardly find any more free spots on our walls and after looking at them for countless times that the idea struck:

What if we could enable users to turn their portrait to what we saw on these walls? And this, without actually having the design expertise they would normally be required to do so.

Instead of brooding over a completely new form of portraits, we could take all these styles and instantly realize them with the technology we had. From that point on, we flipped our process upside down. Instead of thinking about what the technology is capable of, or identifying a problem worth solving, we aimed for the creative output that we wanted our app to produce. While we started our venture with a technology, we now had visual results that we could work towards. The main question shifted from “What is our technology capable of?” to “How can we achieve this visual output with our technology?”

Tommi designed five lead graphics, so our team of engineers and designers could grasp what we ultimately wanted to achieve, using only a selfie and the features of our SDK.

Act 3: Closing ranks

With such a clear vision for our app, we started separating the wheat from the chaff, categorizing the portraits and understanding which operations of our SDK we had to combine, assemble and enhance to create these visuals.

What followed was a remarkable interplay across multiple stakeholders of our team. While we were always very vocal proponents of building strong relationship between product stakeholders, **the introduction of the AI layer actually glued our team further together**.

Our designers started to embrace the engineering perspective, playfully identifying both opportunities and constraints through the tech layer. At the same time, our engineers embraced the design vision and formalized it into code. Let me give you some examples:

While thinking of the UI, we understood that the transformation of a selfie into a graphical artwork required an immediate feedback for the user, so they can find a pose that works best with the respective artwork. Consequently, we optimized our networks for real-time processing, a true challenge that needed strong expertise in both iOS engineering and neural net architecture.

Our designs and recipes in turn had to be tweaked to gracefully allow for errors of our AI, because an error rate of 3% can still produce undesired artefacts and mask inaccuracies. We did that by using techniques that beautifully fringed edges of the portrait.

Altogether, the close cooperation, as well as countless meetings, feedback loops, and the continuous fine tuning of the code and underlying recipes is what brings us here.

All of this wouldn’t have happened if we hadn’t took the risk to invest in a rising technology in the first place. And all of this wouldn’t have been possible if it wasn’t for the exemplary cooperation between all the stakeholders. Portrait is a showcase of how technology can inspire and tie a team together. This, in the end, is absolutely necessary if we want to achieve the leaps we expect with AI. If you want to impact the creative space by introducing an AI layer to it, your engineers have to think like designers, or at least deeply understand their work.

The Road Ahead

Portrait is a first showcase and one step of many in our venture to wire several AI aspects deep into our SDK. On our journey, we’ve identified many more opportunities where we can help broader audiences to make creative work and design more accessible. Of course, we will also improve our models and networks with better and more data, always keeping in mind the aesthetic and visual output we’d like to achieve. We’ll keep you posted on our updates and next ventures into this exciting new era.

If you liked what you read, I’d encourage you to check out Portrait and our PhotoEditor SDK!

Thanks to my co-authors Malte & Felix!

Thanks for reading! To stay in the loop, subscribe to our Newsletter.

Deep Learning for Photo Editing

Malte — Thu, 20 Apr 2017 00:00:00 GMT

Deep learning, a subfield of machine learning, has become one of the most known areas in the ongoing AI hype. Having led to many important publications and impressive results, it is applied to dozens of different scenarios and has already yielded interesting results like human-like speech generation, high accuracy object detection, advanced machine translation, super resolution and many more.

There is a steady flow of papers and publications that describe the latest advances in network design, compare existing architectures or describe unseen approaches leading to even better results than the current state-of-the-art. At the same time more and more companies and developers jump on the deep learning bandwagon and deploy the ideas and architectures to real world production systems.

This article describes our approach to applying deep learning to our image editing product, the struggle we had with finding the right architecture and the experiences we made while developing a system that can be deployed to mobile devices.

Our vision

At 9elements, we’ve had various AI topics on our radar for quite some time now. With deep learning, we finally found a tremendous opportunity for our product, the Photoeditor SDK: We believe AI-based algorithms could be the ideal approach to boost our users creative output and simplify complex design tasks.

Given the hype and results, we decided to dip our toes into deep learning, which quickly lead to some research regarding the most common challenges in interactive image editing. We quickly surfaced image segmentation as a major challenge that could be solved using deep learning and started investigating further.

If you have ever tried to select a distinctive region in a picture, say your best friend on the beach or your cute pet, you know the struggle of carefully moving your cursor along the object’s outer bounds until you eventually miss a part or accidentally select something that doesn’t belong to the object. Professional image editing tools can be quite helpful in accomplishing such tasks, but on the one hand, they aren’t available on your mobile device, where you take and publish the images, and on the other hand, can be quite expensive and usually require some hands-on time, before you can produce anything usable.

Our goal was to finally remove the hassle from image clipping. We wanted to reduce the required user interaction to a minimum while offering an intuitive solution that doesn’t require any manuals or online courses. On top of that, as we provide native SDKs for web, iOS, and Android, the solution had to be deployable to all of these systems without relying on a powerful backend or being limited to certain features.

Having formulated our rather ambitious goals, we started our journey by looking into the most common research papers and classic techniques for image segmentation. We then focused on the deep learning part and quickly had an idea on how to design our approach.

Our journey

Image segmentation, the process of classifying each pixel in a picture to be rather fore- or background, is a popular research field and still perceived as quite challenging due to the complicated nature of the task. We, humans, are extremely well trained at perceiving scenes, identifying objects and making logical assumptions based on the visual input we receive.

For a long time, all approaches were based on colors, edges, and contrast and relied heavily on fine-tuned parameters, which had to be adjusted to every new scenario. That changed in 2012 when Krizhevsky et al. presented astonishing object classification results on the ImageNet benchmark using a neural network. Suddenly a system was able to classify objects with unprecedented accuracy and no need for any human fine-tuning. The neural network was ‘just’ trained on the dataset by seeing images combined with their corresponding labels and adjusting its internal representation until it couldn’t learn any further.

As we had already decided on using deep learning for our task, using a neural net clearly was our way to go. We started by examining the existing solutions and approaches, created our first prototype based on our findings and refined our approach and implementation until everything met our expectations.

Scene Labeling

The first approaches we examined focused on segmenting the whole image. This is a common task called scene labeling or semantic labeling, because it allows robots and other systems to understand a scene. The goal is to classify each pixel in an image to a particular object category. An example could be a self-driving car that searches for the road and tries to determine whether any pedestrians are crossing the street. Such a car would try to classify each pixel as road, pedestrian, tree, traffic sign, etc.:

While offering lots of possibilities, the existing solutions were lacking the desired accuracy we needed to provide visually pleasing image segmentations. For a self-driving car, it doesn’t matter if the ‘person’ region for some pedestrian accurately covers the person’s outlines. However, for us it does.

To overcome these issues we experimented with post processing techniques that used the segmentations we found as a base for further optimisations. This lead to our first approach where we would initially segment the entire image using a convolutional neural network, offer the found regions as selectable regions to a user and then try to refine the user’s selection using conventional image segmentations to find the best possible mask.

While already yielding some useful results the system did not quite match our requirements. If the initial segmentation was too coarse or off in critical regions, the user could never select an area that would lead to his desired segmentation.

Image segmentation based on user inputs

We went back to the drawing board and searched for other approaches that would fit our use case. It didn’t take long, and we stumbled upon Deep Interactive Object Selection, a paper that presents an interactive system which creates image segmentations based on user clicks. It looked like a good fit for our requirements, and we updated our existing system to generate fake user inputs and train on combinations of these inputs and images.

To train the net, we used the publicly available COCO dataset, which contains around 300.000 images with more than 2 million annotated object instances. To handle the amount of data, we limited our training data to a subset of the full dataset. This subset was made up of images that contain objects from certain categories and cover a minimum area within the image. As we generated the inputs artificially by adding clicks on the object mask, we could generate as many training data from the COCO subset as we wanted. After some experiments, we settled for three different strategies to create user inputs and trained the net with roughly 300.000 training records.

The masks generated by the updated system were quite impressive already. The neural net could infer which object the user wanted to mask in the image, just by looking at raw pixel data and the user’s clicks on the object. Happy with the first results, we tried to tackle the next hurdle. Before diving deeper into optimizing the neural net, which is a rather error prone process and consumes lots of time, we wanted to deploy the net to a mobile device. We wanted to make sure that such a tool is usable on any device and the performance would match our expectations.

Neural nets on mobile devices

Neural nets are sets of operations, executed in a specific order and based on millions of parameters. Therefore one “run” of such a net requires a lot of computation power, as millions of calculations have to be carried out. At the same time, the millions of parameters need to be deployed, as they represent the model or the representation the neural net has learned during training. So, to deploy our neural net, we had to solve these two requirements on an iPhone.

The first requirement, computing power, was thankfully solved by Apple. With the latest iOS version a specialised framework, called Metal Performance Shaders, was introduced. It offers the all required operations and is tailored to run these on the phones GPU, which is fast and efficient. To execute our net using the framework we had to translate our TensorFlow network code to Swift and rebuild the net’s architecture using Metal Performance Shader operations. Sadly Apple only supports a subset of todays common neural network operations, so we were forced to write some shader code to reconstruct the full network.

The second requirement, extracting the trained parameters and deploying them to the device was much easier. We just had to restore our previously trained model from a TensorFlow checkpoint, write all trained variables into a file and deploy this file with our iOS app. When needed, the iOS app would load the file into memory, and our network implementation would use the given parameters to run an inference pass.

Having met the two requirements, our network worked fine on an iPhone. We added the postprocessing operations and were able to segment images by a single tap without the need for a backend or any network communication. But there were some caveats.

While our neural net was a very common and widely used network, it was huge regarding the trainable variables. A trained model contains ~134 million parameters, which translates to about half a gigabyte of data that needs to be deployed with the app. This was obviously a showstopper for a mobile image editing app, as we couldn’t justify a 500MB download just to be able to segment images with your finger.

Furthermore, the results were still very coarse. If your colleague waved his arms in an image, the net usually could easily detect his torso, head and maybe his legs, but almost never the arms or hands. Fixing this using our postprocessing algorithms wasn’t that much of an option as it would have required lots of computing power and why bother using a neural net with millions of parameters if we fall back to conventional image processing techniques anyway?

So all in all, we had already learned a lot: Our approach of processing user inputs combined with raw image data as neural net input led to usable outputs, although quite coarse. Deploying such a net to mobile devices was possible, and the performance was good enough for using it in an interactive tool. The next step was to optimize the system to fix the parameter size and get finer results.

Combining SqueezeNet and SharpMask

We decided to tackle the network size first, as laying a proper foundation for optimizing the coarseness seemed like a sane thing to do. When looking for small nets with few parameters and fast inference its hard not to stumble across the SqueezeNet architecture by Iandola et al. which was published in November 2016. It met our use case, didn’t use any exotic operations that would be hard to implement on mobile and the results looked promising, so we removed the original network from our system and replaced it with an altered SqueezeNet implementation. And to our surprise, it worked almost right away. We had to tweak our training pipeline, and the results differed slightly, but all in all the small network with only ~5 million parameters matched the performance of our previous behemoth with ~134 million parameters. We quickly updated our conversion script and found out that our deployable model file just shrunk from ~500mb to 2.9mb. What a happy day!

Having solved the network size issue, we went ahead and thought about increasing the precision of our predictions. A loss of resolution is unavoidable in convolutional neural networks, as later layers acquire a larger “view” of the inputs by reducing their input size with so-called “pooling” layers. These layers take for example four values from the previous layer and merge them into a single one. Therefore our new SqueezeNet-based system created a 32 by 32-pixel image mask from a 512 by 512-pixel input image. Up to now we just scaled these up by using a transposed convolution. This allowed the net to learn how the upscaling worked best, but the fine details from the initial input image were already lost at this point.

We remembered Facebooks SharpMask system introduced in summer 2016 and revisited the accompanying paper. Their refinement modules seemed like a good fit, as they were able to gradually incorporate features from lower levels, but with higher resolution, into the coarse outputs. We adopted the idea and altered the refinement modules to take the final SqueezeNet output. The modules then combined the coarse SqueezeNet output with the pooling layers intermediate results and were able to refine the result. This increased our model size and the computation costs by a fair amount, but lead to much finer and more detailed results.

Once we settled on our architecture, we started an extensive training run, in which we tested more than one hundred different variations of hyperparameters, architectural details, and resizing techniques. Evaluating the results, we selected a variation, which made the best compromise between accuracy and inference speed/model size.

Our results and prototype

Having managed to fix all the issues, we were eager to see how the whole system performed on a mobile device with limited computing power and inputs. We updated our mobile app to use the new network architecture and the freshly trained model to compare the refined system to our previous approach. The results were amazing. When selecting objects that matched the categories of our training data and were fully visible in the image, we were able to generate fine-grained selection masks with just a single tap. More complex or larger objects required a few more taps, but we could always find a selection mask for our object, that was at least a solid starting point for further optimizations.

We decided to build a more polished prototype based on our existing img.ly iOS app. This app uses our PhotoEditor SDK to offer advanced image editing including focus and filter operations. As we were now able to create masks based on objects in the image we quickly settled on enhancing our filter and focus tools with selective masking.

Retrospective

Looking back at our journey into deep learning, it was one of the more frustrating yet fascinating ones. The sheer amount of possible applications is exciting, and once you get the hang of training something on your data, you immediately want to start experimenting with new things. On the other hand, you’re usually building huge black boxes with millions of float values, which makes debugging a pain. Especially when trying to replicate an already implemented architecture on other platforms, this can quickly become rather frustrating. If your outputs don’t match the expected results, your only option is to repeatedly go over your code, check all parameters and hope you stumble upon the wrong number somewhere. But once you manage to set everything up and start seeing some good results, you instantly want to tweak and optimise the bits and pieces of your system.

Overall, deep learning is a pain to debug, but yields great results, opens up a new field of photo editing applications and we’ll definitely keep exploring the new possibilities of applying the techniques in our product. Stay tuned for upcoming features!

Thanks for reading! To stay in the loop, subscribe to our Newsletter.