Machine Learning – IMG.LY Blog

IMG.LY Research: AI-based Generative Design Editing

Mirko — Tue, 16 Jul 2024 10:44:05 GMT

Generative AI is transforming the tech landscape, finding applications in virtually every field. At IMG.LY, we’re exploring how these advancements can revolutionize creative workflows. This article presents a research project, where we integrate Large Language Models (LLMs) with our flagship product, CreativeEditor SDK (CE.SDK), to enable natural language-driven design edits.

Our flagship product, CreativeEditor SDK (CE.SDK) allows for advanced creative workflows for countless use cases in industries ranging from print to marketing tech. Most use cases can be realized with the out-of-the-box feature set, but it also exposes a best-in-class API, called Engine API, to build complex custom workflows with designs and videos.

In this article, we will showcase how to combine our CE.SDK Engine API with LLMs to edit designs with natural language.

Introduction to LLMs

Generative AI is often associated with chatbots, but its capabilities stretch further. It is a versatile text processor that can transform any textual input into various structured outputs. This adaptability is due to its training on diverse textual patterns, allowing it to support a wide range of text-to-text applications beyond just generating conversational prose.

Carefully crafting an input text (prompt) in a way that instructs the LLM to output a specific structured output format allows us to use LLMs to solve almost any arbitrary text-based task.

Crafting prompts is an art in itself. When ensuring that we receive the required output we need to adhere to the following steps:

Consider what type of data our model was trained on to ensure the correct formatting of our input text. The most basic example is using English as our “main prompt language” since most LLMs are mainly trained on English text samples.
All necessary problem-specific information to solve the task needs to be included in the prompt. While LLMs often possess an inherent understanding of the world inferred from the vast amount of text they are trained on, they may not know much about our specific problem. Furthermore, LLMs have the notorious tendency to hallucinate, that is fill in missing context with incoherent or incorrect information. To ensure the best performance, the input to the LLM must provide as much context as possible.
Finally, we need to instruct the LLM well enough to output text-based data in a format we can then parse and process.

Human vs. AI Workflows for Executing Design Tasks

We started this project with the vision to use generative AI to magically handle requests like these (in increasing order of complexity):

Make the logo bigger
Translate this design into German
Adopt this template to our brand colors and brand assets
Transform this Instagram story portrait design into a landscape YouTube thumbnail

When trying to delegate a task to an AI, it’s best to start by thinking about how these tasks are currently solved by humans.

Let’s walk through how a human would complete a task such as
Make the logo bigger:

Humans would visually scan the design and automatically segment it by its elements such as objects, backgrounds, or text.
Humans would then read and comprehend the task “Make the logo bigger”
Finally, users would use their existing knowledge of how to move and interact with design software to fulfill the task by manipulating the individual elements in the design.

Based on these considerations we can extract the implicit knowledge necessary to fulfill a task and make it explicit for the benefit of our LLM.

Design Representation:

To enable the LLM to understand and manipulate a design, it is essential to provide a representation of the design. This can be either textual or a mix of textual and visual (if the LLM has vision capabilities) data. While supplying the current design as a raster image to the LLM is trivial, serializing a CE.SDK design into a textual format requires a custom serialization process. The textual representation is important since it allows the LLM to identify, address, and comprehend the different components of the design effectively.

Refer to Appendix: Use-Case dependent serialization of CE.SDK Designs for a more in-depth explanation of how to accomplish this.

Editing Protocol:

LLMs do not interact with design software using traditional human interfaces like a mouse, keyboard, or visual feedback. Therefore, we need a specific protocol for the LLM to propose changes to the design. We have developed a method where we pass a textual representation of the design to the LLM as part of the prompt so that the LLM can indicate changes to the design by returning a modification of this representation.

Practically, this means that if we pass in an element such as <Image id="1337" x=”100” y="100" .../>, the LLM can change those x and y attributes by simply returning <Image id="1337" x="0" y="0" .. /> inside its output text. Since we can identify the design element that was changed using the ID attribute, we can then calculate the programmatic changes that need to be applied to the design, like in this case engine.block.setPositionX(1337, 0) and engine.block.setPositionY(1337, 0).

Refer to Appendix: Parsing and transforming LLM response for a deeper look into this topic.

Based on what we have learned, we assemble a workflow, with the LLM as the center point that allows the LLM to execute design-related tasks on any of our CE.SDK designs. This workflow can be divided into two sub-tasks: Composing an input text (prompt) with all necessary context and parsing and applying the output of the LLM to the CE.SDK design.

Composing the Input Text

As seen in the graphic above we first compose the input text based on different components to provide the model with all necessary context to fulfill the user’s editing request: This includes a general text to “instruct” the model, the actual request that the user entered, an exemplary design representation to explain our output format to the model and a representation of the currently edited design.

The general, static text to “instruct” the model is composed of the following three parts.

What we are trying to achieve in general:
"You are an AI with expertise in design, specifically focused on XML representations of designs".
Output format instructions:
Your responses should only contain one XML document. Ensure that you do not introduce new attributes to any XML elements. You can change image elements by setting the alt attribute, which then will be used to search Unsplash for a fitting image. A sample alt text is “A mechanic changing tires with a pair of beautiful work gloves on”.
Additionally, pay close attention to the layout: verify that no elements in the XML document extend beyond the page boundaries. This constraint is critical for maintaining consistency and accuracy in XML formatting. Always double-check your XML output for these requirements.
The actual user request:
e.g., "Make the logo bigger"

By including a textual representation of a comprehensive example design we can show the model which layer types are available as well as which properties of those layers can be manipulated.

The model now has all the necessary context and building blocks to respond to user requests in a format that we can process downstream.

Applying the Output Text

In the first step, we scan the output text for an XML-like document and if we find one, attempt to parse it.

This will yield a structured data object we can compare to the one we passed in and calculate which elements have been modified, added, or removed.

The resulting change set can then be translated into specific calls to the CE.SDK Engine API to change the current design.

Issues faced

Latency: One issue with state-of-the-art models is their big latency: Each LLM response contains approximately 1000 tokens. That means each request takes 7-45 seconds (depending on the model) to complete. This long delay may be unacceptable for some user experiences. However, we see this issue as transitory and expect upcoming models to have much smaller latency while maintaining their capabilities.

Pricing: Each request/response with GPT-4 turbo as a backing model costs around 5 cents and restricts some use cases. We also expect the pricing to drop significantly. The new GPT-4o model for example reduces the price by half.

Hallucinations: LLMs do not always follow the instructions properly and, e.g., produce output that is not parsable. Hallucinations directly correlate with the capabilities of the model and this issue is not apparent at current state-of-the-art LLMs like e.g GPT-4/GPT-4o.

Conclusion

We present a novel and adaptable approach to use Generative AI and LLMs specifically to interact with IMG.LY’s CreativeEditor SDK. We showcase how this technology can be used to execute common design requests on arbitrary CE.SDK Designs. The proposition that LLMs can understand textual representations of visual elements was by no means obvious. This research project has revealed that it is very well within the scope of LLMs to translate instructions from a visual semantic context to its textual representation and back. This invites more inquiries into LLMs as assistants for tasks with a heavy visual component such as design.

While further research is needed to make this technology available in production environments, we are confident that Generative AI-based editing will play a big role in the future of Graphics and Video editing.

Appendix: Use-case Dependent Serialization of CE.SDK Designs

LLMs work based on “tokens” which are equal to words. However, a design, like e.g a poster design or a social media graphic, is highly visual. That means that we need a way to convert a design into text, a way to serialize it. Our CE.SDK engine can serialize an existing scene using our engine.block.saveToString method. However, this serialization contains a huge pile of information that is not necessary to do edits inside the file. LLMs are priced by token and their speed is also relative to the number of tokens the input and output have. Thus, the number of tokens should be reduced.

We looked at several ways to convert the current state of the design into a textual representation. Since GenAI is trained on a lot of (X)HTML which has an XML-like format, we decided to serialize any designs into a tag-based XML-like format.

The IMG.LY editor internally refers to design elements like images or texts as “blocks”. These blocks are uniquely identifiable and addressable using a numeric ID. We use this ID to be able to identify a serialized design block in the input and output of the LLM. Example: <Image id="12582927" x="0" y="0" width="800" height="399" />

For each of the CE.SDK block types like e.g “Text” or “Graphics” (representing images or vector shapes), we will the CE.SDK Engine API to query very specific data from the block. That means that for example we only have a text attribute for Text blocks.

This rather specific mapping of only certain properties from the CE.SDK design into the text serialization allows us to optimize the design serialization for different use cases. A use-case where we e.g. want to automatically name each layer does maybe not require fine-grained information about e.g the font size.

Appendix: Parsing and transforming LLM response

The LLM answers with arbitrary tokens. It’s not possible to restrict the response to a certain syntax. By settling on a well-defined and widely used format we instruct the model to also reply with an XML-like document, similar to the one we passed in as “current state”.

After receiving the LLM’s Response we first make sure that only a single XML document is present inside the response. We then compare the retrieved XML document with the state of the Design that we passed into the LLM and generate a change set. This change set contains entries like “Color of block with ID=123 has changed”. These change set entries are then converted into programmatic commands, like e.g engine.block.setColor(123) and executed on the current design.

One challenges when working with an LLM is the inability to restrict the output space. Thus, we are never guaranteed that the LLM did not add e.g new XML node names or that it even replies with a proper, valid XML-like document. The only lever to influence the probability of a proper XML-like document is to use strong prompting and LLM that are good at following those instructions.

In our tests, state-of-the-art models like GPT-4 can follow those instructions without any further tooling.

Further Research Topics

It’s also worth exploring fine-tuning an LLM specifically for this task which could improve the performance of the LLM for the specific tasks.

It would also be possible to use more advanced libraries like Guidance, which allows to define a grammar for the LLM response thus making sure that the output of the LLM is always parseable.

Another way to improve the performance would be to methodically test different prompt templates and find a way to measure and compare the output quality.

Thank you for reading!

3,000+ creative professionals gain exclusive access and hear of our releases first—subscribe to our newsletter and never miss out.

Cutting Through The Jungle: An In-depth Review of Cloud GPU Providers to Train Your AI Models in 2024

Walter — Mon, 22 Apr 2024 06:35:54 GMT

Navigating the World of AI Models Hosting

Here at IMG.LY, we recently dug into finding the best place to host AI models to support apps we’re dreaming up. We wanted to figure out if using cloud GPUs or going serverless would work better for us. As we were looking specifically for service providers to run Image Generation Workloads on, we focused on those that could be the best fit for that. Along the way, we picked up some cool insights and ran into a few hiccups. We think sharing our journey and the things we figured out could help you when you’re looking to deploy your own AI models.

First off, we’ll explain what cloud GPU and serverless hosting really mean. Then, we’ll chat about their good and not-so-good sides when it comes to hosting AI models. It’s super important to make sure whatever hosting you choose fits your model like a glove. We’ll talk about some tools we stumbled upon that could help with that. Next up, we’ll give you a peek at some of the providers we checked out and our thoughts on how they might fit with what we’re working on. We decided to skip over the big names like IBM, Google, and Amazon this time. We were curious about what the newer, smaller companies have to offer.

To wrap things up, we’ll share some final thoughts on all our research. Plus, we’ll throw in some tips and ideas you might want to think about when you’re doing your own digging. Whether you’re developing AI models or planning to host some of the well-known ones, we hope our adventure helps you nail down the perfect hosting solution for what you need. Ready to jump in?

Kinds of Cloud Hosting for AI Models

Cloud hosting has been around for as long as there has been a cloud. Though the server hardware is not at your location, earlier versions of cloud hosting required that your team learnt lots about server infrastructure. As things have evolved, providers now manage the infrastructure so that you can focus on your work. You can now host even just a single function in the cloud, if that’s what you need. In our research, we looked at general serverless hosting and at Cloud GPU AI providers.

Serverless Hosting

Serverless hosting can be defined as an architecture model that lets developers build and run applications and services without managing the servers they run on. The cloud provider manages things like security, provisioning, scaling, and connectivity.

In a serverless CPU-loads hosting the host provisions your services to the most appropriate and available hardware. However, with most of the providers of GPU loads you get to choose.

Serverless Pros:

Pay-per-compute model: you only pay for the compute time you consume.
Autoscaling: the provider will automatically scale up or down depending on load, from a few requests a day to thousands per second.
No server management: eliminates the need for developers to also understand server infrastructure. Often, just a Docker image holding an application is sufficient.

Serverless Cons:

Cold starts: instance deallocates after a certain idle time (enabling the great pay-per-compute model) so initial request after this can be noticeably slow.
Limited control over specifics: certain GPU hardware or even server hardware may be unavailable at times which can impact performance.
Limitations on time - there may be limitations on the execution time of functions, which can impact long-running processes.

Cloud GPU Hosting

Cloud GPU hosting provides access to GPU and TPU (Tensor Processing Unit) hardware that can perform the parallel operations essential for AI model training and inference. The provider allows users to configure specific hardware for their jobs.

With cloud GPU each service or model gets its own GPU while running. Your other services communicate with the model through an API.

Cloud GPU Pros:

High performance: GPUs are specifically designed to run AI models and other tasks like deep learning and complex simulations.
Full control of hardware: users can specify specific hardware configurations for their projects.
Persistent availability: resources are not deallocated, so there is no latency for provisioning for the first request.
Cost-effective experiments: the upfront cost of purchasing GPU hardware to experiment with different configurations is eliminated. Services are priced with a pay-as-you-go model.

Cloud GPU Cons:

Costs over time: costs do not go down during periods of low demand. Over time, costs can potentially surpass the cost of investing in local hardware.
Management overhead - managing and optimizing hardware configurations is not automatically part of the hosting. You’ve got to learn some server administration and manage security and upgrades.

Providers

It’s important to understand that this isn’t a ranking of the best providers or an endorsement. It’s what we discovered with some web searching, reviewing the available documentation, and tinkering with any demo or free tools and models the provider makes available. The list could easily have been different providers and we think some of the pros and cons and qualities would be the same. Hopefully, some of the questions we raise and the pros or cons we noticed in our research can help you to guide your research.

Our goal was to find potential hosts for various workflows with different models in a scalable manner. We want to be able to build applications around the workflows. Some of our, specific, requirements include:

Autoscaling, ideally out-of-the-box without the need for custom Kubernetes setup or similar technologies.
Minimal vendor lock-in.
Compatibility with various technologies (REST API, WebSocket, Webhooks, etc.).
Support for Windows Server.

With those disclaimers and caveats, here is a short summary of our research.

Provider	Best For
Runpod IO (Serverless)	Deploy AI models with GPU support and require customizable API interfaces.
Vast AI (Serverless)	Affordable GPU resources and a variety of GPU options for AI model training.
Paperspace (Serverless)	Flexible workflows and support for different stages of AI model development.
CoreWeave (Serverless)	Strong knowledge of Kubernetes and need autoscaling capabilities for AI workloads.
Modal (Serverless)	Comprehensive documentation and examples for deploying AI models in containers.
ComfyICU (Serverless)	Serverless infrastructure tailored for hosting ComfyUI applications.
Replicate (Serverless)	Easy-to-use API for executing AI tasks without managing infrastructure.
Genesis Cloud (Cloud GPU)	Sustainability and need scalable GPU instances for AI model training.
Fly IO (Cloud GPU)	To deploy complete applications with GPU support in a scalable environment.
Runpod IO (Cloud GPU)	GPU resources in various regions and require customizable Docker-based deployments.
Lamda Labs (Cloud GPU)	On-demand GPU resources for model training and inference tasks.
Together AI (Cloud GPU)	A platform for testing serverless models and occasional access to GPU clusters.

If you want to skip ahead to a specific part, here are the providers we will be diving into:

Serverless Providers
Runpod IO (Serverless)
Vast AI
Paperspace
Banana Dev
CoreWeave
Modal
ComfyICU
Replicate

GPU Cloud Providers
Genesis Cloud
Fly IO
Runpod IO (Cloud GPU)
Lamda Labs
Together AI

Serverless Providers

Runpod IO (Serverless)

Runpod IO

Concept:

A Docker image that includes the installation of Python + GPU packages, models, and ComfyUI.
Python/Go handlers act as an API interface to ComfyUI, which is vendor-specific, but can be wrapped in a more general API for reuse. For more information, see this article on hosting a ComfyUI workflow via API.

Pros:

Good documentation, including public GitHub repositories with examples.
Relatively large community for a new provider.
Compatibility with Windows Server.
Handlers allow for webhook and WebSocket-like communication for API feedback.
Network volume to store models/data and reduce cold start times.
Control over the number of workers and the ability to define persistently active workers.

Cons:

Availability of GPUs, especially in Europe, needs to be validated.
Handlers can only be written in Python and Go.

Open Questions:

General open questions regarding serverless infrastructure and AI inference tasks.

Conclusion:

The overall package seems very mature. The setup can largely be adopted from the GitHub examples. Good documentation and community support (notably on Reddit). The open questions regarding pricing and cold starts are typical for serverless infrastructure.

Vast AI

Concept:

Peer-to-Peer Sharing. Companies/organizations can rent out their unused GPUs.
A GPU Marketplace approach.

Pros:

Affordable prices through their peer-to-peer GPU sharing model.
A wide selection of different GPUs.
Good global availability of GPUs.
Ability to define autoscaler groups, allowing different workflows to scale differently.

Cons:

The autoscaler is currently only in beta mode.
Data privacy/security concerns when renting GPUs from anonymous providers.

Open Questions:

How will the autoscaler beta evolve?
Control over GPU providers: Can one allow only certain trusted providers (e.g., those based in the EU)?

Conclusion:

Even though the pricing is more affordable, there may be significant issues, in terms of security and data protection, as well as the fact that the autoscaler is still in the beta phase.

Paperspace

Concept:

The serverless approach (Workflows or Gradient) is still in beta Paperspace Gradient Workflows is based on Argo Workflows which utilizes Kubernetes.
A predefined API is available for communicating with workflows, as detailed in DigitalOcean’s documentation for Paperspace commands.

Pros:

The ability to use different machines (GPUs) at different stages of a workflow.
Provided by Digital Ocean, allows for general hosting customers to expand into GPU hosting without finding a new vendor.
Possible Windows support as outlined in DigitalOcean’s documentation on running Windows apps.

Cons:

Complex documentation: offers many features for various use cases (AI learning, data preparation, validation, and inference).
Vendor lock-in through a proprietary system: Gradient Workflows and YAML config are specific to Paperspace.
No real-time feedback over the API.

Open Questions:

Since it’s still in beta, how will the ecosystem continue to develop?
How extensive is the knowledge of Kubernetes required to implement autoscaling?

Conclusion:

It’s positive that it’s offered by Digital Ocean as they are a more mature company with general hosting experience. The approach seems very specific to Digital Ocean. Furthermore, it may require experience with Kubernetes.

Banana Dev

It has been excluded: Recently, they announced the termination of their serverless model as it was not cost-effective.

Learning from this: Currently, there are many new providers entering the market aiming to establish themselves as cloud GPU or serverless GPU providers. This highlights the importance of minimizing vendor lock-in.

CoreWeave

Concept:

Heavily based on Kubernetes.
- A Kubernetes file is created for setup; scaling and additional infrastructure are managed by Core Weave.

Pros:

Autoscaling by default with the possibility of scaling to zero.
Supports Windows.
Minimal vendor lock-in due to Kubernetes configuration.

Cons:

Strong dependency on Kubernetes, with the serverless setup based on KNative documentation.
Does not offer a handler API, etc., to communicate directly with ComfyUI.

Open Questions:

How complicated would it be to implement an API interface and resulting scaling to address the correct instances, etc.

Conclusion:

Good documentation and a close interface to Kubernetes. For a team with strong knowledge of Kubernetes, this could be a prime candidate.

Concept:

Container Setup: Containers are defined through Modal’s own container setup Modal custom container documentation.
- Docker images can also be used.
Modal-specific handlers to communicate with ComfyUI and other models.

Pros:

Supports webhooks and custom endpoints Modal webhooks documentation.
Focus on fast startups/cold starts.
Emphasis on AI inference tasks.
Comprehensive documentation with many examples.

Cons:

Vendor lock-in if Modal’s container setup is used.
Autoscaling and scaling configuration are not directly described.

Open Questions:

How exactly does the autoscaling work?

Assessment:

For us, this is a candidate for closer consideration. The container setup can be managed through Dockerfiles, and the API defined by Modal’s own interface.

ComfyICU

Concept:

Pure focus on ComfyUI, serverless infrastructure.
API interface for communication.

Pros:

Minimal setup effort.

Cons:

Limited control over the API.
Limited GPU resources.

Open Questions:

How does the autoscaling work, if it exists at all?
Community-based open source. What is the long-term support for this project?

Conclusion:

Potentially useful for testing or building a demo site, but probably not suitable for developing our commercial applications.

Replicate

Concept:

Execution of AI tasks/models in the cloud via an API.
No access to infrastructure, etc.

Pros:

Supports various languages: Node, Python, Swift.

Cons:

No control over the infrastructure, number of GPUs, or workers.
API rate limits.

Open Questions:

How can autoscaling be enabled?
Is it possible to create custom API endpoints, webhooks, websockets?

Conclusion:

For testing or as a demo for one’s own model, this can be a very good platform. However, as a standalone application interface, it doesn’t meet some of our core requirements.

GPU Cloud Providers

Genesis Cloud

Concept:

Focus on sustainability and renewable energy.
Scaling through instances as detailed in here.

Pros:

A REST API is available for managing instances.

Cons:

The availability of GPUs varies significantly by region.
Limited selection of GPUs.

Open Questions:

How quickly can new instances be scaled up or down?

Conclusion:

The use case for Genesis Cloud appears to be more suited for model training or tasks that require a significant amount of computing power for extended periods.

Fly IO

Concept:

Focus on the deployment of complete applications.
Also offers its own GPU servers.

Pros:

Docker File support with additional configuration via a TOML file.
Quick scaling of GPUs up or down facilitated by the launch process.

Cons:

Limited selection of GPUs, with only very large GPUs available.
Specifically tailored for Linux.

Open Questions:

How well does the launch system perform for relatively fast inference tasks?

Conclusion:

Since primarily large GPUs are available, the focus here also appears to be more on model training or other long-duration tasks. However, the launch system might also potentially be used for inference.

Runpod IO (Cloud GPU)

Runpod IO

Concept:

A wide range of GPUs available across various regions.
Base Docker images for popular tasks or support for custom Docker images.

Pros:

Many different data center regions.
A variety of CPUs available.
Simple setup via Docker images.

Cons:

No direct autoscaling (would need to use Runpod Serverless for that).
Despite a large selection of GPUs and many different data center locations, the availability of GPUs is not very high.

Open Questions:

Can autoscaling be implemented without using serverless?

Conclusion:

The setup can largely be adopted from the GitHub examples. There is good documentation and a community (much of it on Reddit). The availability of GPUs could become a problem, especially for smaller GPUs.

Lamda Labs

Concept:

On-demand cloud with a focus on model training and inference.
Similar concept to Runpod, offering a variety of GPUs.
- GPU availability is very limited.

Conclusion:

Runpod and Lambda Labs seem to have a similar approach and similar offerings. Runpod appears to have greater availability.

Together AI

Concept:

Offers an API and playground for testing serverless models.
Also offers GPU clusters but only upon request.

Conclusion:

We didn’t dig into the GPU clusters since information is available only upon request. Otherwise, in the API/serverless area, it appears to be similar to Replicate.

Established Providers

As we said in the introduction we did not examine the old, large providers like Google Cloud, AWS, Azure, Nvidia, etc., in detail. Rather, we focused on the new providers aiming specifically at the market segment of AI GPUs. With the older providers, we are more in the realm of cloud GPUs and less in serverless. Given the size of these providers and the wide range of market segments they cover, it can make sense to opt for them if one is already familiar with their architecture and documentation.

Google Cloud Platform (GCP)
AWS
Microsoft Azure
IBM Cloud
NVIDIA GPU Cloud (NGC)

Conclusion

Just as we saw that performance can vary wildly for different models, pricing can be similarly complex. When evaluating costs, consider factors like response times, the number of required workers, and potential charges for features like caching. Many providers offer detailed pricing guidelines on their websites, which can be crucial for ensuring you only pay for the computing power you truly need. Experimenting with performance of your model and applications during development will be helpful to make sure your hardware and pricing are both optimized for your application.

Another thing to consider is what kind of experience does your team already have? Most cloud GPU services provide tools like CLI or REST APIs to manage resources, which can be a steep learning curve if your team is not familiar with these technologies. Additionally, while serverless platforms may support multiple programming languages, compatibility with your team’s preferred language—be it JavaScript, Python, or Go—is essential. As exciting as it can be to learn new languages, it’s probably not the best use of your team’s time.

The size of files you’ll be moving between your model and the other parts of your project may also be a factor. Your users may not notice latency for models that communicate using text only. Text moves quickly from point to point in a network. However, if your model takes large image files as input or output, you may find that moving data between data centers is too slow. Then you’d want to focus on providers who can offer more general hosting in addition to cloud GPU hosting.

As we continue to research this for our own projects, we are thinking the best configuration for us is to use a cloud GPU exclusively for generation tasks and communicate with it via an API from our existing back end. We will have to experiment to see if we can have those functions geographically separate, or if we need to find one hosting company and one data center for both. As we learn more we may change our ideas, but that’s part of the fun of working in technology, things change. By using the higher-cost cloud GPU for as few tasks as possible, we’ll know we aren’t wasting compute power for things easily handled by a general CPU.

We hope this has given you some useful background and ideas as you research hosting options for your AI projects. Understanding the subtle differences between serverless and cloud GPU hosting can spark innovative ideas tailored to your needs. Perhaps some of the lesser-known providers we’ve explored might just be the perfect fit for your next project. As always, the dynamic nature of technology keeps us on our toes—ready to adapt and evolve. Happy hosting!

Thanks for reading. Join over 3000 specialists with powerful apps and subscribe to our newsletter. We keep you in the loop with brand-new features, early access, and updates.

How to Remove Backgrounds Using Core ML

Walter — Tue, 28 Jun 2022 15:26:53 GMT

In this tutorial, you will learn how to use machine learning to identify an image background and mask it out. This is particularly useful for making stickers or avatars or adding a fake background to a video call. The process of assigning the pixels in an image to a specific object is called “segmentation”. Apple provides an optimized method with pictures of people in its Vision framework. For performing the same tasks with non-human subjects, you can use the DeepLabV3 machine learning model with Core ML. The code examples in this tutorial have been tested using Xcode 13 and Swift 5. Because of the use of Core ML, Vision, and CoreImage in this tutorial, you should run the demo code on a device, not on the Simulator. An iOS project with the demo code is on GitHub.

Image segmentation is a different process and requires other machine learning models than image recognition. With recognition, the model produces bounding rectangles that the system believes to contain the entire object. With segmentation, the model identifies the actual pixels of the object.

In this tutorial, you’ll start with an image of your subject. Then you’ll generate a mask for the background using Vision. Finally, you’ll use CoreImage filters to blend the original image, image mask, and the new image background.

Whether using the Vision framework alone or supplementing Vision with another Core ML model – the process will be the same:

Create a Vision request object with some parameters.
Create a Vision request handler with the image to be processed.
Process the image with the handler and object.
Process the results into the final image using CoreImage.

When working with still images, Apple’s CoreImage framework is usually the best option, mainly because of the large number of available filters and the ability to create reusable pipelines that you can run on the CPU or the GPU.

Using Vision to Segment People

Without an external model, Vision can only segment documents or people in an image. It is vital to understand that Vision will only identify a pixel in the image as “this pixel is part of a person” or “this pixel is not part of a person.” If an image contains a group of people, Vision will not be able to separate individuals.

To segment the image, start by creating an instance of a VNGeneratePersonSegmentationRequest.

var segmentationRequest = VNGeneratePersonSegmentationRequest()
segmentationRequest.qualityLevel = .balanced

This type of request has a few options you can set. .qualityLevel can be .fast, .balanced or .accurate. The level of .accurate is the default. This will determine how closely the mask conforms to the boundaries of the original image. The different levels process at different speeds. The .fast setting is intended for use in a video so that frames don’t get dropped. Using .balanced or .accurate causes noticeable delay in an app processing the image on most devices. Experiment with different settings depending on your needs.

The output of the segmentation request will be a CVPixelBuffer. This is a structure that contains information for each pixel of the image. Most of Apple’s video frameworks as well as CoreImage can work with pixel buffers. The default for VNGeneratePersonSegmentationRequest is a buffer where the color of each pixel is represented by an 8-bit number. Any pixel that Vision thinks contains part of a person will be white and any not-a-person will be black and represented by zero. This will be exactly what you want for generating a mask to work with CIBlendWithMask.

Next, create a VNImageRequestHandler with the image to be processed. The handler class has a number of initializers for different types of data. In this example we will use CGImage, but you could also start with CIImage, CVPixelbuffer, Data, or others. You can also specify options for the handler. In this tutorial we will not specify any, but one of the options is to pass in a CIContext. This can help with performance as you can tell your app to do all of the processing with a single context on the GPU using Metal.

guard let originalCG = originalImage?.cgImage else { abort() }
let handler = VNImageRequestHandler(cgImage: originalCG)
try? handler.perform([segmentationRequest])
guard let maskPixelBuffer =
  segmentationRequest.results?.first?.pixelBuffer else { return }
let maskImage = CGImage.create(pixelBuffer: maskPixelBuffer)

In the code above, the originalImage (which happens to be a UIImage) gets converted to a CGImage. Then we use the image to initialize a request handler. The VNImageRequestHandler has a method .perform which takes an array of all of the Vision requests you want to use to process the image. The .perform method will not return until all of the requests in the array have been completed. Depending on how you want to structure your code, you can either provide a completion handler to use with each of the requests or just process the results in-line. In this example, we’ll process in-line.

If the segmentationRequest found any people in the image, the results array will contain a .pixelBuffer. Create the mask we need by converting the pixelBuffer into a CGImage. To convert to a CGImage, the example uses some helper methods that Matthijs Hollemans published to GitHub specifically for working with Core ML inputs and outputs.

Now that we have the mask image, use CIFilter to compose the final image. It will take a few steps. First, resize the new background, mask, and original image to the same size. Then, blend the three images.

//Convert main image to a CIImage and get the size
let mainImage = CIImage(cgImage: self.originalImage!.cgImage!)
let originalSize = mainImage.extent.size
//Convert the maskimage to CIImage and set the size
//to be the same as the original
var maskCI = CIImage(cgImage: maskImage!)
let scaleX = originalSize.width / maskCI.extent.width
let scaleY = originalSize.height / maskCI.extent.height
maskCI = maskCI.transformed(by: .init(scaleX: scaleX, y: scaleY))
//Convert the new background to a CIImage and set the size
//to be the same as the original
let backgroundUIImage = UIImage(named: "starfield")!.resized(to: originalSize)
let background = CIImage(cgImage: backgroundUIImage.cgImage!)
//Use CIBlendWithMask to combine the three images
let filter = CIFilter(name: "CIBlendWithMask")
filter?.setValue(background, forKey: kCIInputBackgroundImageKey)
filter?.setValue(mainImage, forKey: kCIInputImageKey)
filter?.setValue(maskCI, forKey: kCIInputMaskImageKey)
//Update the UI
self.filteredImageView.image = UIImage(ciImage: filter!.outputImage!)

The above code resizes the images to match the size of the original image and converts them to CIImage. Many machine learning models commonly resize inputs and outputs. For example, the output of the VNGeneratePersonSegmentationRequest using the demo image is 384x512.

Both CIImage and UIImage formats describe an image but do not always have a bitmap representation. That is why each image gets converted to a CGImage first. Though this guarantees that the demo code will work, converting formats makes the code run slower. In your app, you should experiment with different methods to get your images into CIImage format for filtering. Also, the above code uses Matthijs’ resized helper method to do some of the resizing.

With everything resized, the CIBlendWithMask filter stitches the images together. In place of the black pixels in the mask, the final image will show the background – for white pixels, the foreground. Wherever the mask image has a gray pixel, the final image will blend background and foreground.

Now let’s see how to use an additional CoreML model to process images that don’t have a person as the main subject.

Choosing a Model

Apple provides a number of pre-built models for text and image processing. The DeepLab v3 Machine Learning model can perform segmentation requests on images that have subjects like dogs. You can download it from Apple directly or also find it at the DeepLab repo on GitHub. Once you have downloaded the model, add it to your Xcode project the same as any other file. The DeepLab model will recognize people, the same as the VNPersonSegmentationRequest, but also other objects. This does come at a cost in an increased filesize for your application. You may notice that Apple provides multiple versions of the model of different file sizes. They all perform the same task, but differ in how they represent the output and a few other things.

Models on Apple’s website are packaged to work with Core ML and Xcode, so you can easily try a different model. The input and output method names will be the same. It is beyond the scope of this tutorial, yet Apple provides tutorials and example scripts on how to convert TensorFlow and other machine learning models to work with Core ML.

Core ML and Xcode

When you are working with a Core ML model, Xcode provides some convenient tools. Access them by highlighting the name of the model in the File Navigator pane of Xcode. For instance, clicking “Preview” will let you test the model with your data.

You can also use the “Predictions” tab to determine how the input needs to be formatted and what to expect for the output.

Here you can see that the DeepLabV3 model expects images to be 513 x 513 and will output a multidimensional array of integers that is 513 x 513 in size.

Finally, on this screen, you can see an entry for the “Model Class” in the headers. Double-click on the name to jump into the Swift wrapper class for the model to see how to call the model in your code and work with the inputs and outputs.

The DeepLabV3 Model

The DeepLab model input will be a color image that is 513 x 513 pixels. The Vision framework will handle resizing the input, but you can provide options on how that resize should work. The DeepLabV3 model has been trained to recognize and segment these items:

aeroplane
bicycle
bird
boat
bottle
bus
car
cat
chair
cow
dining table
dog
horse
motorbike
person
potted plant
sheep
sofa
train
tv or monitor

Anything that the model does not recognize, it will consider a background. After performing the recognition, the model returns a two-dimensional 513 x 513 array. Each entry in the array corresponds to one pixel in the original 513 x 513 input image. For instance, every pixel in the original image that shows a dog will be represented in the output by a 12 in the corresponding array entry. Any pixel that is not a recognized object will be given a 0 in the output array to represent a background.

If the model recognizes multiple objects, there will be multiple numbers in the array. If that is not what you want, you need to make changes to the array before creating the mask. For example, using an image of a dog riding a horse, some entries in the array will be 12, others will be 13, and the rest will be 0. To filter out the horse, you need to loop through the array and change any 13s to 0s.

Segmentation with DeepLab

To use the DeepLab model in your code, you again use a request to the Vision framework but this time you can specify a model.

let config = MLModelConfiguration()
var segmentationModel = try! DeepLabV3(configuration: config)
if let visionModel = try? VNCoreMLModel(for: segmentationModel.model) {
  self.request = VNCoreMLRequest(model: visionModel)
  self.request?.imageCropAndScaleOption = .scaleFill
}

In the code above, we create an instance of the DeepLab Core ML object and then initialize a generic VNCoreMLRequest with its model. The only option we set on the request is to tell it how to modify the image when it resizes it for input.

Now the code is very similar to the first example.

let cgImage = originalImage?.cgImage
let handler = VNImageRequestHandler(cgImage: cgImage, options: [:])
try? handler.perform([request])
if let observations = request.results as? [VNCoreMLFeatureValueObservation],
  let segmentationmap = observations.first?.featureValue.multiArrayValue {
  guard let maskUIImage = segmentationmap.image(min: 0.0, max: 1.0) else { return }
  applyBackgroundMask(maskUIImage)
}

Because we are using a generic VNCoreMLRequest we have to cast the results as VNCoreMLFeatureValueObservation. The DeepLab model returns a .multiArrayValue instead of a .pixelBuffer so we will again rely on the helper methods to convert it to an image. Now that the mask image has been created, applying the background is the same as in the original example.

Going Further

This tutorial focused on still images, yet both the standard Vision segmentation requests and DeepLabV3 allow processing video input as they can work with CVPixelBuffer and VNSequenceRequestHandler.
The rest of the process will be almost identical to our example. You will need to finetune the performance, or else you will experience frame drops.

Apple provides another method for identifying the background and primary subject in a still image: Portrait mode. When a device renders Portrait mode, it takes pictures with all cameras on the device and stitches them together. This way, the depth of field can change, and you can adjust what items are in focus or blurred.

If the DeepLab model does not cover your subject matter, you can train the DeepLabV3 model to recognize other objects using your own data.

However, you may consider using SDKs such as PE.SDK and VE.SDK. Enabling background removal for your users is easy — all it takes is a few lines in your configuration as described in the official PE.SDK documentation for iOS and Android, which adds a control to the photo editor to toggle the setting.

Integrate a fully customizable photo editor with Background Removal into your app with PE.SDK.

The latest VE.SDK release introduced Background Removal for stickers. This feature recognizes people in pictures and removes the background with one tap – no need for manual outlines or masks.

Let your users create beautiful visuals with Sticker Background Removal.

This feature is available on Android and iOS 15.0 and higher only. See the official documentation for Sticker Background Removal on Android or iOS.

Using an SDK will simplify and accelerate your app development, as you can save time and resources, and focus on the growth and innovation of your application instead.

Wrapping Up

In this tutorial, you saw how to use Vision and Core ML to segment an image of a person and remove the background. You also saw how to use DeepLab to work with images of non-persons.

Thanks for reading! We hope that you found this tutorial helpful. Feel free to reach out on Twitter with any questions, comments, or suggestions.

A Remote Data Aggregation Pipeline to Provide Machine Learning Datasets

Leonard — Fri, 16 Apr 2021 15:55:59 GMT

In our NRW.EFRE funded research project KI Design, we spent a lot of time in training and testing convolutional networks together with our fellows from the Bochumer Institute of Technology (BO-I-T). Our goal: Using Artificial Intelligence (AI) to make photo editing more comprehensive and easier at the same time. Details about the project and the motivation behind it, and some results can be found on our project homepage and in this blog post.

Today, we want to share details about our custom-made approach for a remote data aggregation and data transfer pipeline that we developed to support seamless integration of data preprocessing and storage into the training procedure. We think that this topic is of interest to the machine learning community, because the generation, versioning, and handling of datasets for the training of machine learning algorithms constitute a challenge for many researchers and developers.

Introduction

A major challenge in nearly every project in which machine learning and deep learning are applied, is set in the data preparation and augmentation for the training process. As nowadays many approaches and algorithms are data-driven, having training data in the right amount and quality even can make the difference between a project’s success or failure. This also includes a well-organized data infrastructure to store data, possibly in different versions of datasets.

In our joint project KI Design, we face the setting that our server cluster is split up geographically, having a data storage server at the BO-I-T laboratory and a computing server at the IMG.LY site. This, of course, makes the design of a training pipeline a bit more demanding in how to efficiently use resources. After searching for an existing software solution, we decided to develop our custom-made approach, adapted to our requirements that we formulated as follows:

The efficient workflow between data (pre)processing, storage, and provisioning on one side, as well as training initiation and execution on the other side
A possibility to requests data preparations and augmentations based on configurations generated at the training side
Versioning of datasets or configurations to ensure reproducibility
Availability of already computed datasets to omit preparation and processing of equal data requests multiple times
A notification process implemented on the data server to signal the availability of a dataset and trigger download as well as training processes to the computation side
Integration into a TensorFlow training pipeline

Our Custom-made Approach - Overview

In this part, we briefly sketch our concept, before diving into our technical implementation in the next part. To train new models on our work-thirsty computing server at IMG.LY, we need training data. As mentioned before, data is stored on the data server at BO-I-T. We work on different tasks with different experiments in our project, thus we need to perform several different pieces of training of machine learning algorithms, many of which require different datasets.

A dataset can be created in a data preparation process. In our context, the data preparation can be based on one of several “raw datasets” such as the COCO, the DUTS, or some self-assembled datasets and might include data pre-formatting (e.g. adjusting image size or section) and augmentation (e.g. image rotation, brightness adjustment, or combining foreground subjects with different backgrounds). Our idea was to use the data server for the whole process of data preparation and augmentation, to not waste valuable resources on the client for this.

The pre-training data aggregation phase can be described like that:

The client on the computing server should be able to request datasets from the data server. To do so, the client should transmit a (parameter) description of an exact configuration of the required dataset (image type and resolution size, as well as meta information) to the data server.
Waiting for the requested dataset to be generated and provided, the computing server can spend its resources into other scheduled training jobs. (Computation time is money!)
Meanwhile, the data server checks if the requested dataset is already existent — prepared from a previous request — or whether a new data generation process needs to be launched.
If a dataset is available and ready to be downloaded, either directly after the request (because it was already created in a previous request with the exact same configuration) or after the time it took to create the new dataset on the server, it sends back a notification message to the client.
Receiving this response, the computing server can download the dataset and initiate the training process.

To implement this concept, we set up an architecture that is organized by three services, c.f. Figure 3:

a dataset-client,
a dataset-server and
a dataset-handler.

For our implementation, we developed a dataset-client service that provides the functionality required to cope with the client-side. The dataset-server service is the counterpart on the server-side. While the previous two components are pretty straightforward to understand, the third part, the dataset-handler, requires a little explanation: not only the implementation and training of machine learning models is part of our research project. With similar importance, we develop new strategies and approaches for data preparation. Thus, the server can not be provided with all preparation functions a-priori. Instead, it needs to be able to get to the required code for, e.g., new augmentation procedures and other algorithms, in their respective latest versions just before the preparation starts. The data-handler is the concept for this: it is the adjustable tool the server uses, to perform the data preparation.

Our Custom-made Approach - Technical Implementation

In this part, we do not aim to provide a full description of our implementation. Instead, our goal is to give some insights into which frameworks we used and how we implemented the interfaces between the client, the data server, and the data handler.

Framework and Languages

To set up our remote data pipeline we used a combination of different tools, frameworks and programming languages:

Apache2
Node.js
Python

For our data server, we used Apache2 as a webserver to allow requests from outside and redirect them. Our REST API and WebSocket connection are set up in Node.js. Here we use the Express.js and WebSocket libraries to handle the REST requests and establish a WebSocket connection. Further server-sided processes such as checking the availability of datasets by a hashtag, setting up a virtual environment (to be sure that all libraries are available on the server we set up a pipenv environment), and trigger the data handler configuration are written in Python. Here, we used libraries like hashlib, subprocess and zip file for the implementation as well as some other basic libraries.

Besides the data-server service, the data-handler and the data-client are built up in Python. Here we used libraries as requests, asyncio, and WebSocket to establish the client-sided connection with the data server. Further used libraries of the data handler strongly depend on the task to perform and thus vary a lot. Just to call some examples: we frequently use libraries as pillow, imageio, or OpenCV for image manipulation.

Client Side

The dataset client is running on the client and triggers the whole data aggregation process. It covers the following functionalities:

requesting datasets,
establishing a WebSocket connection,
downloading aggregated datasets,
starting the model training.

To get preprocessed training data from the server, a REST-based post request is sent from the client to the data server. This request includes a configuration, defining the exact dataset attributes, as well as meta information that specify the raw dataset that should be used for the preprocessing and the version of data-handler. Here is a code snippet showing the required post parameters and the data type:

request_dataset(dataset_name: str, handler_version: str, dataset_config: Dict)

The parameter dataset_name and handler_version are passed as strings and the config settings dataset_config in a dictionary. Different handlers are assigned to different projects and are customized to their specific needs. Therefore, the config settings vary on each project. But, to get an idea of some configurations here is a (short) example:

short_example_ config = {
        'input_attributes': ['image'],
        'batch_size': 100,
        'image_size': [1024, 1024],
        'variations_per_sample': 1,
        'crop_size': [800, 800],
        'random_crop_centered': True,
        }

It shows some basic parameters like, e.g., image type, image resolution as well as batch size, but also options of further operations as centered cropping. These settings serve as instructions for the data_handler to create the dataset and are further fully customizable and adapted to a use case.

Receiving a successful response of the post request, the dataset client will establish a WebSocket connection. This allows continuous communication between the client and the data server, which is important for a server-sided notification in case of the finished data preprocessing (as an alternative, we could have implemented a regularly scheduled client-side polling to check for dataset states). Here we depict an example of our asyncio event loop command:

asyncio.get_event_loop().run_until_complete(wait_for_completion('topic_id'))

With this loop, the client waits until it receives a related response from the server. If a success token is returned, the client leaves the loop and starts to download the dataset and further, the actual training process (as soon as the training scheduler assigns resources, but this is a different story to tell).

Server Side

On the server-side, two services are running: the dataset_server and the dateset_handler. The dataset_server handles the communication with the client and receives configuration requests as well as download requests. Furthermore, it checks the availability of datasets and if necessary triggers the data_handler to run a dataset creation process. In summary, the dataset_server covers the following functionalities:

receiving request over REST service
set up websocket communication protocol
check for existing datasets
install dataset_handler and initialize dataset creation
send notification to the client

Using a REST-API as an entry point, the data server receives the post request of the client and checks, if the required dataset was already created. This check is being done by a string matching of a Universally Unique Identifier (UUID) of a configuration: Using an md5 hash, each request is converted to a special UUID, generated from the transferred dataset configurations. Information that is taken into account for the UUID/hash generation is the dataset name, the handler version as well as the md5 hash of import config settings. We set the dataset name and handler version in front of our md5 hash as it might be useful information in case we run out of disk space. Here is our function of the generation of the UUID.

def hash_dataset_name(dataset_name: str, handler_version: str, dataset_config: str):

    h = hashlib.md5()
    h.update(dataset_config.encode("utf-8"))

    return "".join([dataset_name, "_", handler_version, "_", h.hexdigest(), ".zip"])

If the UUID matches with an existing dataset, a notification to the client will be returned, pointing to the related zip archive on the datastore. Otherwise, the data_handler will be advised to start data preprocessing. However, before the generation process can be started, the correct data_handler has to be downloaded and installed. Different data_handler versions are stored in a GitHub repository, available to be downloaded from data_server. This is dispatched by application of a Python integrated bash command, running a “pip install”:

subprocess.call(["pipenv","run","pip", "install", "--no-cache-dir", "--upgrade", "--process-dependency-links", "git+https://github.com/imgly/dataset-handlers@{}#egg=dataset-handlers".format(handler_version)])

Note: We need pip in version 18 here because it is the last version supporting the “—process-dependency-links” option, which ensures that the dependencies inside of dataset-handlersare installed. The data_handler version is directly passed into the pip command, linking to the corresponding GitHub repository and handler release version.

Data Handler

After successful installation of the dataset_handler, we can simply import the handler in Python: import dataset_handlers

As described earlier, the core functionality of the data_handler is to create a dataset based on a given configuration and return zipped .tfrecord files. For those of you who are not familiar with the .tfrecord file format: it is a special TensorFlow format for handling data as binary records in a sequence. This has the advantage of using less disc space, being faster in copying as well as being more efficient to read data from disk. But let’s go on with our data_handler: Depending on the project and use case, the data handlers vary strongly and offer different methods. This makes an example a bit difficult at this point. But we can present some abstract methods of our class DatasetHandler():

We start with the initialization method or constructor:

class DatasetHandler():
    def __init__(self, config: Dict[str, any], base_data_path: str):
        self.base_data_path = base_data_path
        for k, v in config.items():
            setattr(self, k, v)

The input variables from the client are passed into the constructor method and set as class attributes. Additionally, the base path of the raw data is required and set as an attribute. These are the main settings needed for data generation. Furthermore, the following methods are important for a generation process:

a static method keeping all allowed config operations,
a method to create the correct raw data path and
a method that returns a .tfrecord file list.

def as_tfrecord(self) -> List[str]:
        pass
    def config_options() -> Dict[str, Any]:
        pass
    def tf_records_mapper(self, files: List[str]) -> tf.data.TFRecordDataset:
        pass

Finally, after successfully dataset creation, the server creates a .zip from the .tfrecord Mfiles and returns the file name to the client over the WebSocket connection. The client can now download the dataset.

Summary

Today we introduced our concept and some insights into the implementation of our developed remote data pipeline used to prepare and provide training data for complex machine learning training in our research project. Using this pipeline, we have solved the challenging task of a split infrastructure without suffering from strong performance-related deficits. This remote data transfer was one of the first milestones in our project to be reached and further the base of a successful collaboration.

The advantage of our presented solution is the shared use of resources: The computing server focuses its resources entirely on training models, while the data server handles the labor-intensive creation of datasets. To give you an idea of why this is so important: as we work with large-size image data, a full augmentation process can easily take up to 12 hours. Waste of valuable computing resources if tasks are not split up. Furthermore, we usually start multiple training sessions all requiring the same dataset at the same time. Without our data pipeline, each experiment would create its own version of the same data set and will block even more resources. With our solution, this is solved in a far more efficient way.
Our developed solution could be extended in further versions by features like a monitoring tool with visualization. Important information and statistics to display could be, for example, the status of currently running data preparation processes, a list of all cached datasets including their configurations, as well as statistics about usage and downloads. This could help to keep the storage more structured and clean.

All in all, we were very content with how fast the communication and transfer between our servers take place and are very content with our self developed approach.

Smart Cropping - Automatically crop images to optimal regions with deep neural networks

Vivien — Fri, 24 Jul 2020 11:48:34 GMT

Pictures are omnipresent on the social web. It is common to instantly post photos of all kinds of events to share with friends and followers. Also, businesses want to show presence on social networks and employ designated social media managers to represent the company and to communicate to customers.

Let’s assume you work as a social media manager in a company. Your job is to communicate with customers and represent your company on various social media platforms. One part of your job is to share pictures of your company’s work. Since you’re serving multiple social media platforms, you always have to consider their specific aspect ratio requirements for images. One platform wants you to provide square photos, whereas another one asks for pictures in a wide landscape format.

You are a busy person, you don’t want to waste time on cropping hundreds of images into the proper format, but you also don’t want to crop your pictures weirdly.

Suppose you want a portrait-oriented version of the following image. Simply choosing the center would lead to an odd picture containing only one half of the bird. What you want is the image to include the region of interest; here, probably the whole bird in the center of the image.

But how can we automatically find such image regions?

When humans look at images, they intuitively focus on significant elements of the photos first. If you look at the following pictures, …

… you will probably notice that your first focus on the salient parts of the image (maybe the geyser or the sundown for the first image, and the reindeers on the road for the second image).

As it turns out, it is possible to train neural networks to predict such salient regions. A prediction of such a network is called a saliency map. It basically is a grayscale image of the same size as the picture. Each pixel intensity encodes the degree of saliency. These saliency maps allow us to find the best image region for a given aspect ratio.

But how can networks be trained to predict salient regions in a picture? And how do we, given the salience information, crop an image to an optimal region?

Fortunately, there was already a considerable amount of research regarding saliency prediction. Basically, there are two main approaches: attention-based saliency prediction and segmentation based saliency prediction. The first group focuses on predicting the center points of human attention regardless of object segmentation and boundaries, whereas the latter considers the most salient objects as a whole.

For our application, it seemed more suitable to choose an attention-based approach. We decided to go with an LSTM based model.

Briefly summarized, the approach works as follows: A deep convolutional neural network (CNN), pre-trained on image classification, acts as a feature extractor. The value of some intermediate layer (or hidden layer) is forwarded to the recurrent LSTM that further improves the prediction. The saliency map then is the output of the LSTM, combined with the Gaussian priors.

In particular, a dilated convolutional network, in our case, a modified RESNET50 already pre-trained on the SALICON dataset, is deployed for feature extraction. The original paper for this method used a network for image classification. Many CNNs can act as feature detectors, but they don’t perform equally well. For example, compared to the standard convolutional feature extraction networks, the dilated networks prevent the harmful effects of image rescaling on the saliency prediction. The extracted feature maps are then fed into an attentive convolutional LSTM (recurrent neural network). This iteratively improves the saliency prediction on the obtained feature maps. Finally, multiple trainable (isotropic) Gaussian priors are added to take the bias of human attention into account, since humans tend to focus on the image center.

We trained the network on the SALICON dataset, which includes 20,000 images from Microsoft COCO and 15,000 corresponding saliency maps. The saliency maps were generated by empirical studies modeling human eye fixation by mouse movements. We optimized our network with a composed loss function considering the Pearson Correlation Coefficient and the Kullback-Leibler divergence, representing standard saliency prediction loss measures.

With this approach, we could already predict pleasing saliency maps suitable for smart image cropping. Unfortunately, our first successful model took up way too much memory, thus being useless for practical applications. Therefore we had to compress the model to a suitable size while maintaining the smart cropping performance as high as possible. After a while of unsatisfactory trials, we found that instead of the RESNET50, we could just deploy the way smaller Keras-intern MobileNet as our feature extraction model.

This model option indeed provides less precise results for the saliency map prediction. However, it is still suitable for the smart image cropping, since we only need to know the position of the focus points roughly. Not only could we save much memory capacity employing this model variation, but also we could increase the runtime of our model significantly, which was our goal. This way, we created a model suitable for practical applications that are supposed to take over the inconvenient manual cropping process.

Once we have the saliency map, the smart crop can be determined quite easily. First, we compute the edge length of a window covering the given image as much as possible while fulfilling the required aspect ratio. Afterward, we slide this window over the predicted saliency map and determine the position that maximizes the covered saliency density. Now we only need to crop the image based on the optimal window position. Thus we obtain our smart cropped image suiting the required aspect ratio. The method is inspired by this paper.

To sum up, using saliency prediction and maximization, smart cropping enables us to find the best image regions for any aspect ratio. This technique, which we’re currently building into our UBQ engine, reduces the user’s burden to manually crop images into the required aspect ratio.

This project was funded by the European Regional Development Fund (ERDF).

From 2D to 3D Photo Editing

Malte — Tue, 26 Jun 2018 00:00:00 GMT

Last November, we released Portrait, an iOS app that helps create amazing, stylized selfies and portraits instantly.

With over a million downloads and many more portrait images created, we feel that the idea and vision of Portrait was more than confirmed. The central component of Portrait is an AI that is trained to clip portraits from the background, a technique we are eager to further improve and refine. In fact, Portrait helped us to explore a novel technique for image editing, as we were able to leverage a new powerful data set in photography: depth data.

We began feeding our AI models with the depth data from the iPhone Xs TrueDepth camera and had one goal in mind: to infer depth information for portrait imagery, or bringing three-dimensionality into a two-dimensional photo. Along the way, we created a new architecture concept, that allows performance and memory improvements through modularizing and reusing neural networks.

In the following article, we’d like to present some of our results along with the insights we made.

The New Cool: Depth Data

The usage of depth data in image editing initially became available with the iPhone 7 Plus when Apple introduced ‘Portrait Mode’. By combining a depth map and face detection, the devices are able to blur our distant objects and backgrounds, mimicking a ‘bokeh’ or depth of field effect, which is well known from DSLRs cameras.

While the actual implementation varies, all major manufacturers nowadays offer a similar mode by incorporating depth data into their image editing pipeline. This is either achieved through the conventional dual or even triple camera on the back of a phone, dual-pixel offset calculations combined with machine learning or dedicated sensors like Apples TrueDepth module. In fact, for a modern flagship phone, some sort of depth based portrait mode is almost a commodity.

From a developers perspective, things look a little different: Depth data became a first-class citizen throughout the iOS APIs in iOS 11 and such data is now easily accessible on supported devices. Android users obviously have access to depth data as well, either by utilizing multiple cameras or by Googles dual-pixel based machine learning approach, seen in the newer Pixel 2 phones. But contrary to iOS, Android doesn’t yet offer a common developer interface to access such data. In fact, developers aren’t able to access any of the depth information Google or other manufacturers collected within their camera apps. This means developers would either need to implement the algorithm to infer depth from two images themselves or try to rebuild Googles sophisticated machine learning powered system. Neither of these options is practical and probably not even possible given the usual limitations to camera APIs.

So although being quite common, depth data isn’t as easily accessible for developers as one might think. Right now you’re out of luck on Android, dependent on hardware on iOS and even then limited to the 1.000$ flagship if you’re interested in depth for images taken with the front camera. And last but not least, across all devices and platforms, there is no way for you to generate a depth map for an existing image.

Deep Possibilities

Despite the restrictions, we decided to first explore the power of depth for image editing, as depth data provides many new exciting creative possibilities:

If we have a depth map for a given image, our editing possibilities are increased dramatically. Instead of a 2D image, a flat plane of color values, we suddenly have a depth value for each individual pixel, which translates into a 3D landscape highlighting distinct objects in the foreground and a clear indication of background.

Depth-aware Editing

Instead of relying on color and texture differences to determine fore- and background, one could literally edit these regions individually. This allows adjustments like darkening the background while lightening the foreground, which makes portraits ‘pop’. If we’d be able to generate a high-resolution depth map, we could easily replace the AI currently used in Portrait and would allow even more sophisticated creatives. Thanks to the new APIs, there are already some awesome iOS apps available that specialize in depth based editing. One famous example is Darkroom with their “depth-aware filters”:

Depth of Field Effects

As a depth of field or bokeh effect was the initial motivation for Apple to incorporate depth sensing technology, it is one of the most obvious applications. Depth is crucial for such an effect, as the amount of bluriness of any given region directly depends on its distance to the camera lens.

3D Asset Placement

As mentioned above, a depth map gives us a 3D understanding of the image. We’re able to tell if subject A is positioned in front of or behind subject B. This allows placement of digital assets like stickers or text in a ‘depth-aware’ fashion, but could also be used to apply ‘intelligent’ depth of field, e.g. a bokeh effect that ensures all faces are in focus.

Enter Deep Learning

Motivated by the possibilities enabled by depth maps, we were wondering if we could bring this magic to any type of portrait image. We consulted existing literature on depth inference and found various papers¹ and articles on the topic, some of which even presented results that seemed sufficient for our use cases. In our case, we didn’t need accurate, as in ‘this pixel is 30cm in front of the camera’, results, but we were only interested in getting the general distance relations correct. For us, knowing that region A was slightly behind but definitely way in front of region B was enough to generate a visually pleasing effect and by constraining our domain to portrait imagery, we were able to further reduce the tasks complexity.

Given our experience with deep learning and our current focus on introducing machine learning powered features to the PhotoEditor SDK, we immediately decided to tackle the new challenge with deep learning or more specifically convolutional neural networks. Having a huge dataset of image and depth map pairs available, made this choice even easier. We stuck to a system similar to our previous segmentation model but decided to put more emphasis on allowing the reuse of individual parts, as this would come in handy when adding additional features in the future. To achieve this, we created a new modularized neural network approach named Hydra, which will be presented in an upcoming blog post.

During development, we followed our tried and tested workflow of starting with a complex custom model, which is then tweaked and refined to match our performance requirements while maintaining the prediction quality we need. Once that was done, we had a fast and small model, trained on thousands of iPhone front camera selfies and capable of inferring high fidelity depth maps from a plain RGB image in under a second.

The Prototype

After creating a small model capable of inferring depth maps for any given portrait image, we immediately wanted to evaluate its performance in a ‘real-world’ environment. We decided to build a prototype that applies a depth of field effect to a portrait image, by using the model and its outputs. With our long-term goal of deploying the model to iOS, Android and the web in mind, we built the prototype using TensorFlowJS to explore this newly released library. Our browser demo consists of a minimal ‘Hydra’ implementation with individual modules, one for extracting features and one for the actual depth inference, which can both be executed individually.

While being optimized for performance and memory footprint, the trained weights of the model still add up to ~18MB, which we will improve by further fine-tuning or even applying pruning or quantization. Once the models are loaded, all further processing happens on the device though, so you may try out all the samples without worrying about your data plan.

Results

Seeing our vision come to life was quite a stunning experience. Suddenly our browser was able to perform a complex depth of field effect without the need for special hardware, manual annotations or anything else apart from our image. And the best part was manually moving the focal plane through the image, either by sliding or tapping on different regions. Although being trained on ‘just’ selfies the model handles turned heads, silhouettes and multiple people pretty well and isn’t as restricted to its domain as we initially expected.

And while our initial prototype is still weighing in at ~18MB, we’re certain to slim that down further in order to use the model in production. Performance wise we were very impressed with the TensorFlowJS inference speed. Even though everything is happening on the client side and is therefore dependent on the clients hardware, we saw inference speed below one second right of the bat and those greatly improved after the initial run, as the resources were already allocated. While not being immediately helpful for the depth inference part, this allowed us to further confirm our theory behind Hydra: Re-running inference once the necessary resources on the machine have been allocated greatly increases performance and might even allow real-time performance after an initial setup-time.

To summarise, we’re definitely eager to further explore the use of depth data in image editing and think we have found a way to overcome the access restrictions on different platforms and hardware with our custom model. Combined with our new Hydra approach we can see lots of potential features that will delight both our users and customers and we will keep you updated right here.

(1)
The papers we extracted most knowledge for our use case from were:
“Depth Map Prediction from a Single Image using a Multi-Scale Deep Network” (arXiv)
“Deeper Depth Prediction with Fully Convolutional Residual Networks” (arXiv)

**Thanks for reading! To stay in the loop, subscribe to our Newsletter.**

When Creativity meets A.I.

Eray — Thu, 16 Nov 2017 23:00:00 GMT

A new generation of A.I. algorithms, propelled by rising computational power, new hardware, and a shift in paradigms made its first notable impact in the creative world: The works of Gatys et al. and Krizhevsky et al. have not only gathered considerable public attention but have helped apps like Prisma to be adapted and used by millions. I strongly believe that this is merely the beginning. **With the help of machine learning, we will fine-tune, simplify, and automate creative processes and ultimately empower new techniques for design and content creation.**

We’ve been following this topic for quite some time now and have spent considerable effort in researching the opportunities of deep learning for our PhotoEditorSDK. After more than a year of research and development, today, we’re finally bringing one of our apps to beta. **Portrait** combines supervised deep learning with the visual power of our SDK. In a nutshell, Portrait makes creating beautifully designed portrait images as easy as taking a selfie. You turn your selfies into movie poster-like portraits, with styles ranging from double-exposure photography to stencil art. One may consider it as the next iteration of what Apple and Google recently brought to market with their new camera features.

We’ve now come a long way and gained invaluable insights on our journey so far. Not only did we get our hands dirty with countless training sessions and refinements to the neural net, but our first hand experience also helped to set expectation management right and to dismantle hype from substance. Most notably, it changed our product shaping process, making it more important than ever to foster strong ties between the product stakeholders and to share a common vision and goal everybody can get behind.

In the following I’d like to share the story of how we built the app and closed the gaps between roles of the stakeholders within this process.

Preface: Before Neural Networks were the Hot New Thing

My journey begins over ten years ago, while I was graduating in neuroscience. Back then, the idea of A.I. was just a vague promise. Artificial Neural Networks were too small, computers lacked the necessary power, and the results were certainly nice, but still too weak to compete with other traditional algorithms. Research felt stuck in tiny little specializations without really following a broader vision. Dazzled by its impracticability, my interest in Neural Networks slowly began to fade.

It took research on Neural Networks another six years to get back on my radar. At that time, I was leading several product developments at 9elements. When I learned about the work of DeepMind (now Google) I had a genuine feeling that this time, A.I. was ready for the limelight.

As we were in the course of building a library for image editing and computer vision — the PhotoEditorSDK, we realized how much neural nets could also affect the creative space, given its ability to abstract and formalize rules. What if there was a machine that could reproduce the common and dull tasks you have to do as an art director within a second? What if designers could get rid of repetitive and tedious activities that interrupt their creative flow?

But this topic isn’t something you’d learn in a week, obviously. Still, innovations cannot happen if you’re not willing to take a risk, so we decided to invest considerable time and resources into this technology.

From a product management’s perspective, this process is actually an anti-pattern: Usually, you wouldn’t want to start by finding the right purpose for a technology, instead you’d find the right technology for a purpose. I still believe that this is essentially the right approach, but sometimes you have to abandon your best practices and take a swim in uncharted waters. Consequently, we asked Malte, one of our iOS engineers, to spearhead our research and take a deep dive into this topic. We decided to start off with image segmentation as the first process that we wanted to optimize through machine learning. Masking and clipping sometimes can be a tedious tasks, and ultimately we wanted to reduce this process that can take several minutes to a single click.

Chapter 1: The Machine Engineer

Malte, who is a diligent engineer and — how convenient — a passionate photographer, started investigating some approaches that focused on image segmentation. You can read more about his journey in his article. Although he experimented with various neural networks and post-processing techniques, the resulting masks sometimes lacked the desired accuracy and wouldn’t have matched a user’s expectations. This was a first expected insight. As we want to deliver ready-to-use products to our customers, that don’t need any complex tweaking, this was something we had to fix. Our problems originated mostly from our rather ambitious goal to segment any type of object within an image. It would have required to train with vast data and to scale up the number of filters in our network. However, due to our on-device constraint, this would have killed our carefully crafted performance.

Therefore, we shifted this generalist approach to a specialized network for images of a certain domain that the model can be applied to. In hindsight, this seems quite obvious, as our rather small model would have never been able to cope with the amount of variations existing in ‘the real world’ anyway. So, we went back to the drawing board and started discussing which domain to focus on. That’s where we got suck; we struggled to find an obvious trend in our customers’ use cases or known photography platforms.

It was actually during his summer holiday, when Malte had the flash of genius. At a stop-over in Singapore, he noticed how the city was flooded with selfie-stick wielding tourists. The sheer amount of selfies taken at any public place in Singapore left him astonished and he realised that he just found the right domain. Selfies, and portraits in general, felt like an infinite datasource and prime use case for our image segmentation algorithm. Back home, we decided to focus on selfies and portrait-like photography.

Malte started searching for portrait datasets and found a collection of roughly 2000 portrait images collected from Flickr. Those were a great starting point and after a few training runs, he already reached satisfactory results, as the model was now capable to capture all available variations. At that point, we had a system at our hands that was able to segment portrait or selfie images in real-time on the device you’re capturing them with. This seemed like a great opportunity, but we didn’t want to stop just there. Releasing a prototype that can free a selfie from its background is nice, but doesn’t feel like something that would truly showcase how AI can make a difference in our creative process.

Chapter 2: The Art Director

This is where our Art Director Tommi, a renowned graphic artist and former sprayer, stepped in to explore what can be done with **a selfie, an accurate alpha mask and the image editing features from our PhotoEditor SDK**. When Tommi took the lead, I asked him to draft a vision, a creative direction for our app that combines all the tools and possibilities at our disposal.

Together, we started exploring portrait trends and unique imagery that would help us find a direction for our showcase. Soon, the walls of our meeting rooms and offices were plastered with inspirational works on portrait photography of all different kinds and styles. This visual catalogue kept inspiring us, although we weren’t sure on which style to settle in the end. It was when we could hardly find any more free spots on our walls and after looking at them for countless times that the idea struck:

What if we could enable users to turn their portrait to what we saw on these walls? And this, without actually having the design expertise they would normally be required to do so.

Instead of brooding over a completely new form of portraits, we could take all these styles and instantly realize them with the technology we had. From that point on, we flipped our process upside down. Instead of thinking about what the technology is capable of, or identifying a problem worth solving, we aimed for the creative output that we wanted our app to produce. While we started our venture with a technology, we now had visual results that we could work towards. The main question shifted from “What is our technology capable of?” to “How can we achieve this visual output with our technology?”

Tommi designed five lead graphics, so our team of engineers and designers could grasp what we ultimately wanted to achieve, using only a selfie and the features of our SDK.

Act 3: Closing ranks

With such a clear vision for our app, we started separating the wheat from the chaff, categorizing the portraits and understanding which operations of our SDK we had to combine, assemble and enhance to create these visuals.

What followed was a remarkable interplay across multiple stakeholders of our team. While we were always very vocal proponents of building strong relationship between product stakeholders, **the introduction of the AI layer actually glued our team further together**.

Our designers started to embrace the engineering perspective, playfully identifying both opportunities and constraints through the tech layer. At the same time, our engineers embraced the design vision and formalized it into code. Let me give you some examples:

While thinking of the UI, we understood that the transformation of a selfie into a graphical artwork required an immediate feedback for the user, so they can find a pose that works best with the respective artwork. Consequently, we optimized our networks for real-time processing, a true challenge that needed strong expertise in both iOS engineering and neural net architecture.

Our designs and recipes in turn had to be tweaked to gracefully allow for errors of our AI, because an error rate of 3% can still produce undesired artefacts and mask inaccuracies. We did that by using techniques that beautifully fringed edges of the portrait.

Altogether, the close cooperation, as well as countless meetings, feedback loops, and the continuous fine tuning of the code and underlying recipes is what brings us here.

All of this wouldn’t have happened if we hadn’t took the risk to invest in a rising technology in the first place. And all of this wouldn’t have been possible if it wasn’t for the exemplary cooperation between all the stakeholders. Portrait is a showcase of how technology can inspire and tie a team together. This, in the end, is absolutely necessary if we want to achieve the leaps we expect with AI. If you want to impact the creative space by introducing an AI layer to it, your engineers have to think like designers, or at least deeply understand their work.

The Road Ahead

Portrait is a first showcase and one step of many in our venture to wire several AI aspects deep into our SDK. On our journey, we’ve identified many more opportunities where we can help broader audiences to make creative work and design more accessible. Of course, we will also improve our models and networks with better and more data, always keeping in mind the aesthetic and visual output we’d like to achieve. We’ll keep you posted on our updates and next ventures into this exciting new era.

If you liked what you read, I’d encourage you to check out Portrait and our PhotoEditor SDK!

Thanks to my co-authors Malte & Felix!

Thanks for reading! To stay in the loop, subscribe to our Newsletter.

Deep Learning for Photo Editing

Malte — Thu, 20 Apr 2017 00:00:00 GMT

Deep learning, a subfield of machine learning, has become one of the most known areas in the ongoing AI hype. Having led to many important publications and impressive results, it is applied to dozens of different scenarios and has already yielded interesting results like human-like speech generation, high accuracy object detection, advanced machine translation, super resolution and many more.

There is a steady flow of papers and publications that describe the latest advances in network design, compare existing architectures or describe unseen approaches leading to even better results than the current state-of-the-art. At the same time more and more companies and developers jump on the deep learning bandwagon and deploy the ideas and architectures to real world production systems.

This article describes our approach to applying deep learning to our image editing product, the struggle we had with finding the right architecture and the experiences we made while developing a system that can be deployed to mobile devices.

Our vision

At 9elements, we’ve had various AI topics on our radar for quite some time now. With deep learning, we finally found a tremendous opportunity for our product, the Photoeditor SDK: We believe AI-based algorithms could be the ideal approach to boost our users creative output and simplify complex design tasks.

Given the hype and results, we decided to dip our toes into deep learning, which quickly lead to some research regarding the most common challenges in interactive image editing. We quickly surfaced image segmentation as a major challenge that could be solved using deep learning and started investigating further.

If you have ever tried to select a distinctive region in a picture, say your best friend on the beach or your cute pet, you know the struggle of carefully moving your cursor along the object’s outer bounds until you eventually miss a part or accidentally select something that doesn’t belong to the object. Professional image editing tools can be quite helpful in accomplishing such tasks, but on the one hand, they aren’t available on your mobile device, where you take and publish the images, and on the other hand, can be quite expensive and usually require some hands-on time, before you can produce anything usable.

Our goal was to finally remove the hassle from image clipping. We wanted to reduce the required user interaction to a minimum while offering an intuitive solution that doesn’t require any manuals or online courses. On top of that, as we provide native SDKs for web, iOS, and Android, the solution had to be deployable to all of these systems without relying on a powerful backend or being limited to certain features.

Having formulated our rather ambitious goals, we started our journey by looking into the most common research papers and classic techniques for image segmentation. We then focused on the deep learning part and quickly had an idea on how to design our approach.

Our journey

Image segmentation, the process of classifying each pixel in a picture to be rather fore- or background, is a popular research field and still perceived as quite challenging due to the complicated nature of the task. We, humans, are extremely well trained at perceiving scenes, identifying objects and making logical assumptions based on the visual input we receive.

For a long time, all approaches were based on colors, edges, and contrast and relied heavily on fine-tuned parameters, which had to be adjusted to every new scenario. That changed in 2012 when Krizhevsky et al. presented astonishing object classification results on the ImageNet benchmark using a neural network. Suddenly a system was able to classify objects with unprecedented accuracy and no need for any human fine-tuning. The neural network was ‘just’ trained on the dataset by seeing images combined with their corresponding labels and adjusting its internal representation until it couldn’t learn any further.

As we had already decided on using deep learning for our task, using a neural net clearly was our way to go. We started by examining the existing solutions and approaches, created our first prototype based on our findings and refined our approach and implementation until everything met our expectations.

Scene Labeling

The first approaches we examined focused on segmenting the whole image. This is a common task called scene labeling or semantic labeling, because it allows robots and other systems to understand a scene. The goal is to classify each pixel in an image to a particular object category. An example could be a self-driving car that searches for the road and tries to determine whether any pedestrians are crossing the street. Such a car would try to classify each pixel as road, pedestrian, tree, traffic sign, etc.:

While offering lots of possibilities, the existing solutions were lacking the desired accuracy we needed to provide visually pleasing image segmentations. For a self-driving car, it doesn’t matter if the ‘person’ region for some pedestrian accurately covers the person’s outlines. However, for us it does.

To overcome these issues we experimented with post processing techniques that used the segmentations we found as a base for further optimisations. This lead to our first approach where we would initially segment the entire image using a convolutional neural network, offer the found regions as selectable regions to a user and then try to refine the user’s selection using conventional image segmentations to find the best possible mask.

While already yielding some useful results the system did not quite match our requirements. If the initial segmentation was too coarse or off in critical regions, the user could never select an area that would lead to his desired segmentation.

Image segmentation based on user inputs

We went back to the drawing board and searched for other approaches that would fit our use case. It didn’t take long, and we stumbled upon Deep Interactive Object Selection, a paper that presents an interactive system which creates image segmentations based on user clicks. It looked like a good fit for our requirements, and we updated our existing system to generate fake user inputs and train on combinations of these inputs and images.

To train the net, we used the publicly available COCO dataset, which contains around 300.000 images with more than 2 million annotated object instances. To handle the amount of data, we limited our training data to a subset of the full dataset. This subset was made up of images that contain objects from certain categories and cover a minimum area within the image. As we generated the inputs artificially by adding clicks on the object mask, we could generate as many training data from the COCO subset as we wanted. After some experiments, we settled for three different strategies to create user inputs and trained the net with roughly 300.000 training records.

The masks generated by the updated system were quite impressive already. The neural net could infer which object the user wanted to mask in the image, just by looking at raw pixel data and the user’s clicks on the object. Happy with the first results, we tried to tackle the next hurdle. Before diving deeper into optimizing the neural net, which is a rather error prone process and consumes lots of time, we wanted to deploy the net to a mobile device. We wanted to make sure that such a tool is usable on any device and the performance would match our expectations.

Neural nets on mobile devices

Neural nets are sets of operations, executed in a specific order and based on millions of parameters. Therefore one “run” of such a net requires a lot of computation power, as millions of calculations have to be carried out. At the same time, the millions of parameters need to be deployed, as they represent the model or the representation the neural net has learned during training. So, to deploy our neural net, we had to solve these two requirements on an iPhone.

The first requirement, computing power, was thankfully solved by Apple. With the latest iOS version a specialised framework, called Metal Performance Shaders, was introduced. It offers the all required operations and is tailored to run these on the phones GPU, which is fast and efficient. To execute our net using the framework we had to translate our TensorFlow network code to Swift and rebuild the net’s architecture using Metal Performance Shader operations. Sadly Apple only supports a subset of todays common neural network operations, so we were forced to write some shader code to reconstruct the full network.

The second requirement, extracting the trained parameters and deploying them to the device was much easier. We just had to restore our previously trained model from a TensorFlow checkpoint, write all trained variables into a file and deploy this file with our iOS app. When needed, the iOS app would load the file into memory, and our network implementation would use the given parameters to run an inference pass.

Having met the two requirements, our network worked fine on an iPhone. We added the postprocessing operations and were able to segment images by a single tap without the need for a backend or any network communication. But there were some caveats.

While our neural net was a very common and widely used network, it was huge regarding the trainable variables. A trained model contains ~134 million parameters, which translates to about half a gigabyte of data that needs to be deployed with the app. This was obviously a showstopper for a mobile image editing app, as we couldn’t justify a 500MB download just to be able to segment images with your finger.

Furthermore, the results were still very coarse. If your colleague waved his arms in an image, the net usually could easily detect his torso, head and maybe his legs, but almost never the arms or hands. Fixing this using our postprocessing algorithms wasn’t that much of an option as it would have required lots of computing power and why bother using a neural net with millions of parameters if we fall back to conventional image processing techniques anyway?

So all in all, we had already learned a lot: Our approach of processing user inputs combined with raw image data as neural net input led to usable outputs, although quite coarse. Deploying such a net to mobile devices was possible, and the performance was good enough for using it in an interactive tool. The next step was to optimize the system to fix the parameter size and get finer results.

Combining SqueezeNet and SharpMask

We decided to tackle the network size first, as laying a proper foundation for optimizing the coarseness seemed like a sane thing to do. When looking for small nets with few parameters and fast inference its hard not to stumble across the SqueezeNet architecture by Iandola et al. which was published in November 2016. It met our use case, didn’t use any exotic operations that would be hard to implement on mobile and the results looked promising, so we removed the original network from our system and replaced it with an altered SqueezeNet implementation. And to our surprise, it worked almost right away. We had to tweak our training pipeline, and the results differed slightly, but all in all the small network with only ~5 million parameters matched the performance of our previous behemoth with ~134 million parameters. We quickly updated our conversion script and found out that our deployable model file just shrunk from ~500mb to 2.9mb. What a happy day!

Having solved the network size issue, we went ahead and thought about increasing the precision of our predictions. A loss of resolution is unavoidable in convolutional neural networks, as later layers acquire a larger “view” of the inputs by reducing their input size with so-called “pooling” layers. These layers take for example four values from the previous layer and merge them into a single one. Therefore our new SqueezeNet-based system created a 32 by 32-pixel image mask from a 512 by 512-pixel input image. Up to now we just scaled these up by using a transposed convolution. This allowed the net to learn how the upscaling worked best, but the fine details from the initial input image were already lost at this point.

We remembered Facebooks SharpMask system introduced in summer 2016 and revisited the accompanying paper. Their refinement modules seemed like a good fit, as they were able to gradually incorporate features from lower levels, but with higher resolution, into the coarse outputs. We adopted the idea and altered the refinement modules to take the final SqueezeNet output. The modules then combined the coarse SqueezeNet output with the pooling layers intermediate results and were able to refine the result. This increased our model size and the computation costs by a fair amount, but lead to much finer and more detailed results.

Once we settled on our architecture, we started an extensive training run, in which we tested more than one hundred different variations of hyperparameters, architectural details, and resizing techniques. Evaluating the results, we selected a variation, which made the best compromise between accuracy and inference speed/model size.

Our results and prototype

Having managed to fix all the issues, we were eager to see how the whole system performed on a mobile device with limited computing power and inputs. We updated our mobile app to use the new network architecture and the freshly trained model to compare the refined system to our previous approach. The results were amazing. When selecting objects that matched the categories of our training data and were fully visible in the image, we were able to generate fine-grained selection masks with just a single tap. More complex or larger objects required a few more taps, but we could always find a selection mask for our object, that was at least a solid starting point for further optimizations.

We decided to build a more polished prototype based on our existing img.ly iOS app. This app uses our PhotoEditor SDK to offer advanced image editing including focus and filter operations. As we were now able to create masks based on objects in the image we quickly settled on enhancing our filter and focus tools with selective masking.

Retrospective

Looking back at our journey into deep learning, it was one of the more frustrating yet fascinating ones. The sheer amount of possible applications is exciting, and once you get the hang of training something on your data, you immediately want to start experimenting with new things. On the other hand, you’re usually building huge black boxes with millions of float values, which makes debugging a pain. Especially when trying to replicate an already implemented architecture on other platforms, this can quickly become rather frustrating. If your outputs don’t match the expected results, your only option is to repeatedly go over your code, check all parameters and hope you stumble upon the wrong number somewhere. But once you manage to set everything up and start seeing some good results, you instantly want to tweak and optimise the bits and pieces of your system.

Overall, deep learning is a pain to debug, but yields great results, opens up a new field of photo editing applications and we’ll definitely keep exploring the new possibilities of applying the techniques in our product. Stay tuned for upcoming features!

Thanks for reading! To stay in the loop, subscribe to our Newsletter.

Machine Learning – IMG.LY Blog

IMG.LY Research: AI-based Generative Design Editing

Introduction to LLMs

Human vs. AI Workflows for Executing Design Tasks

Using Generative AI to Execute Design-Related Tasks

Composing the Input Text

Applying the Output Text

Issues faced

Conclusion

Appendix: Use-case Dependent Serialization of CE.SDK Designs

Appendix: Parsing and transforming LLM response

Further Research Topics

Cutting Through The Jungle: An In-depth Review of Cloud GPU Providers to Train Your AI Models in 2024

Navigating the World of AI Models Hosting

Kinds of Cloud Hosting for AI Models

Serverless Hosting

Cloud GPU Hosting

Providers

Serverless Providers

Runpod IO (Serverless)

Vast AI

Paperspace

Banana Dev

CoreWeave

Modal

ComfyICU

Replicate

GPU Cloud Providers

Genesis Cloud

Fly IO

Runpod IO (Cloud GPU)

Lamda Labs

Together AI

Established Providers

Conclusion

How to Remove Backgrounds Using Core ML

Using Vision to Segment People

Choosing a Model

Core ML and Xcode

The DeepLabV3 Model

Segmentation with DeepLab

Going Further

Wrapping Up

A Remote Data Aggregation Pipeline to Provide Machine Learning Datasets

Introduction

Our Custom-made Approach - Overview

Our Custom-made Approach - Technical Implementation

Framework and Languages

Client Side

Server Side

Data Handler

Summary

Smart Cropping - Automatically crop images to optimal regions with deep neural networks

From 2D to 3D Photo Editing

The New Cool: Depth Data

Deep Possibilities

Depth-aware Editing

Depth of Field Effects

3D Asset Placement

Enter Deep Learning

The Prototype

Results

When Creativity meets A.I.

Preface: Before Neural Networks were the Hot New Thing

Chapter 1: The Machine Engineer

Chapter 2: The Art Director

Act 3: Closing ranks

The Road Ahead

Deep Learning for Photo Editing

Our vision

Our journey

Scene Labeling

Image segmentation based on user inputs

Neural nets on mobile devices

Combining SqueezeNet and SharpMask

Our results and prototype

Retrospective