Mirko – IMG.LY Blog

Make Your Docs Agent-Ready: Compiling MDX into Markdown

Mirko — Wed, 11 Mar 2026 14:38:44 GMT

TL;DR: The 7x Efficiency Gain

We rebuilt our documentation pipeline to treat AI agents as a first-class audience. By natively compiling MDX into clean, fully-resolved Markdown—rather than heavy HTML or unresolved source code—agents and LLMs can now ingest our docs 7x faster and with far higher accuracy.

Test it yourself:
Request our docs with the Markdown accept header to see the agent-optimized view:

curl -s -H "Accept: text/markdown" https://img.ly/docs/cesdk/js/settings-970c98/

The Documentation Mismatch: Humans vs. Agents

If you build developer tools today, you have two distinct audiences: human engineers and the AI agents they use to write code. Recently, we realized we were treating them exactly the same.

Our docs pipeline is built on MDX and optimized for the browser. But when an LLM tries to ingest a page, it has to wade through navigation chrome, layout wrappers, and raw JSX. We didn’t just need a “clean text” mode; we needed a fundamentally different architecture.

Humans vs. Agents: A Comparative Overview

Feature	Humans (Browser)	Agents (Context Window)
Primary Format	HTML + CSS + JS	Clean Markdown
Navigation	Visual UI (Sidebar/Tabs)	Explicit links + hierarchy
Context	Implicit (Site layout)	Explicit (Frontmatter + headers)
Constraints	Performance / Core Web Vitals	Token / Context Window budgets
Payload Size	~222 KB	~31 KB (7x smaller)

The Problem: Why HTML and Raw MDX Fail Agents

1. HTML is expensive “Chrome”

A documentation site is an application. Even static sites ship navigation chrome, scripts, and styling hooks. For AI, this is “token noise.” You are paying for bytes that provide zero value to the LLM and forcing the agent to reconstruct a hierarchy that you already had at authoring time.

2. Raw MDX is unresolved source code

Serving raw MDX files doesn’t solve the problem either. MDX is for maintainers—it is full of imports and unresolved dependencies. Our documentation often pulls code from external, tested repositories:

## Initialize the Engine
<CodeBlock file="examples/getting-started/src/index.ts" lines="12-24" />

In the browser, this renders beautifully. In raw MDX, it’s a pointer to a file the agent cannot see. The actual code isn’t there.

The Solution: Treat Markdown as a Compilation Target

We stopped thinking about this as “exporting text” and started treating it as what it really is: a second build target.

Old Pipeline: MDX (Source) → HTML (Browser View)
New Pipeline: MDX (Source) → HTML AND MDX (Source) → Markdown (Agent View)

Why we didn’t just “convert” HTML

We tried rendering to HTML and then running a converter. It failed at code blocks. Syntax highlighting adds spans and wrappers around tokens; reversing that into clean, trustworthy code is nearly impossible.

<pre><code class="language-js">
  <span class="token keyword">const</span>
  <span class="token variable">engine</span>
  ...
</code></pre>

By working at the AST (Abstract Syntax Tree) level before rendering, we avoid this entirely.

Implementation: Transforming MDX at the AST Level

Rather than trying to reconstruct meaning from rendered markup, we preserve it directly. We built a remark plugin that transforms MDX at the MDAST level.

import { remark } from 'remark';
import remarkMdx from 'remark-mdx';
import remarkStringify from 'remark-stringify';
import { remarkTransformForExport } from './remarkTransformForExport';

const processor = remark()
  .use(remarkMdx) // Parse MDX/JSX nodes
  .use(remarkTransformForExport, {
    baseUrl: '[https://img.ly/docs/cesdk/](https://img.ly/docs/cesdk/)',
    paths: resolvedPathMap,
  })
  .use(remarkStringify); // Back to markdown

const markdown = await processor.process(mdxContent);

The Component Contract

Every MDX component must define how it exports to the agent view. We colocate the transform directly with the component: Aside.astro (Human UI) ↔ Aside.toMarkdown.ts (Agent logic).

Example: Aside Component → Blockquote

Input: <Aside title="Pro Tip">Use the basePath...</Aside>
Output: > **Pro Tip:** Use the basePath...

Example: Resolving Code References This is the most critical transform. Instead of a file pointer, the agent gets the actual, inlined code block fetched during the build process.

When you strip the sidebar and footer, you lose context. We reintroduce “navigational chrome” as text-native primitives at the top and bottom of every Markdown file:

---
title: 'Working with Filters'
platform: 'react'
url: '[https://docs.example.com/react/guides/filters/](https://docs.example.com/react/guides/filters/)'
---

> You’re reading the React docs. For the full corpus, see llms-full.txt.

## **Path:** [Home](https://docs.example.com/) > [Guides](https://docs.example.com/guides/)

[Self-contained Page Content]

---

## Continue Reading

- [Color Adjustments](https://docs.example.com/react/guides/color-adjustments/)
- [API Reference](https://docs.example.com/api/)

Lessons for Scaling Agentic DX

If you maintain docs at scale, “optimizing for bots” is no longer optional—it is the new standard for Developer Experience (DX).

Don’t serve HTML and hope: Reverse-engineering structure is hard for AI. Give it the structure directly.
Every component needs an identity: If a component carries meaning, it needs a Markdown equivalent. If it’s just layout, unwrap it.
Resolve everything: Assume zero ambient context. Links must be absolute, and code must be inlined.
Content Negotiation: Use the Accept: text/markdown header. It is becoming the industry standard for tools like Claude Code and OpenCode.
The Single-File Corpus: In addition to per-page files, generate a llms-full.txt that concatenates everything. Some agents prefer one large fetch over a crawl.

By acknowledging that AI agents are a primary consumer of our documentation, we’ve made our SDK significantly easier to integrate. The future of docs isn’t just “readable”. It’s “storable and traversable.”

Note: We built this for CE.SDK. The implementation uses Astro and Vercel, but the approach is framework-agnostic.

IMG.LY Research: AI-based Generative Design Editing

Mirko — Tue, 16 Jul 2024 10:44:05 GMT

Generative AI is transforming the tech landscape, finding applications in virtually every field. At IMG.LY, we’re exploring how these advancements can revolutionize creative workflows. This article presents a research project, where we integrate Large Language Models (LLMs) with our flagship product, CreativeEditor SDK (CE.SDK), to enable natural language-driven design edits.

Our flagship product, CreativeEditor SDK (CE.SDK) allows for advanced creative workflows for countless use cases in industries ranging from print to marketing tech. Most use cases can be realized with the out-of-the-box feature set, but it also exposes a best-in-class API, called Engine API, to build complex custom workflows with designs and videos.

In this article, we will showcase how to combine our CE.SDK Engine API with LLMs to edit designs with natural language.

Introduction to LLMs

Generative AI is often associated with chatbots, but its capabilities stretch further. It is a versatile text processor that can transform any textual input into various structured outputs. This adaptability is due to its training on diverse textual patterns, allowing it to support a wide range of text-to-text applications beyond just generating conversational prose.

Carefully crafting an input text (prompt) in a way that instructs the LLM to output a specific structured output format allows us to use LLMs to solve almost any arbitrary text-based task.

Crafting prompts is an art in itself. When ensuring that we receive the required output we need to adhere to the following steps:

Consider what type of data our model was trained on to ensure the correct formatting of our input text. The most basic example is using English as our “main prompt language” since most LLMs are mainly trained on English text samples.
All necessary problem-specific information to solve the task needs to be included in the prompt. While LLMs often possess an inherent understanding of the world inferred from the vast amount of text they are trained on, they may not know much about our specific problem. Furthermore, LLMs have the notorious tendency to hallucinate, that is fill in missing context with incoherent or incorrect information. To ensure the best performance, the input to the LLM must provide as much context as possible.
Finally, we need to instruct the LLM well enough to output text-based data in a format we can then parse and process.

Human vs. AI Workflows for Executing Design Tasks

We started this project with the vision to use generative AI to magically handle requests like these (in increasing order of complexity):

Make the logo bigger
Translate this design into German
Adopt this template to our brand colors and brand assets
Transform this Instagram story portrait design into a landscape YouTube thumbnail

When trying to delegate a task to an AI, it’s best to start by thinking about how these tasks are currently solved by humans.

Let’s walk through how a human would complete a task such as
Make the logo bigger:

Humans would visually scan the design and automatically segment it by its elements such as objects, backgrounds, or text.
Humans would then read and comprehend the task “Make the logo bigger”
Finally, users would use their existing knowledge of how to move and interact with design software to fulfill the task by manipulating the individual elements in the design.

Based on these considerations we can extract the implicit knowledge necessary to fulfill a task and make it explicit for the benefit of our LLM.

Design Representation:

To enable the LLM to understand and manipulate a design, it is essential to provide a representation of the design. This can be either textual or a mix of textual and visual (if the LLM has vision capabilities) data. While supplying the current design as a raster image to the LLM is trivial, serializing a CE.SDK design into a textual format requires a custom serialization process. The textual representation is important since it allows the LLM to identify, address, and comprehend the different components of the design effectively.

Refer to Appendix: Use-Case dependent serialization of CE.SDK Designs for a more in-depth explanation of how to accomplish this.

Editing Protocol:

LLMs do not interact with design software using traditional human interfaces like a mouse, keyboard, or visual feedback. Therefore, we need a specific protocol for the LLM to propose changes to the design. We have developed a method where we pass a textual representation of the design to the LLM as part of the prompt so that the LLM can indicate changes to the design by returning a modification of this representation.

Practically, this means that if we pass in an element such as <Image id="1337" x=”100” y="100" .../>, the LLM can change those x and y attributes by simply returning <Image id="1337" x="0" y="0" .. /> inside its output text. Since we can identify the design element that was changed using the ID attribute, we can then calculate the programmatic changes that need to be applied to the design, like in this case engine.block.setPositionX(1337, 0) and engine.block.setPositionY(1337, 0).

Refer to Appendix: Parsing and transforming LLM response for a deeper look into this topic.

Based on what we have learned, we assemble a workflow, with the LLM as the center point that allows the LLM to execute design-related tasks on any of our CE.SDK designs. This workflow can be divided into two sub-tasks: Composing an input text (prompt) with all necessary context and parsing and applying the output of the LLM to the CE.SDK design.

Composing the Input Text

As seen in the graphic above we first compose the input text based on different components to provide the model with all necessary context to fulfill the user’s editing request: This includes a general text to “instruct” the model, the actual request that the user entered, an exemplary design representation to explain our output format to the model and a representation of the currently edited design.

The general, static text to “instruct” the model is composed of the following three parts.

What we are trying to achieve in general:
"You are an AI with expertise in design, specifically focused on XML representations of designs".
Output format instructions:
Your responses should only contain one XML document. Ensure that you do not introduce new attributes to any XML elements. You can change image elements by setting the alt attribute, which then will be used to search Unsplash for a fitting image. A sample alt text is “A mechanic changing tires with a pair of beautiful work gloves on”.
Additionally, pay close attention to the layout: verify that no elements in the XML document extend beyond the page boundaries. This constraint is critical for maintaining consistency and accuracy in XML formatting. Always double-check your XML output for these requirements.
The actual user request:
e.g., "Make the logo bigger"

By including a textual representation of a comprehensive example design we can show the model which layer types are available as well as which properties of those layers can be manipulated.

The model now has all the necessary context and building blocks to respond to user requests in a format that we can process downstream.

Applying the Output Text

In the first step, we scan the output text for an XML-like document and if we find one, attempt to parse it.

This will yield a structured data object we can compare to the one we passed in and calculate which elements have been modified, added, or removed.

The resulting change set can then be translated into specific calls to the CE.SDK Engine API to change the current design.

Issues faced

Latency: One issue with state-of-the-art models is their big latency: Each LLM response contains approximately 1000 tokens. That means each request takes 7-45 seconds (depending on the model) to complete. This long delay may be unacceptable for some user experiences. However, we see this issue as transitory and expect upcoming models to have much smaller latency while maintaining their capabilities.

Pricing: Each request/response with GPT-4 turbo as a backing model costs around 5 cents and restricts some use cases. We also expect the pricing to drop significantly. The new GPT-4o model for example reduces the price by half.

Hallucinations: LLMs do not always follow the instructions properly and, e.g., produce output that is not parsable. Hallucinations directly correlate with the capabilities of the model and this issue is not apparent at current state-of-the-art LLMs like e.g GPT-4/GPT-4o.

Conclusion

We present a novel and adaptable approach to use Generative AI and LLMs specifically to interact with IMG.LY’s CreativeEditor SDK. We showcase how this technology can be used to execute common design requests on arbitrary CE.SDK Designs. The proposition that LLMs can understand textual representations of visual elements was by no means obvious. This research project has revealed that it is very well within the scope of LLMs to translate instructions from a visual semantic context to its textual representation and back. This invites more inquiries into LLMs as assistants for tasks with a heavy visual component such as design.

While further research is needed to make this technology available in production environments, we are confident that Generative AI-based editing will play a big role in the future of Graphics and Video editing.

Appendix: Use-case Dependent Serialization of CE.SDK Designs

LLMs work based on “tokens” which are equal to words. However, a design, like e.g a poster design or a social media graphic, is highly visual. That means that we need a way to convert a design into text, a way to serialize it. Our CE.SDK engine can serialize an existing scene using our engine.block.saveToString method. However, this serialization contains a huge pile of information that is not necessary to do edits inside the file. LLMs are priced by token and their speed is also relative to the number of tokens the input and output have. Thus, the number of tokens should be reduced.

We looked at several ways to convert the current state of the design into a textual representation. Since GenAI is trained on a lot of (X)HTML which has an XML-like format, we decided to serialize any designs into a tag-based XML-like format.

The IMG.LY editor internally refers to design elements like images or texts as “blocks”. These blocks are uniquely identifiable and addressable using a numeric ID. We use this ID to be able to identify a serialized design block in the input and output of the LLM. Example: <Image id="12582927" x="0" y="0" width="800" height="399" />

For each of the CE.SDK block types like e.g “Text” or “Graphics” (representing images or vector shapes), we will the CE.SDK Engine API to query very specific data from the block. That means that for example we only have a text attribute for Text blocks.

This rather specific mapping of only certain properties from the CE.SDK design into the text serialization allows us to optimize the design serialization for different use cases. A use-case where we e.g. want to automatically name each layer does maybe not require fine-grained information about e.g the font size.

Appendix: Parsing and transforming LLM response

The LLM answers with arbitrary tokens. It’s not possible to restrict the response to a certain syntax. By settling on a well-defined and widely used format we instruct the model to also reply with an XML-like document, similar to the one we passed in as “current state”.

After receiving the LLM’s Response we first make sure that only a single XML document is present inside the response. We then compare the retrieved XML document with the state of the Design that we passed into the LLM and generate a change set. This change set contains entries like “Color of block with ID=123 has changed”. These change set entries are then converted into programmatic commands, like e.g engine.block.setColor(123) and executed on the current design.

One challenges when working with an LLM is the inability to restrict the output space. Thus, we are never guaranteed that the LLM did not add e.g new XML node names or that it even replies with a proper, valid XML-like document. The only lever to influence the probability of a proper XML-like document is to use strong prompting and LLM that are good at following those instructions.

In our tests, state-of-the-art models like GPT-4 can follow those instructions without any further tooling.

Further Research Topics

It’s also worth exploring fine-tuning an LLM specifically for this task which could improve the performance of the LLM for the specific tasks.

It would also be possible to use more advanced libraries like Guidance, which allows to define a grammar for the LLM response thus making sure that the output of the LLM is always parseable.

Another way to improve the performance would be to methodically test different prompt templates and find a way to measure and compare the output quality.

Thank you for reading!

3,000+ creative professionals gain exclusive access and hear of our releases first—subscribe to our newsletter and never miss out.