How to Build a Simple Real-Time Video Editor with Metal - iOS

In this tutorial you’ll see how to extract frames from live camera streams, regular movie files and streamed movie files and display them on a MetalKit View. Using Metal allows you to have a great deal of control over how the pixels are rendered and ensures that the GPU is used for rendering, which keeps from slowing down the CPU.

The tutorial code will always convert the video frames into CIImage objects before display. This is to help you adapt the code to your own applications. Applying filters, resizing and other effects are fast and easy when working with CIImage. Additionally, as long as you set your CIContext to the GPU, you can be sure that your code uses the GPU and the CPU efficiently.

When working with Video, Apple provides a number of frameworks that operate at different levels. For the higher level frameworks such as AVKit or CoreImage, the system determines when to use the GPU and when to use the CPU for rendering and computation. The higher-level frameworks also offer a tradeoff between ease of use and granularity of control. If you want to have direct access to tell the GPU how to render every pixel, then you will want to use Metal. Remember, though, that you are now responsible for keeping video and audio in sync, encoding and decoding data and determining a reasonable UI for your user. For many use cases, using a higher level framework is probably a better choice. Apple’s engineers have worked to ensure the graphics frameworks use the GPU and CPU in reasonable ways. However, when you need to use Metal, you you need it. So, let's get started.

Setting Up a MTKView

An MTKView is a subclass of a UIView so it has a frame and bounds and other properties. Its drawing and rendering is backed directly by drawables and textures rendered on the GPU, so drawing to it can be quite fast. In addition to displaying video, you can render 3D model objects and graphics shaders, so for a graphics rich application it can be a powerful tool.

Before you can send data to the view it needs to be configured. MetalKit Views are still heavily influenced by UIKit, so if you are working in a SwiftUI project, you will need to wrap them in ViewRepresentable code.

//MetalKit Variables
@IBOutlet var displayView: MTKView!
var metalDevice : MTLDevice!
var metalCommandQueue : MTLCommandQueue!

//CoreImage Variables
var ciContext : CIContext!
var filteredImage: CIImage?
var cleanImage: CIImage?

//get a reference to the GPU
metalDevice = MTLCreateSystemDefaultDevice()

//link the GPU to our MTKView
displayView.device = metalDevice

//link our command queue variable to the GPU
metalCommandQueue = metalDevice.makeCommandQueue()

//associate our CIContext with the metal stack
ciContext = CIContext(mtlCommandQueue: metalCommandQueue)

You will need to get a reference to the GPU of the iOS device. At least for now, iOS devices only have one GPU. Then you will need to link the displayView and the metalCommandQueue to the metalDevice. Finally we will link the ciContext to the GPU so that all of our image manipulation code will run in the same place.

//tell our MTKView that we want to call .draw() to make updates
displayView.isPaused = true
displayView.enableSetNeedsDisplay = false
//let it's drawable texture be writen to dynamically
displayView.framebufferOnly = false

//set our code to be our MTKView's delegate
displayView.delegate = self

In the code above, we set the MTKView to be in a isPaused state and do not enable setNeedsDisplay. This will ensure that it will only redraw when we explicitly tell it to redraw using a .draw() method in our delegate. By setting framebufferOnly to false you are telling the view that you will be writing to it multiple times and may also read from it.

Drawing in an MTKView

Now the MTKView is configured. The next step is to update the delegate methods. The MTKViewDelegate has two methods. If your code needs to respond to the view dimensions changing (to support device rotation, or if you want the user to be able to resize the window) make your adjustments in func mtkView(_ view: MTKView, drawableSizeWillChange size: CGSize). For this example we are only interested in the delegate method: func draw(in view: MTKView). The code below draws a CIImage to the MTKView. When showing video, we can extract each frame of the video as a CIImage

//create a buffer to hold this round of draw commands
guard let commandBuffer = metalCommandQueue.makeCommandBuffer() else {
  return
  }
  
//grab the filtered or a clean image to display
guard let ciImage = filteredImage ?? cleanImage else {
  return
  }

//get a drawable if the GPU is not busy
guard let currentDrawable = view.currentDrawable else {
  return
  }

//make sure frame is centered on screen
let heightOfImage = ciImage.extent.height
let heightOfDrawable = view.drawableSize.height
let yOffset = (heightOfDrawable - heightOfImage)/2

//render into the metal texture
self.ciContext.render(ciImage,
             to: currentDrawable.texture,
  commandBuffer: commandBuffer,
         bounds: CGRect(origin: CGPoint(x: 0, y: -yOffset), 
                          size: view.drawableSize),
     colorSpace: CGColorSpaceCreateDeviceRGB())

//present the drawable and buffer
commandBuffer.present(currentDrawable)

//send the commands to the GPU
commandBuffer.commit()

For each pass of the draw method, we will create a new MTLCommandBuffer then we will get the MTKView's .currentDrawable which contains a texture we can send pixel data to. The texture will have a size and a color space. Though we are using Metal to render images to the screen, Metal can be used to send any valid commands to the GPU for things like computations. Once the buffer has been filled with commands it can be assigned to the drawable and committed. When the buffer is committed, the GPU will execute all of the commands. If another .draw() gets called while the buffer is being executed, the GPU will not interrupt the current buffer in order to start the new one.

The bounds of the render allow you to move where the CIImage gets displayed in the MTKView and resize the CIImage within the view. This can be valuable when compositing multiple images or video streams onto the same MTKView. Remember that unlike a UIView the origin point is the bottom left of the texture and CIImage.

Since iOS 11 Apple provides a new lighter weight API for sending renders to the buffer. Use whichever form makes sense to you.

//render into the metal texture
let destination = CIRenderDestination(mtlTexture: currentDrawable.texture, 
                                   commandBuffer: commandBuffer)
do {
 try self.ciContext.startTask(toRender: ciImage, to: destination)
} catch {
 print(error)
}

The CIRenderDestination has some other optional parameters such as height and width but will always display with an origin of 0,0. So, unlike the earlier call, if you need to rotate or change the dimensions of the image, you will need to do it to the CIImage earlier in the code. However, if your app always displays the video at full size in the MTKView this may be easier. Also the startTask call will return immediately, instead of waiting for other tasks on the CPU to complete.

It is important to remember that you can have any number of renders in the commandBuffer before the calls to .present and .commit. This means that the same MTKView can display video from multiple streams as well as static images or animations.

Working with Local Files

In order to extract individual frames from a local file, we can use an AVAssetReader to get pixel data to render in the MTKView

When using higher-level frameworks, a quick way to display a video file for playback is to load it into an AVPlayer. However, an alternative is to use an AVAssetReader to read the tracks and the individual frames from the video tracks.

let asset = AVAsset(url: Bundle.main.url(forResource: "grocery-train", withExtension: "mov")!)
let reader = try! AVAssetReader(asset: asset)

guard let track = asset.tracks(withMediaType: .video).last else {
  return
  }

let outputSettings: [String: Any] = [
        kCVPixelBufferPixelFormatTypeKey as String: kCVPixelFormatType_32ARGB
    ]
let trackOutput = AVAssetReaderTrackOutput(track: track, outputSettings: outputSettings)
    reader.add(trackOutput)
    reader.startReading()

It is important that the outputSettings of the reader match the image format of the video track. If there is a mismatch the color may be off or the reader may be unable to extract a usable buffer. If your code is generating empty or black buffers, the outputSettings are the first place to troubleshoot. Once the reader starts reading then it can extract pixel buffers. You can use a while loop to get all of the buffers from the track and send them to the MTKView for display.

var sampleBuffer = trackOutput.copyNextSampleBuffer()
while sampleBuffer != nil {
  guard let cvBuffer = CMSampleBufferGetImageBuffer(sampleBuffer!) else {
   return
  }
  print(track.preferredTransform)
  print(sampleBuffer?.outputPresentationTimeStamp)

  //get a CIImage out of the CVImageBuffer
  cleanImage = CIImage(cvImageBuffer: cvBuffer)
  displayView.draw()

  sampleBuffer = trackOutput.copyNextSampleBuffer()
  }
}

In the code above, use CMSampleBufferGetImageBuffer to ensure that we have a valid image. Then assign the image to a CoreImage and execute the .draw method of the MTKView. Afterwards, get the next sample buffer. This will continue until the end of the track, when .copyNextSampleBuffer() will return nil.

In the code above, there are two print statements to note some valuable information that your app may want to store for later use. Remember, when working with buffers and the GPU directly, much of the higher-level metadata about the video is lost, so you’ll need to keep track of it in your code if you want to use it later. An image may be rotated because of the way it was originally generated in the video. The preferredTransform will provide data that your app can use to transform the image to the “proper” orientation before display. Additionally the outputPresentationTimeStamp tells when the image represented by the buffer appeared in the original video. This can be helpful when you are trying to sync individual frames back to audio tracks or if your app only wants to modify specific frames and then reinsert them into the original clip.

Working with Streams

In addition to local files, an app may want to use a video stream as the input. The process is largely the same, as there is a method to extract a CVPixelBuffer from an AVPlayer that is streaming. You can find the complete example code for pushing pixel buffers from a stream to an MTKView in this blog post about streaming. However, the important part of the code perhaps looks familiar by now:

let currentTime = playerItemVideoOutput.itemTime(forHostTime: CACurrentMediaTime())
if playerItemVideoOutput.hasNewPixelBuffer(forItemTime: currentTime) {
  if let buffer = playerItemVideoOutput.copyPixelBuffer(forItemTime: currentTime, itemTimeForDisplay: nil) {
     let frameImage = CIImage(cvImageBuffer: buffer)
     self.currentFrame = frameImage //a CIImage var
     self.videoView.draw() //our MTKView
   }
}

The code above uses a CADisplayLink to query the stream on a regular basis for new video frames. It then extracts a buffer and sends it over the MTKView for display.

Working with the Cameras

Much like a video stream with an AVPlayerItemVideoOutput to output pixel buffer data, a standard object to use with the camera is an AVCaptureVideoDataOutput object. First, you need to set up a standard capture session for either the front or back camera. Then attach a video data output object to the session.

videoOutput = AVCaptureVideoDataOutput()
let videoQueue = DispatchQueue(label: "captureQueue", qos: .userInteractive)
videoOutput.setSampleBufferDelegate(self, queue: videoQueue)

if captureSession.canAddOutput(videoOutput) {
  captureSession.addOutput(videoOutput)
} else {
  fatalError("configuration failed")
}

Whenever the camera has collected enough data to generate a pixel buffer it will send that to its delegate. The delegate can then convert that data to a CIImage and send it to the MTKView to get rendered. The method in a AVCaptureVideoDataOutputSampleBufferDelegate would look like this.

func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
  //get a CVImageBuffer from the camera
  guard let cvBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else {
    return
  }

  //get a CIImage out of the CVImageBuffer
  cleanImage = CIImage(cvImageBuffer: cvBuffer)
  displayView.draw()
}

Going Further

In this tutorial, you saw how to extract pixel buffers from the camera, local file and remote streams and convert them to CIImage Then you saw how to render a CIImage to fill all or part of an MTKView. If your application needs to resize or apply filters to the images, you can do that before calling the .draw method of the MTKView. If the only reason you are considering using MKTViews is because of a Metal or OpenGL kernel you want to use, consider our tutorial on how to wrap kernels into Core Image filters.

If your application needs to gather streams of video, filter and then render them to the screen with precision, then drawing to an MTKView may be sufficient. However, if you want to let your users dictate how and where to filter the streams, then you may want to consider an SDK like VideoEditor SDK or CreativeEditor SDK.