Metro: the ABC’s new Media Transcoding Pipeline

In December last year, the ABC launched a new video encoding system called Metro (“Media Transcoder”), which converts various sources of media into a standardised format for iview.

It’s been a fantastic project for the ABC’s Digital Network division – we’ve built a cheap, scalable, cloud-based solution that we can customise to suit our specific needs.

Metro has been live for a month, successfully transcoding thousands of pieces of content. Here’s an overview of how it’s been designed and what it does.

Background

Our previous transcoding system had a fairly straightforward job: produce the same set of renditions for each piece of content it was given. Both input and output files were fairly standardised. The previous system was convenient, but there were some aspects we couldn’t customise, and we didn’t use its #1 proposition: on-demand transcoding. Most of the content the ABC publishes is available to us days in advance, so we just need to make sure that it’s transcoded before it’s scheduled for release online.

We calculated that we could replace the existing system for less than the previous system cost, and take advantage of AWS services and their scalability. Other systems like the BBC’s Video Factory have been successfully built using the same model. Writing our own system would allow us to start batching up jobs to process in bulk, or use different sized instances to help reduce costs in the long term.

Our first step was to replicate what the existing system did, but allow it to scale when needed, and shut down when there’s nothing to do.

Architecture

Metro is a workflow pipeline that takes advantage of queuesautoscaling compute groups, a managed database, and notifications. Logically, the pipeline follows this series of steps: File upload > Queue Job > Transcode > Transfer to CDN > Notify client

 

transcodearchitecture

The pipeline is coordinated by the “Orchestrator”, an API written in node.js that understands the sequence of steps, enqueues messages, talks to our database, and tracks where each job is in the system. It’s also responsible for scaling the number of transcoding boxes that are processing our content.

Each step in our pipeline is processed by a small, isolated program written in Golang (a “queue listener”), or a simple bash script that knows only about its piece of the pipeline.

We are able to deploy each piece independently, which allows us to make incremental changes to any of the queue listeners, or to the Orchestrator.

Interesting bits

Autoscaling the Transcoders

The transcoders are the most expensive part of our system. They’re the beefiest boxes in the architecture (higher CPU = faster transcode), and we run a variable number of them throughout the day, depending on how much content is queued.

Before a piece of content is uploaded, we check to see how many idle transcoders are available. If there are no spare transcoders, we decide how many new ones to start up based on the transcoding profile. Higher bitrate outputs get one transcoder each; lower bitrates and smaller files might share one transcoder over four renditions. Once we process everything in the queue, we shut down all the transcoders so that we’re not spending money keeping idle boxes running.

Here’s a snapshot of the runtime stats (in minutes) on boxes over a 4 hour window:

ec2_transcoders2

There’s definitely some optimisation we can do with our host runtime. In future, we’d like to optimise the running time of our transcoders so that they run for a full hour, to match Amazon’s billing cycle of one hour blocks. We’d also like to take advantage of Amazon’s spot instances – using cheaper computing time overnight to process jobs in bulk.

FFmpeg

FFmpeg is the transcoding software we use on our transcoders. It’s open source, well maintained, and has an impressive list of features. We’re using it to encode our content in various bitrates, resize content, and add watermarks. We create an AMI that includes a precompiled version of FFmpeg as well as our transcoder app, so that it’s ready to go when we spin up a new box.

There’s still a way to go before we’re using FFmpeg to its full extent. It’s capable of breaking a file into even chunks, which would make it perfect to farm out to multiple transcoders, and likely giving us even faster, consistent results every time. We can also get progress alerts and partial file download (e.g taking the audio track only, avoiding downloading a bunch of video information that you won’t use).

SQS Queues

We utilise SQS queues to keep our pipeline resilient. We’ve got different queues for various step in our system, and each queue has a small app monitoring it.

When a new message arrives, the app takes the message off the queue and starts working. If an error occurs, the app cancels its processing work and puts the message back at the head of the queue, so that another worker can pick it up.

If a message is retried a number of times without success, it ends up in a “Dead Letter Queue” for failed messages, and we get notified.

Things seem to be working well so far, but we’d like to change the queues so that consumers continually confirm they’re working on each message, rather than farming out the message and waiting until a timeout before another consumer can pick it up.

In Production

Metro has been transcoding for a month, and is doing well. Our orchestrator dashboard shows all of the jobs and renditions in progress:

orchestrator2-small

And some of the work done by transcoders in a 4 hour window:

transcode_chart

The Future

We have more features to add, such as extracting captions, using cheaper computing hardware in non-peak times, and building priority/non-priority pipelines so that content can be ready at appropriate times. Metro has been really interesting to build, much cheaper than our previous solution, and we can customise features to suit our needs. I’m really looking forward to where it goes next.

Finding a Memory Leak

This post originally appeared on the 7digital developer blog on 15th February 2011. It has been moved here for preservation. 

A few weeks ago, we launched the shiny, redesigned new 7digital.com to a beta audience. Unfortunately, we had a memory leak.

The new site was hosted on the same set of hardware as a few other applications, and it was gradually bringing the other sites down. We put a limit on the amount of virtual memory to shield the other sites from the memory leak,  but performance kept deteriorating. Thankfully, the memory leak was eventually found – here’s a set of steps I followed to find it.

Step 1: Take a memory dump from the live site

Graham, a fellow dev, helpfully pointed out userdump and also gave me a crash course in windbg. Userdump is a command line tool which will take a snapshot of the memory space used by a process. It’s important to note that it freezes your process while it takes the dump, so if you’re doing this in live, your site might stop for a minute or more. You can use the inbuilt iisapp.vbs script on the command line to find out exactly which w3wp process belongs to which Application Pool, and therefore which process to dump. Once you have the process id, take the memory dump and examine it with windbg.  Two useful articles were Getting Started with windbg by JohanS, and Tess Ferrandez’s excellent lab/tutorial on how to navigate through a memory dump.

Step 2: Add some performance counters

Since the live dump didn’t highlight any obvious problems (it only had information for a minute or less of runtime before the app pool recycled), we added some performance counters to see if we could find any trends. You can access perfmon under Start > Administrative Tools > Performance.  MSDN has a good explanation of the different counters and what they mean. Since we were concentrating on memory, I added the following counters and waited for any trends to appear.

.NET CLR Exceptions\#Exceps thrown
.NET CLR Memory\#Bytes in all Heaps
.NET CLR Memory\Gen 2 Heap Size
.NET CLR Memory\Large Object Heap Size

Edit: It’s possible to show counters for a single process, but if you have multiple w3wp processes running on the same box (as we do), it’s difficult to get the counters for the right one.  I was looking at counters for the whole box, which didn’t give me a lot of detail.

Step 3: Do some local profiling 

A live memory dump is all well and good, but it just looks like a screen full of hex 🙂 Local profiling gives you some lovely graphs, stack traces, statistics on running time, etc which you can use to drill down into specific methods or lines of code. If you know what user action is causing the leak (e.g. clicking the “Purchase” button), you can profile that on your local machine and easily identify which method or line of code is causing the problem.

I downloaded ANTS Memory ProfilerDotTrace, and AQTime to try some local profiling. The learning curve on ANTS seemed to be the gentlest, although if you are familiar with any of the tools, it would help greatly. The ANTS inline help files were an excellent refresher course on how .NET garbage collection works.

Step 4: Local profiling with load testing

I spent about a day learning how ANTS works, and doing some common page loads on my local machine. I didn’t see anything unusual. But…. my mistake was to profile without load. It’s very difficult to spot trends unless the changes being made by an action are exaggerated.

ApacheBench was recommended, which is a command line tool for benchmarking performance, but also handy for making lots of concurrent requests. So I lined up multiple requests (and executed them multiple times, all while running ANTS) for common pages in our site, like the search page, artist page and album page. Nothing really turned up until I tried to add products to a basket – and got my breakthrough. Here are the two graphs of memory usage from ANTS. The first shows code behaving itself and being cleaned up by the garbage collector when some normal actions were load tested. The second illustrates our memory leak – the line in green highlights the total memory (managed + unmanaged) being used by our process, the line in red is the amount of managed memory allocated by .NET. Unforunately, this meant that our leak was in unmanaged memory, which ANTS couldn’t help me track down.

Good memory profile:

trace-good

Bad memory profile:

trace-bad

Step 5: Finding unmanaged memory leaks

So, back to the dump taken from the live site with userdump.  James Kovacs has written a helpful article which, among other things, lists reasons why you might be leaking unmanaged memory.  I took another memory dump with more user activity to examine, and had a look at the assemblies in the app domain. Along with the usual suspects:

Assembly: 034a3fd8 [C:\WINDOWS\Microsoft.NET\Framework\v2.0.50727\Temporary ASP.NET Files\b\970be4ca\1a5ec57f\assembly\dl3\139d25740cf5f9d_99b8cb01\Lucene.Net.dll]
ClassLoader: 034a4048
SecurityDescriptor: 034a3d18
Module Name
04ac1d74 C:\WINDOWS\Microsoft.NET\Framework\v2.0.50727\Temporary ASP.NET Files\b\970be4ca\1a5ec57f\assembly\dl3\139d25740cf5f9d_99b8cb01\Lucene.Net.dll

....

There were an enormous number of dynamic assemblies being loaded into our app domain:

Assembly: 286ff688 (Dynamic) []
ClassLoader: 286ff6f8
SecurityDescriptor: 286ff600
Module Name
0062429c Dynamic Module
0062461c Dynamic Module

This was the reason that the memory kept increasing. Some piece of code was dynamically loading assemblies, and once there, they never get unloaded. However, it’s very difficult to get any more information about them in windbg for framework version 2.0.  Windbg for v2.0 has less commands than windbg for v1.1 (strange!), and the internet seems to be full of demos using windbg 1.1 showing more information than you get now.   They are a good starting point, but be aware you won’t be able to follow them 100%. Tess Ferrandez again has a great tutorial on chasing down unmanaged memory leaks if dynamic assemblies aren’t your problem.

Step 6: Local debugging

The Modules window in Visual Studio shows you which assemblies have been loaded, and it gives you more information than windbg (the name of the assembly, at least) so it was just a matter of repeating the step that caused the error with the debugger attached, and watching when the number of assemblies changed. The culprit was finally found – it was the Application_Error event handler.  We were mis-using a piece of 3rd party code which was creating dynamic assemblies every time an error occurred. And unfortunately for us, it was a catch-22 because our beta users were finding errors we’d missed in testing, making the leak worse.

Step 7: Verification Profiling

We fixed the offending code, and then re-profiled with ApacheBench to verify that the memory was no longer leaking. The whole process took almost three days to track down and fix, mostly because I hadn’t managed to isolate what action was causing the leak. Once I started load testing, the leak was much easier to identify. I was amazed at the number of tools and apps used when trying to find the leak, mostly to rule things out in a process of elimination. Quite satisfying once found, though 🙂

improving page performance with WebPageTest

We are in the process of rebuilding Ninemsn’s front page in coffeescript + node.js, and moving it to Amazon Web Services (AWS) for easier deployment.

As we’re rebuilding the site, we’re quite conscious of improving page performance (or at least maintaining the current standard). I’ve been really impressed with WebPageTest, a free tool for performance analysis. It has some very in-depth analytical tools, and ranks your site according to Google’s published PageSpeed standards. It also keeps a history of past tests you’ve run for up to 12 months. The results are public and searchable by default, but there is an option to keep them private. Some of the excellent features are as follows:

Film strip view
WebPageTest takes a screenshot every 0.1 second as your site renders. This will show you how long it takes for a user to actually see something “above the fold” they can engage with on the page.  It’s not an easily obtained measurement, which is why it’s so valuable – in our case, a user can see and engage with our top stories in news while some of the slower advertising and javascript loads. If we looked at purely when the javascript finished loading, we’d have a terrible measurement (close to 10 seconds).  But in reality, onLoad happens a whopping 8+ seconds after the first “above the fold” experience a user has.  You can click on the filmstrip below to see the screenshots in detail.

filmstrip

Detailed request/Connection breakdown
This is a very similar view to Chrome or Firebug’s network tab, but it also gives the proportion of each request that was spent in DNS lookup, transfer and rendering.  It’s also broken down by asset type (css, js, etc).

ConnectionViewRequestDetails

It’s hugely valuable for you to see why a particular asset is slow – for example, if you are referencing a third party asset that takes a long time for DNS lookup and transfer, you might investigate if you can host it on your own servers or through a CDN. It also goes without saying that the fewer requests you make, the better, so sprites and combining/minifying CSS and JS assets can help you avoid network overhead time. An interesting feature that Google News does is to base-64-encode “above the fold” images directly into the page, to reduce the number of requests. I presume they are also gzipped with the page when it is sent down the wire, saving both requests and page size.

Analysis of Bandwidth and CPU usage
WebPageTest offers a graph of CPU and bandwidth utilisation throughout the render of your page – useful to see if you can improve your time to rendering after the files have been delivered to the user e.g. high CPU because there is some intensive javascript. Improving this measure would make a great difference to older hardware/browsers.

CPU-Bandwidth

Self hosting & Running from different locations around the world
WebPageTest offers the ability to run from over 40 different locations in the world, at last count. So you can check how your site performs when accessed from different parts of the world. You can also download the source and set up your own hosting server and testing agents if you wish, which gives you more control over where and when you’d like to run it. It would be very useful as part of a build pipeline, to run each check-in/day/week, and highlight the effect that recent changes have made to the page speed.

If you are using the public agent, or decide to host your own, be aware of where the test agent is hosted vs where your site is hosted. For example, if you are both in the same AWS data centre, then time to first byte will probably not be an accurate measure for the majority of users.

Multiple requests
One feature I really like is the fact that it executes two requests to your page, which highlights how much of your page is cached from the first request, and what assets could benefit from cache headers. The in-depth analysis of all the above features (film strip view, connection breakdown, bandwidth & CPU analysis) is done on both versions.

RepeatViewRepeatViewWaterfall1RepeatViewWaterfall

Visual Comparison of two different tests
You can also compare two results – for example, two weeks apart – to see the differences if you have been trying to improve speed, or if you have recently introduced a new feature to see the impact that has had.

There are even more features available – for example, a trace route. We noticed when testing our Akamai fronted page that our packets were actually being routed via Singapore (why? I don’t know whether it was AWS or Akamai, but we noted it down for later follow up, especially before we go live.)

In summary – WebPageTest is an extremely useful diagnostic tool, and best of all it’s free!