Wednesday, February 22, 2017

Optimising for 60fps everywhere - Performance and speed in JavaScript and CSS

There is no silver bullet in the matter of making web pages and web apps render efficiently — instead the best approach is to understand the different things that can cause a page to render slowly, and optimise them in turn, following some basic rules and best practices.

What actually happens in the browser?

Before understanding how to optimise web sites and applications for efficient rendering, it’s important to understand what is actually going on in the browser between the code you write and the pixels being drawn to screen. There are six different tasks a browser performs to accomplish all this:
  • downloading and parsing HTML, CSS and JavaScript
  • evaluating JavaScript
  • calculating styles for elements
  • laying out elements on the page
  • painting the actual pixels of elements
  • compositing layers to the screen
This post is only really focussing on the aspects related to achieving smooth animations without visual delay. I won’t focus on the parts about downloading and parsing assets.

Only 16ms per frame

In the typical flow of drawing to the screen, for each frame the browser will evaluate some JavaScript, if necessary, recalculating style for elements and recalculating layout if styles are modified by JavaScript. It will then draw a subset of the page to various “layers”. Then it will use the GPU to composite these layers to the screen. Each of these stages has its own cost, varying on what type of thing your web page or application does. If you’re aiming to achieve a smooth 60fps then the browser only has roughly 16 milliseconds to accomplish all these.

Re-layouts

“Layout” is a term used to describe the geometry of a page. That is, where every element is and how big it is. When you modify the geometry of an element using JavaScript (say, by changing margin, padding, width, height etc.) the browser does not immediately recalculate the geometry for every part of the page that’s affected.
Instead it keeps track of which parts of the page are “dirty” (i.e. in need of recalculating) and defers the calculation until the geometry next needs to be read, either by JavaScript if you’re accessing a property like offsetWidth, or by the renderer once it’s time to draw the page to the screen. As a result, it’s generally best to allow changes to queue up as much as possible and avoid forcing the browser to re-calculate layout several times per frame.
Take this example:
// els is an array of elements
for(var i = 0; i < els.length; i += 1){
  var w = someOtherElement.offsetWidth / 3;
  els[i].style.width = w + 'px';
}

What is happening here is that for every iteration of the loop, the browser has to make sure that all queued changes are applied in order to calculate the value of someOtherElement.offsetWidth, and then apply the updated width style to the next element in the array. This updated width attribute will then invalidate the offsetWidth property on someOtherElement, meaning that for the next iteration the browser will have to perform more expensive operations in order to calculate this value.
Now, assuming that changing the width of the elements in the array does not affect someOtherElement‘s size, take this example:
var x = someOtherElement.offsetWidth / 3;
for(var i = 0; i < els.length; i += 1){
  els[i].style.width = x + 'px';
}

This time, we’re doing all of the reading of properties first, then writing all the updated styles subsequently. This way, the browser only has to perform one reflow, in order to read someOtherElement.offsetWidth, and then all the updates to the elements in els can be queued up and applied all at once, when they next need to be read – either by subsequent javascript, or when the elements need to be repainted.
As a rule, reflows and relayouts should be kept to a minimum – wherever possible all properties should be read first, and then all updates written at once.

Re-paints

Painting is the process by which the browser takes its abstract collection of elements with all their properties, and actually calculates the pixels to draw. This includes calculating styles such as box shadows and gradients, as well as resizing images.
As a rule, a re-paint will occur once per frame, and will involve the browser re-drawing any pixels it has calculated as “dirty”, i.e. those affected by elements that have been added or removed or have had styles changed.
For smooth animation, it’s important to ensure that any re-paints are as efficient as possible. This means avoiding animating any properties that are expensive for the browser to draw, such as box shadows or gradients. It’s also important to avoid animating elements which have these properties, or any that will cause a re-paint of regions heavy with these effects.
It’s also worth noting that the browser will usually attempt to consolidate different regions into a single repaint for efficiency by simply drawing the smallest possible rectangle that encompasses all “dirty” pixels. This can be particularly bad, however, if you’re changing elements in different areas of the page (for example, if your webapp modifies elements at opposite corners of the screen, that’ll cause the whole page to be included in the bounding rectangle). There’s a good example of this in this blog post detailing improvements made to the Atom code editor. The best solution in these cases is usually to make sure the elements are rendered on different layers.

Layers, compositing, CPU and GPU

Back in the old days, browsers would keep one “frame” in memory which was drawn to the screen, and all paints would involve the CPU drawing pixels directly into this frame.
Nowadays, browsers take advantage of the GPU and instead draw some elements to separate “layers” using the CPU, and use the GPU to composite these layers together to give the final pixels drawn to the screen.
The GPU is very efficient at performing basic drawing operations like moving layers around relative to each other, in 2d and 3d space, rotating and scaling layers, and drawing them with varying opacities. To that end, it’s possible to take advantage of these efficiencies if you’re animating elements with these kind of properties.
Take these two examples. Admittedly they’re somewhat contrived but they’re deliberately extreme to make the effect obvious. Both the examples take 100 <divs> with some heavy box shadows, and animate them horizontally using CSS transitions.
Firstly, using the left property:

View on JS Bin→

In this example, the browser is having to completely recalculate the pixels around these elements for every frame, which takes a large amount of computing power.
Here is what it looks like in Chrome’s DevTools timeline:

Now, let’s instead use a transform to animate the same elements

View on JS Bin→

And here’s what that one looks like in DevTools

In this example, the transform forces the browser to place each of the <div> elements into its own layer on the GPU before compisiting them together for displaying on the screen. Now for each frame, the only work is in calculating the new position for each layer, which takes barely any computation power at all. There is no work done in recalculating the box shadows or background gradients – the pixels do not change within their layers, so there are no “Paint” events in the timeline, only “Composite Layers”.
There are a number of ways you can force the browser to place an element in its own layers. Usually applying a CSS transition on transformor opacity is enough. A common hack is to use transform: translateZ(0)– this has no visual effect (it moves the element 0 pixels in the Z direction), but the browser sees a 3d transform so it promotes the element to a new layer.
It is possible to overdo it, however. Don’t go creating new layers willy-nilly. Here’s an example of Apple overdoing it on their homepage and actually slowing it down by having too many composited layers.
It’s also important to bear in mind that with GPU-composited layers, there’s an inherent cost incurred when pushing the actual rendered pixels onto the GPU. If you have lots of composited layers and are animating properties that can’t be animated purely on the GPU, then the browser has to re-paint the pixels on the CPU and upload them to the GPU each time, which may actually be less efficient than keeping the layers non-composited and drawn entirely on the CPU. Here’s a great post by Ariya Hidayat explaining this in more detail.
A good way to see what’s going on here is to use Chrome’s DevTools and enable “show paint rectangles” and “show composited layer borders”. Show paint rectangles will show you exactly which areas are being re-painted for each frame.
If you’re seeing lots of them, especially in regions that contain a lot of elements or fancy css effects, you’re probably at risk of inefficient repainting. Show composited layer borders will show you exactly which elements have their own layers. It’s especially useful if you want to make sure that an element is properly on a separate layer.
It’s worth noting also that if you’re aiming for smooth animations on mobile devices, you should aim wherever possible to only animate properties like transform and opacity that can be animated entirely using GPU acceleration. Mobile devices’ processors are, as a rule, pretty terrible in comparison to their GPUs. As a result it’s best to avoid animating width or height or other such properties. With a little extra effort it’s usually possible to (for example) animate an element’s transform inside another element with overflow: hidden to achieve the same effect as changing its dimensions.

A more concrete example

Here’s an example taken straight from our recent update to the GoSquared UI: showing a modal view.


View on JS Bin→
Notice the techniques we’re using here:
  • The overlay, which has a large radial gradient, uses the transform: translateZ(0) hack to promote it to its own GPU layer. As it’s a full-screen overlay, rendering it in the same layer as the rest of the user interface would cause a re-paint of the entire interface, which would be extremely inefficient. By promoting it to its own layer and only animating the opacity property, the entire animation take place on the GPU which is really efficient.
  • The modal view itself is animated using the transform property using a translate3d, which again forces it to be rendered on its own layer. This means we can apply box shadows to the modal, and also not worry so much about whatever we choose to put inside it, because the entry animation again takes place entirely on the GPU without any re-painting.

In Summary

As I said at the beginning of this post, there’s no silver bullet for render performance. Depending on your exact use-case, it may be a matter of optimising any of a number of different parts of the render pipeline. Once you understand the various hoops through which the browser has to jump in order to get pixels onto the screen, it’s more a matter of maintaining a set of tools and techniques to apply to different scenarios.

This was just a brief run-down of the major potential bottlenecks one can encounter with rendering performance. The links below all expand on the matter in further detail if you’re interested, and if you have any other examples or suggestions of your own to add then please leave them in the comments!

Tuesday, February 21, 2017

Constructing the Object Model

Before the browser can render the page, it needs to construct the DOM and CSSOM trees. As a result, we need to ensure that we deliver both the HTML and CSS to the browser as quickly as possible.

TL;DR

  • Bytes → characters → tokens → nodes → object model.
  • HTML markup is transformed into a Document Object Model (DOM); CSS markup is transformed into a CSS Object Model (CSSOM).
  • DOM and CSSOM are independent data structures.
  • Chrome DevTools Timeline allows us to capture and inspect the construction and processing costs of DOM and CSSOM.

Document Object Model (DOM)

<html>   <head>     <meta name="viewport" content="width=device-width,initial-scale=1">     <link href="style.css" rel="stylesheet">     <title>Critical Path</title>   </head>   <body>     <p>Hello <span>web performance</span> students!</p>     <div><img src="awesome-photo.jpg"></div>   </body> </html>
Let’s start with the simplest possible case: a plain HTML page with some text and a single image. How does the browser process this page?
DOM construction process
  1. Conversion: The browser reads the raw bytes of HTML off the disk or network, and translates them to individual characters based on specified encoding of the file (for example, UTF-8).
  2. Tokenizing: The browser converts strings of characters into distinct tokens—as specified by the W3C HTML5 standard; for example, "<html>", "<body>"—and other strings within angle brackets. Each token has a special meaning and its own set of rules.
  3. Lexing: The emitted tokens are converted into "objects," which define their properties and rules.
  4. DOM construction: Finally, because the HTML markup defines relationships between different tags (some tags are contained within other tags) the created objects are linked in a tree data structure that also captures the parent-child relationships defined in the original markup: the HTML object is a parent of the body object, the body is a parent of the paragraph object, and so on.
DOM tree
The final output of this entire process is the Document Object Model (DOM) of our simple page, which the browser uses for all further processing of the page.
Every time the browser processes HTML markup, it goes through all of the steps above: convert bytes to characters, identify tokens, convert tokens to nodes, and build the DOM tree. This entire process can take some time, especially if we have a large amount of HTML to process.
Tracing DOM construction in DevToolsIf you open up Chrome DevTools and record a timeline while the page is loaded, you can see the actual time taken to perform this step—in the example above, it took us ~5ms to convert a chunk of HTML into a DOM tree. For a larger page, this process could take significantly longer. When creating smooth animations, this can easily become a bottleneck if the browser has to process large amounts of HTML.
The DOM tree captures the properties and relationships of the document markup, but it doesn't tell us how the element will look when rendered. That’s the responsibility of the CSSOM.

CSS Object Model (CSSOM)

While the browser was constructing the DOM of our simple page, it encountered a link tag in the head section of the document referencing an external CSS stylesheet: style.css. Anticipating that it needs this resource to render the page, it immediately dispatches a request for this resource, which comes back with the following content:
body { font-size: 16px } p { font-weight: bold } span { color: red } p span { display: none } img { float: right }
We could have declared our styles directly within the HTML markup (inline), but keeping our CSS independent of HTML allows us to treat content and design as separate concerns: designers can work on CSS, developers can focus on HTML, and so on.
As with HTML, we need to convert the received CSS rules into something that the browser can understand and work with. Hence, we repeat the HTML process, but for CSS instead of HTML:
CSSOM construction steps
The CSS bytes are converted into characters, then tokens, then nodes, and finally they are linked into a tree structure known as the "CSS Object Model" (CSSOM):
CSSOM tree
Why does the CSSOM have a tree structure? When computing the final set of styles for any object on the page, the browser starts with the most general rule applicable to that node (for example, if it is a child of a body element, then all body styles apply) and then recursively refines the computed styles by applying more specific rules; that is, the rules "cascade down."
To make it more concrete, consider the CSSOM tree above. Any text contained within the span tag that is placed within the body element, has a font size of 16 pixels and has red text—the font-size directive cascades down from the body to the span. However, if a span tag is child of a paragraph (p) tag, then its contents are not displayed.
Also, note that the above tree is not the complete CSSOM tree and only shows the styles we decided to override in our stylesheet. Every browser provides a default set of styles also known as "user agent styles"—that’s what we see when we don’t provide any of our own—and our styles simply override these defaults (for example, default IE styles).
To find out how long the CSS processing takes you can record a timeline in DevTools and look for "Recalculate Style" event: unlike DOM parsing, the timeline doesn’t show a separate "Parse CSS" entry, and instead captures parsing and CSSOM tree construction, plus the recursive calculation of computed styles under this one event.
Tracing CSSOM construction in DevTools
Our trivial stylesheet takes ~0.6ms to process and affects eight elements on the page—not much, but once again, not free. However, where did the eight elements come from? The CSSOM and DOM are independent data structures! Turns out, the browser is hiding an important step. Next, lets talk about the render tree that links the DOM and CSSOM together.

Monday, February 20, 2017

Parallel For Loop

Parallel Loops

The Task Parallel Library (TPL) includes two loop commands that are parallel versions of the for and foreach looping structures of C#. They each provide the code needed for the Parallel Loop Pattern, ensuring that the entire process is completed with all iterations executed before moving on to the statement following the loop. The individual iterations are decomposed into groups that may be divided between the available processors, increasing performance on machines with multiple cores.

Parallel.For

In this article we will consider the parallel for loop. This provides some of the functionality of the basic for loop, allowing you to create a loop with a fixed number of iterations. If multiple cores are available, the iterations can be decomposed into groups that are executed in parallel.
To demonstrate, create a new console application. The parallel loops are found in the System.Threading.Tasks namespace so add the following using directive to the generated class:
using System.Threading.Tasks;

To begin, we can create a sequential loop. In the code below, the loop iterates ten times, with the loop control variable increasing from zero to nine. In each iteration the GetTotal method is called. This performs a calculation that is included to generate a long enough pause to see the performance improvement of the parallel version.
When you run the program it outputs the iteration number from the loop control variable and the result of the calculation. NB: You may wish to adjust the length of the loop in the GetTotal method to achieve a useful pause between iterations.
static void Main()
{
    for (int i = 0; i < 10; i++)
    {
        long total = GetTotal();
        Console.WriteLine("{0} - {1}", i, total);
    }
}
 
static long GetTotal()
{
    long total = 0;
    for (int i = 1; i < 1000000000; i++)    // Adjust this loop according
    {                                       // to your computer's speed
        total += i;
    }
    return total;
}
 
/* OUTPUT
 
0 - 499999999500000000
1 - 499999999500000000
2 - 499999999500000000
3 - 499999995000000000
4 - 499999995000000000
5 - 499999995000000000
6 - 499999995000000000
7 - 499999995000000000
8 - 499999995000000000
9 - 499999995000000000
 
*/
To convert the above loop into a parallel version, we can use the Parallel.For method. The syntax is different as it is provided by a static method, rather than a C# keyword. The version of the method that we are interested in has three parameters. The first two arguments specify the lower and upper bounds of the loop, with the upper bound being exclusive. The third parameter accepts an Action delegate, usually expressed as a lambda expression, that contains the code to run during each iteration.
The parallel syntax for the previous loop is shown below. When you run the above code on a computer with multiple cores, you should see a considerable improvement in performance. On a single core, single processor system the performance will be marginally slower than the equivalent sequential loop.
Parallel.For(0, 10, i =>
{
    long total = GetTotal();
    Console.WriteLine("{0} - {1}", i, total);
});
 
/* OUTPUT
 
5 - 499999995000000000
1 - 499999995000000000
6 - 499999995000000000
0 - 499999995000000000
2 - 499999995000000000
7 - 499999995000000000
4 - 499999995000000000
3 - 499999995000000000
8 - 499999995000000000
9 - 499999995000000000
 
*/
It is important to note that the output for the parallel version is different from that of its sequential counterpart. The results shown in the comments above were achieved using a dual-core processor. In this case iteration '5' completed first and what would have been the first iteration in the sequential version actually ran fourth. This change to the ordering of the loop almost always happens when running in parallel and can cause problems if unanticipated.