Sunday, 1 December 2013

Selection optimisation

I've just made a fairly significant optimisation to libcss's selection
engine performance. It's now using more memory, but hopefully that's
worth the extra speed. Please let me know how you get on with it.

In brief, compare:

Before: http://www.netsurf-browser.org/temp/sel-opt/before-bloom-main.png
After: http://www.netsurf-browser.org/temp/sel-opt/bloom-04-main.png

More info below.

Skip to the "Selection optimisation" section if you already know why we
were so slow before.


How selection works
-------------------

Take the following CSS rule from NetSurf's own web site:

.navigation ul.languages li a:hover,
.navigation ul.sitelinks li a:hover {
display: block;
color: #00f;
padding: 0 0 0 0.2em;
margin: 0;
background: transparent;
text-decoration: none;
}

It's a single rule with two selector chains. To determine whether to
apply that rule the element node we're selecting for, at least one of
the selector chains must match.

Considering only the first selector chain, it's saying, "Match an 'a'
element in the hover state who has an ancestor node which is an 'li'
element who has an ancestor node which is a 'ul' element with the
'languages' class who has an ancestor node with the 'navigation'
class."

Libcss is able to ask the client (NetSurf) about element nodes via
a series of callbacks. So to match a selector chain, libcss asks
if the node being selected is an 'a' element. It then needs to ask
wether the element has a ancestor of type 'li'. And so on.

On the NetSurf side the callbacks are implemented with a bunch of
calls to libdom to walk up the DOM tree to look for whatever libcss
is enquiring after.


Selection performance problem
-----------------------------

CSS selection has been dominating NetSurf profiles for ages.

For a site like http://www.bbc.co.uk/news/ at the current time there
are 4366 selector chains to match for each of 1156 element nodes.

We already had an optimisation whereby selector chains were hashed
by the right most selector details. There are four hash tables for
element name, class name, id name and finally for selector chains
whose right most selector has no element name, class name or id name.

However even with that selection is expensive. When launching
framebuffer NetSurf with the BBC news homepage, and quitting as soon
as the page finishes loading, over 45% of NetSurf's CPU usage is
spent in the selection engine:

http://www.netsurf-browser.org/temp/sel-opt/before-bloom-main.png


Selection optimisation
----------------------

I've made two optimisations. The first is quite minor, though
worthwhile. We skip selector chains which don't meet the
following criteria:

+ The rule must have some bytecode, i.e. if the rule matches,
there must be some style data to apply.
+ If the selector chain's right most selector specifies the element
name, then the name of the element we're selecting for must be a
match.
+ The rule must be applicable for the media type we're selecting for.

I moved the testing for those so they happen a bit earlier. It's now
done before we compare the rules that emerge from each of the selector
chain hash itterators for specificity and rule index (to ensure they're
applied in the correct order).

The second optimisation is to generate bloom filters loosely describing
node ancestry and selector chain ancestry requirements.

In NetSurf, we now store a node bloom filter as user data on the libdom
element nodes. This bloom filter is filled with all the node's ancestor
element names, class names and id names.

In libcss, for each selector chain, we store a bloom filter containing
all the element/class/id names that follow an ancestor or parent,
combinator (' ' or '+'). So for the first selector chain in the example
rule above, we'd add the following strings to the bloom filter.

"li"
"languages"
"ul"
"navigation"

To determine wether a selector chain can be ruled out without manually
matching it, we simply have to see whether the selector chain bloom is a
subset of the node bloom. This is a much faster operation.

All the strings involved were already interned with libwapcaplet, so
string hash values were already calculated and available.

The size of the bloom filter can be set (at compile time) to either
4, 8, or 16 * uint32_t. I've commited it with CSS_BLOOM_SIZE of 4.

That means 4 * 4 bytes per DOM node, and per selector chain. For the
example of http://www.bbc.co.uk/news/ that's:

DOM nodes: 16 * 1156 = 18,496 Bytes
Selector chains: 16 * 4366 = 69,856 Bytes
------------
88,352 Bytes

The selector chain cost is shared between pages that share the same
external stylesheets. However, there are over 20,000 element nodes
alone for a page like:

http://git.netsurf-browser.org/netsurf.git/tree/render/layout.c

I have run tests with the larger bloom filters and it does reduce false
positive rates, however I judge the overall improvement/cost ratio to be
best with a CSS_BLOOM_SIZE of 4 at the moment.

Test | Total Instruction Fetch Cost | In selection
--------------+------------------------------+-------------
Before bloom | 785,536,304 | 45.30%
Bloom size 4 | 500,727,778 | 13.95%
Bloom size 8 | 484,935,082 | 11.94%
Bloom size 16 | 480,928,151 | 11.06%

What's interesting is that from bloom size 4 to 16 there isn't a huge
drop in total computation required, but the number of calls to
match_details (used in matching selector chains manually) drops sharply:

Test | match_details calls
--------------+--------------------
Before bloom | 1,428,184
Bloom size 4 | 171,451
Bloom size 8 | 101,397
Bloom size 16 | 54,112

That shows that with a bloom size of 4, there are plenty of false
positives, but the cost of testing bloom filters for being a subset
is getting quite heavy with the longer filters.

See the other profile results using the 8 and 16 settings:

http://www.netsurf-browser.org/temp/sel-opt/


What next?
----------

I pushed my changes at this stage since they were showing a decent
improvement, but there are some more things I plan to experiment with.

I tried generating the selector chain bloom filters on the fly, which
was quite costly, so they're currently stored with the selector chains.

It may be faster to avoid generating selector chain bloom filters at all
and walk the selector chain using css_bloom_has_hash on each relevant
string. We could dismiss the selector chain as soon as
css_bloom_has_hash returns false.

--

Michael Drake (tlsa) http://www.netsurf-browser.org/

No comments:

Post a Comment