Updating our tracking prevalence metrics

Metrics that make more sense.

Feb 22nd, 2019

This month we updated the site with data from over 800 million page loads during January. In this data release we have also made a change to one of the main figures that we publish - site reach. Metrics are as good as their ability to capture a simplified version of reality. The main motivation for redefining site reach attempts is exactly that - simplify the understanding of prevalence.

The site reach stat was conceived as a measure of the number of different sites a tracker has some presence. In contrast to reach - the proportion of pages loading a tracker - it shows how spread around the web a tracker is.

We define a new site reach metric as:

The number of sites in the top 10,000 which have this tracker on more than 1% of page loads.

The relationship between reach and site reach paints an interesting picture of prevalence:

Reach vs Site Reach, Jan 2019

Figure 1: Reach vs Site Reach, Jan 2019 (source: whotracks.me)

Up until this point, we have defined site reach as the proportion of sites for which a given tracker was observed a fixed number of times. Using this formula we have run into two fundemental issues in practice:

  1. How often should a tracker appear in a site, for it to be counted?

The simplest solution would be to say that one observation is sufficient. So if a tracker has been seen to be present at least once in a given site, we count it. However, there is some noise to be expected, which could be introduced by particular browser configurations, installed extensions or ISP redirects. This could result in falsely counting extra trackers in a given site.

Thus, having a low threshold makes the metric vulnerable to fluctuations from changes in data-volume. The end result is that if this threshold is too low then the metric is unstable, fluctuating when no real-world change has occurred. If this threshold is too high, then it fails to capture the presence of trackers in particular low(er)-volume subpages of a given website (e.g. payment pages).

  1. The long-tail of low traffic sites skews the results.

    During January our dataset counted 1.3 million distinct sites. However the traffic distribution is very skewed to the top few thousand sites. Looking at the popularity metric for sites, which measures the relative amount of traffic compared to the most popular site (Google.com), already the 10th most popular site (Pornhub) has just 6% of Google's traffic. By the 100th place, this ratio is 0.6%, and 1000th place this is 0.08%. By the time we are at the 100,000th entry there are only 430 page loads over a month, and at place 1,000,000 just 16. This long tail means that, firstly, the impact of tracker presence on these sites is low - the bottom 50% of the 1.3 million sites we see only corresponds to 1% of total traffic - and secondly, the low data volume increases the noise involved in measuring presence.

Due to these difficulties and the complexities arising when explaining site reach, we decided to redefine this metric - making it simpler and more intuitive to the reader, while still capturing the prevalence of the tracker.

Given that the top 10,000 sites account for 75% of page loads in our data, we decided to measure the presence across this fixed set of sites. By fixing the denominator of our formula the output is no longer influenced by the number of sites observed, which can vary with data volume. This metric is also simpler - a ratio over 10,000 is easier for most people to understand, than over 1.3 million. For example, we now show a site reach of 13 sites instead of 0.006% previously.

As of now the data is updated to use this new metric, under the site_reach_top10k key. A further value, site_avg_frequency gives the mean presence across these sites.

How does this metric compare to the previous one? We back-calculated the new metric for the last 5 months, and found this makes the site reach for some of the top trackers even more concerning:

The new site reach is now present on the WhoTracks.Me website in place of the old metric, and published in our data. As usual we will continue publishing monthly updates to track the development of this metric over time.