A privacy-by-design approach.
Analytics are one of the most common use-cases on the web. You want to know how many people are visiting your website, whether anyone actually clicked the link you posted on social media, or who is sending traffic to your website. For most sites, the solution is to just drop a Google Analytics script into the page - it's free, after all... This has led us to the current situation, where we see Google Analytics having presence across 87% of the top half a million websites, and, despite using reasonably short-lived identifiers, the way the data is collected can be used to track users across these sites.
Is counting page visits such a difficult problem that only Google has solved it? No, there are paid and open source alternatives available, but why pay when you can use a free version which does more, and why host a server with the extra costs that entails, when you don't have to?
But is Google Analytics actually better than the competition? We would argue that, at least among privacy conscious users (i.e. those who contribute to the WhoTracks.Me dataset), Google Analytics will report vastly incorrect figures, for two main reasons:
So how can we accurately measure the traffic coming to our site without exposing the user to tracking and privacy side-effects? This was a problem we faced when we created the WhoTracks.Me website. We wanted to have some analytics so that we can measure if we are being successful in engaging people with the information we are providing on the site. However, we had a few constraints:
Our analytics implementation satisfies these three constraints, using probably the oldest technique on the Internet: server log parsing. Daily analytics for the WhoTracks.Me site are generated as follows:
Processing of raw CloudFront logs to remove potential personal data.
This workflow allows us to keep track of how much traffic we are getting to the WhoTracks.Me website. There is also no reason that this method could not be scaled up to more complex use-cases which services like Google Analytics provides, like conversion counting - provided the time frame that this conversions can occur in are shorter than the time the IP encryption key is used for.
The method is also safe with respect to privacy regulations and user preferences. As IPs are stored for maximum 1 day (and this is only because CloudFront's logging does not obfuscate IPs for us), no other personal information is collected, and message linkage limited to 1 day, there are no additional obligations regarding the usage of this data under GDPR. Furthermore, as tracking is time limited and context limited (this data can only be used for usage on whotracks.me), it respects Do Not Track automatically (using the standard's own tracking definition).
We rolled our own analytics for this site because there was no off-the-shelf solution providing the (very basic) analytics we wanted without significant extra overhead, or potential privacy implications for users of the site. Our system leverages CloudFront logging with a data obfuscation step in order to collect privacy-safe server logs which can then be analysed for basic insights. This technique could be extended to provide most of the richer features of existing web analytics tools.
The lack of privacy-preserving tools in the web analytics ecosystem is a worrying trend. Google Analytics dominates as they provide an extremely feature-rich product as zero cost to the webmaster. It is difficult to see how a service can compete with free without selling analytics data. Existing competitors mostly aim for businesses who will pay for a premium product, and leave bloggers and smaller sites to Google.
While increasing use of adblockers is a more fundamental threat to Google's Ad business, a side effect may be a loss of trust in Google Analytics, as we measure 29% of pages with Google Analytics being affected by blocking. We already see companies which rely on analytics for core business activities (for example advertisers using affiliate schemes) deploying multiple analytics scripts and averaging the results. If the trust in analytics breaks down, then this whole ecosystem may unravel.