Removing Referral Spam

There is a lot of misinformation about clearing out so-called referral spam, so here’s the Definitive Guide to removing it all! Website reporting is tough enough, but when you start getting non-human visits in your reports, it makes life really miserable.

This article describes three techniques to stop bots and spam referrals from appearing in FUTURE website reports. To eliminate spam referrals from historical reporting, a companion article describes an Advanced Segment to use.

Too many words? Infographic Version (download)

UPDATE 2015-04-28 (google / organic search spam with keyword "vitaly rules google..." using fake hostname of google.ru. 
2015-04-24 (free-share-buttons.com / referral, pornhub-forum.ga / referral, youporn-forum.ga / referral, domination.ml / referral, torture.ml / referral, www.Get-Free-Traffic-Now.com / referral, buy-cheap-online.info / referral, theguardlan.com / referral) are fake referrals annoying a lot of people these days. 
The valid hostname include filter described below would have prevented all of these from messing up your stats.

Specific Exclude Filters I am currently using (yours may differ):
12masterov.com|bard-real.com.ua|billiard-classic.com.ua|cardiosport.com.ua|ci.ua|customsua.com.ua|delfin-aqua.com.ua|dipstar.org|dvr.biz.ua|e-kwiaciarz.pl|este-line.com.ua|ghazel.ru|it-max.com.ua|maridan.com.ua|mebeldekor.com.ua

mirobuvi.com.ua|offers.bycontext.com|olgacvetmet.com|palvira.com.ua|trion.od.ua|наркомания.лечениенаркомании.com|алкоголизм.лечениенаркомании.com|med-zdorovie.com.ua|ranksonic.org|.*ranksonic.com

First, there are three types of junk visits and spam referrals, and there are different ways to deal with each of them:

1. Ghost referrals like the darodar / ilovevitaly / cenoval

2. Creepy crawlers like semalt (a.k.a. best-seo-solution.com) and fake referrals like maridan.com.ua and blog.ranksonic.com.

3. Well behaved bots and spiders

Follow Best Practices

Before you start hacking away at your Google Analytics settings, here is some great guidance from LunaMetrics on implementing new filters:

1. Make sure you always have an unfiltered view in your property that has zero filters.

2. Don’t implement it immediately in your main view. Create a new test view that mirrors your main one in every other respect, and then add the filter(s).

3. If you’re happy with the new filter based on this test, then go ahead and implement it in the main view.

BAD Advice: Use The Referral Exclusion List

A number of sites recommend using the Referral Exclusion list feature (Admin – Property – Tracking Info) — this does NOT work! While it may remove some of the annoying entries in your referral report, it may actually change the session to a Direct visit and it continues to appear in your reports.

 

1. Ghost Referrals

[Sidebar: I have noticed that the spammers target tracking ID's that end with "-1" (e.g. UA-1234567-1). If you make a second Property in your GA account and switch your tracking code to the "-2", "-3" or other variant, most of these ghost referrals will never get recorded on your site. Note: you cannot transfer your analytics to the new property, but it is easier than filtering forever.]

The latest arrivals (darodar.com and many more) are what I call “ghost referrals” because they actually NEVER VISITED YOUR SITE. Using some software magic, they post fake pageviews to Google’s tracking service using a random series of tracking IDs. When they pick a series that includes your tracking ID, Google records a referral visit from their source in your reports.

Some variants of this attack use fake google / organic search visits with keywords for your to investigate (like “google officially -recommends ilovevitaly.com search shell“).

Since they never actually visited your site, you can’t block their visits at the server using any website Javascript or .htaccess methods. You have no choice but to create a filter to exclude them (as described below). The biggest problem with these ghost referrals is that they change as quickly as they appear, so you could be continuously building filters for them.

How To Eliminate All Ghost Referrals

You could create specific filters to remove each spam source, but a method that requires a lot less effort to maintain is to create a filter based on valid hostnames. Since the spam referrers do not know whose website the tracking ID belongs to (they are picking numbers at random), they send the “referral” using a hostname that is not one of yours. You can create an INCLUDE filter that keeps ONLY what was recorded from one of your valid web hosts, and you can stop worrying about darodar.com / econom.co / ilovevitaly.co / whatever-comes-next.

Huh? They are referrals, why are you filtering on hostname?  All of your analytics reports are affected by the “spam” traffic, which is why they are so annoying. If all they did was list a fake referral, everyone would ignore them, but they affect site bounce rate, pageviews, total sessions and users, time on site….everything.  #  You need to remove the visit from your data. Those visits include a lot of data associated with them. You can filter by city, or by referrer, or by browser. I noticed that the “ghost referral” visits use a hostname that is different from all of my regular traffic. Because there have been so many variants over the past month, I feel it is easier to create a filter that lets IN the good traffic and just locks OUT everything else. The fact that the traffic it is removing happens to be a referral from X or Y or Z is irrelevant to the filter – it is not “good” traffic, so ignore it.  #  They could just as easily fake search traffic with keywords that make you go to their site [oops, did I give them that idea?], or…well, they don’t need any other good ideas. “Good” traffic comes from hits on my servers (hostnames). Throw away the rest.

To implement this solution, STEP CAREFULLY or you will exclude valid traffic! You MUST identify ALL valid hostnames that may use your website tracking ID, and this could include other websites that you are tracking as part of your web ecosystem — your own domain, PayPal, your ecommerce shopping cart, and all of reserved domains (in case you decide to use them).

Start with a multi-year report showing just hostnames (Audience > Technology > Network > hostname), then identify the valid ones — the servers where I have real pages being tracked.

Many people have a problem with this step; here’s what I picked and why:

  • www.analyticsedge.com – my main site.
  • help.analyticsedge.com – my help site.
  • translate.googleusercontent.com – I have a lot of international visitors on my site, and they use Google translate to read my articles.
  • www.youtube.com – I have a YouTube channel with videos that I track using Google Analytics. I had to configure it with my tracking code.
  • sites.fastspring.com – I use FastSpring as my eCommerce provider to process payments. I configured my account there with my tracking code.
  • webcache.googleusercontent.com – some of my visitors use Google’s cached version of my articles.

I do not have any pages on google.com, mozilla.org, firefox.com or any of the other sites. I never configured them with my tracking code. Traffic on those hostnames are spam.

Then I create a filter with an expression that captures all of the domains that I consider valid. TEST, TEST, TEST! Then move to production when you are sure you have it all.

* The Filter Expression [Really Simplified]

Many people have a problem composing the filter expression because it is Regex (regular expressions), so lets keep it really simple in this case. Identify YOUR hostname(s) from the Google Analytics report as above.

For your filter expression, simply enter YOUR hostname. If you have more than one, separate them by a vertical bar ( | ). If you have a third-party payment service like paypal.com, you may need to enter it as well. In my third example below, I include all of the sites I use. Note that I have used a Regex “.*” (dot-asterisk) to match all subdomains of the ones listed.

www.analyticsedge.com 
OR
www.analyticsedge.com|help.analyticsedge.com
OR 
.*analyticsedge.com|.*youtube.com|.*fastspring.com|.*googleusercontent.com

Important: do NOT put a vertical bar at the ends of the expression, and NO spaces.

Note: spammers may use sites you recognize, like apple.com or theguardian.com or amazon.com. Do NOT add those to your list unless they are YOUR sites with YOUR pages.

It is CRITICAL that you maintain this filter EVERY TIME you enter your tracking ID into a new web service, and you should confirm using an unfiltered view every month that you are not excluding valid traffic.

To eliminate ghost referrals from your historical reports, use a segment.

Why do some people insist changing .htaccess files work? Because the comments also say to wait 2-3 days for it to take effect, and the spam traffic changes in that time, so it appears the blocking worked.

2. “Creepy” Crawlers and Fake Referrals

Not all bots identify themselves and follow the rules, and some are notorious for crawling websites and messing up reports. They seem to creep around the web, grabbing information for questionable purposes. In some cases, like the Semalt crawler, you can go to their website and ask to have your site excluded from their crawler [read the instructions and reference all of your subdomains using the full http://xxx.mydomain.com format. Get help from https://twitter.com/Nataliya_Semalt].

In many other cases, the last thing you should do is to visit the referring site, since this is an invitation to get a virus or Trojan infection on your computer. I recommend you do a quick Google search first to see if you can trust it. I always view page 2 and 3 of the search results for differing opinions because there is a lot of misinformation out there. Don’t open any of the links; the snippets in the search results are usually enough to tell you whether there is a serious problem with the site. If you can’t tell by the search results, check the Google Analytics group or Google+ page.

A lot of spam referral links are intended to get you to visit the link. Sometimes they are evil destinations, and sometimes they are legitimate businesses that contracted with a shady SEO company to increase traffic. If you clicked, they get paid. Typical destinations include Ukrainian sites like 12masterov.com and many others (a recent version of my filter list is below). None of them have real links to your site, but they seem to send you referrals. This is made possible because the web works on a trust basis: the browser visiting your site tells you where they were referred from, and that can be faked.

You can exclude them from your reports in Google Analytics by creating a filter. The way to do this is to find a “unique signature” that identifies them (and only them) and create a filter based on that. For most, filtering on Campaign Source with a matching domain usually works. Most people try filtering on Referral, and the filter doesn’t work because it must match the Full Referrer, not just the domain.

Read Google’s instructions on making filters. If you use Google Tag Manager, Lunametrics has a nice option.

Notes about this approach:

  • although the referrals are filtered from your reports, the visits to your website continue to occur and are included in the total session threshold that triggers data sampling in Google Analytics. If you can, and know how, blocking the visits from your web server would be better. On an Apache web server, this would be done by modifying the .htaccess file.
  • some people may feel the match pattern should be “semalt\.com” because it is a regular expression (the dot should be “escaped” with a backslash to prevent it from being interpreted as meaning ‘any character’), but Google doesn’t push it in their documentation for single domain matches.
  • you will need to create or modify the filter for each new crawler/referral, there is a 255-character limit in the filter expression, and there is a limit to the number of filters you can use. Always keep an unfiltered view and check on the impact at least quarterly. Many bots disappear within a few months when their effectiveness drops off.
  • Just because a lot of spam referrals came from Russia last month doesn’t mean they will come from Russia next month. Unfortunately you will need to update your filters over time.

My current filters (there is a 255 character limit) include:

12masterov.com|bard-real.com.ua|cardiosport.com.ua|customsua.com.ua|delfin-aqua.com.ua|dipstar.org|dvr.biz.ua|e-kwiaciarz.pl|este-line.com.ua|ghazel.ru|maridan.com.ua|mebeldekor.com.ua|med-zdorovie.com.ua|onlywoman.org|palvira.com.ua|trion.od.ua
it-max.com.ua|billiard-classic.com.ua|ci.ua|mirobuvi.com.ua|наркомания.лечениенаркомании.com

 

3. Well Behaved Bots and Spiders

The web exists because of bots and spiders — these are normally good things. They discover your content and share it with others. Google wouldn’t know about you without them. To prevent their traffic from appearing in your web analytics, standards were put in place so they could self-identify, and web analytics applications could automatically filter them out.bots-and-spidersGoogle Analytics has a simple checkbox you can use to exclude most of these well behaved bots and spiders, but you have to enable it for every View you use.

In your Google Analytics Admin section, navigate to each View you use, select View Settings, and check the box to Exclude all hits from known bots and spiders.

This is a good starting point…but it doesn’t handle everything.

Do I Need All Three Filters?

Yes. As you can see in the image below, Bot filtering removes some visits, Hostname filtering eliminates the ghosts, but some of the other bots require specific filters to remove their traffic.

Summary

Well behaved bots and spiders can easily be excluded from your reports by checking the option in your Admin – View – Settings.

Questionable crawlers and fake referrals are best eliminated at the web server (.htaccess file). They can also be removed from your reports using the Admin – View – Filter, but the visit is still counted by Google when they determine whether to apply data sampling for your report.

Ghost referrals never visit your site and must be removed by a Filter. Using an Include filter with valid hostnames will greatly reduce the maintenance effort, but must be maintained or it might exclude valid traffic from a new hostname in the future.

Wondering Why They Do This?

There are a few reasons someone would run a bot like these: the first is that they may be crawling the web to gather website information, just like Google’s crawlers do. Sometimes the purpose is a little shady, like they are looking for security vulnerabilities to exploit. The web is full of bots.

The second reason is that they want a bunch of website owners to look at them, so they push referral links to your site to make you open the referral link to see who posted a link to your site. If you are selling Search Engine Optimization services like Semalt, what better way to market? They are guaranteed to find small website owners that jump at the thought someone actually linked to their site!

The third reason is that they simply want a bunch of people to look at a particular website. Maybe it is a shady SEO service — I can get you thousands of pageviews for $###. And many will NOT bounce (because we search around looking for that link to our site before we eventually conclude it was spam), so yo ucan sell it as “quality” traffic. Yes, there is a really slim chance we’ll buy something, but I bet it’s an SEO deal gone bad.

Wondering How They Did It?

The crawlers are pretty straight-forward — start with a list of web pages, and look for links; follow those links and look for more. New crawlers run Javascript on the pages so they can get dynamic page content, and hence they end up running the Google Analytics tracking code.

The Ghost Referrals don’t actually go to your site at all; they take advantage of a loop hole in how Google Analytics works behind the scenes. When a visitor visits your site, they run the GA Javascript snippet and that sends a ‘ping’ to the Google Analytics servers with information about the website visited (identified by your UA-#######-# tracking ID), a unique user ID (that is supposed to come from a cookie on your computer), the page viewed, the server hostname and the referral source. The spammers send millions of fake ‘ping’s with specially crafted information — and their website information as the referral source.

They don’t actually go to your site to get the tracking IDs — they probably just use randomly selected ones in a range.  Since they never visited your website, they don’t know your server’s hostname, and that is why the hostname appears as ‘iedit.ilovevitaly.com’ or ‘apple,com’. That provides the clue that the traffic has been manufactured.

As for who? Well, over on blackMORE Ops , someone posted a comment saying “Good job! Vitaly Popov is my real name. […] I don’t need to hide my personality, because what I’m doing it isn’t a crime as minimum in Russia. It is just creative marketing. And yes, I’m having a lot of fun and laughing at you all!”   You have to admit: Google left a hole in their analytics system, and he’s just taking advantage of it. I’m surprised it hasn’t been repeated by others yet.