There’s been a lot of attention on one of my favourite SEO issues recently, particularly duplicate content. “dupe content” is just one of those subjects that never go away. Why? It seems that for every fix we apply to sort the problem out, an entirely new kind of duplicate content can occur. The other problem is that it takes a while for an inexperienced SEO to learn enough about the problem, how to diagnose it, and how to solve it on their site once and for all.
A lot of the posts we’ve seen recently tend to focus on the same subject. Pagination! And, the same solution: using a noindex, follow tag to sort it out. (this includes me, by the way ) This is all fine, but what if you don’t have that kind of issue, or you have pagination, you’ve fixed it, but you still have other problems?
This post is going to focus on two more sources of duplicate content frustration in your internal site architecture. Our old friend the session_id, and tracking codes for analytics packages. We’re going to talk about user agent detection, conditional redirects and javascript onclick events and, if you’re really desperate, robots.txt wildcards.
1) Hiding Analytics tracking with Onclick events
The ultimate irony. Google blog and offer advice about reducing duplicate content. They also have an analytics package whose tracking codes are littered all over the internet which is causing heaps of the stuff. How can you prevent leaking a tracking code like ?utm_source= when you need to track the ROI of the link? Use a java onclick event.
For the moment, Googlebot can’t see or execute these types of links. Here’s an example:
>>> Seogadget home page <<<
Here’s the code:
When you click the link, you will be taken back to my homepage but you’ll see the utm_source= query appended to the end of the url. Disable javascript with web developer toolbar on this page and mouseover the same link and you’ll only see a canonicalised homepage link – which, for the time being at least, Googlebot will respect. At least you won’t be leaking any more urls into Google’s index.
Hiding tracking code (or any query string) from search engines can be useful, but what if that code is already in the index? We’ll come on to that in a moment.
2) Session id’s leaking all over the place
Most SEO blogs say “don’t use session id’s”. That’s great advice – most recent content management systems no longer use session id’s. That’s not to say that legacy content management systems should be binned.
If you have a site index full of session_id’s, you might want to consider setting up a conditional 301 redirect to send any known search bot back to the canonical version of the URL. I call this “session stripping”, detecting the user agent on every url and 301ing out any query string or session id that will cause duplicate content in a search engine index. My good friend Gareth Jenkins is working with me on a technical post on how to do just that in ASP code. Subscribe to my RSS feed and you’ll get it in a few days time.
Here’s a good example of a site using user agent detection to strip session id’s from the index – follow the link below with your firefox user agent set to Googlebot:
http://www.goldgroup.co.uk/town-planning-recruitment/?session_id={7FA6ADAE-C397-4755-A4F1-0066FE68DC1E}See? That session id is cleaned out when you’re Googlebot but not when you’re a user. Introducing this method has the added benefit of cleaning up the site index, every session id that gets recrawled is 301 redirected to the canonical form, so after a few weeks your entire site index is cleaned out. This is technically known as conditional redirecting and there’s a lot of debate at the moment of the white hatted-ness of the procedure. I personally think this kind of conditional redirection is ok. You’re making it easier for search engines to crawl your site, and you’re not cloaking your content at all. What’s the problem?
3) If you’re desparate, the robots.txt wildcard
Let me open this with the following statement. Using wildcards to prevent the indexing of session id’s is a bad idea. If you block every session id then your site won’t get crawled at all. That’s bad! If you’re really stuck, however, you could make sure that you’re internal linking uses the canonical version of the url and that you’re only using session id’s where it is considered an absolute nessecity. Here’s a widcard in a robots.txt file:
User-Agent: *
Disallow: /qca/
Disallow: /form/
Disallow: /search/
Disallow: /candidate_community/
Disallow: /campaign/
Disallow: /*?session_id
This example robots.txt would disallow:
http://www.example.com/test-url-devon-40305/?session_id={C355FEB0-4043-4FE9-A07D-D788E441EFDE}but allow:
http://www.example.com/test-url-devon-40305/I hope this post provides a little more insight into one of the most important subjects in site architecture for SEO. I personally hope that duplicate content issues never go away, because fixing them can be extremely satisfying