Using Google for Duplicate Content Detection

Posts : 1016 Points : 25115 Join date : 2012-03-27

A month or so ago I was looking at a camping equipment website called outdoorpros.com. I love this site and would recomend it to anyone. Being an SEO, however, I couldn’t help but notice that they were using some suspicious looking paginated links on their categories pages, so after getting all excited about my new camping stove I decided to take a quick look in their Google site index to see how search engines might be indexing the site.

This post covers some basic tips on “site diagnostics”, specifically; duplicate content detection by using Google search. Checks that every SEO should do as part of investigating potential issues that may negatively impact search engine positioning.

Here’s the approach I always follow, using outdoorpros.com as an example site:

1) Use your common sense

Let’s start by doing a site:www.outdoorpros.com in Google search.

As you can see from the screen grab, Google is reporting 72,100 indexed pages. Is that too many? If so you may have some kind of duplicate content issue.

2) Skip around the index and see if you spot something weird

Ok, not terribly technical advice, but it doesn’t have to be.

Click to around page 10 and take a quick look at the indexed URL’s. If you don’t see anything weird, skip ahead another 10 pages. Go as far to the back of the index as you possibly can, because that’s where the good bad stuff usually hides. You’re looking out for malformed urls, query strings (like ?=sessionid or ?first_page etc) or many repeated results with the same title / description.

In the case of our friends at outdoorpros.com you can see straight away that something doesn’t look right

That set of results tells me a lot about this site, and I’ve only been looking at it for 30 seconds. We’ve identified some query strings in the index. They might be causing duplicate content. How do we confirm that though?

3) Assessing if there really is a problem on individual page types

Take one of the query strings we saw in the index. Let’s use:

?attribute_value_string

Is that indexed string causing a problem? Let’s see. The url was:

http://www.outdoorpros.com/Brands/Kershaw/96?attribute_value_string%7CColor=Pink

It looks like a brand / category page for Kershaw Knives. Checking if that page is indexed with and without the query string is the first step. Here’s the cached page with a query and without. Woops. There are at least two copies of this page in the index.

But those pages have different content? Well, yes in that products the page links to are different, but, the brand category page is the same every time. Each copy of the page has the same meta title, description – it’s duplicating! It may be why Outdoorpros don’t rank organically for “Kershaw” or “Kershaw knives”

4) Deciding how may URLs you have in the index are duplicated

That’s quite easy. To get a feel for the number of URLs that are duplicating, just do a query like

site:www.outdoorpros.com inurl:attribute_value_string

This site looks to have at least 13,000 urls that contain the query string. Drill down a little by picking a few different titles from indexed pages such as:

site:www.outdoorpros.com intitle:”Buck Knives – OutdoorPros.com”

There are 65 pages with that exact . Doh! 5) How do I fix this?! Ok, first of all let me recap on what we’ve done so far. We’ve used a basic site: command and taken a common sense snapshot of how many pages there are in the index. When you’re an e-commerce site with 100,000 indexed pages and only 5,000 products, you might need to think about it. Next, we drilled down by just checking Google’s index in random positions to see if there was anything that didn’t look right. Something was definitely wrong. By carrying out a query that told us how many instances of the query string were present, we had a total number of indexed pages using that string. Finally, we picked a specific page <title> and found 65 instances of the same page. There is a solution, and sadly just nofollowing paginated links won’t work. The damage has been done – you have some indexed urls and some housekeeping to do. I’m going to offer some advice in this post, but I’m going to cover fixing duplicate content issues in my next post soon. Add my RSS feed to get that post when it’s done. In the meantime, my best advice to outdoorpros.com is they need to create a list of all of the query strings that describe paginated pages and set up a rule to noindex,follow anything above the value of the first page. Here’s my example: Let’s look at their pants page. It’s a perfectly good pants page and I’ll hear no sniggering at the back of the class please.. The main url to this page is: <a href="http://www.outdoorpros.com/Cat/Pants/5/List" target="_blank" rel="nofollow">http://www.outdoorpros.com/Cat/Pants/5/List</a> Check out the paginated navigational links. Each one of them produces a different url that looks like this: <a href="http://www.outdoorpros.com/Cat/Pants/5/List?first_answer=13" target="_blank" rel="nofollow">http://www.outdoorpros.com/Cat/Pants/5/List?first_answer=13</a> The fix? A simple noindex,follow should be added in the page head whenever that query string is generated. <html> <head> <title>...

This way, the many versions of the same page will be crawled but not indexed. All links on the page will be followed so the products will still be added to Google’s index. You’ve identified the canonical version of your pants page and Google will be grateful. Job done.

» Using Google for Duplicate Content Detection
» Duplicate Content - Thuật Ngữ SEO
» Fixing Duplicate Content (and no, I’m not going to talk about pagination)
» Drupal SEO - Vấn đề vẫn là trung lặp nội dung (duplicate content)
» Hướng dẫn loại bỏ lỗi Duplicate title tags và Duplicate meta descriptions