Canonicalization is the process by which search engines determine the
best, or "canonical" version of a single page that can be retreived using more
than one URL, and for different URLs that retrieve essentially identical content. A long time ago,
server software makers, ISPs and Web-hosting companies started to configure
their servers in a manner that was meant to be a convenience for their users, but
which can turn into a significant problem for search engine rankings. These well-intentioned
people decided they would allow websites to be accessed with or without
the very common "www" subdomain prefix in the URL. This was often a very
handy contrivance for budding webmasters and new Internet users who
would often omit the three-character prefix when they typed in a website
address in their browser. But it has come to present a particular problem in
Google. Pages indexed with both HTTP and HTTPS versions are another symptom of
canonicalization problems. Canonicalization is an issue for all search engines when it comes to duplicate
content and other situations. This SEO Tip will show you how to specify your canonical URLs.
Whenever the search engines discover the same (duplicate) content available from more than one URL, they will attempt to choose a best (or "canonical") version and filter out the duplicates. If a webmaster does not take steps to monitor and correct this issue, it can impair the rankings of the website. Often it's just a minor issue that resolves itself over time, but it can also be a significant and long-lasting drag on your rankings.
Let's start with the canonicalization issue that is almost exclusively limited to Google. By the strictest definitions, the two URLs "http://example.com" and "http://www.example.com" are separate and distinct entities. The first is technically pointed to the domain's root directory, and the second is pointing to a subdomain named "www". In the earliest days of the World Wide Web, the contents of a website were actually stored in a directory named with the standard abbreviation "www" by convention. Thus, the common practice of making a website's URL begin with that prefix was born. But as the Internet became more popular, and webmasters and IT managers dictated making allowances for the less techno-savvy in the population, various shorthand methods crept into usage. The one we deal with here is the making of the www prefix optional. I'm sure it seemed a natural thing to do. When referring to a website by its URL, the "www" part is frequently omitted both in speech and in writing, so it was only logical that users would similarly take the same shortcut when they went online. So, rather than frustrate those users needlessly, servers were configured to allow either version to retrieve the same content. Users were happy, IT managers were happy, and webmasters were happy. But, being the product of computer-based logic, search engine algorithms often fail to understand when they should almost always treat these two URLs as one and the same. Google has remained particularly stubborn about this issue, despite overwhelming evidence of the problems it causes. However, it should be noted here that Google will eventually resolve the situation on its own and choose what they determine to be the canonical version of your domain name. Of course, when it's your site that develops this problem, "eventually" can seem like a very, very long time.
Google now provides a method for webmasters to specify a canonical or Preffered Domain in their Google Webmaster Tools that will help with "www." canonicalization. But this tool is only effective for Google, and you should still install the 301 redirect, if at all possible. If you can't install a 301 redirect, there are other solutions like the new <link rel="canonical"> tag (described below) which is supported by all of the major search engines, or (as a last resort) a <meta> refresh tag.
Another source of canonicalization problems is when pages from your site are indexed with the normal HTTP protocol ("http://www.example.com") and the HTTPS secure access protocol ("https://www.example.com"). This problem can arise in many ways, but it's usually caused by a either a link to a page on your site (either an internal or external link) that uses HTTPS protocol and that page contains relative links to other pages on your site. Since such relative links do not include the "http://" protocol header or your domain name, the search engines see it as another HTTPS link and they will attempt to crawl and index all pages that they discover following this path. Naturally, the problem can cascade through a site quite quickly and can cause problems until the search engines normal canonicalization processes resolve them properly or you take steps to correct them.
Why do you need to care? The problem is two-fold. First, there is the issue of link popularity. Google's vaunted PageRank system depends on links, and it will not automatically canonicalize (ie. treat as identical) URLs in links that omit the www and the version that includes it. This sometimes means lower rankings for the site for most searches than it actually deserves. Second, and a frequent result of the first, Google won't deep crawl one version of the URL or the other based on either (a) the reduced link popularity/PageRank, or (b) duplicate content issues. Having the same content available from more than one URL is a violation of the guidelines of all major search engines and this www issue is one of the most common causes of canonicalization problems in Google. Your site doesn't get penalized for duplicate content, unless the search engines determines it to be purposeful or malicious, but it can impair the site's rankings. Fewer pages indexed for a site means that, once again, one version of the URL is not receiving full link popularity credit for its own internal links.
So the problem compounds itself for a time, and can be especially debilitating to sites that weren't all that strong to begin with. Sadly, webmasters are often partially responsible for this problem because, knowing they can "get away with it", they will use the shorthand version when submitting their site to directories or posting links on webpages of their own design. Once this Genie is out of the bottle, its a long battle to overcome because even if you are able to find every incorrect link on your own site, all it takes is a mal-formed link on an obscure page on the web somewhere to keep this demon haunting you for a very long time. Fortunately, there is a solution.
The best solution is to use server control methods to automatically redirect requests to the proper URL. The server must return a "301 Moved Permanently" result code in order for the search engines to properly assign the link popularity and to update their internal records of the page's true URL.
Websites running on hosts that use the Apache server software usually have it the easiest in this regard because they can control this problem on their own using the .htaccess control file. Just create a simple text file named ".htaccess" with no filename extension, and insert the following command:
Simply replace "yoursite.com" in the above code with
your website's domain name. Websites based on Microsoft's IIS Server Software will
need to consult their system administrator for help. Again, be sure the server
returns the redirecting result code #301 or you're only papering over the problem
and not repairing it. A code 302 result is not acceptable because 302 means
"Moved Temporarily" and doesn't repair canonicalization problems.
You can check the code your site returns with my Server Header Checker.
The simplest and best method of preventing HTTPS canonicalization issues is to use complete URLs in your website's internal links. Specifying the correct URLs in your own links is the strongest possible signal that you can send to the search engines as to which version should be selected as the canonical. I'll admit that this is a case of "do as I say, not as I do", since I'm as guilty as anyone of using relative links for my own convenience when designing websites. But I'm changing my ways as quickly as I can.
Repairing HTTP or HTTPS links that have been indexed requires much the same process as it does in repairing the WWW issue. You can use a rel="canonical" tag whenever necessary and practical, of course. And you can install 301 redirects for a more comprehensive fix. You'll find more information on my SEO tip: Repairing HTTPS issues
Many webmasters don't have access to server redirect tools like Apache's .htaccess file, so they can't install conventional redirects to solve canonicalization problems. Fortunately, there is a simple alternative.
In February 2009, the major search engines gave all webmasters a very powerful and easy-to-use
method of preventing and repairing canonicalization tools. The four largest search engines:
Google, Yahoo!, MSN, and Ask.com have all agreed to support a new canonicalization attribute for
the <link> tag that goes in the <head> section of your HTML documents. The syntax is
as follows:
This tag will be used as "a very strong hint" in determining the canonical version of a URL in the search engines. It is treated almost exactly as a 301 redirect for such purposes. However, it is important to remember that the rel="canonical" tag is a page-level setting and will not affect how the other pages on your site are indexed. For more information, see the Google Webmaster Blog post: Specify Your Canonical, and Matt Cutts' article: Learn About The Canonical Link Element in 5 Minutes. Both are well worth reading, but Matt Cutts really explains the impact on rankings and ideas for when it's appropriate to take this action.
Website owners who operate multiple websites for a single company or organization face the issue of the best way to deal with duplicate content on pages that contain information like contact details or terms of use that are common to all of their websites. Overall, there is no reason to worry about duplicate content for these pages since they are rarely pages that need to rank well. You should simply provide a clear navigation path for users who are looking for such information in the normal design of your website, and let your other pages carry the burden for ranking issues. But if you have a such a page that you want to rank well, you can use rel="canonical" to point to a URL on a different domain.
The www issue is only one place where canonicalization problems occur. Anytime the search engines encounter a page that is essentially identical to another page, they will try to select the best, or "canonical" version, and filter any duplicates from their index. As with the www issue, this can hurt your site's performance in the search engines. The proliferation of BLOGs and other content management systems has brought canonicalization problems to many websites because those programs routinely create multiple URLs that point to the same content, resulting in canonicalization issues. The search engines are becoming more adept at detecting and dealing with the most common canonicalization problems in BLOGs and forums, but it's up to the individual webmaster to take steps to prevent the problem from arising in the first place. Fortunately, most BLOGs are supported by a community of talented programmers who have created add-ons for BLOGs that can reduce the number canonicalization problems.
Ecommerce websites have their own problems with canonicalization. Many shopping cart programs require users to accept cookies in their browser or they will add what are called "Session IDs" to every link. Since search engine crawlers don't accept cookies, they have traditionally avoided crawling any URL that included a Session ID or other user indentification value. Another place where ecommerce sites can create canonicalization issues is when they use features like sorting lists of products by price, color, or size, etc. The search engines see these pages containing nearly identical content and suppress them. Fortunately, two of the major search engines - Google and Yahoo! - now provide tools for webmasters to manage these problems involving dynamic URLs. Naturally, you need to register and verify your site in order to use these tools. Assuming you've already done so, here's how they work:
In Google's Webmaster Tools console, you can tell Google to ignore parameters in query strings, such as session IDs. Click on "Site Configuration", then "Settings", and you'll see a section titled, "Parameter Handling". Click on "Adjust parameter settings". You'll see a text box labeled "parameter name". Enter the name your site gives to the parameter for your session ID (for example, osCommerce uses "oscSid"). Then choose "Ignore" from the drop-down menu titled, "Action". Soon, Google will filter out that parameter from the URLs for your site, and will start to properly index any URLs that would have caused a problem in the past.
Not to be outdone, Bing's Webmaster Tools will also let you select query string parameters (called "URL Normalization") to filter or ignore in the "Crawl Settings" tab.
This SEO Tip by Rainbo Design was last updated on September 25, 2020
Need More Help?
You'll find more SEO Tips on the menu on the right side of this page.
You can also contact me with your SEO questions.
If you can't fix your website search engine problems on your own,
my Search Engine Optimization Services
can give your website what it needs to get your business' fair share of search engine traffic quickly, without disturbing your website's design, and without breaking your budget.
Call Richard L. Trethewey at Rainbo Design in Minneapolis today at 612-408-4057 from 9:00 AM to 5:00 PM Central time
to get started on an affordable website design package or search engine optimization program for your small business today!
In writing these SEO tips, I'm often reminded of a pearl of wisdom that my high school computer programming teacher passed on from one of his teachers, "Computers are high-speed idiots!" Remember that, and don't let them get under your skin.