Among the canonicalization problems that can harm a website's rankings
is when Google or one of the other search engines begins to index your pages with HTTPS,
the secure protocol, instead of the normal HTTP. This can happen when Google discovers a link
to your site that includes the HTTPS prefix in the URL, whether that link resides on your own
site or was posted by someone else - either accidentally or maliciously. When this happens, it can lead to Duplicate Content
issues because Google will usually see identical content on your site with both versions
of the URL. This article discusses the causes and the cure.
Just Added: How To Switch Your Website To All HTTPS
This problem begins when Google encounters a link to your site using HTTPS. This most often occurs on e-commerce sites, but any site that deals with private user information may well protect portions of their site with the SSL (Secure Sockets Layer) service employed via HTTPS. Ordinarily, a webmaster will block the search engines from accessing these protected areas through an instruction in the site's robots.txt file. But if you don't use robots.txt, or if the instruction is poorly crafted, the gates are open for search engines to crawl those pages as they would any other.
Once Google or other search engine crawls a page with HTTPS, it can begin to crawl the rest of the site with HTTPS if you rely on relative links on your pages. A relative link is one that uses a shorthand version of the URL in the "href" attribute of the <a>nchor tag, and does not include either the protocol ('HTTP' or 'HTTPS') or the domain name ('www.example.com'). Both search engines and browsers will use the same protocol as they used to access the page where such a link resides in constructing the complete URL to assign to that link. So, if your site relies heavily on relative links and does not take steps to prevent search engines from indexing pages with HTTPS, the problem can cascade through your entire website and disrupt your site's performance in the rankings. Here are some steps you can take to prevent this problem from hurting your website:
Start by creating a good robots.txt file. It's a good idea to
limit HTTPS access to specific directories within your site so that you can control
when and where HTTPS is used. Then you can include an instruction in your robots.txt
file to block the search engines from crawling those directories with something like:
Disallow: /directory/
There's a tool in Google's
Webmaster Tools console that will let you test your robots.txt file to make sure
that you are properly blocking all of the pages within the directories you want to protect
with HTTPS. Naturally, doing this falls in the "an ounce of prevention" category. You'll
need to take further steps if some of your pages are already improperly indexed in the
search engines.
Use a robots <META> Tag on All Pages Using HTTPS. Using a
robots <META> tag on pages designed to be accessed with HTTPS will go a long way
toward preventing this problem. Simply add:
<meta name="robots" content="noindex,nofollow">
to the <head> section of each of these pages. This prevents the search engines
from both indexing the page where this tag resides and from following any links that
also reside on the page. If any of your pages designed for HTTPS access have already been indexed,
be sure to add this <META> tag to all such pages and then temporarily remove the
blocking instruction from your robots.txt file. This will allow the search engines to
see this <META> tag, which will cause them to remove the page from the index. Once
the pages have been removed, you can restore the blocking instruction in your robots.txt file.
Use the rel="canonical" Tag. The rel="canonical" tag tells the search engines the correct URL for a page. It's always a good idea to add this tag to your site's main page to prevent the common canonicalization problems with the "www." prefix, but it will also serve to prevent it from being indexed with HTTPS as well. You can use this tag in many situations where a page might be accessed with different URLs, and you can also use it when a page has already been improperly indexed with HTTPS. For details on the rel="canonical" tag, see Google's article Specify Your Canonical. This is both an "ounce of prevention" and a "pound of cure" that's easy to implement and does the job pretty well in a single step. You'll find another step you can take to reinforce this setting later in this article.
Use Complete URLs in Your Internal Links. Web designers like to use relative URLs when they create webpages because it often simplifies testing page layouts on their computer before uploading them to the server. But, as we've seen, this can lead to search engines following improper paths through the site once they've latched on to a link that resolves to starting with "https://". Get in the habit of using complete URLs and you'll be doing your site a big favor.
Use a Special robots.txt File For HTTPS. You can serve a special robots.txt file
when your server receives a request for /robots.txt using HTTPS. If your server uses Apache server
software, you can add an instruction near the top of your .htaccess file to handle this, such as:
RewriteCond %{HTTPS} ^on$
RewriteCond %{REQUEST_URI} ^/robots.txt$
RewriteRule ^(.*)$ /robots_https.txt [L]
You may need a different instruction, depending on your server environment. If the above
example doesn't work for you, try:
RewriteCond %{SERVER_PORT} !80
RewriteCond %{REQUEST_URI} ^/robots.txt$
RewriteRule ^(.*)$ /robots_https.txt [L]
This instruction should be placed before any other redirects in your .htaccess file so that it will be processed first.
Next, create a special robots.txt file using a different file name. In my example, I use "robots_https.txt".
Modify the "RewriteRule" in the .htaccess code above to use whatever file name you choose. Then, create a
new text file using that file name, and fill it with:
User-agent: *
Disallow: /
The combination of the .htaccess settings and the special robots.txt file will block the search engines from using HTTPS
for any URL on your site. If your server uses Microsoft IIs software, contact your hosting service for advice on
implementing this.
Redirect HTTPS Requests For Normal Pages. If some of your pages have already
been improperly indexed with HTTPS, it's a good idea to set up 301 redirects for those pages and
unblock them in your special robots.txt file (if any) so that the search engines can try to re-crawl
those pages and discover the new redirect. A sample .htaccess instruction for this would be:
RewriteCond %{HTTPS} ^on$
RewriteCond %{REQUEST_URI} !^/https-allowed-directory/(.*)?$
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]
or, (again) if your server uses a different HTTPS indicator field:
RewriteCond %{SERVER_PORT} !80
RewriteCond %{REQUEST_URI} !^/https-allowed-directory/(.*)?$
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]
Note that this example allows HTTPS access to one directory (such as the "admin" directory for a blog or an e-commerce website). You can repeat this instruction for other directories that depend on https access as well. Once the search engines have seen this redirect a few times, you should go ahead and restore the blocking instructions in your special robots.txt file. Letting the search engines see the 301 response from URLs that have been indexed with HTTPS will effectively remove them from the index.
Google has recently announced that they will be giving a slight ranking boost to sites whose pages are served using HTTPS. This will certainly cause many webmasters to make the switch, but you need to use the same care described here when switching protocols TO HTTPS as when you want to REMOVE HTTPS. That means using the rel="canonical" tag on your pages, carefully crafting 301 redirects, and updating your robots.txt file to make sure that your sensitive pages are never indexed.
Your .htaccess file should include one of the two following instrunctions:
When you change your website to use HTTPS, it's important to notify the search engines directly about the change through the Google Webmaster Tools console and Bing's Webmaster Tools to speed up their indexing of your new URLs. See my article on Changing Your URLs for more information.
In summary, removing pages that have been improperly indexed with HTTPS requires a bit of effort. The rel="canonical" tag is the easiest method for removing your pages from the search engines that were indexed with HTTPS, but it can take a long time for them to resolve the situation. Always using the robots <META> tag set to "noindex" on pages that you never want to be indexed will go a long way to preventing the problem as well. And serving a special robots.txt file is an added layer of prevention and will, in time, also repair the problem. The ultimate sledge-hammer-approach method is to also install 301 redirects for directories or individual pages that have been improperly indexed.
These steps will help reduce the risk of your site developing duplicate content or canonicalization problems, and can also remove pages from the search engine's index that have been improperly indexed with "https://". It's critical your "pound of cure" to let the search engines see the new status of the badly indexed URLs before they will remove them from the index.
This SEO Tip was last updated on September 25, 2020
Need More Help?
You'll find more SEO Tips on the menu on the right side of this page.
You can also contact me with your SEO questions.
If you can't fix your website search engine problems on your own,
my Search Engine Optimization Services
can give your website what it needs to get your business' fair share of search engine traffic quickly, without disturbing your website's design, and without breaking your budget.
Call Richard L. Trethewey at Rainbo Design in Minneapolis today at 612-408-4057 from 9:00 AM to 5:00 PM Central time
to get started on an affordable website design package or search engine optimization program for your small business today!
In writing these SEO tips, I'm often reminded of a pearl of wisdom that my high school computer programming teacher passed on from one of his teachers, "Computers are high-speed idiots!" Remember that, and don't let them get under your skin.