Google crawls over 60 trillion pages which is over 100 million gigabytes of information1. Is your website showing in their index for all the appropriate pages?
There’s no sense in worrying about SEO if your website isn’t even in Google.
How to Find Your Web Pages Indexed by Google
Begin by heading on over to Google and type in:
Do NOT include any spaces, the http:// or www. Type it exactly as I show above replacing “yourwebsite” with whatever your domain is.
For this example, I am using Dollar Shave Club. I just received my next shipment of razors from them, so they are top-of-mind.
Dollar Shave Club shows that 3,260 pages are in Google’s index. It is important to note that this number is just an estimate.
What if you don’t see any results?
How to Handle No Results Found
First, don’t panic. Well, maybe panic a little. Then check for typos. After that, it is time to review your robots.txt file.
Unfamiliar with what a robots.txt file is? A robots.txt file tells search engines what files and directories it should index and which to ignore.
To research further, login to your Google Search Console. On the left you will see an option for Crawl >> robots.txt Tester.
If you haven’t set up Google Search Console yet, look for this line in your robots.txt file:
If you find the side-wide disallow in your file, it could be as simple as removing it.
Next, look at Google Index >> Index Status. Under the Advanced option, check the box for Blocked by Robots. Look for any issues here as well.
Another possible reason for your website not showing up in Google is deceptive SEO practices2.
If you suspect deceptive SEO practices, check for any manual actions against your site.
In Google Search Console, check Search Traffic >> Manual Actions.
If your website is WordPress, you will want to check one more possibility. Go to your Settings >> Reading section. Make sure that you DO NOT have the option checked to prevent the web crawlers from accessing your site.
Checking Your Indexed Pages for Missing Pages
For small websites, it isn’t a big deal to check your search results pages for anything missing.
For bigger sites, it isn’t practical to try and check every search results page.
If you only have a handful of critical pages, you could do a search for specific pages.
That’s a quick solution for a few pages, but you most likely want to check all your important pages this way. You also don’t want to go to extreme measures to check your links.
For bigger sites, I like to use Scrapebox with proxies to prevent IP address blocks or bans. Scrapebox is a paid tool, but is fast and has many other uses.
If you Google “free index checker” you will find several alternatives.
How to Get Pages Added to the Google Index
For the pages you found that still need to be indexed, we need to check for one more possible issue.
Visit your web page in question and view the pages source code.
You’re looking for either:
meta name="robots" content="noindex"
meta name="googlebot" content="noindex"
The first one is blocking all web crawlers. The second meta tag is blocking just for Google.
The next thing to check is for duplicate content on the page in question. That duplicate content can come from within your own site or another site. Google doesn’t like to index content that is the same on two different pages.
To check for duplicate content, use Copyscape.
We’re not concerned with a sentence or two of duplicate content. We’re looking for whole pages or massive chunks of a page that have duplicate content.
Once you have eliminated meta tags and duplicate content, go back to Google Search Console.
Under Crawl >> Fetch as Google you will add your pages URL and click Fetch and Render (leaving desktop option is fine).
Finally, click the link to Submit to Index.
You can submit up to 500 individual URLs within a 30 day period. For the crawl this URL and its direct links option, you can submit up to 10 requests within a 30 day period.
Removing Pages from the Google Index
Going back to the Dollar Shave Club example from earlier, Google shows over 3k pages indexed. If you look closer at the results, four of the first seven results look “off” as they have a user id in the URL.
If you click the first user id result, you see in the URL that this is a customer sending a referral to a friend. This page doesn’t need indexing in Google, or any of the other customer referral pages.
So how do we remove pages that we don’t want to be indexed?
- Fix the reason it was indexed to begin with
- Submit a request to Google to remove the page
To prevent a page from indexing, block it in robots.txt or with a meta noindex tag.
Submitting a request for Google to remove from the index is easy. Return to your Google Search Console and go to Google Index >> Remove URLs.
From here you can submit a new request and see a list of pages removed in the past six months.
Making Sure Google Indexes the Latest Version of Your Website’s Pages
Up to this point, we have discussed how to check, add, and block pages from the Google index. In this section, we’ll discuss how to make sure Google is indexing the most recent version of your pages.
Use an XML sitemap for your site. The XML sitemap lists all the pages on your website. It also notifies search engines when there has been a change to a page or a new page added to the site.
Once you have an XML sitemap on your website, you will want to add it to your Google Search Console by going to Crawl >> Sitemaps.
Crawling and Indexing Final Considerations – The Googlebot
Five things to keep in mind with the Googlebot:
- The Googlebot crawling process starts with the previous crawl list. It then checks with the XML sitemap on your site for updates.
- Googlebot will ignore your robots.txt file if it isn’t installed in the proper location. Make sure it is in the top directory of the server.
- You can check your server logs to see when the Googlebot crawls your site, and how often. Be sure to use a reverse DNS lookup as the IP addresses used by Googlebot change from time to time.
- If you use URL parameters, and want to avoid duplicate content caused by dynamic URLs, configure the URL parameters section in Google Search Console.
- Feedfetcher is how Google grabs RSS or Atom feeds from your site. Feedfetcher doesn’t follow robots.txt guidelines. So if you want to prevent Feedfetcher from crawling your site, you must configure your server to serve error status message, such as a 404, for this crawler.
The goal of this article was to outline the significance of the Googlebot crawling your website. When I consult with businesses and hear them say, “We tried SEO, but it didn’t work for us” my immediate first step is to perform the action items outlined in this post.
On several occasions, these businesses jumped right into link building and other SEO strategies, but had their entire website blocked in the settings panel of WordPress as I discussed earlier.You need to do the due diligence before you get to the sexier side of SEO. Click To Tweet