Crawl Rate - Magento SEO - Wordpress SEO

How To: Find Out If Your Sites URLs Are Being Crawled & Indexed by Google

By on 28th January 2021

Reading Time: 14 minutes

  • LinkedIn
  • Twitter
Search Engine Spider

This is a blog post in two (large) pages – live and staging sites:

Part 1: How To Check if Google has Indexed Your Live Site

Part 2: How To Check If Google has Indexed Your Staging/Test Site


Part 1:

How can I tell if Google has indexed my live site?

There are two straightforward ways to find out:

Use The site: query operator

Search for your domain on Google as follows:  site:organicdigital.co

If your site is indexed, you will see a list of pages:

Site Query Operator

If no results are returned, then you may have issues:

Site Query Operator with No Results


Note:  on bigger sites, whilst you will see an approximation of how many pages are indexed, you will only be able to actually see around 300 of them in the SERPs.

Check the Coverage Section of Google Search Console

Every website should have GSC account, it is, in my opinion, the greatest tool a site owner or SEO can use and gives a wealth of information about your site’s organic visibility and performance.  If you do not have one, head to the official GSC page, if you do, go to the Coverage section where you can see a breakdown of:

  • Errors encountered whilst crawling pages
  • Pages that are blocked
  • Valid indexed pages
  • Pages that are excluded
GSC Coverage Report

If your site has issues, these will be reported under “error” or “excluded” – and you can find out the reasons why they are not being included in search such as:

  • Alternate page with proper canonical tag
  • Crawled – currently not indexed
  • Duplicate without user-selected canonical
  • Excluded by ‘noindex’ tag
  • Crawl anomaly
  • Not found (404)

If your site’s pages are not appearing in the “valid” section then you may have issues.

Use the URL Inspect Function In GSC

If some pages are indexed and others are not, then you can also use the URL Inspect tool to see if Google is able to crawl and index a specific page, or if there are other issues preventing it from appearing in search – this is in the top menu and will allow you to check one URL at time:

GSC URL Inspect Tool

If your page is indexed, it will give details as follows:

GSC Indexed Page Data

If not, you get this status which shows when Google has attempted to crawl the page and some insight into why it is not indexed:

GSC Non Indexed Page Data

Why Won’t Google Crawl or Index My Pages?

There are generally two reasons why a page cannot be either crawled or indexed.  These are particularly common when a new site has been launched or migrated, and the settings from the development environment have been carried over.

The robots.txt Disallow Directive

This is where the site, a directory, or a page are blocked from being crawled by the robots.txt file.

Every site should have a robots.txt file, this is used to give directives to search engines as to what sections of your site should and should not be crawled.

If you have one, you will find it in your root directory under the name robots.txt

https://organicdigital.co/robots.txt

The directives that would prevent a site, directory or page being crawled would be as follows:

Disallow: /
Disallow: /directory/
Disallow: /specific_page.html

You can also use Screaming Frog to attempt to crawl your site. If it is unable to do so, you see the following crawl data:

Screaming Frog Robots Issue

There are many valid reasons for blocking search engines using this directive, but if you see something along the lines of the above, you need to amend these to allow crawling of your site.

How To Amend a Robots.txt File Manually

If you have access to FTP or have a developer on hand, you can manually amend the robots.txt file to remove any directives that are blocking your site from crawl.

Generally, the following command will do this:

User-agent: *
Allow: /

How To Amend a Robots.txt File in WordPress

If you have the Yoast plugin installed, you can edit your file directly via the Tools -> File Editor Section – follow this link for instructions on how to do this.

Yoast robots.txt Editor

How To Amend a Robots.txt File in Magento

Go to Content -> Design -> Configuration, click into your relevant Store View and edit “Search Engine Robots”

Magento Robots Settings

The Robots Meta Tag is Set to Noindex and/or Nofollow

In addition to the robots.txt file, you can also check the robots meta tag within your site’s source code and ensure it’s not preventing search engines from crawling.

If you check your source code, if you do not see a robots meta tag, or, it is set to “index” or “index,follow” – then this isn’t the issue.   However, if you see that it says “noindex”, this means your page can be crawled but will not be indexed:

Noindex Tag In Source Code

Again, you can use Screaming Frog to check the status of your robots tags on your site.   If your tag is set to noindex,nofollow it won’t get beyond the home page:

Screaming Frog Robots Noindex/Nofllow Issue

If it is just set to noindex, the whole site can still be crawled but not indexed:

Screaming Frog Robots Noindex/Nofllow Issue

How To Amend the Robots Meta Tag File Manually

Again, access your site’s page/template directly and replace/add the following tag:

<meta name="robots" content="index, follow">

How To Amend the Robots Meta Tag in WordPress

There are two ways to do this – if the issue is site wide the go to Settings -> Reading and ensure the “Discourage search engines from indexing this site” is not ticked:

Wordpress Noindex Site Setting

I might be wrong, but I think the only way a specific page or post can be set to index or noindex if you are using Yoast, so go to page/post and check the following setting at the foot of the page:

Yoast NoIndex Setting

How To Amend Robots Meta Tag in Magento

As before, go to Content -> Design -> Configuration, click into your relevant Store View and amend the “Default Robots” drop down option:

Robots Meta in Magento

My Site / Pages Can Be Crawled and Indexed by Google – What Next?

Once you are satisfied that your robots.txt file and robots meta tag are correct, you can again use the Inspect URL tool to check your page and request that Google crawls and indexes your page:

GSC Request Indexing

I Also Have a Bing Webmaster Account!

Do you? I thought I was the only one. Ok, you can do pretty much all the same things written in this article in Bing Webmaster Tools as you can in GSC – so inspect the URL and Request Indexing:

Bing Request Indexing

I’ve Done All This and My Site / Pages Still Aren’t Indexed!

In which case, you need a deeper delve into the configuration and functionality of your website to identify what other issues there might be.   I can help you with if you fill in the contact form below.


Part 2:

Someone Who Has Just Realised Their Test Site Is Indexed

How To: Check If Your Staging Site Is Indexed By Google

Only three things are certain in life: death, taxes and your test site getting indexed by Google.  

Very rarely do you come across a new site launch without at some point realising the staging server has been left open to bots to come crawl and index. 

It’s not necessarily the end of the world if a search engine were to index a test site as it’s fairly easy to resolve – but if you are running a test environment long term to develop new functionality alongside a live site, then you need to ensure it is protected correctly as early as possible to avoid duplicate content issues, and to ensure real life humans don’t visit and interact (i.e. try to buy something).

I am formerly a developer, and probably made these mistakes myself more than once, but back then I didn’t have an SEO being a pain in my arse all the time pointing these things out (back then, old school brochure-come-web designers who didn’t understand the limitation of tables and inline CSS where the pain in my arse).

The following techniques are all tried and tested methods that I’ve used to identify these issues in the wild, though to protect the identity of my clients and their developers, I’ve taken the selfless decision to set up a couple of test sites using my own website content in order illustrate what you need to do, those being:

test.organicdigital.co
alitis.co.uk

Though by the time you read this, I will have followed my own advice and taken these down, I need all the visibility I can get, the last thing I need are indexed test sites holding me back.

1) Google Search Console (GSC) Domain Property

One of the great things about the new GSC is that you can set up domain properties which gives you key insights across all subdomains associated with your website – on both HTTP and HTTPS.   To set this up, simply select the domain option when adding a property (you also need to carry out the potentially not so straightforward task of adding a TXT record to your domain’s DNS):

GSC Domain Property

There are a whole host of reasons why a domain property is useful, in this case it’s because if you have your test site set up on a sub domain and it’s generating impression and clicks in search, you can spot this from within the “Performance” section by filtering or ordering your pages:

GSC Performance Data

In addition, you should also check the “coverage” section – in some cases, Google will index your content:

GSC Indexed Data

Whilst In other cases, they will spot that you have duplicate content in place, and kindly refrain from indexing, in which case you would find it within the section “Duplicate, Google chose different canonical than user”:

GSC Different Canonical

Even if this is the case, you should still endeavour to ensure it’s not crawled moving forward.

2) Check Google SERPs Using Link Clump

If you don’t have access to GSC domain properties, or any access to GSC (if not, why not?) then you can check the SERPs to see if any test URLs have made their way into the index.  

This is also a handy technique when pitching for new business, what better way to win over a potential client than to make their internal or external development team look like they are dicing with search visibility death by allowing this to happen in the first place, and that you’re here to save the day.

The steps are as follows:

i) install the Link Clump Google Chrome Extension, which allows you to copy and paste multiple URLs from a page to somewhere more useful like Excel.

ii) Amend your Link Clump settings as follows:

Link Clump Settings

The most important one to note is the Action “copied to clipboard” – the last thing you want to happen here is to open up to a hundred URLs at once.

iii) Go to your favourite (or local) Google TLD, click “settings” which you should see at the bottom right of the page, and select “search settings” where you can set your “results per page” to 100.

iv) Return to the Google home page and use the “site:“ query operator and append your domain.  If you use www or similar, remove this – so the command would be as follows:

site:organicdigital.co

You will be presented with a sample of up to 300 URLs currently indexed by Google across all the subdomains.   Whilst you could manually review each result to spot rogue sites:

Test Site in SERPs

I find it far quicker and easier to right click and drag all the way to the bottom of the page.  You will know if Link Clump is working as you will see the following occur to denote links are being selected and copied:

Link Clump In Action
URLs in Excel

Repeat this across SERPs 2 and 3 if available, and once all URLs are pasted into Excel, use sort by A-Z to easily identify your indexed content across all relevant sub domains.

3) Search For Text Unique To Your Site

The above methods work if your test site is hosted on a subdomain on the same domain as your live website.  However, if your test site is located elsewhere, e.g. test.webdevcompany.com, then they won’t work.  In which case, this or the following methods might.

Find some content you believe is unique to your website – in my case I’ve gone with the strapline of: “Enhance Your Website’s Organic Visibility And Traffic” – then search for this within quotation marks.   If a test site containing this content has been indexed, this search should reveal it:

Test Sites In SERPs Again

As you can see, the home pages on the main site, test sub domain and separate test domain all appear.  You may also inadvertently spot a competitor who has ripped off your content.   Some would take that as a compliment, others would issue DMCAs – it’s up to you, but the last thing you want is someone outranking you with your own copy.

4) Crawl The Site Using Screaming Frog

I presume you’re into SEO and therefore use Screaming Frog. If either of those answers is no, then well done for making it this far into this article (let me guess you’re a developer who has dropped a bollock and looking to cover your arse before anyone else finds out?).

If you don’t have it, download it here.

Within the Basic Settings, tick “Crawl All Subdomains”.  You can also tick “Follow Internal ‘nofollow’” as some test environments may have this in place.

Once the crawl is complete, peruse the list to see if there are any internal links in place to test sites.  I came across this recently where a new Drupal site had gone live but with all internal links within the blog posts pointing to a beta subdomain:

Screaming Frog Crawl

You can then click on each test URL and click on InLinks at the bottom to find the offending internal link from the live to test site.  In this case, I amended the Contact Us link on the sitemap to point to the test URL:

Screaming Frog Internal Links

Once spotted, amend and re-crawl till these are no more internal links taking visitors elsewhere.  If you are using WordPress, use a search/replace plugin to find all test URLs and replace them with the live one.

5) Check Google Analytics Hostnames

If your test site has the same Google Analytics account’s tracking code installed as your live site, you will be able to spot this within GA if you go to a section such as “Behavior” -> “Site Content” -> “All Pages” and select “Hostname” as a secondary dimension:

Google Analytics Hostnames

Further to this, you can also then filter the data further by excluding from the report all visits to the main domain, which will leave all other instances in the list.   In addition to test sites, you may also uncover GA Spam being triggered on a 3rd party site:

Google Analytics Exclude Hostname

There are pros and cons to having the same GA tracking ID running on both your live and test environments, but personally, I see no reason to have separate accounts and instead would create multiple views within your one account.   For the live site, set up a filter to only include traffic to the live hostname, and vice versa for the test site.

How To Remove and Prevent Your Test Site From Getting Indexed

So you’ve discovered your test site in the index using one of the techniques above, or, you want to make sure it doesn’t happen in the first place.  The following will all help with this:

1) Remove URLs via GSC

If your site is indexed, whether it’s generating traffic or not, it’s best to get it removed.   To do this, you can use the “Remove URLs” section from the “old” GSC.    

Note, this will not work at domain property level as these aren’t catered for in old GSC.  In order to do this, you need to set up set up a property for the individual test domain.

Once set up, “Go To The Old Version” and go to “Google Index” -> “Remove URLs”.   From here, select “Temporarily Hide” and enter as single forward slash as the URL you wish to block which will submit your entire site for removal:

GSC Remove URLs

This will remove your site from the SERPs for 90 days, in order to ensure it doesn’t return, you must take further steps.  One of the following will suffice (and should be carried out regardless of whether you are able to Remove via GSC)

2) Set robots tag to noindex on test site

Ask your developers to ensure that when running on the test domain, each page across the site generates a robots noindex tag:

<meta name="robots" content="noindex" />

If your site is WordPress, you can set this via “Settings” -> “Reading” and selecting “Discourage search engines from indexing this site”:

Wordpress Reading Settings

Whatever code or settings you use to prevent the test site from being indexed, you must ensure this is not migrated to the live site when new content or functionality is made live.   Test site settings going live are one of the most common and most sure-fire ways to mess up your live site’s visibility.

3) Password Protect Your Test Site

From your web control panel or via the server, password protect the directory in which your test site resides.   There are numerous ways to do this – the best bet is to ask your hosting company or developers to configure this, or, there are plenty good resources out there that will show you how to do this, such as:

https://one-docs.com/tools/basic-auth

Once blocked, you should see an alert box when trying to access your test site:

https://alitis.co.uk/

Password Protected Site

This will prevent search engines from crawling and indexing the site.

4) Delete site and return page status 410

If you no longer have need for your test site, you can simply delete it.  When search engines try to visit pages on longer live, they will see the pages are deleted.   By default, a broken page will return status 404 (“Not Found”) – whilst this will get the site de-indexed in time, it will take a while as there will be follow up visits to see if the broken page has returned.  

Instead, set the status to 410 (“Permanently Gone”) which will return the following message:

Status 410

To do this across an entire domain, delete the site and leave the .htaccess file in place with the following command:

Redirect 410 /

This will ensure the site gets de-indexed at the first time of asking (or at least quicker than a 404)

5) Block via robots.txt

You can block the site from being crawled by implementing the following commands in the test site’s robots.txt file:

User-agent: *
Disallow: /

This will prevent bots from crawling the site.  Note: if your test site is currently indexed, and you have gone down the route of adding noindex tags to the site, do not add the robots.txt command in until all pages have been de-indexed.  If you add this in before all pages have de-indexed, this will prevent them from being crawled and the robots tag detected, so the pages will remain indexed.

And that’s it – I hope the above will be enough for you to find, deindex and prevent your test from being crawled ever again.

But Remember

I cannot stress this enough – if you decide to implement robots meta tags or robots.txt which disallow all bots from crawling and indexing your test site, make sure when you put your test site live that you do not carry these configurations over to the live site, as you will risk losing your organic visibility altogether.  

And we’ve all been there, right?

Get In Touch

Fill in the form below if you want to enhance your website's organic visibility