Robots.txt is an often misunderstood tool that can play an important part in SEO. Because of this, it is not something to be ignored. Here we will go over what exactly Robots.txt is, what it does, and how to use it properly.
What is Robots.txt and What Does it do?
Robots.txt is a basic text file that uses the Robot Exclusion Protocol(REP) to regulate web robots, also known as crawlers. REP is a standard that is used to inform crawlers which areas of your website or server should and should not be scanned.
Robots.txt can disallow crawlers from scanning areas of your website or server. This function is usually used when you don’t want a crawler to scan specific folders or directories such as an image folder. This can also come in handy if you know parts of your site need SEO work and you’d rather have other areas scanned over the areas that need work.
Disallowing Google to crawl specific areas of your site will not keep it from being indexed within the search engine. It’s likely Google will still index the URL but will leave out information such as title and meta tags.
It is also important to note that robot.txt is more of a guideline for web crawlers. While good crawlers will adhere to the regulations you set forth in robots.txt, crawlers with malicious intent will most likely ignore it completely. If there are specific areas of your site you need protected, do not rely on robots.txt. Use security measures that employ passwords.
How to Properly Use Robots.txt
Using Robots.txt files on your server are not incredibly difficult. In fact, if you have basic knowledge of how to upload files to your hosting, you should be able to implement your own Robots.txt file.
Creating a Robots.txt file is as simple as opening up a plain text editor such as Notepad on your computer and saving the blank document using the name Robots. Make sure you save the file as a .txt file. Once you have this file created, it’s time to fill it with the proper restriction codes for your website.
There are several commands you need to know. They are as follows:
The “User-Agent” command is used to identify which web crawlers will receive all the following restrictions. Web crawlers are identified by name, and an asterisk (*) is used to symbolize all crawlers. The crawler you need to worry about is Googlebot.
The “Disallow” and “Allow” commands state which directory paths crawlers are either disallowed or allowed to scan.
Now let’s combine all of these into a simple Robots.txt file that allows Googlebot to scan the entirety of a website while disallowing all other crawlers from scanning the site’s Products page.
As you can see, the first line of text identifies the Googlebot as the crawler to receive the following command. The following “Allow” command uses a forward slash (/) to denote the entirety of a website. Think of the forward / as a website’s URL. Technically it’s the top level directory of the web server.
As the first two lines allow the Googlebot to scan the entire website, the second two lines disallow all other user agents from scanning the products page. The products page on this site would be found on www.YOUR-URL.com/Products.
You or your website admin will know what areas, if any, of your site crawlers need to be disallowed access to.
Once you have the proper commands in your Robots.txt file, you simply need to place it in the top-level directory of your web server. If done correctly, you should be able to navigate to your robots.txt file by going to www.YOUR-URL.com/robots.txt
Robots.txt and Sitemaps
As a final note, it is important to know that you can place the URL of your sitemap in your robots.txt file to ensure crawlers will find it. You do this with the simple command:
Followed by your sitemap URL. For example:
You can do this for all sitemaps on your URL if you have sitemaps for multiple sections of your site.
Now that you know how to set up a robots.txt file with simple allow and disallow commands for individual crawlers, you can set up the proper crawler restrictions for your site. If you are unsure which areas should be disallowed, it’s probably best to not mess with anything. When in doubt, just ensure that Googlebot has access to your site and make sure your sitemap is in your robots.txt.