Search Engine crawlers are your best friends if you own a website and want it to rank high in search engine result page. But, what if, your website has a certain page, a specific product, a news or some information which you don’t want web spiders to crawl and index in its database? How will you inform web robots? Yes, there is a way! You can do this with the help of robot.txt file. Robot.txt is a text file which you can use to communicate with web crawlers and other web robots. It instructs them about that specific file or folder of your website which should not be processed or scanned. This process is called robot exclusion standard or robot exclusion protocol or simply robot.txt.
Do all robots obey the instructions given in the robot.txt file?
Robot.txt file is like a “Do Not Disturb” sign at the door of your website which generally search engines follow like good people who do not open or enter the door with “Do Not Disturb” sign. Unfortunately, thieves do not care about signs at the door just like those robots which do not cooperate with the robot.txt file, either they don’t pay attention to the instructions given in the robot.txt file or they may even begin with the portions of the website where they have been instructed to stay away.
Why is it extremely important to put the robot.txt file in the right place?
It is extremely important to place the robot.txt file in the right place. What could be that right place for a robot.txt file? Well, the right place is a place which is easy to find. Robot.txt file is the first thing which most of the search engine crawlers look for before crawling the websites and that is why it must be placed in the main directory otherwise they will not be able to discover it and they don’t scan the entire website for it. Rather, on the off chance, they simply accept that the site does not have a robots.txt file and index everything they find on the website.
How does robot.txt file work?
Suppose a robot is about to crawl a website, for example, http://www.website.com/home.html. Before crawling, it will first ensure that the website doesn’t have any robot.txt file and if the website has one then the robot strips the path components from the URL with its very first slash. For example, for http://www.website.com/home.html, it will replace “home.html” with “robots.txt”, and end up with “http://www.example.com/robots.txt” and later it will crawl the website according to the instructions mentioned in the robot.txt file.
What is the Structure of a Robots.txt File?
The basic structure of robot.txt file is
Where “User-agent:” contains instructions for a specific robot you want to disallow to crawl and “disallow:” is to instruct that robot which page of the website it is not supposed to visit or crawl.
Here follow some examples:
- To exclude all robots from the entire website
(The ‘*’ in the User-agent field is a special value meaning “any robot” )
(The ‘/’ in the disallow field is a special value meaning “not allowed”)
- To give all robots complete access
(we can create an empty “/robots.txt” file, or we don’t need to use one at all)
- To exclude all robots from part of the website
- To exclude a single robot
- To allow a single robot
There are few other important points that we need to remember:
- Always use all lowercase for the filename: “robots.txt”. Robots.TXT is wrong.
- robots.txt files are publicly available, anyone can see which sections of your website you don’t want robots to crawl.
- Never use robots.txt files to hide information.
- If you have really sensitive data to protect then don’t rely on robots.txt files as it is not mandatory for each search engine to follow robot.txt file