Website owners have been excluding web crawlers using the Robots Exclusion Protocol (REP) on robots.txt files for 25 years. Up until now, there has never been an official Internet standard, no documented specification for writing the rules correctly according to the protocol. Over the years, developers shared their various interpretations of the protocol, but this created many different ambiguous methods for controlling crawlers.
Google is working together with Martijn Koster, the original author of the protocol, webmasters, and other search engines to create a proposal to submit to the Internet Engineering Task Force (IETF) for standardizing the REP:
The proposed REP draft reflects over 20 years of real world experience of relying on robots.txt rules, used both by Googlebot and other major crawlers, as well as about half a billion websites that rely on REP. These fine grained controls give the publisher the power to decide what they’d like to be crawled on their site and potentially shown to interested users. It doesn’t change the rules created in 1994, but rather defines essentially all undefined scenarios for robots.txt parsing and matching, and extends it for the modern web.
The proposed specification includes several major items that webmasters and developers will want to review. It extends the use of robots.txt to any URI-based transfer protocol (FTP, CoAP, et al), instead of limiting it to HTTP. It also implements a new maximum caching time of 24 hours and lets website owners update robots.txt whenever they choose, without having crawlers overload their sites with requests. If a previously accessible robots.txt file becomes inaccessible for whatever reason, crawlers will respect the known disallowed pages that were previously identified for “a reasonably long period of time.”
Google has also open sourced the C++ library it uses for parsing and matching rules in robots.txt files, along with a testing tool for testing the rules. Developers can use this parser to create parsers that use the proposed REP requirements. It has been updated to ensure that Googlebot only crawls what it’s allowed to and is now available on GitHub.
“This library has been around for 20 years and it contains pieces of code that were written in the 90’s,” Google’s Search Open Sourcing team said in the announcement. “Since then, the library evolved; we learned a lot about how webmasters write robots.txt files and corner cases that we had to cover for, and added what we learned over the years also to the internet draft when it made sense.”
Lizzi Harvey, who maintains Google’s Search developer docs, updated the robots.txt spec to match the REP draft. Check out the full list of changes if you want to compare your robots.txt file to the proposed spec. If the proposal for standardizing the REP is successfully adopted by the IETF, the days of googling and wading through undocumented robots.txt rules will soon be over.