![]() This entire instruction set is treated as one single group by GSC robots.txt tester. Note: extra line breaks without any instructions are of no meaning and are discarded. One can say GSC’s robots.txt tester sees it like this: User-agent: * Our assumption is that since Googlebot doesn’t consider “crawl-delay” the robots.txt tester behaves as if the crawl-delay instructions do not exist. However, if you test these instructions on GSC’s robots.txt tester, it shows the URL will be blocked for Googlebot as shown in image 1 above. This would mean the instruction Disallow: *?param_color=* doesn’t apply to Googlebot. … #Rest of the instructions which do no matter hereĬlearly, the issue was how the instructions were grouped.Īs per robots.txt specifications related to grouping, the above instructions mean that all instructions after #Parameter handling apply only to the third user group i.e. Here are the original instructions and the parameter in question is “param_color”. The other one marked it as blocked just like GSC’s robots.txt tester. To confirm I tested the robots.txt instructions without any changes on two other external validators. The above command in terminal counts the occurrence of string “param_color” in the log file “googlebot.log” The real issue in robots.txt instructionsĪll evidence points to the fact that these URLs were getting crawled and it was undesired. ![]() Terminal command to count the occurrence of a particular pattern in a document – Image 4 The first obvious check was using the GSC’s robots.txt tester to verify if the URLs were indeed blocked or not in the robots.txt.Īs shown in the screenshot below, GSC’s robots.txt tester says the URL pattern is blocked for Googlebot. Reviewing the problem in GSC’s robots.txt Please note: We’ll only discuss Googlebot in this article. This hinted there was a problem with robots.txt instructions. With Screaming Frog crawler set to respect robots.txt, parameterized URLs wouldn’t be crawled normally. This was a medium sized site with around 3000+ pages with URLs generated from faceted navigation blocked intentionally. Something that you could find easily on most Ecommerce sites. The URL(s) in question were a particular pattern of parameterized URLs generated from the faceted navigation. We had witnessed a spider trap on this particular project and wondered if it might have popped up again. While working on the monthly crawls for one of our clients we noticed that the crawler was taking too long for some reason. The tool claimed the URL(s) was blocked in robots.txt checker but as per URL inspector tool, GSC Crawl Stats and server logs, the page(s) were getting crawled. This is the first time I have encountered an issue with GSC’s robots.txt tester. Yes, you read that right and no it is not a clickbait!
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |