欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页  >  IT编程

robots协议

程序员文章站 2022-04-14 15:41:53

什么是robots.txt?

robots.txt是一个纯文本文件,是爬虫抓取网站的时候要查看的第一个文件,一般位于网站的根目录下。robots ......

<div id="cnblogs_post_body" class="blogpost-body"><h3><strong>什么是robots.txt?</strong></h3>
<p>robots.txt是一个纯文本文件,是爬虫抓取网站的时候要查看的第一个文件,一般位于网站的根目录下。robots.txt文件定义了爬虫在爬取该网站时存在的限制,哪些部分爬虫可以爬取,哪些不可以爬取(防君子不防小人)</p>
<p>更多robots.txt协议信息参考:www.robotstxt.org</p>
<p>在爬取网站之前,检查robots.txt文件可以最小化爬虫被封禁的可能</p>
<p>下面是百度robots.txt协议的一部分:https://www.baidu.com/robots.txt</p>
<div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copycnblogscode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div>
<pre><span style="color: #008080;"> 1</span> <span style="color: #000000;">user-agent: baiduspider
</span><span style="color: #008080;"> 2</span> <span style="color: #000000;">disallow: /baidu
</span><span style="color: #008080;"> 3</span> <span style="color: #000000;">disallow: /s?
</span><span style="color: #008080;"> 4</span> <span style="color: #000000;">disallow: /ulink?
</span><span style="color: #008080;"> 5</span> <span style="color: #000000;">disallow: /link?
</span><span style="color: #008080;"> 6</span> <span style="color: #000000;">disallow: /home/news/data/
</span><span style="color: #008080;"> 7</span>
<span style="color: #008080;"> 8</span> <span style="color: #000000;">user-agent: googlebot
</span><span style="color: #008080;"> 9</span> <span style="color: #000000;">disallow: /baidu
</span><span style="color: #008080;">10</span> <span style="color: #000000;">disallow: /s?
</span><span style="color: #008080;">11</span> <span style="color: #000000;">disallow: /shifen/
</span><span style="color: #008080;">12</span> <span style="color: #000000;">disallow: /homepage/
</span><span style="color: #008080;">13</span> <span style="color: #000000;">disallow: /cpro
</span><span style="color: #008080;">14</span> <span style="color: #000000;">disallow: /ulink?
</span><span style="color: #008080;">15</span> <span style="color: #000000;">disallow: /link?
</span><span style="color: #008080;">16</span> <span style="color: #000000;">disallow: /home/news/data/
</span><span style="color: #008080;">17</span>
<span style="color: #008080;">18</span> <span style="color: #000000;">user-agent: msnbot
</span><span style="color: #008080;">19</span> <span style="color: #000000;">disallow: /baidu
</span><span style="color: #008080;">20</span> <span style="color: #000000;">disallow: /s?
</span><span style="color: #008080;">21</span> <span style="color: #000000;">disallow: /shifen/
</span><span style="color: #008080;">22</span> <span style="color: #000000;">disallow: /homepage/
</span><span style="color: #008080;">23</span> <span style="color: #000000;">disallow: /cpro
</span><span style="color: #008080;">24</span> <span style="color: #000000;">disallow: /ulink?
</span><span style="color: #008080;">25</span> <span style="color: #000000;">disallow: /link?
</span><span style="color: #008080;">26</span> <span style="color: #000000;">disallow: /home/news/data/
</span><span style="color: #008080;">27</span>
<span style="color: #008080;">28</span> <span style="color: #000000;">user-agent: baiduspider-image
</span><span style="color: #008080;">29</span> <span style="color: #000000;">disallow: /baidu
</span><span style="color: #008080;">30</span> <span style="color: #000000;">disallow: /s?
</span><span style="color: #008080;">31</span> <span style="color: #000000;">disallow: /shifen/
</span><span style="color: #008080;">32</span> <span style="color: #000000;">disallow: /homepage/
</span><span style="color: #008080;">33</span> <span style="color: #000000;">disallow: /cpro
</span><span style="color: #008080;">34</span> <span style="color: #000000;">disallow: /ulink?
</span><span style="color: #008080;">35</span> <span style="color: #000000;">disallow: /link?
</span><span style="color: #008080;">36</span> <span style="color: #000000;">disallow: /home/news/data/
</span><span style="color: #008080;">37</span>
<span style="color: #008080;">38</span> <span style="color: #000000;">user-agent: youdaobot
</span><span style="color: #008080;">39</span> <span style="color: #000000;">disallow: /baidu
</span><span style="color: #008080;">40</span> <span style="color: #000000;">disallow: /s?
</span><span style="color: #008080;">41</span> <span style="color: #000000;">disallow: /shifen/
</span><span style="color: #008080;">42</span> <span style="color: #000000;">disallow: /homepage/
</span><span style="color: #008080;">43</span> <span style="color: #000000;">disallow: /cpro
</span><span style="color: #008080;">44</span> <span style="color: #000000;">disallow: /ulink?
</span><span style="color: #008080;">45</span> <span style="color: #000000;">disallow: /link?
</span><span style="color: #008080;">46</span> <span style="color: #000000;">disallow: /home/news/data/
</span><span style="color: #008080;">47</span>
<span style="color: #008080;">48</span> <span style="color: #000000;">user-agent: sogou spider2
</span><span style="color: #008080;">49</span> <span style="color: #000000;">disallow: /baidu
</span><span style="color: #008080;">50</span> <span style="color: #000000;">disallow: /s?
</span><span style="color: #008080;">51</span> <span style="color: #000000;">disallow: /shifen/
</span><span style="color: #008080;">52</span> <span style="color: #000000;">disallow: /homepage/
</span><span style="color: #008080;">53</span> <span style="color: #000000;">disallow: /cpro
</span><span style="color: #008080;">54</span> <span style="color: #000000;">disallow: /ulink?
</span><span style="color: #008080;">55</span> <span style="color: #000000;">disallow: /link?
</span><span style="color: #008080;">56</span> <span style="color: #000000;">disallow: /home/news/data/
</span><span style="color: #008080;">57</span>
<span style="color: #008080;">58</span> <span style="color: #000000;">user-agent: sogou blog
</span><span style="color: #008080;">59</span> <span style="color: #000000;">disallow: /baidu
</span><span style="color: #008080;">60</span> <span style="color: #000000;">disallow: /s?
</span><span style="color: #008080;">61</span> <span style="color: #000000;">disallow: /shifen/
</span><span style="color: #008080;">62</span> <span style="color: #000000;">disallow: /homepage/
</span><span style="color: #008080;">63</span> <span style="color: #000000;">disallow: /cpro
</span><span style="color: #008080;">64</span> <span style="color: #000000;">disallow: /ulink?
</span><span style="color: #008080;">65</span> <span style="color: #000000;">disallow: /link?
</span><span style="color: #008080;">66</span> <span style="color: #000000;">disallow: /home/news/data/
</span><span style="color: #008080;">67</span>
<span style="color: #008080;">68</span> <span style="color: #000000;">user-agent: sogou news spider
</span><span style="color: #008080;">69</span> <span style="color: #000000;">disallow: /baidu
</span><span style="color: #008080;">70</span> <span style="color: #000000;">disallow: /s?
</span><span style="color: #008080;">71</span> <span style="color: #000000;">disallow: /shifen/
</span><span style="color: #008080;">72</span> <span style="color: #000000;">disallow: /homepage/
</span><span style="color: #008080;">73</span> <span style="color: #000000;">disallow: /cpro
</span><span style="color: #008080;">74</span> <span style="color: #000000;">disallow: /ulink?
</span><span style="color: #008080;">75</span> <span style="color: #000000;">disallow: /link?
</span><span style="color: #008080;">76</span> <span style="color: #000000;">disallow: /home/news/data/
</span><span style="color: #008080;">77</span>
78 <span style="color: #000000;">user-agent: *
</span>79 disallow: /</pre>
<div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copycnblogscode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div>
<p><span style="font-size: 15px;"><strong>robots.txt中的参数含义:</strong></span></p>
<p>1. user-agent:描述搜索引擎spider的名字。在“robots.txt“文件中,如果有多条 user-agent记录,说明有多个robot会受到该协议的约束。所以,“robots.txt”文件中至少要有一条user- agent记录。如果该项的值设为*(通配符),则该协议对任何搜索引擎机器人均有效。在“robots.txt”文件 中,“user-agent:*”这样的记录只能有一条。</p>
<p>2. disallow: / 禁止访问的路径</p>
<p>例如,disallow: /home/news/data/,代表爬虫不能访问/home/news/data/后的所有url,但能访问/home/news/data123</p>
<p>disallow: /home/news/data,代表爬虫不能访问/home/news/data123、/home/news/datadasf等一系列以data开头的url。</p>
<p>前者是精确屏蔽,后者是相对屏蔽</p>
<p>3.&nbsp; allow:/允许访问的路径</p>
<p>例如,disallow:/home/后面有news、video、image等多个路径</p>
<p>接着使用allow:/home/news,代表禁止访问/home/后的一切路径,但可以访问/home/news路径</p>
<p>&nbsp;</p></div>