Python.Scrapy爬虫运行(scrapy.cfg和路径皆正确时)报错的解决方案

程序员文章站 2022-07-02 22:50:16

运行scarpy爬取时，crapy.cfg存在，路径也是正确的，但总是报错。# Crawl responsibly by identifying yourself (and your website) on the user-agent#USER_AGENT = 'douban (+http://www.yourdomain.com)'# Obey robots.txt rulesROBOTSTXT_OBEY = True改成# Crawl responsibly by identi....

一

运行scarpy爬取时，crapy.cfg存在，路径也是正确的，但总是报错。
setting.py中修改：

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'douban (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

改成

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/'**#修改成自己的请求头**

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

把USER _AGENT的注释取消（删除#），替换掉user-agent的内容修改成自己请求头。
因为Scrapy是遵守robots协议的，如果是robots协议禁止爬取的内容，Scrapy也会默认不去爬取，所以我们还得修改Scrapy中的默认设置。

二

运行出现unknown command crawl 错误

可能有以下原因：
1、目录层级出现了问题
2、爬虫内定义的name有问题,或者（把setting里的BOT_NAME给注释了…）
3、没有保存就运行了项目
最后一种可能，是因为没有pywin32这个*，pip install pywin32 一般是vscode才会出现这个，vscode打开项目的层级有可能出现问题

from scrapy import cmdline
import os#让运行的路径指向我们的mainos.chdir(os.path.dirname(os.path.abspath(__file__)))#最后一个参数是爬虫的名字哈，爬虫主文件中的name，根据自己的项目来写
cmdline.execute(['scrapy','crawl','是啥写啥])

本文地址：https://blog.csdn.net/weixin_51277037/article/details/110458559