scrapy设置header部分随机-写给自己看爬虫系列2
程序员文章站
2022-05-23 10:26:41
...
前言
需求:用scrapy设置request的请求头ua是随机的,header中其他参数是固定的。
方法:由于scrapy局部设置优先于全局设置。所以在middleware中设置随机ua,在settings中DEFAULT_REQUEST_HEADERS设置固定部分,就能够实现header中ua是随机的,其他参数是固定的
middleware中设置随机ua
class AgentMiddleware(UserAgentMiddleware):
def __init__ (self,user_agent=""):
self.user_agent =user_agent
self.ua_list = ["Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
"Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",]
def process_request(self,request,spider):
ua = random.choice(self.ua_list)
request.headers.setdefault('Use-Agent',ua)
settings中设置固定部分
DEFAULT_REQUEST_HEADERS = {
'Accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
'Accept-Language': "zh-CN,zh;q=0.8",
"Accept-Encoding":"gzip, deflate",
"Connection":"keep-alive",
"Host":"baidu.cn",
"Referer":"http://ris.szpl.gov.cn/bol/projectdetail.aspx",
"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36",
"Origin":"http://baidu.com",
'Upgrade-Insecure-Requests':'1',
'Content-Type':'application/x-www-form-urlencoded'}
上一篇: JS中new()用法剖析