欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页  >  IT编程

python爬虫爬取淘宝商品比价(附淘宝反爬虫机制解决小办法)

程序员文章站 2022-04-27 11:03:38
因为评论有很多人说爬取不到,我强调几点kv的格式应该是这样的:kv = {‘cookie':‘你复制的一长串cookie',‘user-agent':‘mozilla/5.0'}注意都应该用 ‘' ,...

因为评论有很多人说爬取不到,我强调几点

kv的格式应该是这样的:

kv = {‘cookie':‘你复制的一长串cookie',‘user-agent':‘mozilla/5.0'}

注意都应该用 ‘' ,然后还有个英文的 逗号,

kv写完要在后面的代码中添加

r = requests.get(url, headers=kv,timeout=30)

自己得先登录自己的淘宝账号才有自己登陆的cookie呀,没登录cookie当然没用

以下原博

本人是python新手,目前在看中国大学mooc的嵩天老师的爬虫课程,其中一个实例是讲如何爬取淘宝商品信息

以下是代码:

import requests
import re
 
def gethtmltext(url):
 try:
  r = requests.get(url, timeout=30)
  r.raise_for_status()
  r.encoding = r.apparent_encoding
  return r.text
 except:
  return ""
  
def parsepage(ilt, html):
 try:
  plt = re.findall(r'\"view_price\"\:\"[\d\.]*\"',html)
  tlt = re.findall(r'\"raw_title\"\:\".*?\"',html)
  for i in range(len(plt)):
   price = eval(plt[i].split(':')[1])
   title = eval(tlt[i].split(':')[1])
   ilt.append([price , title])
 except:
  print("")
 
def printgoodslist(ilt):
 tplt = "{:4}\t{:8}\t{:16}"
 print(tplt.format("序号", "价格", "商品名称"))
 count = 0
 for g in ilt:
  count = count + 1
  print(tplt.format(count, g[0], g[1]))
   
def main():
 goods = '书包'
 depth = 3
 start_url = 'https://s.taobao.com/search?q=' + goods
 infolist = []
 for i in range(depth):
  try:
   url = start_url + '&s=' + str(44*i)
   html = gethtmltext(url)
   parsepage(infolist, html)
  except:
   continue
 printgoodslist(infolist)
  
main()

但是我们运行的时候会发现这个程序没有出错,但是爬取不到,原因是淘宝实施了反爬虫机制,r.text 时是登录界面,我们如何绕过登录界面进行爬取呢?

首先我们需要先在浏览器中登录我们的个人淘宝,然后搜索以书包为例的商品,打开开发者模式(我使用的是chrome)或者按f12

python爬虫爬取淘宝商品比价(附淘宝反爬虫机制解决小办法)

这里我们可以看到我们当前的cookie和user-agent(一般是mozilla/5.0)(注意:如果没有出现这几个name,点击浏览器刷新就会出现了)

然后在代码中增加我们的cookie和user-agent

python爬虫爬取淘宝商品比价(附淘宝反爬虫机制解决小办法)

然后运行

python爬虫爬取淘宝商品比价(附淘宝反爬虫机制解决小办法)

我只是个初学者,学的时候视频给不了答案,百度了很多,才发现这个小技巧,
有问题百度就完事了

完整代码

import requests
import re


def gethtmltext(url):
 kv = {'cookie':'t=5c749e8d453e7e3664735746f5eb5de8; cna=brxnfdenbxucaxggnkx9h1bo; thw=cn; tg=0; enc=5lmrhd8305w3oo8x0agyvfuda7ox%2f4rbf34ocwap48nrhy%2b%2b1rzcwzj7ebn%2fpy7vrnil8xps%2ba0onfxg5nsu8w%3d%3d; hng=cn%7czh-cn%7ccny%7c156; cookie2=10dbf1309bd9a2d5bc9cabe562965aee; _tb_token_=ee67e1a3ee0e5; alitrackid=www.taobao.com; swfstore=308730; v=0; unb=2448224718; sg=%e6%bb%a18d; _l_g_=ug%3d%3d; skt=d53506c42f2db259; cookie1=bxuhgxug%2b5y4iw7vzcvjlj0zdvfl2xy3mjxt%2frptfea%3d; csg=4246b77f; uc3=vt3=f8dbyezfiho1%2fuik8vy%3d&id2=uuwu0bqkq1tydq%3d%3d&nk2=cn5ozui3xv2%2blbvx&lg2=w5ihllyfogw7aa%3d%3d; existshop=mtu1mdu2mtuymq%3d%3d; tracknick=king%5cu4e36%5cu5c0f%5cu4e30%5cu6ee1; lgc=king%5cu4e36%5cu5c0f%5cu4e30%5cu6ee1; _cc_=vfc%2fuz9ajq%3d%3d; dnk=king%5cu4e36%5cu5c0f%5cu4e30%5cu6ee1; _nk_=king%5cu4e36%5cu5c0f%5cu4e30%5cu6ee1; cookie17=uuwu0bqkq1tydq%3d%3d; lastalitrackid=login.taobao.com; mt=ci=5_1; x=e%3d1%26p%3d*%26s%3d0%26c%3d0%26f%3d0%26g%3d0%26t%3d0%26__ll%3d-1%26_ato%3d0; uc1=cookie14=uotz5oxqjxxkda%3d%3d&lng=zh_cn&cookie16=w5ihllyfplmgbldwa%2bdvagzqlg%3d%3d&existshop=false&cookie21=uihilt3xthh8t7yqouiw&tag=8&cookie15=uihilt3xd8xytw%3d%3d&pas=0; jsessionid=f99b5e66516b99d5e7c9f431e402713f; l=bbnu0zkpvj9ogfulbocnzui8ln_ogiryjuprwcfmi_5b46jhzlqollv3_fj6vj5rsk8b4z6vznp9-etki; isg=bdg4vi5gkpaamvx83rjgspcnceykcz0m9ucvohkp6xnmjdh3gru6uo2vqcwy5lqd; whl=-1%260%260%261550562673185',
   'user-agent':'mozilla/5.0'}
 try:
  r = requests.get(url, headers=kv,timeout=30)
  r.raise_for_status()
  r.encoding = r.apparent_encoding
  return r.text
 except:
  return ""


def parsepage(ilt, html):
 try:
  plt = re.findall(r'\"view_price\"\:\"[\d\.]*\"', html)
  tlt = re.findall(r'\"raw_title\"\:\".*?\"', html)
  for i in range(len(plt)):
   price = eval(plt[i].split(':')[1])
   title = eval(tlt[i].split(':')[1])
   ilt.append([price, title])
 except:
  print("")


def printgoodslist(ilt):
 tplt = "{:4}\t{:8}\t{:16}"
 print(tplt.format("序号", "价格", "商品名称"))
 count = 0
 for g in ilt:
  count = count + 1
  print(tplt.format(count, g[0], g[1]))


def main():
 goods = '书包'
 depth = 3
 start_url = 'https://s.taobao.com/search?q=' + goods
 infolist = []
 for i in range(depth):
  try:
   url = start_url + '&s=' + str(44 * i)
   html = gethtmltext(url)
   parsepage(infolist, html)
  except:
   continue
 printgoodslist(infolist)


main()

到此这篇关于python爬虫爬取淘宝商品比价(附淘宝反爬虫机制解决小办法)的文章就介绍到这了,更多相关python爬取淘宝商品内容请搜索以前的文章或继续浏览下面的相关文章希望大家以后多多支持!