python模拟用户登录爬取阳光采购平台数据
阳光采购平台每月初会把当月的价格挂到平台上,现模拟用户登录平台,将需要的数据保存到csv文件和数据库,并且发送给指定人员。python初学者,遇见很多坑,这里记录一下。
环境 | python2.7 |
开发工具 | pycharm |
运行环境 | centos7 |
运行说明 | 设置定时任务每月1号凌晨1点执行这个python代码 |
实现功能 | 根据账号密码及解析处理的验证码自动登录系统,解析需要的数据,并保存在csv文件和mysql数据库中,爬取完成后将csv文件发给指定的人。支持请求断开后自动重连。 |
开发环境搭建:
网上教程一大堆,不赘述了。安装好后需要安装一些必须的库,如下:
bs4(页面html解析)
csv(用于保存csv文件)
smtplib(用于发送邮件)
mysql.connector(用于连接数据库)
部分需要下载的内容我放在网盘共享,包括leptonica-1.72.tar.gz,tesseract3.04.00.tar.gz,以及语言包:
链接:https://pan.baidu.com/s/1j4szdgmn6dpuq1ehxe6zkw
提取码:crbl
图像识别:
网上也有很多教程,整理了一版在centos7上能正常安装图像识别库的操作步骤。
- 因为是下载源码编译安装,所有需要安装响应的编译工具:
yum install gcc gcc-c++ make
yum install autoconf automake libtool
- 安装对图片识别相关支持工具,没有这些在后续执行tesseract命令时会报错:
yum install libjpeg-devel libpng-devel libtiff-devel zlib-devel
- 安装leptonica,首先去leptonica下载,下载后放到服务器解压并编译,leptonica是一个tesseract必须的工具:
下载地址:
#到leptonica的目录执行
./configure
make
make install
- 下载对应的tesseract
下载地址:https://link.jianshu.com/?t=https://github.com/tesseract-ocr/tesseract/wiki/downloads
#到tesseract-3.04.00目录执行
./autogen.sh
./configure
make
make install
ldconfig
- 下载语言包
下载地址:
下载后的文件放在目录tessdata下面
- 环境配置
拷贝tessdata:cp tessdata /usr/local/share –r
修改环境变量:
打开配置文件:vi /etc/profile
添加一行:export tessdata_prefix=/usr/local/share/tessdata
生效:source /etc/profile
- 测试
tesseract –v 查看tesseract的版本相关信息。如果没有报错,那么表示安装成功了。
放入找到一张图片image.png,然后执行:tesseract image.png 123
当前目录下会生成123.txt文件,这个文件就记录了识别的文字。
- 安装库pytesseract
这个库是用于在python代码里面调用tesseract
命令:pip install pytesseract
测试代码如下:
1 import pytesseract 2 from pil import image 3 4 im1=image.open('image.png') 5 print(pytesseract.image_to_string(im1))
代码:
我要获取的数据长相如下:
首先获取一共有多少页,循环访问每一页,将每一页数据保存到csv和数据库里面,如果在访问某页的时候抛出异常,那么记录当前broken页数,重新登录,从broken那页继续爬取数据。
写了一个gl.py,用于保存全局变量:
1 #!/usr/bin/python 2 # -*- coding: utf-8 -*- 3 #coding=utf-8 4 import time 5 6 timestr = time.strftime('%y%m%d', time.localtime(time.time())) 7 monthstr = time.strftime('%m', time.localtime(time.time())) 8 yearstr = time.strftime('%y', time.localtime(time.time())) 9 log_file = "log/" + timestr + '.log' 10 csvfilename = "csv/" + timestr + ".csv" 11 filename = timestr + ".csv" 12 fmt = '%(asctime)s - %(filename)s:%(lineno)s - %(message)s' 13 loginurl = "http://yourpath/login.aspx" 14 producturl = 'http://yourpath/aaa.aspx' 15 username = 'aaaa' 16 password = "aaa" 17 precodeurl = "yourpath" 18 host="yourip" 19 user="aaa" 20 passwd="aaa" 21 db="mysql" 22 charset="utf8" 23 postdata={ 24 '__viewstate':'', 25 '__eventtarget':'', 26 '__eventargument':'', 27 'btnlogin':"登录", 28 'txtuserid':'aaaa', 29 'txtuserpwd':'aaa', 30 'txtcode':'', 31 'hfip':'yourip' 32 } 33 tdd={ 34 '__viewstate':'', 35 '__eventtarget':'ctl00$contentplaceholder1$aspnetpager1', 36 'ctl00$contentplaceholder1$aspnetpager1_input':'1', 37 'ctl00$contentplaceholder1$aspnetpager1_pagesize':'50', 38 'ctl00$contentplaceholder1$txtyear':'', 39 'ctl00$contentplaceholder1$txtmonth':'', 40 '__eventargument':'', 41 } 42 vs={ 43 '__viewstate':'' 44 }
主代码中设置日志,csv,数据库连接,cookie:
1 handler = logging.handlers.rotatingfilehandler(gl.log_file, maxbytes=1024 * 1024, backupcount=5) 2 formatter = logging.formatter(gl.fmt) 3 handler.setformatter(formatter) 4 logger = logging.getlogger('tst') 5 logger.addhandler(handler) 6 logger.setlevel(logging.debug) 7 csvfile = codecs.open(gl.csvfilename, 'w+', 'utf_8_sig') 8 writer = csv.writer(csvfile) 9 conn = mysql.connector.connect(host=gl.host, user=gl.user, passwd=gl.passwd, db=gl.db, charset=gl.charset) 10 cursor = conn.cursor() 11 12 cookiejar = cookielib.mozillacookiejar() 13 cookiesupport = urllib2.httpcookieprocessor(cookiejar) 14 httpshandler = urllib2.httpshandler(debuglevel=0) 15 opener = urllib2.build_opener(cookiesupport, httpshandler) 16 opener.addheaders = [('user-agent','mozilla/5.0 (windows nt 6.1) applewebkit/537.11 (khtml, like gecko) chrome/23.0.1271.64 safari/537.11')] 17 urllib2.install_opener(opener)
登录方法:
首先是识别验证码,转为数字。然后用(密码+用户名+验证)提交到登录方法,可能会失败,因为识别验证码有时候识别的不正确。如果登录失败,那么重新获取验证码,再次识别,再次登录,直到登录成功。
1 def get_logined_data(opener,logger,views): 2 print "get_logined_data" 3 indexcount = 1 4 retdata = none 5 while indexcount <= 15: 6 print "begin login ", str(indexcount), " time" 7 logger.info("begin login " + str(indexcount) + " time") 8 vrifycodeurl = gl.precodeurl + str(random.random()) 9 text = get_image(vrifycodeurl)#封装一个方法,传入验证码url,返回识别出的数字 10 postdata = gl.postdata 11 postdata["txtcode"] = text 12 postdata["__viewstate"]=views 13 14 15 data = urllib.urlencode(postdata) 16 try: 17 headers22 = { 18 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3', 19 'accept-encoding': 'gzip, deflate, br', 20 'accept-language': 'zh-cn,zh;q=0.9', 21 'connection': 'keep-alive', 22 'content-type': 'application/x-www-form-urlencoded', 23 'user-agent': 'mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/74.0.3729.131 safari/537.36' 24 } 25 request = urllib2.request(gl.loginurl, data, headers22) 26 opener.open(request) 27 except exception as e: 28 print "catch exception when login" 29 print e 30 31 request = urllib2.request(gl.producturl) 32 response = opener.open(request) 33 datapage = response.read().decode('utf-8') 34 35 bsobj = beautifulsoup(datapage,'html.parser') 36 tabcontent = bsobj.find(id="tabcontent") #登录成功后,页面才有tabcontent这个元素,所以更具这个来判断是否登录成功 37 if (tabcontent is not none): 38 print "login succesfully" 39 logger.info("login succesfully") 40 retdata = bsobj 41 break 42 else: 43 print "enter failed,try again" 44 logger.info("enter failed,try again") 45 time.sleep(3) 46 indexcount += 1 47 return retdata
分析代码发现,每次请求获取数据都需要带上’__viewstate’这个参数,这个参数是存放在页面,所以需要把‘__viewstate’提出出来,用于访问下一页的时候带到参数里面去。
验证码解析:
通过验证码的url地址,将验证码保存到本地,因为验证码是彩色的,所有需要先把验证码置灰,然后再调用图像识别转为数字。这个验证码为4位数字,但是调用图像识别的时候,可能会转成字母,所有手动将字母转为数字,转换后识别率还能接受。
1 #获取验证码对应的数字,返回为4个数字才为有效 2 def get_image(codeurl): 3 print(time.strftime('%y-%m-%d %h:%m:%s',time.localtime(time.time())) + " begin get code num") 4 index = 1 5 while index<=15: 6 file = urllib2.urlopen(codeurl).read() 7 im = cstringio.stringio(file) 8 img = image.open(im) 9 imgname = "vrifycode/" + gl.timestr + "_" + str(index) + ".png" 10 print 'begin get vrifycode' 11 text = convert_image(img, imgname) 12 print "vrifycode", index, ":", text 13 # logger.info('vrifycode' + str(index) + ":" + text) 14 15 if (len(text) != 4 or text.isdigit() == false): # 如果验证码不是4位那么肯定是错误的。 16 print 'vrifycode:', index, ' is wrong' 17 index += 1 18 time.sleep(2) 19 continue 20 return text 21 22 #将图片转为数字 23 def convert_image(image,impname): 24 print "enter convert_image" 25 image = image.convert('l') # 灰度 26 image2 = image.new('l', image.size, 255) 27 for x in range(image.size[0]): 28 for y in range(image.size[1]): 29 pix = image.getpixel((x, y)) 30 if pix < 90: # 灰度低于120 设置为 0 31 image2.putpixel((x, y), 0) 32 print "begin save" 33 image2.save(impname) # 将灰度图存储下来看效果 34 print "begin convert" 35 text = pytesseract.image_to_string(image2) 36 print "end convert" 37 snum = "" 38 for j in text:#进行简单转换 39 if (j == 'z'): 40 snum += "2" 41 elif (j == 't'): 42 snum += "7" 43 elif (j == 'b'): 44 snum += "5" 45 elif (j == 's'): 46 snum += "8" 47 elif (j == 's'): 48 snum += "8" 49 elif (j == 'o'): 50 snum += "0" 51 elif (j == 'o'): 52 snum += "0" 53 else: 54 snum += j 55 return snum
数据转换:
将html数据转换为数组,供保存csv文件和数据库时使用
1 def paras_data(namelist,logger): 2 data = [] 3 mainlist = namelist 4 rows = mainlist.findall("tr", {"class": {"row", "alter"}}) 5 try: 6 if (len(rows) != 0): 7 for name in rows: 8 tds = name.findall("td") 9 if tds == none: 10 print "get tds is null" 11 logger.info("get tds is null") 12 else: 13 item = [] 14 for index in range(len(tds)): 15 s_span = (tds[index]).find("span") 16 if (s_span is not none): 17 tmp = s_span["title"] 18 else: 19 tmp = (tds[index]).get_text() 20 # tmp=(tds[index]).get_text() 21 item.append(tmp.encode('utf-8')) # gb2312 utf-8 22 item.append(datetime.datetime.now().strftime('%y-%m-%d %h:%m:%s'))#本条数据获取时间 23 data.append(tuple(item)) 24 25 except exception as e: 26 print "catch exception when save csv", e 27 logger.info("catch exception when save csv" + e.message) 28 return data
保存csv文件:
def save_to_csv(data ,writer): for d in data: if d is not none: writer.writerow(d)
保存数据库:
1 def save_to_mysql(data,conn,cursor): 2 try: 3 cursor.executemany( 4 "insert into `aaa`(aaa,bbb) values (%s,%s)", 5 data) 6 conn.commit() 7 8 except exception as e: 9 print "catch exception when save to mysql",e 10 else: 11 pass
保存指定页数据:
1 def get_appointed_page(snum,opener,vs,logger): 2 tdd = get_tdd() 3 tdd["__viewstate"] = vs['__viewstate'] 4 tdd["__eventargument"] = snum 5 tdd=urllib.urlencode(tdd) 6 # print "tdd",tdd 7 op = opener.open(gl.producturl, tdd) 8 if (op.getcode() != 200): 9 print("the" + snum + " page ,state not 200,try connect again") 10 return none 11 data = op.read().decode('utf-8', 'ignore') 12 # print "data",data 13 bsobj = beautifulsoup(data,"lxml") 14 namelist = bsobj.find("table", {"class": "mainlist"}) 15 # print "namelist",namelist 16 if len(namelist) == 0: 17 return none 18 viewstate = bsobj.find(id="__viewstate") 19 if viewstate is none: 20 logger.info("the other page,no viewstate,try connect again") 21 print("the other page,no viewstate,try connect again") 22 return none 23 vs['__viewstate'] = viewstate["value"] 24 return namelist
main方法:
1 while flag == true and logintime <50: 2 try: 3 print "global login the ", str(logintime), " times" 4 logger.info("global login the " + str(logintime) + " times") 5 bsobj = get_logined_data(opener, logger,views) 6 if bsobj is none: 7 print "try login 15 times,but failed,exit" 8 logger.info("try login 15 times,but failed,exit") 9 exit() 10 else: 11 print "global login the ", str(logintime), " times succesfully!" 12 logger.info("global login the " + str(logintime) + " times succesfully!") 13 viewstate_source = bsobj.find(id="__viewstate") 14 if totalnum == -1: 15 totalnum = get_totalnum(bsobj) 16 print "totalnum:",str(totalnum) 17 logger.info("totalnum:"+str(totalnum)) 18 vs = gl.vs 19 if viewstate_source != none: 20 vs['__viewstate'] = viewstate_source["value"] 21 22 # 获取指定snum页的数据 23 # while snum<=totalnum: 24 while snum<=totalnum: 25 print "begin get the ",str(snum)," page" 26 logger.info("begin get the "+str(snum)+" page") 27 namelist = get_appointed_page(snum, opener, vs, logger) 28 if namelist is none: 29 print "get the namelist failed,connect agian" 30 logger.info("get the namelist failed,connect agian") 31 raise exception 32 else: 33 print "get the ", str(snum), " successfully" 34 logger.info("get the " + str(snum) + " successfully") 35 36 37 mydata = paras_data(namelist,logger) 38 #保存csv文件 39 save_to_csv(mydata, snum, writer) 40 #保存到数据库 41 save_to_mysql(mydata, conn, cursor) 42 43 snum+=1 44 time.sleep(3) 45 46 flag = false 47 except exception as e: 48 logintime+=1 49 print "catch exception",e 50 logger.error("catch exception"+e.message)
定时任务设置:
cd /var/spool/cron/
crontab –e#编辑定时任务
输入:1 1 1 * * /yourpath/normal_script.sh>>/yourpath/cronlog.log 2>&1
(上面定时任务的意思是每月1号1点1分执行文件normal_script.sh,日志存放在cronlog.log)
目录结构:
源码下载:helloworld.zip