爬虫过程中遇到的坑
程序员文章站
2024-02-27 23:59:51
...
1.当你爬取url中含有中文字符时会出现编码错误问题。
# -*- encoding= "utf-8" -*-
import urllib.request
import re
# import sys
# import codecs
# sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach())
# fh=open("E:/pycharmprojects/111.txt",'w')
#page=(num-1)*20
for i in range(1,11):
url="https://book.douban.com/tag/成长?start="+str((i-1)*20)
data=urllib.request.urlopen(url).read().decode("utf-8")
pat='title="(.*?)"'
rst=re.compile(pat).findall(data)
for j in range(0,len(rst)):
print(rst[j])
显示错误:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 9-10: ordinal not in range(128)
解决方法:
url中含有中文的需要进行转码,
keywd="成长"
keywd=urllib.request.quote(keywd)
for i in range(1,11):
url="https://book.douban.com/tag/"+str(keywd)+"?start="+str((i-1)*20)