python爬虫实现豆瓣数据的爬取
程序员文章站
2022-05-04 11:50:20
...
本文利用urllib在python3.7的环境下实现豆瓣页面的爬取!
用到的包有urllib与re两个模块,具体实现如下!
import urllib.request
import re
import ssl
url = "https://read.douban.com/provider/all"
def doubanread(url):
ssl._create_default_https_context = ssl._create_unverified_context
data = urllib.request.urlopen(url).read()
data = data.decode("utf-8")
pat = '<div class="name">(.*?)</div>'
mydata = re.compile(pat).findall(data)
return mydata
def writetxt(mydata):
fw = open("test.txt","w")
for i in range(0,len(mydata)):
fw.write(mydata[i] + "\n")
fw.close()
if __name__ == '__main__':
datatest = doubanread(url)
writetxt(datatest)
本文主要实现爬取豆瓣阅读页面的出版社信息的爬取,将所有出版社写入到一个txt文件并保存到本地!
下面是另一个版本的抓取,用于抓取豆瓣文学部分的数据,包括数名、作者、出版社、出版时间、售价、评分等内容!
本次抓取利用requests库抓取网页代码;Beautiful解析网页数据;由于此版本可以用来抓取多页数据,为防止爬虫被禁,加入时间,引入time模块;数据最终保存在csv中,在抓取的过程中将数据保存在列表中,最终利用pandas,实现数据形式的转换,保存在csv文件中!
还有需要注意的一点是,beautiful解析的数据应该是content,否则会报错!
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import random
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
}
urlorigin = 'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start='
def doubanSpider(urlorigin):
for i in range(0,30):
url = urlorigin + str(i * 20) + '&type=T'
html = requests.get(url=url,headers=headers)
list = doubanparse(html)
doubanwrite(list)
time.sleep(3 + random.random())
def doubanparse(html):
List = []
bookname = []
info1 = []
scores = []
numberofpeople = []
actor = []
publicer = []
date = []
price = []
soup = BeautifulSoup(html.content,'lxml')
for name in soup.select('h2 a'):
bookname.append(name.get_text().strip())
for actor in soup.select('div .pub'):
info = actor.get_text().strip()
info1.append(actor.get_text().strip())
actor = info.split('/')[0]
publicer = info.split('/')[-3]
date = info.split('/')[-2]
price = info.split('/')[-1]
print(actor,publicer,date,price)
for score in soup.select('div .rating_nums'):
scores.append(score.get_text().strip())
for peoples in soup.select('div .pl'):
numberofpeople.append(peoples.get_text().strip())
for i in range(len(bookname)):
List.append([bookname[i],actor,publicer,date,price,scores[i],numberofpeople[i]])
return List
def doubanwrite(dataList):
fieldnames = ['bookname','acthor','publisher','date','price','score','numberofpeople']
data = pd.DataFrame(columns=fieldnames,data=dataList)
data.to_csv('douban.csv')
if __name__ == '__main__':
doubanSpider(url=urlorigin)
上一篇: 使用python爬取有道词典翻译
下一篇: python爬虫基础 --爬取有道翻译