欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

python爬虫实现豆瓣数据的爬取

程序员文章站 2022-05-04 11:50:20
...

本文利用urllib在python3.7的环境下实现豆瓣页面的爬取!

用到的包有urllib与re两个模块,具体实现如下!

import urllib.request
import re
import ssl

url = "https://read.douban.com/provider/all"

def doubanread(url):
    ssl._create_default_https_context = ssl._create_unverified_context
    data = urllib.request.urlopen(url).read()
    data = data.decode("utf-8")
    pat = '<div class="name">(.*?)</div>'
    mydata = re.compile(pat).findall(data)
    return mydata

def writetxt(mydata):
    fw = open("test.txt","w")
    for i in range(0,len(mydata)):
        fw.write(mydata[i] + "\n")
    fw.close()

if __name__ == '__main__':
    datatest = doubanread(url)
    writetxt(datatest)

本文主要实现爬取豆瓣阅读页面的出版社信息的爬取,将所有出版社写入到一个txt文件并保存到本地!

下面是另一个版本的抓取,用于抓取豆瓣文学部分的数据,包括数名、作者、出版社、出版时间、售价、评分等内容!

本次抓取利用requests库抓取网页代码;Beautiful解析网页数据;由于此版本可以用来抓取多页数据,为防止爬虫被禁,加入时间,引入time模块;数据最终保存在csv中,在抓取的过程中将数据保存在列表中,最终利用pandas,实现数据形式的转换,保存在csv文件中!

还有需要注意的一点是,beautiful解析的数据应该是content,否则会报错!

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import random

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
}

urlorigin = 'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start='

def doubanSpider(urlorigin):
    for i in range(0,30):
        url = urlorigin + str(i * 20) + '&type=T'
        html = requests.get(url=url,headers=headers)
        list = doubanparse(html)
        doubanwrite(list)
        time.sleep(3 + random.random())

def doubanparse(html):
    List = []
    bookname = []
    info1 = []
    scores = []
    numberofpeople = []
    actor = []
    publicer = []
    date = []
    price = []
    soup = BeautifulSoup(html.content,'lxml')

    for name in soup.select('h2 a'):
        bookname.append(name.get_text().strip())

    for actor in soup.select('div .pub'):
        info = actor.get_text().strip()
        info1.append(actor.get_text().strip())
        actor = info.split('/')[0]
        publicer = info.split('/')[-3]
        date = info.split('/')[-2]
        price = info.split('/')[-1]

        print(actor,publicer,date,price)

    for score in soup.select('div .rating_nums'):
        scores.append(score.get_text().strip())

    for peoples in soup.select('div .pl'):
        numberofpeople.append(peoples.get_text().strip())

    for i in range(len(bookname)):
        List.append([bookname[i],actor,publicer,date,price,scores[i],numberofpeople[i]])
    return List


def doubanwrite(dataList):
    fieldnames = ['bookname','acthor','publisher','date','price','score','numberofpeople']
    data = pd.DataFrame(columns=fieldnames,data=dataList)
    data.to_csv('douban.csv')


if __name__ == '__main__':
    doubanSpider(url=urlorigin)