(爬虫)采用BeautifulSoup和正则爬取今日头条图集.详细!

程序员文章站 2022-03-07 19:51:36

...

用beautifulsoup提取文本信息,正则匹配关键的图片信息.

最后存入数据库mongodb.

完成后的感想: 其实分析网页是最关键的一个环节.

ajax分析,json处理等等,还是需要多点练习.

下面是代码:

'''
步骤:
1. 首先抓取索引页的内容,利用requests请求目标站点,获得索引页html代码,返回结果.
2. 解析返回结果,得到详情页(也就是每一个图集的url)的链接,进一步用requests请求详情页的信息
3. 分析详情页,得到图片的url,并把url保存到MongoDB数据库中
4. 多线程,提高抓取效率
工具库:beautifulsoup re pymongo数据库 requests
'''

import requests
from urllib.parse import urlencode
from requests.exceptions import RequestException
from bs4 import BeautifulSoup
from hashlib import md5
from multiprocessing import Pool

import re
import os
import json
import pymongo

# 以下是需要用到的参数, 设为全局变量既可. 也可以另存到一个配置文件config.py
MONGO_URL = 'localhost'
MONGO_DB = 'toutiao'
MONGO_TABLE = 'toutiao'
GROUP_START = 0
GROUP_END = 10
KEYWORD = '街拍'

# pymongo  创建mongodb的链接,用于把数据存入数据库
mongo_client = pymongo.MongoClient(MONGO_URL)
mongo_db = mongo_client[MONGO_DB]

'''
经分析网站,每个图集的url是通过ajax的方式加载出来,
而每个图集页面的图片是存在该html页面中的一个json串中
爬取图片,大体分为两个步骤:
    1. 通过设置get请求参数的方式获取,每个图集的url
    2. 然后再请求每个图集的url,在每个图集的html中,用BeautifulSoup获取文本信息(标题)
    再用正则表达式匹配出我们需要的json串,最后处理json串,最终就可以得到我们想要的图片url
把图片存入数据库
'''

def get_index_page(offset,keyword):
    # 通过设置get请求参数的方式,获取索引网页信息
    data = {
        'offset': offset,
        'format': 'json',
        'keyword': keyword,
        'autoload': 'true',
        'count': '20',
        'cur_tab': 1,
        'from': 'search_tab'
    }
    headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'}
    url = 'https://www.toutiao.com/search_content/?'+urlencode(data)
    try:
        response = requests.get(url,headers=headers)
        if response.status_code == 200:
            # 直接把返回的内容以json格式返回
            return response.json()
        return None
    except RequestException:
        print('请求索引页失败')
        return None

def parse_index_page(data):
    # 解析索引网页 (该网页返回的是json串)
    if data and 'data' in data.keys():
        for item in data.get('data'):
            # 这里返回的是每个图集网页的url(在这个url中才有我们想要的图片)
            yield item.get('article_url')

def get_detail_page(url):
    # 获取图集网页的html数据,并以文本的形式返回
    # 这个headers参数 可以提取出去,当作全局变量来使用
    headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'}
    try:
        response = requests.get(url,headers=headers)
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        # print('请求详情页出错',url)
        return None

def parse_detail_page(html,url):
    # 解析详情页(图集网页),获得该图集的标题和图片url
    try:
        soup = BeautifulSoup(html,'lxml')
        title = soup.select('title')[0].get_text()
        image_pattern = re.compile('JSON.parse\("([\s\S]+?)"\),')
        images = re.search(image_pattern,html)
        if images:
            images = images.group(1)
            images = re.sub(r'\\','',images)
            data = json.loads(images)
            if data and 'sub_images' in data.keys():
                sub_images = data.get('sub_images')
                img_list = [item.get('url') for item in sub_images]
                # 下载图片
                for img_url in img_list:
                    download_image(img_url)
                # 把获取好的数据以字典形式返回
                return {
                    'title':title,
                    'url':url,
                    'images':img_list,
                }
    except IndexError:
        return None

def save_2_mongo(result):
    #保存到数据库
    if mongo_db[MONGO_TABLE].insert(result):
        print('save to mongo successfuly',result)
        return True
    return False

def download_image(url):
    # 下载图片
    print('正在下载',url)
    headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'}
    try:
        response = requests.get(url,headers=headers)
        if response.status_code == 200:
            # 下载图片
            save_image(response.content)
        return None
    except RequestException:
        print('请求图片出错',url)
        return None

def save_image(content):
    # 下载
    # 当前路径/KEYWORD  比如:/home/xiaohaozi/进阶之路/爬虫/今日头条街拍/街拍  方便查找下载好的图片
    # 关键字可以改变,每改变一次关键字,就需要创建一个新的文件夹
    dir = os.path.dirname(os.path.realpath(__file__)) + r'/' + KEYWORD
    if not os.path.exists(dir):
        os.makedirs(dir)
    # 每个图片用hash的方法命名,避免重复下载
    file_path = '{0}/{1}.{2}'.format(dir,md5(content).hexdigest(),'jpg')
    if not os.path.exists(file_path):
        with open(file_path,'wb') as f:
            f.write(content)

def main(offset):
    # 主函数
    index_html = get_index_page(offset,KEYWORD)
    for url in parse_index_page(index_html):
        detail_html = get_detail_page(url)
        if detail_html:
            result = parse_detail_page(detail_html,url)
            if result:
                save_2_mongo(result)
        


if __name__ == "__main__":
    # 多进程 进程池
    groups = [i*20 for i in range(GROUP_START,GROUP_END+1)]
    pool =  Pool()
    pool.map(main,groups)

爬取内容截图:

下载的图片

(爬虫)采用BeautifulSoup和正则爬取今日头条图集.详细!

数据库 (偷了个小懒,没用可视工具,直接终端截的)

(爬虫)采用BeautifulSoup和正则爬取今日头条图集.详细!

勤能补拙

请努力 xiaohaozi

(爬虫)采用BeautifulSoup和正则爬取今日头条图集.详细!

python3实现爬虫爬取今日头条上面的图片（requests+正则表达式+beautifulSoup+Ajax+多线程）

(爬虫)采用BeautifulSoup和正则爬取今日头条图集.详细!