python3实现爬虫爬取今日头条上面的图片(requests+正则表达式+beautifulSoup+Ajax+多线程)
程序员文章站
2022-04-26 10:01:06
...
1.环境须知
做这个爬取的时候需要安装好python3.6和requests、BeautifulSoup等等一些比较常用的爬取和解析库,还需要安装MongoDB这个分布式数据库。
2.直接上代码
spider.py
import json
import re
from _md5 import md5
from urllib.parse import urlencode
from hashlib import md5
from multiprocessing import Pool
import os
import pymongo
import requests
from bs4 import BeautifulSoup
from requests.exceptions import RequestException
from config import *
#连接mongo的配置
client = pymongo.MongoClient(MONGO_URL)
db = client[MONGO_DB]
#得到索引页的内容
def get_page_index(offset,keyword):
data = {
'offset': offset,
'format': 'json',
'keyword': keyword,
'autoload': 'true',
'count': '20',
'cur_tab': 3
}
url = 'https://www.toutiao.com/search_content/?' + urlencode(data)
try:
response = requests.get(url)
if response.status_code == 200:
return response.text
return None
except RequestException:
print('请求索引页出错')
return None
#解析索引页的内容
def parse_page_index(html):
data = json.loads(html)
if data and 'data' in data.keys():
for item in data.get('data'):
yield item.get('article_url')
#得到详细页的内容
def get_page_detail(url):
try:
response = requests.get(url)
if response.status_code == 200:
return response.text
return None
except RequestException:
print('请求详情页页出错', url)
return None
#解析详细页的内容
def parse_page_detail(html, url):
soup = BeautifulSoup(html, 'lxml')
title = soup.select('title')[0].get_text()
print(title)
images_pattern = re.compile('JSON.parse\("(.*?)"\),', re.S)
result = re.search(images_pattern, html)
if result:
result_url = str(result.group(1))
images = ''
images = re.findall(r'url\\":\\"(.*?)\\"', result_url, re.S)
images_url = [item for item in images]
for image in images:
download_image(image)
return {
'title': title,
'url': url,
'images': images_url
}
return None
#将信息存储到mongo中
def save_to_mongo(result):
if db[MONGO_TABLE].insert(result):
print('存储成功', result)
return True
return False
#下载图片
def download_image(url):
url = re.sub('\\\\', '', url)
print('正在下载', url)
try:
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'
headers = {"User-Agent": user_agent}
response = requests.get(url, headers=headers)
if response.status_code == 200:
save_image(response.content)
return None
except RequestException:
print('请求图片出错', url)
return None
#保存图片
def save_image(content):
path = 'F://重庆小吃'
file_path = '{0}/{1}.{2}'.format(path, md5(content).hexdigest(), 'jpg')
if not os.path.exists(file_path):
with open(file_path, 'wb') as f:
f.write(content)
f.close()
#主方法
def main(offset):
html = get_page_index(offset,KEYWORD)
for url in parse_page_index(html):
detail_html = get_page_detail(url)
if detail_html:
result = parse_page_detail(detail_html, url)
if result:
save_to_mongo(result)
#程序入口
if __name__ == '__main__':
groups = [x*20 for x in range(GROUP_START, GROUP_END)]
pool = Pool()
pool.map(main, groups)
config.py
#mongodb数据库的配置信息
MONGO_URL = 'localhost'
MONGO_DB = 'toutiao'
MONGO_TABLE = 'toutiao'
#界面首地址和尾地址
GROUP_START = 1
GROUP_END = 5
KEYWORD = '重庆小吃'
3.运行结果
上面的代码我爬取的重庆小吃的一些图片,一共爬取的几百张图片。下面是爬取结果的截图。
打完收工。
上一篇: 爬取今日头条组图