爬虫程序_改进

程序员文章站 2022-05-04 11:25:06

...


import requests
from lxml import etree
import os

urls = []
num = 1


def get_urls(page_num):
    global urls
    headers = {
        'Upgrade-Insecure-Requests':'1',
        'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
    }
    for num in range(1, page_num+1):
        try:
            url = 'https://m.cnbeta.com/wap/index.htm?page=' + str(num)
            data_list = requests.get(url,headers=headers )
            data_list.encoding = 'utf-8'
            data_html = etree.HTML(data_list.text)
            data_urls = data_html.xpath('//div[@id="info_list"]/div[@class="list"]/a//@href')
            # data_title = data_html.xpath('//div[@id="info_list"]/div[@class="list"]/a//text()')
            urls += data_urls
        except:
            print(data_urls + " : 获取失败...")
    print(urls)


def write_data(title, content):
    global num
    if not os.path.exists('./wenzhang'):
        os.makedirs('wenzhang')
    with open('wenzhang/' + 'cnbeta.txt', 'a', encoding='utf-8') as f:
        f.write(' ' + str(num) + ' -->  ' + title + '  <--\n\n')
        num += 1
        f.write(content + '\n\n--------------------------\n--------------------------\n\n\n')


def get_articles(urls):
    headers = {
        'Referer': 'https://m.cnbeta.com/wap/index.htm',
        'Upgrade-Insecure-Requests': '1',
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
    }
    for url in urls:
        try:
            new_url = 'https://m.cnbeta.com' + url
            response = requests.get(new_url,headers=headers)
            response.encoding = "utf-8"
            response_html = etree.HTML(response.text)
            title = response_html.xpath('//div[@class="title"]/b//text()')
            print(title)
            content = response_html.xpath('//div[@class="content"]/p//text()')
            content_all = ''
            for content_x in content:
                content_all = content_all + "\n" + content_x
            write_data(title[0], content_all)
        except:
            print(url + "文章错误...")


print('''
这是一个爬虫程序,爬取的是www.cxxxa.com的wap手机版页面.
采集了文章标题,和文章正文.您可以选择你要的页数.(每页35条新闻)

''')
page_num = int(input("请输入您想得到几页的数据: "))
if __name__ == '__main__':
    get_urls(page_num)
    get_articles(urls)

上一篇： pyinstaller打包爬虫程序

下一篇：豆瓣爬虫程序

爬虫程序_改进

WP8.1程序开发中，如何加载本地文件资源或安装在程序包中的资源。

这种程序怎么实现

Python构建网页爬虫原理分析

微信小程序如何接入百度统计并且自定义事件分析

python爬虫学习---爬取微软必应翻译（中英互译）

.net core 简单定时程序

PHP程序配置文件(最佳？)实践

java基本程序设计结构总结

记录Idea运行程序时报错Error:Abnormal build process termination的解决方式

Python爬虫之Selenium实现窗口截图