python爬虫抓取信息-urllib

程序员文章站 2022-05-04 11:49:38

...

自己晚上写的本来抓取的是汇率没写完唉路还长继续走

import requests
import urllib.request
import urllib.request
import re
import datetime


def get_headers():
    '''定义请求头 换着请求头进行爬取'''
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36',
    }
    return headers
def get_ip():
    '''代理ip 换着地址进行爬取信息'''
    pass

def grab_info():
    '''抓取地址'''
    url = 'http://fx.cmbchina.com/hq/'
    return url
def get_url_address(url):
    '''处理请求地址，或者翻页
    返回的信息可以是json数据
    使用代理ip时要跟换方法使用 可以提前写好
    '''
    headers = get_headers()
    request = urllib.request.Request(url, headers=headers)
    return request

proxy_handler = urllib.request.ProxyHandler({'http': '120.32.208.16:8118'})
opener = urllib.request.build_opener(proxy_handler)


def get_html(request):
    '''进行响应，获取数据'''
    repsonse = urllib.request.urlopen(request)
    html = repsonse.read().decode('utf-8')
    return html
def handle_data(html):
    '''利用xpath re 进行解析'''
    need_data = dict()

    print(html)


def need_info():
    '''存储数据 可以存入表格等 重新调取数据'''
    pass

def main():
    '''主程序 可以设置死循环 来进行不断抓取数据'''
    url = grab_info()
    request = get_url_address(url=url)
    html = get_html(request =request)
    handle_data(html = html)


if __name__ == '__main__':
    start_time = datetime.datetime.now()
    main()
    end_time = datetime.datetime.now()
    print('爬取时间{time}'.format(time = end_time-start_time))

上一篇： Pthon 网络爬虫 xiaohua

下一篇： python 程序打包成exe py2exe

python爬虫抓取信息-urllib

Python使用scrapy抓取网站sitemap信息的方法

python制作爬虫并将抓取结果保存到excel中

python3爬虫-通过selenium登陆拉钩，爬取职位信息

Python之多线程爬虫抓取网页图片的实战代码

用Python程序抓取网页的HTML信息的一个小实例

Python爬虫实战之爬取某宝男装信息

Python抓取手机号归属地信息示例代码

Python使用Srapy框架爬虫模拟登陆并抓取知乎内容

零基础写python爬虫之抓取百度贴吧代码分享

python3爬虫全国地址信息