百度飞桨-Python小白逆袭大神-结营心得

程序员文章站 2022-06-15 20:19:25

...

@[TOC]百度飞桨Python小白逆袭大神结营心得

百度飞桨-Python小白逆袭大神-结营心得

很开心参加了这次百度飞桨的python小白逆袭大神的课程，课程内容从Python入手，绝对0基础，老师由浅入深讲解，十分清晰，课程设计也特别有层次感，架构清晰，收获颇丰，总的收获可以概括为以下几点。

一.爬虫

任务：完成《青春有你2》选手图片爬取，打印爬取的所有图片的绝对路径，以及爬取的图片总数
分析：爬虫的过程，就是模仿浏览器的行为，往目标站点发送请求，接收服务器的响应数据，提取需要的信息，并进行保存的过程。
爬虫过程：发送请求（requests模块）；获取响应数据（服务器返回）；解析并提取数据（BeautifulSoup查找或者re正则）；保存数据
request模块：requests是python实现的简单易用的HTTP库，官网地址：http://cn.python-requests.org/zh_CN/latest/，requests.get(url)可以发送一个http get请求，返回服务器响应内容。
BeautifulSoup库：BeautifulSoup 是一个可以从HTML或XML文件中提取数据的Python库。BeautifulSoup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml。
实践：
①.爬取百度百科中《青春有你2》中所有参赛选手信息，返回页面数据

import json
import re
import requests
import datetime
from bs4 import BeautifulSoup
import os

today = datetime.date.today().strftime('%Y%m%d')    
def crawl_wiki_data():
    """
    爬取百度百科中《青春有你2》中参赛选手信息，返回html
    """
    headers = { 
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
    }
    url='https://baike.baidu.com/item/青春有你第二季'  
    try:
        response = requests.get(url,headers=headers)
        print(response.status_code)
        soup = BeautifulSoup(response.text,'lxml')
        tables = soup.find_all('table',{'class':'table-view log-set-param'})
        crawl_table_title = "参赛学员"
        for table in  tables:          
            table_titles = table.find_previous('div').find_all('h3')
            for title in table_titles:
                if(crawl_table_title in title):
                    return table       
    except Exception as e:
        print(e)

②.对爬取的页面数据进行解析，并保存为JSON文件

def crawl_pic_urls():
    '''
    爬取每个选手的百度百科图片，并保存
    ''' 
    with open('work/'+ today + '.json', 'r', encoding='UTF-8') as file:
         json_array = json.loads(file.read())
    headers = { 
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36' 
     }
    for star in json_array:
        name = star['name']
        link = star['link']
        response = requests.get(link,headers=headers)
        bs = BeautifulSoup(response.text,'lxml')
        pic_list_url = bs.select('.summary-pic a')[0].get('href')
        pic_list_url = 'https://baike.baidu.com' + pic_list_url
        pic_list_response = requests.get(pic_list_url,headers=headers)
        bs = BeautifulSoup(pic_list_response.text,'lxml')
        pic_list_html = bs.select('.pic-list img ')
        pic_urls = []
        for pic_html in pic_list_html:
            pic_url = pic_html.get('src')
            pic_urls.append(pic_url)
        down_pic(name,pic_urls)

③.爬取每个选手的百度百科图片，并进行保存

def down_pic(name,pic_urls):
    '''
    根据图片链接列表pic_urls, 下载所有图片，保存在以name命名的文件夹中,
    '''
    path = 'work/'+'pics/'+name+'/'
    if not os.path.exists(path):
      os.makedirs(path)
    for i, pic_url in enumerate(pic_urls):
        try:
            pic = requests.get(pic_url, timeout=15)
            string = str(i + 1) + '.jpg'
            with open(path+string, 'wb') as f:
                f.write(pic.content)
                print('成功下载第%s张图片: %s' % (str(i + 1), str(pic_url)))
        except Exception as e:
            print('下载第%s张图片时失败: %s' % (str(i + 1), str(pic_url)))
            print(e)
            continue

二.可视化

任务：对《青春有你2》对选手体重分布进行可视化，绘制饼状图
分析：主要对于Python的库的应用
实践：

import matplotlib.pyplot as plt
import numpy as np 
import json
import matplotlib.font_manager as font_manager

%matplotlib inline
df = pd.read_json('data/data31557/20200422.json')
weights = df['weight']
arrs = weights.values
for i in range(len(arrs)):
    arrs[i] = float(arrs[i][0:-2])
bin = [0,45,50,55,100]
se1 = pd.cut(arrs,bin)
pd.value_counts(se1)
labels = '<= 45kg','45~50kg','50~55kg','>=55kg'
sizes = [size1,size2,size3,size4]
explode = (0,0.05,0,0)
colors = ['lightskyblue','yellow','yellowgreen','pink']
fig1,ax1 = plt.subplots()
ax1.pie(sizes,explode=explode,colors=colors,labels=labels,autopct='%1.1f%%',shadow=True,
        startangle=90,pctdistance = 0.6)
ax1.axis('equal')
plt.legend(bbox_to_anchor=(0.2, 0.2))
plt.title('''《青春有你2》参赛选手体重分布''',fontsize = 24)
plt.savefig('/home/aistudio/work/result/pie_result02.jpg')
plt.show()

三.PaddleHub

任务：PaddleHub之《青春有你2》作业：五人识别
分析：PaddleHub是飞桨预训练模型管理和迁移学习工具，通过PaddleHub开发者可以使用高质量的预训练模型结合Fine-tune API快速完成迁移学习到应用部署的全流程工作。其提供了飞桨生态下的高质量预训练模型，涵盖了图像分类、目标检测、词法分析、语义模型、情感分析、视频分类、图像生成、图像分割、文本审核、关键点检测等主流模型。更多模型详情请查看官网：https://www.paddlepaddle.org.cn/hub
实践：
①.安装paddlehub

!pip install paddlehub==1.6.0 -i https://pypi.tuna.tsinghua.edu.cn/simple

②.准备工作

import paddlehub as hub
#加载预训练模型
module = hub.Module(name="resnet_v2_50_imagenet")
#数据准备
from paddlehub.dataset.base_cv_dataset import BaseCVDataset
class DemoDataset(BaseCVDataset):	
   def __init__(self):	
       self.dataset_dir = "dataset"
       #self.dataset_dir = "data"
       super(DemoDataset, self).__init__(
           base_path=self.dataset_dir,
           train_list_file="train_list.txt",
           validate_list_file="validate_list.txt",
           test_list_file="test_list.txt",
           label_list_file="label_list.txt",
           )
dataset = DemoDataset()
#生成数据读取器
data_reader = hub.reader.ImageClassificationReader(
    image_width=module.get_expected_image_width(),
    image_height=module.get_expected_image_height(),
    images_mean=module.get_pretrained_images_mean(),
    images_std=module.get_pretrained_images_std(),
    dataset=dataset)
#配置策略
config = hub.RunConfig(
    use_cuda=False,                              
    num_epoch=3,                                
    checkpoint_dir="cv_finetune_turtorial_demo",
    batch_size=10,                              
    eval_interval=10,                           
    strategy=hub.finetune.strategy.DefaultFinetuneStrategy()) 
#组建Finetune Task
input_dict, output_dict, program = module.context(trainable=True)
img = input_dict["image"]
feature_map = output_dict["feature_map"]
feed_list = [img.name]
task = hub.ImageClassifierTask(
    data_reader=data_reader,
    feed_list=feed_list,
    feature=feature_map,
    num_classes=dataset.num_labels,
    config=config)
#开始Finetune
run_states = task.finetune_and_eval()

③.预测

import numpy as np
import matplotlib.pyplot as plt 
import matplotlib.image as mpimg

with open("dataset/temp.txt","r") as f:
    filepath = f.readlines()

data = [filepath[0].split(" ")[0],filepath[1].split(" ")[0],filepath[2].split(" ")[0],filepath[3].split(" ")[0],filepath[4].split(" ")[0]]
label_map = dataset.label_dict()
index = 0
run_states = task.predict(data=data)
results = [run_state.run_results for run_state in run_states]
for batch_result in results:
    print(batch_result)
    batch_result = np.argmax(batch_result, axis=2)[0]
    for result in batch_result:
        index += 1
        result = label_map[result]
        print("input %i is %s, and the predict result is %s" %
              (index, data[index - 1], result))

四.词频统计，词云绘制及对评论进行内容审核

任务：PaddleHub之《青春有你2》作业：五人识别
分析：中文分词需要jieba ；词云绘制需要wordcloud ；可视化展示中需要的中文字体；网上公开资源中找一个中文停用词表 ；根据分词结果自己制作新增词表 ；准备一张词云背景图 ；paddlehub配置 。
实践：主函数

if __name__ == "__main__":
    num = 110
    lastID = '0'
    arr = []
    with open('aqy.txt','a',encoding='utf-8') as f:
        for i in range(num):
            lastID = saveMovieInfoToFile(lastID,arr)
            #print(i)
            time.sleep(0.5)
        for item in arr:
            Item = clear_special_char(item)
            if Item.strip()!='':
                try:
                    f.write(Item+'\n')
                except Exception as e:
                    print("含有特殊字符")
    print('共爬取评论:',len(arr))
    f = open('aqy.txt','r',encoding='utf-8')
    counts = {}
    for line in f:
        words = fenci(line)
        stopwords = stopwordslist('cn_stopwords.txt')
        movestopwords(words,stopwords,counts)
    # 加载模型
    humanseg = hub.Module(name = 'deeplabv3p_xception65_humanseg')
    # 抠图
    results = humanseg.segmentation(data = {"image":['cloud2.png']})
    for result in results:
        print(result['origin'])
        print(result['processed'])
    drawcounts(counts,10)
    drawcloud(counts)
    f.close()
    file_path = 'aqy.txt'
    test_text = []
    text_detection(test_text,file_path)

五.EasyDL

只需要收集少量和任务相关的数据，并直接在平台上完成标注，然后让系统帮我们选择合适模型与超参数进行训练，最后已训练模型还可以直接部署到云 API 或打包成安装包。
功能：
零算法训练模型：无需机器学习专业知识，只需上传并标注需要识别的示例数据即可一键训练模型
校验模型效果：查看详细的效果评估报告，并在可视化界面校验模型效果，进而有针对性地补充训练数据
部署应用模型：对模型效果满意后，将模型部署在云端、设备端、私有服务器，或直接购买软硬一体方案。

总结

百度飞桨作为开源开放、功能完备的深度学习平台，经过进一步的了解，飞桨真的很棒，尤其是很多操作可以直接在百度飞桨进行操作，这对于赋闲在家而没有很好电脑配置的我来说尤为感到惊喜，反正我是爱上这个平台了。

百度飞桨-Python小白逆袭大神-结营心得

百度飞桨-Python小白逆袭大神-结营心得

一.爬虫

二.可视化

三.PaddleHub

四.词频统计，词云绘制及对评论进行内容审核

五.EasyDL

总结

百度飞桨学院小白逆袭大神第三天题目解析

【百度训练营】python小白逆袭大神 Day1-Python基础练习