欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

百度飞桨-Python小白逆袭大神-结营心得

程序员文章站 2022-06-15 20:19:25
...

@[TOC]百度飞桨Python小白逆袭大神结营心得

百度飞桨-Python小白逆袭大神-结营心得

很开心参加了这次百度飞桨的python小白逆袭大神的课程,课程内容从Python入手,绝对0基础,老师由浅入深讲解,十分清晰,课程设计也特别有层次感,架构清晰,收获颇丰,总的收获可以概括为以下几点。

一.爬虫

  1. 任务:完成《青春有你2》选手图片爬取,打印爬取的所有图片的绝对路径,以及爬取的图片总数
  2. 分析:爬虫的过程,就是模仿浏览器的行为,往目标站点发送请求,接收服务器的响应数据,提取需要的信息,并进行保存的过程。
    爬虫过程:发送请求(requests模块);获取响应数据(服务器返回);解析并提取数据(BeautifulSoup查找或者re正则);保存数据
    request模块:requests是python实现的简单易用的HTTP库,官网地址:http://cn.python-requests.org/zh_CN/latest/,requests.get(url)可以发送一个http get请求,返回服务器响应内容。
    BeautifulSoup库:BeautifulSoup 是一个可以从HTML或XML文件中提取数据的Python库。BeautifulSoup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml。
  3. 实践:
    ①.爬取百度百科中《青春有你2》中所有参赛选手信息,返回页面数据
import json
import re
import requests
import datetime
from bs4 import BeautifulSoup
import os

today = datetime.date.today().strftime('%Y%m%d')    
def crawl_wiki_data():
    """
    爬取百度百科中《青春有你2》中参赛选手信息,返回html
    """
    headers = { 
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
    }
    url='https://baike.baidu.com/item/青春有你第二季'  
    try:
        response = requests.get(url,headers=headers)
        print(response.status_code)
        soup = BeautifulSoup(response.text,'lxml')
        tables = soup.find_all('table',{'class':'table-view log-set-param'})
        crawl_table_title = "参赛学员"
        for table in  tables:          
            table_titles = table.find_previous('div').find_all('h3')
            for title in table_titles:
                if(crawl_table_title in title):
                    return table       
    except Exception as e:
        print(e)

②.对爬取的页面数据进行解析,并保存为JSON文件

def crawl_pic_urls():
    '''
    爬取每个选手的百度百科图片,并保存
    ''' 
    with open('work/'+ today + '.json', 'r', encoding='UTF-8') as file:
         json_array = json.loads(file.read())
    headers = { 
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36' 
     }
    for star in json_array:
        name = star['name']
        link = star['link']
        response = requests.get(link,headers=headers)
        bs = BeautifulSoup(response.text,'lxml')
        pic_list_url = bs.select('.summary-pic a')[0].get('href')
        pic_list_url = 'https://baike.baidu.com' + pic_list_url
        pic_list_response = requests.get(pic_list_url,headers=headers)
        bs = BeautifulSoup(pic_list_response.text,'lxml')
        pic_list_html = bs.select('.pic-list img ')
        pic_urls = []
        for pic_html in pic_list_html:
            pic_url = pic_html.get('src')
            pic_urls.append(pic_url)
        down_pic(name,pic_urls)

③.爬取每个选手的百度百科图片,并进行保存

def down_pic(name,pic_urls):
    '''
    根据图片链接列表pic_urls, 下载所有图片,保存在以name命名的文件夹中,
    '''
    path = 'work/'+'pics/'+name+'/'
    if not os.path.exists(path):
      os.makedirs(path)
    for i, pic_url in enumerate(pic_urls):
        try:
            pic = requests.get(pic_url, timeout=15)
            string = str(i + 1) + '.jpg'
            with open(path+string, 'wb') as f:
                f.write(pic.content)
                print('成功下载第%s张图片: %s' % (str(i + 1), str(pic_url)))
        except Exception as e:
            print('下载第%s张图片时失败: %s' % (str(i + 1), str(pic_url)))
            print(e)
            continue

二.可视化

  1. 任务:对《青春有你2》对选手体重分布进行可视化,绘制饼状图
  2. 分析:主要对于Python的库的应用
  3. 实践:
import matplotlib.pyplot as plt
import numpy as np 
import json
import matplotlib.font_manager as font_manager

%matplotlib inline
df = pd.read_json('data/data31557/20200422.json')
weights = df['weight']
arrs = weights.values
for i in range(len(arrs)):
    arrs[i] = float(arrs[i][0:-2])
bin = [0,45,50,55,100]
se1 = pd.cut(arrs,bin)
pd.value_counts(se1)
labels = '<= 45kg','45~50kg','50~55kg','>=55kg'
sizes = [size1,size2,size3,size4]
explode = (0,0.05,0,0)
colors = ['lightskyblue','yellow','yellowgreen','pink']
fig1,ax1 = plt.subplots()
ax1.pie(sizes,explode=explode,colors=colors,labels=labels,autopct='%1.1f%%',shadow=True,
        startangle=90,pctdistance = 0.6)
ax1.axis('equal')
plt.legend(bbox_to_anchor=(0.2, 0.2))
plt.title('''《青春有你2》参赛选手体重分布''',fontsize = 24)
plt.savefig('/home/aistudio/work/result/pie_result02.jpg')
plt.show()

三.PaddleHub

  1. 任务:PaddleHub之《青春有你2》作业:五人识别
  2. 分析:PaddleHub是飞桨预训练模型管理和迁移学习工具,通过PaddleHub开发者可以使用高质量的预训练模型结合Fine-tune API快速完成迁移学习到应用部署的全流程工作。其提供了飞桨生态下的高质量预训练模型,涵盖了图像分类、目标检测、词法分析、语义模型、情感分析、视频分类、图像生成、图像分割、文本审核、关键点检测等主流模型。更多模型详情请查看官网:https://www.paddlepaddle.org.cn/hub
  3. 实践:
    ①.安装paddlehub
!pip install paddlehub==1.6.0 -i https://pypi.tuna.tsinghua.edu.cn/simple

②.准备工作

import paddlehub as hub
#加载预训练模型
module = hub.Module(name="resnet_v2_50_imagenet")
#数据准备
from paddlehub.dataset.base_cv_dataset import BaseCVDataset
class DemoDataset(BaseCVDataset):	
   def __init__(self):	
       self.dataset_dir = "dataset"
       #self.dataset_dir = "data"
       super(DemoDataset, self).__init__(
           base_path=self.dataset_dir,
           train_list_file="train_list.txt",
           validate_list_file="validate_list.txt",
           test_list_file="test_list.txt",
           label_list_file="label_list.txt",
           )
dataset = DemoDataset()
#生成数据读取器
data_reader = hub.reader.ImageClassificationReader(
    image_width=module.get_expected_image_width(),
    image_height=module.get_expected_image_height(),
    images_mean=module.get_pretrained_images_mean(),
    images_std=module.get_pretrained_images_std(),
    dataset=dataset)
#配置策略
config = hub.RunConfig(
    use_cuda=False,                              
    num_epoch=3,                                
    checkpoint_dir="cv_finetune_turtorial_demo",
    batch_size=10,                              
    eval_interval=10,                           
    strategy=hub.finetune.strategy.DefaultFinetuneStrategy()) 
#组建Finetune Task
input_dict, output_dict, program = module.context(trainable=True)
img = input_dict["image"]
feature_map = output_dict["feature_map"]
feed_list = [img.name]
task = hub.ImageClassifierTask(
    data_reader=data_reader,
    feed_list=feed_list,
    feature=feature_map,
    num_classes=dataset.num_labels,
    config=config)
#开始Finetune
run_states = task.finetune_and_eval()

③.预测

import numpy as np
import matplotlib.pyplot as plt 
import matplotlib.image as mpimg

with open("dataset/temp.txt","r") as f:
    filepath = f.readlines()

data = [filepath[0].split(" ")[0],filepath[1].split(" ")[0],filepath[2].split(" ")[0],filepath[3].split(" ")[0],filepath[4].split(" ")[0]]
label_map = dataset.label_dict()
index = 0
run_states = task.predict(data=data)
results = [run_state.run_results for run_state in run_states]
for batch_result in results:
    print(batch_result)
    batch_result = np.argmax(batch_result, axis=2)[0]
    for result in batch_result:
        index += 1
        result = label_map[result]
        print("input %i is %s, and the predict result is %s" %
              (index, data[index - 1], result))

四.词频统计,词云绘制及对评论进行内容审核

  1. 任务:PaddleHub之《青春有你2》作业:五人识别
  2. 分析:中文分词需要jieba ;词云绘制需要wordcloud ;可视化展示中需要的中文字体;网上公开资源中找一个中文停用词表 ;根据分词结果自己制作新增词表 ;准备一张词云背景图paddlehub配置
  3. 实践:主函数
if __name__ == "__main__":
    num = 110
    lastID = '0'
    arr = []
    with open('aqy.txt','a',encoding='utf-8') as f:
        for i in range(num):
            lastID = saveMovieInfoToFile(lastID,arr)
            #print(i)
            time.sleep(0.5)
        for item in arr:
            Item = clear_special_char(item)
            if Item.strip()!='':
                try:
                    f.write(Item+'\n')
                except Exception as e:
                    print("含有特殊字符")
    print('共爬取评论:',len(arr))
    f = open('aqy.txt','r',encoding='utf-8')
    counts = {}
    for line in f:
        words = fenci(line)
        stopwords = stopwordslist('cn_stopwords.txt')
        movestopwords(words,stopwords,counts)
    # 加载模型
    humanseg = hub.Module(name = 'deeplabv3p_xception65_humanseg')
    # 抠图
    results = humanseg.segmentation(data = {"image":['cloud2.png']})
    for result in results:
        print(result['origin'])
        print(result['processed'])
    drawcounts(counts,10)
    drawcloud(counts)
    f.close()
    file_path = 'aqy.txt'
    test_text = []
    text_detection(test_text,file_path)

五.EasyDL

只需要收集少量和任务相关的数据,并直接在平台上完成标注,然后让系统帮我们选择合适模型与超参数进行训练,最后已训练模型还可以直接部署到云 API 或打包成安装包。
功能
零算法训练模型:无需机器学习专业知识,只需上传并标注需要识别的示例数据即可一键训练模型
校验模型效果:查看详细的效果评估报告,并在可视化界面校验模型效果,进而有针对性地补充训练数据
部署应用模型:对模型效果满意后,将模型部署在云端、设备端、私有服务器,或直接购买软硬一体方案。

总结

百度飞桨作为开源开放、功能完备的深度学习平台,经过进一步的了解,飞桨真的很棒,尤其是很多操作可以直接在百度飞桨进行操作,这对于赋闲在家而没有很好电脑配置的我来说尤为感到惊喜,反正我是爱上这个平台了。

相关标签: python 百度