百度飞桨-Python小白逆袭大神-结营心得
程序员文章站
2022-06-15 20:19:25
...
@[TOC]百度飞桨Python小白逆袭大神结营心得
百度飞桨-Python小白逆袭大神-结营心得
很开心参加了这次百度飞桨的python小白逆袭大神的课程,课程内容从Python入手,绝对0基础,老师由浅入深讲解,十分清晰,课程设计也特别有层次感,架构清晰,收获颇丰,总的收获可以概括为以下几点。
一.爬虫
- 任务:完成《青春有你2》选手图片爬取,打印爬取的所有图片的绝对路径,以及爬取的图片总数
- 分析:爬虫的过程,就是模仿浏览器的行为,往目标站点发送请求,接收服务器的响应数据,提取需要的信息,并进行保存的过程。
爬虫过程:发送请求(requests模块);获取响应数据(服务器返回);解析并提取数据(BeautifulSoup查找或者re正则);保存数据
request模块:requests是python实现的简单易用的HTTP库,官网地址:http://cn.python-requests.org/zh_CN/latest/,requests.get(url)可以发送一个http get请求,返回服务器响应内容。
BeautifulSoup库:BeautifulSoup 是一个可以从HTML或XML文件中提取数据的Python库。BeautifulSoup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml。 - 实践:
①.爬取百度百科中《青春有你2》中所有参赛选手信息,返回页面数据
import json
import re
import requests
import datetime
from bs4 import BeautifulSoup
import os
today = datetime.date.today().strftime('%Y%m%d')
def crawl_wiki_data():
"""
爬取百度百科中《青春有你2》中参赛选手信息,返回html
"""
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
url='https://baike.baidu.com/item/青春有你第二季'
try:
response = requests.get(url,headers=headers)
print(response.status_code)
soup = BeautifulSoup(response.text,'lxml')
tables = soup.find_all('table',{'class':'table-view log-set-param'})
crawl_table_title = "参赛学员"
for table in tables:
table_titles = table.find_previous('div').find_all('h3')
for title in table_titles:
if(crawl_table_title in title):
return table
except Exception as e:
print(e)
②.对爬取的页面数据进行解析,并保存为JSON文件
def crawl_pic_urls():
'''
爬取每个选手的百度百科图片,并保存
'''
with open('work/'+ today + '.json', 'r', encoding='UTF-8') as file:
json_array = json.loads(file.read())
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
for star in json_array:
name = star['name']
link = star['link']
response = requests.get(link,headers=headers)
bs = BeautifulSoup(response.text,'lxml')
pic_list_url = bs.select('.summary-pic a')[0].get('href')
pic_list_url = 'https://baike.baidu.com' + pic_list_url
pic_list_response = requests.get(pic_list_url,headers=headers)
bs = BeautifulSoup(pic_list_response.text,'lxml')
pic_list_html = bs.select('.pic-list img ')
pic_urls = []
for pic_html in pic_list_html:
pic_url = pic_html.get('src')
pic_urls.append(pic_url)
down_pic(name,pic_urls)
③.爬取每个选手的百度百科图片,并进行保存
def down_pic(name,pic_urls):
'''
根据图片链接列表pic_urls, 下载所有图片,保存在以name命名的文件夹中,
'''
path = 'work/'+'pics/'+name+'/'
if not os.path.exists(path):
os.makedirs(path)
for i, pic_url in enumerate(pic_urls):
try:
pic = requests.get(pic_url, timeout=15)
string = str(i + 1) + '.jpg'
with open(path+string, 'wb') as f:
f.write(pic.content)
print('成功下载第%s张图片: %s' % (str(i + 1), str(pic_url)))
except Exception as e:
print('下载第%s张图片时失败: %s' % (str(i + 1), str(pic_url)))
print(e)
continue
二.可视化
- 任务:对《青春有你2》对选手体重分布进行可视化,绘制饼状图
- 分析:主要对于Python的库的应用
- 实践:
import matplotlib.pyplot as plt
import numpy as np
import json
import matplotlib.font_manager as font_manager
%matplotlib inline
df = pd.read_json('data/data31557/20200422.json')
weights = df['weight']
arrs = weights.values
for i in range(len(arrs)):
arrs[i] = float(arrs[i][0:-2])
bin = [0,45,50,55,100]
se1 = pd.cut(arrs,bin)
pd.value_counts(se1)
labels = '<= 45kg','45~50kg','50~55kg','>=55kg'
sizes = [size1,size2,size3,size4]
explode = (0,0.05,0,0)
colors = ['lightskyblue','yellow','yellowgreen','pink']
fig1,ax1 = plt.subplots()
ax1.pie(sizes,explode=explode,colors=colors,labels=labels,autopct='%1.1f%%',shadow=True,
startangle=90,pctdistance = 0.6)
ax1.axis('equal')
plt.legend(bbox_to_anchor=(0.2, 0.2))
plt.title('''《青春有你2》参赛选手体重分布''',fontsize = 24)
plt.savefig('/home/aistudio/work/result/pie_result02.jpg')
plt.show()
三.PaddleHub
- 任务:PaddleHub之《青春有你2》作业:五人识别
- 分析:PaddleHub是飞桨预训练模型管理和迁移学习工具,通过PaddleHub开发者可以使用高质量的预训练模型结合Fine-tune API快速完成迁移学习到应用部署的全流程工作。其提供了飞桨生态下的高质量预训练模型,涵盖了图像分类、目标检测、词法分析、语义模型、情感分析、视频分类、图像生成、图像分割、文本审核、关键点检测等主流模型。更多模型详情请查看官网:https://www.paddlepaddle.org.cn/hub
- 实践:
①.安装paddlehub
!pip install paddlehub==1.6.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
②.准备工作
import paddlehub as hub
#加载预训练模型
module = hub.Module(name="resnet_v2_50_imagenet")
#数据准备
from paddlehub.dataset.base_cv_dataset import BaseCVDataset
class DemoDataset(BaseCVDataset):
def __init__(self):
self.dataset_dir = "dataset"
#self.dataset_dir = "data"
super(DemoDataset, self).__init__(
base_path=self.dataset_dir,
train_list_file="train_list.txt",
validate_list_file="validate_list.txt",
test_list_file="test_list.txt",
label_list_file="label_list.txt",
)
dataset = DemoDataset()
#生成数据读取器
data_reader = hub.reader.ImageClassificationReader(
image_width=module.get_expected_image_width(),
image_height=module.get_expected_image_height(),
images_mean=module.get_pretrained_images_mean(),
images_std=module.get_pretrained_images_std(),
dataset=dataset)
#配置策略
config = hub.RunConfig(
use_cuda=False,
num_epoch=3,
checkpoint_dir="cv_finetune_turtorial_demo",
batch_size=10,
eval_interval=10,
strategy=hub.finetune.strategy.DefaultFinetuneStrategy())
#组建Finetune Task
input_dict, output_dict, program = module.context(trainable=True)
img = input_dict["image"]
feature_map = output_dict["feature_map"]
feed_list = [img.name]
task = hub.ImageClassifierTask(
data_reader=data_reader,
feed_list=feed_list,
feature=feature_map,
num_classes=dataset.num_labels,
config=config)
#开始Finetune
run_states = task.finetune_and_eval()
③.预测
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
with open("dataset/temp.txt","r") as f:
filepath = f.readlines()
data = [filepath[0].split(" ")[0],filepath[1].split(" ")[0],filepath[2].split(" ")[0],filepath[3].split(" ")[0],filepath[4].split(" ")[0]]
label_map = dataset.label_dict()
index = 0
run_states = task.predict(data=data)
results = [run_state.run_results for run_state in run_states]
for batch_result in results:
print(batch_result)
batch_result = np.argmax(batch_result, axis=2)[0]
for result in batch_result:
index += 1
result = label_map[result]
print("input %i is %s, and the predict result is %s" %
(index, data[index - 1], result))
四.词频统计,词云绘制及对评论进行内容审核
- 任务:PaddleHub之《青春有你2》作业:五人识别
- 分析:中文分词需要jieba ;词云绘制需要wordcloud ;可视化展示中需要的中文字体;网上公开资源中找一个中文停用词表 ;根据分词结果自己制作新增词表 ;准备一张词云背景图 ;paddlehub配置 。
- 实践:主函数
if __name__ == "__main__":
num = 110
lastID = '0'
arr = []
with open('aqy.txt','a',encoding='utf-8') as f:
for i in range(num):
lastID = saveMovieInfoToFile(lastID,arr)
#print(i)
time.sleep(0.5)
for item in arr:
Item = clear_special_char(item)
if Item.strip()!='':
try:
f.write(Item+'\n')
except Exception as e:
print("含有特殊字符")
print('共爬取评论:',len(arr))
f = open('aqy.txt','r',encoding='utf-8')
counts = {}
for line in f:
words = fenci(line)
stopwords = stopwordslist('cn_stopwords.txt')
movestopwords(words,stopwords,counts)
# 加载模型
humanseg = hub.Module(name = 'deeplabv3p_xception65_humanseg')
# 抠图
results = humanseg.segmentation(data = {"image":['cloud2.png']})
for result in results:
print(result['origin'])
print(result['processed'])
drawcounts(counts,10)
drawcloud(counts)
f.close()
file_path = 'aqy.txt'
test_text = []
text_detection(test_text,file_path)
五.EasyDL
只需要收集少量和任务相关的数据,并直接在平台上完成标注,然后让系统帮我们选择合适模型与超参数进行训练,最后已训练模型还可以直接部署到云 API 或打包成安装包。
功能 :
零算法训练模型:无需机器学习专业知识,只需上传并标注需要识别的示例数据即可一键训练模型
校验模型效果:查看详细的效果评估报告,并在可视化界面校验模型效果,进而有针对性地补充训练数据
部署应用模型:对模型效果满意后,将模型部署在云端、设备端、私有服务器,或直接购买软硬一体方案。
总结
百度飞桨作为开源开放、功能完备的深度学习平台,经过进一步的了解,飞桨真的很棒,尤其是很多操作可以直接在百度飞桨进行操作,这对于赋闲在家而没有很好电脑配置的我来说尤为感到惊喜,反正我是爱上这个平台了。