欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

数据分析之数据清洗(四)

程序员文章站 2022-04-28 23:49:42
...

旅游招聘数据分析之数据清洗(四)

在获取完我们的数据之后,就需要我们对数据进行清洗了,这个是一件很头疼的事情,麻烦,工作量大,首先我们先对我们的数据进行查重,毕竟那么多网站,有很多重复的,这些数据不仅没用而且还会增加我们的工作量,浪费时间,所以首先第一步就是查重了。建议最好先把全部数据放到一个Excel文件里面

import pandas as pd
data= pd.DataFrame(pd.read_excel('数据大集成.xlsx','Sheet1'))
no_re_row = data.drop_duplicates()
print(no_re_row)
no_re_row.to_excel("新(数据大集成).xls")

然后我们到查重好的文件里面,先将里面的部分内容进行复制,因为,数据量太大,文本文件大小有限制(好像超过3-4M之后就容易出错,所以数据量不易过大),超过这个范围,文本文件就容易出错,这个不是我们想要的结果,所以最好就是一步步复制,我这里是将Excel表进行一列列复制,到一个文本文件里面然后再写一个程序,把里面不需要,多余的文字进行删除,这里最好一部分一部分代码运行,不然一下子全部运行,容易发生冲突,这样就容易数据出错

import re
def clearBlankLine():
    file1 = open('整理文件.txt', 'r',encoding="utf-8")
    file2 = open('整理好的内容.txt', 'w', encoding='utf-8')
    try:
        for line in file1.readlines():
            file2.write(line)
            # line = line.replace("英语","").replace("薪聘","").replace("高","").replace("顺德区","").replace("旅游销售\\\\","") \
            #     .replace("去哪儿", "").replace("网","").replace("旅游在线","").replace("客服","").replace("门店","").replace("全球","") \
            #     .replace("包住宿", "").replace("月薪","").replace("携程","").replace("8500","").replace("8K","").replace("旅游产品专员\\\\","") \
            #     .replace("急聘", "").replace("康养","").replace("7000+","").replace("酒店","").replace("薪","").replace("招聘","") \
            #     .replace("同业", "").replace("无经验","").replace("+","").replace("轻松","").replace("工作","").replace("旅行海外","") \
            #     .replace("同业", "").replace("无经验", "").replace("+", "").replace("轻松", "").replace("工作", "").replace(
            #     "旅行海外", "") \
            #     .replace("泰语", "").replace("客服", "").replace("主管", "").replace("门店", "").replace("扶贫", "").replace(
            #     "旅行海外", "") \
            #     .replace("高铁", "").replace("资深", "").replace(",", "").replace("轻松", "").replace("工作", "").replace(
            #     "旅行海外", "") \
            #     .replace("同业", "").replace("无经验", "").replace("+", "").replace("轻松", "").replace("工作", "").replace(
            #     "旅行海外", "") \
                # line = line.replace("0.1","1").replace("0.2","2").replace("0.3","3").replace("0.4","4") \
            #     .replace("0.5", "5").replace("0.6","6").replace("0.7","7").replace("0.8","8").replace("0.9","9")
            # line = line.replace("'","").replace("\\xa0","").replace("***","").\
            #     replace(",","").replace("★","").replace("◆","").replace("(","").\
            #     replace(")","").replace("【","").replace("】","").replace("\\n","").\
            #     replace('[','').replace(']',"").replace("...","").replace('\\\\',"/")
            # line = line.strip(" ")
            # line = line.replace("1.20k","12k").replace("1.40k","14k").replace\
            #     ("1.50k","15k").replace("1.30k","13k").replace("10-150k/年","10-15k").\
            #     replace("8-150k/年","8-15k").replace("15-200k/年","15-20k").\
            #     replace("200元/天","6k").replace("1.5千以下","1.5k").\
            #     replace("300元/天","9k").replace("100元/天","3k").replace("150元/天","4.5k").\
            #     replace("7-120k/年","7-12k").replace("30-400k/年","3-4k").replace("20-300k/年","2-30k").replace("8-100k/年","8-10k").replace("8-200k/年","8-20k")
            # line = line.replace("1-","1k-").replace("2-","2k-").replace("3-","3k-").\
            #     replace("6-","6k-").replace("5-","5k-").replace("4-","4k-").replace("7-","7k-").replace("8-","8k-").replace("9-","9k-").replace("0-","0k-")
            # line = line[0:2]
            # line = line.replace("哈尔","哈尔滨").replace("大兴","大兴安岭").replace("防城","防城港").replace("呼和","呼和浩特").\
            #     replace("呼伦","呼伦贝尔").replace("葫芦","葫芦岛").replace("红河","红河州").replace("景德","景德镇").replace("克拉","克拉玛依")\
            #     .replace("喀什","喀什地区").replace("马鞍","马鞍山").replace("牡丹","牡丹江").replace("秦皇","秦皇岛").replace("齐齐","齐齐哈尔").\
            #     replace("七台","七台河").replace("黔东","黔东南").replace("石家","石家庄").replace("神农","神农架").replace("双鸭","双鸭山")\
            #     .replace("石河","石河子").replace("图木","图木舒克").replace("五指","五指山").replace("乌鲁","乌鲁木齐").replace("西双","西双版纳")\
            #     .replace("张家","张家界").replace("驻马","驻马店")
            # line = line.replace("0k","0").replace("1k","1").replace("2k","2").replace("3k","3").replace("4k","4")\
            #     .replace("5k","5").replace("6k","6").replace("7k","7").replace("8k","8").replace("9k","9")
            # line = line.replace("1-1","1-1千/月").replace("1-2","1-2千/月").replace("1-3","-3千/月").replace("1-4","1-4千/月").\
            #     replace("1-5","1-5千/月").replace("1-6","1-6千/月").replace("1-7","1-7千/月").replace("1-8","1-8千/月").\
            #     replace("1-9","1-9千/月").replace("2-3","2-3千/月").replace("2-4","2-4千/月").\
            #     replace("2-5","2-5千/月").replace("2-6","2-6千/月").replace("2-7","2-7千/月").replace("2-8","2-8千/月").\
            #     replace("2-9","2-9千/月").replace("3-4","3-4千/月").\
            #     replace("3-5","3-5千/月").replace("3-6","3-6千/月").replace("3-7","3-7千/月").replace("3-8","3-8千/月").\
            #     replace("3-9","3-9千/月").\
            #     replace("5-5","5-5千/月").replace("5-6","5-6千/月").replace("5-7","5-7千/月").replace("5-8","5-8千/月").\
            #     replace("5-9","5-9千/月").replace("6-6","6-6千/月").replace("6-7","6-7千/月").replace("6-8","6-8千/月").\
            #     replace("6-9","6-9千/月").replace("7-7","7-7千/月").replace("7-8","7-8千/月").\
            #     replace("7-9","7-9千/月").replace("8-8","8-8千/月").\
            #     replace("8-9","8-9千/月").\
            #     replace("4-5","4-5千/月").replace("4-6","4-6千/月").replace("4-7","4-7千/月").replace("4-8","4-8千/月").\
            #     replace("4-9","4-9千/月")
            # line = line.replace("20-4万/月0","2-4万/月")\
            #     .replace("-35","-3.5万/月").replace("-38","-3.8万/月").replace("-55","-5.5万/月").replace("-25","-2.5万/月").\
            #     replace("-36","-3.6万/月").replace("-26","-2.6万/月").replace("-27","-2.7万/月").replace("-28","-2.8万/月").\
            #     replace("-29","-2.9万/月").replace("-24","-2.4万/月").replace("-23","-2.3万/月").replace("-22","-2.2万/月").\
            #     replace("-21","-2.1万/月").replace("-31","-3.1万/月").replace("-32","-3.2万/月").replace("-33","-3.3万/月").\
            #     replace("-34","-3.4万/月").replace("-37","-3.7万/月").replace("-39","-3.9万/月").replace("-50","-5万/月").\
            #     replace("-60","-6万/月")
            # line = line.replace("-10","-1万/月").replace("-11","-1.1万/月").replace("-12","-1.2万/月").replace("-13","-1.3万/月").\
            #     replace("-14","-1.4万/月").replace("-15","-1.5万/月").replace("-16","-1.6万/月").replace("-17","-1.7万/月").replace("-18","-1.8万/月").\
            #     replace("-19","-1.9万/月").replace("-20","-2万/月").replace("-30","-3万/月").replace("-40","-4万/月")
            # line = line.replace("2千/月.2万/月","2.2万/月").replace("千/月万/月","万/月").\
            #     replace("1千/月.2万/月","1.2万/月").replace("1千/月.5万/月","1.5万/月").replace("3千/月.5万/月","3.5万/月")

    finally:
        file1.close()
        file2.close()


if __name__ == '__main__':
    clearBlankLine()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

清洗好的数据之后就是用专门的数据分析工具来,当然也可以用python来做数据分析,这样更精准一点,但是本人比较懒,就直接用数据分析工具了,这里我采用的是tableau publish,我之所以选择它,是因为它简单易上手,而且弄出来的图,清晰可见,易懂,非常好用,这里我要算的是全部旅游数据,什么类型的公司占百分比,和工作地点百分比,以及公司规模百分比

数据分析之数据清洗(四)

在这里我们直接鼠标移到饼图就可以看清楚,某个东西占的比重,方便我们做数据分析。

这里就是全部思路了,代码的话,可以去我的GitHub账户上面把源代码下载下来,如果对你有帮助的话,不嫌麻烦,可以在我的GitHub点一下start,你的支持是我更新的动力

比重,方便我们做数据分析。

这里就是全部思路了,代码的话,可以去我的GitHub账户上面把源代码下载下来,如果对你有帮助的话,不嫌麻烦,可以在我的GitHub点一下start,你的支持是我更新的动力

数据分析的结果和代码

数据分析之前程无忧(一)

数据分析之大街网(二)

数据分析之拉勾网(三)

数据分析之数据清洗(四)