Python学习之数据可视化
常用Python包
- Matplotlib
- Seaborn
- Pandas
- Bokeh
- Plotly
- Vispy
- Vega
- gaga-lite
Matplotlib可视化
Matplotlib安装
pip install matplotlib-i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com
如果失败了可以试试这样:
先更新pip,在安装matplotlib
python -m pip install -U pip setuptools -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com
python -m pip install matplotlib -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com
Matplotlib包括两个模板
- 绘图API:pyplot,通常用于可视化
- 集成库:pylab,是Matplotlib和SciPy、NumPy的集成库
Matplotlib绘图的两种方式
- inline,静态绘图
- notebook,交互式图
在二维坐标上绘图plt.plot()
plt.show()显示结果
import matplotlib.pyplot as plt
import pandas as pd
dt = {'height': pd.Series([58, 59, 60, 61, 62], index=[0, 1, 2, 3, 4]),
'weight': pd.Series([115, 117, 120, 123, 126], index=[0, 1, 2, 3, 4])}
women = pd.DataFrame(dt)
print(women)
'''
height weight
0 58 115
1 59 117
2 60 120
3 61 123
4 62 126
'''
plt.plot(women["height"],women["weight"])
plt.show()
实现显示多条线条的方法plt.plot(x,y1,x,y2,x,y3…)
import matplotlib.pyplot as plt
import numpy as np
t = np.arange(0.0, 4.0, 0.1)
print(t)
plt.plot(t, t, t, t + 2, t, t ** 2, t, t + 8)
plt.show()
改变图的属性
- 设置点的类型
在plt.plot()中增加第三个实参的取值,如‘o’
import matplotlib.pyplot as plt
import pandas as pd
dt = {'height': pd.Series([58, 59, 60, 61, 62], index=[0, 1, 2, 3, 4]),
'weight': pd.Series([115, 117, 120, 123, 126], index=[0, 1, 2, 3, 4])}
women = pd.DataFrame(dt)
print(women)
'''
height weight
0 58 115
1 59 117
2 60 120
3 61 123
4 62 126
'''
plt.plot(women["height"],women["weight"],'o')
plt.show()
plt.plot(women["height"],women["weight"],'D')
plt.show()
- 设置线的颜色和形状
改变plt.plot()的第三个实参
import matplotlib.pyplot as plt
import pandas as pd
dt = {'height': pd.Series([58, 59, 60, 61, 62], index=[0, 1, 2, 3, 4]),
'weight': pd.Series([115, 117, 120, 123, 126], index=[0, 1, 2, 3, 4])}
women = pd.DataFrame(dt)
print(women)
'''
height weight
0 58 115
1 59 117
2 60 120
3 61 123
4 62 126
'''
plt.plot(women["height"],women["weight"],'g--')
plt.show()
plt.plot(women["height"],women["weight"],'rD')
plt.show()
具体用法可以参考这两篇
https://blog.csdn.net/cjcrxzz/article/details/79627483
https://blog.csdn.net/sinat_36219858/article/details/79800460?utm_source=distribute.pc_relevant.none-task
- 显示汉字
放在plot前
汉字常用字体:SimHei、Kaiti、Lisu、Fangsong、YouYuan
plt.rcParams['font.family'] = 'SimHei'
- 设置图名以及x/y轴名称
plt.title()、plt.xlabel()、plt.ylabel()分别为图的标题、x坐标名和y坐标名
import matplotlib.pyplot as plt
import pandas as pd
dt = {'height': pd.Series([58, 59, 60, 61, 62], index=[0, 1, 2, 3, 4]),
'weight': pd.Series([115, 117, 120, 123, 126], index=[0, 1, 2, 3, 4])}
women = pd.DataFrame(dt)
print(women)
'''
height weight
0 58 115
1 59 117
2 60 120
3 61 123
4 62 126
'''
plt.rcParams['font.family'] = 'SimHei'
plt.plot(women["height"], women["weight"], 'g--')
plt.title("此处为图名")
plt.xlabel("x轴的名称")
plt.ylabel("y轴的名称")
plt.show()
- 图例的位置
首先在plt.plot()加上label参数,再使用plt.legend(loc = )loc为位置,可设置为如"upper left"。显示的是图例,即lebel的内容
import matplotlib.pyplot as plt
import pandas as pd
dt = {'height': pd.Series([58, 59, 60, 61, 62], index=[0, 1, 2, 3, 4]),
'weight': pd.Series([115, 117, 120, 123, 126], index=[0, 1, 2, 3, 4])}
women = pd.DataFrame(dt)
print(women)
'''
height weight
0 58 115
1 59 117
2 60 120
3 61 123
4 62 126
'''
plt.rcParams['font.family'] = 'SimHei'
plt.plot(women["height"], women["weight"], 'g--', label='weight')
plt.title("此处为图名")
plt.xlabel("x轴的名称")
plt.ylabel("y轴的名称")
plt.legend(loc="upper left")
plt.show()
改变图的类型
plt.scatter()散点图
import matplotlib.pyplot as plt
import pandas as pd
dt = {'height': pd.Series([58, 59, 60, 61, 62], index=[0, 1, 2, 3, 4]),
'weight': pd.Series([115, 117, 120, 123, 126], index=[0, 1, 2, 3, 4])}
women = pd.DataFrame(dt)
print(women)
'''
height weight
0 58 115
1 59 117
2 60 120
3 61 123
4 62 126
'''
plt.scatter(women["height"], women["weight"])
plt.show()
改变图的坐标轴的取值范围
定义横坐标:plt.xlim()
定义纵坐标:plt.ylim()
同时定义横、纵坐标:plt.axis()
np.linspace(0,10,100)功能为返回一个含有100个元素且每个元素取值范围为[0,100]的等距离数列
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0, 10, 100)
plt.plot(x, np.sin(x))
plt.xlim(11, -2) # x轴取值范围为[11,-2]
plt.ylim(2.2, -1.3) # y轴取值范围为[2.2,-1.3]
plt.show()
plt.axis(a1,a2,b1,b2):a1和a2为x轴的取值范围,b1和b2为y轴的取值范围
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0, 10, 100)
plt.plot(x, np.sin(x))
plt.axis([-1, 21, -1.6, 1.6])
plt.show()
plt.axis("equal’)x轴和y轴的刻度单位一样
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0, 10, 100)
plt.plot(x, np.sin(x))
plt.axis("equal")
plt.show()
去掉边界的空白
plt.axis(“tight”)
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0, 10, 100)
plt.plot(x, np.sin(x))
plt.axis("tight")
plt.show()
在同一个坐标上画两个图
定义多个plt.plot()
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0, 10, 100)
plt.plot(x, np.sin(x),label="sin(x)")
plt.plot(x, np.cos(x),label="cos(x)")
plt.axis("tight")
plt.legend()
plt.show()
多图显示
plt.subplot(x,y,z)表示的是接下面的图显示位置是x*y个窗口的第z个窗口
import matplotlib.pyplot as plt
import pandas as pd
dt = {'height': pd.Series([58, 59, 60, 61, 62], index=[0, 1, 2, 3, 4]),
'weight': pd.Series([115, 117, 120, 123, 126], index=[0, 1, 2, 3, 4])}
women = pd.DataFrame(dt)
print(women)
'''
height weight
0 58 115
1 59 117
2 60 120
3 61 123
4 62 126
'''
plt.subplot(2, 3, 5) # 2*3个窗口的第5个窗口
plt.scatter(women["height"], women["weight"])
plt.subplot(2, 3, 1) # 2*3个窗口的第1个窗口
plt.scatter(women["height"], women["weight"])
plt.show()
图的保存
将plt.show()替换为plt.savefig(“图片名称.图片格式”)
保存在当前工作目录
import matplotlib.pyplot as plt
import pandas as pd
dt = {'height': pd.Series([58, 59, 60, 61, 62], index=[0, 1, 2, 3, 4]),
'weight': pd.Series([115, 117, 120, 123, 126], index=[0, 1, 2, 3, 4])}
women = pd.DataFrame(dt)
print(women)
'''
height weight
0 58 115
1 59 117
2 60 120
3 61 123
4 62 126
'''
plt.subplot(2, 3, 5) # 2*3个窗口的第5个窗口
plt.scatter(women["height"], women["weight"])
plt.subplot(2, 3, 1) # 2*3个窗口的第1个窗口
plt.scatter(women["height"], women["weight"])
plt.savefig("sagefig.png")
散点图的画法
sklearn模块下载
pip install sklearn -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com
make_blobs:生成符合正态分布的随机数据集
参数:
- n_samples:样本数量,即行数
- n_features:每个样本的特征数量,即列数
- centers:类别数
- random_state:随机数的生成方式
- cluster_std:每个类别的方差
返回值:
- X:测试集,类型为数组,形状为[n_samples,n_features]
- y:每个成员的标签(label),也是个数组,形状为[n_samples]的数组
plt.scatter()的参数
- X[:,0]和X[:,1]分别为x坐标和y坐标
- c为颜色
- s为点的大小
- cmap为色带,是c的补充
from sklearn.datasets.samples_generator import make_blobs
import matplotlib.pyplot as plt
X, y = make_blobs(n_samples=300, centers=4, random_state=0, cluster_std=1.0)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap="rainbow")
plt.show()
Pandas可视化
Pandas的画图函数,使得DataFrame类的数据可视化更加容易
Pandas的plot(kind=)参数决定了图的类别
import matplotlib.pyplot as plt
import pandas as pd
dt = {'height': pd.Series([58, 59, 60, 61, 62], index=[0, 1, 2, 3, 4]),
'weight': pd.Series([115, 117, 120, 123, 126], index=[0, 1, 2, 3, 4])}
women = pd.DataFrame(dt)
print(women)
'''
height weight
0 58 115
1 59 117
2 60 120
3 61 123
4 62 126
'''
women.plot(kind="bar")
plt.show()
barh代表的是横向柱状图
import matplotlib.pyplot as plt
import pandas as pd
dt = {'height': pd.Series([58, 59, 60, 61, 62], index=[0, 1, 2, 3, 4]),
'weight': pd.Series([115, 117, 120, 123, 126], index=[0, 1, 2, 3, 4])}
women = pd.DataFrame(dt)
print(women)
'''
height weight
0 58 115
1 59 117
2 60 120
3 61 123
4 62 126
'''
women.plot(kind="barh")
plt.show()
import matplotlib.pyplot as plt
import pandas as pd
dt = {'height': pd.Series([58, 59, 60, 61, 62], index=[0, 1, 2, 3, 4]),
'weight': pd.Series([115, 117, 120, 123, 126], index=[0, 1, 2, 3, 4])}
women = pd.DataFrame(dt)
print(women)
'''
height weight
0 58 115
1 59 117
2 60 120
3 61 123
4 62 126
'''
women.plot(kind="bar", x="height", y="weight", color='g')
plt.show()
kde表示为核密度估计曲线
import matplotlib.pyplot as plt
import pandas as pd
dt = {'height': pd.Series([58, 59, 60, 61, 62], index=[0, 1, 2, 3, 4]),
'weight': pd.Series([115, 117, 120, 123, 126], index=[0, 1, 2, 3, 4])}
women = pd.DataFrame(dt)
print(women)
'''
height weight
0 58 115
1 59 117
2 60 120
3 61 123
4 62 126
'''
women.plot(kind="kde")
plt.show()
plt.legend(loc=“best”)使图例位置最优
import matplotlib.pyplot as plt
import pandas as pd
dt = {'height': pd.Series([58, 59, 60, 61, 62], index=[0, 1, 2, 3, 4]),
'weight': pd.Series([115, 117, 120, 123, 126], index=[0, 1, 2, 3, 4])}
women = pd.DataFrame(dt)
print(women)
'''
height weight
0 58 115
1 59 117
2 60 120
3 61 123
4 62 126
'''
women.plot(kind="bar", x="height", y="weight", color='g')
plt.legend(loc="best")
plt.show()
Seaborn可视化
cumsum为Matlab中的一个函数,通常用于计算一个数组各行的累加值,语法为:B = cumsum(A,dim),或B = cumsum(A)
plt.legend()的功能为设置图例参数
- 图例内容:abcdef
- 图例列数:ncol = 2
- 图例的显示位置:loc = “upper left”
import matplotlib.pyplot as plt
import numpy as np
plt.style.use("classic")
Rng = np.random.RandomState(0)
X = np.linspace(0, 10, 500) # 生成500个0~10之间的数
y = np.cumsum(Rng.randn(500, 6), 0)
plt.plot(X, y)
plt.legend("abcdef", ncol=2, loc="upper left")
plt.show()
Seaborn下载
pip install seaborn -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com
加上Seaborn可以使图形更加美观
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
plt.style.use("classic")
Rng = np.random.RandomState(0)
X = np.linspace(0, 10, 500)
y = np.cumsum(Rng.randn(500, 6), 0)
sns.set()
plt.plot(X, y)
plt.legend("abcdef", ncol=2, loc="upper left")
plt.show()
核密度估计图(KDE)
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
dt = {'height': pd.Series([58, 59, 60, 61, 62], index=[0, 1, 2, 3, 4]),
'weight': pd.Series([115, 117, 120, 123, 126], index=[0, 1, 2, 3, 4])}
women = pd.DataFrame(dt)
print(women)
'''
height weight
0 58 115
1 59 117
2 60 120
3 61 123
4 62 126
'''
sns.kdeplot(women.height,shade=True)
plt.show()
sns.distplot()绘制displot图,功能为直方图+kdeplot
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
dt = {'height': pd.Series([58, 59, 60, 61, 62], index=[0, 1, 2, 3, 4]),
'weight': pd.Series([115, 117, 120, 123, 126], index=[0, 1, 2, 3, 4])}
women = pd.DataFrame(dt)
print(women)
'''
height weight
0 58 115
1 59 117
2 60 120
3 61 123
4 62 126
'''
sns.distplot(women.height)
plt.show()
sns.pairplot():散点图矩阵
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
dt = {'height': pd.Series([58, 59, 60, 61, 62], index=[0, 1, 2, 3, 4]),
'weight': pd.Series([115, 117, 120, 123, 126], index=[0, 1, 2, 3, 4])}
women = pd.DataFrame(dt)
print(women)
'''
height weight
0 58 115
1 59 117
2 60 120
3 61 123
4 62 126
'''
sns.pairplot(women)
plt.show()
sns.jointplot()联合分布图
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
dt = {'height': pd.Series([58, 59, 60, 61, 62], index=[0, 1, 2, 3, 4]),
'weight': pd.Series([115, 117, 120, 123, 126], index=[0, 1, 2, 3, 4])}
women = pd.DataFrame(dt)
print(women)
'''
height weight
0 58 115
1 59 117
2 60 120
3 61 123
4 62 126
'''
sns.jointplot(women.height, women.weight, kind="reg")
plt.show()
用with同样可以改变参数,注意要加:,同时注意缩进
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
dt = {'height': pd.Series([58, 59, 60, 61, 62], index=[0, 1, 2, 3, 4]),
'weight': pd.Series([115, 117, 120, 123, 126], index=[0, 1, 2, 3, 4])}
women = pd.DataFrame(dt)
print(women)
'''
height weight
0 58 115
1 59 117
2 60 120
3 61 123
4 62 126
'''
with sns.axes_style("white"):
sns.jointplot(women.height, women.weight, kind="reg")
plt.show()
plt.hist()为绘制直方图
还可以将Seaborn放在for循环里将多个变量画在一起
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
dt = {'height': pd.Series([58, 59, 60, 61, 62], index=[0, 1, 2, 3, 4]),
'weight': pd.Series([115, 117, 120, 123, 126], index=[0, 1, 2, 3, 4])}
women = pd.DataFrame(dt)
print(women)
'''
height weight
0 58 115
1 59 117
2 60 120
3 61 123
4 62 126
'''
for x in ["height", "weight"]:
plt.hist(women[x], normed=True, alpha=0.5)
plt.show()
更多Seaborn操作参考
https://www.jianshu.com/p/844f66d00ac1
数据可视化实战
- 数据准备
import os
print(os.getcwd())#E:\py_workspace\test2
用pandas中的read_csv()读取到内存对象salaries中
import pandas as pd
salaries = pd.read_csv("salaries.csv", index_col=0)
# index_col=0使读取的数据文件带有索引列且索引列位于第0列
查看数据
import pandas as pd
salaries = pd.read_csv("salaries.csv", index_col=0)
# index_col=0使读取的数据文件带有索引列且索引列位于第0列
print(salaries.head())
'''
rank discipline yrs.since.phd yrs.service sex salary
1 Prof B 19 18 Male 139750
2 Prof B 20 16 Male 173200
3 AsstProf B 4 3 Male 79750
4 Prof B 45 39 Male 115000
5 Prof B 40 41 Male 141500
'''
- 导入Python包
import seaborn as sns
import matplotlib.pyplot as plt
- 可视化绘图
sns.set_style(‘darkgrid’)设置Seaborn的绘图样式或主题为darkgrid(灰色+网格)
sns.stripplot()为绘制散点图
参数:
- data:数据来源
- x:设置x轴
- y:设置y轴
- jitter:是否抖动
- alpha:透明度
sns.boxplot()为绘制箱线图
参数: - data:数据来源
- x:设置x轴
- y:设置y轴
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
salaries = pd.read_csv("salaries.csv", index_col=0)
# index_col=0使读取的数据文件带有索引列且索引列位于第0列
print(salaries.head())
'''
rank discipline yrs.since.phd yrs.service sex salary
1 Prof B 19 18 Male 139750
2 Prof B 20 16 Male 173200
3 AsstProf B 4 3 Male 79750
4 Prof B 45 39 Male 115000
5 Prof B 40 41 Male 141500
'''
sns.set_style('darkgrid')
sns.stripplot(data=salaries, x='rank', y='salary', jitter=True, alpha=0.5)
sns.boxplot(data=salaries, x='rank', y='salary')
plt.show()
上一篇: Python与机器学习之数据可视化(四)
下一篇: Python与机器学习之数据可视化(三)