pandas之常用统计及字符串离散化
程序员文章站
2024-03-24 16:06:22
...
统计方法
print("*"*25+"直方图"+"*"*25)
*************************直方图*************************
import pandas as pd
file_path="./IMDB-Movie-Data.csv"
df = pd.read_csv(file_path)
print(df.head(1))
print(df.info())
Rank Title Genre \
0 1 Guardians of the Galaxy Action,Adventure,Sci-Fi
Description Director \
0 A group of intergalactic criminals are forced ... James Gunn
Actors Year Runtime (Minutes) \
0 Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S... 2014 121
Rating Votes Revenue (Millions) Metascore
0 8.1 757074 333.13 76.0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Rank 1000 non-null int64
1 Title 1000 non-null object
2 Genre 1000 non-null object
3 Description 1000 non-null object
4 Director 1000 non-null object
5 Actors 1000 non-null object
6 Year 1000 non-null int64
7 Runtime (Minutes) 1000 non-null int64
8 Rating 1000 non-null float64
9 Votes 1000 non-null int64
10 Revenue (Millions) 872 non-null float64
11 Metascore 936 non-null float64
dtypes: float64(3), int64(4), object(5)
memory usage: 93.9+ KB
None
#rating,runtime分布情况-直方图
runtime_data=df["Runtime (Minutes)"].values #ndarray类型
max_runtime=runtime_data.max()
min_runtime=runtime_data.min()
from matplotlib import pyplot as plt
num_bin = (max_runtime-min_runtime)//5 #设置组距
plt.figure(figsize=(20,8),dpi=80) #设置图形大小
plt.hist(runtime_data,num_bin)
# print(max_runtime-min_runtime)
plt.xticks(range(min_runtime,max_runtime+5,5))#设置x轴
plt.show()
runtime_data=df["Rating"].values
max_runtime=runtime_data.max()
min_runtime=runtime_data.min()
num_bin_list=[1.9,3.5]
i=3.5
while i<=max_runtime:
i += 0.5
num_bin_list.append(i)
# num_bin_list = [0.5]*13+[0.6]
print(num_bin_list)
plt.figure(figsize=(20,8),dpi=80)
plt.hist(runtime_data,num_bin_list)
# _x = [min_runtime]
# i=min_runtime
# while i<=max_runtime+0.5:
# i=i+0.5
# _x.append(i)
plt.xticks(num_bin_list)
plt.show()
[1.9, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5, 7.0, 7.5, 8.0, 8.5, 9.0, 9.5]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Rank 1000 non-null int64
1 Title 1000 non-null object
2 Genre 1000 non-null object
3 Description 1000 non-null object
4 Director 1000 non-null object
5 Actors 1000 non-null object
6 Year 1000 non-null int64
7 Runtime (Minutes) 1000 non-null int64
8 Rating 1000 non-null float64
9 Votes 1000 non-null int64
10 Revenue (Millions) 872 non-null float64
11 Metascore 936 non-null float64
dtypes: float64(3), int64(4), object(5)
memory usage: 93.9+ KB
None
print("*"*25+"常用方法"+"*"*25)
print("平均分:%f"%df["Rating"].mean())
print("导演人数:%d"%len(set(df["Director"].tolist())))
print("导演人数unique:%d"%len(df["Director"].unique()))
actors_list=df["Actors"].str.split(",").tolist()
actors_list=[i for j in actors_list for i in j]
# import numpy as np
# actors_list=list(np.array(actors_list).flatten())
actors_num=len(set(actors_list))
print("演员人数:%d"%actors_num)
#最大值max(),最小值min(),平均值mean(),最大值位置argmax(),最小值位置argmin(),中位数median()
平均分:6.723200
导演人数:644
导演人数unique:644
演员人数:2394
字符串离散化
推荐阅读