欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

pandas之常用统计及字符串离散化

程序员文章站 2024-03-24 16:06:22
...

统计方法

print("*"*25+"直方图"+"*"*25)
*************************直方图*************************
import pandas as pd
file_path="./IMDB-Movie-Data.csv"
df = pd.read_csv(file_path)
print(df.head(1))
print(df.info())
   Rank                    Title                    Genre  \
0     1  Guardians of the Galaxy  Action,Adventure,Sci-Fi   

                                         Description    Director  \
0  A group of intergalactic criminals are forced ...  James Gunn   

                                              Actors  Year  Runtime (Minutes)  \
0  Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...  2014                121   

   Rating   Votes  Revenue (Millions)  Metascore  
0     8.1  757074              333.13       76.0  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Rank                1000 non-null   int64  
 1   Title               1000 non-null   object 
 2   Genre               1000 non-null   object 
 3   Description         1000 non-null   object 
 4   Director            1000 non-null   object 
 5   Actors              1000 non-null   object 
 6   Year                1000 non-null   int64  
 7   Runtime (Minutes)   1000 non-null   int64  
 8   Rating              1000 non-null   float64
 9   Votes               1000 non-null   int64  
 10  Revenue (Millions)  872 non-null    float64
 11  Metascore           936 non-null    float64
dtypes: float64(3), int64(4), object(5)
memory usage: 93.9+ KB
None
#rating,runtime分布情况-直方图
runtime_data=df["Runtime (Minutes)"].values #ndarray类型
max_runtime=runtime_data.max()
min_runtime=runtime_data.min()

from matplotlib import pyplot as plt
num_bin = (max_runtime-min_runtime)//5 #设置组距


plt.figure(figsize=(20,8),dpi=80) #设置图形大小
plt.hist(runtime_data,num_bin)
# print(max_runtime-min_runtime)

plt.xticks(range(min_runtime,max_runtime+5,5))#设置x轴

plt.show()

pandas之常用统计及字符串离散化

runtime_data=df["Rating"].values 
max_runtime=runtime_data.max()
min_runtime=runtime_data.min()

num_bin_list=[1.9,3.5]
i=3.5
while i<=max_runtime:
    i += 0.5
    num_bin_list.append(i)
    
# num_bin_list = [0.5]*13+[0.6]
print(num_bin_list)


plt.figure(figsize=(20,8),dpi=80) 
plt.hist(runtime_data,num_bin_list)

# _x = [min_runtime]
# i=min_runtime
# while i<=max_runtime+0.5:
#   i=i+0.5
#   _x.append(i)
    
plt.xticks(num_bin_list)
plt.show()
[1.9, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5, 7.0, 7.5, 8.0, 8.5, 9.0, 9.5]

pandas之常用统计及字符串离散化



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Rank                1000 non-null   int64  
 1   Title               1000 non-null   object 
 2   Genre               1000 non-null   object 
 3   Description         1000 non-null   object 
 4   Director            1000 non-null   object 
 5   Actors              1000 non-null   object 
 6   Year                1000 non-null   int64  
 7   Runtime (Minutes)   1000 non-null   int64  
 8   Rating              1000 non-null   float64
 9   Votes               1000 non-null   int64  
 10  Revenue (Millions)  872 non-null    float64
 11  Metascore           936 non-null    float64
dtypes: float64(3), int64(4), object(5)
memory usage: 93.9+ KB
None
print("*"*25+"常用方法"+"*"*25)
print("平均分:%f"%df["Rating"].mean())
print("导演人数:%d"%len(set(df["Director"].tolist())))
print("导演人数unique:%d"%len(df["Director"].unique()))
actors_list=df["Actors"].str.split(",").tolist()
actors_list=[i for j in actors_list for i in j]
# import numpy as np
# actors_list=list(np.array(actors_list).flatten())
actors_num=len(set(actors_list))
print("演员人数:%d"%actors_num)
#最大值max(),最小值min(),平均值mean(),最大值位置argmax(),最小值位置argmin(),中位数median()
平均分:6.723200
导演人数:644
导演人数unique:644
演员人数:2394

字符串离散化

相关标签: python 数据分析