Pandas数据类型之category的用法

程序员文章站 2022-03-17 19:43:36

创建category使用series创建在创建series的同时添加dtype="category"就可以创建好category了。category分为两部分，一部分是order，一部分是字面量：in...

创建category

使用series创建

在创建series的同时添加dtype="category"就可以创建好category了。category分为两部分，一部分是order，一部分是字面量：

in [1]: s = pd.series(["a", "b", "c", "a"], dtype="category")

in [2]: s
out[2]: 
0    a
1    b
2    c
3    a
dtype: category
categories (3, object): ['a', 'b', 'c']

可以将df中的series转换为category：

in [3]: df = pd.dataframe({"a": ["a", "b", "c", "a"]})

in [4]: df["b"] = df["a"].astype("category")

in [5]: df["b"]
out[32]: 
0    a
1    b
2    c
3    a
name: b, dtype: category
categories (3, object): [a, b, c]

可以创建好一个pandas.categorical ，将其作为参数传递给series：

in [10]: raw_cat = pd.categorical(
   ....:     ["a", "b", "c", "a"], categories=["b", "c", "d"], ordered=false
   ....: )
   ....: 

in [11]: s = pd.series(raw_cat)

in [12]: s
out[12]: 
0    nan
1      b
2      c
3    nan
dtype: category
categories (3, object): ['b', 'c', 'd']

使用df创建

创建dataframe的时候，也可以传入 dtype="category"：

in [17]: df = pd.dataframe({"a": list("abca"), "b": list("bccd")}, dtype="category")

in [18]: df.dtypes
out[18]: 
a    category
b    category
dtype: object

df中的a和b都是一个category:

in [19]: df["a"]
out[19]: 
0    a
1    b
2    c
3    a
name: a, dtype: category
categories (3, object): ['a', 'b', 'c']

in [20]: df["b"]
out[20]: 
0    b
1    c
2    c
3    d
name: b, dtype: category
categories (3, object): ['b', 'c', 'd']

或者使用df.astype("category")将df中所有的series转换为category:

in [21]: df = pd.dataframe({"a": list("abca"), "b": list("bccd")})

in [22]: df_cat = df.astype("category")

in [23]: df_cat.dtypes
out[23]: 
a    category
b    category
dtype: object

创建控制

默认情况下传入dtype='category' 创建出来的category使用的是默认值：

1.categories是从数据中推断出来的。

2.categories是没有大小顺序的。

可以显示创建categoricaldtype来修改上面的两个默认值：

in [26]: from pandas.api.types import categoricaldtype

in [27]: s = pd.series(["a", "b", "c", "a"])

in [28]: cat_type = categoricaldtype(categories=["b", "c", "d"], ordered=true)

in [29]: s_cat = s.astype(cat_type)

in [30]: s_cat
out[30]: 
0    nan
1      b
2      c
3    nan
dtype: category
categories (3, object): ['b' < 'c' < 'd']

同样的categoricaldtype还可以用在df中：

in [31]: from pandas.api.types import categoricaldtype

in [32]: df = pd.dataframe({"a": list("abca"), "b": list("bccd")})

in [33]: cat_type = categoricaldtype(categories=list("abcd"), ordered=true)

in [34]: df_cat = df.astype(cat_type)

in [35]: df_cat["a"]
out[35]: 
0    a
1    b
2    c
3    a
name: a, dtype: category
categories (4, object): ['a' < 'b' < 'c' < 'd']

in [36]: df_cat["b"]
out[36]: 
0    b
1    c
2    c
3    d
name: b, dtype: category
categories (4, object): ['a' < 'b' < 'c' < 'd']

转换为原始类型

使用series.astype(original_dtype) 或者 np.asarray(categorical)可以将category转换为原始类型：

in [39]: s = pd.series(["a", "b", "c", "a"])

in [40]: s
out[40]: 
0    a
1    b
2    c
3    a
dtype: object

in [41]: s2 = s.astype("category")

in [42]: s2
out[42]: 
0    a
1    b
2    c
3    a
dtype: category
categories (3, object): ['a', 'b', 'c']

in [43]: s2.astype(str)
out[43]: 
0    a
1    b
2    c
3    a
dtype: object

in [44]: np.asarray(s2)
out[44]: array(['a', 'b', 'c', 'a'], dtype=object)

categories的操作

获取category的属性

categorical数据有 categories 和 ordered 两个属性。可以通过s.cat.categories 和 s.cat.ordered来获取：

in [57]: s = pd.series(["a", "b", "c", "a"], dtype="category")

in [58]: s.cat.categories
out[58]: index(['a', 'b', 'c'], dtype='object')

in [59]: s.cat.ordered
out[59]: false

重排category的顺序：

in [60]: s = pd.series(pd.categorical(["a", "b", "c", "a"], categories=["c", "b", "a"]))

in [61]: s.cat.categories
out[61]: index(['c', 'b', 'a'], dtype='object')

in [62]: s.cat.ordered
out[62]: false

重命名categories

通过给s.cat.categories赋值可以重命名categories:

in [67]: s = pd.series(["a", "b", "c", "a"], dtype="category")

in [68]: s
out[68]: 
0    a
1    b
2    c
3    a
dtype: category
categories (3, object): ['a', 'b', 'c']

in [69]: s.cat.categories = ["group %s" % g for g in s.cat.categories]

in [70]: s
out[70]: 
0    group a
1    group b
2    group c
3    group a
dtype: category
categories (3, object): ['group a', 'group b', 'group c']

使用rename_categories可以达到同样的效果：

in [71]: s = s.cat.rename_categories([1, 2, 3])

in [72]: s
out[72]: 
0    1
1    2
2    3
3    1
dtype: category
categories (3, int64): [1, 2, 3]

或者使用字典对象：

# you can also pass a dict-like object to map the renaming
in [73]: s = s.cat.rename_categories({1: "x", 2: "y", 3: "z"})

in [74]: s
out[74]: 
0    x
1    y
2    z
3    x
dtype: category
categories (3, object): ['x', 'y', 'z']

使用add_categories添加category

可以使用add_categories来添加category:

in [77]: s = s.cat.add_categories([4])

in [78]: s.cat.categories
out[78]: index(['x', 'y', 'z', 4], dtype='object')

in [79]: s
out[79]: 
0    x
1    y
2    z
3    x
dtype: category
categories (4, object): ['x', 'y', 'z', 4]

使用remove_categories删除category

in [80]: s = s.cat.remove_categories([4])

in [81]: s
out[81]: 
0    x
1    y
2    z
3    x
dtype: category
categories (3, object): ['x', 'y', 'z']

删除未使用的cagtegory

in [82]: s = pd.series(pd.categorical(["a", "b", "a"], categories=["a", "b", "c", "d"]))

in [83]: s
out[83]: 
0    a
1    b
2    a
dtype: category
categories (4, object): ['a', 'b', 'c', 'd']

in [84]: s.cat.remove_unused_categories()
out[84]: 
0    a
1    b
2    a
dtype: category
categories (2, object): ['a', 'b']

重置cagtegory

使用set_categories()可以同时进行添加和删除category操作：

in [85]: s = pd.series(["one", "two", "four", "-"], dtype="category")

in [86]: s
out[86]: 
0     one
1     two
2    four
3       -
dtype: category
categories (4, object): ['-', 'four', 'one', 'two']

in [87]: s = s.cat.set_categories(["one", "two", "three", "four"])

in [88]: s
out[88]: 
0     one
1     two
2    four
3     nan
dtype: category
categories (4, object): ['one', 'two', 'three', 'four']

category排序

如果category创建的时候带有 ordered=true ，那么可以对其进行排序操作：

in [91]: s = pd.series(["a", "b", "c", "a"]).astype(categoricaldtype(ordered=true))

in [92]: s.sort_values(inplace=true)

in [93]: s
out[93]: 
0    a
3    a
1    b
2    c
dtype: category
categories (3, object): ['a' < 'b' < 'c']

in [94]: s.min(), s.max()
out[94]: ('a', 'c')

可以使用 as_ordered() 或者 as_unordered() 来强制排序或者不排序：

in [95]: s.cat.as_ordered()
out[95]: 
0    a
3    a
1    b
2    c
dtype: category
categories (3, object): ['a' < 'b' < 'c']

in [96]: s.cat.as_unordered()
out[96]: 
0    a
3    a
1    b
2    c
dtype: category
categories (3, object): ['a', 'b', 'c']

重排序

使用categorical.reorder_categories() 可以对现有的category进行重排序：

in [103]: s = pd.series([1, 2, 3, 1], dtype="category")

in [104]: s = s.cat.reorder_categories([2, 3, 1], ordered=true)

in [105]: s
out[105]: 
0    1
1    2
2    3
3    1
dtype: category
categories (3, int64): [2 < 3 < 1]

多列排序

sort_values 支持多列进行排序：

in [109]: dfs = pd.dataframe(
   .....:     {
   .....:         "a": pd.categorical(
   .....:             list("bbeebbaa"),
   .....:             categories=["e", "a", "b"],
   .....:             ordered=true,
   .....:         ),
   .....:         "b": [1, 2, 1, 2, 2, 1, 2, 1],
   .....:     }
   .....: )
   .....: 

in [110]: dfs.sort_values(by=["a", "b"])
out[110]: 
   a  b
2  e  1
3  e  2
7  a  1
6  a  2
0  b  1
5  b  1
1  b  2
4  b  2

比较操作

如果创建的时候设置了ordered==true ，那么category之间就可以进行比较操作。支持 ==, !=, >, >=, <, 和 <=这些操作符。

in [113]: cat = pd.series([1, 2, 3]).astype(categoricaldtype([3, 2, 1], ordered=true))

in [114]: cat_base = pd.series([2, 2, 2]).astype(categoricaldtype([3, 2, 1], ordered=true))

in [115]: cat_base2 = pd.series([2, 2, 2]).astype(categoricaldtype(ordered=true))
in [119]: cat > cat_base
out[119]: 
0     true
1    false
2    false
dtype: bool

in [120]: cat > 2
out[120]: 
0     true
1    false
2    false
dtype: bool

其他操作

cagetory本质上来说还是一个series，所以series的操作category基本上都可以使用，比如： series.min(), series.max() 和 series.mode()。

value_counts：

in [131]: s = pd.series(pd.categorical(["a", "b", "c", "c"], categories=["c", "a", "b", "d"]))

in [132]: s.value_counts()
out[132]: 
c    2
a    1
b    1
d    0
dtype: int64

dataframe.sum()：

in [133]: columns = pd.categorical(
   .....:     ["one", "one", "two"], categories=["one", "two", "three"], ordered=true
   .....: )
   .....: 

in [134]: df = pd.dataframe(
   .....:     data=[[1, 2, 3], [4, 5, 6]],
   .....:     columns=pd.multiindex.from_arrays([["a", "b", "b"], columns]),
   .....: )
   .....: 

in [135]: df.sum(axis=1, level=1)
out[135]: 
   one  two  three
0    3    3      0
1    9    6      0

groupby：

in [136]: cats = pd.categorical(
   .....:     ["a", "b", "b", "b", "c", "c", "c"], categories=["a", "b", "c", "d"]
   .....: )
   .....: 

in [137]: df = pd.dataframe({"cats": cats, "values": [1, 2, 2, 2, 3, 4, 5]})

in [138]: df.groupby("cats").mean()
out[138]: 
      values
cats        
a        1.0
b        2.0
c        4.0
d        nan

in [139]: cats2 = pd.categorical(["a", "a", "b", "b"], categories=["a", "b", "c"])

in [140]: df2 = pd.dataframe(
   .....:     {
   .....:         "cats": cats2,
   .....:         "b": ["c", "d", "c", "d"],
   .....:         "values": [1, 2, 3, 4],
   .....:     }
   .....: )
   .....: 

in [141]: df2.groupby(["cats", "b"]).mean()
out[141]: 
        values
cats b        
a    c     1.0
     d     2.0
b    c     3.0
     d     4.0
c    c     nan
     d     nan

pivot tables：

in [142]: raw_cat = pd.categorical(["a", "a", "b", "b"], categories=["a", "b", "c"])

in [143]: df = pd.dataframe({"a": raw_cat, "b": ["c", "d", "c", "d"], "values": [1, 2, 3, 4]})

in [144]: pd.pivot_table(df, values="values", index=["a", "b"])
out[144]: 
     values
a b        
a c       1
  d       2
b c       3
  d       4

到此这篇关于pandas数据类型之category的用法的文章就介绍到这了,更多相关category的用法内容请搜索以前的文章或继续浏览下面的相关文章希望大家以后多多支持！

Pandas数据类型之category的用法

创建category

使用series创建

使用df创建

创建控制

转换为原始类型

categories的操作

获取category的属性

重命名categories

使用add_categories添加category

使用remove_categories删除category

删除未使用的cagtegory

重置cagtegory

category排序

重排序

多列排序

比较操作

其他操作

Android自定义view Path 的高级用法之搜索按钮动画

Android学习之Intent中显示意图和隐式意图的用法实例分析

Python3.5 Pandas模块之DataFrame用法实例分析

html5跨域通讯之postMessage的用法总结

Python3.5 Pandas模块之Series用法实例分析

Linux基础学习之文件查找find的常见用法

节点的插入之append()和appendTo()的用法介绍

Python Pandas DataFrame:查询数据or选择数据（selection）之loc,iloc,at,iat,ix的用法和区别

php学习之数据类型之间的转换代码

IOS开发（49）之关于 self与内存相关的用法总结