欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

【课程】Introduction to Data Science in Python

程序员文章站 2024-01-04 21:19:22
...

Week 2 : Basic Processing with Pandas

Introduction

When using pandas, Stock Overflow is the best place to ask questions related to pandas.
OTHER Sources:
Learning the Pandas Library by Matt Harrison
planet python.org or it’s Twitter @PlanetPython
Data Skeptic Podcast

The Series Data Structure

See the documentation of Series

import pandas as pd
pd.Series?
animals = ['Tiger', 'Bear', 'Moose']
pd.Series(animals)
>>> 0	Tiger
	1	Bear
	2	Moose
	dtype: object

None type in pandas

animals=['Tiger', 'Bear', None] #此时index为2处返回None
animals=[1, 2, None] #此时index为2处返回NaN,dtype为float64
# NaN != None

The index value can be set to the keys from our dictionary.

import pandas as pd
sports = {'Archery': 'Bhutan',
          'Golf': 'Scotland',
          'Sumo': 'Japan',
          'Taekwondo': 'South Korea'}
s = pd.Series(sports, index=['Golf', 'Sumo', 'Hockey'])
s
>>>	Golf	Scotland
	Sumo	Japan
	Hockey	NaN
	dtype:	object

Querying a Series

iloc and loc

sports = {'Archery': 'Bhutan',
          'Golf': 'Scotland',
          'Sumo': 'Japan',
          'Taekwondo': 'South Korea'}
s = pd.Series(sports)
s.iloc[3]
>>> 'South Korea'
s.loc['Golf']
>>> 'Scotland'

bloc and loc are not methods, they are attributes. iloc 直接返回的是value。

s[3]
>>>'South Korea'
s['Golf']
>>>'Scotland'
sports = {99: 'Bhutan',
          100: 'Scotland',
          101: 'Japan',
          102: 'South Korea'}
s = pd.Series(sports)
s[0] #This won't call s.iloc[0] as one might expect, it generates an error instead
>>>Trackback

Using Numpy’s vectorization to increase manipulation speed.

s = pd.Series([100.00, 120.00, 101.00, 3.00])
import numpy as np
total = np.sum(s)
print(total)
>>>324.0

Jupyter Notebook 中比较非vectorize和vectorize二者的时间

#this creates a big series of random numbers
%%timeit -n 10
s = pd.Series(np.random.randint(0,1000,10000))
for label, value in s.iteritems():
    s.loc[label]= value+2
>>> 1.44 s ± 58.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit -n 10
s = pd.Series(np.random.randint(0,1000,10000))
s+=2
>>> 288 µs ± 30.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

There’s an example where index values are not unique, and this makes data frames different, conceptually, that a relational database might be.

original_sports = pd.Series({'Archery': 'Bhutan',
						'Golf': 'Scotland',
                        'Sumo': 'Japan',
                        'Taekwondo': 'South Korea'})
cricket_loving_countries = pd.Series(['Australia',
                                      'Barbados',
                                      'Pakistan',
                                      'England'], 
                                   index=['Cricket',
                                          'Cricket',
                                          'Cricket',
                                          'Cricket'])
all_countries = original_sports.append(cricket_loving_countries)
>>>	Archery		Bhutan
	Golf		Scotland
	Sumo		Japan
	Taekwondo	South Korea
	dtype:		object

all_countries
>>>	Archery		Bhutan
	Golf		Scotland
	Sumo		Japan
	Taekwondo	South Korea
	Cricket		Australia
	Cricket		Barbados
	Cricket		Pakistan
	Cricket		England
	dtype:		object

all_countries.loc['Cricket']
>>>	Cricket		Australia
	Cricket     Barbados
	Cricket     Pakistan
	Cricket		England
	dtype:		object

The original series values are not change
When using ‘Cricket’ as the index, we don’t get a single value, but a series itself.

上一篇:

下一篇: