【课程】Introduction to Data Science in Python

Week 2 : Basic Processing with Pandas


When using pandas, Stock Overflow is the best place to ask questions related to pandas.
OTHER Sources:
Learning the Pandas Library by Matt Harrison
planet python.org or it’s Twitter @PlanetPython
Data Skeptic Podcast

The Series Data Structure

See the documentation of Series

import pandas as pd
animals = ['Tiger', 'Bear', 'Moose']
>>> 0	Tiger
	1	Bear
	2	Moose
	dtype: object

None type in pandas

animals=['Tiger', 'Bear', None] #此时index为2处返回None
animals=[1, 2, None] #此时index为2处返回NaN,dtype为float64
# NaN != None

The index value can be set to the keys from our dictionary.

import pandas as pd
sports = {'Archery': 'Bhutan',
          'Golf': 'Scotland',
          'Sumo': 'Japan',
          'Taekwondo': 'South Korea'}
s = pd.Series(sports, index=['Golf', 'Sumo', 'Hockey'])
>>>	Golf	Scotland
	Sumo	Japan
	Hockey	NaN
	dtype:	object

Querying a Series

iloc and loc

sports = {'Archery': 'Bhutan',
          'Golf': 'Scotland',
          'Sumo': 'Japan',
          'Taekwondo': 'South Korea'}
s = pd.Series(sports)
>>> 'South Korea'
>>> 'Scotland'

bloc and loc are not methods, they are attributes. iloc 直接返回的是value。

>>>'South Korea'
sports = {99: 'Bhutan',
          100: 'Scotland',
          101: 'Japan',
          102: 'South Korea'}
s = pd.Series(sports)
s[0] #This won't call s.iloc[0] as one might expect, it generates an error instead

Using Numpy’s vectorization to increase manipulation speed.

s = pd.Series([100.00, 120.00, 101.00, 3.00])
import numpy as np
total = np.sum(s)

Jupyter Notebook 中比较非vectorize和vectorize二者的时间

#this creates a big series of random numbers
%%timeit -n 10
s = pd.Series(np.random.randint(0,1000,10000))
for label, value in s.iteritems():
    s.loc[label]= value+2
>>> 1.44 s ± 58.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit -n 10
s = pd.Series(np.random.randint(0,1000,10000))
>>> 288 µs ± 30.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

There’s an example where index values are not unique, and this makes data frames different, conceptually, that a relational database might be.

original_sports = pd.Series({'Archery': 'Bhutan',
						'Golf': 'Scotland',
                        'Sumo': 'Japan',
                        'Taekwondo': 'South Korea'})
cricket_loving_countries = pd.Series(['Australia',
all_countries = original_sports.append(cricket_loving_countries)
>>>	Archery		Bhutan
	Golf		Scotland
	Sumo		Japan
	Taekwondo	South Korea
	dtype:		object

>>>	Archery		Bhutan
	Golf		Scotland
	Sumo		Japan
	Taekwondo	South Korea
	Cricket		Australia
	Cricket		Barbados
	Cricket		Pakistan
	Cricket		England
	dtype:		object

>>>	Cricket		Australia
	Cricket     Barbados
	Cricket     Pakistan
	Cricket		England
	dtype:		object

The original series values are not change
When using ‘Cricket’ as the index, we don’t get a single value, but a series itself.

