【课程】Introduction to Data Science in Python
Week 2 : Basic Processing with Pandas
Introduction
When using pandas, Stock Overflow is the best place to ask questions related to pandas.
OTHER Sources:
Learning the Pandas Library by Matt Harrison
planet python.org or it’s Twitter @PlanetPython
Data Skeptic Podcast
The Series Data Structure
See the documentation of Series
import pandas as pd
pd.Series?
animals = ['Tiger', 'Bear', 'Moose']
pd.Series(animals)
>>> 0 Tiger
1 Bear
2 Moose
dtype: object
None type in pandas
animals=['Tiger', 'Bear', None] #此时index为2处返回None
animals=[1, 2, None] #此时index为2处返回NaN,dtype为float64
# NaN != None
The index value can be set to the keys from our dictionary.
import pandas as pd
sports = {'Archery': 'Bhutan',
'Golf': 'Scotland',
'Sumo': 'Japan',
'Taekwondo': 'South Korea'}
s = pd.Series(sports, index=['Golf', 'Sumo', 'Hockey'])
s
>>> Golf Scotland
Sumo Japan
Hockey NaN
dtype: object
Querying a Series
iloc and loc
sports = {'Archery': 'Bhutan',
'Golf': 'Scotland',
'Sumo': 'Japan',
'Taekwondo': 'South Korea'}
s = pd.Series(sports)
s.iloc[3]
>>> 'South Korea'
s.loc['Golf']
>>> 'Scotland'
bloc and loc are not methods, they are attributes. iloc 直接返回的是value。
s[3]
>>>'South Korea'
s['Golf']
>>>'Scotland'
sports = {99: 'Bhutan',
100: 'Scotland',
101: 'Japan',
102: 'South Korea'}
s = pd.Series(sports)
s[0] #This won't call s.iloc[0] as one might expect, it generates an error instead
>>>Trackback
Using Numpy’s vectorization to increase manipulation speed.
s = pd.Series([100.00, 120.00, 101.00, 3.00])
import numpy as np
total = np.sum(s)
print(total)
>>>324.0
在 Jupyter Notebook 中比较非vectorize和vectorize二者的时间
#this creates a big series of random numbers
%%timeit -n 10
s = pd.Series(np.random.randint(0,1000,10000))
for label, value in s.iteritems():
s.loc[label]= value+2
>>> 1.44 s ± 58.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit -n 10
s = pd.Series(np.random.randint(0,1000,10000))
s+=2
>>> 288 µs ± 30.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
There’s an example where index values are not unique, and this makes data frames different, conceptually, that a relational database might be.
original_sports = pd.Series({'Archery': 'Bhutan',
'Golf': 'Scotland',
'Sumo': 'Japan',
'Taekwondo': 'South Korea'})
cricket_loving_countries = pd.Series(['Australia',
'Barbados',
'Pakistan',
'England'],
index=['Cricket',
'Cricket',
'Cricket',
'Cricket'])
all_countries = original_sports.append(cricket_loving_countries)
>>> Archery Bhutan
Golf Scotland
Sumo Japan
Taekwondo South Korea
dtype: object
all_countries
>>> Archery Bhutan
Golf Scotland
Sumo Japan
Taekwondo South Korea
Cricket Australia
Cricket Barbados
Cricket Pakistan
Cricket England
dtype: object
all_countries.loc['Cricket']
>>> Cricket Australia
Cricket Barbados
Cricket Pakistan
Cricket England
dtype: object
The original series values are not change
When using ‘Cricket’ as the index, we don’t get a single value, but a series itself.
推荐阅读
-
Introduction to Data Science in Python 第 2 周 Assignment
-
【课程】Introduction to Data Science in Python
-
introduction to data science w4
-
Coursera | Introduction to Data Science in Python(University of Michigan)| Assignment2
-
【课程】Introduction to Data Science in Python Week3
-
Introduction to Data Science w3 Advanced python pandas笔记
-
Coursera Introduction to Data Science in Python Assignment2
-
Intro to Python for Data Science Learning 6 - NumPy
-
Python for Data Science
-
Python Data Science, NumPy 1