【课程】Introduction to Data Science in Python

程序员文章站 2024-01-04 21:19:22

...

Week 2 : Basic Processing with Pandas

Introduction
The Series Data Structure
Querying a Series

Introduction

When using pandas, Stock Overflow is the best place to ask questions related to pandas.
OTHER Sources:
Learning the Pandas Library by Matt Harrison
planet python.org or it’s Twitter @PlanetPython
Data Skeptic Podcast

The Series Data Structure

See the documentation of Series

import pandas as pd
pd.Series?

animals = ['Tiger', 'Bear', 'Moose']
pd.Series(animals)
>>> 0	Tiger
	1	Bear
	2	Moose
	dtype: object

None type in pandas

animals=['Tiger', 'Bear', None] #此时index为2处返回None
animals=[1, 2, None] #此时index为2处返回NaN，dtype为float64
# NaN != None

The index value can be set to the keys from our dictionary.

import pandas as pd
sports = {'Archery': 'Bhutan',
          'Golf': 'Scotland',
          'Sumo': 'Japan',
          'Taekwondo': 'South Korea'}
s = pd.Series(sports, index=['Golf', 'Sumo', 'Hockey'])
s
>>>	Golf	Scotland
	Sumo	Japan
	Hockey	NaN
	dtype:	object

Querying a Series

iloc and loc

sports = {'Archery': 'Bhutan',
          'Golf': 'Scotland',
          'Sumo': 'Japan',
          'Taekwondo': 'South Korea'}
s = pd.Series(sports)
s.iloc[3]
>>> 'South Korea'
s.loc['Golf']
>>> 'Scotland'

bloc and loc are not methods, they are attributes. iloc 直接返回的是value。

s[3]
>>>'South Korea'
s['Golf']
>>>'Scotland'

sports = {99: 'Bhutan',
          100: 'Scotland',
          101: 'Japan',
          102: 'South Korea'}
s = pd.Series(sports)
s[0] #This won't call s.iloc[0] as one might expect, it generates an error instead
>>>Trackback

Using Numpy’s vectorization to increase manipulation speed.

s = pd.Series([100.00, 120.00, 101.00, 3.00])
import numpy as np
total = np.sum(s)
print(total)
>>>324.0

在 Jupyter Notebook 中比较非vectorize和vectorize二者的时间

#this creates a big series of random numbers
%%timeit -n 10
s = pd.Series(np.random.randint(0,1000,10000))
for label, value in s.iteritems():
    s.loc[label]= value+2
>>> 1.44 s ± 58.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit -n 10
s = pd.Series(np.random.randint(0,1000,10000))
s+=2
>>> 288 µs ± 30.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

There’s an example where index values are not unique, and this makes data frames different, conceptually, that a relational database might be.

original_sports = pd.Series({'Archery': 'Bhutan',
						'Golf': 'Scotland',
                        'Sumo': 'Japan',
                        'Taekwondo': 'South Korea'})
cricket_loving_countries = pd.Series(['Australia',
                                      'Barbados',
                                      'Pakistan',
                                      'England'], 
                                   index=['Cricket',
                                          'Cricket',
                                          'Cricket',
                                          'Cricket'])
all_countries = original_sports.append(cricket_loving_countries)
>>>	Archery		Bhutan
	Golf		Scotland
	Sumo		Japan
	Taekwondo	South Korea
	dtype:		object

all_countries
>>>	Archery		Bhutan
	Golf		Scotland
	Sumo		Japan
	Taekwondo	South Korea
	Cricket		Australia
	Cricket		Barbados
	Cricket		Pakistan
	Cricket		England
	dtype:		object

all_countries.loc['Cricket']
>>>	Cricket		Australia
	Cricket     Barbados
	Cricket     Pakistan
	Cricket		England
	dtype:		object

The original series values are not change
When using ‘Cricket’ as the index, we don’t get a single value, but a series itself.

【课程】Introduction to Data Science in Python

Week 2 : Basic Processing with Pandas

Introduction

The Series Data Structure

Querying a Series

Introduction to Data Science in Python 第 2 周 Assignment

【课程】Introduction to Data Science in Python

introduction to data science w4

Coursera | Introduction to Data Science in Python（University of Michigan）| Assignment2

【课程】Introduction to Data Science in Python Week3

Introduction to Data Science w3 Advanced python pandas笔记

Coursera Introduction to Data Science in Python Assignment2

Intro to Python for Data Science Learning 6 - NumPy

Python for Data Science

Python Data Science, NumPy 1