Introduction to Data Science in Python Week 4
程序员文章站
2024-01-04 22:36:46
...
Week 4: Statistical Analysis in Python and Project
Distributions
- Set of all possible random variables.
Binomial Distribution
import pandas as pd
import Numpy as np
np.random.binomial(1,0.5) #第一个参数是运行的次数,第二个参数是得到0的几率
如果要计算两天连续发生龙卷风的概率
chance_of_tornado = 0.01 #概率
tornado_events = np.random.binomial(1, chance_of_tornado, 1000000)#采样次数100万次
two_days_in_a_row = 0
for j in range(1,len(tornado_events)):
if tornado_events[j]==1 and tornado_events[j-1]==1:
two_days_in_a_row+=1
print('{} tornadoes back to back in {} years'.format(two_days_in_a_row, 1000000/365))
Uniform Distribution
np.random.uniform(0, 1)
# np.random.uniform(low,high,size)
Normal (Gaussian) Distribution (Mean is zero)
np.random.normal(0.75) #scale=0.75
#np.random.normal(loc,scale,size)
#loc: 分布中心
#scale:标准差,scale越大,正态分布曲线越宽越矮
How to calculate the standard deviation
distribution = np.random.normal(0.75,size=1000)
np.sqrt(np.sum((np.mean(distribution)-distribution)**2)/len(distribution))
or
np.std(distribution)
Kurtosis
import scipy.stats as stats
stats.kurtosis(distribution) #负值表示比normal distribution更加平坦,正值表示比normal distribution更加陡峭
stats.skew(distribution)#查看是否有太多偏差(于正态分布中心相比)
Chi-Squared Distribution
- Left-skewed
- Degrees of freedom (one parameter)
chi_squared_df2 = np.random.chisquare(2, size=10000)
stats.skew(chi_squared_df2)
>>>2.067857561010524
chi_squared_df5 = np.random.chisquare(5, size=10000)
stats.skew(chi_squared_df5)
>>>1.3091894938388848
#随着 degrees of freedom 增加,曲线左偏移值减小
Modality Distribution
有多个峰值
Hypothesis Testing in Python
- Hypothesis: A statement we can test
- Alternative hypothesis: our idea, e.g. there’s a difference between groups
- Null hypothesis(零假设): the alternative of our idea, there’s no difference between groups
- 需要证明的是有证据使零假设不成立。
df = pd.read_csv('grades.csv')
df.head()
len(df)
early = df[df['assignment1_submission'] <= '2015-12-31']
late = df[df['assignment1_submission'] > '2015-12-31']
early.mean()
late.mean()
- Critical Value alpha
- The threshold as to how much chance you are willing to accept
- typical values in social science are 0.1, 0.05 or 0.01
T-test
from scipy import stats
stats.ttest_ind?
>>>Signature: stats.ttest_ind(a, b, axis=0, equal_var=True, nan_policy='propagate')
>>>Docstring:
Calculates the T-test for the means of *two independent* samples of scores.
stats.ttest_ind(early['assignment1_grade'], late['assignment1_grade'])
>>>Ttest_indResult(statistic=1.400549944897566, pvalue=0.16148283016060577)
# P-value is large, there's no significant difference between these two sample means, we can not reject the null hypothesis.
P-hacking (Dredging)
- 虚假的相关性,而不是一般化的结果
- Doing many tests until you find one which is of statistical significance
- At a confidence level of 0.05, we expect to find one positive result 1 time out of 20 test
- Remedies
- Bonferroni correction (随着测试次数增多而减小)
- hold-out sets
- investigation pre-registration
推荐阅读
-
Introduction to Data Science in Python Week 4
-
Coursera | Introduction to Data Science in Python(University of Michigan)| Assignment4
-
【课程】Introduction to Data Science in Python Week 1
-
Introduction to Data Science in Python 第 2 周 Assignment
-
【课程】Introduction to Data Science in Python
-
introduction to data science w4
-
Coursera | Introduction to Data Science in Python(University of Michigan)| Assignment2
-
【课程】Introduction to Data Science in Python Week3
-
Introduction to Data Science w3 Advanced python pandas笔记
-
Coursera Introduction to Data Science in Python Assignment2