欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

Introduction to Data Science in Python Week 4

程序员文章站 2024-01-04 22:36:46
...

Distributions

  • Set of all possible random variables.

Binomial Distribution

import pandas as pd
import Numpy as np
np.random.binomial(1,0.5) #第一个参数是运行的次数,第二个参数是得到0的几率

如果要计算两天连续发生龙卷风的概率

chance_of_tornado = 0.01 #概率

tornado_events = np.random.binomial(1, chance_of_tornado, 1000000)#采样次数100万次
    
two_days_in_a_row = 0
for j in range(1,len(tornado_events)):
    if tornado_events[j]==1 and tornado_events[j-1]==1:
        two_days_in_a_row+=1

print('{} tornadoes back to back in {} years'.format(two_days_in_a_row, 1000000/365))

Uniform Distribution

np.random.uniform(0, 1)
# np.random.uniform(low,high,size)

Normal (Gaussian) Distribution (Mean is zero)

np.random.normal(0.75) #scale=0.75
#np.random.normal(loc,scale,size)
#loc: 分布中心
#scale:标准差,scale越大,正态分布曲线越宽越矮

How to calculate the standard deviation

distribution = np.random.normal(0.75,size=1000)
np.sqrt(np.sum((np.mean(distribution)-distribution)**2)/len(distribution))
or
np.std(distribution)

Kurtosis

import scipy.stats as stats
stats.kurtosis(distribution) #负值表示比normal distribution更加平坦,正值表示比normal distribution更加陡峭
stats.skew(distribution)#查看是否有太多偏差(于正态分布中心相比)

Chi-Squared Distribution

  • Left-skewed
  • Degrees of freedom (one parameter)
chi_squared_df2 = np.random.chisquare(2, size=10000)
stats.skew(chi_squared_df2)
>>>2.067857561010524
chi_squared_df5 = np.random.chisquare(5, size=10000)
stats.skew(chi_squared_df5)
>>>1.3091894938388848
#随着 degrees of freedom 增加,曲线左偏移值减小

Modality Distribution

有多个峰值

Hypothesis Testing in Python

  • Hypothesis: A statement we can test
    • Alternative hypothesis: our idea, e.g. there’s a difference between groups
    • Null hypothesis(零假设): the alternative of our idea, there’s no difference between groups
    • 需要证明的是有证据使零假设不成立。
df = pd.read_csv('grades.csv')
df.head()
len(df)
early = df[df['assignment1_submission'] <= '2015-12-31']
late = df[df['assignment1_submission'] > '2015-12-31']
early.mean()
late.mean()
  • Critical Value alpha α\alpha
    • The threshold as to how much chance you are willing to accept
    • typical values in social science are 0.1, 0.05 or 0.01

T-test

from scipy import stats
stats.ttest_ind?
>>>Signature: stats.ttest_ind(a, b, axis=0, equal_var=True, nan_policy='propagate')
>>>Docstring:
Calculates the T-test for the means of *two independent* samples of scores.
stats.ttest_ind(early['assignment1_grade'], late['assignment1_grade'])
>>>Ttest_indResult(statistic=1.400549944897566, pvalue=0.16148283016060577)
# P-value is large, there's no significant difference between these two sample means, we can not reject the null hypothesis.

P-hacking (Dredging)

  • 虚假的相关性,而不是一般化的结果
  • Doing many tests until you find one which is of statistical significance
  • At a confidence level of 0.05, we expect to find one positive result 1 time out of 20 test
  • Remedies
    • Bonferroni correction (随着测试次数增多而减小α\alpha)
    • hold-out sets
    • investigation pre-registration
相关标签: 课程

上一篇:

下一篇: