『Data Science』R语言学习笔记，观察数据

程序员文章站 2022-03-01 15:45:26

...

为什么80%的码农都做不了架构师？>>> 『Data Science』R语言学习笔记，观察数据

Getting the data from Web

if(!file.exists("./db")){
    dir.create("./db")
}

fileUrl <- "https://data.baltimorecity.gov/api/views/k5ry-ef3g/rows.csv?accessType=DOWNLOAD"
download.file(fileUrl, destfile = "./db/restaurants.csv", method = "auto")
restData <- read.csv("./db/restaurants.csv")

Looking at a bit of the data

head(restData, n=3)
tail(restData, n=3)

Make summary

summary(restData)

More in depth information

str(restData)

Quantiles of quantitative variables

The generic function quantile produces sample quantiles corresponding to the given probabilities. The smallest observation corresponds to a probability of 0 and the largest to a probability of 1.

> quantile(restData$councilDistrict, na.rm = T)
  0%  25%  50%  75% 100%
   1    2    9   11   14
> quantile(restData$councilDistrict, probs = c(0.5, 0.75, 0.9))
50% 75% 90%
  9  11  12

x - numeric vector whose sample quantiles are wanted, or an object of a class for which a method has been defined (see also ‘details’). NA and NaN values are not allowed in numeric vectors unless na.rm is TRUE.
probs - numeric vector of probabilities with values in [0,1]. (Values up to 2e-14 outside that range are accepted and moved to the nearby endpoint.)
na.rm - logical; if true, any NA and NaN's are removed from x before the quantiles are computed.
names - logical; if true, the result has a names attribute. Set to FALSE for speedup with many probs.
type - an integer between 1 and 9 selecting one of the nine quantile algorithms detailed below to be used.
... - further arguments passed to or from other methods.

Make table

> table(restData$zipCode, useNA = "ifany")

-21226  21201  21202  21205  21206  21207  21208  21209  21210  21211  21212  21213  21214  21215  21216  21217  21218  21220
     1    136    201     27     30      4      1      8     23     41     28     31     17     54     10     32     69      1

> table(restData$councilDistrict, restData$zipCode)

     -21226 21201 21202 21205 21206 21207 21208 21209 21210 21211 21212 21213 21214 21215 21216 21217 21218 21220 21222 21223
  1       0     0    37     0     0     0     0     0     0     0     0     2     0     0     0     0     0     0     7     0
  2       0     0     0     3    27     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
  3       0     0     0     0     0     0     0     0     0     0     0     2    17     0     0     0     3     0     0     0
  4       0     0     0     0     0     0     0     0     0     0    27     0     0     0     0     0     0     0     0     0
  5       0     0     0     0     0     3     0     6     0     0     0     0     0    31     0     0     0     0     0     0
  6       0     0     0     0     0     0     0     1    19     0     0     0     0    15     1     0     0     0     0     0

Check for missing values

sum(is.na(restData$councilDistrict))
any(is.na(restData$councilDistrict))
all(restData$zipCode > 0)

Row and column sums

colSums(is.na(restData))
all(colSums(is.na(restData)) == 0)
all(restData$zipCode > 0)

Values with specific characteristics

> table(restData$zipCode %in% c("21212"))

FALSE  TRUE
 1299    28

> table(restData$zipCode %in% c("21212", "21213"))

FALSE  TRUE
 1268    59

> restData[restData$zipCode %in% c("21212", "21213"), ]
                                     name zipCode                neighborhood councilDistrict policeDistrict
29                      BAY ATLANTIC CLUB   21212                    Downtown              11        CENTRAL
39                            BERMUDA BAR   21213               Broadway East              12        EASTERN
92                              ATWATER'S   21212   Chinquapin Park-Belvedere               4       NORTHERN
111            BALTIMORE ESTONIAN SOCIETY   21213          South Clifton Park              12        EASTERN
187                              CAFE ZEN   21212                    Rosebank               4       NORTHERN

Cross tabs

data(UCBAdmissions)
DF = as.data.frame(UCBAdmissions)
DF
summary(DF)

xt <- xtabs(Freq ~ Gender + Admit, data = DF)   ## Freq must be a column which could be compute, like integer or numeric
xt

Flat tables

> warpbreaks$replicate <- rep(1:9, len = 54)
> xt = xtabs(breaks ~., data = warpbreaks)        ## equals to xtabs(breaks ~ wool + tension + replicate, data = warpbreaks)
> xt
, , replicate = 1

    tension
wool  L  M  H
   A 26 18 36
   B 27 42 20

, , replicate = 2

    tension
wool  L  M  H
   A 30 21 21
   B 14 26 21

, , replicate = 3

    tension
wool  L  M  H
   A 54 29 24
   B 29 19 24


> ftable(xt)
             replicate  1  2  3  4  5  6  7  8  9
wool tension                                     
A    L                 26 30 54 25 70 52 51 26 67
     M                 18 21 29 17 12 18 35 30 36
     H                 36 21 24 18 10 43 28 15 26
B    L                 27 14 29 19 29 31 41 20 44
     M                 42 26 19 16 39 28 21 39 29
     H                 20 21 24 17 13 15 15 16 28

Size of a data set

> fakeData = rnorm(1e5)
> object.size(fakeData)
800040 bytes
> print(object.size(fakeData), units = "Mb")
0.8 Mb

转载于:https://my.oschina.net/skyler/blog/714702

上一篇： Json篇-Json详解

下一篇： Git

『Data Science』R语言学习笔记，观察数据

Getting the data from Web

Looking at a bit of the data

Make summary

More in depth information

Quantiles of quantitative variables

Make table

Check for missing values

Row and column sums

Values with specific characteristics

Cross tabs

Flat tables

Size of a data set

R语言学习笔记（一）数据结构

【R语言】依知乎问题标签数据集绘图（3月24日学习笔记）

《R语言实战》学习笔记：第四章基本数据管理

R语言实战学习笔记-第五章高级数据管理

『Data Science』R语言学习笔记，基础语法

『Data Science』R语言学习笔记，观察数据

『Data Science』R语言学习笔记，使用Swirl包学习R

R语言data manipulation学习笔记之subset data

『Data Science』R语言学习笔记，获取数据

『Data Science』R语言学习笔记，观察数据

Getting the data from Web

Looking at a bit of the data

Make summary

More in depth information

Quantiles of quantitative variables

Make table

Check for missing values

Row and column sums

Values with specific characteristics

Cross tabs

Flat tables

Size of a data set

R语言学习笔记（一）数据结构

【R语言】依知乎问题标签数据集绘图（3月24日学习笔记）

《R语言实战》学习笔记：第四章 基本数据管理

R语言实战学习笔记-第五章 高级数据管理

『Data Science』R语言学习笔记，基础语法

『Data Science』R语言学习笔记，观察数据

『Data Science』R语言学习笔记，使用Swirl包学习R

R语言data manipulation学习笔记之subset data

『Data Science』R语言学习笔记，获取数据

《R语言实战》学习笔记：第四章基本数据管理

R语言实战学习笔记-第五章高级数据管理