欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

『Data Science』R语言学习笔记,观察数据

程序员文章站 2022-03-01 15:45:26
...

Getting the data from Web

if(!file.exists("./db")){
    dir.create("./db")
}

fileUrl <- "https://data.baltimorecity.gov/api/views/k5ry-ef3g/rows.csv?accessType=DOWNLOAD"
download.file(fileUrl, destfile = "./db/restaurants.csv", method = "auto")
restData <- read.csv("./db/restaurants.csv")

Looking at a bit of the data

head(restData, n=3)
tail(restData, n=3)

Make summary

summary(restData)

More in depth information

str(restData)

Quantiles of quantitative variables

The generic function quantile produces sample quantiles corresponding to the given probabilities. The smallest observation corresponds to a probability of 0 and the largest to a probability of 1.

> quantile(restData$councilDistrict, na.rm = T)
  0%  25%  50%  75% 100%
   1    2    9   11   14
> quantile(restData$councilDistrict, probs = c(0.5, 0.75, 0.9))
50% 75% 90%
  9  11  12
  • x - numeric vector whose sample quantiles are wanted, or an object of a class for which a method has been defined (see also ‘details’). NA and NaN values are not allowed in numeric vectors unless na.rm is TRUE.
  • probs - numeric vector of probabilities with values in [0,1]. (Values up to 2e-14 outside that range are accepted and moved to the nearby endpoint.)
  • na.rm - logical; if true, any NA and NaN's are removed from x before the quantiles are computed.
  • names - logical; if true, the result has a names attribute. Set to FALSE for speedup with many probs.
  • type - an integer between 1 and 9 selecting one of the nine quantile algorithms detailed below to be used.
  • ... - further arguments passed to or from other methods.

Make table

> table(restData$zipCode, useNA = "ifany")

-21226  21201  21202  21205  21206  21207  21208  21209  21210  21211  21212  21213  21214  21215  21216  21217  21218  21220
     1    136    201     27     30      4      1      8     23     41     28     31     17     54     10     32     69      1

> table(restData$councilDistrict, restData$zipCode)

     -21226 21201 21202 21205 21206 21207 21208 21209 21210 21211 21212 21213 21214 21215 21216 21217 21218 21220 21222 21223
  1       0     0    37     0     0     0     0     0     0     0     0     2     0     0     0     0     0     0     7     0
  2       0     0     0     3    27     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
  3       0     0     0     0     0     0     0     0     0     0     0     2    17     0     0     0     3     0     0     0
  4       0     0     0     0     0     0     0     0     0     0    27     0     0     0     0     0     0     0     0     0
  5       0     0     0     0     0     3     0     6     0     0     0     0     0    31     0     0     0     0     0     0
  6       0     0     0     0     0     0     0     1    19     0     0     0     0    15     1     0     0     0     0     0

Check for missing values

sum(is.na(restData$councilDistrict))
any(is.na(restData$councilDistrict))
all(restData$zipCode > 0)

Row and column sums

colSums(is.na(restData))
all(colSums(is.na(restData)) == 0)
all(restData$zipCode > 0)

Values with specific characteristics

> table(restData$zipCode %in% c("21212"))

FALSE  TRUE
 1299    28

> table(restData$zipCode %in% c("21212", "21213"))

FALSE  TRUE
 1268    59

> restData[restData$zipCode %in% c("21212", "21213"), ]
                                     name zipCode                neighborhood councilDistrict policeDistrict
29                      BAY ATLANTIC CLUB   21212                    Downtown              11        CENTRAL
39                            BERMUDA BAR   21213               Broadway East              12        EASTERN
92                              ATWATER'S   21212   Chinquapin Park-Belvedere               4       NORTHERN
111            BALTIMORE ESTONIAN SOCIETY   21213          South Clifton Park              12        EASTERN
187                              CAFE ZEN   21212                    Rosebank               4       NORTHERN

Cross tabs

data(UCBAdmissions)
DF = as.data.frame(UCBAdmissions)
DF
summary(DF)

xt <- xtabs(Freq ~ Gender + Admit, data = DF)   ## Freq must be a column which could be compute, like integer or numeric
xt

Flat tables

> warpbreaks$replicate <- rep(1:9, len = 54)
> xt = xtabs(breaks ~., data = warpbreaks)        ## equals to xtabs(breaks ~ wool + tension + replicate, data = warpbreaks)
> xt
, , replicate = 1

    tension
wool  L  M  H
   A 26 18 36
   B 27 42 20

, , replicate = 2

    tension
wool  L  M  H
   A 30 21 21
   B 14 26 21

, , replicate = 3

    tension
wool  L  M  H
   A 54 29 24
   B 29 19 24


> ftable(xt)
             replicate  1  2  3  4  5  6  7  8  9
wool tension                                     
A    L                 26 30 54 25 70 52 51 26 67
     M                 18 21 29 17 12 18 35 30 36
     H                 36 21 24 18 10 43 28 15 26
B    L                 27 14 29 19 29 31 41 20 44
     M                 42 26 19 16 39 28 21 39 29
     H                 20 21 24 17 13 15 15 16 28

Size of a data set

> fakeData = rnorm(1e5)
> object.size(fakeData)
800040 bytes
> print(object.size(fakeData), units = "Mb")
0.8 Mb

转载于:https://my.oschina.net/skyler/blog/714702

上一篇: Json篇-Json详解

下一篇: Git