『Data Science』R语言学习笔记,观察数据
程序员文章站
2022-03-01 15:45:26
...
Getting the data from Web
if(!file.exists("./db")){
dir.create("./db")
}
fileUrl <- "https://data.baltimorecity.gov/api/views/k5ry-ef3g/rows.csv?accessType=DOWNLOAD"
download.file(fileUrl, destfile = "./db/restaurants.csv", method = "auto")
restData <- read.csv("./db/restaurants.csv")
Looking at a bit of the data
head(restData, n=3)
tail(restData, n=3)
Make summary
summary(restData)
More in depth information
str(restData)
Quantiles of quantitative variables
The generic function quantile produces sample quantiles corresponding to the given probabilities. The smallest observation corresponds to a probability of 0 and the largest to a probability of 1.
> quantile(restData$councilDistrict, na.rm = T)
0% 25% 50% 75% 100%
1 2 9 11 14
> quantile(restData$councilDistrict, probs = c(0.5, 0.75, 0.9))
50% 75% 90%
9 11 12
-
x
- numeric vector whose sample quantiles are wanted, or an object of a class for which a method has been defined (see also ‘details’). NA and NaN values are not allowed in numeric vectors unless na.rm is TRUE. -
probs
- numeric vector of probabilities with values in [0,1]. (Values up to 2e-14 outside that range are accepted and moved to the nearby endpoint.) -
na.rm
- logical; if true, any NA and NaN's are removed from x before the quantiles are computed. -
names
- logical; if true, the result has a names attribute. Set to FALSE for speedup with many probs. -
type
- an integer between 1 and 9 selecting one of the nine quantile algorithms detailed below to be used. -
...
- further arguments passed to or from other methods.
Make table
> table(restData$zipCode, useNA = "ifany")
-21226 21201 21202 21205 21206 21207 21208 21209 21210 21211 21212 21213 21214 21215 21216 21217 21218 21220
1 136 201 27 30 4 1 8 23 41 28 31 17 54 10 32 69 1
> table(restData$councilDistrict, restData$zipCode)
-21226 21201 21202 21205 21206 21207 21208 21209 21210 21211 21212 21213 21214 21215 21216 21217 21218 21220 21222 21223
1 0 0 37 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 7 0
2 0 0 0 3 27 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 2 17 0 0 0 3 0 0 0
4 0 0 0 0 0 0 0 0 0 0 27 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 3 0 6 0 0 0 0 0 31 0 0 0 0 0 0
6 0 0 0 0 0 0 0 1 19 0 0 0 0 15 1 0 0 0 0 0
Check for missing values
sum(is.na(restData$councilDistrict))
any(is.na(restData$councilDistrict))
all(restData$zipCode > 0)
Row and column sums
colSums(is.na(restData))
all(colSums(is.na(restData)) == 0)
all(restData$zipCode > 0)
Values with specific characteristics
> table(restData$zipCode %in% c("21212"))
FALSE TRUE
1299 28
> table(restData$zipCode %in% c("21212", "21213"))
FALSE TRUE
1268 59
> restData[restData$zipCode %in% c("21212", "21213"), ]
name zipCode neighborhood councilDistrict policeDistrict
29 BAY ATLANTIC CLUB 21212 Downtown 11 CENTRAL
39 BERMUDA BAR 21213 Broadway East 12 EASTERN
92 ATWATER'S 21212 Chinquapin Park-Belvedere 4 NORTHERN
111 BALTIMORE ESTONIAN SOCIETY 21213 South Clifton Park 12 EASTERN
187 CAFE ZEN 21212 Rosebank 4 NORTHERN
Cross tabs
data(UCBAdmissions)
DF = as.data.frame(UCBAdmissions)
DF
summary(DF)
xt <- xtabs(Freq ~ Gender + Admit, data = DF) ## Freq must be a column which could be compute, like integer or numeric
xt
Flat tables
> warpbreaks$replicate <- rep(1:9, len = 54)
> xt = xtabs(breaks ~., data = warpbreaks) ## equals to xtabs(breaks ~ wool + tension + replicate, data = warpbreaks)
> xt
, , replicate = 1
tension
wool L M H
A 26 18 36
B 27 42 20
, , replicate = 2
tension
wool L M H
A 30 21 21
B 14 26 21
, , replicate = 3
tension
wool L M H
A 54 29 24
B 29 19 24
> ftable(xt)
replicate 1 2 3 4 5 6 7 8 9
wool tension
A L 26 30 54 25 70 52 51 26 67
M 18 21 29 17 12 18 35 30 36
H 36 21 24 18 10 43 28 15 26
B L 27 14 29 19 29 31 41 20 44
M 42 26 19 16 39 28 21 39 29
H 20 21 24 17 13 15 15 16 28
Size of a data set
> fakeData = rnorm(1e5)
> object.size(fakeData)
800040 bytes
> print(object.size(fakeData), units = "Mb")
0.8 Mb
转载于:https://my.oschina.net/skyler/blog/714702
上一篇: Json篇-Json详解
下一篇: Git