『Data Science』R语言学习笔记,获取数据
程序员文章站
2022-02-16 19:22:44
...
obtaining Data Motivation
- This course covers the basic ideas behind getting data ready for analysis
- Finding and extracting raw data
- Tidy data principles and how to make data tiny
- practical implementation through a range of R packages
- What this course depends on
- What would be useful
- Exploratory analysis
- Reporting Data and Reproducible Research
GOAL: Raw data -> Processing script -> tidy data -> data analysis -> data communication
Raw and Processed Data
Data are values of qualitative or quantitative variables, belonging to a set of items.
- Qualitative: Country of origin, sex, treatment
- Quantitative: Height, weight, blood pressure
The components of tidy data
- The raw data.
- A tidy data set.
- A code book describing each variable and its values in the tidy data set.
- An explicit and recipe you used to go from 1 -> 2,3.
The tidy data
- Each variable you measure should be in one column.
- Each different observation of that variable should be in a different row.
- There should be one table for each "kind" of variable.
- If you have multiple tables, they should include a column in the table that allows them to be linked.
Others:
- Include a row at the top of each file with variable names.
- Make variable names human readable AgeAtDiagnosis instead of AgeDx.
- In general data should be saved in one file per table.
Dowdloading Data
- Get/set your working directory
getwd()
setwd()
- Checking for and creating directories
file.exists("directoryName")
dir.create("directoryName")
- Getting data from the internet
download.file()
if(!file.exists(('db'))){
dir.create('db')
}
fileUrl <- "https://data.baltimorecity.gov/api/views/xviu-ezkt/rows.csv?accessType=DOWNLOAD"
download.file(fileUrl, destfile = "./db/callsforservice.csv", method = "curl")
list.files('./db')
PS: 在使用上面的方法下载数据文件的时候,出现了下面的的错误信息,这个是由于我系统里面没有安装curl
造成的,把method = "curl"
改成method = "auto"
解决。
Warning messages:
1: running command 'curl "https://data.baltimorecity.gov/api/views/xviu-ezkt/rows.csv?accessType=DOWNLOAD" -o "./db/callsforservice.csv"' had status 127
2: In download.file(fileUrl, destfile = "./db/callsforservice.csv", :
下载退出状态不是零
Loading flat files
read.table()
calssData <- read.table('./db/callsforservice.csv', sep = ',', header = T)
head(calssData)
All Reading Functions
read.table(file, header = FALSE, sep = "", quote = "\"'",
dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"),
row.names, col.names, as.is = !stringsAsFactors,
na.strings = "NA", colClasses = NA, nrows = -1,
skip = 0, check.names = TRUE, fill = !blank.lines.skip,
strip.white = FALSE, blank.lines.skip = TRUE,
comment.char = "#",
allowEscapes = FALSE, flush = FALSE,
stringsAsFactors = default.stringsAsFactors(),
fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)
read.csv(file, header = TRUE, sep = ",", quote = "\"",
dec = ".", fill = TRUE, comment.char = "", ...)
read.csv2(file, header = TRUE, sep = ";", quote = "\"",
dec = ",", fill = TRUE, comment.char = "", ...)
read.delim(file, header = TRUE, sep = "\t", quote = "\"",
dec = ".", fill = TRUE, comment.char = "", ...)
read.delim2(file, header = TRUE, sep = "\t", quote = "\"",
dec = ",", fill = TRUE, comment.char = "", ...)
Reading XML Data
- Extensible markup language
- Frequently used to store structured data
- Particularly widely used in internet applications
- Extracting XML is the basis for most web scraping
- Components
- Markup - label that give the text structured
- Content - the actual text of the document
library(XMl)
html <- "http://*.com/search?q=XML+content+does+not+seem+to+be+XML%3A"
doc <- htmlTreeParse(html, useInternal = T)
content <- xpathSApply(doc, "//div[[[[@class](http://my.oschina.net/liwenlong7758)](http://my.oschina.net/liwenlong7758)](http://my.oschina.net/liwenlong7758) = 'result-link']", xmlValue)
Reading JSON Data
jsonlite
install.packages('jsonlite')
library(jsonlite)
jsonData <- fromJSON("https://api.github.com/users/jtleek/repos")
names(jsonData)
names(jsonData$owner$login)
## print JSON data in pretty way
myjson <- toJSON(jsonData$owner, pretty = T)
cat(myjson)
data.table()
> library(data.table)
> DF = data.frame(x=rnorm(9), y=rep(c("a", "b", "c"),each=3),z=rnorm(9))
> head(DF, 3)
x y z
1 1.239493 a -0.3917245
2 1.090748 a 0.3640152
3 2.462106 a 1.3424369
> DT = data.frame(x=rnorm(9), y=rep(c("a", "b", "c"),each=3),z=rnorm(9))
> head(DT)
x y z
1 0.1235667 a 0.94765708
2 -1.1491418 a 1.23264715
3 -2.3339784 a -0.70625463
4 0.4896532 b 0.07144038
5 0.7731791 b 0.45262096
6 0.1601838 b -0.30345490
DT[2,]
DT[DT$y == "a",]
DT[, c(2,3)]
DT[, list(mean(x), sum(z))]
DT[, table(y)]
DT[, w:=z^2]
DT[, m:= {tmp <- (x+y); log2(tmp+5)}]
Reading from MySQL
install.packages("RMySQL")
library(RMySQL)
ucscDb <- dbConnect(MySQL(), user = "genome", host = "genome-mysql.cse.ucsc.edu")
result <- dbGetQuery(ucscDb, "show databases;"); dbDisconnect(ucscDb);
hg19 <- dbConnect(MySQL(), user = "genome", db = "hg19", host = "genome-mysql.cse.ucsc.edu")
allTables <- dbListTables(hg19)
length(allTables)
dbListFields(hg19, "affyU133Plus2")
dbGetQuery(hg19, "select count(*) from affyU13Plus2")
affyData <- dbReadTable(hg19, "affyU133Plus2")
head(affyData)
## processing big data table
query <- dbSendQuery(hg19, "select * from affyU133Plus2 where misMatches between 1 and 3")
affyMis <- fetch(query); quantitle(affyMis$misMatches)
affyMisSmall <- fetch(query, n = 10); dbClearResult(query);
dim(affyMisSmall)
dbDisconnect(hg19) ## close db connection
Reading from HDF5
- Used for stroing large data sets.
- Supports storing a range of data types
- Heirarchical data format
-
groups containting zero or more data sets and metadata
- Have a group header with group name and list of attributes
- Have a group symbol table with a list of bjects in group
-
datasets multidmensional array of data elements with metadata
- Have a header with name, datatype, dataspace, and storage layout
- Have a data array with the data
Reading from web
Get web document
- Use built-in functions,
url()
andreadLines
> con = url("http://www.baidu.com")
> htmlCode = readLines(con)
> close(con)
> htmlCode
- Use
XML
package
> library(XML)
> url <- "http://www.baidu.com"
> html <- htmlTreeParse(url, useInternalNodes = T)
> xpathSApply(html, "//div", xmlValue)
- Use
httr
andXML
packages
install.packages("httr")
library(httr)
url <- "http://www.baidu.com"
html <- GET(url)
content = content(html, as="text")
library(XML)
parsedHtml = htmlParse(content, asText = T)
xpathSApply(parsedHtml, "//div", xmlValue)
Accessing websites with passwords
- before Login
> pg1 = GET("http://httpbin.org/basic-auth/user/passwd")
> pg1
Response [http://httpbin.org/basic-auth/user/passwd]
Date: 2016-07-17 15:33
Status: 401
Content-Type: <unknown>
<EMPTY BODY>
- Loging In
> pg2 = GET("http://httpbin.org/basic-auth/user/passwd", authenticate("user", "passwd"))
> pg2
Response [http://httpbin.org/basic-auth/user/passwd]
Date: 2016-07-17 15:34
Status: 200
Content-Type: application/json
Size: 47 B
{
"authenticated": true,
"user": "user"
}
> names(pg2)
[1] "url" "status_code" "headers" "all_headers" "cookies" "content" "date" "times"
[9] "request" "handle"
- handle the web site with cookies, sessions and so on.
> pg = handle("http://httpbin.org")
> login = GET("http://httpbin.org/basic-auth/user/passwd", authenticate("user", "passwd"))
> pg1 = GET(handle = pg, path = "/")
> pg2 = GET(handle = pg, path = "about")
Reading data from APIs
library(oauth_app)
myapp = oauth_app("twitter", key = "yourConsumerKeyHere", secret = "yourConsumerSecretHere")
sig = sign_oauth1.0(myapp, token = "youerTokenHere", token_secret = "yourTokenSecretHere")
homeTL = GET("https://api.twitter.com/1.1/statuses/home_timeline.json", sig)
json1 = content(homtTL)
json2 = jsonlite::fromJSON(toJSON(json1))
json2[1, 1:4]
转载于:https://my.oschina.net/skyler/blog/713563