『Data Science』R语言学习笔记，获取数据

程序员文章站 2022-02-16 19:22:44

...

为什么80%的码农都做不了架构师？>>> 『Data Science』R语言学习笔记，获取数据

obtaining Data Motivation

This course covers the basic ideas behind getting data ready for analysis
- Finding and extracting raw data
- Tidy data principles and how to make data tiny
- practical implementation through a range of R packages
What this course depends on
- The Data Scientist's Tollbox
- R Programming
What would be useful
- Exploratory analysis
- Reporting Data and Reproducible Research

ps: Free big data source

GOAL: Raw data -> Processing script -> tidy data -> data analysis -> data communication

Raw and Processed Data

Data are values of qualitative or quantitative variables, belonging to a set of items.

Qualitative: Country of origin, sex, treatment
Quantitative: Height, weight, blood pressure

The components of tidy data

The raw data.
A tidy data set.
A code book describing each variable and its values in the tidy data set.
An explicit and recipe you used to go from 1 -> 2,3.

The tidy data

Each variable you measure should be in one column.
Each different observation of that variable should be in a different row.
There should be one table for each "kind" of variable.
If you have multiple tables, they should include a column in the table that allows them to be linked.

Others:

Include a row at the top of each file with variable names.
Make variable names human readable AgeAtDiagnosis instead of AgeDx.
In general data should be saved in one file per table.

Dowdloading Data

Get/set your working directory

getwd()
setwd()

Checking for and creating directories

file.exists("directoryName")
dir.create("directoryName")

Getting data from the internet

download.file()

if(!file.exists(('db'))){
  dir.create('db')
}

fileUrl <- "https://data.baltimorecity.gov/api/views/xviu-ezkt/rows.csv?accessType=DOWNLOAD"
download.file(fileUrl, destfile = "./db/callsforservice.csv", method = "curl")
list.files('./db')

PS: 在使用上面的方法下载数据文件的时候，出现了下面的的错误信息，这个是由于我系统里面没有安装curl造成的，把method = "curl"改成method = "auto"解决。

Warning messages:
1: running command 'curl  "https://data.baltimorecity.gov/api/views/xviu-ezkt/rows.csv?accessType=DOWNLOAD"  -o "./db/callsforservice.csv"' had status 127
2: In download.file(fileUrl, destfile = "./db/callsforservice.csv",  :
  下载退出状态不是零

Loading flat files

read.table()

calssData <- read.table('./db/callsforservice.csv', sep = ',', header = T)
head(calssData)

All Reading Functions

read.table(file, header = FALSE, sep = "", quote = "\"'",
           dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"),
           row.names, col.names, as.is = !stringsAsFactors,
           na.strings = "NA", colClasses = NA, nrows = -1,
           skip = 0, check.names = TRUE, fill = !blank.lines.skip,
           strip.white = FALSE, blank.lines.skip = TRUE,
           comment.char = "#",
           allowEscapes = FALSE, flush = FALSE,
           stringsAsFactors = default.stringsAsFactors(),
           fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)

read.csv(file, header = TRUE, sep = ",", quote = "\"",
         dec = ".", fill = TRUE, comment.char = "", ...)

read.csv2(file, header = TRUE, sep = ";", quote = "\"",
          dec = ",", fill = TRUE, comment.char = "", ...)

read.delim(file, header = TRUE, sep = "\t", quote = "\"",
           dec = ".", fill = TRUE, comment.char = "", ...)

read.delim2(file, header = TRUE, sep = "\t", quote = "\"",
            dec = ",", fill = TRUE, comment.char = "", ...)

Reading XML Data

Extensible markup language
Frequently used to store structured data
Particularly widely used in internet applications
Extracting XML is the basis for most web scraping
Components
- Markup - label that give the text structured
- Content - the actual text of the document

library(XMl)

html <- "http://*.com/search?q=XML+content+does+not+seem+to+be+XML%3A"
doc <- htmlTreeParse(html, useInternal = T)
content <- xpathSApply(doc, "//div[[[[@class](http://my.oschina.net/liwenlong7758)](http://my.oschina.net/liwenlong7758)](http://my.oschina.net/liwenlong7758) = 'result-link']", xmlValue)

Reading JSON Data

jsonlite

install.packages('jsonlite')
library(jsonlite)
jsonData <- fromJSON("https://api.github.com/users/jtleek/repos")
names(jsonData)
names(jsonData$owner$login)

## print JSON data in pretty way
myjson <- toJSON(jsonData$owner, pretty = T)
cat(myjson)

`data.table()`

> library(data.table)
> DF = data.frame(x=rnorm(9), y=rep(c("a", "b", "c"),each=3),z=rnorm(9))
> head(DF, 3)
         x y          z
1 1.239493 a -0.3917245
2 1.090748 a  0.3640152
3 2.462106 a  1.3424369

> DT = data.frame(x=rnorm(9), y=rep(c("a", "b", "c"),each=3),z=rnorm(9))
> head(DT)
           x y           z
1  0.1235667 a  0.94765708
2 -1.1491418 a  1.23264715
3 -2.3339784 a -0.70625463
4  0.4896532 b  0.07144038
5  0.7731791 b  0.45262096
6  0.1601838 b -0.30345490

DT[2,]
DT[DT$y == "a",]
DT[, c(2,3)]
DT[, list(mean(x), sum(z))]
DT[, table(y)]
DT[, w:=z^2]

DT[, m:= {tmp <- (x+y); log2(tmp+5)}]

Reading from MySQL

install.packages("RMySQL")
library(RMySQL)

ucscDb <- dbConnect(MySQL(), user = "genome", host = "genome-mysql.cse.ucsc.edu")
result <- dbGetQuery(ucscDb, "show databases;"); dbDisconnect(ucscDb);

hg19 <- dbConnect(MySQL(), user = "genome", db = "hg19", host = "genome-mysql.cse.ucsc.edu")
allTables <- dbListTables(hg19)
length(allTables)

dbListFields(hg19, "affyU133Plus2")

dbGetQuery(hg19, "select count(*) from affyU13Plus2")

affyData <- dbReadTable(hg19, "affyU133Plus2")
head(affyData)

## processing big data table
query <- dbSendQuery(hg19, "select * from affyU133Plus2 where misMatches between 1 and 3")
affyMis <- fetch(query); quantitle(affyMis$misMatches)

affyMisSmall <- fetch(query, n = 10); dbClearResult(query);
dim(affyMisSmall)

dbDisconnect(hg19)    ## close db connection

Reading from HDF5

Used for stroing large data sets.
Supports storing a range of data types
Heirarchical data format
groups containting zero or more data sets and metadata
- Have a group header with group name and list of attributes
- Have a group symbol table with a list of bjects in group
datasets multidmensional array of data elements with metadata
- Have a header with name, datatype, dataspace, and storage layout
- Have a data array with the data

Reading from web

Get web document

Use built-in functions, url() and readLines

> con = url("http://www.baidu.com")
> htmlCode = readLines(con)
> close(con)
> htmlCode

Use XML package

> library(XML)

> url <- "http://www.baidu.com"
> html <- htmlTreeParse(url, useInternalNodes = T)

> xpathSApply(html, "//div", xmlValue)

Use httr and XML packages

install.packages("httr")
library(httr)
url <- "http://www.baidu.com"
html <- GET(url)
content = content(html, as="text")

library(XML)
parsedHtml = htmlParse(content, asText = T)
xpathSApply(parsedHtml, "//div", xmlValue)

Accessing websites with passwords

before Login

> pg1 = GET("http://httpbin.org/basic-auth/user/passwd")
> pg1
Response [http://httpbin.org/basic-auth/user/passwd]
  Date: 2016-07-17 15:33
  Status: 401
  Content-Type: <unknown>
<EMPTY BODY>

Loging In

> pg2 = GET("http://httpbin.org/basic-auth/user/passwd", authenticate("user", "passwd"))
> pg2
Response [http://httpbin.org/basic-auth/user/passwd]
  Date: 2016-07-17 15:34
  Status: 200
  Content-Type: application/json
  Size: 47 B
{
  "authenticated": true,
  "user": "user"
}
> names(pg2)
 [1] "url"         "status_code" "headers"     "all_headers" "cookies"     "content"     "date"        "times"      
 [9] "request"     "handle"

handle the web site with cookies, sessions and so on.

> pg = handle("http://httpbin.org")
> login = GET("http://httpbin.org/basic-auth/user/passwd", authenticate("user", "passwd"))
> pg1 = GET(handle = pg, path = "/")
> pg2 = GET(handle = pg, path = "about")

Reading data from APIs

library(oauth_app)
myapp = oauth_app("twitter", key = "yourConsumerKeyHere", secret = "yourConsumerSecretHere")
sig = sign_oauth1.0(myapp, token = "youerTokenHere", token_secret = "yourTokenSecretHere")
homeTL = GET("https://api.twitter.com/1.1/statuses/home_timeline.json", sig)

json1 = content(homtTL)
json2 = jsonlite::fromJSON(toJSON(json1))
json2[1, 1:4]

转载于:https://my.oschina.net/skyler/blog/713563

上一篇： JSON总结篇

下一篇： Git

『Data Science』R语言学习笔记，获取数据

obtaining Data Motivation

Raw and Processed Data

The components of tidy data

The tidy data

Dowdloading Data

Loading flat files

All Reading Functions

Reading XML Data

Reading JSON Data

`data.table()`

Reading from MySQL

Reading from HDF5

Reading from web

Get web document

Accessing websites with passwords

Reading data from APIs

R语言学习笔记（一）数据结构

【R语言】依知乎问题标签数据集绘图（3月24日学习笔记）

《R语言实战》学习笔记：第四章基本数据管理

R语言实战学习笔记-第五章高级数据管理

『Data Science』R语言学习笔记，基础语法

『Data Science』R语言学习笔记，观察数据

『Data Science』R语言学习笔记，使用Swirl包学习R

R语言data manipulation学习笔记之subset data

『Data Science』R语言学习笔记，获取数据

『Data Science』R语言学习笔记，获取数据

obtaining Data Motivation

Raw and Processed Data

The components of tidy data

The tidy data

Dowdloading Data

Loading flat files

All Reading Functions

Reading XML Data

Reading JSON Data

data.table()

Reading from MySQL

Reading from HDF5

Reading from web

Get web document

Accessing websites with passwords

Reading data from APIs

R语言学习笔记（一）数据结构

【R语言】依知乎问题标签数据集绘图（3月24日学习笔记）

《R语言实战》学习笔记：第四章 基本数据管理

R语言实战学习笔记-第五章 高级数据管理

『Data Science』R语言学习笔记，基础语法

『Data Science』R语言学习笔记，观察数据

『Data Science』R语言学习笔记，使用Swirl包学习R

R语言data manipulation学习笔记之subset data

『Data Science』R语言学习笔记，获取数据

`data.table()`

《R语言实战》学习笔记：第四章基本数据管理

R语言实战学习笔记-第五章高级数据管理