欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

『Data Science』R语言学习笔记,获取数据

程序员文章站 2022-02-16 19:22:44
...

obtaining Data Motivation

  • This course covers the basic ideas behind getting data ready for analysis
    • Finding and extracting raw data
    • Tidy data principles and how to make data tiny
    • practical implementation through a range of R packages
  • What this course depends on
  • What would be useful
    • Exploratory analysis
    • Reporting Data and Reproducible Research

ps: Free big data source

GOAL: Raw data -> Processing script -> tidy data -> data analysis -> data communication

Raw and Processed Data

Data are values of qualitative or quantitative variables, belonging to a set of items.

  • Qualitative: Country of origin, sex, treatment
  • Quantitative: Height, weight, blood pressure

The components of tidy data

  1. The raw data.
  2. A tidy data set.
  3. A code book describing each variable and its values in the tidy data set.
  4. An explicit and recipe you used to go from 1 -> 2,3.

The tidy data

  1. Each variable you measure should be in one column.
  2. Each different observation of that variable should be in a different row.
  3. There should be one table for each "kind" of variable.
  4. If you have multiple tables, they should include a column in the table that allows them to be linked.

Others:

  • Include a row at the top of each file with variable names.
  • Make variable names human readable AgeAtDiagnosis instead of AgeDx.
  • In general data should be saved in one file per table.

Dowdloading Data

  1. Get/set your working directory
  • getwd()
  • setwd()
  1. Checking for and creating directories
  • file.exists("directoryName")
  • dir.create("directoryName")
  1. Getting data from the internet
  • download.file()
if(!file.exists(('db'))){
  dir.create('db')
}

fileUrl <- "https://data.baltimorecity.gov/api/views/xviu-ezkt/rows.csv?accessType=DOWNLOAD"
download.file(fileUrl, destfile = "./db/callsforservice.csv", method = "curl")
list.files('./db')

PS: 在使用上面的方法下载数据文件的时候,出现了下面的的错误信息,这个是由于我系统里面没有安装curl造成的,把method = "curl"改成method = "auto"解决。

Warning messages:
1: running command 'curl  "https://data.baltimorecity.gov/api/views/xviu-ezkt/rows.csv?accessType=DOWNLOAD"  -o "./db/callsforservice.csv"' had status 127
2: In download.file(fileUrl, destfile = "./db/callsforservice.csv",  :
  下载退出状态不是零

Loading flat files

  • read.table()
calssData <- read.table('./db/callsforservice.csv', sep = ',', header = T)
head(calssData)

All Reading Functions

read.table(file, header = FALSE, sep = "", quote = "\"'",
           dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"),
           row.names, col.names, as.is = !stringsAsFactors,
           na.strings = "NA", colClasses = NA, nrows = -1,
           skip = 0, check.names = TRUE, fill = !blank.lines.skip,
           strip.white = FALSE, blank.lines.skip = TRUE,
           comment.char = "#",
           allowEscapes = FALSE, flush = FALSE,
           stringsAsFactors = default.stringsAsFactors(),
           fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)

read.csv(file, header = TRUE, sep = ",", quote = "\"",
         dec = ".", fill = TRUE, comment.char = "", ...)

read.csv2(file, header = TRUE, sep = ";", quote = "\"",
          dec = ",", fill = TRUE, comment.char = "", ...)

read.delim(file, header = TRUE, sep = "\t", quote = "\"",
           dec = ".", fill = TRUE, comment.char = "", ...)

read.delim2(file, header = TRUE, sep = "\t", quote = "\"",
            dec = ",", fill = TRUE, comment.char = "", ...)

Reading XML Data

  • Extensible markup language
  • Frequently used to store structured data
  • Particularly widely used in internet applications
  • Extracting XML is the basis for most web scraping
  • Components
    • Markup - label that give the text structured
    • Content - the actual text of the document
library(XMl)

html <- "http://*.com/search?q=XML+content+does+not+seem+to+be+XML%3A"
doc <- htmlTreeParse(html, useInternal = T)
content <- xpathSApply(doc, "//div[[[[@class](http://my.oschina.net/liwenlong7758)](http://my.oschina.net/liwenlong7758)](http://my.oschina.net/liwenlong7758) = 'result-link']", xmlValue)

Reading JSON Data

  • jsonlite
install.packages('jsonlite')
library(jsonlite)
jsonData <- fromJSON("https://api.github.com/users/jtleek/repos")
names(jsonData)
names(jsonData$owner$login)

## print JSON data in pretty way
myjson <- toJSON(jsonData$owner, pretty = T)
cat(myjson)

data.table()

> library(data.table)
> DF = data.frame(x=rnorm(9), y=rep(c("a", "b", "c"),each=3),z=rnorm(9))
> head(DF, 3)
         x y          z
1 1.239493 a -0.3917245
2 1.090748 a  0.3640152
3 2.462106 a  1.3424369

> DT = data.frame(x=rnorm(9), y=rep(c("a", "b", "c"),each=3),z=rnorm(9))
> head(DT)
           x y           z
1  0.1235667 a  0.94765708
2 -1.1491418 a  1.23264715
3 -2.3339784 a -0.70625463
4  0.4896532 b  0.07144038
5  0.7731791 b  0.45262096
6  0.1601838 b -0.30345490
DT[2,]
DT[DT$y == "a",]
DT[, c(2,3)]
DT[, list(mean(x), sum(z))]
DT[, table(y)]
DT[, w:=z^2]

DT[, m:= {tmp <- (x+y); log2(tmp+5)}]

Reading from MySQL

install.packages("RMySQL")
library(RMySQL)

ucscDb <- dbConnect(MySQL(), user = "genome", host = "genome-mysql.cse.ucsc.edu")
result <- dbGetQuery(ucscDb, "show databases;"); dbDisconnect(ucscDb);

hg19 <- dbConnect(MySQL(), user = "genome", db = "hg19", host = "genome-mysql.cse.ucsc.edu")
allTables <- dbListTables(hg19)
length(allTables)

dbListFields(hg19, "affyU133Plus2")

dbGetQuery(hg19, "select count(*) from affyU13Plus2")

affyData <- dbReadTable(hg19, "affyU133Plus2")
head(affyData)

## processing big data table
query <- dbSendQuery(hg19, "select * from affyU133Plus2 where misMatches between 1 and 3")
affyMis <- fetch(query); quantitle(affyMis$misMatches)

affyMisSmall <- fetch(query, n = 10); dbClearResult(query);
dim(affyMisSmall)

dbDisconnect(hg19)    ## close db connection

Reading from HDF5

  • Used for stroing large data sets.
  • Supports storing a range of data types
  • Heirarchical data format
  • groups containting zero or more data sets and metadata
    • Have a group header with group name and list of attributes
    • Have a group symbol table with a list of bjects in group
  • datasets multidmensional array of data elements with metadata
    • Have a header with name, datatype, dataspace, and storage layout
    • Have a data array with the data

Reading from web

Get web document

  1. Use built-in functions, url() and readLines
> con = url("http://www.baidu.com")
> htmlCode = readLines(con)
> close(con)
> htmlCode
  1. Use XML package
> library(XML)

> url <- "http://www.baidu.com"
> html <- htmlTreeParse(url, useInternalNodes = T)

> xpathSApply(html, "//div", xmlValue)
  1. Use httr and XML packages
install.packages("httr")
library(httr)
url <- "http://www.baidu.com"
html <- GET(url)
content = content(html, as="text")

library(XML)
parsedHtml = htmlParse(content, asText = T)
xpathSApply(parsedHtml, "//div", xmlValue)

Accessing websites with passwords

  1. before Login
> pg1 = GET("http://httpbin.org/basic-auth/user/passwd")
> pg1
Response [http://httpbin.org/basic-auth/user/passwd]
  Date: 2016-07-17 15:33
  Status: 401
  Content-Type: <unknown>
<EMPTY BODY>
  1. Loging In
> pg2 = GET("http://httpbin.org/basic-auth/user/passwd", authenticate("user", "passwd"))
> pg2
Response [http://httpbin.org/basic-auth/user/passwd]
  Date: 2016-07-17 15:34
  Status: 200
  Content-Type: application/json
  Size: 47 B
{
  "authenticated": true,
  "user": "user"
}
> names(pg2)
 [1] "url"         "status_code" "headers"     "all_headers" "cookies"     "content"     "date"        "times"      
 [9] "request"     "handle"
  1. handle the web site with cookies, sessions and so on.
> pg = handle("http://httpbin.org")
> login = GET("http://httpbin.org/basic-auth/user/passwd", authenticate("user", "passwd"))
> pg1 = GET(handle = pg, path = "/")
> pg2 = GET(handle = pg, path = "about")

Reading data from APIs

library(oauth_app)
myapp = oauth_app("twitter", key = "yourConsumerKeyHere", secret = "yourConsumerSecretHere")
sig = sign_oauth1.0(myapp, token = "youerTokenHere", token_secret = "yourTokenSecretHere")
homeTL = GET("https://api.twitter.com/1.1/statuses/home_timeline.json", sig)

json1 = content(homtTL)
json2 = jsonlite::fromJSON(toJSON(json1))
json2[1, 1:4]

转载于:https://my.oschina.net/skyler/blog/713563

上一篇: JSON总结篇

下一篇: Git