Chapter 2 Data Exploration
目录
A. Data Types
B. Record Data
C. Types of Attributes
A. About Data Quality
B. Preprocessing
① Quality
② Sampling
③ Attribute Selection
④ Dimensionality Reduce
⑤ Discretization:Binning
⑥ Statistics
⑦ Visualization
1. What is Data:
A. Data Types: Document Data、Transaction Data、Graph Data、Sequence Data、Spatial-Temporal Data、Record Data、 Data Matrix
Spatial [ˈspeɪʃl] 空间的
Temporal [ˈtempərəl] 时间的
B. Record Data:
Collection of data objects and their attributes
An attribute is a property or characteristic of an Object
A collection of attributes describe an Object
property [ˈprɑːpərti] 特性
characteristic [ˌkærəktəˈrɪstɪk] 特征
C. Types of Attributes:
① Discrete Attribute and Continus Attribute
② Nominal Attribute and Ordinal Attribute
③ Interval Attribute and Ratio Attribute
Nominal [ˈnɒmɪnl] 名义
Ordinal [ˈɔːrdənl] 序数
Interval [ˈɪntəvl] 区间
Ratio [ˈreɪʃioʊ] 比率
2. Data Exploration:
A. About Data Quality: Data in the real world is dirty.
① incomplete: lacking attribute values
② noisy:data errors, outliers
③ inconsistent: discrepancy between duplicate records
outlier [ˈaʊtlaɪər] 离群的, 异常的
discrepancy [dɪsˈkrepənsi] 差异,不一致
duplicate [ˈduːplɪkeɪt] 完全一样的,复制的
B. Preprocessing:
① Quality:Handle missing values (Ignore or Estimate)、Remove Outliers、Resolve Confilcts (Merge or Identify)
② Sampling:
Key principle:using a sample will work almost as well as using the entire data sets, if the sample is representative;
A sample is representative if it has approximately the same property as the origin set of data
Types of Sampling:Simple Random Sampling、Sampling without replacement、Sampling with repacement、
Stratified Sampling
Sampling Rate:
③ Attribute Selection:Redundant Attributes and Irrelevant Attributes
stratified [ˈstrætɪfaɪd] 分层的
redundant [rɪˈdʌndənt] 冗余的
irrelevant [ɪˈreləvənt] 无关的
④ Dimensionality Reduce:
Reduce the number of attributes by creating a new set of attributes.
⑤ Discretization:Binning
Convert numerical data into categorical data
Divides the range into N intervals
⑥ Statistics:
Center Measurement:Mean、Median
Frequency Distribution:Mode
Variability Measurement:Variance,Standard Devitation
⑦ Visualization:
Visualization is the conversion of data into a visual or tabular format
so that characters of the data and the relations among data items or attributes can be analyzed or reported
Visualization of data is one of the most powerful and appealing techniques for Data Exploration
dimensionality [dɪˌmɛnʃəˈnæləti] 维度
discretization 离散化
binning [ˈbɪnɪŋ] 装箱
categorical [ˌkætəˈɡɔːrɪkl] 分类的
mode 众数
devitation 偏差
tabular [ˈtæbjələr] 表格式的
appealing 吸引人的
Examples Of Visualization:
Sea Surface Temperature
Histogram:[ˈhɪstəɡræm] 直方图
Box Plots:方块图
Scatter Plot:散点图
Correlation Matrix:关联矩阵
上一篇: rabbitmq安装使用
下一篇: 奇异值分解(SVD)及其应用
推荐阅读
-
笔记 Bioinformatics Algorithms Chapter2
-
《Java 8 in Action》Chapter 2:通过行为参数化传递代码
-
笔记 Bioinformatics Algorithms Chapter2
-
Python Algorithms – chapter2 基础知识
-
Coursera | Applied Plotting, Charting & Data Representation in Python(UMich)| Assignment2
-
Oracle 11g R2 Backup Data Pump(数据泵)之expdp/impdp工具
-
3个用例说明 ajax请求+springmvc 的主要的2种 contentType (除开multipart/form-data )
-
Spring Boot Messaging Chapter 2 Spring Boot
-
算法-chapter2递归与分治-概述
-
Hadoop 实现协同过滤 (example in
chapter 6) Part 2