欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

PCA主成分分析-scikitlearn和Numpy两种实现方法

程序员文章站 2022-07-14 21:42:04
...
'''
PCA with the Iris dataset – manual example 使用Iris数据来示例PCA主成分分析,使用numpy手工实现和s cikit-learn中的PCA方式实现
'''

# import the Iris dataset from scikit-learn
from sklearn.datasets import load_iris
# import our plotting module
import matplotlib.pyplot as plt
# load the Iris dataset
iris = load_iris()
# 创建X,y变量来表示特征和响应变量列。create X and y variables to hold features and response column
iris_X, iris_y = iris.data, iris.target
# the names of the flower we are trying to predict.
iris.target_names
# Names of the features
iris.feature_names

# 构建协方差矩阵,协方差矩阵的公式
# import numpy
import numpy as np
# calculate the mean vector
mean_vector = iris_X.mean(axis=0)
print (mean_vector)
#[ 5.84333333  3.054       3.75866667  1.19866667]
# calculate the covariance matrix。协方差矩阵是对称矩阵,行数和列数为特征的个数。
cov_mat = np.cov((iris_X-mean_vector).T)
print(cov_mat.shape)

# 计算协方差矩阵的特征值
# calculate the eigenvectors and eigenvalues of our covariance matrix of the iris dataset
eig_val_cov, eig_vec_cov = np.linalg.eig(cov_mat)
# Print the eigen vectors and corresponding eigenvalues
# in order of descending eigenvalues
for i in range(len(eig_val_cov)):
	eigvec_cov = eig_vec_cov[:,i]
	print ('Eigenvector {}: \n{}'.format(i+1, eigvec_cov))
	print ('Eigenvalue {} from covariance matrix: {}'.format(i+1,eig_val_cov[i]))
	print (30 * '-')

#根据特征值排序,选择topK的特征向量
explained_variance_ratio = eig_val_cov/eig_val_cov.sum()
explained_variance_ratio

# Scree Plot陡坡图来可视化 特征值/向量的重要性
plt.plot(np.cumsum(explained_variance_ratio))
plt.title('Scree Plot')
plt.xlabel('Principal Component (k)')
plt.ylabel('% of Variance Explained <= k')

#用保留的特征向量来变换原数据,生成新的数据矩阵
# store the top two eigenvectors in a variable。假如这里选定了前两个特征向量。
top_2_eigenvectors = eig_vec_cov[:,:2].T
# show the transpose so that each row is a principal component, we have two rows == two components
top_2_eigenvectors

# to transform our data from having shape (150, 4) to (150, 2)
# we will multiply the matrices of our data and our eigen vectors together
np.dot(iris_X, top_2_eigenvectors.T)



# 以上使用的是numpy的形式来实现PCA。scikit-learn中也有PCA实现的模块。
# scikit-learn's version of PCA
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt 
#instantiate the class
pca = PCA(n_components=2)

# fit the PCA to our data
pca.fit(iris_X)
# 查看得到的主成分
pca.components_
#数据变换。scikit-learn 中PCA 会自动中心化数据,所以结果上会与上面手工做的有一些出入,但不影响模型预测。
pca.transform(iris_X)[:5,]
#成分解释的方差
pca.explained_variance_ratio_

# Plot the original and projected data
label_dict = {i: k for i, k in enumerate(iris.target_names)}
def plot(X, y, title, x_label, y_label):
	ax = plt.subplot(111)
	for label,marker,color in zip(range(3),('^', 's', 'o'),('blue', 'red', 'green')):
		plt.scatter(x=X[:,0].real[y == label],y=X[:,1].real[y == label],color=color,alpha=0.5,label=label_dict[label])
		plt.xlabel(x_label)
		plt.ylabel(y_label)
	leg = plt.legend(loc='upper right', fancybox=True)
	leg.get_frame().set_alpha(0.5)
	plt.title(title)
	plt.show()

plot(iris_X, iris_y, "Original Iris Data", "sepal length (cm)","sepal width (cm)")
plot(pca.transform(iris_X), iris_y, "Iris: Data projected onto first two PCA components", "PCA1", "PCA2")

PCA主成分分析-scikitlearn和Numpy两种实现方法