Randomized SVD大法好!
Truncated SVD
full_svd的full在于 和 都是方阵,而 中被虚线框出的部分的重要性都为0,对 其实是没有贡献的。
Reduced SVD和Truncated SVD是不同的,Truncated SVD是去掉最末尾的几个singular value来近似原矩阵。
而Reduced SVD就是把多余的bottom去掉,对应的 仍然是方阵。
如果我们要得到Truncated SVD,一种方法是先做Full SVD,再cut off 一些columns。但是Full SVD is slow. This is the calculation we did above using Scipy’s Linalg SVD:
# 32s
%time U, s, Vh = linalg.svd(vectors, full_matrices=False)
Fortunately, there is a faster way:只要5个eigenvalue
# 4.92s
%time u, s, v = decomposition.randomized_svd(vectors, 5)
# ((2034, 5), (5,), (5, 26576))
u.shape, s.shape, v.shape
SVD复杂度分析
The runtime complexity for SVD is
How to speed up SVD?
Idea: Let’s use a smaller matrix (with smaller )!
Instead of calculating the SVD on our full matrix which is , let’s use , which is just and
这个矩阵又高又瘦,相当于把 的一部分column都cut off掉了。所谓的approximate是指, 的columns和 的columns可以span出一个similar的space。这样就可以用fewer columns span出一个一样的space。
We haven’t found a better general SVD method, we are just using the method we have on a smaller matrix.
Implement Randomized SVD
-
Compute an approximation to the range of . Note that the range of is , or 的columns的 span(Linear Combination)。
直观点说就是,把 视为一个变换矩阵,将vector 变换为 ,而 组成的范围就是 的 range。所以在后面的代码中用了一个for循环不停 ,就是希望最后求出的 能够和 有similar column space but fewer columns。
That is, we want with orthonormal columns such that
【注意】如果 是orthogonal matrix(行列均为orthonormal),则有 。但是这里 只有column是orthonormal的,所以是approximate 。
Construct , which is small ()。, ,所以 是 <<
Compute the SVD of by standard methods (fast since is smaller than ),
-
Since
if we set , then we have a low rank approximation . 这样就得到了 的SVD分解式,并且是lower rank的!
【注意】 的column必须是orthonormal的,才符合SVD的要求, 可以说明 是orthonormal的吗?
可以滴,因为 和 都是column orthonormal的。
How to find
使用随机采样方式构建矩阵 :
- 1.构建一个 维的高斯随机矩阵
- 2.进行矩阵乘积运算
- 3.利用QR分解获得 的正交基
To estimate the range of , we can just take a bunch of random vectors , evaluate the subspace formed by . We can form a matrix with the as it’s columns. Now, we take the QR decomposition of , then the columns of form an orthonormal basis for (like ), and gives us the range of . (因为 乘上了一堆随机的vector ,相当于把这些 变换为其他vector,这些vector合并到一起基本上就可以看做是 的range了)
And is a upper trianglur matrix.
Since the matrix of the product has far more rows than columns and therefore, approximately, orthonormal columns. This is simple probability - with lots of rows, and few columns, it’s unlikely that the columns are linearly dependent.
The method randomized_range_finder
finds an orthonomal matrix whose range approximates the range of (step 1 in our algorithm above). To do so, we use the LU and QR factorizations:
# computes on orthonormal matrix whose range approximates the range of A
# power_iteration_normalizer can be safe_sparse_dot (fast but unstable)
def randomized_range_finder(A, size, n_iter=5):
Q = np.random.normal(size=(A.shape[1], size))
# 这里不停 A @ Q ,是为了使Q最大化接近A的range,回想之前所说的A的range是矩阵A对vector做变换后的范围
# 然后对 A @ Q做LU分解是因为:一直乘一个相同的数是unstable的(expose/0),做LU类似一种Normalize(为啥?)
for i in range(n_iter):
Q, _ = linalg.lu(A @ Q, permute_l=True)
Q, _ = linalg.lu(A.T @ Q, permute_l=True)
Q, _ = linalg.qr(A @ Q, mode='economic')
return Q
How to choose
Suppose our matrix has 100 columns, and we want 5 columns in and V. To be safe, we should project our matrix onto an orthogonal basis with a few more rows and columns than 5 (let’s use 15). At the end, we will just grab the first 5 columns of and .
So even although our projection was only approximate, by making it a bit bigger than we need, we can make up for the loss of accuracy (since we’re only taking a subset later).
Implement
And here’s our randomized SVD method:
4.构建低维矩阵 , 维度为(k+p)
5.矩阵 的SVD分解,
- 6.用 更新左奇异向量,,也就是前面提的
- 7.得到最终结果 , ,
def randomized_svd(M, n_components, n_oversamples=10, n_iter=4):
# `n_components`代表我们想要的前n个singular value,也就是topic model中的topics
n_random = n_components + n_oversamples
Q = randomized_range_finder(M, n_random, n_iter)
# project M to the (k+p) dims space using the basic vectors
B = Q.T @ M
# compute the SVD on the thin matrix: (k+p) wide
Uhat, s, V = linalg.svd(B, full_matrices=False)
del B
U = Q @ Uhat
return U[:, :n_components], s[:n_components], v[:n_components, :]
Q:为什么要用
n_oversamples
?如果我们只想要5个topic,但是在计算时要取得稍稍多一点,比如15个topics。
n_oversamples
is kind of like a safety buffer,因为我们对这堆documents到底该归为几个topic是一无所知的。你如果只取5个topic的话,可能有一些不属于这些topic的document会被squeeze into those 5 topic。但是你取15个topic的话,那这前5个topic里面的document就会clean许多。
# 137ms
%time u, s, v = randomized_svd(vectors, 5)
#((2034, 5), (5,), (5, 26576))
u.shape, s.shape, v.shape
Plot Error
Calculate the error of the decomposition as we vary the # of topics, 然后plot一下result:
step = 20
n = 20
error = np.zeros(n)
for i in range(n):
U, s, V = randomized_svd(vectors, i*step)
reconstructed = U @ np.diag(s) @ V
error[i] = np.linalg.norm(vectors - reconstructed)
plt.plot(range(0,n*step,step), error)
根据上图可以看出,前几个singular value capture more than additional singular value.
Randomized SVD in sklearn
%time u, s, v = decomposition.randomized_svd(vectors, 5)
%time u, s, v = decomposition.randomized_svd(vectors.todense(), 5)
Background Removel with Randomized SVD
采用经典的 BMC 2012 Background Models Challenge Dataset 数据集。
An image from 1 moment in time is 60 pixels by 80 pixels (when scaled). We can unroll that picture into a single tall column. So instead of having a 2D picture that is , we have a column
This isn’t very human-readable, but it’s handy because it lets us stack the images from different times on top of one another, to put a video all into 1 matrix. If we took the video image every tenth of a second for 113 seconds (so 11,300 different images, each from a different point in time), we’d have a matrix, representing the video!
如下图是uroll and stack之后的矩阵 (4800, 11300),4800是一帧图片(60x80)拉长后的向量长度,11300是帧数。图中的solid横线代表background;wave代表行人的moving;column代表 a moment in time。原矩阵 每一列都是某时间点上一帧图像拉长后形成的vector。
Randomized SVD背景分离
若做rank=2
的SVD decomposition:
from sklearn import decomposition
# rank=2
u, s, v = decomposition.randomized_svd(M, 2)
# ((4800, 2), (2,), (2, 11300))
u.shape, s.shape, v.shape
# low_rank.shape: (4800, 11300)
low_rank = u @ np.diag(s) @ v
plt.figure(figsize=(12, 12))
plt.imshow(low_rank, cmap='gray')
根据上图可以看出来rank=2
时,capture的几乎都是background,因此原矩阵和low_rank的差就是运动的人。
rank=2,只关注最大的2个奇异值,它们只会capture最主要的信息(背景),所以reconstruct后得到的就是背景矩阵。
plt.imshow(np.reshape(low_rank[:,140], dims), cmap='gray');
plt.imshow(np.reshape(M[:,140] - low_rank[:,140], dims), cmap='gray')
若做rank=1
的Randomized SVD,分离效果会更好。
上一篇: python 下采样和上采样
下一篇: 如何下载在guge中找到的文献