欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

梯度下降与最小二乘

程序员文章站 2022-07-12 14:50:36
...

预备知识

一元函数泰勒公式

f(x+Δx)=f(x)+f(x)Δx+12f(x)Δx2+...f(x+\Delta x)=f(x)+{f}'(x)\Delta x+\frac{1}{2}{{f}}''(x)\Delta x^{2}+...

多元函数泰勒展开

f(x+Δx)=f(x)+[f(x)]TΔx+12ΔxT2f(x)Δx+...f(\overrightarrow{x}+\Delta{\overrightarrow{x}})=f(\overrightarrow{x})+[\nabla f(\overrightarrow{x})]^{T}\Delta \overrightarrow{x}+\frac{1}{2}\Delta \overrightarrow{x}^{T} \nabla ^{2}f(\overrightarrow{x})\Delta \overrightarrow{x}+...

其中,[f(x)]T=[fx1,fx2...fxd],2f(x)=[2fx1x12fx2x12fxdx12fx1xd2fxdxd][\nabla f(\overrightarrow{x})]^{T}=[\frac{\partial f}{\partial x_{1}},\frac{\partial f}{\partial x_{2}}...\frac{\partial f}{\partial x_{d}}],\nabla ^{2}f(\overrightarrow{x})=\begin{bmatrix} \frac{\partial^{2} f}{\partial x _{1}\partial x _{1}}& \frac{\partial^{2} f}{\partial x_{2}\partial x _{1}} &\cdots &\frac{\partial^{2} f}{\partial x_{d}\partial x_{1}} \\ \vdots & \vdots &\ddots &\vdots \\ \frac{\partial^{2} f}{\partial x _{1}\partial x _{d}}& \cdots & \cdots & \frac{\partial^{2} f}{\partial x _{d}\partial x _{d}} \end{bmatrix}

极值点(极小值)

f(x)=0,2f(x)>0\nabla f(\overrightarrow{x})=0 ,\nabla ^{2}f(\overrightarrow{x})>0

其中,满足2f(x)>0\nabla ^{2}f(\overrightarrow{x})>0的对称矩阵称为正定矩阵,充要条件为特征值大于零或者各阶主子式大于零

梯度下降法

原理

求极小值,为保证:f(x+Δx)f(x)=[f(x)]TΔx<0f(\overrightarrow{x}+\Delta{\overrightarrow{x}})-f(\overrightarrow{x})=[\nabla f(\overrightarrow{x})]^{T}\Delta \overrightarrow{x}<0,取:Δx=αf(x)\Delta{\overrightarrow{x}}=-\alpha \nabla f(\overrightarrow{x})

为保证泰勒展开在领域内成立的条件,取:Δx=αf(x)\Delta{\overrightarrow{x}}=-\alpha \nabla f(\overrightarrow{x})

步骤

  1. 取初始值xi,i=(1,2,,n)x_{i},i=(1,2,\cdots,n)
  2. f(xi),Δxi=f(xi)\nabla f(\overrightarrow{x_{i}}),\Delta{\overrightarrow{x_{i}}}=-\nabla f(\overrightarrow{x_{i}})
  3. xi+1=xi+αΔx\overrightarrow{x_{i+1}}=\overrightarrow{x_{i}}+\alpha \Delta{\overrightarrow{x}}
  4. 计算f(xi+1)f(xi)22ε\parallel f(\overrightarrow{x_{i+1}})-f(\overrightarrow{x_{i}})\parallel_{2}^{2}\leq\varepsilon,若不等式成立则停止,否则i=i+1i=i+1,重复2,3,4

最小二乘法

步骤

  1. 有n个数据对{xi,yi}\{\overrightarrow{x_{i}},y_{i}\},其中xi\overrightarrow{x_{i}}是行向量(i=1,2,3,di=1,2,3\cdots,d

  2. 构造常数项{β0,β1,βd}\{\beta _{0},\beta _{1},\cdots\beta _{d} \};误差计算公式为:ε=yi(xiβ+β0),i=1,2,3,,d\varepsilon =y_{i}-(\overrightarrow{x_{i}}\overrightarrow{\beta}+\beta _{0}),i=1,2,3,\cdots,d;其中β={β1,β2,βd}T\overrightarrow{\beta }=\{\beta _{1},\beta _{2},\cdots\beta _{d} \}^{T}

  3. 损失函数为:

    1. J(β,β0)=1ni=1nε2=1ni=1n(yi(xiβ+β0))2=1ni=1n(yi(j=1dxijβj+β0))2J(\overrightarrow{\beta},\beta_{0})=\frac{1}{n}\sum_{i=1}^{n}\varepsilon^{2}=\frac{1}{n}\sum_{i=1}^{n}(y_{i}-(\overrightarrow{x_{i}}\overrightarrow{\beta}+\beta _{0}))^{2} =\frac{1}{n}\sum_{i=1}^{n}(y_{i}-(\sum_{j=1}^{d}x_{ij}\beta _{j} +\beta _{0}))^{2}

    2. xi0=1(i=1,2,,n),β={β0,β1,βd}Tx_{i0}=1(i=1,2,\cdots,n),\overrightarrow{\beta }=\{\beta _{0},\beta _{1},\cdots\beta _{d} \}^{T},则可以写成
      J(β)=1n(i=1n(yixiβ))2 J(\overrightarrow{\beta})=\frac{1}{n}(\sum_{i=1}^{n}(y_{i}-\overrightarrow{x_{i}}\overrightarrow{\beta}))^{2}

    3. 进一步,令X=[1x11x1d1xn1xnd]X=\begin{bmatrix}1 & x_{11} &\cdots&x_{1d}\\\vdots& \vdots& \ddots& \\1& x_{n1}&\cdots&x_{nd}\end{bmatrix}Y=[y1,y2,,yn]TY=\begin{bmatrix}y_{1},y_{2},\cdots,y_{n}\\\end{bmatrix}^{T},B=[β0,β1,,βd]TB=[\beta_{0},\beta_{1},\cdots,\beta_{d}]^{T}

    J(B)=1n(YXB)T(YXB) J(B)=\frac{1}{n}(Y-XB)^{T}(Y-XB)

  4. 利用梯度下降法求解

    1. 参考matrix cookbook,对矩阵进行展开求导

    J(B)=1n(YXB)T(YXB)=1n(YTYYTXBBTXTY+BTXTXB) J(B)=\frac{1}{n}(Y-XB)^{T}(Y-XB)=\frac{1}{n}(Y^{T}Y-Y^{T}XB-B^{T}X^{T}Y+B^{T}X^{T}XB)

    J(B)=1n(XTYXTY+2XTXB)=1n(2XTY+2XTXB) \nabla J(B)=\frac{1}{n}(-X^{T}Y-X^{T}Y+2X^{T}XB)=\frac{1}{n}(-2X^{T}Y+2X^{T}XB)

    ​ 令梯度为零,则有:2XTY+2XTXB=0,B=(XTX)1XTY-2X^{T}Y+2X^{T}XB=0,B=(X^{T}X)^{-1}X^{T}Y

    1. Bi+1=BiαJ(Bi)B_{i+1}={B_{i}}-\alpha \nabla J(B_{i})
    2. 计算J(Bi+1)J(Bi)22ε\parallel J(B_{i+1})-J(B_{i})\parallel_{2}^{2}\leq\varepsilon,若不等式成立则停止,否则i=i+1i=i+1,重复1,2,3

附注:几类特殊函数的梯度公式

  • (bTX)=b\nabla (b^{T}X)=b
  • (XTb)=b\nabla (X^{T}b)=b
  • (XTX)=2X\nabla (X^{T}X)=2X
  • (XTAX)=2AX\nabla (X^{T}AX)=2AX(其中A为对称矩阵)

随机梯度下降法(Stochastic gradient decent,SGD)

伪代码

network = TwoLayerNet(...)
optimizer = SGD()

for i in range(10000):
    ...
    x_batch, t_batch = get_mini_batch(...) # mini-batch
    grads = network.gradient(x_batch, t_batch)
    params = network.params
    optimizer.update(params, grads)
    ...