《Deep Learning》




Chapter 2 Linear Algebra

broadcasting: We allow the addition of a matrix and a vector, yielding another matrix: $\boldsymbol{C} = \boldsymbol{A} + \boldsymbol{b}$ where $C_{i,j} = A_{i,j} + b_j$. In other words, the vector $\textbf{b}$ is added to each row of the matrix. This shorthand eliminates the need to define a matrix with $\textbf{b}​$ copied into each row before doing the addition.

Multiplying Matrices: We can think of the matrix product $\boldsymbol{C} = \boldsymbol{AB}​$ as computing $\it{C_{i,j}}​$ as the dot product between row $i​$ of $\boldsymbol{A}​$ and column $j​$ of$\boldsymbol{B}​$.

Range of A: Determining whether Ax=b has a solution thus amounts to testing whether b is in the span of the columns of A.

\boldsymbol{Ax} = [\sum_i(\sum_j \boldsymbol{A}_{i,j}x_j)] = \sum_i\boldsymbol{A}_{:,i}x_i

Chapter 3 概率与信息论

Multinoulli Distribution: If you perform an experiment that can have K outcomes and you denote by $X_i$ a random variable that takes value 1 if you obtain the i-th outcome and 0 otherwise.


正态分布的优点:1. 很多分布的真实情况是比较接近正态分布的 <— 中心极限定理。 2. 在具有相同方差的所有可能的概率分布中,正太分布在实数上具有最大的不确定性,因此可以认为正态分布是对模型加入的先验知识量最少的分布。(具体证明见19.4.2节)

先验概率和后验概率:给出公式 $P(x) = \sum_i P(c=i)P(x|c=i)$,其中前者表示先验概率,表明了在观测到x之前传递给模型关于c的信念;而后者表示后验概率,因为它是在观测到x之后进行计算的。如果这里的P(c)是一个multinoulli分布,那么这就是一个混合分布模型。

I(x) = -\log P(x)
其中log表示自然对数。自信息只处理单个的输出。我们可以使用Shannon entropy来对整个概率分布中的不确定性总量进行量化,一个分布的香农熵是指遵循这个分布的事件所产生的期望信息总量:
H(x) = \mathbb{E}_{x\sim P}[I(x)] = -\mathbb E {x\sim P}[\log P(x)]
KL散度:如果对于同一个随机变量x有两个单独的概率分布P(x) 和 Q(x) ,那么可以使用KL散度来衡量这两个分布的差异:
{KL}(P||Q) = \mathbb E _{x\sim P} \Big[\log \frac{P(x)}{Q(x)} \Big] = \mathbb E {x\sim P}[\log P(x) - \log Q(x)]

H(P,Q)= H(P) + D_{KL}(P||Q)

chapter 4 数值计算

Optimization: refers to the task of either maximizing or minimizing some function $f(\boldsymbol{x})$ by altering $\boldsymbol x$

Partial derivative: for functions with multiple inputs, the partial derivative measures how $f$ changes as only the variable $x_i$ increases at point $\boldsymbol x$.

Gradient: the gradient of $f$ is the vector containing all the partial derivatives, denoted $\nabla _\boldsymbol x f(\boldsymbol x)$.
