Chapter 2 Linear Algebra

broadcasting: We allow the addition of a matrix and a vector, yielding another matrix: $\boldsymbol{C} = \boldsymbol{A} + \boldsymbol{b}$ where $C_{i,j} = A_{i,j} + b_j$. In other words, the vector $\textbf{b}$ is added to each row of the matrix. This shorthand eliminates the need to define a matrix with $\textbf{b}$ copied into each row before doing the addition.

Multiplying Matrices: We can think of the matrix product $\boldsymbol{C} = \boldsymbol{AB}$ as computing $\it{C_{i,j}}$ as the dot product between row $i$ of $\boldsymbol{A}$ and column $j$ of$\boldsymbol{B}$.

Range of A: Determining whether Ax=b has a solution thus amounts to testing whether b is in the span of the columns of A.

该书给出了关于矩阵方程求解的一个直观理解，同时还解释清楚了生成子空间(span)的问题。也就是将原方程的求解，视为从某个高维坐标系的原点出发到达向量b。
$$
\boldsymbol{Ax} = [\sum_i(\sum_j \boldsymbol{A}_{i,j}x_j)] = \sum_i\boldsymbol{A}_{:,i}x_i
$$
此处是将矩阵与向量的乘积做了一个等价变换之后重新解读。最终的意义为，在第i列给定的方向，走上x_i的距离。注意，最终在按规定走过全部路程之后，也许某些方向上的路程会抵消掉。

Chapter 3 概率与信息论

Multinoulli Distribution: If you perform an experiment that can have K outcomes and you denote by $X_i$ a random variable that takes value 1 if you obtain the i-th outcome and 0 otherwise.

中心极限定理(概率论中的首席定理)：在适当的条件下，大量相互独立随机变量的均值经适当标准化后依分布收敛于正态分布。

正态分布的优点：1. 很多分布的真实情况是比较接近正态分布的 <— 中心极限定理。 2. 在具有相同方差的所有可能的概率分布中，正太分布在实数上具有最大的不确定性，因此可以认为正态分布是对模型加入的先验知识量最少的分布。(具体证明见19.4.2节)

先验概率和后验概率：给出公式 $P(x) = \sum_i P(c=i)P(x|c=i)$，其中前者表示先验概率，表明了在观测到x之前传递给模型关于c的信念；而后者表示后验概率，因为它是在观测到x之后进行计算的。如果这里的P(c)是一个multinoulli分布，那么这就是一个混合分布模型。

信息论的一个基本思想：较不可能发生的事件具有更高的信息量。因此定义了self-information:
$$
I(x) = -\log P(x)
$$
其中log表示自然对数。自信息只处理单个的输出。我们可以使用Shannon entropy来对整个概率分布中的不确定性总量进行量化，一个分布的香农熵是指遵循这个分布的事件所产生的期望信息总量：
$$
H(x) = \mathbb{E}_{x\sim P}[I(x)] = -\mathbb E {x\sim P}[\log P(x)]
$$
KL散度：如果对于同一个随机变量x有两个单独的概率分布P(x) 和 Q(x) ，那么可以使用KL散度来衡量这两个分布的差异:
$$
D{KL}(P||Q) = \mathbb E _{x\sim P} \Big[\log \frac{P(x)}{Q(x)} \Big] = \mathbb E {x\sim P}[\log P(x) - \log Q(x)]
$$
需要注意的是，KL散度是不对称的，具体见书本P48下面的插图。这种非对称性意味着选择$D{KL}(P||Q)$还是$D_{KL}(Q||P)$的影响很大。

交叉熵：将香农熵和KL散度结合起来，就可以得到交叉熵。需要注意的是，这里将Q省略了，因此针对Q最小化交叉熵等驾驭最小化KL散度：
$$
H(P,Q)= H(P) + D_{KL}(P||Q)
$$

chapter 4 数值计算

Optimization: refers to the task of either maximizing or minimizing some function $f(\boldsymbol{x})$ by altering $\boldsymbol x$

Partial derivative: for functions with multiple inputs, the partial derivative measures how $f$ changes as only the variable $x_i$ increases at point $\boldsymbol x$.

Gradient: the gradient of $f$ is the vector containing all the partial derivatives, denoted $\nabla _\boldsymbol x f(\boldsymbol x)$.