12.4 update: fully understand the differences between supervised learning & unsupervised learning here

# 第一周

1. regression: predict continuous valued output

2. supervised learning: right answer given

3. benign tumor | malignant tumor

market segmentation

4. 监督学习及其分类

In supervised learning, we are given a data set and already know what our correct output should look like, having the idea that there is a relationship between the input and the output.

Supervised learning problems are categorized into “regression” and “classification” problems. In a regression problem, we are trying to predict results within a continuous output, meaning that we are trying to map input variables to some continuous function. In a classification problem, we are instead trying to predict results in a discrete output. In other words, we are trying to map input variables into discrete categories.

5. 函数h 是一个约定俗成的函数

表示的内容为hypotheses，也就是经过算法学得的函数

$h_{\theta}(x) = \theta_{0} + \theta _{1}x$ 表示参数变量为$\theta$，自变量为$x$的函数

一般会简写为$h(x) = \theta_{0} + \theta _{1}x$

6. cost function $J(\theta_{0}, \theta_{1}) = \displaystyle \frac{1}{2m}\sum_{i=1}^{m}(h(x^{(i)}) - y^{(i)})^2$

表示代价函数的参数为theta

最小化代价函数的符号为 $\underset{\theta_{0}, \theta_{1}}{\text{minimize}}J(\theta_{0}, \theta_{1})$

7. 梯度下降法在进行参数的更新时，该层的参数必须使用上一层的参数进行更新，而不可以使用本层的参数

即在进行某个参数的更新时，必须保证使用的参数都是上一层的

使用temp0 暂存 theta0 的数据

8. 关于梯度下降的公式 $\theta_{i} := \theta_{i} - \alpha \displaystyle \frac{\partial}{\partial \theta_{i}}J(\theta_{i})$

为什么后面的符号为减号而不是加号

总的来说，梯度下降法是寻找最小值的一种收敛法

从该图来说，如果选择50作为起始点，那么因为是为了寻找较小值，所以需要向左边移动，因此要减去该点对应的正值导数的适当比例

实际上，梯度法求极值对应的自变量坐标，其中的导数不过是提供了一个方向而已

前面还有fancy的参数叫做学习率，其实也就是一次的步长

按理说学习率应该越小越好，因为梯度法当出现因学习率过大导致跨越了最优值时，该参数就不合适了。同时考虑到想要迭代次数尽可能小，所以学习率又需要适当大。可以说这里就是需要所谓的调参。

但是需要注意的是，如果学习率过小，并且代价函数最优解附近的导数都比较小，那么将会出现每一次迭代走过的步越来越小，导致还没有到达最优解就已经停止收敛。(这个要参照退出循环的条件)

所以这个参数的选择还是有一定的考究的。

如果目前已经到达了最优解，那么因为该点的导数为0，所以该值不会进行更新

9. 线性回归任务的代价函数一定是凸函数!!!

# 第三周

1. 为什么最大似然估计必定是凸函数

2. 关于为什么在代价函数中总是约定$h_{\theta}(x)= \displaystyle \frac{1}{1 + e^{-\theta ^{T}x}}$，其中后面自然指数的幂次总是这样的形式？

此时认为$\theta$和$x$都是列向量

可是在实际应用中，总是将一个样本中的数据放到一行中进行展示，最后一列表示target。所以此处存在一个别扭的转换。

3. 扩展了一些收敛迭代方法; 建议直接调用库函数，而不要自己去实现

• BFGS
• L-BFGS
4. 直接使用库函数会使得难以调试，但是带来的效率是值得的

5. 针对多分类任务，我们设置多个不同的“假设”$h_{\theta}^{(i)}(x) = P(y = i | x; \theta)$

表示目前在分类第i个类别，使用$\theta$作为参数，x作为自变量

6. 假设在数据中我们有k个需要抽取出来的特征，那么我们就需要设置k个分类器

这是因为每次筛选数据的时候，并不是每进行一步就将筛选出来的数据剔除

而是针对每一种特征训练出来一种分类器，共需要k个分类器

7. 逻辑回归 - 我的理解

关于为什么使用该函数$\displaystyle \frac{1}{1 + e^{-\theta ^T x}}$ ？ 是因为该函数可导连续，并且可以很好地从原点将坐标系分为两部分

insight：归根结底，是对$-\theta^T x$的拟合 => 如果大于0，那么 预测为正例，否则为反例。此时传入的参数可以为非线性的，即$x^2, x^3, x^4 …$此类，使其可以完成非线性任务。

8. 处理过拟合的两种办法

• 减少参数数量 - 可以人为选择保留的参数；可以使用模型选择算法，判断哪些参数保留，哪些扔掉

该方法的缺点是也许所有的参数都有用，但是仍然会把它扔掉；

归根到底还是算法或者选择不当，如果把这个因素修改完成，那么一样可以达成过拟合的目的

• 正则化 - 保留所有参数，但是减少参数的影响；适用于有很多个参数同时每个参数都有作用

9. 正则化实际上就是修改代价函数，来使得某些参数应该尽可能偏小

具体的方法是在代价函数后加上一些类似$1000\theta_{i}^2$这样的项，如此一来就可以达到减小该项影响的作用

正则化中的$\lambda$是为了均衡两个目标 - 与训练集更匹配bias & 函数图形更简单variance

10. 在正规方程中使用正规化，甚至可以将矩阵转化为绝对可逆

11. 关于运算的向量化

$$\theta_j = \displaystyle \sum_{i = 1} ^{m}(h_{\theta}(x^{(i)}) - y^{(i)})x_{j}^{(i)}$$ <=> grad = X' * (sigmoid(X * theta) - y) / m

上式在起初学习的时候我不明白应该如何进行计算，于是使用了两个for循环来处理

但是如果仔细考察的话，可以发现使用矩阵运算即可完成

之前我习惯把求和以及前面的括号看成一团，后面的看成一部分，但是实际上要把求和号后面的看成一个整体

我决定对这个东西做一个总结，这其中必定存在某种普遍使用的结论

• 前面的求和号，表示括号中的元素和后面的x必定分别属于矩阵的某行某列
• 注意到i的变化是从小到大，所以相对于矩阵运算，必定是从左到右、从上到下
• 如果将括号中的元素视为一个列向量的话，那么x必定是从左到右；注意到其j下标是不变的，所以x矩阵中的一行，表示的是每个样本数据的第j个数据

搞定！

12. octave中的size函数，第二个参数用来标记返回的是行数还是列数

13. 在使用正则化时，注意代价函数不计算常数项的参数$\theta_0$

同时在更新theta(2:n)的时候，仍然可以使用向量化运算，注意将n维中的第一维删除掉

# 第四周

1. 神经网络产生的背景

当使用多项式假设时，面对多个特征，如果要暴力组合所有的配对方案，那么将会产生数目巨大的feature；而如果使用较小的feature，那么就会产生欠拟合的问题

2. 每一个输入以及输出都从1开始编号，每一层都特殊设置一个偏移量，该偏移量从0开始编号

也就是对应的阈值

input layer + hidden layer + output layer

3. 机器学习可以看做是数理统计的一个应用，在数理统计中一个常见的任务就是拟合，也就是给定一些样本点，用合适的曲线揭示这些样本点随着自变量的变化关系。

深度学习同样也是为了这个目的，只不过此时，样本点不再限定为(x, y)点对，而可以是由向量、矩阵等等组成的广义点对(X,Y)。而此时，(X,Y)之间的关系也变得十分复杂，不太可能用一个简单函数表示。然而，人们发现可以用多层神经网络来表示这样的关系，而多层神经网络的本质就是一个多层复合的函数。

作者：Anonymous

4. for the parameter $\theta$ , it is a matrix with three dimensions

suppose that now we are at i-th layer, then $\theta^{(i)}$ have a dimension of $size(s_{i + 1}) \times (size(s_{i}) + 1)$, and $x^{(i)}$ has the dimension of $(size(s_i) + 1) \times 1$

because in the future we are going to calculate $z^{(i+1)} = \theta^{(i)} * x^{(i)}$ where $x^{(i)} = a^{(i)}$,

after we get the value of $z$ and put it into the sigmoid function, we get $a^{(i)} = g(z^{(i)})$

5. in the every layer, we can regard the parameter as one function which is learned by the nodes before it

so our final result is made of many functions to evaluate the aim function.

6. network architectures refer to how the different neurons are connected to each other.

7. notes about octave

• .* will return a vector rather than a scalar value

• [a, b] = max(A, [], 2) will return the information about this matrix

the first return parameter is maximum value in every line

the second return parameter is index of every maximum value in every line

in the max function, the first parameter is the matrix we concerned, the second is the default value which shouldn’t be changed, the third parameter represents to search according to column-1, or row-2

# 第五周

1. The cost function of neural networks:

$J(\Theta) = \displaystyle - \frac{1}{m}\sum_{i = 1}^{m} \sum_{k = 1}^K[y_k^{(i)}log((h_{\Theta}(x^{(i)}))_k) + (1 - y_k^{(i)}) log(1 - (h_{\Theta}(x^{(i)}))k)] + \frac{\lambda}{2m}\sum{i = 1}^{L - 1}\sum_{i = 1}^{s_l}\sum_{j = 1}^{s_l + 1}(\Theta_{j, i}^{(l)})^2$

We have added a few nested summations to account for our multiple output nodes. In the first part of the equation, before the square brackets, we have an additional nested summation that loops through the number of output nodes.

In the regularization part, after the square brackets, we must account for multuple theta matrices. The number of columns in our current theta matrix is equal to the number of nodes in our current layer(including the bias unit). The number of rows in our current theta matrix is euqal to the number of nodes in the next layer(excluding the bias unit).

Note:

• the double sum simply adds up the logistic regression costs calculated for each cell in the output layer
• the triple sum simply adds up the squares of all the individual $\Theta$s in the entire network
• the i in the triple sum does not reger to training example, but the part before it does!
• the first part of this funciton is all $K$ logistic regression added up separately.
2. 如何直观地解释 backpropagation 算法？ - Anonymous的回答 - 知乎

观后感：BP神经网络就是链式法则

该文对于理解BP神经网络有极大的帮助

看到评论里有人表示这个其实是动态规划，真是naive

因为error的限制，只能从输出层向输入层传递，另外需要每次计算出同层的所有结点再进行传递，这大概就是BP神经网络的框架

3. BP神经网络的公式推导

https://www.cs.swarthmore.edu/~meeden/cs81/s10/BackPropDeriv.pdf

4. new instructions of octave

if we have many matrices of different scales, and want to unroll them into one datastructure, then we can do it by thetaVector = [ Theta1(:); Theta2(:); Theta3(:); ]

and get the matrices back by

Theta1 = reshape(thetaVector(1:110),10,11);

Theta2 = reshape(thetaVector(111:220),10,11);

Theta3 = reshape(thetaVector(221:231),1,11);

5. gradient checking, which is to check the whether the rough derivative function could get a similiar value in regard to the value we get from the derivative function of BP algorithm

6. 神经网络的初始化参数问题 - 为什么不可以像回归问题一样全部置0？

考虑给定的神经网络，其中结点1, 2, 3在输入层，结点4, 5在隐含层，结点6位于输出层

假定：输入各不相同；初始化全部参数为0。

• 第一次正向传播过程中，后面结点上激活值均为0.5

• 第一次反向传播过程中，结点4, 5的输入由于相同，所以它们学得的权重相同；同时由于输入层到隐含层的参数在计算时，它们的输入也都相同，所以学得的参数也都相同

• 第二次正向传播过程中，结点4, 5的输入相同，所以激活值相同，进而得到输入
• 第二次反向传播过程中，结点4, 5学得的权重仍然相同；同时结点1学得的两个权重相同、结点2学得的两个权重相同、结点3学得的两个权重相同(这是因为权重的学习与三个量相关，一个是该结点的激活值，一个是上一层的误差，一个是两层之间的权重)
7. 作业总结

• 因为octave类似python没有先声明后使用的规则，所以会出现一种很神奇的BUG - 变量名拼错

这种BUG在这种语言里面实在是危险，出现的时候不会有任何的报错信息

• 梯度降矩阵的规模与$\Theta$矩阵的规模一致，将来更新的时候也就是更新这些参数

# 《神经网络与深度学习》

## 人工神经元模型

### 感知机

$$output = \displaystyle \begin{cases} 0 & \text{if } \sum_jw_jx_j \leq \text{threshold} \ 1 & \text{if} \sum_j w_j x_j > \text{threshold }\end{cases}$$

### S型神经元

$\sigma$的平滑意味着权重和偏置的微小变化，即$\Delta w_j$和$\Delta b$，会从神经元产生一个微小的输出变化$\Delta \text{output}$。实际上，微积分告诉我们

$$\displaystyle \Delta \text{output} \approx \sum_j \frac{\partial\text{output}}{\partial w_j} \Delta w_j + \frac{\partial\text{output}}{\partial b} \Delta b$$

# 第六周

1. We can promote the performance of out machine learning algorithms by

• Getting more training examples
• Trying smaller sets of features
• Trying polynomial features
• Increasing or decreasing λ

However, it is important to decide which measure to take to propmote our algorithm because any of them is time-costing.

2. A hypothesis may have a low error for the training examples but still be inaccurate because of overfitting. Thus, to evaluate a hypothesis, given a dataset of training examples, we can split up teh data into two sets: a training set and a test set.

3. Given many models with different polynomial degrees, we can use a systematic approach to identify the ‘best’ function. In order to choose the model of your hypothesis, you can test each degree of polynomial and look at the error result.

One way to break down our dataset into the three sets is:

• Training set: 60%
• Cross validation set: 20%
• Test set: 20%

We can now calculate three separate error values for the three different sets using the following method:

1. Optimize the parameters in Θ using the training set for each polynomial degree.
2. Find the polynomial degree d with the least error using the cross validation set.
3. Estimate the generalization error using the test set with $J_{test}(\Theta^{(d)})$, (d = theta from polynomial with lower error);

This way, the degree of the polynomial d has not been trained using the test set.

为了使得模型的generalization能力更强，将数据分为两部分，一部分用于训练(训练集和验证集)，另一部分用于测试。

训练集的数据用于训练模型，验证集的数据用于选择模型(如神经网络的结构、回归模型中的幂次等)，测试集的数据用于测试模型的performance。

一般来说validation上的cost会比test上的cost要小，这是因为前者相对于后者多了一些表示模型的参数。

4. The bias is an error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).

The variance is an error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs (overfitting).

5. The training error will tend to decrease as we increase the degree d of the polynomial.

At the same time, the cross validation error will tend to decrease as we increase d up to a point, and then it will increase as d is increased, forming a convex curve.

High bias (underfitting): both $J_{train}(\Theta)$ and $J_{CV}(\Theta)$ will be high. Also, $J_{CV}(\Theta) \approx J_{train}(\Theta)$.

High variance (overfitting): $J_{train}(\Theta)$ will be low and $J_{CV}(\Theta)$ will be much greater than $J_{train}(\Theta)$.

6. How do we choose our parameter \lambdaλ to get it ‘just right’ ? In order to choose the model and the regularization term λ, we need to:

1. Create a list of lambdas (i.e. λ∈{0,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28,2.56,5.12,10.24});
2. Create a set of models with different degrees or any other variants.
3. Iterate through the λs and for each λ go through all the models to learn some Θ.
4. Compute the cross validation error using the learned Θ (computed with λ) on the $J_{CV}(\Theta)$ without regularization or λ = 0. That’s because we have got the model parameters with the effect of regularization.
5. Select the best combo that produces the lowest error on the cross validation set.
6. Using the best combo Θ and λ, apply it on $J_{test}(\Theta)$ to see if it has a good generalization of the problem.
7. learning curves

Experiencing high bias:

Low training set size: causes $J_{train}(\Theta)$ to be low and $J_{CV}(\Theta)$ to be high.

Large training set size: causes both $J_{train}(\Theta)$ and $J_{CV}(\Theta)$ to be high with $J_{train}(\Theta)$≈$J_{CV}(\Theta)$.

If a learning algorithm is suffering from high bias, getting more training data will not (by itself) help much.

Experiencing high variance:

Low training set size: $J_{train}(\Theta)$ will be low and $J_{CV}(\Theta)$ will be high.

Large training set size: $J_{train}(\Theta)$ increases with training set size and $J_{CV}(\Theta)$ continues to decrease without leveling off. Also, $J_{train}(\Theta)$ < $J_{CV}(\Theta)$ but the difference between them remains significant.

If a learning algorithm is suffering from high variance, getting more training data is likely to help.

当出现欠拟合时，增大数据量是没有用处的；而在过拟合时，增大数据量可以使得训练集数据与真实数据更加接近，使得模型的精度提高

8. Our decision process can be broken down as follows:

• Getting more training examples: Fixes high variance

• Trying smaller sets of features: Fixes high variance

• Adding features: Fixes high bias

• Adding polynomial features: Fixes high bias

• Decreasing λ: Fixes high bias

• Increasing λ: Fixes high variance.

9. The recommended approach to solving machine learning problems is to:

• Start with a simple algorithm, implement it quickly, and test it early on your cross validation data.
• Plot learning curves to decide if more data, more features, etc. are likely to help.
• Manually examine the errors on examples in the cross validation set and try to spot a trend where most of the errors were made.

# 第七周

1. The form of SVM model’s cost function is similiar to the cost function of logistic regression
$$\displaystyle minC\sum_{i=1}^{m}[y^{(i)}cost_1(\theta^Tx^{(i)}) + (1 - y^{(i)})cost_0(\theta^Tx^{(i)})] + \frac{1}{2}\sum_{i = 1} ^n \theta^2_j$$
and the two parts of the $cost$ are some function which have the following properties
$$\begin{cases}\theta^Tx \ge 1 & y = 1 \\ \theta^Tx\le -1 & y = 0\end{cases}$$

2. support vector machine is also called large margin machine

3. the value of $C$ denotes the value we view to the loss of training data

if $C$ is very large, then even if there are some bad data exist in the training data, the algorithm will still try to set the margin according to the training data. => high variance

if $C$ is reasonably small, then the algorithm could ignore the bad data and try to get a subjective regression => high bias

4. SVM with kernels
$$\displaystyle minC\sum_{i=1}^{m}[y^{(i)}cost_1(\theta^Tf^{(i)}) + (1 - y^{(i)})cost_0(\theta^Tf^{(i)})] + \frac{1}{2}\sum_{i = 1} ^n \theta^2_j$$

5. $\sigma^2$ ‘s influence on model

if it is large, then it will have very blury decision which results in high bias

if it is small, it will have somewhat very critical criticism which results in high variance

6. if $n$ is much large than $m$, then it is better to use logistic regress or SVM without a kernel

if $n$ is small and $m$ is intermediate(about ten times at maximum), then it is better to use SVM with Gaussian kernel

if $n$ is small and $m$ is large, add more features and then use logistic regression or SVM without kernel

# 第八周

## K-means algoirithm

1. unsupervised learning algorithm is some model given unlabeled data and get to find the structure of data. There are many unsupervised learning algorithm and one of them is clustering

2. A famous clustering algorithm - K-means algorithm

3. the optimization objective of K-means algorithm $\displaystyle \frac{1}{m} \sum_{i = 1}^m ||x^{(i)} - \mu_{c^{(i)}}||^2$, where $x^{(i)}$ is the position of i-th data and $\mu_{c^{(i)}}$ is the cluster centroid which the i-th data is assigned. In this case, the cost function will never increase during the process of algorithm.

why do we care about that?

• it will help to debug
• we can use this to make K-means better
4. how to randomly initialize K cluster centroids to avoid the local optima

• randomly pick K training examples
• set the cluster centroids to these K examples
5. what is the right value of K?

## Data demensionality reduction

1. data compression

if some features corrlate to each other, then maybe we can represent then by some curve

After that, we will replace them with this only one feature which will reduce the demensionality

2. data visualization

reduce the demensionality to 2D or 3D to print out the graph of data to look for some features

3. PCA(principal component analysis) is trying to find a lower dimensional surface onto which to project the data from $n$-dimensions to $k$-dimensions

• data preprocessing (feature scaling / mean normalization)
• compute “covariance matrix” $\sum = \displaystyle\frac{1}{m}\sum_{i = 1}^n(x^{(i)})(x^{(i)})^T$
• compute “eigenvectors” of matrix $\sum$ by “Singular Value Decomposition”
• get the compressed data by parameters from last step
4. recounstuction from compressed representation
$$z = U_{\text{reduce}}^T x <=> x \approx U_{\text{reduce}}z$$

5. how to choose the right number of $K$

recall that the cost function of PCA is $\displaystyle \frac{1}{m}\sum_{i = 1}^m ||x^{(i)} - x_{\text{approx}}^{(i)}||^2$, and total variation in data is $\displaystyle \frac{1}{m}\sum_{i = 1}^m ||x^{(i)}||^2$. typically, choose $k$ to be smallest value so that $\frac{\displaystyle \frac{1}{m}\sum_{i = 1}^m ||x^{(i)} - x_{\text{approx}}^{(i)}||^2}{\displaystyle \frac{1}{m}\sum_{i = 1}^m ||x^{(i)} ||^2} \le 1\%$.

Guess what, it can be calculated by the parameters “S” returned by SVD function, which is a diagonal matrix. And the formula is $\displaystyle 1 - \frac{\sum^k_i s_{ii}}{\sum_i^{n}s_{ii}} \le 1\%$

6. when using PCA on dataset, we should get parameters from SVD only from training set and apply it to cross validation set and test set.

7. bad use of PCA: to prevent overfitting because there would be less features after applying PCA. This might work OK, but isn’t a good way to address overfitting. Use regularization instead.

The reason of that is PCA would throw some information which may be valuable.

8. PCA is sometimes used where it shouldn’t be. Think about doing the whole thing without using PCA?

Before implementing PCA, first try running whatever you want to do with the original/raw data $x^{(i)}$. Only if that doesn’t do what you want, then implement PCA and consider using $z^{(i)}$

1. $||xx||^2$ 表示的是二范数，也就是一个平方根
2. 在对矩阵操作的时候，

# 第九周

## anomaly detect algorithm

1. assume that every features of data follow Gaussian distribution, and we are going to get the parameter of all these Gaussian distribution

• choose features $x_i$ that you think might be indicative of anomalous examples.
• fit parameters $\mu_1, \mu_2, …, \mu_n$ , $\sigma_1^2, …, \sigma_n^2$
• given new example $x$, compute $p(x) = \prod_{j = 1} ^np(x_j; \mu_j;\sigma_j^2)$, anomaly if $p(x) < \epsilon$
2. suppose our model gives high accuracy when evaluating on the cross validation set or test set, maybe it still doesn’t have a good performance because of skewed classes

3. why not supervised learning algorithm?

| Anomaly detection | Supervised learning |
| :———————————————————– | :———————————————————– |
| very small number of positve examples;
large number of negative examples | large number of positive and negative example |
| Many different “types” of anomalies. Hard for any algorithm to learn from positive examples what the anomalies look like;
future anomalies may look nothing like any of the anomalous examples we’ve seen so far. | Enough positive examples for algorithm to get a sense of what positive examples are like, future positive examples likely to be similar to ones in training set. |
| fraud detection;
Manufacturing(e.g. aircraft engines)
monitoring machines in a data center | email spam classification
weather prediction
cancer classification |

4. choose what features to use

if a certain feature doesn’t look like gaussian distribution, then maybe it is better to get a log, or get some power root

## content-based recommender system

1. give some features to the content of the movie and score them

for each user, learn a parameter. Predict user as rating movie with the parameter

2. the cost function of this algorithm is just like linear regression

3. given $x^{(1)},x^{(2)}, x^{(3)}…$, can estimate $\theta^{(1)},\theta^{(2)}, \theta^{(3)}…$

## Collaborative Filter

1. the algorithm could learn what features to use
2. given the parameter from user as same as the “content-based recommender system”, try to score the features of moive. The cost function has the same form of the linear regression
3. given $\theta^{(1)},\theta^{(2)}, \theta^{(3)}…$, can estimate $x^{(1)},x^{(2)}, x^{(3)}…$

the traditional way to compund these two algorithm is to first guess some parameter of the user favor, then use “collaborative filter” to get the features of movie, then “content-based recommender system” to get a more accurate parameter of user, and repeat on

## Collaborative Filtering algorithm

1. if we mix these two algorithm together by put two cost function into one, then we can get a new algorithm wihich could simultaneously update the parameters of user and the features of movie
2. the process of “collaborative filtering algorithm”

• initialize parameter of user and features of movie
• get the prediction of user for some movie by the parameter learned before
3. it is also called “low rank matrix factorization”

4. how to find movies that user might favor? Try to find some data that has little difference compared to the movie user like.

5. what if a new user join in and hasn’t seen any movies? “Mean Normalization” could fix that.

# 第十周

1. the normal algorithm
2. plot $J_{\text{train}}(\theta)$ as a function of the number of iterations of gradient descent to see if it is performing well, then we can tune the learning rate or so

1. In this case, the cost funciton shouldn’t include the $m$ parameters because the batch we use doesn’t have the scale of $m$. So now the cost function is $cost(\theta, (x^{(i)}, y^{(i)})) = \frac{1}{2}(h_\theta(x^{(i)}) - y^{(i)})^2$.

But as for the loss of training set, it still should use the $m$ term.

2. in every iteration, just use one example to update the $\theta$ parameters

3. stochastic gradient descent doesn’t just converge to teh global minimum, it will oscillate a bit around the global minimum.

4. as for this algorithm, we need to plot something different to get a tuition about how it is performing.

• during leartning, compute $cost(\theta, (x^{(i)}, y^{(i)}))$ before updating $\theta$ using $(x^{(i)}, y^{(i)})$.
• every 1000 iterations(say), plot $cost(\theta, (x^{(i)}, y^{(i)}))$ averaged over the last 1000 examples processed by algorithm.
5. sometimes it is better to set this number of iterations small, because we can get more detail about the performing of algorithm; However, sometimes it is better to use a large number because maybe in detail there is just noise everywhere, and the large iteraitons could get us a global plot in this case.

6. it is better to set the learning rate to decrease by time because eventually the algorithm will oscillate around the optima value. As Andrew suggest, maybe it is good to use something like $\alpha= \displaystyle \frac{const1}{iterations + const2}$

1. every batch is a subset of training data set with the size we set, and we are going to update $\theta$ parameters based on these mini-batches

2. why look for a min batch rather than a single case of training examples?

In this case, we can get a better and efficient gradient descent algorithm, because we can parallelize the calculation by vectorization.

## Online Learning set

update the parameter of website or a specific user based on the data generated by them

## Map-reduce and Dara Parallelism

many machines doing the same task separately, and the work these machines doing all together makes up the origin sing task which maybe impossible to complete in one single machine.

# 第十一周

## Photo OCR pupeline

1. the basic process of OCR
• text detection
• character segmentation
• character classification
2. sliding windows algirithm for looking for pedestrians
• slide a littel window with a fixed size to go over the all image
• make the window bigger and slide it all over the image, but then transform this bigger window into the fixed size just like before

## Artificial Data Synthesis

1. as for the example of character recognition, we can use font libraries form the internet

and we can introduce distortion to the origin image

2. as for speech recognition,

• introduce a distortion such as cellphone connection,
• noisy background: crowd, vehicle, machinery
3. discussion on getting more data

• make sure you have a low bias classifier before expending the effort.
• “How much work would it be to get 10 times as much data as we currently have?”
• artificial data synthesis
• collect/label it ourselves
• “Crowd source”

## Ceiliing Analysis

what part of the pipeline to work on next?

assume we are part of the component of algorithm, then go from the source of data flow and get the part of job perfect, look for the final performance of algorithm. If it exert a great influence on final result, then it is worth to get more work on this part of pipeline. Then we go on to the next part while keep the previous perfect part, keep trying to see which part is more worth the studying.

0%