Neural Networks and Deep Learning

本文用于记录学习Coursera中Neural Networks and Deep Learning课程的笔记


why is deep learning taking off?

Beceuse deep learning could handle large set of data to promopte itself while logistic regression, SVM and something else can’t.

Either your intuitions are good or they’re not.

  • If your intuitions are good, you should follow them and you’ll eventually be successful. If your intuitions are not good, it doesn’t matter what you do. There is not point not trusting them.
  • never stop programming
  • read enough so you start developing intuitions, and then trust your instuitions and go for it. Don’t be too worried if everybody else says it’s nonsense.

Geoffrey Hinton


  1. some changes compared to the previous video of Andre

    • the matrix of training set $X$ has the size $n\times m$, and the label of it has the size $1 \times m$
    • there’s no plus one row of $X$ matrix for the bias value, and we are going to seperate it from teh bias
  2. computation graph for how to compute the derivatives of some function

  3. vectorized computation is much faster than unvectorized version in Python

    and the method to get time running time of one programm is to use time.time() method in package time. When we use it, then we will get the current time of programm. Mark tic as the time before we run calculating and toc means after calculation, then 1000*(toc-tic) means the times used to calculate in ms.

    the reason for that is because vectorized calculation make use of parallelization.

  4. when we add some real number to some vector or matrix then it will be handled to a matrix full of real number, which is called “Boardcasting”

  5. the deduction of the derivatives for logistic regression

  6. the broadcasting function not only transform a single real number into a matrix that matches the matrix add or substract it, but also it could transform a vector to some matrix that matches the matrix make some operations with it.

  7. the parameter axis in sum method means to calculate the sum vertically(0) or horizontally(1).

  8. tip when work with numpy

    it is better to deal with 2 rank array of numpy array rather than rank 1 array

    because rank 1 array is either a row vector nor a column vector; and the way to create it is something like a = np.random.randn(5). While the way to create a rank 2 array is a = np.random.randn(5, 1). Also, it is good to use a.shape(row, col) regularly.

  9. What you need to remember:

    Common steps for pre-processing a new dataset are:

    • Figure out the dimensions and shapes of the problem (m_train, m_test, num_px, …)
    • Reshape the datasets such that each example is now a vector of size (num_px * num_px * 3, 1)
    • “Standardize” the data
  10. My jupyter notebook while watching this week’s video


  1. when we are talking about the layers of neural network, what are we talking about?

    The input layer shouldn’t be counted. So the first layer is the first hidden layer. The reason for this is that in Python subscript counts from zero.

  2. some activation of neural network

    • sigmoid(if binary classification)

    • tanh = $\displaystyle \frac{e^{x} - e^{-x}}{e^x + e^{-x}}$

      $g’(z) = \displaystyle 1 - (tanh(z))^2$

    • ReLU(default choice) = $max(z, 0)$

    • leaky ReLU = $max(0.01z, z)$

  3. why do we use linear activation function? Because in this case, finally we will get a linear regression algorithm. It is very clear if we write down the formulas.


  1. some notations of deep neural network(mark the input layer as layer 0, the first hidden layer as layer 1)

    • $n^{ [l]}$ denotes the units in layer $l$
    • $a^{[l]}$ denotes the activations in layer $l$
  2. some formular in deep network, and there must be one for-loop to make the forward propagation

    • $Z^{[l]} = W^{[l]}A^{[l - 1]} + b^{[l]}$
    • $A^{[l]} = g^{[l]}(Z^{[l]})$
  3. how to get your matrix dimensions right?

    • $W^{[l]}: \ (n^{[l]}, n^{[l - 1]})$
    • $b^{[l]}: \ (n^{[l]}, 1)$
  4. why deep representations ?

    • we can think of neural network that it will compose the small element into something more abstract.

      For example, when we input pixels of a image, the first layer maybe deal with these pixels to form some lines; then second layer maybe deal with these line to form some features of face; afterwards the deeper layer will deal with these features to see if it recognize the face composed by these features.

    • Circuit theory and deep learning: There are functions u can compute with a “small” N-layer deep neural network that shallower networks require exponentially more hidden units to computes.

      I think the tuition behind this is to solve repeatedly subproblem only once which is very same with dynamic programming.

  5. backward propagation for layer $l$

    Input: $da^{[l]}$

    Ourtput: $da^{[l - 1]}, dW^{[l]}, db^{[l]}$

    Calculation:$\begin{cases}dz^{[l]} = da^{[l]}*g^{‘[l]}(z^{[l]}) \ \ dw^{[l]} = dz^{[l]}\cdot a^{[l -1]} \ \ db^{[l]} = dz^{[l]} \\ da^{[l - 1]} = w^{[l]T}\cdot dz^{[l]}\end{cases}$

  6. a brief to hyperparameters

    the parameters that control the final parameter that we use to predict some examples


    • learning rate
    • iterations
    • hidden layer
    • hidden unit
    • choice of activation funciton


    • weight matrix
    • bias