# Code

The solution of problem 982 is showed below. You can also find it here.

# Solution

The core method to solve it is to treat one number previously and traverse all combination of two numbers $a \& b$ in the vector. In this way, we can reduce the time complexity from $O(N^3)$ to $O(N^2)$. The treatment is to pre-calculate each number $a$ in the vector to find with which number $b$ $a \& b$ would get $0$.

# Valuable Method

The pre-treating method is really important when solving problems. If we enumerate all three numbers combination, for example $a \& b \& c$, we would repeatly calculate & operator with $c$ for many times. By remembering it, we could use more space to reduce run time. Moreoever, we could also store the result of $a \& b$ and check the result by $c$.

# Code

The solution of problem 843 is showed below. You can also find it here.

# Solution

The thought of the problem is from a discussion of the problem. Basically, we count all characters on each position from 1 to 6 as there probability. Then we generate a string with highest probability and guess it. Based on what return to us, we eliminate all other candidates which does not meet the requirement. Finally, we will guess the right word.

# Interesting Point

It’s a whole new kind of problem. The method to solve it is now fixed. Statistic is a candidate method to this problem. This kind of problem is just so fun and interesting!

# Code

The solution of problem 200 is showed below. You can also find it here.

# Solution

The solution of the problem is simply BFS/DFS. Check on each position and go BFS/DFS search when the position is “1”.

# Valueable Method

I choose the BFS and here is a trick. If you choose to change status after retreat from queue, you will push a huge amount of duplicate position into queue which will cause “Memory Limit Exceed” and slow down your program. So If treating each action before pushing into queue, it would save huge amount of memory and run faster.

# Code

The solution of problem 159 is showed below. You can also find it here.

# Solution

To solve the problem, you need to build a map to store the last position of each unique element and use sliding window to get the final result.

The sliding window means that expanding the right side of the window first until exceeding the limit, three unique elements, then querying each element in map and kicking the most left element out. In this way, the window starts sliding until the end of the string.

# Valuable Method

At the first time, I use map to store number of elements in a sliding window rather than their position. It’s OK but stores an useless information–number. Also, my first try will scan the whole string twice which slows down the whole program. However, if we just store the position of each element, we just need to scan the whole string once and also get the right result. It’s also a kind of optimization on constant, which reduces the time complexity from O(2N) to O(N).

# Code

Solution of problem 309 is showed below. You can also find it here.

# Solution

The method of the solution is from Discuss.

There is another method which use operations rather than states to record status and do calculation.

# Valueable Function

The interest part of the first solution is that it uses state machine and minimizes the memory usage by just storing constant level variables. However, it can only be applied in the situation. If the problem said you can only buy stock two days later after selling it, the solution cannot hold it. So the second solution is versatility.

# Code

Solution of problem 416 is showed below. You can also find it here.

# Solution

There are two solutions. The core method of both solutions is dp, specifically 0-1 Knapsack problem.

The first solution uses “bitset” to storage status.

The second solution, which doesn’t show here since it’s too simple and you need to implement it by yourself, uses “bool array” to storage all status.

The core operation (at line 16) of two solutions is different due to feature of the data structure they use. According to the result on LeetCode, the second solution is slower than the first one. In my opinion, bit operations is much faster than scanning through bool array.

# Valueable Function

It is a good problem to practice “bitset” data structure. It is made of bits and can store more than 64 bits, which is the length of long long or unsigned long long. It also implements almost all bit operations.

# Code

Solution of problem 146 is showed below. You can also find it here.

# Solution

To solve the problem, you need to build a list to store the order of “key”-“value” pair and to build a map to store “key”-“position in list” pair. Thus, when putting a “key”-“value” pair, you will easily know its position and update the order.

# Valueable Function

The interesting part of this code is the function “splice” of list. It can be found on c++ reference.

The function will transfer element “i” in list “x” to specific “position”.

Before using this funcition, I wrote a segment of code to implement the function by myself but it runs slower than the function. In this case, by using this function, I transfer the “mp[key]” in list “lst” to the front of the list “lst”.

This blog is a summary of my solution of LeetCode problems. I will share several interesting problems in this blog and category. Most of my solutions are on my Github.

Most of my algorithm problem solutions are written in C++ and some of them are written in Python3. Most of my databse problem solutions are written in MySQL.

This is a note of the first course of the “Deep Learning Specialization” at Coursera. The course is taught by Andrew Ng.

Almost all materials in this note come from courses’ videos. The note combines knowledge from course and some of my understanding of these konwledge. I’ve reorganized the structure of the whole course according to my understanding. Thus, it doesn’t strictly follow the order of videos.

In this note, I will keep all functions and equations vectorized (without for loop) as far as possible.

If you want to read the notes which strictly follows the course, here are some recommendations:

Some parts of this note are inspired from Tess Ferrandez.

# Brief Intro to Deep Learning

To begin with, let’s focus on some basic concepts to gain some intuition of deep learning.

## Stuctures of Deep Learning

We start with supervised learning. Here are several types of neural network (NN) in the folloing chart:

INPUT: X OUTPUT: y NN TYPE
Home features Price Standard NN
Ad, user info Click or not Standard NN
Image Objects Convolutional NN (CNN)
Audio Text Transcription Recurrent NN (RNN)
English Chineses Recurrent NN (RNN)
Image, Radar info Position of other cars Custom NN

Here are some pictures of Standard NN, Convolutional NN, Recurrent NN:

Standard NN

Convolutional NN

Recurrent NN

## Data

Neural Nework can deal with both stuctured data and unstructured data. The following will give you an intuition of both kinds of data.

Structured and Unstructured Data

Here are some conclusions of why deep learning is advanced comparing to traditional machine learning.

Firstly, deep learning models performs better when dealing with big data. Here is a comparation of deep learning models and classic machine learing models:

Comparation between deep learning and machine learning

Secondly, thanks to the booming development of hardware and advanced algorithm, computing is much faster than before. Thus, we can implement our idea and know whether it works or not in short time. As a result, we can run the following circle much faster than we image.

Iteration process

# Basic Symbols of the Course

These basic symbols will be used through out the whole specialization.

Standard notations

Standard representation

Moreover, in this course, each input x will be stacked into columns and form the input matrix X.

Input X

Output y

# Neural Network

Reviewing the whole course, there are several common concepts between logistic regression and neural network (including both shallow and deep neural network). Thus, I draw conclusions on each concept and then apply them to both logistic regression and neural network.

## Logistic Regression and Neural Network

First of all, here are pictures of logistic regression and neural network.

Logistic Regression

Neural Network

As we can see, logistic regression is also a kind of neural network, which has input layer and output layer and does not have hidden layers, so that it is also called mini neural network. In the following sections, I will write “neural network” to represent logistic regression and neural network and use pictures similar to the second one to represent neural network.

## Computation Graph

Computation graph is one of basic concepts in deep learning. By analyzing it, we could understand the whole process of computation process of neural network. The following is the basic computation graph:

In this picture, we can easily understand how $J(a,b,c)=3(a+bc)$ is computed. This process is similar to “Forward Propagation” process which I will say in next section. Moreover, in neural network, $J$ is called cost function. After computing cost function $J$, we need to feed it back to all of our parameters, such as $a$, $b$, $c$ in the picture. This process is called computing derivatives.

By analyzing the comutation graph, we can easily compute all deviatives. According to chain rule, we can compute $\frac{dJ}{da}$ by $\frac{dJ}{dv}\frac{dv}{da}$. So do parameter $b$ and $c$. The whole derivation process is similar to “backward propagation” process in neural network.

## Forward Propagation

### Computation on single neuron

Computation on single neuron

For every single neuron, the computing process is the same as the logistic regression. Logistic regression is basically the combination of linear regression and logistic function such as sigmoid. It has one input layer, x, and one output layer, a or $\hat{y}$.

The linear regression equation is: $z = w^Tx+b$
The sigmoid function equation is: $a = \sigma( z )$
The combination euquation is:   $\hat{y} = a = \sigma( w^Tx + b )$

### The whole process on Neural Network

Forward Propagation

This is an example of neural network. Since it only has one hidden layer, it’s also called shallow neural network. The forward propagation process means that we compute the graph from left to the right in this picture.

The whole process when computing the 1st layer (hidden layer) is as the following:

\begin{align} Z^{[1]} & = W^{[1]T}X + b^{[1]} \\ A^{[1]} & = \sigma( Z^{[1]} ) \end{align}

In these equations:

• $W^{[1]T}$ is a $4 \times 3$ matrix. It is also written as $W^{[1]}$. Its shape is always $n^{[l]} \times n^{[l - 1]}$.
• $X$ is a $3 \times m$ matrix. Sometimes it is also called $A^{[0]}$.
• $b^{[1]}$ is a $4 \times m$ matrix. Its shape is always $n^{[l]} \times m$.
• $A^{[1]}$ is a $4 \times m$ matrix. Its shape is always $n^{[l]} \times m$.
• $\sigma$ is an element-wise function. It is called activation function.

For each layer, it just repeats what previous layers do until the last layer (output layer).

### Cost function

Here is a definition of loss function and cost function.

• Loss function computes a single training example.
• Cost function is the average of the loss function of the whole training set.

In traditional machine learning, we use square root error as loss function, which is $L = \frac{1}{2}( \hat{y} - y )^2$. But in this case, we don’t use it since most problems we try to solve are not convex.

Here is the loss function we use:

$$L( \hat{y}, y ) = -( y \cdot log(\hat{y}) + ( 1 - y ) \cdot log( 1 - \hat{y} ) )$$

For this loss function:

• if y = 1, then $L = -y \cdot log(\hat{y})$ and it will close to 0 when $\hat{y}$ near 1.
• if y = 0, then $L = -( 1 - y ) \cdot log( 1 - \hat{y} )$ and it will close to 0 when $\hat{y}$ near 0.

Then the cost function is: $$J( w, b ) = \frac{1}{m}\sum_{i=1}^{m} L( \hat{y}, y )$$

## Backward Propagation

Here, we use gradient descent as our backward propagation method.

Backward Propagation

As we can see in the picture, it is a simplified computation graph. The neural network is on the right-top, which is almost the same as the neural network we discussing in previous section. Backward Propagation is computing derivatives from the right to the left. By following the backward process, we can get derivatives for all parameters, including $W^{[1]}$, $b^{[1]}$, $W^{[2]}$, $b^{[2]}$.

Here I give a rough derivation example of computing gradients of parameter $W^{[1]}$.

\begin{align} \frac{\partial L}{\partial Z^{[1]}} & = W^{[2]T}\frac{\partial L}{\partial Z^{[2]}} \cdot {\sigma}^{[1]\prime}(Z^{[1]}) \\ \frac{dL}{dW^{[1]}} & = \frac{\partial L}{\partial Z^{[1]}}\frac{dZ^{[1]}}{dW^{[1]}} \\ & = \frac{\partial L}{\partial Z^{[1]}}A^{[0]T} \\ \frac{dL}{db^{[1]}} & = \frac{\partial L}{\partial Z^{[1]}}\frac{dZ^{[1]}}{db^{[1]}} \\ & = \frac{\partial L}{\partial Z^{[1]}} \end{align}

It is similar to compute parameters in other layers. In these equations:

• $\frac{dL}{dW^{[1]}}$ has the same shape as $W^{[1]}$. So do other layers.
• $\frac{dL}{db^{[1]}}$ has the same shape as $b^{[1]}$. So do other layers.
• the $\cdot$ in first line is an element-wise product.

### Update parameters

After computing gradients, we can update our parameters quickly.

For every parameters (Take layer1 as an example):

\begin{align} W^{[1]} & = W^{[1]} - \alpha \frac{dL}{dW^{[1]}} \\ b^{[1]} & = b^{[1]} - \alpha \frac{dL}{db^{[1]}} \end{align}

In above equations, $\alpha$ is called learning rate, which we need to determine before training.

## Activation Functions

In previous sections, notation $\sigma$ is used to represent activation function. In neural network, there are five common activation functions: Sigmoid, Tanh, ReLU, Leaky ReLU, and Exponential LU.

Activation Functions

In the course, Prof. Andrew Ng introduces the first four activation functions.

Here are some experience on choosing those activation functions:

• Sigmoid: It is usually used in output layer to generate results between 0 and 1 when doing binary classification. In other case, you should not use it.
• Tanh: It always works better than sigmoid function since its value is between -1 and 1, so that neural network can learn more information by using it than using sigmoid function.
• ReLU: The most commonly used activation function is ReLU function. If you don’t want to use it, you can choose other ReLU derivatives, such as Leaky ReLU.