# Week1 卷积神经网络

Why CNN?

## 边缘检测示例

Question: 滤波器（kernel）的选择标准是什么？（下一节会涉及）

## 更多边缘检测内容

• 每次图片会缩小，如 6×6 → 4×4
• 边上的像素点只被一个输出所触碰或者说使用，意味着丢掉了边缘的部分信息

• Same: InputSize == OutputSize, 与过滤器有关，过滤器的大小一般都是奇数
• 奇数过滤器有一个中心点，便于指出过滤器的位置（计算机视觉的惯例）

## 简单卷积网络示例

Types of layer in a convolutional network:

• Convolution
• Pooling
• Fully connected

## 池化层

f=2, s=2 其效果相当于表示层的高度和宽度缩减一半。

## 卷积神经网络示例

• 208 = 5*5*8 + 8
• 416 = 5*5*16 + 16
• 48001 = 400*120 + 1
• 10081 = 120*84 + 1
• 841 = 84*10 + 1
• 后面加的是偏置参数

• 参数共享
• 稀疏连接

## 注意点&Tips

• 上一层的每个过滤器结果叠加后才是这一层一个过滤器的结果。
• 对叠加后的结果 Pooling，这一层有多少个过滤器，就 Pooling 多少次。

# Week2 深度卷积网络：实例探究

## 经典网络

LeNet-5 Sigmoid 和 Tanh 激活函数；每个过滤器和输入模块信道数量相同；池化后非线性处理。

AlexNet 比 LeNet 大很多（60 million 参数）；使用了 ReLU 激活函数；多 GPU；LRN（局部响应归一化）。

VGG-16 简化了神经网络结构；特征数量巨大（138 million 参数）。

## 残差网络为什么有用？

ResNets 使用了许多相同卷积，所以 a[l] 的维度等于输出层的维度，从而实现了跳远连接。

## 网络中的网络以及 1×1 卷积

Inception 网络不需要人为决定使用哪个过滤器。

## 迁移学习

• 如果数据集比较小
• 通过深度学习框架的参数固定已经训练的层，让其不参与训练
• 最后一个隐层的特征结果存到硬盘，直接用此结果连接 softmax 进行训练
• 如果数据集比较大
• 冻结更少的层
• 如果有大量数据
• 训练整个网络（用作者的参数初始化，代替随机初始化）

## 注意点&Tips

• If you have not yet achieved a very good accuracy (let’s say more than 80%), here’re some things you can play around with to try to achieve it:

• Try using blocks of CONV->BATCHNORM->RELU such as:

until your height and width dimensions are quite low and your number of channels quite large (≈32 for example). You are encoding useful information in a volume with a lot of channels. You can then flatten the volume and use a fully-connected layer.

• You can use MAXPOOL after such blocks. It will help you lower the dimension in height and width.
• If the model is struggling to run and you get memory issues, lower your batch_size (12 is usually a good compromise)
• Run on more epochs, until you see the train accuracy plateauing.
• Create->Compile->Fit/Train->Evaluate/Test.

• other basic features of Keras

• Identify block

• Convolutional block

• ResNet model

# Week3 目标检测

• 分类：对数损失函数
• 坐标：平方误差
• 对象：逻辑回归

## Bounding Box 预测

• 和分类、定位算法非常类似，能够输出边界框坐标。
• 卷积实现，不需要在每个网格上跑一次算法；单次卷积，有很多共享计算。（YOLO 算法）

bh, bw 的单位是相对格子尺度的比例

## Anchor Boxes

• 同一个格子里有超过 anchor boxes 数量的对象
• 两个对象都分配到一个格子里，并且 anchor box 的形状也一样

Anchor Boxes 能让算法更有针对性，特别是数据集中有一些很高很瘦的对象如行人，或者很矮很胖的对象如汽车。

• 一般手工指定 Anchor Box 的形状，选择 5-10 个涵盖要检测对象的各种形状。
• k-means，对两类对象形状聚类，选择最具代表性的一组 Anchor Box，可以代表试图检测 十多个 对象。

## 注意点&Tips

• Summary for YOLO:
• Input image (608, 608, 3)
• The input image goes through a CNN, resulting in a (19,19,5,85) dimensional output.
• After flattening the last two dimensions, the output is a volume of shape (19, 19, 425):
• Each cell in a 19x19 grid over the input image gives 425 numbers.
• 425 = 5 x 85 because each cell contains predictions for 5 boxes, corresponding to 5 anchor boxes, as seen in lecture.
• 85 = 5 + 80 where 5 is because $(p_c, b_x, b_y, b_h, b_w)$ has 5 numbers, and and 80 is the number of classes we’d like to detect
• You then select only few boxes based on:
• Score-thresholding: throw away boxes that have detected a class with a score less than the threshold
• Non-max suppression: Compute the Intersection over Union and avoid selecting overlapping boxes
• This gives you YOLO’s final output.

# Week4 特殊应用：人脸识别和神经风格转换

## Siamese 网络

[Taigman et. al., 2014. DeepFace closing the gap to human level performance]

## 什么是神经风格转移

Content + Style = Generated

## 代价函数

J(G) = αJcontent(C,G) + βJstyle(S, G)

## 注意点&Tips

• Face Recognition
• Face verification solves an easier 1:1 matching problem; face recognition addresses a harder 1:K matching problem.
• The triplet loss is an effective loss function for training a neural network to learn an encoding of a face image.
• The same encoding can be used for verification and recognition. Measuring distances (np.linalg.norm) between two images’ encodings allows you to determine whether they are pictures of the same person.
• Neural Style Transfer

• Content Cost

• lower-level features such as edges and simple textures, and the later (deeper) layers tend to detect higher-level features such as more complex textures as well as object classes.

• The content cost takes a hidden layer activation of the neural network, and measures how different $a^{(C)}$ and $a^{(G)}$ are.

• When we minimize the content cost later, this will help make sure $G$ has similar content as $C$.

• In order to compute the cost $J_{content}(C,G)$, it might also be convenient to unroll these 3D volumes into a 2D matrix:

• Style Cost

• Style matrix is also called a “Gram matrix.” In linear algebra, the Gram matrix G of a set of vectors $(v_{1},\dots ,v_{n})$ is the matrix of dot products, whose entries are ${\displaystyle G_{ij} = v_{i}^T v_{j} = np.dot(v_{i}, v_{j}) }$. In other words, $G_{ij}$ compares how similar $v_i$ is to $v_j$: If they are highly similar, you would expect them to have a large dot product, and thus for $G_{ij}$ to be large.

The result is a matrix of dimension $(n_C,n_C)$ where $n_C$ is the number of filters. The value $G_{ij}$ measures how similar the activations of filter $i$ are to the activations of filter $j$.

One important part of the gram matrix is that the diagonal elements such as $G_{ii}$ also measures how active filter $i$ is. For example, suppose filter $i$ is detecting vertical textures in the image. Then $G_{ii}$ measures how common vertical textures are in the image as a whole: If $G_{ii}$ is large, this means that the image has a lot of vertical texture.

• Style cost: After generating the Style matrix (Gram matrix), the goal will be to minimize the distance between the Gram matrix of the “style” image S and that of the “generated” image G.

• Style weights: We’ll get better results if we “merge” style costs from several different layers.

• The style of an image can be represented using the Gram matrix of a hidden layer’s activations. However, we get even better results combining this representation from multiple different layers. This is in contrast to the content representation, where usually using just a single hidden layer is sufficient. Minimizing the style cost will cause the image $G$ to follow the style of the image $S$.

• DONOT NEED to generate image completely random. We initialize the “generated” image as a noisy image created from the content_image. By initializing the pixels of the generated image to be mostly noise but still slightly correlated with the content image, this will help the content of the “generated” image more rapidly match the content of the “content” image.

• Summary

It uses representations (hidden layer activations) based on a pretrained ConvNet. The content cost function is computed using one hidden layer’s activations. The style cost function for one layer is computed using the Gram matrix of that layer’s activations. The overall style cost function is obtained using several hidden layers.