转载

【译文】深度学习R（1）：从零开始建立完全连接的神经网络

本文为数盟原创译文，欢迎转载，注明出处“数盟社区”即可

作者：PENG ZHAO

我要感谢Feiwen, Neil和所有其他的技术评论家和读者，他们为本文提出了宝贵的意见和建议。

背景

深度神经网络（DNN）近年来取得了在图像识别、自然语言处理和自动驾驶领域取得了巨大成就，如图1所示，从2012至2015年IMAGNET的识别准确度由80%以内提高到95%以内，这远远超过了传统的计算机视觉识别（CV）方法。

【译文】深度学习R（1）：从零开始建立完全连接的神经网络

图1 – 来自NVIDIA CEO 黄仁勋在2016年国际消费类电子产品展览会上的演讲

在这篇文章中，我们将关注于完全连接的神经网络——通常在数据科学中被称为DNN。DNN的最大优点是能够通过深层结构自动提取和学习特征，尤其是对这些工程师不能轻易捕捉到的复杂高维特征的数据，例子Kaggle。因此，DNN也对数据科学家很有吸引力，有许多成功案例，如分类、时间序列和推荐系统，如Nick的文章和DNN信用评分。在CRAN和R的社区，有几个比较成熟的DNN包，包括神经网络，nerualnet，H2O，DARCH，deepnet和mxnet，我强烈推荐《 H2O DNN algorithm and R interface 》。

那么，我们到底为什么需要从头开始建立DNN呢？

-理解神经网络是如何工作的

利用现有的DNN的包，在大多数时候您建立DNN模型只需要一行R代码，而且还有神经网络实例。但对于没有经验的用户，处理过程和结果可能是难以理解的。因此，它将是一个有价值的实践，有助于您完善自己的网络以便于从结构和算法的角度了解更多的细节。

-用你的新想法建立特定的网络

DNN是一个迅速发展的领域。每周都会有大量的新发现和研究成果发表在顶级期刊和互联网上。DNN用户也有其特定的神经网络结构以针对他们的问题，例如不同的激活、损失函数、正规化和连通图的问题。另一方面，新的研究成果出来之前，几乎所有现有的包都是用C/C++，Java写的，所以它们不能适用于一些最新升级，也不能通过修改把你的思维加进去。

-网络和数据的训练和可视化

正如我们提到的，现有的DNN包是高度集中的，并且是一些低等级语言编写的，所以我们需要逐层或逐节点训练网络，这是一场噩梦。即使是不容易在每一层中将结果可视化，监视数据或权重在训练中一直是变化的，并显示在网络中发现的模式。

基本概念和组成部分

完全连接神经网络，在数据科学中称为DNN，是相邻的网络层是完全相互连接。网络中的每个神经元都与相邻层中的每一个神经元相连。

如下图所示是一个非常简单的和典型的神经网络，有1个输入层，2个隐藏层，和1个输出层。最主要的是，当研究人员谈论网络的体系结构时，它指的是DNN的配置，如整个网络有多少层，每层有多少神经元，正在应用什么类型的激活、损失函数、正则化。

【译文】深度学习R（1）：从零开始建立完全连接的神经网络

现在，我们将通过DNN的基本组件，向你展示它如何在R中实现。

权重和偏差值

以上述DNN架构为例，从输入层到第一隐藏层、第一到第二隐藏层、第二隐藏层到输出层有3组权重。偏差单元连接到每个隐藏节点并影响着输出分数，但不与实际数据进行接触。在我们的R研究中，我们通过矩阵展示了权重和偏差。权重大小的计算公式是：

(神经元层数 M) X (每层中的神经元数量 M+1)

并且权重值通过rnorm公式的随机数而被初始化。偏差值只是一个一维矩阵，和神经元相同大小，并且值被设定为0。其他的初始化方法，如校准1 / sqrt（N）和稀疏初始化的差异，在斯坦福大学CS231n的 weight initialization 部分中介绍过。

其R代码为：

weight.i &lt;- 0.01*<a href="http://inside-r.org/r-doc/base/matrix">matrix</a>(<a href="http://inside-r.org/r-doc/stats/rnorm">rnorm</a>(layer.size(i)*layer.size(i+1), <a href="http://inside-r.org/r-doc/stats/sd">sd</a>=0.5),                         <a href="http://inside-r.org/r-doc/base/nrow">nrow</a>=layer.size(i),                         <a href="http://inside-r.org/r-doc/base/ncol">ncol</a>=layer.size(i+1)) bias.i    &lt;- <a href="http://inside-r.org/r-doc/base/matrix">matrix</a>(0, <a href="http://inside-r.org/r-doc/base/nrow">nrow</a>=1, <a href="http://inside-r.org/r-doc/base/ncol">ncol</a> = layer.size(i))

weight.i <- 0.01*<a href="http://inside-r.org/r-doc/base/matrix">matrix</a>(<a href="http://inside-r.org/r-doc/stats/rnorm">rnorm</a>(layer.size(i)*layer.size(i+1), <a href="http://inside-r.org/r-doc/stats/sd">sd</a>=0.5),                         <a href="http://inside-r.org/r-doc/base/nrow">nrow</a>=layer.size(i),                         <a href="http://inside-r.org/r-doc/base/ncol">ncol</a>=layer.size(i+1)) bias.i    <- <a href="http://inside-r.org/r-doc/base/matrix">matrix</a>(0, <a href="http://inside-r.org/r-doc/base/nrow">nrow</a>=1, <a href="http://inside-r.org/r-doc/base/ncol">ncol</a> = layer.size(i))

另一种常见的实现方法，将权重和偏差值结合起来，使输入的维数是N+1，表明N输入特征偏差值为1，如下面的代码：

weight   &lt;- 0.01*<a href="http://inside-r.org/r-doc/base/matrix">matrix</a>(<a href="http://inside-r.org/r-doc/stats/rnorm">rnorm</a>((layer.size(i)+1)*layer.size(i+1), <a href="http://inside-r.org/r-doc/stats/sd">sd</a>=0.5),                         <a href="http://inside-r.org/r-doc/base/nrow">nrow</a>=layer.size(i)+1,                         <a href="http://inside-r.org/r-doc/base/ncol">ncol</a>=layer.size(i+1))

weight  <- 0.01*<a href="http://inside-r.org/r-doc/base/matrix">matrix</a>(<a href="http://inside-r.org/r-doc/stats/rnorm">rnorm</a>((layer.size(i)+1)*layer.size(i+1), <a href="http://inside-r.org/r-doc/stats/sd">sd</a>=0.5),                         <a href="http://inside-r.org/r-doc/base/nrow">nrow</a>=layer.size(i)+1,                         <a href="http://inside-r.org/r-doc/base/ncol">ncol</a>=layer.size(i+1))

神经元

神经元是DNN的基本单位，它是人类神经元的仿生模型。一个独立的神经元扮演着权重，并且输入乘法和加法（FMA），这和数据科学中的线性回归是一样的，然后FMA的结果传递给激活函数。

常用的激活函数包括sigmoid、ReLu、Tanh和Maxout。在这篇文章中，我将采取纠正线性单元（ReLu）作为激活函数，f（x）= max（0，x）。对于其他类型的激活功能，您可以参考这里。

【译文】深度学习R（1）：从零开始建立完全连接的神经网络

在R中，我们可以通过多种方法操作神经元，比如sum(xi*wi)。但是通过矩阵乘法能够更有效的实现。

其R代码为：

neuron.ij &lt;- <a href="http://inside-r.org/r-doc/base/max">max</a>(0, input %*% weight + bias)

neuron.ij <- <a href="http://inside-r.org/r-doc/base/max">max</a>(0, input %*% weight + bias)

实施提示

在实践中，我们为了性能考虑总是以一批实例去更新一层中的所有神经元。因此，上述代码将无法正常工作。

1）矩阵乘法和加法

如以下代码所示， input %*% weights 和 bias 是不同的 dimension ，也不能直接被添加。这里提供两种解决方案。第一种重复bias的ncol次数。但是，它会在大量数据输入时浪费很多存储空间，因此，第二种方案更好

# dimension: 2X2 input &lt;- <a href="http://inside-r.org/r-doc/base/matrix">matrix</a>(1:4, <a href="http://inside-r.org/r-doc/base/nrow">nrow</a>=2, <a href="http://inside-r.org/r-doc/base/ncol">ncol</a>=2) # dimension: 2x3 weights &lt;- <a href="http://inside-r.org/r-doc/base/matrix">matrix</a>(1:6, <a href="http://inside-r.org/r-doc/base/nrow">nrow</a>=2, <a href="http://inside-r.org/r-doc/base/ncol">ncol</a>=3) # dimension: 1*3 bias &lt;- <a href="http://inside-r.org/r-doc/base/matrix">matrix</a>(1:3, <a href="http://inside-r.org/r-doc/base/nrow">nrow</a>=1, <a href="http://inside-r.org/r-doc/base/ncol">ncol</a>=3) # doesn't work since unmatched dimension input %*% weights + bias Error input %*% weights + bias : non-conformable arrays    # solution 1: repeat bias aligned to 2X3  s1 &lt;- input %*% weights + <a href="http://inside-r.org/r-doc/base/matrix">matrix</a>(<a href="http://inside-r.org/r-doc/base/rep">rep</a>(bias, each=2), <a href="http://inside-r.org/r-doc/base/ncol">ncol</a>=3)    # solution 2: sweep addition s2 &lt;- <a href="http://inside-r.org/r-doc/base/sweep">sweep</a>(input %*% weights ,2, bias, '+')   <a href="http://inside-r.org/r-doc/base/all.equal">all.equal</a>(s1, s2) [1] TRUE

# dimension: 2X2 input <- <a href="http://inside-r.org/r-doc/base/matrix">matrix</a>(1:4, <a href="http://inside-r.org/r-doc/base/nrow">nrow</a>=2, <a href="http://inside-r.org/r-doc/base/ncol">ncol</a>=2) # dimension: 2x3 weights <- <a href="http://inside-r.org/r-doc/base/matrix">matrix</a>(1:6, <a href="http://inside-r.org/r-doc/base/nrow">nrow</a>=2, <a href="http://inside-r.org/r-doc/base/ncol">ncol</a>=3) # dimension: 1*3 bias <- <a href="http://inside-r.org/r-doc/base/matrix">matrix</a>(1:3, <a href="http://inside-r.org/r-doc/base/nrow">nrow</a>=1, <a href="http://inside-r.org/r-doc/base/ncol">ncol</a>=3) # doesn't work since unmatched dimension input %*% weights + bias Errorinput %*% weights + bias : non-conformablearrays    # solution 1: repeat bias aligned to 2X3 s1 <- input %*% weights + <a href="http://inside-r.org/r-doc/base/matrix">matrix</a>(<a href="http://inside-r.org/r-doc/base/rep">rep</a>(bias, each=2), <a href="http://inside-r.org/r-doc/base/ncol">ncol</a>=3)    # solution 2: sweep addition s2 <- <a href="http://inside-r.org/r-doc/base/sweep">sweep</a>(input %*% weights ,2, bias, '+')   <a href="http://inside-r.org/r-doc/base/all.equal">all.equal</a>(s1, s2) [1] TRUE

2）一个矩阵的元素级最大值

另一个小方法就是通过pmax代替max来获得元素级最大值而不是一个全程的值，注意pmax里的顺序。

# the original matrix &gt; s1      [,1] [,2] [,3] [1,]    8   17   26 [2,]   11   24   37   # max returns global maximum  &gt; <a href="http://inside-r.org/r-doc/base/max">max</a>(0, s1) [1] 37   # s1 is aligned with a scalar, so the matrix structure is lost &gt; <a href="http://inside-r.org/r-doc/base/pmax">pmax</a>(0, s1) [1]  8 11 17 24 26 37   # correct  # put matrix in the first, the scalar will be recycled to match matrix structure &gt; <a href="http://inside-r.org/r-doc/base/pmax">pmax</a>(s1, 0)      [,1] [,2] [,3] [1,]    8   17   26 [2,]   11   24   37

# the original matrix > s1     [,1] [,2] [,3] [1,]    8  17  26 [2,]  11  24  37   # max returns global maximum > <a href="http://inside-r.org/r-doc/base/max">max</a>(0, s1) [1] 37   # s1 is aligned with a scalar, so the matrix structure is lost > <a href="http://inside-r.org/r-doc/base/pmax">pmax</a>(0, s1) [1]  8 11 17 24 26 37   # correct # put matrix in the first, the scalar will be recycled to match matrix structure > <a href="http://inside-r.org/r-doc/base/pmax">pmax</a>(s1, 0)     [,1] [,2] [,3] [1,]    8  17  26 [2,]  11  24  37

层

-输入层

输入层是相对固定的，只有1层，其数字单位相当于输入数据的特征数量。

-隐藏层

隐藏层种类很多，是DNN的核心部件。但在一般情况下，需要更多的隐藏层捕捉请求的模式，以解决更复杂的问题（非线性）。

-输出层

在输出层中的单元通常没有激活，因为它通常是在分类中表示类的分数和回归中的任意实数值。对于分类，输出单元的数目与预测的类别数相匹配，而只有一个输出节点进行回归。

构建神经网络：体系结构、预测和训练

到目前为止，我们已经了解了深层神经网络的基本概念，我们将建立一个神经网络，其中包括确定网络体系结构，训练网络，然后预测新的数据与学习网络。为了简化步骤，我们使用一个小的数据集——埃德加安德森的虹膜数据（IRIS），通过DNN做分类。

网络架构

IRIS是众所周知的内置数据集，在机器学习的存量的R中。所以你可以直接通过下面的控制台总结来了解这个资料组。

其R代码为：

<a href="http://inside-r.org/r-doc/base/summary">summary</a>(<a href="http://inside-r.org/r-doc/datasets/iris">iris</a>)   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species    Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :50    1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:50    Median :5.800   Median :3.000   Median :4.350   Median :1.300   virginica :50    Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199                    3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800                    Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500

<a href="http://inside-r.org/r-doc/base/summary">summary</a>(<a href="http://inside-r.org/r-doc/datasets/iris">iris</a>)   Sepal.Length    Sepal.Width    Petal.Length    Petal.Width          Species    Min.  :4.300  Min.  :2.000  Min.  :1.000  Min.  :0.100  setosa    :50    1st Qu.:5.100  1st Qu.:2.800  1st Qu.:1.600  1st Qu.:0.300  versicolor:50    Median :5.800  Median :3.000  Median :4.350  Median :1.300  virginica :50    Mean  :5.843  Mean  :3.057  Mean  :3.758  Mean  :1.199                    3rd Qu.:6.400  3rd Qu.:3.300  3rd Qu.:5.100  3rd Qu.:1.800                    Max.  :7.900  Max.  :4.400  Max.  :6.900  Max.  :2.500

概要里有四个特征和三个类别的Species。所以我们可以设计一个DNN架构如下。

【译文】深度学习R（1）：从零开始建立完全连接的神经网络

然后，我们将我们的DNN模型保存在一个列表里，可用于培训或预测，如下。实际上，我们可以在模型中保留更多的感兴趣的参数，这具有很大的灵活性。

其R代码为：

<a href="http://inside-r.org/r-doc/utils/str">str</a>(ir.model) List of 7  $ D : int 4  $ H : num 6  $ K : int 3  $ W1: num [1:4, 1:6] 1.34994 1.11369 -0.57346 -1.12123 -0.00107 ...  $ b1: num [1, 1:6] 1.336621 -0.509689 -0.000277 -0.473194 0 ...  $ W2: num [1:6, 1:3] 1.31464 -0.92211 -0.00574 -0.82909 0.00312 ...  $ b2: num [1, 1:3] 0.581 0.506 -1.088

<a href="http://inside-r.org/r-doc/utils/str">str</a>(ir.model) Listof 7  $ D : int 4  $ H : num 6  $ K : int 3  $ W1: num [1:4, 1:6] 1.34994 1.11369 -0.57346 -1.12123 -0.00107 ...  $ b1: num [1, 1:6] 1.336621 -0.509689 -0.000277 -0.473194 0 ...  $ W2: num [1:6, 1:3] 1.31464 -0.92211 -0.00574 -0.82909 0.00312 ...  $ b2: num [1, 1:3] 0.581 0.506 -1.088

预测

预测，也被称为机器学习领域的分类或推理，比测试更为简洁，它通过矩阵乘法，从输入到输出来逐层穿越网络层。在输出层，不需要激活功能。在分类上，概率将由SOFTMAX进行计算而在回归上，输出代表实际的值的预测。这个过程被称为前馈或反馈传播。

其R代码为：

# Prediction <a href="http://inside-r.org/r-doc/stats/predict">predict</a> &lt;- <a href="http://inside-r.org/r-doc/base/function">function</a>(model, data = X.test) {   # new data, transfer to matrix   new.data &lt;- <a href="http://inside-r.org/r-doc/base/data.matrix">data.matrix</a>(data)     # Feed Forwad   hidden.layer &lt;- <a href="http://inside-r.org/r-doc/base/sweep">sweep</a>(new.data %*% model$W1 ,2, model$b1, '+')   # neurons : Rectified Linear   hidden.layer &lt;- <a href="http://inside-r.org/r-doc/base/pmax">pmax</a>(hidden.layer, 0)   score &lt;- <a href="http://inside-r.org/r-doc/base/sweep">sweep</a>(hidden.layer %*% model$W2, 2, model$b2, '+')     # Loss Function: softmax   score.exp &lt;- <a href="http://inside-r.org/r-doc/base/exp">exp</a>(score)   probs &lt;-<a href="http://inside-r.org/r-doc/base/sweep">sweep</a>(score.exp, 1, <a href="http://inside-r.org/r-doc/base/rowSums">rowSums</a>(score.exp), '/')      # select max possiblity   labels.predicted &lt;- <a href="http://inside-r.org/r-doc/base/max.col">max.col</a>(probs)   <a href="http://inside-r.org/r-doc/base/return">return</a>(labels.predicted) }

# Prediction <a href="http://inside-r.org/r-doc/stats/predict">predict</a> <- <a href="http://inside-r.org/r-doc/base/function">function</a>(model, data = X.test) {   # new data, transfer to matrix   new.data <- <a href="http://inside-r.org/r-doc/base/data.matrix">data.matrix</a>(data)     # Feed Forwad   hidden.layer <- <a href="http://inside-r.org/r-doc/base/sweep">sweep</a>(new.data %*% model$W1 ,2, model$b1, '+')   # neurons : Rectified Linear   hidden.layer <- <a href="http://inside-r.org/r-doc/base/pmax">pmax</a>(hidden.layer, 0)   score <- <a href="http://inside-r.org/r-doc/base/sweep">sweep</a>(hidden.layer %*% model$W2, 2, model$b2, '+')     # Loss Function: softmax   score.exp <- <a href="http://inside-r.org/r-doc/base/exp">exp</a>(score)   probs <-<a href="http://inside-r.org/r-doc/base/sweep">sweep</a>(score.exp, 1, <a href="http://inside-r.org/r-doc/base/rowSums">rowSums</a>(score.exp), '/')      # select max possiblity   labels.predicted <- <a href="http://inside-r.org/r-doc/base/max.col">max.col</a>(probs)   <a href="http://inside-r.org/r-doc/base/return">return</a>(labels.predicted) }

训练

训练是在既定的网络体系结构下，搜索优化参数（权重和偏差），并将分类错误或差值最小化。这个过程包括两个部分：前馈和反向传播。前馈是通过输入数据（如预测部分），通过网络，然后计算数据损失的输出层的损失函数（成本函数）。“数据损失度量预测（例如分类中的分类）和地面实况标签之间的相容性。”在我们的示例代码中，我们选择交叉熵函数来评估数据损失，点击这里查看。

在获取数据丢失后，我们需要通过改变权重和偏差来减少数据丢失。通常流行的方法是通过梯度下降或随机梯度下降的损失，这需要每个数据损失的参数的倒数(W1, W2, b1, b2)。反馈会根据不同的激活功能而不同，这是他们的导数公式。这里是斯坦福大学CS231N 的更多的训练技巧。

在我们的例子中，RELU逐点导数是：

【译文】深度学习R（1）：从零开始建立完全连接的神经网络

其R代码为：

Train: build and train a 2-layers neural network

train.dnn <- function (x, y, traindata=data, testdata=NULL, # set hidden layers and neurons # currently, only support 1 hidden layer hidden= c (6), # max iteration steps maxit=2000, # delta loss abstol=1e-2, # learning rate lr = 1e-2, # regularization rate reg = 1e-3, # show results every 'display' step display = 100, random.seed = 1) {

to make the case reproducible.

set.seed (random.seed)

total number of training set

N <- nrow (traindata)

extract the data and label

don't need atribute

X <- unname ( data.matrix (traindata[,x])) Y <- traindata[,y] if( is.factor (Y)) { Y <- as.integer (Y) }

create index for both row and col

Y.index <- cbind (1:N, Y)

number of input features

D <- ncol (X)

number of categories for classification

K <- length ( unique (Y)) H <- hidden

create and init weights and bias

W1 <- 0.01 matrix ( rnorm (D H, sd =0.5), nrow =D, ncol =H) b1 <- matrix (0, nrow =1, ncol =H) W2 <- 0.01 matrix ( rnorm (H K, sd =0.5), nrow =H, ncol =K) b2 <- matrix (0, nrow =1, ncol =K)

use all train data to update weights since it's a small dataset

batchsize <- N

Training the network

i <- 0 while(i < maxit || loss < abstol ) { # iteration index i <- i +1 # forward .... # 1 indicate row, 2 indicate col hidden.layer <- sweep (X % % W1 ,2, b1, '+') # neurons : ReLU hidden.layer <- pmax (hidden.layer, 0) score <- sweep (hidden.layer % % W2, 2, b2, '+') # softmax score.exp <- exp (score) probs <- sweep (score.exp, 1, rowSums (score.exp), '/') # compute the loss corect.logprobs <- - log (probs[Y.index]) data.loss <- sum (corect.logprobs)/batchsize reg.loss <- 0.5 reg ( sum (W1 W1) + sum (W2 W2)) loss <- data.loss + reg.loss # display results and update model if( i %% display == 0) { if(! is.null (testdata)) { model <- list ( D = D, H = H, K = K, # weights and bias W1 = W1, b1 = b1, W2 = W2, b2 = b2) labs <- predict.dnn(model, testdata[,-y]) accuracy <- mean ( as.integer (testdata[,y]) == labs) cat (i, loss, accuracy, "/n") } else { cat (i, loss, "/n") } } # backward .... dscores <- probs dscores[Y.index] <- dscores[Y.index] -1 dscores <- dscores / batchsize dW2 <- t (hidden.layer) % % dscores db2 <- colSums (dscores) dhidden <- dscores % % t (W2) dhidden[hidden.layer <= 0] <- 0 dW1 <- t (X) % % dhidden db1 <- colSums (dhidden) # update .... dW2 <- dW2 + reg W2 dW1 <- dW1 + reg*W1 W1 <- W1 - lr * dW1 b1 <- b1 - lr * db1 W2 <- W2 - lr * dW2 b2 <- b2 - lr * db2 }

final results

creat list to store learned parameters

you can add more parameters for debug and visualization

such as residuals, fitted.values ...

model <- list ( D = D, H = H, K = K, # weights and bias W1= W1, b1= b1, W2= W2, b2= b2) return (model) }

# Train: build and train a 2-layers neural network train.dnn <- <a href="http://inside-r.org/r-doc/base/function">function</a>(x, y, traindata=data, testdata=NULL,                   # set hidden layers and neurons                   # currently, only support 1 hidden layer                   hidden=<a href="http://inside-r.org/r-doc/base/c">c</a>(6),                    # max iteration steps                   maxit=2000,                   # delta loss                   abstol=1e-2,                   # learning rate                   lr = 1e-2,                   # regularization rate                   reg = 1e-3,                   # show results every 'display' step                   display = 100,                   random.seed = 1) {   # to make the case reproducible.   <a href="http://inside-r.org/r-doc/base/set.seed">set.seed</a>(random.seed)     # total number of training set   N <- <a href="http://inside-r.org/r-doc/base/nrow">nrow</a>(traindata)     # extract the data and label   # don't need atribute   X <- <a href="http://inside-r.org/r-doc/base/unname">unname</a>(<a href="http://inside-r.org/r-doc/base/data.matrix">data.matrix</a>(traindata[,x]))   Y <- traindata[,y]   if(<a href="http://inside-r.org/r-doc/base/is.factor">is.factor</a>(Y)) { Y <- <a href="http://inside-r.org/r-doc/base/as.integer">as.integer</a>(Y) }   # create index for both row and col   Y.index <- <a href="http://inside-r.org/r-doc/base/cbind">cbind</a>(1:N, Y)     # number of input features   D <- <a href="http://inside-r.org/r-doc/base/ncol">ncol</a>(X)   # number of categories for classification   K <- <a href="http://inside-r.org/r-doc/base/length">length</a>(<a href="http://inside-r.org/r-doc/base/unique">unique</a>(Y))   H <-  hidden     # create and init weights and bias   W1 <- 0.01*<a href="http://inside-r.org/r-doc/base/matrix">matrix</a>(<a href="http://inside-r.org/r-doc/stats/rnorm">rnorm</a>(D*H, <a href="http://inside-r.org/r-doc/stats/sd">sd</a>=0.5), <a href="http://inside-r.org/r-doc/base/nrow">nrow</a>=D, <a href="http://inside-r.org/r-doc/base/ncol">ncol</a>=H)   b1 <- <a href="http://inside-r.org/r-doc/base/matrix">matrix</a>(0, <a href="http://inside-r.org/r-doc/base/nrow">nrow</a>=1, <a href="http://inside-r.org/r-doc/base/ncol">ncol</a>=H)     W2 <- 0.01*<a href="http://inside-r.org/r-doc/base/matrix">matrix</a>(<a href="http://inside-r.org/r-doc/stats/rnorm">rnorm</a>(H*K, <a href="http://inside-r.org/r-doc/stats/sd">sd</a>=0.5), <a href="http://inside-r.org/r-doc/base/nrow">nrow</a>=H, <a href="http://inside-r.org/r-doc/base/ncol">ncol</a>=K)   b2 <- <a href="http://inside-r.org/r-doc/base/matrix">matrix</a>(0, <a href="http://inside-r.org/r-doc/base/nrow">nrow</a>=1, <a href="http://inside-r.org/r-doc/base/ncol">ncol</a>=K)     # use all train data to update weights since it's a small dataset   batchsize <- N     # Training the network   i <- 0   while(i < maxit || loss < abstol ) {       # iteration index     i <- i +1       # forward ....     # 1 indicate row, 2 indicate col     hidden.layer <- <a href="http://inside-r.org/r-doc/base/sweep">sweep</a>(X %*% W1 ,2, b1, '+')     # neurons : ReLU     hidden.layer <- <a href="http://inside-r.org/r-doc/base/pmax">pmax</a>(hidden.layer, 0)     score <- <a href="http://inside-r.org/r-doc/base/sweep">sweep</a>(hidden.layer %*% W2, 2, b2, '+')       # softmax     score.exp <- <a href="http://inside-r.org/r-doc/base/exp">exp</a>(score)     probs <-<a href="http://inside-r.org/r-doc/base/sweep">sweep</a>(score.exp, 1, <a href="http://inside-r.org/r-doc/base/rowSums">rowSums</a>(score.exp), '/')        # compute the loss     corect.logprobs <- -<a href="http://inside-r.org/r-doc/base/log">log</a>(probs[Y.index])     data.loss  <- <a href="http://inside-r.org/r-doc/base/sum">sum</a>(corect.logprobs)/batchsize     reg.loss  <- 0.5*reg* (<a href="http://inside-r.org/r-doc/base/sum">sum</a>(W1*W1) + <a href="http://inside-r.org/r-doc/base/sum">sum</a>(W2*W2))     loss <- data.loss + reg.loss       # display results and update model     if( i %% display == 0) {         if(!<a href="http://inside-r.org/r-doc/base/is.null">is.null</a>(testdata)) {             model <- <a href="http://inside-r.org/r-doc/base/list">list</a>( D = D,                           H = H,                           K = K,                           # weights and bias                           W1 = W1,                            b1 = b1,                            W2 = W2,                            b2 = b2)             labs <- predict.dnn(model, testdata[,-y])             accuracy <- <a href="http://inside-r.org/r-doc/base/mean">mean</a>(<a href="http://inside-r.org/r-doc/base/as.integer">as.integer</a>(testdata[,y]) == labs)             <a href="http://inside-r.org/r-doc/base/cat">cat</a>(i, loss, accuracy, "/n")         } else {             <a href="http://inside-r.org/r-doc/base/cat">cat</a>(i, loss, "/n")         }     }       # backward ....     dscores <- probs     dscores[Y.index] <- dscores[Y.index] -1     dscores <- dscores / batchsize         dW2 <- <a href="http://inside-r.org/r-doc/base/t">t</a>(hidden.layer) %*% dscores     db2 <- <a href="http://inside-r.org/r-doc/base/colSums">colSums</a>(dscores)       dhidden <- dscores %*% <a href="http://inside-r.org/r-doc/base/t">t</a>(W2)     dhidden[hidden.layer <= 0] <- 0       dW1 <- <a href="http://inside-r.org/r-doc/base/t">t</a>(X) %*% dhidden     db1 <- <a href="http://inside-r.org/r-doc/base/colSums">colSums</a>(dhidden)        # update ....     dW2 <- dW2 + reg*W2     dW1 <- dW1  + reg*W1       W1 <- W1 - lr * dW1     b1 <- b1 - lr * db1       W2 <- W2 - lr * dW2     b2 <- b2 - lr * db2     }     # final results   # creat list to store learned parameters   # you can add more parameters for debug and visualization   # such as residuals, fitted.values ...   model <- <a href="http://inside-r.org/r-doc/base/list">list</a>( D = D,                 H = H,                 K = K,                 # weights and bias                 W1= W1,                  b1= b1,                  W2= W2,                  b2= b2)     <a href="http://inside-r.org/r-doc/base/return">return</a>(model) }

测试和可视化

我们已经建立了简单的二层DNN模型，现在我们可以测试我们的模型了。首先将数据集分为训练和测试的两个部分，然后利用训练集训练模型，测试集来测量模型的泛化能力。

其R 代码为

testing

set.seed (1)

0. EDA

summary ( iris ) plot ( iris )

1. split data into test/train

samp <- c ( sample (1:50,25), sample (51:100,25), sample (101:150,25))

2. train model

ir.model <- train.dnn(x=1:4, y=5, traindata= iris [samp,], testdata= iris [-samp,], hidden=6, maxit=2000, display=50)

3. prediction

labels.dnn <- predict.dnn(ir.model, iris [-samp, -5])

4. verify the results

table ( iris [-samp,5], labels.dnn)

labels.dnn

1 2 3

setosa 25 0 0

versicolor 0 24 1

virginica 0 0 25

accuracy

mean ( as.integer ( iris [-samp, 5]) == labels.dnn)

0.98

######################################################################## # testing ####################################################################### <a href="http://inside-r.org/r-doc/base/set.seed">set.seed</a>(1)   # 0. EDA <a href="http://inside-r.org/r-doc/base/summary">summary</a>(<a href="http://inside-r.org/r-doc/datasets/iris">iris</a>) <a href="http://inside-r.org/r-doc/graphics/plot">plot</a>(<a href="http://inside-r.org/r-doc/datasets/iris">iris</a>)   # 1. split data into test/train samp <- <a href="http://inside-r.org/r-doc/base/c">c</a>(<a href="http://inside-r.org/r-doc/base/sample">sample</a>(1:50,25), <a href="http://inside-r.org/r-doc/base/sample">sample</a>(51:100,25), <a href="http://inside-r.org/r-doc/base/sample">sample</a>(101:150,25))   # 2. train model ir.model <- train.dnn(x=1:4, y=5, traindata=<a href="http://inside-r.org/r-doc/datasets/iris">iris</a>[samp,], testdata=<a href="http://inside-r.org/r-doc/datasets/iris">iris</a>[-samp,], hidden=6, maxit=2000, display=50)   # 3. prediction labels.dnn <- predict.dnn(ir.model, <a href="http://inside-r.org/r-doc/datasets/iris">iris</a>[-samp, -5])   # 4. verify the results <a href="http://inside-r.org/r-doc/base/table">table</a>(<a href="http://inside-r.org/r-doc/datasets/iris">iris</a>[-samp,5], labels.dnn) #          labels.dnn #            1  2  3 #setosa     25  0  0 #versicolor  0 24  1 #virginica   0  0 25   #accuracy <a href="http://inside-r.org/r-doc/base/mean">mean</a>(<a href="http://inside-r.org/r-doc/base/as.integer">as.integer</a>(<a href="http://inside-r.org/r-doc/datasets/iris">iris</a>[-samp, 5]) == labels.dnn) # 0.98

在测试设定中的数据丢失和测试精度如下所示：

【译文】深度学习R（1）：从零开始建立完全连接的神经网络

然后我们比较我们的DNN模型和“nnet”包，代码如下

<a href="http://inside-r.org/r-doc/base/library">library</a>(<a href="http://inside-r.org/r-doc/nnet/nnet">nnet</a>) ird &lt;- <a href="http://inside-r.org/r-doc/base/data.frame">data.frame</a>(<a href="http://inside-r.org/r-doc/base/rbind">rbind</a>(<a href="http://inside-r.org/r-doc/datasets/iris3">iris3</a>[,,1], <a href="http://inside-r.org/r-doc/datasets/iris3">iris3</a>[,,2], <a href="http://inside-r.org/r-doc/datasets/iris3">iris3</a>[,,3]),                   species = <a href="http://inside-r.org/r-doc/base/factor">factor</a>(<a href="http://inside-r.org/r-doc/base/c">c</a>(<a href="http://inside-r.org/r-doc/base/rep">rep</a>("s",50), <a href="http://inside-r.org/r-doc/base/rep">rep</a>("c", 50), <a href="http://inside-r.org/r-doc/base/rep">rep</a>("v", 50)))) ir.nn2 &lt;- <a href="http://inside-r.org/r-doc/nnet/nnet">nnet</a>(species ~ ., <a href="http://inside-r.org/r-doc/utils/data">data</a> = ird, <a href="http://inside-r.org/r-doc/base/subset">subset</a> = samp, size = 6, rang = 0.1,                decay = 1e-2, maxit = 2000)   labels.nnet &lt;- <a href="http://inside-r.org/r-doc/stats/predict">predict</a>(ir.nn2, ird[-samp,], type="class") <a href="http://inside-r.org/r-doc/base/table">table</a>(ird$species[-samp], labels.nnet) #  labels.nnet #   c  s  v #c 22  0  3 #s  0 25  0 #v  3  0 22   # accuracy <a href="http://inside-r.org/r-doc/base/mean">mean</a>(ird$species[-samp] == labels.nnet) # 0.96

<a href="http://inside-r.org/r-doc/base/library">library</a>(<a href="http://inside-r.org/r-doc/nnet/nnet">nnet</a>) ird <- <a href="http://inside-r.org/r-doc/base/data.frame">data.frame</a>(<a href="http://inside-r.org/r-doc/base/rbind">rbind</a>(<a href="http://inside-r.org/r-doc/datasets/iris3">iris3</a>[,,1], <a href="http://inside-r.org/r-doc/datasets/iris3">iris3</a>[,,2], <a href="http://inside-r.org/r-doc/datasets/iris3">iris3</a>[,,3]),                   species = <a href="http://inside-r.org/r-doc/base/factor">factor</a>(<a href="http://inside-r.org/r-doc/base/c">c</a>(<a href="http://inside-r.org/r-doc/base/rep">rep</a>("s",50), <a href="http://inside-r.org/r-doc/base/rep">rep</a>("c", 50), <a href="http://inside-r.org/r-doc/base/rep">rep</a>("v", 50)))) ir.nn2 <- <a href="http://inside-r.org/r-doc/nnet/nnet">nnet</a>(species ~ ., <a href="http://inside-r.org/r-doc/utils/data">data</a> = ird, <a href="http://inside-r.org/r-doc/base/subset">subset</a> = samp, size = 6, rang = 0.1,               decay = 1e-2, maxit = 2000)   labels.nnet <- <a href="http://inside-r.org/r-doc/stats/predict">predict</a>(ir.nn2, ird[-samp,], type="class") <a href="http://inside-r.org/r-doc/base/table">table</a>(ird$species[-samp], labels.nnet) #  labels.nnet #   c  s  v #c 22  0  3 #s  0 25  0 #v  3  0 22   # accuracy <a href="http://inside-r.org/r-doc/base/mean">mean</a>(ird$species[-samp] == labels.nnet) # 0.96