{"title": "On the Local Hessian in Back-propagation", "book": "Advances in Neural Information Processing Systems", "page_first": 6520, "page_last": 6530, "abstract": "Back-propagation (BP) is the foundation for successfully training deep neural networks. However, BP sometimes has difficulties in propagating a learning signal deep enough effectively, e.g., the vanishing gradient phenomenon. Meanwhile, BP often works well when combining with ``designing tricks'' like orthogonal initialization, batch normalization and skip connection. There is no clear understanding on what is essential to the efficiency of BP. In this paper, we take one step towards clarifying this problem. We view BP as a solution of back-matching propagation which minimizes a sequence of back-matching losses each corresponding to one block of the network. We study the Hessian of the local back-matching loss (local Hessian) and connect it to the efficiency of BP. It turns out that those designing tricks facilitate BP by improving the spectrum of local Hessian. In addition, we can utilize the local Hessian to balance the training pace of each block and design new training algorithms. Based on a scalar approximation of local Hessian, we propose a scale-amended SGD algorithm. We apply it to train neural networks with batch normalization, and achieve favorable results over vanilla SGD. This corroborates the importance of local Hessian from another side.", "full_text": "On the Local Hessian in Back-propagation\n\nHuishuai Zhang\n\nMicrosoft Research Asia\n\nBeijing, 100080\n\nWei Chen\n\nMicrosoft Research Asia\n\nBeijing, 100080\n\nTie-Yan Liu\n\nMicrosoft Research Asia\n\nBeijing, 100080\n\nAbstract\n\nBack-propagation (BP) is the foundation for successfully training deep neural\nnetworks. However, BP sometimes has dif\ufb01culties in propagating a learning signal\ndeep enough effectively, e.g., the vanishing gradient phenomenon. Meanwhile, BP\noften works well when combining with \u201cdesigning tricks\u201d like orthogonal initial-\nization, batch normalization and skip connection. There is no clear understanding\non what is essential to the ef\ufb01ciency of BP. In this paper, we take one step towards\nclarifying this problem. We view BP as a solution of back-matching propagation\nwhich minimizes a sequence of back-matching losses each corresponding to one\nblock of the network. We study the Hessian of the local back-matching loss (local\nHessian) and connect it to the ef\ufb01ciency of BP. It turns out that those designing\ntricks facilitate BP by improving the spectrum of local Hessian. In addition, we can\nutilize the local Hessian to balance the training pace of each block and design new\ntraining algorithms. Based on a scalar approximation of local Hessian, we propose\na scale-amended SGD algorithm. We apply it to train neural networks with batch\nnormalization, and achieve favorable results over vanilla SGD. This corroborates\nthe importance of local Hessian from another side.\n\n1\n\nIntroduction\n\nDeep neural networks have been advancing the state-of-the-art performance over a number of tasks\nin arti\ufb01cial intelligence, from speech recognition [Hinton et al., 2012], computer vision [He et al.,\n2016a] to natural language understanding [Hochreiter and Schmidhuber, 1997]. These problems\nare typically formulated as minimizing non-convex objectives parameterized by the neural network\nmodels. Typically, the models are trained with stochastic gradient descent (SGD) or its variants and\nthe gradient information is computed through back-propagation (BP) [Rumelhart et al., 1986].\nIt is known that BP sometimes has dif\ufb01culties in propagating a learning signal deep enough effectively,\ne.g., the vanishing/exploding gradient phenomenon, [Hochreiter, 1991, Hochreiter et al., 2001].\nRecent designing tricks, such as orthogonal initialization [Saxe et al., 2014], batch normalization\n[Ioffe and Szegedy, 2015] and skip connection [He et al., 2016a], improve the performance of\ndeep neural networks on almost all tasks, which are interpreted to be able to alleviate the vanishing\ngradient to some extent. However, a recent work [Orhan and Pitkow, 2018] shows that a network\nwith non-orthogonal skip connection always underperforms a network with orthogonal (identity is a\nspecial case) skip connection and neither network has vanishing gradient as back-propagating through\nlayers. This suggests that vanishing gradient is not the core reason for a network being good or not.\nWe ask that if vanishing gradient is a super\ufb01cial reason, what is essential to the ef\ufb01ciency of BP?\nIn this paper, we consider this question from the optimization\u2019s perspective and give an answer: the\nHessian of the local back-matching loss is responsible for the dif\ufb01culty of training deep nets with\nBP. Speci\ufb01cally, we start from a penalized loss formulation, which takes the intermediate feature\noutputs as variables of the optimization and the penalty is to enforce the coordination (architecture)\nconnection [Carreira-Perpinan and Wang, 2014]. Minimizing the penalized loss following backward\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\forder leads to a back-matching propagation procedure which involves minimizing a sequence of\nback-matching losses. Each back-matching loss penalizes the mismatch between a target signal from\nthe upper block and the output of the current block, which is determined by the parameters and the\ninputs of the block1. We show that BP is equivalent to minimizing each back-matching loss with\none-step gradient update. This is to say, BP is a solution of the back-matching propagation procedure.\nHowever, in general, the one-step gradient update may/may not be a good solution to minimize\nthe back-matching loss contingent on the Hessian, as is well known that bad-conditioned Hessian\ncan have enormous adversarial impact on the convergence of the \ufb01rst-order methods [Ben-Tal and\nNemirovski, 2001]. Loosely speaking, if the local Hessian is badly-conditioned, the one-step gradient\nupdate does not minimize the back-matching loss suf\ufb01ciently and the target signal distorts gradually\nwhen backward through layers.\nWe mathematically derive the formula of local Hessian and show that the designing tricks including\nbatch normalization and skip connection can drive local Hessian towards a good-conditioned matrix\nto some extent. This explains why practical designing tricks can stabilize the backward process. In\nparticular, by analyzing the local Hessian of residual block, we can answer the questions about skip\nconnection in [Orhan and Pitkow, 2018] via local Hessian.\nBesides interpreting existing practical techniques and providing guidance to design neural network\nstructure, we can also utilize local Hessian to design new algorithms. The general idea is to employ\nthe information of the local Hessian to facilitate the training of neural networks. We propose a\nscale-amended SGD algorithm to balance the training pace of each block by considering the scaling\neffect of local Hessian. More speci\ufb01cally, we approximate the local Hessian with a scalar and use the\nscalar to amend the gradient of each block. Such a scale-amended SGD is built upon the regular BP\nprocess, and hence it is easy to implement in current deep learning frameworks [Bastien et al., 2012,\nAbadi et al., 2016, Paszke et al., 2017, Seide and Agarwal, 2016]. We apply this scale-amended SGD\nto feed-forward networks with batch normalization and empirically demonstrate that it improves the\nperformance by a considerable margin. This further advocates the key role of the local Hessian in\nef\ufb01cient learning of deep neural networks.\n\n1.1 Related Works\n\nThe penalty loss formulation is inspired by methods of auxiliary coordinates (MAC) [Carreira-\nPerpinan and Wang, 2014] and proximal backpropagation [Frerix et al., 2018]. Speci\ufb01cally, Carreira-\nPerpinan and Wang [2014] applies block coordinate descent to optimize the penalized objective.\nFrerix et al. [2018] applies proximal gradient when updating W . In contrast, we start from the\npenalty loss formulation and focus on the local Hessian for each subproblem in the back-matching\npropagation, and argue that the local Hessian is critical to the ef\ufb01ciency of BP.\nOur scale-amended SGD is related to the algorithms tackling the dif\ufb01culty of BP for training deep nets\nby incorporating second-order/metric information [Martens, 2010, Amari, 1998, Pascanu and Bengio,\n2014, Ollivier, 2015, Martens and Grosse, 2015] and the block-diagonal second order algorithms\n[Lafond et al., 2017, Zhang et al., 2017, Grosse and Martens, 2016]. These second-order algorithms\napproximate the Hessian/Fisher matrix and are computationally expensive. In contrast, the scale-\namended SGD only amends the vanilla SGD with a scalar for each block based on the approximation\nof local Hessian. The scale-amended SGD is closely related to the layer-wise adaptive learning rate\nstrategy [Singh et al., 2015, You et al., 2017]. However, these two layer-wise learning rate strategies\ndo not have explanation of why the rate is set in that way.\n\n2 BP as a solution of back-matching propagation\n\nIn this section, we \ufb01rst introduce the quadratic penalty formulation of the loss of neural network\nand the back-matching propagation procedure which minimizes a sequence of local back-matching\nlosses following the backward order. Then we connect BP to the one-step gradient update solution of\nback-matching propagation procedure. The quality of such a solution on minimizing back-matching\nloss is determined by its local Hessian.\n\n1Here a block can be composed of one layer or multiple layers.\n\n2\n\n\fProcedure 1 Back-matching Propagation\nb for b = 1, ..., B and zk\n\nb , zk\n\nInput: W k\nfor b = B, ..., 1 do\n\n0 = X k.\n\nend for\nOutput: A new parameter W k+1\n\nk+ 1\n2\n\nz\n\nb1 arg min\nzb1\n\nW k+1\n\nb arg min\n\nW b\n\n`b\u21e3W b, zk\nb1\u2318 ,\nb , zb1\u2318 ,\n`b\u21e3W k\n\n(3)\n\n(4)\n\nSuppose the loss of training a neural network is given by2\n\nJ(W ; X, y) = `(y; F (W , X)),\n\n(1)\nwhere `(\u00b7) is the loss function with respect to the training targets y and the network output, F (\u00b7,\u00b7) is\nthe network mapping, W is the trainable parameter of the network and X is the input data. Carreira-\nPerpinan and Wang [2014] introduces the intermediate output of the network as auxiliary variables\nand the architecture connection as a quadratic penalty, and proposes to minimize the following\nquadratic penalty formulation of the loss,\n\nB1Xb=1\n\nQ(W , z; ) = `(y, FB(W B, zB1)) +\n\n\n2kzb Fb(W b; zb1)k2,\n\n(2)\n\nwhere Fb(\u00b7,\u00b7) is a block mapping, W b, zb1 are the trainable parameter and the input of network\nblock b, respectively, for b = B, ..., 1 and z0 is the input data X.\nIt has been argued in [Nocedal and Wright, 2006] that under mild condition, the solutions of\nminimizing (2) converge to the solution of the original problem (1) as ! 1. Carreira-Perpinan and\nWang [2014] minimizes objective (2) via z-step and W -step, which is essentially a block coordinate\ndescent algorithm.\nInspired by the form (2), we study the back-matching propagation procedure (Procedure 1), which\nminimizes a sequence of local back-matching losses following the backward order. The local\nback-matching loss for block b at step k is denoted by `b\n\n`b(W b, zb1) =8<:\n\n1\n\n`yk; FB(W B, zB1) ,\n Fb(W b, zb1)\n2zk+ 1\n\nb\n\n2\n\n2\n\n,\n\nfor b = B\nfor b = B 1, ..., 1,\n\n(5)\n\n2\n\n2\n\nb\n\nb\n\nwhere zk+ 1\nis computed by (4) repeatedly. We note that zk is computed by forward pass given\na new X k and is not updated by Procedure 1 and zk+ 1\nis an intermediate variable to store the\ndesired change on the output zb which is used to compute W k+1\nb1 . For each subproblem at b, we\nalternatively optimize over W b and zb1 while \ufb01xing the other as in the forward process because\njointly optimizing over W b and zb1 is non-convex even if Fb represents matrix-vector product.\nA direct explanation of the back-matching loss is that given the target signal zk+ 1\nupper block, which is believed to be the direction of zk\nand the new target signal for lower block zk+ 1\nWe are not suggesting Procedure 1 as a new algorithm to train neural network. Actually, Procedure\n1 may not be stable in practice if solving each subproblem fully [Lee et al., 2015, Wiseman et al.,\n2017] because the solution of (3) and (4) may deviate from last updated value too much and jump\nout of the trust region. Instead, we connect BP to the one-step gradient update solution of (3) and\n(4) and argue that the conditions of the subproblems (3) and (4) affect the ef\ufb01ciency of BP given the\nexplanation of back matching loss.\nProposition 1. If (3) is solved by one-step gradient update with step size \u00b5 and (4) is solved by\none-step gradient update with step size 1, then W k+1 produced by the procedure 1 is the same as\ngradient update of the original problem (1) with step size \u00b5.\n\npropagated from\nb to decrease the loss, the new weight W k+1\n\nb1 should minimize the matching loss `b.\n\nb\n\n2\n\n2\n\n2For simplicity we omit the bias term in the sequel.\n\n3\n\n\fProof. The proof is relegated to Supplemental A due to space limit.\n\nWe note that the form of the back-matching loss is mentioned in target propagation [Lee et al., 2015,\nLe Cun, 1986] which is motivated by the biological implausibility of BP while we formulate it from\nminimizing a penalized objective. We also note that the connection between BP and the one-step\ngradient update of minimizing (2) in backward order is made in [Frerix et al., 2018] for the case Fb(\u00b7)\nis either activation function or linear transformation.\nHere we view BP as a solution of back-matching propagation and study the local Hessian matrices of\nback-matching losses (3) and (4),\n\nLocal Hessian: H vec(W ) =\n\n@2`bW b, zk\nb1\n\n@vec(W )2\n\n, H z =\n\n@2`bW k\n\n@z2\n\nb , zb1\n\nb1\n\n.\n\n(6)\n\nThe Hessian of training deep neural networks has been studied in previous works Dauphin et al.\n[2014], Orhan and Pitkow [2018], Li et al. [2016], Sagun et al. [2017], Jastrz\u02dbebski et al. [2018]. They\nall analyze and calculate the Hessian of the objective with respect to the whole network parameter. In\ncontrast, we study the Hessian of the local back-matching loss and connect it to the ef\ufb01ciency of BP.\nLoosely speaking, if the local Hessian of (5) with respect to W is good-conditioned, the solution\nof (3) minimizes the local back-matching loss suf\ufb01ciently, which implies that the target signal is\nef\ufb01ciently approximated by updating parameters of current block, and if the local Hessian of (5)\nwith respect to z is good-conditioned, the solution of (4) minimizes the local back-matching loss\nsuf\ufb01ciently, which implies that the target signal is ef\ufb01ciently back-propagated. Next, we show how\nskip connection and batch normalization improve the spectrum of the local Hessian.\n\n3 Explain the ef\ufb01ciency of BP via local Hessian\n\nBecause the condition of local Hessian determines how ef\ufb01ciently the back-matching loss is mini-\nmized by updating the parameters of current block and how accurately the error signal propagates\nback to the lower layer, we evaluate how good a block is via analyzing its local Hessian. We \ufb01rst\nanalyze the local Hessian of a fully connected layer3 and then show that the skip connection and\nbatch normalization improve the spectrum of local Hessian and hence facilitate the ef\ufb01ciency of BP.\n\n3.1 Block of a fully connected layer\nWe consider a block b composed of a fully connected layer with nb outputs and nb1 inputs. The\nmapping function is given by\n\nzb = Fb(W b, zb1) = W b \u00b7 zb1,\n\n(7)\n\nwhere W b is an nb \u21e5 nb1 matrix.\nSuppose that after the gradient step on zb from upper layer, we get an intermediate variable zk+1/2\nThe Hessian of back matching loss (5) with respect to zb1 and wb are\n\nb\n\n.\n\nH z = (W k\n\nb )T W k\nb ,\n\nH w =\n\nmXj=1\n\nzk\nb1[j](zk\n\nb1[j])T ,\n\n(8)\n\n(9)\n\nrespectively, where m is the batch size, [j] represents the j-th sample and wb is a vector of a row of\nW b. Then H vec(Wb) is a block diagonal matrix with each block being H w where vec(W b) is a long\nvector stacking the rows of W b \ufb01rst.\nFor (8) the local Hessian with respect to zb1, we derive the distribution of its eigenvalues. For the\nconvenience of analysis, we assume the elements of wb are independently generated from Gaussian\ndistribution with mean 0 and variance 2 and nb, nb1 ! 1 and the ratio nb/nb1 ! c 2 (0, +1).\n\n3The formula for convolution layer is given in Supplemental C.\n\n4\n\n\f\u232b(A) =\u21e2(1 c)102A + \u232b2(A),\n\n\u232b2(A),\n\nif 0 < c \uf8ff 1,\nif c > 1,\n\nd\u232b2() =\n\nc\n\n2\u21e12p(c+ )( c)\n\n\n\n1[c,c+]d,\n\n(10)\n\n(11)\n\nThen by the Marchenko-Pastur law [Mar\u02c7cenko and Pastur, 1967], we have the density of the eigenvalue\n of (8) as follows,\n\nwhere\n\nwith c+ = 2(1 + pc)2/c, c = 2(1 pc)2/c.\nThis result af\ufb01rms that the orthonormal initialization [Mishkin and Matas, 2016, Saxe et al., 2014]\nfacilitates backward propagation. If W b is an orthonormal matrix, the eigenvalues of H z are\ncomposed of nb 1\u2019s and nb1 nb 0\u2019s if nb < nb1. This is the best spectrum of Hessian we can\nexpect for minimizing the back-matching loss (5).\nHowever, in general, W b is not orthonormal and hence H z is not identity. The gradient update\non zb1 does not minimize the back-matching loss well. As back propagating to lower blocks,\nthe update zk+ 1\nbt gets far from the direction of minimizing the back-matching loss `b for\nt = 1, 2, ..., b. Such discrepancy becomes larger as the condition of H z of each block is bad and as\nthe back-propagation goes deep.\nFor the local Hessian with respect to W , it is hard to control in general. Several recent works [Frerix\net al., 2018, Ye et al., 2017] suggest using forms involving H W to precondition vanilla SGD. We note\nthat Le Cun et al. [1991] has also studied the spectrum of H w which gives a theoretical justi\ufb01cation\nfor the choice of centered input over biased state variables.\nWe next study the local Hessian of blocks with skip connection and batch normalization and show\nthat these designing tricks can improve the spectrum of H z and H W to some extent and hence make\nthe training deep neural networks easier.\n\nbt zk\n\n2\n\n3.2 Block with skip connection\nSkip connection has been empirically demonstrated important to obtain state-of-the-art results [He\net al., 2016a,b, Huang et al., 2017a], while its functionality has various interpretations. Veit et al.\n[2016] argue that residual network can be seen as an ensemble of shallow nets and avoids vanishing\ngradient problem by introducing short paths. Jastrzebski et al. [2018] suggest that residual block\nperforms iterative re\ufb01nement of features for higher layer while lower layers concentrate representation\nlearning behavior. These works focus on the interpretation of how Resnet works. We here try to\ngive an answer on why Resnet works from the optimization perspective. A recent work [Orhan and\nPitkow, 2018] argues that skip connection eliminates singular points of the Hessian matrix and there\nare open questions in [Orhan and Pitkow, 2018], for which we can give answers by analyzing the\nlocal Hessian of residual block.\nSuppose that the mapping of residual block is given by\n\n@b\n\n@Fb\n\n@b\n\n@\n\nzb = Fb(W b, zb1) = zb1 + b(W b, zb1),\n\n(12)\nwhere Fb(\u00b7) is the residual block mapping with parameters W b and input zb1. The Hessians of the\nback-matching loss (5) with respect to zb1 and W b are given by\n@zb1\u2713 @Fb\n@zb1 \u00b7\u2713z\n@vec(W b)\u2713\n\nb\u25c6\u25c6 ,\n zk\n@vec(W b) \u00b7\u2713z\n\n@zb1\u25c6T\u2713I +\n@vec(W b)\u25c6T\u2713\n\n@zb1\u25c6 \n@vec(W b)\u25c6 \n\nH z =\u2713I +\nH W =\u2713\n\nb\u25c6\u25c6\n zk\n\nWe can see that (14) the Hessian of local matching loss for residual block with respect to W b is the\nsame as the case without skip connection. Thus we focus on (13) the local Hessian with respect to\nz. Speci\ufb01cally, we analyze the \ufb01rst part of (13), the Gauss-Newton matrix, which is a good positive\nsemide\ufb01nite approximation to the Hessian [Martens, 2016, Chen, 2011]. De\ufb01ne the condition number\nof a matrix M as C(M ) := max(M )/min(M ), where max and min are the largest and smallest\nnon-zero singular values, respectively. The larger the condition number, the worse the problem.\n\n(13)\n\n(14)\n\nk+ 1\n2\nb\n\n@Fb\n\n@\n\n@Fb\n\nk+ 1\n2\nb\n\n5\n\n\fRemark 1. If a) @b\n@zb1\n\nand b) C\u21e3 @b\n\n@zb1\u2318 > 1+s\n\nis \u201csmall\u201d relatively i.e., max\u21e3 @b\n@zb1\u2318 < 1 s for some constant s > 0,\n1s, then\n@zb1\u25c6 .\n@zb1\u25c6 < C\u2713 @b\n\nC\u2713I +\n\n(15)\n\n@b\n\nThis indicates that the condition number of the Gauss-Newton matrix with skip connection is\nguaranteed to be smaller than that without skip connection given two assumptions. The assumption\nb) is generally satis\ufb01ed for neural network from the spectrum distribution analysis of fully-connected\nlayer in Section 3.1 while the assumption a) seems a bit strong. We cannot verify assumption a)\nanalytically because b(\u00b7) typically involves more than two linear layers, nonlinear activations and\nbatch normalization layers. We leave the empirical study on the spectrum distribution of local Hessian\nof the residual block for future work.\nInterestingly, Orhan and Pitkow [2018] demonstrate that a network with an orthogonal connection\nachieves the performance as good as the one with identity skip connection, which can be easily\nexplained from the fact that orthogonal skip connection does not change the condition number\nof the local Gauss-Newton matrix (the \ufb01rst part of (13)). Furthermore, Orhan and Pitkow [2018]\nalso empirically show that a network with non-orthogonal skip connection always underperforms\na network with orthogonal (identity is a special case) skip connection though neither network has\nvanishing gradient as back-propagating through layers. This can be easily argued from the formula\n(13) as non-orthogonal skip connection has larger condition number than orthogonal skip connection\nwhose eigenvalues are all 1\u2019s.\n\n3.3 Block with batch normalization\nBatch normalization (BN) is widely used for accelerating the training of feed-forward neural networks.\nIn this section, we consider adding a BN layer after a fully-connected layer. We \ufb01x the af\ufb01ne\nb and wb a\ntransformation of BN to be identity for simplicity. If zk\nvector of one row of W b, then the BN layer mapping is given by\n\nb represents one component of zk\n\nb = BN\u02dczk\n\nb =\u02dczk\n\nb E[\u02dczk\n\n(16)\nBP through a fully connected layer with BN is given in [Ioffe and Szegedy, 2015] and we provide\nthe form for the back matching loss in Supplemental B for completeness. The gradient formula is\nquite complicated, as the E and Var involve batch information. To proceed the analysis, we ignore\nthe terms involving 1/m, which does not lose much as the batch size becomes large.\nNow we compute the local Hessian of the fully connected layer with BN as follows4\n\n\u02dczk\nb = (wb)T zb1.\n\nb ], where\n\nb ] /qVar[\u02dczk\n\nzk\n\nb (i)T\n\nH z \u21e1\n\nnbXi=1\nH w \u21e1 Pm\n\nwk\nb (i) \u00b7 wk\nVar[\u02dczk\nb (i)]\nb1[j](zk\nVar[\u02dczk\nb (i)]\n\nj=1 zk\n\nb1[j])T\n\n=\n\n,\n\nwk\nb (i) \u00b7 wk\nVar[wk\n\nnbXi=1\n= Pm\n\nb (i)T\nb (i)T zk\nb1]\nb1[j](zk\nb (i)T zk\n\nj=1 zk\nVar[wk\n\nb1[j])T\nb1]\n\n(17)\n\n(18)\n\n,\n\nb (i) is the vector of the i-th row of W k\n\nwhere nb is the number of outputs of layer b, wk\nb1[j]\nrepresents the input of the block b of the sample j. We next show how BN facilitates BP for training\ndeep networks.\nWe \ufb01rst derive the distribution of the eigenvalues of (17) and compare it to (10) (the case without\nBN). Our assumption on wb is the same as the one to derive (11). In contrast to that H z being the\nsum of outer products of Gaussian vectors in Section 3.1, here H z is the sum of the outer products of\nwb/kwbk\u2019s which are the unit vectors equally distributed on the sphere. The density of the eigenvalue\n of (17) is of the form (10) with [Mar\u02c7cenko and Pastur, 1967],\n\nb , and zk\n\nd\u232b2() = p(c+ )( c)\n\n2\u21e1\n\n1[c,c+]d,\n\n(19)\n\nwhere c+ = (1 + pc)2, c = (1 pc)2.\n\n4We ignore the terms involving 1/m again.\n\n6\n\n\fRemark 2. Scaling the variance of the block parameter does not affect the spectrum of H z in (17).\nThis is in contrast to (8) where the spectrum is linearly scaled with the variance of weight parameters.\nThus BP gains bene\ufb01t because it acts as one-step gradient update with \ufb01xed step size 1 for all blocks.\nAnother bene\ufb01t of BN is to improve the condition of H w if zb1 is the output of a BN .\nRemark 3. If zb1 is the output of BN and wb is independent of zb1, then Ediag(H w) = I/kwbk2.\nThis indicates the problem (3) is well-conditioned and hence large step size is allowed [Ioffe and\nSzegedy, 2015].\n\n4 Utilize local Hessian: An example\n\nAs previous section has shown the importance of the local Hessian, this section discuss how to utilize\nlocal Hessian to improve the performance on current deep learning tasks. One direct way of using\nlocal Hessian is to design better architecture. The spectrum of local Hessian can be a criteria to\ndetermine whether a building block is good or not for BP. One potential usage of local Hessian could\nbe in neural architecture search [Zoph and Le, 2016]. As most of the time in neural architecture\nsearch is used to train huge amount of small networks and it will greatly accelerate if using local\nHessian to prune the search space.\nAnother direction is to utilize the local Hessian to design new algorithms to improve the training of\nexisting neural networks. Several works can be understood as examples, e.g., proximal propagation\n[Frerix et al., 2018] and Riemannian approaches [Cho and Lee, 2017, Huang et al., 2017b].\nIn this section, we propose a way to employ the information of the local Hessian to facilitate the\ntraining of deep neural networks. Ideally, good alternatives to minimize back-matching loss `b\nW wb and H1\nare H1\nz zb1, where wb and zb1 are the gradient computed via BP rule given\nzk+ 1\n2 zk. However, H W and H z are often inde\ufb01nite and expensive to compute. We suggest using\ntwo scalars mb,W and mb,z to evaluate how H W and H z scale the norm of a vector with general\nposition, respectively. Then the back-matching loss can be approximated as\n`b\u21e3W b, zk\n(W b W k\nmb,WkW b W k\n1\n2\n\nb +\n, W b W k\nb +\n, W b W k\nb1+ +\n\nb1\u2318 \u21e1 `b\u21e3W k\n\u21e1 `b\u21e3W k\nb , zb1\u2318 \u21e1 `b\u21e3W k\n\nb1\u2318 +\u2327 @`b\nb1\u2318 +\u2327 @`b\nb1\u2318 +* @`b\n\n@zk\n\nb )T H W (W b W k\nb )\n\nmb,zkzb1 zk\n\nb1k2\n2,\n\nwhere the approximation is composed of a second-order Taylor expansion and a scaling effect of\nlocal Hessian, and W b may represent vec(W b) contingent on the context.\nWe next propose an algorithm scale-amended SGD to take the effect of mb,W and mb,z into account\nto balance the training pace of each block. Scale-amended SGD uses mb,W and mb,z to amend the\nscale of vanilla BP of each block. We set the initial backward factor of the output layer m = 1, which\nindicates that the derivative of the loss with respect to the output of the network is regarded as the\ndesired changes on the output to minimize the loss.\nThen following the backward order, if a block has parameter W b and gradient W b computed by\nBP, then we use 0W b := W b/m/mb,W as the scale-amended gradient to update W b, where m is\nthe backward factor on the output of the block and mb,W is the scalar used to approximate H b,W .\nThen we update the backward factor m for next block via m m \u00b7 mb,z, where mb,z is the scalar\nused to approximate H b,z. This strategy is described in Algorithm 2.\n\n`b\u21e3W k\n\nb , zk\n\nb , zk\n\nb , zk\n\n, zb1 zk\n\nb1\n\nbk2\n2,\n\n(20)\n\n(21)\n\n@W k\nb\n\n@W k\nb\n\n1\n2\n1\n2\n\n4.1 Scale-amended SGD for feed-forward networks with BN\nNote that for the feed-forward networks with BN layers, we can obtain a reliable estimation of mb,W\nand mb,z. Speci\ufb01cally, we assume that W b is row homogeneous [Ba et al., 2016], i.e., they represent\nthe same level of information and are roughly of similar magnitude, and de\ufb01ne\n\nkW bk2\n\n2,\u00b5 :=\n\n1\n\n#row(W b)\n\n#row(W b)Xi=1\n\nwb(i)T wb(i),\n\n7\n\n\fAlgorithm 2 Scale-amended SGD\n\nInput: Gradient W b and scaling factor mb,W , mb,z. for b = 1, ..., B; Initialize m = 1.\nfor b = B, ..., 1 do\n\nend for\n\n0W b W b/m/mb,W\nm m \u00b7 mb,z\n\n(22)\n(23)\n\nwhere wb(i) is the i-th row of W b. Under this assumption, the scalars to approximate the local\nHessians (17) and (18) of the fully connected layer with BN are computed as follows,\n\nmb,z := kW T\n\nb k2\n\n2,\u00b5,\n\n2,\u00b5.\n\n2,\u00b5/kW bk2\n\nmb,W := 1/kW bk2\n\n(24)\nWe next evaluate the scale-amended SGD on training VGG nets [Simonyan and Zisserman, 2015]\nfor image classi\ufb01cation tasks with two datasets: CIFAR-10 [Krizhevsky and Hinton, 2009] and\nCIFAR-100 [Krizhevsky and Hinton, 2009]. We modify the VGG nets by keeping the last fully\nconnected layers and removing the intermediate two fully connected layers and all the biases. Each\nintermediate layer of the VGG nets concatenates a BN layer right before the activation function and\nthe BN has no trainable parameters.\nDuring training, the images of CIFAR-10 and CIFAR-100 datasets are randomly \ufb02ipped and rotated\nfor data augmentation. The hyper-parameters for vanilla SGD and our scale-amended SGD are the\nsame including learning rate \u2318 = 0.1 (because the backward factor for linear layer of CIFAR10 is\n512, small learning rate \u2318 = 0.005 works better for CIFAR10 to use scale-amened SGD),\naround 10\nmomentum 0.9 and weight decay5 coef\ufb01cient 0.005. We reduce the learning rate by half once the\nvalidation accuracy is on plateau (ReduceLROnPlateau in PyTorch with patience=10), which works\nwell for both vanilla-SGD and scale-amended SGD.\nWe compare the learning curves between scale-amended SGD and vanilla SGD on training VGG13\nfor CIFAR10 and CIFAR-100 classi\ufb01cation tasks. Two algorithms start from the same initialization\nand pass the same batches of data. Both algorithms are run 300 epochs. We plot the learning curves\nin Figure 1. From Figure 1, we can see that the learning curves of our algorithm and SGD have\n\nFigure 1: Comparison of vanilla SGD and scale-amended SGD on training VGG13 for CIFAR10 and\nCIFAR-100 classi\ufb01cation. Hyperparameters are the same: learning rate 0.1 (except for CIFAR10\nscale-amended SGD uses 0.005), momentum 0.9, weight decay 0.005.\n\nsimilar trend (we plot curves of multiple runs and their average in Supplemental D). This is because\nscale-amended SGD only modi\ufb01es the magnitude of each block gradient as a whole and does not\ninvolve any further information (second order information) and hyper-parameters are the same for\nboth algorithms. Scrutinizing more closely, we can see our training loss curve is almost always lower\nthan SGD\u2019s and our test error ends with a considerably lower number. Thus the scale-amended SGD\n\n5For scale-amended SGD, we \ufb01rst apply the weight decay and then amend the scale.\n\n8\n\n\fachieves favorable result over vanilla SGD on training feed-forward neural network with BN. More\nextensive experiments can be found in Supplemental D.\n\n5 Conclusion\n\nIn this paper we view BP as a solution of back-matching propagation which minimizes a sequence\nof back-matching losses. By studying the Hessian of the local back-matching loss, we interpret the\nbene\ufb01ts of practical designing tricks, e.g., batch normalization and skip connection, in a uni\ufb01ed way:\nimproving the spectrum of local Hessian. Moreover, we propose scale-amended SGD algorithm\nby employing the information of local Hessian via a scalar approximation. Scale-amended SGD\nachieves favorable results over vanilla SGD empirically for training feed-forward networks with BN,\nwhich corroborates the importance of local Hessian.\n\nAcknowledgments\nThe authors would like to thank Prof. Yuejie Chi for helpful discussion.\n\nReferences\nM. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al.\n\nTensorFlow: A system for large-scale machine learning. In OSDI, volume 16, pages 265\u2013283, 2016.\n\nS.-I. Amari. Natural gradient works ef\ufb01ciently in learning. Neural computation, 10(2):251\u2013276, 1998.\n\nJ. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.\n\nF. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. Goodfellow, A. Bergeron, N. Bouchard, D. Warde-Farley, and\n\nY. Bengio. Theano: new features and speed improvements. arXiv preprint arXiv:1211.5590, 2012.\n\nA. Ben-Tal and A. Nemirovski. Lectures on modern convex optimization: analysis, algorithms, and engineering\n\napplications, volume 2. Siam, 2001.\n\nM. Carreira-Perpinan and W. Wang. Distributed optimization of deeply nested systems. In Arti\ufb01cial Intelligence\n\nand Statistics, pages 10\u201319, 2014.\n\nP. Chen. Hessian matrix vs. Gauss\u2013Newton matrix. SIAM Journal on Numerical Analysis, 49(4):1417\u20131435,\n\n2011.\n\nM. Cho and J. Lee. Riemannian approach to batch normalization. In Advances in Neural Information Processing\n\nSystems (NIPS), pages 5231\u20135241, 2017.\n\nY. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio. Identifying and attacking the saddle\npoint problem in high-dimensional non-convex optimization. In Advances in Neural Information Processing\nSystems (NIPS), pages 2933\u20132941, 2014.\n\nT. Frerix, T. M\u00f6llenhoff, M. Moeller, and D. Cremers. Proximal backpropagation. In International Conference\n\non Learning Representations (ICLR), 2018.\n\nR. Grosse and J. Martens. A Kronecker-factored approximate Fisher matrix for convolution layers. In Interna-\n\ntional Conference on Machine Learning (ICML), 2016.\n\nK. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In The IEEE Conference on\n\nComputer Vision and Pattern Recognition (CVPR), June 2016a.\n\nK. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European Conference on\n\nComputer Vision, pages 630\u2013645. Springer, 2016b.\n\nG. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N.\nSainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four\nresearch groups. IEEE Signal Processing Magazine, 29(6):82\u201397, 2012.\n\nS. Hochreiter. Untersuchungen zu dynamischen neuronalen netzen. Diploma, Technische Universit\u00e4t M\u00fcnchen,\n\n1991.\n\nS. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735\u20131780, 1997.\n\n9\n\n\fS. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber, et al. Gradient \ufb02ow in recurrent nets: the dif\ufb01culty of\n\nlearning long-term dependencies, 2001.\n\nG. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely connected convolutional networks. In The\n\nIEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, page 3, 2017a.\n\nL. Huang, X. Liu, B. Lang, and B. Li. Projection based weight normalization for deep neural networks. arXiv\n\npreprint arXiv:1710.02338, 2017b.\n\nS. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate\n\nshift. In International Conference on Machine Learning (ICML), pages 448\u2013456, 2015.\n\nS. Jastrzebski, D. Arpit, N. Ballas, V. Verma, T. Che, and Y. Bengio. Residual connections encourage iterative\n\ninference. In International Conference on Learning Representations (ICLR), 2018.\n\nS. Jastrz\u02dbebski, Z. Kenton, N. Ballas, A. Fischer, A. Storkey, and Y. Bengio. SGD smooths the sharpest directions.\n\n2018.\n\nA. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.\n\nJ. Lafond, N. Vasilache, and L. Bottou. Diagonal rescaling for neural networks. arXiv preprint arXiv:1705.09319,\n\n2017.\n\nY. Le Cun. Learning process in an asymmetric threshold network. In Disordered systems and biological\n\norganization, pages 233\u2013240. Springer, 1986.\n\nY. Le Cun, I. Kanter, and S. A. Solla. Eigenvalues of covariance matrices: Application to neural-network\n\nlearning. Physical Review Letters, 66(18):2396, 1991.\n\nD.-H. Lee, S. Zhang, A. Fischer, and Y. Bengio. Difference target propagation. In Joint European Conference\n\non Machine Learning and Knowledge Discovery in Databases, pages 498\u2013515. Springer, 2015.\n\nS. Li, J. Jiao, Y. Han, and T. Weissman. Demystifying resnet. arXiv preprint arXiv:1611.01186, 2016.\n\nV. A. Mar\u02c7cenko and L. A. Pastur. Distribution of eigenvalues for some sets of random matrices. Mathematics of\n\nthe USSR-Sbornik, 1(4):457, 1967.\n\nJ. Martens. Deep learning via Hessian-free optimization. In International Conference on Machine Learning\n\n(ICML), pages 735\u2013742, 2010.\n\nJ. Martens. Second-order optimization for neural networks. PhD thesis, University of Toronto, 2016.\n\nJ. Martens and R. Grosse. Optimizing neural networks with Kronecker-factored approximate curvature. In\n\nInternational Conference on Machine Learning (ICML), pages 2408\u20132417, 2015.\n\nD. Mishkin and J. Matas. All you need is a good init. In International Conference on Learning Representations\n\n(ICLR), 2016.\n\nJ. Nocedal and S. J. Wright. Sequential quadratic programming. Springer, 2006.\n\nY. Ollivier. Riemannian metrics for neural networks I: feedforward networks. Information and Inference: A\n\nJournal of the IMA, 4(2):108\u2013153, 2015.\n\nA. E. Orhan and X. Pitkow. Skip connections eliminate singularities. In International Conference on Learning\n\nRepresentations (ICLR), 2018.\n\nR. Pascanu and Y. Bengio. Revisiting natural gradient for deep networks. In International Conference on\n\nLearning Representations (ICLR), 2014.\n\nA. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer.\n\nAutomatic differentiation in PyTorch. 2017.\n\nD. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. Nature,\n\n323(6088):533, 1986.\n\nL. Sagun, U. Evci, V. U. Guney, Y. Dauphin, and L. Bottou. Empirical analysis of the hessian of over-parametrized\n\nneural networks. arXiv preprint arXiv:1706.04454, 2017.\n\nA. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear\n\nneural networks. In International Conference on Learning Representations (ICLR), 2014.\n\n10\n\n\fF. Seide and A. Agarwal. CNTK: Microsoft\u2019s open-source deep-learning toolkit. In Proceedings of the 22nd\nACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 2135\u20132135. ACM,\n2016.\n\nK. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In\n\nInternational Conference on Learning Representations (ICLR), 2015.\n\nB. Singh, S. De, Y. Zhang, T. Goldstein, and G. Taylor. Layer-speci\ufb01c adaptive learning rates for deep networks.\nIn IEEE 14th International Conference on Machine Learning and Applications (ICMLA), pages 364\u2013368,\nDec 2015.\n\nA. Veit, M. J. Wilber, and S. Belongie. Residual networks behave like ensembles of relatively shallow networks.\n\nIn Advances in Neural Information Processing Systems (NIPS), pages 550\u2013558, 2016.\n\nS. Wiseman, S. Chopra, M. Ranzato, A. Szlam, R. Sun, S. Chintala, and N. Vasilache. Training language models\n\nusing target-propagation. arXiv preprint arXiv:1702.04770, 2017.\n\nC. Ye, Y. Yang, C. Fermuller, and Y. Aloimonos. On the importance of consistency in training deep neural\n\nnetworks. arXiv preprint arXiv:1708.00631, 2017.\n\nY. You, I. Gitman, and B. Ginsburg. Large batch training of convolutional networks. arXiv preprint\n\narXiv:1708.03888v3, 2017.\n\nH. Zhang, C. Xiong, J. Bradbury, and R. Socher. Block-diagonal hessian-free optimization for training neural\n\nnetworks. arXiv preprint arXiv:1712.07296, 2017.\n\nB. Zoph and Q. V. Le. Neural architecture search with reinforcement learning, 2016.\n\n11\n\n\f", "award": [], "sourceid": 3233, "authors": [{"given_name": "Huishuai", "family_name": "Zhang", "institution": "Microsoft Research Asia"}, {"given_name": "Wei", "family_name": "Chen", "institution": "Microsoft Research"}, {"given_name": "Tie-Yan", "family_name": "Liu", "institution": "Microsoft Research Asia"}]}