In this paper, we propose deep networks with stochastic depth; a novel trainingalgorithm that is based on the seemingly contradictory insight that ideallywe would like to have a deep network during testing but a short network duringtraining. We resolve this conflict by creating deep Residual Network [8] architectures(with hundreds or even thousands of layers) with sufficient modelingcapacity; however, during training we shorten the network significantly by randomlyremoving a substantial fraction of layers independently for each sample ormini-batch. The effect is a network with a small expected depth during training,but a large depth during testing. Although seemingly simple, this approach issurprisingly effective in practice.