We introduce and analyze a new technique for model reduction in deep neural networks. Our algorithm prunes (sparsifies) a trained network layer-wise, removing connections at each layer by addressing a convex problem. We present both parallel and cascade versions of the algorithm along with the mathematical analysis of the consistency between the initial network and the retrained model. In terms of the sample complexity, we present a general result that holds for any layer within a network using rectified linear units as the activation. If a layer taking inputs of size $N$ can be described using a maximum number of $s$ non-zero weights per node, under some mild assumptions on the input covariance matrix, we show that these weights can be learned from $O(slog N/s)$ samples.