1 Introduction
Deep generative models represent powerful approaches to modeling highly complex highdimensional data. There has been a lot of recent research geared towards the advancement of deep generative modeling strategies, including Variational Autoencoders (VAE)
(Kingma & Welling, 2013)(Oord et al., 2016a, b) and hybrid models (Gulrajani et al., 2016; Nguyen et al., 2016). However, Generative Adversarial Networks (GANs) (Goodfellow et al., 2014)have emerged as the learning paradigm of choice across a varied range of tasks, especially in computer vision
Zhu et al. (2017), simulation and robotics Finn et al. (2016) Shrivastava et al. (2016). GANs cast the learning of a generative network in the form of a game between the generative and discriminator networks. While the discriminator is trained to distinguish between the true and generated examples, the generative model is trained to fool the discriminator. Using a discriminator network in GANs avoids the need for an explicit reconstructionbased loss function. This allows this model class to generate visually sharper images than VAEs while simultaneously enjoying faster sampling than autoregressive models.
Recent work, known as either ALI (Dumoulin et al., 2016) or BiGAN (Donahue et al., 2016), has shown that the adversarial learning paradigm can be extended to incorporate the learning of an inference network. While the inference network, or encoder, maps training examples to a latent space variable , the decoder plays the role of the standard GAN generator mapping from space of the latent variables (that is typically sampled from some factorial distribution) into the data space. In ALI, the discriminator is trained to distinguish between the encoder and the decoder, while the encoder and decoder are trained to conspire together to fool the discriminator. Unlike some approaches that hybridize VAEstyle inference with GANstyle generative learning (e.g. Larsen et al. (2015), Chen et al. (2016)), the encoder and decoder in ALI use a purely adversarial approach. One big advantage of adopting an adversarialonly formalism is demonstrated by the highquality of the generated samples. Additionally, we are given a mechanism to infer the latent code associated with a true data example.
One interesting feature highlighted in the original ALI work (Dumoulin et al., 2016)
is that even though the encoder and decoder models are never explicitly trained to perform reconstruction, this can nevertheless be easily done by projecting data samples via the encoder into the latent space, copying these values across to the latent variable layer of the decoder and projecting them back to the data space. Doing this yields reconstructions that often preserve some semantic features of the original input data, but are perceptually relatively different from the original samples. These observations naturally lead to the question of the source of the discrepancy between the data samples and their ALI reconstructions. Is the discrepancy due to a failure of the adversarial training paradigm, or is it due to the more standard challenge of compressing the information from the data into a rather restrictive latent feature vector?
Ulyanov et al. (2017) show that an improvement in reconstructions is achievable when additional terms which explicitly minimize reconstruction error in the data space are added to the training objective. Li et al. (2017b) palliates to the nonidentifiability issues pertaining to bidirectional adversarial training by augmenting the generator’s loss with an adversarial cycle consistency loss.In this paper we explore issues surrounding the representation of complex, richlystructured data, such as natural images, in the context of a novel, hierarchical generative model, Hierarchical Adversarially Learned Inference (HALI), which represents a hierarchical extension of ALI. We show that within a purely adversarial training paradigm, and by exploiting the model’s hierarchical structure, we can modulate the perceptual fidelity of the reconstructions. We provide theoretical arguments for why HALI’s adversarial game should be sufficient to minimize the reconstruction cost and show empirical evidence supporting this perspective. Finally, we evaluate the usefulness of the learned representations on a semisupervised task on MNIST and an attribution prediction task on the CelebA dataset.
2 Related work
Our work fits into the general trend of hybrid approaches to generative modeling that combine aspects of VAEs and GANs. For example, Adversarial Autoencoders (Makhzani et al., 2015)
replace the KullbackLeibler divergence that appears in the training objective for VAEs with an adversarial discriminator that learns to distinguish between samples from the approximate posterior and the prior. A second line of research has been directed towards replacing the reconstruction penalty from the VAE objective with GANs or other kinds of auxiliary losses. Examples of this include
Larsen et al. (2015) that combines the GAN generator and the VAE decoder into one network and Lamb et al. (2016)that uses the loss of a pretrained classifier as an additional reconstruction loss in the VAE objective. Another research direction has been focused on augmenting GANs with inference machinery. One particular approach is given by
Dumoulin et al. (2016); Donahue et al. (2016), where, like in our approach, there is a separate inference network that is jointly trained with the usual GAN discriminator and generator. Karaletsos (2016) presents a theoretical framework to jointly train inference networks and generators defined on directed acyclic graphs by leverage multiple discriminators defined nodes and their parents. Another related work is that of Huang et al. (2016b) which takes advantage of the representational information coming from a pretrained discriminator. Their model decomposes the data generating task into multiple subtasks, where each level outputs an intermediate representation conditioned on the representations from higher level. A stack of discriminators is employed to provide signals for these intermediate representations. The idea of stacking discriminator can be traced back to Denton et al. (2015) which used used a succession of convolutional networks within a Laplacian pyramid framework to progressively increase the resolution of the generated images.3 Hierachical Adversarially Learned Inference
The goal of generative modeling is to capture the datagenerating process with a probabilistic model. Most realworld data is highly complex and thus, the exact modeling of the underlying probability density function is usually computationally intractable. Motivated by this fact, GANs
(Goodfellow et al., 2014) model the datagenerating distribution as a transformation of some fixed distribution over latent variables. In particular, the adversarial loss, through a discriminator network, forces the generator network to produce samples that are close to those of the datagenerating distribution. While GANs are flexible and provide good approximations to the true datagenerating mechanism, their original formulation does not permit inference on the latent variables. In order to mitigate this, Adversarially Learned Inference (ALI) (Dumoulin et al., 2016)extends the GAN framework to include an inference network that encodes the data into the latent space. The discriminator is then trained to discriminate between the joint distribution of the data and latent causes coming from the generator and inference network. Thus, the ALI objective encourages a matching of the two joint distributions, which also results in all the marginals and conditional distributions being matched. This enables inference on the latent variables.
We endeavor to improve on ALI in two aspects. First, as reconstructions from ALI only loosely match the input on a perceptual level, we want to achieve better perceptual matching in the reconstructions. Second, we wish to be able to compress the observables, , using a sequence of composed features maps, leading to a distilled hierarchy of stochastic latent representations, denoted by to . Note that, as a consequence of the data processing inequality(Cover & Thomas, 2012)
, latent representations higher up in the hierarchy cannot contain more information than those situated lower in the hierarchy. In informationtheoretic terms, the conditional entropy of the observables given a latent variable is nonincreasing as we ascend the hierarchy. This loss of information can be seen as responsible for the perceptual discrepancy observed in ALI’s reconstructions. Thus, the question we seek to answer becomes: How can we achieve high perceptual fidelity of the data reconstructions while also having a compressed latent space that is strongly coupled with the observables? In this paper, we propose to answer this using a novel model, Hierarchical Adversarially Learned Inference (HALI), that uses a simple hierarchical Markovian inference network that is matched through adversarial training to a similarly constructed generator network. Furthermore, we discuss the hierarchy of reconstructions induced by the HALI’s hierarchical inference network and show that the resulting reconstruction errors are implicitly minimized during adversarial training. Also, we leverage HALI’s hierarchial inference network to offer a novel approach to semisupervised learning in generative adversarial models.
3.1 A Model for Hierarchical Features
Denote by the set of all probability measures on some set . Let be a Markov kernel associating to each element a probability measure . Given two Markov kernels and , a further Markov kernel can be defined by composing these two and then marginalizing over , i.e.
. Consider a set of random variables
. Using the composition operation, we can construct a hierarchy of Markov kernels or feature transitions as(1) 
A desirable property for these feature transitions is to have some form of inverses. Motivated by this, we define the adjoint feature transition as . From this, we see that
(2) 
This can be interpreted as the generative mechanism of the latent variables given the data being the "inverse" of the datagenerating mechanism given the latent variables. Let denote the distribution of the data and be the prior on the latent variables. Typically the prior will be a simple distribution, e.g. a standard Gaussian .
The composition of Markov kernels in Eq. missing, mapping data samples to samples of the latent variables using constitutes the encoder. Similarly, the composition of kernels in Eq. missing mapping prior samples of to data samples through constitutes the decoder. Thus, the joint distribution of the encoder can be written as
(3) 
while the joint distribution of the decoder is given by
(4) 
The encoder and decoder distributions can be visualized graphically as
Having constructed the joint distributions of the encoder and decoder, we can now match these distributions through adversarial training. It can be shown that, under an ideal (nonparametric) discriminator, this is equivalent to minimizing the JensenShanon divergence between the joint Eq. missing and Eq. missing, see (Dumoulin et al., 2016). Algorithm 1 details the training procedure.
3.2 A hierarchy of reconstructions
The Markovian character of both the encoder and decoder implies a hierarchy of reconstructions in the decoder. In particular, for a given observation , the model yields different reconstructions for with the reconstruction of the at the th level of the hierarchy. Here, we can think of as projecting to the th intermediate representation and as projecting it back to the input space. Then, the reconstruction error for a given input at the th hierarchical level is given by
(5) 
Contrary to models that try to merge autoencoders and adversarial models, e.g. Rosca et al. (2017); Larsen et al. (2015), HALI does not require any additional terms in its loss function in order to minimize the above reconstruction error. Indeed, the reconstruction errors at the different levels of HALI are minimized down to the amount of information about that a given level of the hierarchy is able to encode as training proceeds. Furthermore, under an optimal discriminator, training in HALI minimizes the JensenShanon divergence between and as formalized in Proposition 1 below. Furthermore, the interaction between the reconstruction error and training dynamics is captured in Proposition 1.
Proposition 1.
Assuming is bounded away for zero for all , we have that
(6) 
where is computed under the encoder’s distribution and is as defined in Lemma 2 in the appendix.
On the other hand, proposition 2 below relates the intermediate representations in the hierarchy to the corresponding induced reconstruction error.
Proposition 2.
For any given latent variable ,
(7) 
i.e. the reconstruction error is an upper bound on .
In summary, Propositions 1 and 2 establish the dynamics between the hierarchical representation learned by the inference network, the reconstruction errors and the adversarial matching of the joint distributions Eq. missing and Eq. missing. The proofs on the two propositions above are deferred to the appendix. Having theoretically established the interplay between layerwise reconstructions and the training mechanics, we now move to the empirical evaluation of HALI.
4 Empirical Analysis: Setup
We designed our experiments with the objective of addressing the following questions: Is HALI successful in improving the fidelity perceptual reconstructions? Does HALI induces a semantically meaningful representation of the observed data? Are the learned representations useful for downstream classification tasks? All of these questions are considered in turn in the following sections.
We evaluated HALI on four datasets, CIFAR10 (Krizhevsky & Hinton, 2009), SVHN (Netzer et al., 2011)
, ImageNet 128x128
(Russakovsky et al., 2015) and CelebA (Liu et al., 2015). We used two conditional hierarchies in all experiments with the Markov kernels parametrized by conditional isotropic Gaussians. For SVHN, CIFAR10 and CelebA the resolutions of two level latent variables are and . For ImageNet, the resolutions is and .For both the encoder and decoder, we use residual blocks(He et al., 2015)
with skip connections between the blocks in conjunction with batch normalization
(Ioffe & Szegedy, 2015). We use convolution with stride 2 for downsampling in the encoder and bilinear upsampling in the decoder. In the discriminator, we use consecutive stride 1 and stride 2 convolutions and weight normalization
(Salimans & Kingma, 2016). To regularize the discriminator, we apply dropout every 3 layers with a probability of retention of 0.2. We also add Gaussian noise with standard deviation of 0.2 at the inputs of the discriminator and the encoder.
5 Empirical Analysis I: Reconstructions
One of the desired objectives of a generative model is to reconstruct the input images from the latent representation. We show that HALI offers improved perceptual reconstructions relative to the (nonhierarchical) ALI model.
5.1 Qualitative analysis
First, we present reconstructions obtained on ImageNet. Reconstructions from SVHN and CIFAR10 can be seen in Fig. missing in the appendix. Fig. missing highlights HALI’s ability to reconstruct the input samples with high fidelity. We observe that reconstructions from the first level of the hierarchy exhibit local differences in the natural images, while reconstructions from the second level of the hierarchy displays global change. Higher conditional reconstructions are more often than not reconstructed as a different member of the same class. Moreover, we show in Fig. missing that this increase in reconstruction fidelity does not impact the quality of the generative samples from HALI’s decoder.
5.2 Quantitative analysis
We further investigate the quality of the reconstructions with a quantitative assessment of the preservation of perceptual features in the input sample. For this evaluation task, we use the CelebA dataset where each image comes with a 40 dimensional binary attributes vector. A VGG16 classifier(Simonyan & Zisserman, 2014) was trained on the CelebA training set to classify the individual attributes. This trained model is then used to classify the attributes of the reconstructions from the validation set. We consider a reconstruction as being good if it preserves – as measured by the trained classifier – the attributes possessed by the original sample.
We report a summary of the statistics of the classifier’s accuracies in Table 1. We do this for three different models, VAE, ALI and HALI. An inspection of the table reveals that the proportion of attributes where HALI’s reconstructions outperforms the other models is clearly dominant. Therefore, the encoderdecoder relationship of HALI better preserves the identifiable attributes compared to other models leveraging such relationships. Please refer to Table 5 in the appendix for the full table of attributes score.
Mean  Std  # Best  

Data  
VAE  
ALI  
HALI  
HALI 
5.3 Perceptual Reconstructions
In the same spirit as Larsen et al. (2015), we construct a metric by computing the Euclidean distance between the input images and their various reconstructions in the discriminator’s feature space. More precisely, let be the embedding of the input to the penultimate layer of the discriminator. We compute the discriminator embedded distance
(8) 
where is the Euclidean norm. We then compute the average distances and over the ImageNet validation set. (a) shows that under , the average reconstruction errors for both and decrease steadily as training advances. Furthermore, the reconstruction error under of the reconstructions from the first level of the hierarchy are uniformly bounded by above by those of the second. We note that while the VAEGAN model of Larsen et al. (2015) explicitly minimizes the perceptual reconstruction error by adding this term to their loss function, HALI implicitly minimizes it during adversarial training, as shown in subsection 3.2.
6 Empirical Analysis II: Learned Representations
We now move on to assessing the quality of our learned representation through inpainting, visualizing the hierarchy and innovation vectors.
6.1 Inpainting
Inpainting is the task of reconstructing the missing or lost parts of an image. It is a challenging task since sufficient prior information is needed to meaningfully replace the missing parts of an image. While it is common to incorporate inpaintingspecific training Yeh et al. (2016); Pérez et al. (2003); Pathak et al. (2016), in our case we simply use the standard HALI adversarial loss during training and reconstruct incomplete images during inference time.
We first predict the missing portions from the higher level reconstructions followed by iteratively using the lower level reconstructions that are pixelwise closer to the original image. Fig. missing shows the inpaintings on centercropped SVHN, CelebA and MSCOCO (Lin et al., 2014) datasets without any blending postprocessing or explicit supervision. The effectiveness of our model at this task is due the hierarchy – we can extract semantically consistent reconstructions from the higher levels of the hierarchy, then leverage pixelwise reconstructions from the lower levels.
Real CelebA faces (right) and their corresponding innovation tensor (IT) updates (left). For instance, the third row in the figure features Christina Hendricks followed by haircolor IT updates. Similarly, the first two rows depicts usage of smileIT and the 4th row glassesplushaircolorIT.
6.2 Hierarchical latent representations
To qualitatively show that higher levels of the hierarchy encode increasingly abstract representation of the data, we individually vary the latent variables and observe the effect.
The process is as follow: we sample a latent code from the prior distribution . We then multiply individual components of the vector by scalars ranging from to . For , we fix and multiply each feature map independently by scalars ranging from to . In all cases these modified latent vectors are then decoded back to input data space. Fig. missing (a) and (b) exhibit some of those decodings for , while (c) and (d) do the same for the lower conditional . The last column contain the decodings obtained from the originally sampled latent codes. We see that the representations learned in the conditional are responsible for high level variations like gender, while codes imply local/pixelwise changes such as saturation or lip color.
6.3 Latent semantic Innovation
With HALI, we can exploit the jointly learned hierarchical inference mechanism to modify actual data samples by manipulating their latent codes. We refer to these sorts of manipulations as latent semantic innovations.
Consider a given instance from a dataset . Encoding yields and . We modify by multiplying a specific entry by a scalar . We denote the resulting vector by . We decode the latter and get . We decode the unmodified encoding vector and get . We then form the innovation tensor . Finally, we subtract the innovation vector from the initial encoding, thus getting , and sample . This method provides explicit control and allows us to carry out these variations on real samples in a completely unsupervised way. The results are shown in Fig. missing. These were done on the CelebA validation set and were not used for training.
7 Empirical Evaluations III: Learning Predictive Representations
We evaluate the usefulness of our learned representation for downstream tasks by quantifying the performance of HALI on attribute classification in CelebA and on a semisupervised variant of the MNIST digit classification task.
7.1 Unsupervised classification
Following the protocol established by Berg & Belhumeur (2013); Liu et al. (2015), we train 40 linear SVMs on HALI encoder representations (i.e. we utilize the inference network) on the CelebA validation set and subsequently measure performance on the test set. As in Berg & Belhumeur (2013); Huang et al. (2016a); Kalayeh et al. (2017), we report the balanced accuracy in order to evaluate the attribute prediction performance. We emphasize that, for this experiment, the HALI encoder and decoder were trained in on entirely unsupervised data. Attribute labels were only used to train the linear SVM classifiers.
A summary of the results are reported in Table 2. HALI’s unsupervised features surpass those of VAE and ALI, but more remarkably, they outperform the best handcrafted features by a wide margin (Zhang et al., 2014). Furthermore, our approach outperforms a number of supervised (Huang et al., 2016a) and deeply supervised (Liu et al., 2015) features. Table 6 in the appendix arrays the results per attribute.
Mean  Std  # Best  

TripletkNN (Schroff et al., 2015) 

PANDA (Zhang et al., 2014)  
Anet (Liu et al., 2015)  
LMLEkNN (Huang et al., 2016a)  
VAE  
ALI  
HALI 
7.2 Semisupervised learning within HALI
The HALI hierarchy can also be used in a more integrated semisupervised setting, where the encoder also receives a training signal from the supervised objective. The currently most successful approach to semisupervised in adversarially trained generative models are built on the approach introduced by Salimans et al. (2016). This formalism relies on exploiting the discriminator’s feature to differentiate between the individual classes present in the labeled data as well as the generated samples. Taking inspiration from (Makhzani et al., 2015; Makhzani & Frey, 2017), we adopt a different approach that leverages the Markovian hierarchical inference network made available by HALI,
(9) 
Where , with , and is a categorical random variable. In practice, we characterize the conditional distribution of given by a softmax. The cost of the generator is then augmented by a supervised cost. Let us write as the set of pairs all labeled instance along with their label, the supervised cost reads
(10) 
We showcased this approach on a semisupervised variant of MNIST(LeCun et al., 1998) digit classification task with 100 labeled examples evenly distributed across classes.
Table 3 shows that HALI achieves a new stateoftheart result for this setting. Note that unlike Dai et al. (2017), HALI uses no additional regularization.
MNIST (# errors)  

VAE (M1+M2) (Kingma et al., 2014)  
VAT (Miyato et al., 2017)  
CatGAN (Springenberg, 2015)  
Adversarial Autoencoder (Makhzani et al., 2015)  
PixelGAN (Makhzani & Frey, 2017)  
ADGM (Maaløe et al., 2016)  
FeatureMatching GAN (Salimans et al., 2016)  
Triple GAN (Li et al., 2017a)  
GSSLTRABG (Dai et al., 2017)  
HALI (ours)  73 
8 Conclusion and future work
In this paper, we introduced HALI, a novel adversarially trained generative model. HALI learns a hierarchy of latent variables with a simple Markovian structure in both the generator and inference networks. We have shown both theoretically and empirically the advantages gained by extending the ALI framework to a hierarchy.
While there are many potential applications of HALI, one important future direction of research is to explore ways to render the training process more stable and straightforward. GANs are wellknown to be challenging to train and the introduction of a hierarchy of latent variables only adds to this.
References

Berg & Belhumeur (2013)
Thomas Berg and Peter N Belhumeur.
Poof: Partbased onevs.one features for finegrained categorization, face verification, and attribute estimation.
InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 955–962, 2013.  Chen et al. (2016) Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. CoRR, abs/1606.03657, 2016. URL http://arxiv.org/abs/1606.03657.
 Cover & Thomas (2012) Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012.
 Dai et al. (2017) Zihang Dai, Zhilin Yang, Fan Yang, William W Cohen, and Ruslan Salakhutdinov. Good semisupervised learning that requires a bad gan. arXiv preprint arXiv:1705.09783, 2017.
 Denton et al. (2015) Emily L Denton, Soumith Chintala, Rob Fergus, et al. Deep generative image models using a laplacian pyramid of adversarial networks. In Advances in Neural Information Processing Systems, pp. 1486–1494, 2015.
 Donahue et al. (2016) Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. arXiv preprint arXiv:1605.09782, 2016.
 Dumoulin et al. (2016) Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Martin Arjovsky, Olivier Mastropietro, and Aaron Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016.
 Finn et al. (2016) Chelsea Finn, Ian J. Goodfellow, and Sergey Levine. Unsupervised learning for physical interaction through video prediction. CoRR, abs/1605.07157, 2016. URL http://arxiv.org/abs/1605.07157.
 Goodfellow et al. (2014) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680, 2014.
 Gulrajani et al. (2016) Ishaan Gulrajani, Kundan Kumar, Faruk Ahmed, Adrien Ali Taiga, Francesco Visin, David Vázquez, and Aaron C. Courville. Pixelvae: A latent variable model for natural images. CoRR, abs/1611.05013, 2016. URL http://arxiv.org/abs/1611.05013.
 He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015. URL http://arxiv.org/abs/1512.03385.
 Huang et al. (2016a) Chen Huang, Yining Li, Chen Change Loy, and Xiaoou Tang. Learning deep representation for imbalanced classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5375–5384, 2016a.
 Huang et al. (2016b) Xun Huang, Yixuan Li, Omid Poursaeed, John E. Hopcroft, and Serge J. Belongie. Stacked generative adversarial networks. CoRR, abs/1612.04357, 2016b. URL http://arxiv.org/abs/1612.04357.
 Ioffe & Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167, 2015. URL http://arxiv.org/abs/1502.03167.
 Kalayeh et al. (2017) Mahdi M Kalayeh, Boqing Gong, and Mubarak Shah. Improving facial attribute prediction using semantic segmentation. arXiv preprint arXiv:1704.08740, 2017.

Karaletsos (2016)
Theofanis Karaletsos.
Adversarial message passing for graphical models.
NIPS workshop on Advances in Approximate Bayesian Inference
, 2016.  Kingma & Welling (2013) Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Kingma et al. (2014) Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semisupervised learning with deep generative models. In Advances in Neural Information Processing Systems, pp. 3581–3589, 2014.
 Krizhevsky & Hinton (2009) Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images, 2009.
 Lamb et al. (2016) Alex Lamb, Vincent Dumoulin, and Aaron Courville. Discriminative regularization for generative models. arXiv preprint arXiv:1602.03220, 2016.
 Larsen et al. (2015) Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300, 2015.

LeCun et al. (1998)
Yann LeCun, Corinna Cortes, and Christopher JC Burges.
The mnist database of handwritten digits, 1998.
 Li et al. (2017a) Chongxuan Li, Kun Xu, Jun Zhu, and Bo Zhang. Triple generative adversarial nets. arXiv preprint arXiv:1703.02291, 2017a.
 Li et al. (2017b) Chunyuan Li, Hao Liu, Changyou Chen, Yunchen Pu, Liqun Chen, Ricardo Henao, and Lawrence Carin. Alice: Towards understanding adversarial learning for joint distribution matching. In Advances in Neural Information Processing Systems (NIPS), 2017b.
 Lin et al. (2014) TsungYi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pp. 740–755. Springer, 2014.
 Liu et al. (2015) Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3730–3738, 2015.
 Maaløe et al. (2016) Lars Maaløe, Casper Kaae Sønderby, Søren Kaae Sønderby, and Ole Winther. Auxiliary deep generative models. arXiv preprint arXiv:1602.05473, 2016.
 Makhzani & Frey (2017) Alireza Makhzani and Brendan Frey. Pixelgan autoencoders. arXiv preprint arXiv:1706.00531, 2017.
 Makhzani et al. (2015) Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, and Ian Goodfellow. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.
 Miyato et al. (2017) Takeru Miyato, Shinichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a regularization method for supervised and semisupervised learning. arXiv preprint arXiv:1704.03976, 2017.
 Netzer et al. (2011) Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, pp. 4. Granada, Spain, 2011.
 Nguyen et al. (2016) Anh Nguyen, Jason Yosinski, Yoshua Bengio, Alexey Dosovitskiy, and Jeff Clune. Plug & play generative networks: Conditional iterative generation of images in latent space. CoRR, abs/1612.00005, 2016. URL http://arxiv.org/abs/1612.00005.
 Oord et al. (2016a) Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, pp. 4790–4798, 2016a.
 Oord et al. (2016b) Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016b.
 Pathak et al. (2016) Deepak Pathak, Philipp Krähenbühl, Jeff Donahue, Trevor Darrell, and Alexei A. Efros. Context encoders: Feature learning by inpainting. CoRR, abs/1604.07379, 2016. URL http://arxiv.org/abs/1604.07379.
 Pérez et al. (2003) Patrick Pérez, Michel Gangnet, and Andrew Blake. Poisson image editing. ACM Trans. Graph., 22(3):313–318, July 2003. ISSN 07300301. doi: 10.1145/882262.882269. URL http://doi.acm.org/10.1145/882262.882269.
 Rosca et al. (2017) Mihaela Rosca, Balaji Lakshminarayanan, David WardeFarley, and Shakir Mohamed. Variational approaches for autoencoding generative adversarial networks. arXiv preprint arXiv:1706.04987, 2017.
 Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.

Salimans & Kingma (2016)
Tim Salimans and Diederik P Kingma.
Weight normalization: A simple reparameterization to accelerate training of deep neural networks.
In Advances in Neural Information Processing Systems, pp. 901–901, 2016.  Salimans et al. (2016) Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. arXiv preprint arXiv:1606.03498, 2016.

Schroff et al. (2015)
Florian Schroff, Dmitry Kalenichenko, and James Philbin.
Facenet: A unified embedding for face recognition and clustering1a_089. pdf.
2015.  Shrivastava et al. (2016) Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Josh Susskind, Wenda Wang, and Russ Webb. Learning from simulated and unsupervised images through adversarial training. CoRR, abs/1612.07828, 2016. URL http://arxiv.org/abs/1612.07828.
 Simonyan & Zisserman (2014) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. CoRR, abs/1409.1556, 2014. URL http://arxiv.org/abs/1409.1556.
 Springenberg (2015) Jost Tobias Springenberg. Unsupervised and semisupervised learning with categorical generative adversarial networks. arXiv preprint arXiv:1511.06390, 2015.
 Ulyanov et al. (2017) Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Adversarial generatorencoder networks. arXiv preprint arXiv:1704.02304, 2017.
 Yeh et al. (2016) Raymond Yeh, Chen Chen, TeckYian Lim, Mark HasegawaJohnson, and Minh N. Do. Semantic image inpainting with perceptual and contextual losses. CoRR, abs/1607.07539, 2016. URL http://arxiv.org/abs/1607.07539.
 Zhang et al. (2014) Ning Zhang, Manohar Paluri, Marc’Aurelio Ranzato, Trevor Darrell, and Lubomir Bourdev. Panda: Pose aligned networks for deep attribute modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1637–1644, 2014.
 Zhu et al. (2017) JunYan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. Unpaired imagetoimage translation using cycleconsistent adversarial networks. CoRR, abs/1703.10593, 2017. URL http://arxiv.org/abs/1703.10593.
Appendix A Appendix
a.1 Architecture Details
Operation  Kernel  Strides  Feature maps  BN/WN?  Dropout  Nonlinearity 
– input  
Convolution  0.0  Leaky ReLU 

Convolution  BN  0.0  Leaky ReLU  
Resnet Block  BN  0.0  Leaky ReLU  
Resnet Block  BN  0.0  Leaky ReLU  
Convolution  BN  0.0  Leaky ReLU  
Resnet Block  BN  0.0  Leaky ReLU  
Convolution  0.0  Leaky ReLU  
Gaussian Layer  
– input  
Convolution  BN  0.0  Leaky ReLU  
Convolution  BN  0.0  Leaky ReLU  
Convolution  BN  0.0  Leaky ReLU  
Resnet Block  BN  0.0  Leaky ReLU  
Convolution  BN  0.0  Leaky ReLU  
Convolution  0.0  Linear  
– input  
Convolution  BN  0.0  Leaky ReLU  
Bilinear Upsampling  
Resnet Block  BN  0.0  Leaky ReLU  
Bilinear Upsampling  
Convolution  BN  0.0  Leaky ReLU  
Convolution  BN  0.0  Leaky ReLU  
Bilinear Upsampling  
Convolution  0.0  Leaky ReLU  
Gaussian Layer  
– input  
Convolution  BN  0.0  Leaky ReLU  
Resnet Block  BN  0.0  Leaky ReLU  
Bilinear Upsampling  
Resnet Block  BN  0.0  Leaky ReLU  
Bilinear Upsampling  
Convolution  BN  0.0  Leaky ReLU  
Convolution  BN  0.0  Leaky ReLU  
Convolution  0.0  Tanh  
– input  
Convolution  WN  0.2  Leaky ReLU  
Convolution  WN  0.5  Leaky ReLU  
Convolution  WN  0.5  Leaky ReLU  
Convolution  WN  0.5  Leaky ReLU  
– input  
Concatenate and along the channel axis  
Convolution  WN  0.2  Leaky ReLU  
Convolution  WN  0.5  Leaky ReLU  
Convolution  WN  0.5  Leaky ReLU  
Convolution  WN  0.5  Leaky ReLU  
Convolution  WN  0.2  Leaky ReLU  
– input  
Concatenate and along the channel axis  
Convolution  0.5  Leaky ReLU  
Convolution  0.5  Leaky ReLU  
Convolution  0.5  Sigmoid 
a.2 Proofs
Lemma 1.
Let be a valid fdivergence generator. Let and be joint distributions over a random vector . Let be any strict subset of and its complement, then
(11) 
Proof.
By definition, we have
Using that is convex, Jensen’s inequality yields
Simplifying the inner expectation on the right hand side, we conclude that
∎
Lemma 2 (KullbackLeibler’s upper bound by JensenShannon).
Assume that and
are two probability distribution absolutely continuous with respect to each other. Moreover, assume that
is bounded away from zero. Then, there exist a positive scalar such that(12) 
Proof.
We start by bounding the KullblackLeibler divergence by the distance. We have
(13) 
The first inequality follows by Jensen’s inequality. The third inequality follows by the Taylor expansion. Recall that both the distance and the JensenShanon divergences are fdivergences with generators given by and , respectively. We form the function . is strictly increasing on . Since we are assuming to be bounded away from zero, we know that there is a constant such that for all . Subsequently for all , we have that . Thus, for all we have and hence. Intergrating with respect to , we conclude
∎
Proposition 3.
Assuming and are positive for any . We have
(14) 
Where is computed under the encoder’s distribution
Proof.
By elementary manipulations we have.
Where the conditional entropy is computed . By the nonnegativity of the KLdivergence we obtain
Using lemma 2, we have
Proposition 4.
For any given latent variable , the reconstruction likelihood is an upper bound on .
Proof.
By the nonnegativity of the KullbackLeibler divergence, we have that
. Integrating over the marginal and applying Fubini’s theorem yields
where the conditional entropy is computed under the encoder distribution. ∎
Sideburns 
Black Hair 
Wavy Hair 
Young 
Makeup 
Blond 
Attractive 
withShadow 
withNecktie 
Blurry 
DoubleChin 
BrownHair 
Mouth Open 
Goatee 
Bald 
PointyNose 
Gray Hair 
Pale Skin 
ArchedBrows 
With Hat 

Data  86  81  69  91  90  88  81  80  80  60  70  74  92  88  77  60  84  72  69  86  
VAE  80  96  60  98  79  74  83  74  93  96  73  77  83  77  89  86  69  86  80  84  
ALI  72  90  83  94  88  81  91  77  83  83  89  76  77  72  83  87  84  79  92  92  
HALI (z1)  92  93  93  98  91  95  95  92  89  92  81  95  92  89  94  98  88  90  95  99  
HALI (z2)  80  78  91  96  89  94  92  78  90  83  88  80  83  78  88  89  87  91  87  84  
Balding 
StraightHair 
Big Nose 
Rosy Cheeks 
Oval Face 
Bangs 
Male 
Mustache 
HighCheeks 
No Beard 
Eyeglasses 
BaggyEyes 
WithNecklace 
WithLipstick 
Big Lips 
NarrowEyes 
Chubby 
Smiling 
BushyBrows 
WithEarrings 

Data  72  60  69  71  56  89  97  80  85  96  94  69  54  94  54  58  71  92  72  74  
VAE  89  94  69  67  83  78  93  82  91  98  73  65  92  86  82  78  73  95  64  62  
ALI  86  87  89  82  89  76  90  77  86  92  81  87  93  88  83  85  90  85  82  83  
HALI (z1)  96  91  90  71  91  94  97  84  96  98  90  94  93  94  83  88  82  97  91  83  
HALI (z2)  88  88  89  75  90  84  90  80  89  95  75  86  99  90  88  87  87  89  85  81  
5 o Clock Shadow 
Arched Eyebrows 
Attractive 
Bags Under Eyes 
Bald 
Bangs 
Big Lips 
Big Nose 
Black Hair 
Blond Hair 
Blurry 
Brown Hair 
Bushy Eyebrows 
Chubby 
Double Chin 
Eyeglasses 
Goatee 
Gray Hair 
Heavy Makeup 
High Cheekbones 

TripletkNN  66  73  83  63  75  81  55  68  82  81  43  76  68  64  60  82  73  72  88  86  
PANDA  76  77  85  67  74  92  56  72  84  91  50  85  74  65  64  88  84  79  95  89  
Anet  81  76  87  70  73  90  57  78  90  90  56  83  82  70  68  95  86  85  96  89  
LMLEkNN  82  79  88  73  90  98  60  80  92  99  59  87  82  79  74  98  95  91  98  92  
VAE  78  65  62  68  87  86  58  67  75  83  64  62  72  77  80  81  80  88  75  75  
ALI  78  70  69  68  89  87  57  69  75  88  65  64  71  78  78  85  79  89  79  64  
HALI(Unsup)  86  77  80  78  94  93  62  74  85  92  78  77  82  85  86  96  92  93  89  85  
Male 
Mouth Slightly Open 
Mustache 
Narrow Eyes 
No Beard 
Oval Face 
Pale Skin 
Pointy Nose 
Receding Hairline 
Rosy Cheeks 
Sideburns 
Smiling 
Straight Hair 
Wavy Hair 
Wearing Earrings 
Wearing Hat 
Wearing Lipstick 
Wearing Necklace 
Wearing Necktie 
Young 

TripletkNN  91  92  57  47  82  61  63  61  60  64  71  92  63  77  69  84  91  50  73  75  
PANDA  99  93  63  51  87  66  69  67  67  68  81  98  66  78  77  90  97  51  85  78  
Anet  99  96  61  57  93  67  77  69  70  76  79  97  69  81  83  90  95  59  79  84  
LMLEkNN  99  96  73  59  96  68  80  72  76  78  88  99  73  83  83  99  99  59  90  87  
VAE  78  67  81  60  79  51  86  59  79  79  79  81  55  69  65  84  78  67  83  69  
ALI  83  52  82  62  79  54  85  61  78  80  77  68  60  72  67  91  82  67  82  71  
HALI(Unsup)  96  88  90  72  90  65  89  69  84  89  91  91  70  77  78  95  92  71  89  80  
Comments
There are no comments yet.