Unsupervised Domain Adaptation
Boqing Gong, Yuan Shi, Fei Sha, and Kristen Grauman
In this page, we present our works of 1) geodesic flow kernel (GFK) [1], 2) landmarks [2, 3], 3) rank of domains (ROD) metric [1], a discriminative clustering approach [4], as well as how to discover latent domains from data [6], devoting to unsupervised domain adaptation problems.
Motivation
Recent studies have shown that recognition datasets are biased (see Figure 1 for an example). Paying no heed to those biases, learning algorithms often result in classifiers with poor crossdataset generalization. We are developing domain adaptation techniques to overcome those biases and yield classifiers with significantly improved performance when generalized to new testing datasets. Our work enables us to continue to harvest the benefits of existing vision datasets for the time being. Moreover, it also sheds insights about how to construct new ones. In particular, domain adaptation raises the bar for collecting datathe most informative data are those which cannot be classified well by learning algorithms that adapt from existing datasets.
Figure 1. Mismatch between two different domains. 
Introduction
Domain adaptation aims to build classifiers that are robust to mismatched distributions. When some labeled data from the target domain is accessible, the problem is similar to semisupervised learning and is referred to as semisupervised domain adaptation. In contrast, when there is no labeled data from the target domain to help learning, the problem is called unsupervised domain adaptation. We are mainly interested in developing unsupervised domain adaptation techniques, which are often easy to be extended to the semisupervied case.
In the literature, one extensively studied paradigm is to assume that there is a domaininvariant feature space. Defining and quantifying such a feature space entails careful examination of our intuition on what type of representations facilitate adaptation. Since many visual data can be assumed to lie in lowdimensional subspaces, given data from two domains, how can we exploit the subspaces in these datasets, which can be telltale in revealing the underlying difference and commonness between the domains? We propose a geodesic flow kernel (GFK) to address this chanllenge. GFK integrates an infinite number of subspaces that lie on the geodesic flow from the source subspace to the target one. The flow represents incremental changes in geometric and statistical properties between the two domains.
Despite of the progresses by the existing approaches, they so far have only been limited to macroscopically examining the distribution similarity by tuning to statistical properties of the samples as a wholewhen comparing distributions, all the samples are used. This notion is stringent, as it requires all discrepancies to be counted for and forces learning inefficiently (or even erroneously) from “hard” cases which might be just outliers to the target domains. In contrast, we will leverage the key insight that not all instances are created equally in terms of adaptability. Thus, we will examine distribution similarity microscopically at the instance level using landmarks. Landmarks are defined as a subset of labeled instances from the source domain. These instances are distributed similarly to the target domain. As a result, they are expected to function as a conduit connecting the source and target domains to facilitate adaptation.
In addition, we also study an interesting problem in domain adaptation: given multiple source domains and a target domain, how can we select which source domain to pair with the target domain? This is an especially important problem to address in order to apply domain adaptation to realworld problems. We address this problem proposing a rank of domains (ROD) measure that ranks source domains based on how suitable they are for a sourcetrained classifier to adapt to the target domain. The metric integrates two pieces of information: geometrically how much the subspaces of the source and the target domains overlap, and statistically how similarly the target and source data are distributed in the subspaces.
Existing approaches that learn a domaininvariant feature space often separate the learning of the new feature space and the learning of the classifier. We argue that the feature space learned in this twostage learning paradigm can be suboptimal, and propose a novel approach called Discriminative Clustering based Domain Adaptation, which combines the two in a single stage. Our approach assumes that, in the new feature space, 1) data in both the source and the target domains are tightly clustered and clusters corresponds class boundaries; 2) For the same class, the clusters from the two domains are geometrically close to each other.
Geodesic flow kernel (GFK)
We describe our kernelbased method, GFK, in this section. Figure 2 sketches the main idea.
Figure 2. Main idea of our geodesic flow kernelbased approach
for domain adaptation. 
We embed source
and target datasets in a Grassmann manifold. We then construct
a geodesic flow between the two points and integrate an infinite
number of subspaces along the flow . Concretely, raw
features are projected into these subspaces to form an infinitedimensional
feature vector . Inner products between
these feature vectors define a kernel function that can be computed
over the original feature space in closedform. The kernel
encapsulates incremental changes between subspaces that underly
the difference and commonness between the two domains. The
learning algorithms thus use this kernel to derive lowdimensional
representations that are invariant to the domains. 
GFK models the source domain and the target domain with ddimensional linear subspaces and embeds them onto a Grassmann manifold. Specifically, let denote the basis of the PCA subspaces for each of the two domains, respectively. The Grassmann manifold G(d,D) is the collection of all ddimensional subspaces of the feature vector space .
The geodesic flow between on the manifold parameterizes a path connecting the two subspaces. In the beginning of the flow, the subspace is similar to that of the source domain and in the end of the flow, the subspace is similar to that of the target. The original feature is projected into these subspaces and forms a feature vector of infinite dimensions:
Using the new feature representation for learning, will force the classifiers to NOT lean towards either the source domain or the target domain, or in other words, will force the classifier to use domaininvariant features. The infinitedimensional feature vector is handled conveniently by their inner product that gives rise to a positive semidefinite kernel defined on the original features,
The matrix G can be computed efficiently using singular value decomposition. Moreover, computing the kernel does not require any labeled data. Details are in [1].
Landmarks
Landmarks are data points from the source domain; however, given how they are distributed, they look like they could be samples from the target domain too (cf. Figure 3 for a schematic illustration, and Figure 5 for exemplar images of visual objects identified as landmarks in vision datasets). The intuition behind our approach is to use these landmarks to bridge the source and the target domains, more specifically, to forge the target discriminative loss function using the labeled landmarks.
Figure 3. Main idea of our landmarkbased approach
for domain adaptation. 
(a) The original domain adaptation (DA)
problem where instances in red are from the target and in
blue from the source. (b) Landmarks, shown inside the
green circles, are data instances from the source that can be
regarded as samples from the target. (c) Multiple
auxiliary tasks are created by augmenting the original
target with landmarks, which switches their color (domain
association) from blue to red. Each task gives
rise to a new feature representation. These representations
are combined discriminatively to form domaininvariant
features for the original DA problem. 
Identifying landmarks
Let and denote the source domain and the target domain, respectively. We use as the indicator variable, one for each data point in the source domain. Our goal is to choose among all possible configurations of the indicator variables such that the distribution of the selected data instances are maximally similar to that of the target domain,
where the objective function is motivated by the maximum mean discrepancy, a nonparametric twosample test for determining whether two distributions are similar.
Furthermore, we impose the constraint that labels be balanced in the selected landmarks. Concretely,
where denotes whether
Note that due to the binary constraint on the indicator variable, the problem is difficult to solve. We relax and solve it with a convex programming. Details please be referred to [2, 3].
Multiscale analysis
Note that in the above formulation, we do not need compute the feature mapping directly. Instead, we use the kernel function,
where G is the GFK kernel. Other kernels may be also applicable here. The bandwidth σ is a scaling factor for measuring distances and similarities between data points. Since we regard landmarks as likely samples from the target domain, σ determines how much the source and the target are similar to each other at different granularities. Instead of hoping to have one optimal σ, we identify landmarks at different scales. Figure 3(c) illustrates this effect, and Figure 5 shows some landmarks selected at different scales.
Auxiliary tasks
Each set of landmarks (selected at a scale σ) gives rise to a different perspective on the adaptation problem by suggesting which instances to explore to connect the source and the target. We achieve this connection by creating auxiliary tasks.
The auxiliary task is created as domain adaptation from a new source domain to a new target domain , where is the landmarks at the qth scale. The auxiliary tasks differ from the original problem in an important aspect: the new tasks should be “easier”, as the existence of landmark points ought to aid the adaptation, as justified by Theorem 1 in [2]. We solve these auxiliary tasks by GFK. As a result, we have a set of kernels .
Discriminative learning
Besides the benefit of harvesting the "easier" auxiliary tasks, we reveal an more important use of landmarks here. In unsupervised domain adaptation, there is no principled way of defining discriminative loss functions with respect to the target domain as there is no labeled data. However, landmarks serve as a conduit to the target discriminative loss, enabling us to forge the loss functions to fit the target domain. To this end, we will use only the landmarks to train a classification function, instead of using the whole labeled source domain.
Concretely, we use the landmarks to learn an kernelized SVM classifier, with the convex combination of the kernels from the auxiliary tasks,
Rank of domains (ROD)
We study an interesting and important question in this section. Imagine we need to build a classifier for a target domain for object recognition. We have several datasets, Caltech101, PASCAL VOC, and ImageNet to choose from as the source domain. Without actually running our domain adaptation algorithms and building classifiers, is it possible to determine which dataset(s) would give us the best performance on the target domain?
We introduce the rank of domains (ROD) metric to answer this question. It measures the adaptability between a sourcetarget pair from two sets of information: geometrically, the alignment between subspaces of the two domains, and statistically, KL divergences between data distributions once they are projected into the subspaces:
where and are two onedimensional Gaussians to approximate the source data and target data after being projected to the corresponding subspaces, the principal angels measure the geometric overlap of the source subspace and the target subspace. For details please be referred to [1].
Discriminative clustering
The assumptions of our approach are illustrated in Figure 4: in the new feature space, 1) data in both domains are tightly clustered and clusters corresponds class boundaries; 2) For the same class, the clusters from the two domains are geometrically close to each other.
Figure 4. Assumption of our discriminative clustering approach for domain adaptation.

Leveraging these assumptions, our formulation of learning the optimal feature space balances two forces: maximizing domain similarity that makes the source and the target domains look alike, and (approximately) minimizing the expected classification error on the target domain. We definene those two forces with informationtheoretical quantities: the domain similarity being the negated mutual information between all data and their binary domain labels (source versus target) and the expected classification error being the negated mutual information between the target data and its clusters (i.e. class) labels estimated from the source data. These two quantities are directly motivated by the nearest neighbor classifiers we use in the new feature space. Figure 5 shows the formulation of our approach.
Figure 5. Formulation of our discriminative clustering approach for domain adaptation.

Reshaping visual datasets
Image data is difficult to manually divide into the discrete domains required by adaptation algorithms, and the standard practice of equating datasets with domains is a weak proxy for all the real conditions that alter the statistics in complex ways (lighting, pose, background, resolution, etc.)
We propose an approach [6] to automatically discover latent domains in image or video datasets. Our formulation imposes two key properties on domains: maximum distinctiveness and maximum learnability. By maximum distinctiveness, we require the underlying distributions of the identified domains to be different from each other to the maximum extent; by maximum learnability, we ensure that a strong discriminative model can be learned from the domain. We devise a nonparametric formulation and efficient optimization procedure that can successfully discover domains among both training and test data.
Experiments
We evaluate our methods in the context of object recognition. Four datasets are used in the experiments: Caltech101, AMAZON, WEBCAM, and DSLR (see Figure 6 for example images). Each of them is regarded as one domainCaltech101 images often have clean backgrounds, AMAZON is collected from online catalog, WEBCAM images are captured by lowresolution web cameras in an office environment, and DSLR consists of highquality images. We use an 800bin histogram (bagofSURF) to represent each image.

Figure 6. Example images from the four domains. 
GFK evaluation
Table 1 shows some recognition results on target domains of unsupervised domain adaptation using our GFK. 1NN classifier is used. The OrigFeat row corresponds to results without any adaptation techniques. We can see that they are worse than domain adaptation results (SGF and GFK). Further, our GFK outperforms another adaptation approach, SGF proposed by Gopalan et al. More results can be found in [1].
Table 1 . Recognition accuracies on target domains with GFK. (C: Caltech, A: Amazon, W: Webcam, and D: DSLR). 


Landmark Evaluation
We evaluate the landmarkbased approach in Table 2. Since we use an SVM for landmarks, we report the results of ther other methods also using SVM. The best results for each sourcetarget pair are in red color, and the second best ones are in green color. Our landmarkbased approach performs the best.
Discriminative clustering Evaluation
We evaluate the discriminative clusteringbased approach in Table 3. 1NN classifier is used. We show that our method outperforms both nonadaptation methods (PCA, LMNN) and adaptation methods (TCA, GFS, Metric). More details can be found in [4].
Table 3 . Recognition accuracies on target domains with Discriminative Clustering. (C: Caltech, A: Amazon, W: Webcam, and D: DSLR). 


Selected landmarks
We show some selected landmarks from the source domain AMAZON for the target domain WEBCAM at different scales. As the scale decreases, landmarks with greater variances are selected.
Figure 7. Selected landmarks from AMAZON for the target domain WEBCAM. 
Rank of domains (ROD)
Given an unlabeled target domain and several source domain, how can we identify the best source domain that can be adapted the best to the target domain? We have proposed the ROD metric towards this problem. If we compare the ROD values in the left table in Figure 8 to the recognition results in Table 1, we can see that the ROD indicates well on how adaptable a source domain is to a particular target domain. The right two figures in Figure 8 shows on such example about how to corresponds the ROD values to the recognition accuracies..
Figure 8. Left: ROD values between all sourcetarget pairs. Middle: ROD values from three possible source domains to the target domain AMAZON. The ROD between Caltech and AMAZON is the smallest except from AMAZON to itself. Right: Recognition accuracies of different approaches adapting from the three source domains to the AMAZON domain. The best results are achieved from Caltech to AMAZON, in consistent with the ROD values.

References
[1] Geodesic Flow Kernel for Unsupervised Domain Adaptation. B. Gong, Y. Shi, F. Sha, and K. Grauman.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, June 2012. (Oral) [PDF] [Supp.] [Slides] [Code&Data]
[2] Connecting the Dots with Landmarks: Discriminatively Learning
DomainInvariant Features for Unsupervised Domain Adaptation. B. Gong, K. Grauman, and F. Sha.
Proceedings of the International Conference on Machine Learning(ICML), Atlanta, GA, June 2013. (Accepted) [PDF] [Supp.]
[3] Overcoming Dataset Bias:
An Unsupervised Domain Adaptation Approach. B. Gong, F. Sha, and K. Grauman.
Big Data Meets Computer Vision: first international workshop on Large Scale Visual Recognition and Retrieval (BigVision) at NIPS, Lake Tahoe, NV, Dec. 2012. (Oral) [PDF][Slides]
[4] InformationTheoretical Learning of Discriminative Clusters for Unsupervised Domain Adaptation. Y. Shi and F. Sha
Proceedings of the International Conference on Machine Learning (ICML), 2012 [PDF] [Code]
[5] Database for Studying Effects of Domain Shift in Object Recognition: http://www1.icsi.berkeley.edu/~saenko/projects.html#data
[6] Reshaping Visual Datasets for Domain Adaptation. B. Gong, K. Grauman, and F. Sha.
Proceedings of the Neural Information Processing Systems (NIPS), Lake Tahoe, NV, Dec. 2013. [Project] [PDF] [Supp.] [Code]
Acknowledgements
This work is partially supported by DARPA D11AP00278, NSF IIS#1065243, CSSG (B. G., Y. S. and F. S.), and ONR ATL and NSF IIS 1065390 (K. G.).
Page generated Dec. 2012.