1 Introduction

Although deep neural networks (DNN) [1] have achieved remarkable results in many computer vision tasks, they generally assume that the training and test sets follow the same distribution. However, in real environments, the training and test sets may come from different distributions. Unsupervised Domain Adaptation (UDA) [2,3,4,5] aims to alleviate the domain gap by leveraging unlabeled target domain data. To this end, researchers design different unsupervised losses on the target data for learning a well-performing model in the target domain. The loss of existing unsupervised domain adaptation can be roughly summarized into three categories: 1) self-training loss that iteratively retrains the network with highly confident pseudo-labeled target samples [6,7,8,9]; 2) image transformation loss which transforms the source image into a target-like style and appearance [10,11,12,13]; 3) adversarial loss that forces the two domains to align in the output space [14, 14, 15].

Fig. 1
figure 1

Overview of IDD_ICL. We can cluster samples of the same category in the source domain and disperse samples of different categories using implicit contrastive learning loss. Yellow represents the source domain, green represents the target domain. Within the same shape denotes the same category, while the hollow shapes represent augmented classes

In order to minimize domain discrepancies, most researchers have developed adversarial losses [2, 14] to handle this problem. For this purpose, GAN-style [16] architectures are widely used, which contain a generator and discriminator. In the discriminator, features are extracted from raw images by the generator, which identifies the two domains. This can be accomplished by both adversarial and cooperative methods by using the discriminator to guide the generator toward extracting target features that are close to the distribution of source features. While these methods match the marginal distributions of the two domains, they do not guarantee that features from different categories within the target domain will be well separated. It is important to describe the feature distribution separately for each category to ensure semantic consistency. In recent years, many approaches have included semantic information [7, 17] with their features in order to align categories. With these methods, category-level adversarial training is used to align semantic features across the source and target domains independently. During adaptation, however, the mini-batch size used for training is small, so an object instance from a source domain typically differs greatly from another image. Therefore, these methods inevitably bring image-level bias, which leads to learning features being misaligned between domains and being unstably aligned.

Based on the analysis above, we present a new approach to achieve domain adaptation that minimizes domain shifts by learning sample-wise representations that attract similar samples and dispel those that differ which can be seen in Fig. 1. In order to guide the directions of category alignment, our first step is to determine the holistic distribution of each category in the source domain, as the distribution can be efficiently estimated with sufficient supervision. Unlike category-centroid-based counterparts, our method is able to provide diverse generations from estimated distributions. Second, a better sample classifier can be obtained by increasing the level of intra-category compactness and inter-category separability in sample-wise representations. By sampling from the estimated distribution in the same category, we define an infinite number of positive pairs for each sample by separating sample-wise representations in both source and target domains. The rest semantic distributions are then used to draw an infinite number of negative sample pairs. For contrastive adaptation, the following form of contrastive loss is used. This formulation is further enhanced with a practical upper bound. Furthermore, we propose to enhance the discriminability of our model by using segmentation predictions with high confidence to retrain the model. In order to confirm the validity of our method regarding sample-wise category alignment, we conduct an analysis using sample-wise discrimination distance. Experimental results have demonstrated that contrastively driving the source and target sample-wise representations toward semantic distributions decreases domain discrepancies and improve generalization capabilities.

The following summarizes our main contributions.

  • We propose novel contrastive learning to improve diversity and discriminability for domain adaptation (IDD_ICL). Specifically, sample-wise representations and semantic distributions of the same category should be explicitly encouraged, while sample-wise representations and semantic distributions of different categories should be penalized.

  • The statistics of each category are used to derive an upper bound on the expected contrastive loss, making it simple yet effective to learn invariant and distinctive sample-wise representations.

  • Several empirical evaluations of competitive benchmarks, including Office-31, Office-Home and VisDA-2017, show that the IDD_ICL significantly improves the baseline model. Its effectiveness is validated by analytical evidence.

The remaining sections are organized as follows: Section 2 delves into the relevant literature. Section 3 provides a concise overview of the proposed design. Section 4 presents and analyzes the results of the experiments conducted. Finally, in Section 5, we conclude and wrap up the discussion.

2 Related work

2.1 Domain adaptation

Unsupervised domain adaptation (UDA) alleviates domain shift by transferring knowledge from a similar source domain to a target domain. The problem of image classification has been tackled in a number of pioneering works [2, 3, 18,19,20,21,22]. By reweighting instances or learning domain-invariant features, early domain adaptation works reduce the gap between domains [23, 24]. As a result, various deep DA works have been discussed to improve transfer performance given CNN’s power. Minimizing feature representation divergence is a common strategy [20, 25, 26], maximum mean discrepancy [27]. Drawn inspiration from generative adversarial network [16], adversarial-based training is another popular method for learning domain-invariant features [14, 28,29,30]. Semantic representations are among the most relevant works in this subset [31, 32]. Zahra. [33] propose a new method for limited domain adaptation, leveraging geometry information of both the source and target domains to maintain geometry information within domains allows for the use of source samples to compensate for the missing classes in the target domain. A new metric named contrastive domain discrepancy is used by Kang [31] to explicitly model intra- and inter-class discrepancies. Many recent works have adopted the adversarial learning mechanism and achieved the state-of-the-art performance for unsupervised domain adaptation. The Adversarial Discriminative Domain Adaptation [28] method uses an untied weight sharing strategy to align the feature distributions of the source and target domains. The Maximum Classification Discrepancy [34] utilizes different task-specific classifiers to learn a feature extractor that can generate category-related discriminative features. Multi-Adversarial Domain Adaptation [35] can exploit multiplicative interactions between feature representations and category predictions to enforce adversarial learning. We show the prospects (Pros) and considerations (Cons) of some of the current technologies. Our proposed method is to improve diversity and discriminability for domain adaptation in Table 1

Table 1 Pros and cons of unsupervised domain adaptation algorithms

2.2 Contrastive learning

In recent times, contrastive learning has demonstrated remarkable performance in representation learning, yielding state-of-the-art outcomes in the field of computer vision. The fundamental objective of this approach is to create an embedding space in which similar or positive pairs are brought closer together, while dissimilar or negative pairs are pushed apart. Positive pairs are established by pairing augmentations of the same image, whereas negative pairs are formed using augmentations from different images. Various existing contrastive learning methods employ different strategies for generating positive and negative samples. For example, Wu et al. [36] maintain sample representations in a memory bank, MoCo [37] maintains an on-the-fly momentum encoder alongside a limited queue of previous samples, Tian et al. [38] employ all generated multi-view samples in a mini-batch approach, and both SimClr V1 [39] and SimClr V2 [40] utilize a momentum encoder and all generated sample representations within the mini-batch. While these methods provide pre-trained networks for downstream tasks, they do not explicitly address domain shift when applied directly. In contrast, our approach focuses on learning representations that are generalizable without the need for labeled data. Notably, contrastive learning has recently been applied in the context of unsupervised domain adaptation [37, 41,42,43,44,45]. In these settings, models have access to source labels and typically employ models pre-trained on ImageNet as their backbone network. In comparison, our work is rooted in contrastive learning, often referred to as unsupervised representation learning, and distinguishes itself by not relying on labeled data or pre-trained ImageNet parameters.

3 Method

Fig. 2
figure 2

Framework of our IDD_ICL. We can obtain positive sample pairs from distributions that have the same semantic label and negative sample pairs from the rest. In which \(\mu \) represents the mean, and \(\sigma \) represents the standard deviation

3.1 Motivation and preliminaries

Formally, we identify the two domains in unsupervised domain adaptation as \(\mathcal {D}_{S}=\{(x_{sk}, y_{sk})\}_{k=1}^{n_{s}}\) with \(n_{s}\) labeled samples and \(\mathcal {D}_{T}=\{x_{tk}\}_{k=1}^{n_{t}}\) with \(n_{t}\) unlabeled samples, and \(y_{sk}\in \{1,2, ... ,K\}\) is the label refer to \(x_{sk}\). Because the distributions of the two domains are different. The different domain adaptation algorithms are presented in Table 1.

Our IDD_ICL framework can be seen in Fig. 2. First, we mine comprehensive semantic information from the distribution statistics for each category; then, to mitigate the domain gap, we design a novel contrastive loss which uses a sample level learning algorithm to simultaneously learn an infinite number of similar/dissimilar pairs.

3.1.1 Contrastive learning

In recent years, contrastive learning [37, 41, 43, 44] has been shown to be an effective method of learning meaningful representations from unlabeled data. Let f be an embedding function (realized via a CNN) that transforms an sample a to an embedding vector \(z=f(a)\,, z \in \mathbb {R}^d\). Then, we normalize z onto a unit sphere. Let \((a\,,a^-)\) be dissimilar pairs and \((a\,,a^+)\) be similar pairs. Then the contrastive loss of InfoNCE [42] can be written as follows:

$$\begin{aligned} {\mathbb {E}_{a\,,a^+\,,\{a^{n-}\}_{k=1}^K} \left[ - \log _{10} \frac{e^{f(a)^{\top }f(a^+)/\tau }}{e^{f(a)^{\top }f(a^+)/\tau } + \sum _{k=1}^K e^{f(a)^{\top }f(a^{k-})/\tau }}\right] .} \end{aligned}$$
(1)

It is common practice to replace expectations with empirical estimates. Above we saw the contrastive loss essentially is based on the softmax formulation with a temperature of \(\tau \) [42].

Clearly, the contrastive loss promotes sample discrimination. In contrast, our research explores sample predictions for UDA, which have received little consideration in previous studies. In this study, we demonstrate that sample-by-sample representation alignment outperforms existing algorithms by a significant margin. Below we will introduce the contrast loss [46] we use in the paper.

3.1.2 Estimation of semantic distributions

It is essential to identify all possible directions of feature transformation in order to facilitate meaningful cross-domain semantic augmentations. Such calculations require a large amount of computation on the whole source domain in an implementation. In order to resolve this issue, The mean is calculated online by aggregating statistics one by one. According to mathematics, the online mean estimation algorithm is as follows:

$$\begin{aligned} \mu ^i_{(t)} = \frac{n^i_{(t-1)}\mu ^i_{(t-1)} + m^i_{(t)}{\mu '}^i_{(t)}}{n^i_{(t-1)}+m^i_{(t)}}\,, \end{aligned}$$
(2)
$$\begin{aligned} \Sigma ^i_{(t)}&= \frac{n^i_{(t-1)}\Sigma ^i_{(t-1)}+m^i_{(t)}{\Sigma '}^i_{(t)}}{n^i_{(t-1)}+m^i_{(t)}} \nonumber \\&+\frac{n^i_{(t-1)}m^i_{(t)}\left( \mu ^i_{(t-1)}-{\mu '}^i_{(t)}\right) \left( \mu ^i_{(t-1)}-{\mu '}^i_{(t)}\right) ^{\top }}{\left( n^i_{(t-1)} + m^i_{(t)}\right) ^2} \,, \end{aligned}$$
(3)

where \({\Sigma '}^i_{(t)}\) represents the covariance matrix of the features of the \(i^{th}\) category in \(t^{th}\) image. As an initialization, K mean values and K covariance matrices are computed on the whole source domain for each category before training. These semantic distributions are dynamically updated during adaptation. In order to guide the alignment of categories, the estimated semantic distributions are more informative.

3.2 Contrastive domain adaptation

Recently, several prior methods have leveraged category feature centroids [4, 47] or instance and stuff features in the source domain to serve as anchors to remedy the domain shift problem. However, in their works, these anchors merely preserve the basic characteristic of each category, but at the expense of the diversity and discriminability within the category. Additionally, their potential capability in dense prediction tasks is severely limited by an insufficient margin between categories.

By contrast, our approach maximizes the statistics of the distribution for category alignment at the pixel level, which is different from previous methods. A particular form of contrastive loss is obtained by incorporating multiple positive/negative sample pairs into our framework. To improve UDA, this modification forces similar and dissimilar pairs to establish stronger intra-category and inter-category connections.

Therefore, every sample representation in the source and target features must return a low loss value. Combined with multiple positive sample pairs \(a^{m+}\) and negative sample pairs \(a^{n-}_j\), where \(a^{m+}\) indicates the \(m^{th}\) positive example from the same category represents \(n^{th}\) negative example from the \(j^{th}\) different category. The following is a formal definition of a sample-wise contrastive loss:

$$\begin{aligned} {\mathcal {L}^{M,N}_i = -\frac{1}{M}\sum _{m=1}^M\log _{10}\frac{e^{a_i^{\top } a^{m+}/\tau }}{e^{a_i^\top q^{m+}/\tau } + \sum _{j=1}^{K-1}\frac{1}{N}\sum _{n=1}^N e^{a_i^\top a^{n-}_{j}/\tau }}\,,} \end{aligned}$$
(4)

Positive and negative examples are represented by M and N, respectively. Explicitly sampling M examples from semantic distribution is a naive implementation of \(\mathcal {L}^{M,N}\). There are N examples from each distribution with a different semantic label that have the same latent class.

By taking an infinity limit on M and N, we hope to absorb the effect of M and N probabilistically. Using the infinity limit, we achieve the same goal of multiple pairing. Mathematically, as M and N reach infinity, \(\mathcal {L}^{M,N}\) becomes the estimation of following:

$$\begin{aligned} \mathcal {L}^{\infty }_i&= \lim _{\begin{array}{c} M\rightarrow \infty \\ N\rightarrow \infty \end{array}} \mathcal {L}^{M,N}_i \nonumber \\&= -\mathbb {E}_{\begin{array}{c} a^{+} \sim p(a^+) \\ a^{-}_j\sim p(a^{-}_j) \end{array}} \log _{10}\frac{e^{a_i^\top a^+/\tau }}{e^{a_i^\top a^+/\tau } + \sum _{j=1}^{K-1} e^{a_i^\top a^{^-}_j/\tau }} \,, \end{aligned}$$
(5)

A positive semantic distribution has the same semantic label as a negative semantic distribution has a different semantic label, and so on. Despite the fact that its analytic form is intractable, it has a rigorous closed form of upper bound:

$$\begin{aligned}&-\mathbb {E}_{\begin{array}{c} a^{+}, a^{-}_j \end{array}} \log _{10}\frac{e^{a_i^\top a^+/\tau }}{e^{a_i^\top a^+/\tau } + \sum _{j=1}^{K-1} e^{a_i^\top a^{-}_j/\tau }} \nonumber \\&= \mathbb {E}_{a^+}\left[ \log _{10} \left[ e^{\frac{a_i^\top a^+}{\tau }} + \sum _{j=1}^{K-1} \mathbb {E}_{a^{-}_j}e^{\frac{a_i^\top a^{-}_j}{\tau }}\right] \right] - \mathbb {E}_{a^+}\left[ \frac{a_i^\top a^+}{\tau }\right] \end{aligned}$$
(6)
$$\begin{aligned}&\le \log _{10}\left[ \mathbb {E}_{a^+}\left[ e^{\frac{a_i^\top a^+}{\tau }}+\sum _{j=1}^{K-1} \mathbb {E}_{a^{-}_j}e^{\frac{a_i^\top a^{-}_j}{\tau }}\right] \right] - a_i^\top \mathbb {E}_{a^+}\left[ \frac{a^+}{\tau }\right] \end{aligned}$$
(7)
$$\begin{aligned}&= \log _{10}\left[ \mathbb {E}_{a^+}e^{\frac{a_i^\top a^+}{\tau }} + \sum _{j=1}^{K-1} \mathbb {E}_{a^{-}_j}e^{\frac{a_i^\top a^{-}_j}{\tau }}\right] - a_i^\top \mathbb {E}_{a^+}\left[ \frac{a^+}{\tau }\right] \end{aligned}$$
(8)
$$\begin{aligned}&= \bar{\mathcal {L}_i} \end{aligned}$$
(9)

The distribution of the features requires some further assumptions to facilitate our formulation. In the case of a random variable a that follows a Gaussian distribution \(x\sim \mathcal {N}(\mu , \Sigma )\), where \(\mu \) is the expectation of a, \(\Sigma \) is the covariance matrices of a. The moment generation function [48] satisfies the following conditions:

$$\begin{aligned} \mathbb {E} \left[ e^{x^\top a}\right] = e^{x^\top \mu + \frac{1}{2}x^\top \Sigma x}\,. \end{aligned}$$
(10)

Under the Gaussian assumption \(a^{+} \sim \mathcal {N}(\mu ^{+}, \Sigma ^{+})\,, a^{-}_j\) \(\sim \mathcal {N}(\mu ^{-}_j, \Sigma ^{-}_j)\), along with (10), we find that (8) for a certain pixel representation \(a_i\) immediately reduces to:

$$\begin{aligned}&\mathcal {L}_{IDD\_ICL} \nonumber \\&=\log _{10} \left[ e^{\frac{a_i^\top \mu ^{+}}{\tau } + \frac{a_i^{\top }\Sigma ^{+} a_i}{2\tau ^2}} + \sum _{j=1}^{K-1} e^{\frac{a_i^\top \mu _{j}^{-}}{\tau } + \frac{a_i^\top \Sigma _{j}^{-} a_i}{2\tau ^2}} \right] - \frac{a_i^\top \mu ^{+}}{\tau } \end{aligned}$$
(11)
$$\begin{aligned}&=-\log _{10}\frac{e^{\frac{a_i^\top \mu ^{+}}{\tau }+\frac{a_i^\top \Sigma ^{+} a_i}{2\tau ^2}}}{e^{\frac{a_i^\top \mu ^{+}}{\tau }+\frac{a_i^\top \Sigma ^{+} a_i}{2\tau ^2}}+\sum _{j=1}^{K-1} e^{\frac{a_i^\top \mu _{j}^{-}}{\tau }+\frac{a_i^\top \Sigma _{j}^{-} a_i}{2\tau ^2}}} + \frac{a_i^\top \Sigma ^{+}a_i}{2 \tau ^2} \,. \end{aligned}$$
(12)
Algorithm 1
figure a

IDD_ICL algorithm.

Table 2 Accuracy (%) on Office-31 for UDA (ResNet-50), the bold font indicates the best results
Table 3 Accuracy (%) on Office-Home for UDA (ResNet-50), the bold font indicates the best results

3.3 Overall formulation

An estimate of the mutual information I(XY) determines the degree of similarity between two random variables. Due to strong correlations between target features and predictions, our semantic augmentations will contain more meaningful semantic information, ignoring trivial semantic information. Therefore, we maximize mutual information on target data, i.e., minimize loss in (13).

$$\begin{aligned} \mathcal {L}_{MI} = \sum _{k=1}^{K}\hat{P}^{k} \log _{10} \hat{P}^{k} - \frac{1}{n_{t}}\sum _{j=1}^{n_{t}}\sum _{k=1}^{C}{P_{tj}^{k}} \log _{10} P_{tj}^{k}, \end{aligned}$$
(13)

where \(\varvec{\hat{P}} = \frac{1}{n_{t}} \sum _{j=1}^{n_t} \varvec{P}_{tj}\). The ground-truth distribution on the target domain is approximated by the average of the target predictions.

Therefore, IDD_ICL serves the following objective functions:

$$\begin{aligned} \mathcal {L}_{IDD\_ICL}&=\bar{\mathcal {L}_i}+\alpha \mathcal {L}_{MI}, \end{aligned}$$
(14)

In this case, \(\alpha \) represents the trade-off parameter. We summarize our training process in Algorithm 1. A detailed analysis of the IDD_ICL will be performed in the ablation study.

Table 4 Accuracy (%) on VisDA-2017 for UDA (ResNet-101), the bold font indicates the best results
Table 5 Accuracy (%) on Office-31 for UDA (ResNet-50), the bold font indicates the best results
Fig. 3
figure 3

The accuracy with the different hyper-parameter \(\alpha \) value

4 Experiment

4.1 Datasets

Office-31 [66] uses images from 31 distinct categories based on three distinct domains: Amazon (A), Webcam (W), and DSLR (D). Amazon: 2,817 images, an average of 90 per class, with a single image background. Webcam: 795 images, images exhibit significant noise, color, and white balance artifacts. DSLR: 498 images, 5 objects per class, each object taken on average 3 times from different viewpoints.

Office-Home [67] contains 15,500 images divided into 65 categories, including Artistic (Ar), Clip Art (Cl), Product (Pr), and Real-World (Rw). The office-home dataset is a more complex dataset than Office-31.

VisDA-2017 [68] are included in the visda-2017 dataset, spanning 12 categories. Taking [34] as a guide, we use 152,397 synthetic images as a source and 55,388 realistic images as a target.

4.2 Implementation details

Our backbone network for these datasets is ResNet [69] pre-trained on ImageNet [70]. The experiments in this paper are implemented using PyTorch [71]. For network optimization, we use a mini-batch SGD optimizer with momentum 0.9, and a deep embedded validation [72] to select hyperparameters \(\alpha \) from \(\{0.1, 0.05, 0.1, 0.15, 0.2\}\). On all datasets, we find \(\alpha =0.1\) works well.

4.3 Results

Results on Office-31 can be found in Table 2. IDD_ICL outperforms JAN and DANN by a large margin, showing that revised contrastive learning is also indispensable for UDA. In particular, IDD_ICL significantly improves GSP by 2.5%, demonstrating that IDD_ICL complements previous UDA methods. Additionally, IDD_ICL is superior to recent classifier adaptation methods, such as SymNets or TAT, showing it is capable of exploring useful semantic information for better diversity and discrimination.

Results on Office-Home can be found in Table 3. With large domain discrepancies, Office-Home is one of the most challenging datasets for UDA. In comparison with the compared methods, IDD_ICL consistently improves generalization ability. A specific benefit of IDD_ICL is that it enhances GVB-GD’s accuracy by 4.1%, with the average accuracy reaching 74.5%. These promising results indicate that IDD_ICL enhances transferability of classifiers across cross-domain datasets stably.

Results on VisDA-2017 can be found in Table 4. Compared to other augmentation methods, IDD_ICL performs dramatically better. Generally, IDD_ICL generates better augmentation results since it exploits mean difference and target covariance to capture semantic information class-wise. Furthermore, IDD_ICL proves its effectiveness and versatility over other baseline methods as well.

4.4 Analysis

Ablation study. To demonstrate the effectiveness of the proposed method, we use different methods like CDAN, DANN, BSP as our baseline, it can be seen in Table 5 that compared to the comparison methods, our IDD_ICL method shows a huge improvement, and improves 8% on DANN, which further demonstrates the effectiveness of our method. When evaluating the IDD_ICL component in these methods, we can see that the model without the proposed method produces worse classification accuracies for these tasks, demonstrating that contrastive learning information in the target domain can make a significant contribution in domain adaptation. All experiments produce inferior results, and the full model with IDD_ICL produces the best results. This validates the contributions of the proposed method.

Hyper-parameter sensitivity To study how the hyper-parameter \(\lambda \) affects the performance of our method, sensitivity test is conducted. We conducted experiments on two Office-31 tasks A\(\rightarrow \)W and W\(\rightarrow \)A by varying \(\alpha \in \{0.01, 0.05, 0.1, 0.15, 0.2\}\). Figure 3 shows that IDD_ICL is not that sensitive to \(\alpha \), and can achieve competitive results with different hyper-parameters. empirically, we recommend \(\alpha =0.1\) for naive implementation.

Quantitative Distribution Discrepancy The distribution discrepancy between source and target domains is used here to evaluate the functionality of each component in our model, resulting from \(\mathcal {A}\)-distance [73] in Fig. 4. Based on [73], \(\mathcal {A}\)-distance is defined as \(d_{\mathcal {A}}=2(1-2\epsilon )\), where \(\epsilon \) represents a binary domain classifier’s classification error in discriminating between the source and target domains. In general, the smaller the \(\mathcal {A}\)-distance, the better the alignment of the distribution, as shown in Fig. 4. The \(\mathcal {A}\)-distance between two domains is smaller using our model than those of the other three baselines. In other words, our model reduces domain discrepancy gaps more effectively.

5 Conclusion

This paper proposes a novel contrastive learning approach for improving diversity and discriminability for domain adaptation (IDD_ICL). Through sample-wise alignment guided by semantic distributions, the IDD_ICL model successfully adapts to the target domain. For each sample-wise representation of both domains, we use a particular form of contrastive loss that implicitly involves learning infinitely many similar/dissimilar sample pairs. A practical implementation of this intractable loss function is then derived. The combination of this simple but effective strategy and self-supervised learning is surprisingly effective. IDD_ICL is superior on a variety of benchmarks, as demonstrated by the experimental results.

Fig. 4
figure 4

The \(\mathcal {A}\)-distance of the IDD_ICL with the different baseline (DANN, CDAN, BSP)

One limitation of our study is that it only focuses on image classification. In future work, we plan to extend the scope of our benchmarking to include semantic segmentation tasks. Additionally, we have only considered the closed-set domain adaptation scenario, where the source and target classes are similar. In future work, we aim to considering partial and open-set domain adaptation scenarios, which are common in image classification applications and involve varying classes between the source and target domain.