1 Introduction

In real-world scenarios, various makeup techniques are used to enhance the appearance of the human face, including applying foundation, eyeshadow, lipstick, and complex painting patterns. Computer vision technology enables the synthesis of facial makeup in images, and it is widely used in the photo beautification, video enhancement, digital entertainment, live streaming, and virtual makeup application fields. Typically, digital makeup application is achieved through the transfer and removal of various styles between faces with and those without makeup while preserving facial identity. Makeup transfer comprises two main objectives. First, transferring low-frequency color features and high-frequency patterns from makeup to non-makeup images. Second, preserving the facial identity of non-makeup images. Similarly, digital makeup removal encompasses a dual-pronged approach. The first involves accurately removing the color, texture, and patterns present on makeup face images and ensuring that unnatural boundaries, color distortions, and other artifacts do not exist. Second, the facial identity of makeup images must be preserved.

Existing makeup transfer and removal research can be classified into two main categories. The first category includes traditional image processing-based methods, such as image gradient editing [1] and physical operations [2, 3], which are often complex and computationally expensive. The second category encompasses deep learning-based methods that are more efficient and have become mainstream. For example, BeautyGAN [4] effectively transfers light makeup styles between two images; PSGAN (Pose and expression robust spatial-aware generative adversarial network) [5] and SCGAN(Style-based controllable generative adversarial network) [6] adequately transfer makeup with misaligned spatial positions. However, these methods fail to transfer extreme makeup styles with complex patterns effectively. Transferring extreme makeup styles using LADN (Local adversarial disentangling network) [7] and CPM (Color- &-Pattern makeup) [8] is feasible; however, LADN may alter the makeup color and CPM displays noticeable boundaries and distortion. Also, while BeautyGAN [4], SCGAN [6], and LADN [7] can be used for makeup removal, issues such as color residue and background information changes may occur during the removal of extreme styles.

To address the above problems, we propose a multi-granularity facial makeup transfer and removal model with local-global collaboration that can generate real and natural makeup results for light and extreme makeup styles with complex patterns.

Our main contributions are as follows:

(1) We introduce a multi-discriminator combination strategy with novel architecture comprising one content, two global, and 10 local discriminators. This architecture improves the model’s performance by effectively handling coarse-grained overall makeup and fine-grained local makeup details.

(2) We propose 10 novel local discriminators for 10 landmark-defined facial patches distinguishing the makeup style consistency and providing feedback signals. Incorporating global discriminators enables comprehensive makeup detection and correction at the global and local levels.

(3) We present a novel local loss function to ensure color distribution consistency between the transferred and reference images’ facial patches. This function enables our model’s generator to accurately capture and reproduce makeup details, leading to high-quality transferred makeup results.

(4) To avoid the alteration of unrelated content and the background’s negative impact on makeup transfer and removal, we process the facial and background regions separately. The discriminators evaluate only the facial region, and the generator merges the generated facial region with the background from the makeup/non-makeup image to produce the final image, to more natural and realistic effects.

2 Related work

2.1 Traditional makeup transfer methods

Traditional makeup transfer methods employ filtering, analogies, and texture synthesis operations to transfer makeup between images. Tong et al. [1] calculated the changes between two images by applying geometrically transformed Laplacian operators and mapping patterns from makeup to non-makeup images. This method requires high accuracy in aligning the image’s pose and expression, limiting its applicability. In contrast, Guo et al. [2] decomposed images into the facial structural, skin detail, and color layers after alignment. Then, they transferred the information from layer to layer between the images. However, this method entails a complex preprocessing step. Similarly, Li et al. [3] divided the image into three levels: the retro, diffuse, and specular reflection layers using a physical model. The makeup is transferred by changing the parameters within the corresponding layers. In this case, the transferred results depend on the physical model’s layering accuracy. Liu et al. [9] and Scherbaum et al. [10] proposed recommendation and synthesis systems that generate the most suitable makeup effects based on positive facial images provided by a user. However, these traditional image procesing-based methods cannot be used for makeup removal.

2.2 Deep learning-based makeup transfer and removal methods

Recently, deep learning methods have made significant progress in the image style transfer field. Gao et al. [11] proposed a personalized makeup transfer network that includes an outline feature extraction module and contour loss, which achieves transfer by learning outline correspondence. However, this method cannot effectively preserve facial identity during light makeup transfer. Tiwari et al. [12] proposed a COCOTA(Cooperative COlorizaTion of Achromatic Faces) framework that uses only monocular color and achromatic facial images to estimate colored 3D faces.

Yuan et al. [13] proposed a framework based on dynamic neural radiance fields for transferring makeup between images with different poses and expressions, and introduced a novel hybrid loss to significantly improve the transfer effect. Similarly, Li et al. [14] proposed a hybrid transformer model with attention-guided spatial embeddings for makeup transfer and removal, which can produce accurate and robust results even in images with large spatial misalignment and facial occlusion. Zhang et al. [15] introduced POFMakeup to transfer makeup styles from a Peking Opera face image to another using position and appearance guidance, ensuring semantic consistency and preserving the subject’s appearance.

Moreover, Yan et al. [16] presented the BeautyREC model, which localizes makeup between corresponding components in images, such as the eyes, lips, and skin. It uses a transformer’s long-range visual dependencies to achieve effective global makeup transfer. Sun et al. [17] developed a SSAT (symmetric semantic-aware transformer) network for makeup transfer and removal. This network renders the styles from makeup to non-makeup images according to the specifications of a semantic loss and the SSCFT (symmetric semantic corresponding feature transfer) module. Meanwhile, Lu et al. [18] proposed MakeupDiffuse, a diffusion-based model that utilizes a novel double image controller to manipulate the recognition process and adjust the makeup style, achieving precise transfer.

Goodfellow et al. [19] proposed the GAN (generative adversarial networks) framework for machine learning, which comprises a generator and a discriminator. This framework utilizes a backpropagation algorithm for parameter updates and optimizes the generator’s capabilities through adversarial training. It has proven to be a powerful makeup transfer tool. Likewise, Li et al. [20] proposed a Dual HyperST network for one-shot domain adaptation, which models the target domain’s textual and visual style through a text- and style-guide path. Also, this network adopts a dynamic domain transfer strategy to improve the style transfer process’s adjustment from the source to the target domain. Zhu et al. [21] proposed a novel DCCMF-GAN (double cycle consistently constrained multi-feature discrimination generative adversarial networks). This network comprises the nested CycleI and CycleII networks and is constrained by “cycle consistency 1” and “cycle consistency 2,” respectively.

Li et al. [4] proposed the BeautyGAN framework with dual inputs and outputs, which can simultaneously perform makeup transfer and removal. This framework enhances the accuracy of instance-level makeup transfer by matching the facial components in two images. In addition, Yuan et al. [22] developed RAMT-GAN (Region adaptive makeup transfer generative adversarial network) which utilizes a BeautyGAN-based dual-input/output network as a makeup transfer framework. Additionally, it incorporates the identity preservation and background invariance losses to synthesize authentic and accurate transferred makeup images. Xu et al. [23] utilized the LFFM(Low-Frequency Information Fusion Module) in the makeup transfer method to establish feature correspondence between faces, aiming to address the issue of significant pose and expression differences between the source face and the reference face. Li et al. [24] proposed a disentangled feature learning strategy to simultaneously perform makeup-invariant face verification and transfer tasks within a single generative network. This network preserves a face’s geometry by integrating a corresponding dense field and introduces a cosmetic loss makeup learning style to preserve the visual effects after transfer. Gu et al. [7] introduced LADN, which utilizes multiple overlapping local discriminators and asymmetric loss functions to ensure detail consistency during the transfer and removal of extreme makeup styles. However, this method cannot resolve the spatial misalignment issues between makeup and non-makeup images.

The framework proposed by Nguyen et al. [8] contains color and pattern branches and uses UV maps to eliminate shape, pose, and facial expression variations during training. Jiang et al. [5] introduced PSGAN to the issues resulting from different head poses and facial expressions by incorporating an attention mechanism. Then, Liu et al. [25] proposed PSGAN++, a PSGAN-based model that adds local makeup transfer and removal. Similar to PSGAN, this model encounters challenges in achieving satisfactory makeup transfer results with images containing shadow effects. Deng et al. [6] proposed a SCGAN (style-based controllable generative adversarial network) model that transfers makeup styles between images with significant spatial misalignment by discarding spatial information when encoding images. Likewise, Chen et al. [26] proposed a makeup transfer model based on the swin transformer GAN. In addition, they proposed an improved PSC-GAN (progressive generative adversarial network) model that achieves high-quality makeup transfer effects by incorporating semantic perception and channel attention mechanisms.

Moreover, Yang et al. [27] used the Sow attention module in EleGANt to reduce computational costs during feature estraction. In this module, the feature maps are encoded into a pyramid structure to preserve high-frequency information, and are edited to achieve controlled makeup transfer within arbitrary regions. Hao et al. [28] proposed the CUMTGAN (Controllable U-Net makeup transfer generative adversarial network) model, which uses facial semantic segmentation results to perform makeup transfer and removal on specific regions. This model applies the U-Net structure and superimposition module for hierarchical feature fusion, achieving controllability. Similarly, Fang et al. [29] proposed the AM Net learning framework, which can perform facial makeup transfer across different age groups while preserving identity information. Xu et al. [30] proposed the TSEV-GAN (target-aware makeup style encoding and verification generative adversarial network), which accurately captures makeup styles and imposes style representation learning on conditional discriminators, enabling the model to perform transfer accurately. Furthermore, Chen et al. [31] proposed an AEP-GAN (aesthetic enhanced perception generative adversarial network), which comprises an aesthetic deformation perception, a synthesis and removal, and a dual-agent identification block, enabling end-to-end automatic facial beautification.

3 Proposed approach

We propose a multi-granularity facial makeup transfer and removal model with local-global collaboration. In our model, the content and makeup style features are separated using a content discriminator, and the authenticity of the entire face and specific facial patches are distinguished by two global and 10 local discriminators, respectively. Combining the discriminators forces the generator to produce more realistic and natural results. Additionally, we introduce a local loss function to constrain the color distribution consistency between the transferred makeup facial patches and those on the corresponding reference image. This loss function enables our model to transfer each patch’s makeup details more effectively. We also introduce face parsing maps into the generator and discriminators to separate the images’ facial and background regions. This approach maintains the unrelated content’s invariance and mitigates the negative impact of background information on makeup transfer and removal.

Fig. 1
figure 1

Network architecture

3.1 Problem formulation

In this paper, A refers to the non-makeup image domain, and B is the makeup image domain. Note that non-makeup and makeup images do not form pairs with the same identity. We assume that the non-makeup image is a special instance of the makeup image, thereby allowing removal and transfer to be treated as the same problem. Specifically, given four inputs: a non-makeup \(I_s \in A\) and makeup image \(I_m \in B\), and a non-makeup \({M_s}\) and a makeup face parsing map \({M_m}\), the model can generate a transferred makeup \(I_s^B \in B\) and de-makeup image \(I_m^A \in A\). In this model, \(I_s^B\) synthesizes the makeup style of \(I_m\) while preserving the identity and background information of \(I_s\). Then, \(I_m^A\) synthesizes the makeup style of \(I_s\) while preserving the identity and background information of \(I_m\). We aim to learn the mapping between two domains and achieve makeup transfer and removal: \(\Omega :{I_s},{I_m} \rightarrow I_s^B,I_m^A\).

3.2 Network architecture

Our network’s overall framework is displayed in Fig. 1. The network comprises two makeup style and content encoders, one generator, one content discriminator, and two global and 10 local discriminators. Using the input images \(I_s \in A\) and \(I_m \in B\), the makeup style encoders extract the latent features \(F_s^M\) and \(F_m^M\). Then, the content encoders extract the content features \(F_s^C\) and \(F_m^C\), such as facial structure and expression. Moreover, it is possible to distinguish between the unrelated content and makeup style features using the content discriminator. With the input images \(I_s\) and \(I_m\), the face parsing maps \(M_s\) and \(M_m\), and the features \(F_s^C\), \(F_m^C\), \(F_s^M\) and \(F_m^M\), we first combine the features extracted by the different encoders from various images to generate outputs that transfer styles between non-makeup and makeup images. After that, we leverage the face parsing maps to extract facial and background regions from the network’s outputs and input images separately, combining them to generate the transferred makeup \(I_s^B \in B\) and de-makeup image \(I_m^A \in A\). Our model uses global discriminators to distinguish the makeup style consistency of the facial regions extracted from the input and generated images. Meanwhile, the model employs local discriminators to distinguish the makeup style consistency of the facial region patches separated by landmarks. By combining the local and global discriminators, the model can accurately assess the images’ makeup style and generate high-quality results.

3.2.1 Content encoder

In our model, the content encoder \(\mathrm EC\) contains multiple convolutional, activation function, and normalization layers. It is designed to capture facial contour and shapes, eye, nose, and mouth positions, and higher-level abstract features, such as facial expressions, posture, gender, and age. The encoder converts an image into a latent content space and helps the generator preserve identity using content features when generating makeup transfer and removal results. The process of extracting the content features \(F_s^C\) and \(F_m^C\) is expressed as follows:

$$\begin{aligned} \ F_s^C = \text {EC}({I_s}), \quad F_m^C = \text {EC}({I_m})\ . \end{aligned}$$
(1)

3.2.2 Makeup style encoder

The makeup style encoder \(\mathrm EM\) contains a Conv-Relu-Conv-BN layer that is used to capture makeup style features, such as eyeshadow, lipstick, and foundation. Using this structure, the first convolutional layer extracts basic features, while the second extracts more complex features based on the basic ones. This enables the generator to produce images with specific makeup styles using richer and more diverse makeup style features extracted by the encoder. The process of extracting makeup style features \(F_s^M\) and \(F_m^M\) is expressed as follows:

$$\begin{aligned} \ F_s^M = \text {EM}({I_s}), \quad F_m^M =\text {EM}({I_m})\ . \end{aligned}$$
(2)

3.2.3 Generator

The face parsing map is an image processing technique that enables the precise segmentation of a face image into distinct semantic regions, encompassing skin, eyes, mouth, and hair. In this paper, we utilized face parsing maps to separate facial and background regions, thereby ensuring background consistency between the generated and input images. In general, the facial region pixels primarily encompass makeup-related regions such as the eyebrows, eyeshadow, lips, skin, and nose, while the remaining pixels correspond to background regions including hair, teeth, and eyes. Based on the face parsing map, we obtained the final result by combining facial and background region pixels from the output and input images, respectively.

Fig. 2
figure 2

Global discriminator \(\mathrm D_A\) structure

In the generator module, we used the content (\(F_s^C\) and \(F_m^C\)) and makeup style features (\(F_s^M\) and \(F_m^M\)), the non-makeup and makeup images (\(I_s\) and \(I_m\)), and their corresponding face parsing maps (\(M_s\) and \(M_m\)) as inputs. The generator is divided into two branches for the purpose of makeup transfer and removal. Each branch contains a feature fusion module, three Conv-BN-Relu and upsampling layers, and a Conv-Tanh layer. We upsampled the first two Conv-BN-Relu layers’ features to preserve the input images’ spatial information and structure, which is crucial for generating more realistic images. Specifically, all the inputs inserted into the makeup image’s generator (i.e., the non-makeup image \(I_s\) and face parsing map \(M_s\), and the non-makeup content \(F_s^C\) and makeup style features \(F_m^M\)) are sent to the transfer branch. First, we acquired features \(F_{sm}^{CM}\) by fusing the content \(F_s^C\) and makeup style features \(F_m^M\) within the feature fusion module. Second, we inputted the fused features \(F_{sm}^{CM}\) into the network layers to obtain an output. Finally, we generated the final transferred makeup image \(I_s^B\) by selectively extracting the facial region from the output image and copying the background region from \(I_s\) using \(M_s\), which contains the identity and background of \(I_s\) and the makeup style of \(I_m\). Similarly, the makeup image \(I_m\), the makeup face parsing map \(M_m\), the makeup content \(F_m^C\) and the style features \(F_s^M\) of the non-makeup image are transferred to the removal branch. Then, the generator produces the de-makeup image \(I_m^A\) that includes the identity and background of \(I_m\) and the makeup style of \(I_s\).

During makeup transfer and removal, face parsing maps are employed to extract facial and background regions from images, ensuring that the generated images maintain a consistent background with the input images of the same identity. The generator’s image generation process is expressed as follows:

$$\begin{aligned} \ I_s^B = \text {G}(F_{sm}^{CM},{I_s},{M_s}), \quad I_m^A = \text {G}(F_{ms}^{CM},{I_m},{M_m})\ . \end{aligned}$$
(3)

3.2.4 Content discriminator

The content discriminator used in this paper consists of four Conv-IN-LeakyRelu layers and a Conv-Sigmoid layer. It can distinguish whether the given content features \(F_s^C\) and \(F_m^C\), extracted by the content encoder from non-makeup and makeup images, relate to the makeup style or content. By constructing an adversarial loss function between the content encoder and discriminator, the separation of content and makeup style features can be effectively improved.

Fig. 3
figure 3

Local discriminator structure

3.2.5 Global discriminator

The discriminator accurately distinguishes between input and generated images, enabling the generator to produce more realistic results. We introduce two global discriminators \(\mathrm D_A\) and \(\mathrm D_B\) to determine the authenticity of the de-makeup, reference non-makeup, transferred, and reference makeup images, independently. The network architecture of global discriminators \(\mathrm D_A\) and \(\mathrm D_B\) is similar. As illustrated in Fig. 2, we take discriminators \(\mathrm D_A\) as an example. In an image, features at different scales represent information at different levels. We employed a multiscale discriminator structure to obtain more robust information from various feature scales, evaluate the original scale image, and downsample features obtained through average pooling using four Conv-IN-LeakyRelu layers and a Conv-sigmoid layer. First, we employed face parsing maps to select the images’ background region pixels and filled them with black, eliminating the background information. After that, the facial regions \(I_s^{face}\) and \(I_m^{Aface}\) were accurately extracted from the non-makeup \(I_s\) and the de-makeup image \(I_m^A\) and fed into the discriminator for evaluation. In this case, eyeballs and teeth are not considered a part of the makeup; therefore, they are excluded from the facial region extraction. Typically, evaluating the face makeup style individually using the global discriminators is effective. This allows the model to prevent the alteration of unrelated content and mitigate the negative impact of background information during makeup transfer. As a result, the model can generate high-quality results. The processing procedure for the global discriminator \(\mathrm D_B\) is similar to \(\mathrm D_A\) and will not be repeated here.

3.2.6 Local discriminator

As part of the model, we designed 10 local discriminators for 10 patches divided by facial landmarks. These discriminators can assess the patches’ style authenticity during transfer. Moreover, the sum of the local facial patches evaluated by the discriminators constitutes the total facial region. Based on this fine-grained division of facial regions and local evaluation strategy, the model can accurately capture and assess each part of a face, further improving the quality of the generated results. As shown in Fig. 3, we adoped the same network structure for the 10 local discriminators; each of them contains a combination of a convolutional and an activation function layer (Conv-Sigmoid) and six convolutional, normalization, and activation function layers (Conv-IN-LeakyRelu). We generated a synthesized makeup image \(W_m\) by warping and blending the makeup image \(I_m\) onto the non-makeup one \(I_s\) according to their facial landmarks. After that, we used face parsing maps to extract facial regions \(W_m^{face}\), \(I_s^{Bface}\) and \(I_m^{face}\) from the synthesized \(W_m\), transferred \(I_s^B\), and makeup images \(I_m\). Then, we divided the facial regions \(W_m^{face}\), \(I_s^{Bface}\) and \(I_m^{face}\) into 10 facial local patches based on their facial landmarks. In the final step, the local discriminator \(\mathrm{{D}}_{\mathrm{{local}}}^i\) produces a negative result for the facial local patches \(P_m^{face \ i}\) and \(P_s^{Bface \ i}\) with different makeup styles and a positive outcome for the patches \(P_m^{face \ i}\) and \(P_W^{face \ i}\) with the same makeup style. The experimental results presented in Section 4 demonstrate that the local discriminators effectively preserve the color and complex pattern integrity during makeup transfer, while retaining facial identity.

In this study, combining the local and global discriminators enables the model to assess the authenticity of the entire face and specific facial patches. During extreme makeup style transfer, the global discriminators focus on the overall makeup’s consistency, while the local ones concentrate on the makeup details. In addition, the sum of patches assessed by the local discriminators covers the entire face without excluding edge details. Thus, combining these two aspects enables capturing a more comprehensive makeup detail representation, resulting in more exhaustive feedback signals. This may improve the makeup transfer quality at the coarse- and fine-grained levels.

3.3 Objective function

3.3.1 Local loss

We calculated the local losses for 10 facial patches separated from the background, hair, eyeballs, eyebrows, and teeth. We expected the model to change the eyebrow color and preserve the shape during makeup transfer. Moreover, retaining the reference eyebrow shape in the transferred makeup image is possible if the eyebrow region loss is calculated. Consequently, we relied on other global loss functions instead of the local loss to constrain the eyebrow color’s transfer effect. As the makeup style is visually determined by color distributions, it is essential for the transferred image to exhibit similar color distributions within the reference image. Thus, we enforced the color distribution consistency for the facial local patches between the transferred and reference images, thereby ensuring coherence in the style details. First, we extracted the facial regions \(I_s^{face}\), \(I_m^{face}\) and \(I_s^{Bface}\) from the non-makeup, makeup, and corresponding transferred images using face parsing maps. Second, we divided \(I_s^{face}\), \(I_m^{face}\) and \(I_s^{Bface}\) into 10 local patches according to their respective facial landmarks. Third, we used histogram matching to adjust the histogram of the non-makeup patch \(P_s^{face \ i}\), making it similar to the makeup patch \(P_m^{face \ i}\) and forming the histogram-matched \(PHM_s^{Bface \ i}\). In this instance, \(PHM_s^{Bface \ i}\) retains the identity of the non-makeup patch \(P_s^{face \ i}\) while having the same color distribution as the makeup patch \(P_m^{face \ i}\). Finally, we utilized the L1-norm to measure the difference between the histogram-matched facial patch \(PHM_s^{Bface \ i}\) and the transferred makeup patch \(P_s^{Bface \ i}\). Furthermore, the local loss function is crucial to extreme makeup style transfer and is expressed as follows:

$$\begin{aligned} PHM_s^{Bface \ i}= & \mathrm{{HM}}(P_s^{face \ i},P_m^{face \ i})\nonumber \\ L_\mathrm{{region\_hm}}= & \sum \limits _i {||PHM_s^{Bface \ i} - P_s^{Bface \ i}|{|_1}} . \end{aligned}$$
(4)

where \(\textrm{HM}\) \((\cdot )\) denotes histogram matching.

3.3.2 Content adversarial loss

The purpose of the content discriminator \(\mathrm D_c\) is to determine the accuracy of the content features extracted from the non-makeup and makeup images. It forces the content encoder to concentrate on features other than the makeup style ones during the data encoding process. After that, the makeup style encoder encodes the remaining style features. Content adversarial loss is described as follows:

$$\begin{aligned} L_\mathrm{{content}}= & \mathbb {E}[\frac{1}{2}\log \mathrm{{D_c}}(F_s^C) + \frac{1}{2}\log (1 - \mathrm{{D_c}}(F_s^C))] \nonumber \\ & + \mathbb {E}[\frac{1}{2}\log \mathrm{{D_c}}(F_m^C) + \frac{1}{2}\log (1 - \mathrm{{D_c}}(F_m^C))] . \end{aligned}$$
(5)

3.3.3 Global adversarial loss

The global discriminators \(\mathrm D_A\) and \(\mathrm D_B\) distinguish the makeup style consistency among the de-makeup, reference non-makeup, transferred makeup, and reference makeup facial regions, respectively. The global adversarial loss is as follows:

$$\begin{aligned} L_\mathrm{{adv}}^\mathrm{{D_A}}= & \mathbb {E}[\log \mathrm{{D_A}}(I_s^{face})] + \mathbb {E}[\log (1 - \mathrm{{D_A}}(I_m^{Aface}))]\nonumber \\ L_\mathrm{{adv}}^\mathrm{{D_B}}= & \mathbb {E}[\log \mathrm{{D_B}}(I_m^{face})] + \mathbb {E}[\log (1 - \mathrm{{D_B}}(I_s^{Bface}))] . \end{aligned}$$
(6)

3.3.4 Local adversarial loss

We designed 10 local discriminators for 10 patches separated from the background regions and divided by facial landmarks. The local discriminator \(\mathrm{{D}}_{\mathrm{{local}}}^i\) estimates a positive result for local patch pairs (\(P_m^{face \ i}\), \(P_W^{face \ i}\)) with the same makeup style, and generates a negative result for pairs (\(P_m^{face \ i}\), \(P_s^{Bface \ i}\)) with different styles. The local adversarial loss is as follows:

$$\begin{aligned} L_\mathrm{{local}}= & \sum \limits _i \{ \mathbb {E}[\log {\mathrm{{D}}_{\mathrm{{local}}}^i}(P_m^{face\ i},P_W^{face \ i})] \nonumber \\ & +\mathbb {E}[\log (1 - {\mathrm{{D}}_{\mathrm{{local}}}^i}(P_m^{face \ i},P_s^{Bface \ i}))]\} . \end{aligned}$$
(7)

3.3.5 Kullback-Leibler divergence loss

The makeup style encoder uses the KL (Kullback-Leibler) divergence loss to quantify the difference between the makeup style feature and prior Gaussian distributions. This ensures that the makeup style features closely approach a Gaussian distribution. It effectively prevents the makeup style feature distribution from becoming excessively discrete or concentrated within a specific feature space. This ensures enhanced makeup style feature stability and controllability across the entirety of the feature space. The KL divergence loss is calculated as follows:

$$\begin{aligned} L_\mathrm{{kl}}= & \mathbb {E}[{D_\mathrm{{KL}}}(F_s^M||\mathcal{N}(0,1)) + {D_\mathrm{{KL}}}(F_m^M||\mathcal{N}(0,1))]\nonumber \\ D_\mathrm{{KL}}(P\mathrm{{ || Q}})= & \smallint P(x)\log {\textstyle {{P(x)} \over {Q(x)}}}\mathrm{{d}}x . \end{aligned}$$
(8)

In this equation, \(\mathcal{N}(0,1)\) is the Gaussian distribution, P(x) represents the feature distribution’s probability density, and Q(x) represents the Gaussian distribution’s probability density function.

3.3.6 Reconstruction loss

The reconstruction loss comprises the cycle consistency and self-reconstruction losses. The generator generates a self-non-makeup image \(I_s^{self}\) using the content \(F_s^C\) and makeup style features \(F_s^M\) from the non-makeup image \(I_s\). In this way, it also generates a self-makeup image \(I_m^{self}\) using the content and makeup style features, \(F_m^C\) and \(F_m^M\), from the makeup image \(I_m\). The difference between these generated (i.e., \(I_s^{self}\) and \(I_m^{self}\)) and input images (i.e., \(I_s\) and \(I_m\)) is the self-reconstruction loss.

Moreover, the generator generates a cycle-non-makeup \(I_s^{cyc}\) and a cycle-makeup image \(I_m^{cyc}\) using the content and makeup style features from the transferred \(I_s^B\) and de-makeup images \(I_m^A\). The difference between these cycle-generated (i.e., \(I_s^{cyc}\) and \(I_m^{cyc}\)) and input images (i.e., \(I_s\) and \(I_m\)) is the cycle consistency loss. Thus, the reconstruction loss is expressed as follows:

$$\begin{aligned} L_\mathrm{{rec}}= & ||I_\textrm{s}^{self} - {I_s}|{|_1} + ||I_m^{self} - {I_m}|{|_1} + ||I_s^{cyc} - {I_s}|{|_1}\nonumber \\ & + ||I_m^{cyc} - {I_m}|{|_1} . \end{aligned}$$
(9)

3.3.7 Smooth loss

According to the study in [7], we employed the smooth loss fucntion to ensure the generator produces high-quality de-makeup images. This study’s objective is to utilize Laplacian filters f on the local facial patches, aiming for results that approach zero because the de-makeup patches should be smooth with a lack of high-frequency details. Furthemore, the de-makeup local facial patch \(P_m^{Aface \ i}\) is assigned the individual weight \(\omega _i\), since each patch may contain a distinct facial feature. The smooth loss is as follows:

$$\begin{aligned} {L_\mathrm{{smooth}} = \sum \limits _i {{\omega _i}||f(P_m^{Aface \ i})|{|_1}}.} \end{aligned}$$
(10)

where the eyes and lips areas are assigned weights of 0.5 due to the need to preserve their texture. For complex makeup covered areas such as the nose, cheeks, and forehead, the weights are set to 2, 4, and 4 in sequence, and the eyebrow areas’ weights are set to 1.

Fig. 4
figure 4

Results on the MT dataset

Fig. 5
figure 5

Results on the Makeup dataset

Following the above-mentioned losses, the total loss function is expressed as follows:

$$\begin{aligned} L_\mathrm{{total}}= & \lambda _\mathrm{{region\_hm}}{L_\mathrm{{region\_hm}}} + \lambda _\textrm{content}{L_\textrm{content}}\nonumber \\ & + \lambda _\textrm{adv}(L_\textrm{adv}^\mathrm{D_A}+ L_\textrm{adv}^\mathrm{D_B}) + \lambda _\textrm{local}{L_\textrm{local}} \nonumber \\ & + \lambda _\textrm{kl}{L_\textrm{kl}} + \lambda _\textrm{rec}{L_\textrm{rec}} + \lambda _\textrm{smooth}{L_\textrm{smooth}}. \end{aligned}$$
(11)

where \(\lambda _\mathrm{{region\_hm}}\), \(\lambda _\textrm{content}\), \(\lambda _\textrm{adv}\), \(\lambda _\textrm{local}\), \(\lambda _\textrm{kl}\), \(\lambda _\textrm{rec}\) and \(\lambda _\textrm{smooth}\) are the weights for each loss function.

4 Experiments

4.1 Training details

Our experiments’ machine configuration and runtime environment are as follows: Ubuntu 20.04, GeForce RTX 3090, CUDA 11.6, PyTorch 1.12.1, Pycharm 2021.3, and Python 3.8.16. We used the Adam algorithm to optimize all the model’s modules and set the initial learning rate to 0.0001. The model was trained for 700 epochs with a batch size of 1. The various loss parameters are set as \(\lambda _\mathrm{{region\_hm}}=1.5\), \(\lambda _\textrm{content}=1\), \(\lambda _\textrm{adv}=1\), \(\lambda _\textrm{local}=8\), \(\lambda _\textrm{kl}=0.01\), \(\lambda _\textrm{rec}=8\), and \(\lambda _\textrm{smooth}=15\).

4.2 Datasets

The MT dataset provided by BeautyGAN [4] was used in this study. It consists of 1,115 non-makeup and 2,719 makeup images. The faces in the images vary in race, posture, expression, and background. The makeup images contain various light categories, such as retro, Korean and Japanese makeup styles. We also used the Makeup dataset provided by LADN [7], which consists of 334 non-makeup and 355 makeup images. This dataset contains light and extreme makeup images, which show significant differences in color, style, and area coverage.

(1) Training Datasets

We selected 300 non-makeup and 300 makeup images from the Makeup dataset, ensuring that not all the extreme makeup images are exclusively included in the training dataset. Using this training dataset, our network can be trained to be suitable for both light and extreme makeup transfer.

(2) Test Datasets

We randomly selected 300 non-makeup and 200 makeup images from the MT dataset. Meanwhile, we excluded the Makeup training dataset and used the remaining images to build the Makeup test dataset. The images independent on the training data provide a rigorous evaluation benchmark.

Fig. 6
figure 6

Makeup transfer results for the different model variations

As shown in Figs. 4 and 5, the proposed method was tested on the MT and Makeup test datasets, respectively. The first row displays the reference non-makeup images, the second shows the reference makeup ones, and the third depicts the transferred images generated by our model. The results demonstrate that the proposed method can transfer both light and extreme makeup styles with complex patterns.

4.3 Ablation study

To evaluate the effectiveness of the 10 local discriminators, the local loss function, and the facial region extraction with face parsing maps during transfer, we conducted an ablation study using three variants of our model. The first variant eliminates the 10 local discriminators from the model so that the feedback signals’ impact on the local patches during makeup transfer can be examined. The second variant removes the local loss function, thereby enabling a more comprehensive assessment of the differences in color and texture within the makeup transfer results. The third variant masks the operation that separates the background and facial regions using face parsing maps to highlight the significance of their segregation. This aims to analyze the each component’s contribution during makeup transfer using these variants.

Figure 6 shows the comparative makeup transfer results between our model and the three variants. The first column displays the reference non-makeup images, the second shows the reference makeup ones, and the third to sixth columns depict our model’s and the three variants’ results, respectively. As observed in the third column, Variant 1, which eliminates the 10 local discriminators, exhibits the capability of transferring makeup. However, the makeup patterns in the transferred images appear to have light colors and incomplete patterns. In addition, the transferred images are affected by the reference ones. For example, the blue box indicating the eye in row (c) and the lip in row (e) are more similar to the corresponding regions in the reference makeup images than those in the reference non-makeup ones. Therefore, local discriminators maintain the colors and patterns and ensure that facial identity remains unchanged.

Table 1 ArcFace and FID scores for the results in Figs. 6 and 7

In the fourth column, it is evident that Variant 2 fails to perform successful makeup transfer, as the transferred images exhibit noticeable artifacts. This observation highlights the crucial role of the local loss function in effectively preserving color and texture consistency throughout the makeup transfer process. In the fifth column, blurriness and inconsistency are observed in the hair and background after masking the facial region extraction with face parsing maps operation. In addition, the facial regions’ colors changed. These results demonstrate that processing facial and background regions separately protects the background regions from being altered and mitigates the negative impact of background information during makeup transfer. As shown in the sixth column, this study’s proposed model has the potential to produce accurate and realistic makeup transfer results.

The local discriminators and loss function are not employed in the makeup removal segment for two reasons: (1) synthesizing non-makeup images that align with the reference makeup ones for effective guidance during makeup removal is necessary. However, the existing datasets lack these image pairs with the same identity before and after applying makeup. Using the method that synthesizes makeup images to generate non-makeup ones is impractical due to challenges such as local texture loss and noticeable intricate color residue. In addition, existing alternative methods cannot directly synthesize high-quality non-makeup images from makeup ones. Therefore, relying on flawed synthesized non-makeup images with local discriminators would be impractical. Consequently, the global discriminator is used to preserve the identity of generated images using the global content features extracted by the content encoder in the makeup removal branch. Furthermore, the smooth makeup removal loss is applied to each facial patch to enable the generator to remove high-frequency details. (2) Applying a histogram-matched local patch to calculate local loss and guide color and detail changes in facial patches during makeup transfer is reasonable. However, for makeup removal, since the genuine skin colors of the reference makeup and non-makeup images may differ significantly, using the histogram-matched patches would yield colors resembling those of the reference non-makeup images instead of maintaining the reference makeup faces’ original colors. Consequently, the makeup removal branch does not utilize the local loss function; instead, it relies on the global discriminator and smooth makeup removal loss to obtain more realistic de-makeup results.

Fig. 7
figure 7

Makeup removal results for the different model variations

Therefore, this paper presents only one makeup removal variant, which involves masking the facial region’s extraction operation. Figure 7 illustrates the results comparing the complete model with this variant. The first column depicts the reference makeup images, while the second and third display the variant and full model results. As observed in the second column, the images display color residue and the background has been noticeably altered. Furthermore, facial blurring and deformation are exhibited in the image in row (a). According to the third column, the model’s makeup removal process performs better at preserving identity, maintaining background, and removing color residue than the variant.

In this paper, we applied the ArcFace [32] and FID (Fréchet Inception Distance) [33] metrics to evaluate the makeup transfer and removal ablation study results illustrated in Figs. 6 and 7, as summarized in Table 1. The ArcFace and FID values in Table 1 are respectively the average values calculated for all results in each ablation variant in Figs. 6 and 7. We utilized ArcFace to measure facial similarity between the de-makeup and reference makeup images, and the transferred makeup and reference non-makeup images. A high score on this metric indicates excellent identity preservation after makeup transfer or removal. We utilized the FID metric to measure the similarity between the reference non-makeup and de-makeup images, and between the reference makeup and the transferred makeup images. A small FID score indicates great makeup similarity. For the makeup transfer ablation study, the FID score of our complete method is smallest, and the ArcFace score of the ablation variant without the local loss function is the highest. However, as shown in Fig. 6, the ablation variant without the local loss function generates the worst effects, with almost no makeup transfer, resulting in a high facial similarity to the no-makeup images and consequently, achieving the highest ArcFace scores when comparing the transferred makeup images with the no-makeup ones. Therefore, our complete method outperforms the variants for the makeup transfer ablation study. For the makeup removal ablation study, our model demonstrated superior performance in the ArcFace and FID scores.

4.4 Comparison results

We compared our method to several advanced makeup transfer approaches, including BeautyGAN [4], SCGAN [6], EleGANt [26], SSAT [17], BeautyREC [16], LADN [7], and CPM [8]. We used the official codes for BeautyGAN, SCGAN, BeautyREC, EleGANt, and LADN and re-implemented them using our training dataset and environment to ensure a fair comparison. We also utilized the pre-trained SSAT and CPM models, as provided by the official source, to generate makeup transfer results. Although re-implementation was not feasible, we ensured that the conditions for generating images using the SSAT and CPM remained consistent with the other methods. To evaluate each method’s performance accurately, we aligned all the generated images using same resolution (256\(\times \)256), eliminating visual differences due to varying image resolutions, allowing effective comparison of our approach with other state-of-the-art methods, and generating results under standardized experimental conditions.

Fig. 8
figure 8

Makeup transfer results for the different methods on MT dataset

Fig. 9
figure 9

Makeup transfer results for different methods on the Makeup dataset

Fig. 10
figure 10

De-makeup results for the different methods on the MT dataset

Figure 8 shows our method’s comparison with other frameworks based on the MT test dataset. The first row shows the reference non-makeup images, the second displays the reference makeup ones, and the third to tenth display the results for BeautyGAN, SCGAN, BeautyREC, EleGANt, SSAT, CPM, LADN, and our method, respectively. The faces generated by BeautyGAN appear black and produce unsatisfactory makeup transfer effects. In the green boxes in row (4), SCGAN displays lips and eyeshadow color changes in the transferred makeup images. Moreover, the images generated using this method tend to be blurry and exhibit significant distortions in the hair and background. As observed in the blue boxes in row (5), BeautyREC tends to only transfer the lower lip color. As observed in the pink and red boxes in rows (7) and (8), the lipstick color appears relatively lighter when compared to corresponding reference makeup images in SSAT and CPM.

Moreover, LADN’s results exhibit natural effects; however, certain makeup details are not executed appropriately. For example, the absence of lower eyelashes in the purple box within column (b), the erroneous eye effects observed in the purple boxes within columns (d) and (e), inconsistent lip color depicted in the purple box within column (e), and an unnecessary red color evident in the eye displayed by the purple box within column (f). We also observed a change in the background color. Compared to the above mentioned methods, our approach effectively ensures makeup color and detail accuracy and also maintains the reference images’ backgrounds during makeup transfer.

Figure 9 illustrates the method’s comparison results with BeautyGAN, SCGAN, BeautyREC, EleGANt, SSAT, LADN, and CPM on the Makeup test dataset with extreme makeup styles. In Fig. 9, the first row shows the reference non-makeup images, the second displays the reference makeup ones, and the third to tenth present the transfer results generated by BeautyGAN, SCGAN, BeautyREC, EleGANt, SSAT, CPM, LADN, and our method, respectively. The results demonstrate that BeautyGAN produces unsatisfactory effects and causes severe facial darkening. SCGAN, BeautyREC, EleGANt, and SSAT only transfer lipstick, and SCGAN alters the background and hair color.

The CPM and LADN results partially achieve makeup transfer, but the patterns are incomplete and inaccurate in CPM and the background color changed in LADN. Additionally, the blue, red, and green boxes in rows (8) and (9) reveal a color overflow phenomenon after makeup transfer, and boundaries between forehead makeup and natural skin color. Among the compared methods, BeautyGAN, SCGAN, BeautyREC, EleGANt, and SSAT fail to transfer extreme makeup styles. It is possible to partially transfer extreme makeup styles using LADN and CPM, but issues such as color overflow and background color changes may occur. In comparison, our method achieves superior results in transferring extreme makeup styles and preserving the background.

Fig. 11
figure 11

De-makeup results for the different methods on the Makeup dataset

In Figs. 10 and 11, we compare our method to BeautyGAN, SCGAN, SSAT, and LADN on the MT and Makeup test datasets due to BeautyREC, EleGANt, and CPM failing to achieve makeup removal. The first column represents the reference makeup images, and the second to sixth display the de-makeup results of BeautyGAN, SCGAN, SSAT, LADN, and our method. In Fig. 10, the red boxes in the BeautyGAN results indicate that it removes lipstick in some images, while other makeup aspects are not effectively eliminated and appear to darken the faces. The SCGAN results exhibit an overall whitening effect with low clarity. SSAT has color residue on the lipstick, as indicated by the purple boxes. According to the blue boxes, LADN effectively removes eyeshadow but does not completely remove lipstick. Moreover, all four makeup removal methods change the color of the original background. Compared to these methods, our approach effectively removes lipstick and eyeshadow, maintains the natural skin color, and preserves the background information during makeup removal. In conclusion, our method can generate high-quality de-makeup results.

Figure 11 illustrates the de-makeup results for extreme makeup styles obtained from the Makeup test dataset. BeautyGAN hardly removes the extreme makeup styles and leaves the faces darkened. SCGAN causes eye distortion, blurry backgrounds, and color changes during makeup removal. SSAT displays color residue as indicated by the green boxes. In contrast, LADN exhibits relatively effective de-makeup results while retaining some color, as shown in the red boxes. In comparison to these methods, our approach effectively removes makeup with fewer colors remaining and preserves background information the best.

4.5 Quantitative analysis results

4.5.1 Quantitative comparison of identity preservation

In this section, we compare our method’s average ArcFace values with those of other approaches for the MT and Makeup test datasets to understand each model’s identity preservation performance. We randomly selected 100 non-makeup and makeup images from the MT test dataset, and 34 non-makeup and 55 makeup images from the Makeup test dataset, and then used them to generate transferred makeup and de-makeup images. Table 2 demonstrates that our method achieves the highest ArcFace score in the MT test dataset. Additionally, Table 3 indicates that BeautyREC and LADN achieved the highest ArcFace scores in the Makeup test dataset for makeup transfer and removal, respectively. However, as shown in Figs. 9 and 11, BeautyREC exhibits poor performance in transferring extreme makeup styles, resulting in high facial similarity to reference non-makeup images, while LADN performs poorly in removing extreme makeup styles, leading to high similarity to reference makeup images. Therefore, they achieve the highest ArcFace scores when comparing the transferred makeup images with the no-makeup ones and the de-makeup images with the makeup ones. Notably, the ArcFace scores for the BeautyREC, EleGANt, and CPM methods are not included in Tables 2 and 3 because they failed to perform makeup removal. Overall, our approach outperforms other methods in identity preservation.

Table 2 ArcFace values calculated using different methods on the MT dataset (\(\uparrow \))
Table 3 ArcFace values calculated using different methods on the Makeup dataset (\(\uparrow \))
Table 4 FID values calculated using different methods on the MT dataset (\(\downarrow \))
Table 5 FID values calculated using different methods on the Makeup dataset (\(\downarrow \))
Table 6 User survey results for the different methods

4.5.2 Quantitative comparison of makeup similarities

We used the FID values to quantify the makeup similarities, to evaluate the makeup effects of each method after makeup transfer and removal. Tables 4 and 5 present the calculated average results for the abovementioned images from the MT and Makeup test datasets. We observed that our method achieves the lowest FID scores, indicating the best performance in makeup transfer and removal.

4.5.3 User study

We conducted a user study to qualitatively evaluate our method’s visual quality compared to BeautyGAN, SCGAN, BeautyREC, EleGANt, SSTA, LADN, and CPM. We randomly selected 25 non-makeup and 10 makeup images from the MT and Makeup test datasets, using them to generate the transferred makeup and de-makeup images for the different methods. For each results set, we invited 10 participants to independently rank them. The makeup transfer results are ranked based on three factors: (1) Comparing the color, texture, and makeup effects on the transferred and reference makeup images; (2) Evaluating the identity consistency and preservation of unrelated content in transferred makeup and reference non-makeup images; and (3) Verifying the makeup authenticity in the transferred images. The participants ranked the de-makeup results by: (1) Examining the residual color on the de-makeup faces; and (2) Evaluating the identity consistency and unrelated content preservation in the de-makeup images. As shown in Table 6, our method achieved the highest proportion of best-choice results for makeup transfer and removal.

5 Conclusion

This paper proposed a multi-granularity facial makeup transfer and removal model with local-global collaboration. By collaborating with the local and global discriminators, the network effectively separates content and makeup style features, and detects the coarse- and fine-grained makeup details. The local discriminators play a critical role in identity preservation and ensuring makeup accuracy. Meanwhile, the local loss function ensures makeup consistency between the transferred and reference makeup images with extreme makeup styles. This is due to its importance in accurately transferring details such as makeup color and texture. Additionally, by introducing facial parsing maps into the generator and discriminators, the background and facial regions are processed separately. As a result, the generated images appear more natural and accurate. Compared to the state-of-the-art methods, our model displayed high efficiency and effectiveness during makeup transfer and removal tasks. Our method may generate undesired results in certain extreme conditions. For example, the non-makeup and makeup faces may have significant pose variations, and in this case there may be errors in the transferred eye or lip makeup’s positioning. We would like to explore new loss and further improve the local discriminators to address these issues in future works.