Angular Margin Losses for Representative Embeddings Training: ArcFace (2018) vs MV-Arc-Softmax (2019) vs CurricularFace (2020)

15 min readJan 13, 2021

One of the main challenges in feature learning using Deep Convolutional Neural Networks (DCNNs) for large-scale face recognition is the design of appropriate loss functions that enhance discriminative power.

This quote was taken from ArcFace paper. The paper investigates face recognition problem, and introduces a loss function to train more discriminative embeddings. An embedding is a relatively low-dimensional space into which you can translate high-dimensional vectors. Ideally, an embedding captures some of the semantics of the input by placing semantically similar inputs close together in the embedding space. For example, let’s say you have embeddings representing “Person A” and “Person B” (different embeddings are coming from, e.g., different images of Person A and Person B), an ideal model would produce embeddings in such a way that while embeddings belong to the same person stick together, embeddings of different people would stand as further apart to each other as possible in the embedding space.

Although the papers discussed in this post: ArcFace, MV-Arc-Softmax and CurricularFace investigate face recognition problem, the usage of embeddings is not restricted to face recognition. They are being used in many different Machine Learning tasks, such as: person re-identification, image retrieval, word representations in NLP, and so on…

In this blog post, we are going to investigate and compare three of the most powerful angular margin loss functions to train discriminative embeddings. The post will put special attention to CurricularFace, since it is the latest and the most effective loss function among the three + it adopts an interesting strategy called “Curriculum Learning” which we will also talk about shortly.

By the end of this post you should:

understand the difference between discriminative and non-discriminative embeddings,
understand what curriculum learning is,
understand how to modify softmax loss (softmax activation function + cross entropy loss) in order to convert the problem to angular margin optimization,
learn how to apply the things you learned in practice using Python and PyTorch,
be able to compare explained loss functions with some other ones which are also for training embeddings (benchmarks).

The Goal

First of all, what is the goal here? What are we trying to achieve?

We want to train embeddings in such a way that it captures some of the semantics of the input, so that features of the same category have a small distance while features of different categories have a considerable distance in some metric space. For example, desired features in this metric space would allow maximal intra-class distance be smaller than the minimal inter-class distance.

Visualization of inter and intra class distances (from Quadruplet loss paper)

In the figure above, while B1B3 represents intra-class distance, B1A3 represents inter class distance.

Curriculum Learning

The full title of the paper we are going to show special attention in this post is “CurricularFace: Adaptive Curriculum Learning Loss for Deep Face Recognition”. There is an interesting term in this title, namely: Adaptive Curriculum Learning. We are going to talk about the “Adaptive” part later on in this post, but for now, what is “Curriculum Learning”?

It is a technique, easy samples are emphasized in the early training stage and hard ones in the later training stage. In each stage, different samples are assigned with different importance according to their corresponding difficulty,
It is motivated by the nature of human learning that easy cases are learned first and then come the hard ones

Next section we are going to see how exactly curriculum learning is adopted in CurricularFace loss function.

Angular Margin Losses

Angular margin losses are constructed by modifying the Softmax loss function (Softmax loss = Softmax activation + Cross-Entropy loss). The reason why it is modified is because the original Softmax loss does not explicitly optimize the features to have smaller distance between positive pairs and higher distance between negative pairs. Let’s look at the decision boundary for Softmax loss function (binary classification scenario):

For Softmax loss, as long as the features are separable the goal is achieved. You can see by looking at the image above, even though the distance between features that belong to different classes are very small in some cases, there is no effort from Softmax loss side to change this. However, as we talked before, we want our embeddings to be very discriminative, in that, we want embeddings which belong to the same class gather together as much as possible and the ones belong to different classes get away from each other as further as possible.

But why doesn’t Softmax loss do that? Because it is a loss function for classification. To get a better classification results you don’t need to force a margin between classes… As long as they are separable, you are good to go.

Okay, now let’s look at what kind of modification we can do on Softmax loss to get what we want. Below, you can see the original Softmax loss function:

m: batch size, n: number of classes, W: weight vector, b: bias term,

x: feature vector (embeddings)

(if it is not obvious for some readers from where power of the exponential term is coming, instead of writing it with an abbreviation, we expanded the operation performed right before Softmax activation function, which is a dot product between features and weights + bias, a.k.a linear layer)

As an initial step, we remove all bias terms by setting them to 0:

Softmax Loss Function without bias terms

When the bias terms set to 0, power of the exponential term becomes just a dot product between two vectors. If you remember from cosine similarity equation, dot product of two vectors can be written like this:

Cosine Similarity

So, we replace power of the exponents with this:

Power of exponent in Softmax loss function is replaced based on cosine similarity

As the next step, we fix the norm of the weights to be 1 with L2 normalization, and we fix the norm of the features to be s. By doing so, predictions become only depended on the angle between the feature vector and the weight. Moreover, setting norm of features to be s restricts features to lie on a hyper-sphere of a fixed radius. Why? Think what happens to cosine similarity equation when we perform these operations: s*cos(theta). Can you see it now?

Norms of the weight vector and feature vector is fixed

The final step is to write this modified Softmax Loss function in a general form:

What I mean here by “in general form” is that when parts of this function is modified we get different angular margin loss functions which were defined in different papers related to this topic. For example, as you might already guessed it, CurricalarFace is one version of this generalized loss function. We are also going to compare it with MV-Arc-Softmax and ArcFace, which use different versions of this function.

Let’s first define the terms visible on this function. First of all, T(cos(theta)) is the positive cosine similarity modulator. It modulates the positive cosine similarity between the feature vector and its corresponding weight vector. N(t,cos(theta)) does the same thing but for negative cosine similarity. The negative cosine similarity modulator is further expanded like this:

Negative Cosine Similarity Modulator

I(t, cos(theta)) is modulation coefficient, and c is a constant. If you are, don’t be discouraged by lots of functions and different letters flying around. The reason why there are lots of them is because we are trying to write modified version of the Softmax Loss in a general form. Hang in there! at the end, the final equations you will care about will be quite simple. Lastly G(p(x_i)) is indicator function and for the rest of this post we are going to assume that it is 1 (it is set to 1 for all of the loss functions we are going to talk about in this post).

Okay, let the fun begin… 😅

Adaptive Curricular Learning Loss: CurricularFace

Positive Cosine Similarity Modulator:

First of all, let’s look at how the positive cosine similarity modulator is chosen to be:

Chosen Positive Cosine Similarity Modulator For CurricularFace

The key here is the added margin parameter m. Although it is a very small change, what it forces the loss function to do is quite big. This change forces positive cosine angle to be smaller which means that intra-class distances are reduced. And because this change applies to each class, naturally inter class distances are increased. Let’s see why this happens:

Before this modification, it was enough positive cosine angle to be smaller than the blue decision boundary for an input to be classified correctly. However, look what happened after the added margin: now the angle has to be smaller than the red decision boundary for this input to be classified correctly.

I don’t know if anyone noticed, but actually because of the added margin parameter (m) this loss function is not monotonically decreasing in the range [0, π]. When the angle is larger then 180 degrees, the value of cosine starts to increase. So, what about that? How to handle this situation? Well, fortunately, we don’t need to do anything about this. Because the angles between W_y and x_i have a Gaussian distribution with mean around 90 degrees and standard deviation around 15 degrees at the beginning of the training, when the weights are initialized randomly. So, even at the beginning of the training the maximum value positive cosine angle can get is around 105 degrees.

Angle between the Feature and Target Center

Negative Cosine Similarity Modulator:

Now, lets look at the negative cosine similarity modulation function:

Here, we have 2 cases and these cases are separated as being “hard” samples and “easy” samples. If the difference between positive cosine similarity after the modulation function is applied (T(cos(theta))) and the cosine of the negative sample is greater than or equals to 0, then it is considered as an easy sample, and when the other case is true it is considered as a hard sample. Why is that? Because if the positive cosine angle is already smaller than the negative cosine angle (even after we made things harder with the positive cosine similarity modulator), it means that the model doesn’t have any issue correctly classifying the sample.

In case of the easy samples, we do not modify the negative cosine angle, however, for hard samples we do. Here is actually how adaptive curricular learning applied. The parameter t in this equation is adaptively adjusted during training. We are going to talk about how it is adjusted in a while, but for now it is enough for you to know that, early in training t + cos(theta_j) gets a value between 0 and 1, so the contribution of the hard samples to the loss function is decreased. However, later in training, t + cos(theta_j) takes a value bigger than 1, that’s why the contribution of the hard samples is increased. So, this is how adaptive curricular learning is applied, by emphasizing easier samples during early stages of the training and harder samples in later stages.

Now, there is only one thing left which we haven’t covered yet: how to adjust t. It is calculated with Exponential Moving Average function:

Exponential Moving Average

Here r is defined as the average of the positive cosine similarities of the k-th batch. t is set to 0 initially, and each step it is modified according to the equation above. As we talked about it before, at the beginning of the training positive cosine angles have Gaussian distribution with mean around 90 degrees. So, at the beginning of the training, r is around 0. Then when the model gets better and better, positive cosine similarities decreases, hence r increases.

ArcFace vs MV-Arc-Softmax vs CurricularFace

Okay, now let’s compare CurricularFace with 2 of its closest competitors. One of them is called ArcFace, and the other is MV-Arc-Softmax. The positive cosine similarity modulator for each of these loss functions are exactly the same. The only difference is in their negative cosine similarity modulators:

ArcFace Negative Cosine Similarity Modulator

MV-Arc-Softmax Negative Cosine Similarity Modulator

CurricularFace Negative Cosine Similarity Modulator

In ArcFace, there is basically no modification, it is exactly what it is. This means that there is no functionality in ArcFace to separate easy and hard samples and modulate the loss function based on sample difficulty. In MV-Arc-Softmax, there is a similar modulation function to CurricularFace. This loss function considers hard negatives during training as well. However, this consideration is fixed. The value of t is constant and it is larger than 1. So, hard samples are always emphasized during training. Authors of this paper emphasized that the choice of t is crucial and higher values of it may lead to divergence, especially for smaller network architectures. In CurricularFace, it is adaptive and hard samples are de-emphasized early in training and emphasized later on. Adaptively setting t allows one to worry one hyper-parameter less.

Lets See It in Practice

In this section we are going to look at a very simple example in order to compare the methods we have learned so far. We are going to use MNIST dataset to keep things simple, and to more focus on the comparison between these methods.

The code is written in Python using the PyTorch library, you can find it here (if you find it helpful, please leave a ⭐ 🤓). The thing we are going to do is very simple. We are going to train four models using the same backbone, and four different heads we learned in this post: Softmax, ArcFace, MV-Arc-Softmax and CurricularFace. At the end, with the help of some plots, we are trying to see if there is indeed some difference between these methods, or not.

As an initial step, it would be wise to verify if distribution of the angle between the feature vector of the positive class and its corresponding weight vector at the beginning of the training is indeed a Gaussian distribution with desired properties. Because if it is not, it means that there is a chance the positive cosine similarity modulator won’t be a monotonically decreasing function, hence disturb the training process. To verify this, I have chosen three network architectures and plotted the distribution at the begging of the training. Below you can see these graphs for: ResNet50, ResNet101 and a custom network architecture (you can find its details in the code).

ResNet50 model — Positive theta distribution at the beginning of the training

ResNet101 — Positive theta distribution at the beginning of the training

Custom model — Positive theta distribution at the beginning of the training

It is relieving to see that it is indeed a Gaussian distribution with desired properties (it would have been a shame if it wasn’t, after spending so much time writing this blog post + preparing the code 😅). Well, at least for the models I tested, the angle never exceeded 100 degrees.

Okay, since we now know that everything is okay with the theta distribution, we can start training. Since MNIST dataset is very simple (usually it is used in “Hello World” projects in Deep Learining), I am going to use a very simple network architecture as backbone:

Finally, Adam was chosen as optimizer and 0.005 as the learning rate. Furthermore, the learning rate was reduced by 0.1 half of the the training, 3/4 of the training and 2 epochs before training ends. The training was stopped whenever there was no improvement in the last 10 epochs, or whenever we reach accuracy > 0.97 (because of this, learning rate reduction is usually not applied, however, I just put it in the code). Feature vector size (embedding vector size) is another hyper-parameter to choose. It is usually chosen to be 128, 256, 512, … However, I have chosen it to be 3. With this way we can visualize our embeddings in3D space and compare differences on embeddings when they are trained with different loss functions. Since MNIST dataset is a very easy dataset to learn by a Deep Learning model, such small feature vector size wasn’t an issue. Okay, let’s now see some results:

First thing to notice from these plots is that we could get 98% accuracy using any of these methods (we can even get 100% accuracy, but the training was stopped once the accuracy reached 98%). However, if you look at the distribution of the features on the unit sphere we start to see some differences. Even though each of these methods allowed us to achieve 98% accuracy, when we look at the distribution of the embeddings of the model which was trained using Softmax head, we see that they are all over the place. The reason is, as we talked, as long as features are separable, Softmax loss function doesn’t impose the features to have smaller distance between positive pairs and higher distance between negative pairs. When we look at the plot which belongs to the ArcFace head, we can see a big difference. Embeddings that belong to the same class are now grouped together, and embeddings of different classes are pushed away. In MV-Arc-Softmax and CurricularFace heads we can see the same phenomenon even stronger. Because of the chosen dataset is a very simple one, we cannot see a big difference on the plots of MV-Arc-Softmax and CurricularFace. That’s why it is hard to say, based on this experiment, which one is actually better. Hence, let’s check some bencmarks as the next step 🙂

Benchmarks

Alright, now we know how CurricularFace works, but how about some benchmarks? Is it really any better than the other loss functions that are used to train discriminative embeddings? Well, let’s see… First of all, a little bit info about the datasets that are used in the benchmarks:

LFW: Labeled Faces in the Wild is a public benchmark for face verification

CPLFW: Collection to seek the pictures of people in LFW with pose differences as large as possible

CFP-FP: Images of celebrities in frontal and profile views

AgeDB: This dataset contains face images of celebrities, politicians and scientists in different ages and poses. The annotations per image include gender, age and identity of the person in the image. The age variations are from 3 to 101 years old.

MegaFace: A dataset with lots of distractors. 1 million photos with 690,572 unique identities

IJB-C: The IARPA Janus Benchmark-C face challenge (IJB-C) defines eight challenges addressing verification, identification, detection, clustering, and processing of full motion videos. This is supported by the IJB-C set of 138000 face images, 11000 face videos, and 10000 non-face images

We are ready to see the benchmarks now:

CurricularFace bencmarks on various datasets

It can be seen that CurricularFace achieves comparable result with the competitors on LFW where the performance is near saturated. Furthermore, for all other datasets, the method shows superior results

Conclusions

Yea, CurricularFace is better than its competitors on almost all of the datasets it was benchmarked. However, is there really a significant difference, let’s say, between CurricularFace and 2 of its closest competitors? I don’t think so. So, why did I even bother to focus most of the attention on this loss function and write about a post about it? Because, I really liked the idea of Curriculum Learning. The idea of training a network with easy samples at the beginning, and than gradually increase the difficulty of samples when the model starts learning is a great idea, and it’s amazingly applied into this loss function. I think most Machine Learning problems would benefit from such learning methodological if it was embedded into their loss functions. I hope to see more examples of this in the near future in many Machine Learning articles.

Final Words

Don’t forget to 👏🏻 if you liked this post, and please leave a comment below if you have any feedback, criticism, or something that you would like to discuss. I can also be reached on social media: linkedin, twitter, instagram