Multi-label NLP refers to the task of assigning a number of labels to a given textual content input, quite than only one label. In conventional NLP duties, corresponding to textual content classification or sentiment evaluation, every enter is typically assigned a single label based mostly on its content. However, in many real-world eventualities, a piece of text can belong to a number of categories or categorical multiple sentiments concurrently.

Multi-label NLP is essential because it allows us to capture more nuanced and sophisticated data from textual content knowledge. For example, in the domain of buyer suggestions evaluation, a buyer review may express both optimistic and negative sentiments on the similar time, or it might touch upon a number of features of a services or products. By assigning a quantity of labels to such inputs, we are in a position to achieve a extra comprehensive understanding of the customer’s suggestions and take more focused actions to address their issues.

This article delves right into a noteworthy case of Provectus’ use of multi-label NLP.


> A consumer approached us with a request to assist them automate labeling documents of a certain type. At first glance, the task seemed to be straightforward and easily solved. However, as we worked on the case, we encountered a dataset with inconsistent annotations. Though our buyer had faced challenges with varying class numbers and modifications in their evaluate staff over time, that they had invested significant efforts into creating a diverse dataset with a spread of annotations. While there existed some imbalances and uncertainties within the labels, this dataset provided a useful alternative for evaluation and further exploration.

Let’s take a more in-depth look at the dataset, explore the metrics and our approach, and recap how Provectus solved the issue of multi-label textual content classification.

The dataset has 14,354 observations, with 124 distinctive courses (labels). Our task is to assign one or multiple courses to every observation.

Table 1 offers descriptive statistics for the dataset.

On common, we have about two classes per remark, with a mean of 261 completely different texts describing a single class.

Table 1: Dataset Statistic

In Figure 1, we see the distribution of classes in the top graph, and we’ve a sure number of HEAD labels with the very best frequency of occurrence within the dataset. Also notice that virtually all of classes have a low frequency of prevalence.

In the bottom graph we see that there’s frequent overlap between the lessons which are best represented within the dataset, and the lessons which have low significance.

We modified the method of splitting the dataset into train/val/test sets. Instead of using a standard methodology, we now have employed iterative stratification, to offer a well-balanced distribution of proof of label relations. For that, we used Scikit Multi-learn

from skmultilearn.model_selection import iterative_train_test_split
mlb = MultiLabelBinarizer()
def balanced_split(df, mlb, test_size=0.5):
ind = np.expand_dims(np.arange(len(df)), axis=1)
labels = mlb.transform(df[“tag”])
ind_train, _, ind_test, _ = iterative_train_test_split( ind, labels, test_size
return df.iloc[ind_train[:, 0]], df.iloc[ind_test[:, 0]]
df_train, df_tmp = balanced_split(df, test_size=0.4)
df_val, df_test = balanced_split(df_tmp, test_size=0.5)

We obtained the following distribution:

1. The training dataset incorporates 60% of the information and covers all 124 labels
2. The validation dataset incorporates 20% of the data and covers all 124 labels
3. The test dataset contains 20% of the information and covers all 124 labels

Multi-label classification is a type of supervised machine learning algorithm that enables us to assign multiple labels to a single information sample. It differs from binary classification the place the model predicts solely two categories, and multi-class classification where the model predicts only one out of a quantity of lessons for a sample.

Evaluation metrics for multi-label classification performance are inherently completely different from those utilized in multi-class (or binary) classification as a outcome of inherent variations of the classification downside. More detailed information can be discovered on Wikipedia.

We chosen metrics that are most suitable for us:

1. Precision measures the proportion of true optimistic predictions among the total optimistic predictions made by the mannequin.
2. Recall measures the proportion of true constructive predictions among all precise constructive samples.
3. F1-score is the harmonic imply of precision and recall, which helps to revive balance between the two.
four. Hamming loss is the fraction of labels that are incorrectly predicted

We additionally monitor the number of predicted labels within the set defined as rely for labels, for which we obtain an F1 score > 0.

Multi-Label Classification is a kind of supervised studying drawback the place a single instance or example could be associated with a quantity of labels or classifications, as opposed to conventional single-label classification, where each occasion is just associated with a single class label.

To remedy multi-label classification issues, there are two main categories of techniques:

1. Problem transformation methods
2. Algorithm adaptation methods

Problem transformation methods enable us to transform multi-label classification duties into multiple single-label classification duties. For instance, the Binary Relevance (BR) baseline strategy treats every label as a separate binary classification downside. In this case, the multi-label problem is reworked into a quantity of single-label problems.

Algorithm adaptation methods modify the algorithms themselves to handle multi-label data natively, without reworking the task into multiple single-label classification tasks. An example of this approach is the BERT mannequin, which is a pre-trained transformer-based language mannequin that can be fine-tuned for varied NLP duties, together with multi-label textual content classification. BERT is designed to deal with multi-label knowledge immediately, without the need for problem transformation.

In the context of using BERT for multi-label textual content classification, the usual strategy is to use Binary Cross-Entropy (BCE) loss as the loss perform. BCE loss is a generally used loss perform for binary classification problems and could be easily prolonged to handle multi-label classification issues by computing the loss for every label independently, after which summing the losses. In this case, the BCE loss perform measures the error between predicted chances and true labels, the place predicted chances are obtained from the ultimate sigmoid activation layer within the BERT model.

Now, let’s take a better look at Figure 2 beneath.

Figure 2. Metrics for baseline fashions

The graph on the left exhibits a comparison of metrics for a “baseline: BERT” and “baseline: ML”. Thus, it can be seen that for “baseline: BERT”, the F1 and Recall scores are approximately 1.5 times higher, whereas the Precision for “baseline: ML” is 2 times higher than that of model 1. By analyzing the overall percentage of predicted courses shown on the proper, we see that “baseline: BERT” predicted classes greater than 10 occasions that of “baseline: ML”.

Because the maximum result for the “baseline: BERT” is lower than 50% of all courses, the outcomes are quite discouraging. Let’s determine tips on how to improve these outcomes.

Based on the excellent article “Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution”, we discovered that distribution-balanced loss could be the most suitable strategy for us.

Distribution-balanced loss
Distribution-balanced loss is a method used in multi-label text classification issues to deal with imbalances in class distribution. In these problems, some lessons have a a lot larger frequency of prevalence compared to others, resulting in mannequin bias towards these extra frequent courses.

To handle this concern, distribution-balanced loss aims to steadiness the contribution of every pattern within the loss perform. This is achieved by re-weighting the lack of every sample based on the inverse of its frequency of incidence in the dataset. By doing so, the contribution of less frequent lessons is elevated, and the contribution of more frequent lessons is decreased, thus balancing the general class distribution.

This method has been shown to be effective in improving the efficiency of fashions on long-tailed class distribution problems. By reducing the impact of frequent courses and rising the impact of infrequent courses, the mannequin is ready to better capture patterns within the data and produce more balanced predictions.

Implementation of Resample Class
import torch
import torch.nn as nn
import torch.nn.useful as F
import numpy as np
class ResampleLoss(nn.Module):
def __init__( self, use_sigmoid=True, partial=False, loss_weight=1.0, reduction=”imply”, reweight_func=None, weight_norm=None, focal=dict(focal=True, alpha=0.5, gamma=2), map_param=dict(alpha=10.0, beta=0.2, gamma=0.1), CB_loss=dict(CB_beta=0.9, CB_mode=”average_w”), logit_reg=dict(, init_bias=0.1), class_freq=None, train_num=None,
): super(ResampleLoss, self).__init__() assert (use_sigmoid is True) or (partial is False) self.use_sigmoid = use_sigmoid self.partial = partial self.loss_weight = loss_weight self.reduction = discount if self.use_sigmoid: if self.partial: self.cls_criterion = partial_cross_entropy else: self.cls_criterion = binary_cross_entropy else: self.cls_criterion = cross_entropy # reweighting function self.reweight_func = reweight_func # normalization (optional) self.weight_norm = weight_norm # focal loss params self.focal = focal[“focal”] self.gamma = focal[“gamma”] self.alpha = focal[“alpha”] # mapping function params self.map_alpha = map_param[“alpha”] self.map_beta = map_param[“beta”] self.map_gamma = map_param[“gamma”] # CB loss params (optional) self.CB_beta = CB_loss[“CB_beta”] self.CB_mode = CB_loss[“CB_mode”] self.class_freq = ( torch.from_numpy(np.asarray(class_freq)).float().cuda() ) self.num_classes = self.class_freq.shape[0] self.train_num = train_num # solely was divided by class_freq # regularization params self.logit_reg = logit_reg self.neg_scale = ( logit_reg[“neg_scale”] if “neg_scale” in logit_reg else 1. ) init_bias = ( logit_reg[“init_bias”] if “init_bias” in logit_reg else 0. ) self.init_bias = ( torch.log(self.train_num / self.class_freq – 1) * init_bias ) self.freq_inv = ( torch.ones(self.class_freq.shape).cuda() / self.class_freq ) self.propotion_inv = self.train_num / self.class_freq def forward( self, cls_score, label, weight=None, avg_factor=None, reduction_override=None, **kwargs
): assert reduction_override in (None, “none”, “mean”, “sum”) reduction = ( reduction_override if reduction_override else ) weight = self.reweight_functions(label) cls_score, weight = self.logit_reg_functions( label.float(), cls_score, weight ) if self.focal: logpt = self.cls_criterion( cls_score.clone(), label, weight=None, reduction=”none”, avg_factor=avg_factor, ) # pt is sigmoid(logit) for pos or sigmoid(-logit) for neg pt = torch.exp(-logpt) wtloss = self.cls_criterion( cls_score, label.float(), weight=weight, reduction=”none” ) alpha_t = torch.where(label == 1, self.alpha, 1 – self.alpha) loss = alpha_t * ((1 – pt) ** self.gamma) * wtloss loss = reduce_loss(loss, reduction) else: loss = self.cls_criterion( cls_score, label.float(), weight, reduction=reduction ) loss = self.loss_weight * loss return loss def reweight_functions(self, label): if self.reweight_func is None: return None elif self.reweight_func in [“inv”, “sqrt_inv”]: weight = self.RW_weight(label.float()) elif self.reweight_func in “rebalance”: weight = self.rebalance_weight(label.float()) elif self.reweight_func in “CB”: weight = self.CB_weight(label.float()) else: return None if self.weight_norm is not None: if “by_instance” in self.weight_norm: max_by_instance, _ = torch.max(weight, dim=-1, keepdim=True) weight = weight / max_by_instance elif “by_batch” in self.weight_norm: weight = weight / torch.max(weight) return weight def logit_reg_functions(self, labels, logits, weight=None): if not self.logit_reg: return logits, weight if “init_bias” in self.logit_reg: logits += self.init_bias if “neg_scale” in self.logit_reg: logits = logits * (1 – labels) * self.neg_scale + logits * labels if weight is not None: weight = ( weight / self.neg_scale * (1 – labels) + weight * labels ) return logits, weight def rebalance_weight(self, gt_labels): repeat_rate = torch.sum( gt_labels.float() * self.freq_inv, dim=1, keepdim=True ) pos_weight = ( self.freq_inv.clone().detach().unsqueeze(0) / repeat_rate ) # pos and neg are equally handled weight = ( torch.sigmoid(self.map_beta * (pos_weight – self.map_gamma)) + self.map_alpha ) return weight def CB_weight(self, gt_labels): if “by_class” in self.CB_mode: weight = ( torch.tensor((1 – self.CB_beta)).cuda() / (1 – torch.pow(self.CB_beta, self.class_freq)).cuda() ) elif “average_n” in self.CB_mode: avg_n = torch.sum( gt_labels * self.class_freq, dim=1, keepdim=True ) / torch.sum(gt_labels, dim=1, keepdim=True) weight = ( torch.tensor((1 – self.CB_beta)).cuda() / (1 – torch.pow(self.CB_beta, avg_n)).cuda() ) elif “average_w” in self.CB_mode: weight_ = ( torch.tensor((1 – self.CB_beta)).cuda() / (1 – torch.pow(self.CB_beta, self.class_freq)).cuda() ) weight = torch.sum( gt_labels * weight_, dim=1, keepdim=True ) / torch.sum(gt_labels, dim=1, keepdim=True) elif “min_n” in self.CB_mode: min_n, _ = torch.min( gt_labels * self.class_freq + (1 – gt_labels) * , dim=1, keepdim=True, ) weight = ( torch.tensor((1 – self.CB_beta)).cuda() / (1 – torch.pow(self.CB_beta, min_n)).cuda() ) else: elevate NameError return weight def RW_weight(self, gt_labels, by_class=True): if “sqrt” in self.reweight_func: weight = torch.sqrt(self.propotion_inv) else: weight = self.propotion_inv if not by_class: sum_ = torch.sum(weight * gt_labels, dim=1, keepdim=True) weight = sum_ / torch.sum(gt_labels, dim=1, keepdim=True) return weight
def reduce_loss(loss, reduction):
“””Reduce loss as specified.
Args: loss (Tensor): Elementwise loss tensor. discount (str): Options are “none”, “imply” and “sum”.
Return: Tensor: Reduced loss tensor.
reduction_enum = F._Reduction.get_enum(reduction)
# none: zero, elementwise_mean:1, sum: if reduction_enum == 0: return loss
elif reduction_enum == 1: return loss.mean()
elif reduction_enum == 2: return loss.sum()
def weight_reduce_loss(loss, weight=None, reduction=”imply”, avg_factor=None):
“””Apply element-wise weight and cut back loss.
Args: loss (Tensor): Element-wise loss. weight (Tensor): Element-wise weights. reduction (str): Same as built-in losses of PyTorch. avg_factor (float): Avarage factor when computing the imply of losses.
Returns: Tensor: Processed loss values.
# if weight is specified, apply element-wise weight
if weight isn’t None: loss = loss * weight
# if avg_factor is not specified, just cut back the loss
if avg_factor is None: loss = reduce_loss(loss, reduction)
else: # if reduction is mean, then average the loss by avg_factor if reduction == “mean”: loss = loss.sum() / avg_factor # if reduction is ‘none’, then do nothing, in any other case increase an error elif reduction != “none”: increase ValueError( ‘avg_factor can not be used with reduction=”sum”‘ )
return loss
def binary_cross_entropy(
pred, label, weight=None, reduction=”imply”, avg_factor=None
# weighted element-wise losses
if weight is not None: weight = weight.float()
loss = F.binary_cross_entropy_with_logits( pred, label.float(), weight, reduction=”none”
loss = weight_reduce_loss( loss, reduction=reduction, avg_factor=avg_factor
) return loss

loss_func = ResampleLoss(
focal=dict(focal=True, alpha=0.5, gamma=2),
logit_reg=dict(init_bias=0.05, neg_scale=2.0),
map_param=dict(alpha=0.1, beta=10.0, gamma=0.405),
class_freq – list of frequencies for every class,
train_num – size of prepare dataset
By intently investigating the dataset, we now have concluded that the parameter = 0.405.

Threshold tuning
Another step in improving our mannequin was the process of tuning the edge, each within the coaching stage, and in the validation and testing levels. We calculated the dependencies of metrics corresponding to f1-score, precision, and recall on the brink degree, and we selected the edge based on the very best metric rating. Below you can see the operate implementation of this course of.

Optimization of the F1 rating by tuning the edge:

def optimise_f1_score(true_labels: np.ndarray, pred_labels: np.ndarray):
best_med_th = zero. true_bools = [tl == 1 for tl in true_labels]
micro_thresholds = (np.array(range(-45, 15)) / 100) + best_med_th
f1_results, pre_results, recall_results = [], [], []
for th in micro_thresholds: pred_bools = [pl > th for pl in pred_labels] test_f1 = f1_score(true_bools, pred_bools, average=”micro”, zero_division=0) test_precision = precision_score( true_bools, pred_bools, average=”micro”, zero_division= ) test_recall = recall_score( true_bools, pred_bools, average=”micro”, zero_division= ) f1_results.append(test_f1) prec_results.append(test_precision) recall_results.append(test_recall) best_f1_idx = np.argmax(f1_results)
return micro_thresholds[best_f1_idx]

Evaluation and comparison with baseline
These approaches allowed us to coach a model new model and procure the next end result, which is compared to the baseline: BERT in Figure 3 beneath.

Figure three. Comparison metrics by baseline and newer method.

By comparing the metrics which might be relevant for classification, we see a big enhance in efficiency measures nearly by 5-6 occasions:

The F1 rating elevated from 12% → 55%, while Precision elevated from 9% → 59% and Recall elevated from 15% → 51%.

With the changes shown in the best graph in Figure three, we will now predict 80% of the classes.

Slices of courses
We divided our labels into four teams: HEAD, MEDIUM, TAIL, and ZERO. Each group incorporates labels with an identical amount of supporting data observations.

As seen in Figure 4, the distributions of the groups are distinct. The rose box (HEAD) has a negatively skewed distribution, the middlebox (MEDIUM) has a positively skewed distribution, and the green field (TAIL) appears to have a traditional distribution.

All groups also have outliers, that are factors outside the whiskers within the box plot. The HEAD group has a major impression on a MAJOR class.

Additionally, we’ve recognized a separate group named “ZERO” which accommodates labels that the model was unable to be taught and cannot recognize due to the minimal number of occurrences within the dataset (less than 3% of all observations).

Figure 4. Label counts vs. groups

Table 2 offers information about metrics per every group of labels for the test subset of knowledge.

Table 2. Metrics per group.
1. The HEAD group incorporates 21 labels with an average of 112 assist observations per label. This group is impacted by outliers and, because of its excessive illustration in the dataset, its metrics are high: F1 – 73%, Precision – 71%, Recall – 75%.
2. The MEDIUM group consists of forty four labels with a median assist of 67 observations, which is roughly two occasions lower than the HEAD group. The metrics for this group are expected to decrease by 50%: F1 – 52%, Precision – 56%, Recall – 51%.
3. The TAIL group has the most important number of classes, however all are poorly represented within the dataset, with a mean of 40 support observations per label. As a end result, the metrics drop significantly: F1 – 21%, Precision – 32%, Recall – 16%.
four. The ZERO group contains courses that the mannequin can’t recognize in any respect, doubtlessly due to their low incidence in the dataset. Each of the 24 labels in this group has a mean of 7 help observations.

Figure 5 visualizes the information offered in Table 2, providing a visible representation of the metrics per group of labels.

Figure 5. Metrics vs. label groups. All ZERO values = 0.

In this complete article, we have demonstrated that a seemingly simple task of multi-label textual content classification may be challenging when conventional methods are applied. We have proposed the usage of distribution-balancing loss features to deal with the problem of class imbalance.

We have in contrast the performance of our proposed strategy to the traditional methodology, and evaluated it using real-world enterprise metrics. The results show that utilizing loss features to address class imbalances and label co-occurrences offer a viable solution for multi-label text classification.

The proposed use case highlights the importance of considering different approaches and techniques when dealing with multi-label text classification, and the potential advantages of distribution-balancing loss capabilities in addressing class imbalances.

If you are dealing with an analogous issue and in search of to streamline doc processing operations within your organization, please contact me or the Provectus team. We might be happy to help you in finding more environment friendly methods for automating your processes.

Oleksii Babych is a Machine Learning Engineer at Provectus. With a background in physics, he possesses wonderful analytical and math skills, and has gained useful expertise by way of scientific analysis and international conference shows, including SPIE Photonics West. Oleksii makes a speciality of creating end-to-end, large-scale AI/ML solutions for healthcare and fintech industries. He is involved in every stage of the ML development life cycle, from figuring out business issues to deploying and operating manufacturing ML fashions.

Rinat Akhmetov is the ML Solution Architect at Provectus. With a solid sensible background in Machine Learning (especially in Computer Vision), Rinat is a nerd, data enthusiast, software program engineer, and workaholic whose second greatest passion is programming. At Provectus, Rinat is in management of the discovery and proof of idea phases, and leads the execution of advanced AI projects.

About The Author