<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Yachay AI]]></title><description><![CDATA[Open-Source Natural Language Processing models with a focus on text-based geolocation.]]></description><link>https://blog.yachay.ai</link><generator>RSS for Node</generator><lastBuildDate>Mon, 20 Apr 2026 10:08:13 GMT</lastBuildDate><atom:link href="https://blog.yachay.ai/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Uncertainty Estimation in Transformers]]></title><description><![CDATA[In our work on text-based geotagging, where many texts lack clear location references, having accurate uncertainty estimation is particularly valuable. In this blog post, we'll dive into different methods of achieving that.
To make it easy to follow ...]]></description><link>https://blog.yachay.ai/uncertainty-estimation-in-transformers</link><guid isPermaLink="true">https://blog.yachay.ai/uncertainty-estimation-in-transformers</guid><category><![CDATA[Machine Learning]]></category><category><![CDATA[transformers]]></category><category><![CDATA[text classification]]></category><category><![CDATA[Sentiment analysis]]></category><dc:creator><![CDATA[Yachay AI]]></dc:creator><pubDate>Thu, 22 Feb 2024 09:52:42 GMT</pubDate><content:encoded><![CDATA[<p>In our work on text-based geotagging, where many texts lack clear location references, having accurate uncertainty estimation is particularly valuable. In this blog post, we'll dive into different methods of achieving that.</p>
<p>To make it easy to follow along, we'll conduct all our experiments on a multi-class sentiment classification task.</p>
<h2 id="heading-introduction">Introduction</h2>
<p>If you've ever worked with machine learning models, you know they're not just about making predictions; it's also crucial to understand how confident one can be in these predictions. This is where uncertainty estimation comes into play.</p>
<p>To illustrate the idea of uncertainty estimation, take a look at the bar charts below. We processed three different texts with the same model, and you can see that each text shows different confidence levels.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1708432569333/a5968502-a3e2-486b-9c37-4fd7007e1c1e.png" alt class="image--center mx-auto" /></p>
<p>After reading this blog, you should have a clear understanding of different techniques, their strengths, weaknesses, and practical applications.</p>
<p>So, whether you're a seasoned data scientist or just dipping your toes into natural language processing (NLP), there's something here for everyone.</p>
<hr />
<h2 id="heading-setting-up-the-environment">Setting up the environment</h2>
<p>Let's start by setting up the environment.</p>
<ol>
<li>Install the essential libraries for transformer-based text classification.</li>
</ol>
<pre><code class="lang-python">!pip install transformers datasets &gt; /dev/null
</code></pre>
<ol>
<li>Initialize the model and the tokenizer.</li>
</ol>
<p>In this example, we're using a DistilBERT-based model, renowned for its efficiency in understanding and processing language.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> torch
<span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> AutoTokenizer, AutoModelForSequenceClassification
<span class="hljs-keyword">from</span> datasets <span class="hljs-keyword">import</span> load_dataset
<span class="hljs-keyword">import</span> random
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">initialize_model_and_tokenizer</span>(<span class="hljs-params">device</span>):</span>
    tokenizer = AutoTokenizer.from_pretrained(<span class="hljs-string">"bdotloh/distilbert-base-uncased-empathetic-dialogues-context"</span>)
    model = AutoModelForSequenceClassification.from_pretrained(<span class="hljs-string">"bdotloh/distilbert-base-uncased-empathetic-dialogues-context"</span>).to(device)
    <span class="hljs-keyword">return</span> model, tokenizer

device = <span class="hljs-string">'cuda'</span> <span class="hljs-keyword">if</span> torch.cuda.is_available() <span class="hljs-keyword">else</span> <span class="hljs-string">'cpu'</span>

<span class="hljs-comment"># Use the functions to initialize the model and tokenizer, and prepare the dataset</span>
model, tokenizer = initialize_model_and_tokenizer(device)
</code></pre>
<ol>
<li><p>For this experiment, we're using a dataset from <a target="_blank" href="https://huggingface.co/datasets/bdotloh/empathetic-dialogues-contexts">'empathetic-dialogues-contexts'.</a> It's a rich dataset with 32 distinct emotion labels, offering a diverse range of contexts and emotional nuances. To collect the dataset, respondents were asked to describe events associated with specific emotions. It consists of 19,209 texts for training, 2,756 texts for validation, and 2,542 texts for tests.</p>
</li>
<li><p>After initializing the environment, the model, and the dataset, let's set a seed value to ensure consistent results across runs.</p>
</li>
</ol>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">prepare_datasets</span>():</span>
    full_dataset = load_dataset(<span class="hljs-string">"bdotloh/empathetic-dialogues-contexts"</span>, split=<span class="hljs-string">'validation'</span>)
    shuffled_dataset = full_dataset.shuffle(seed=<span class="hljs-number">42</span>)
    valid_dataset = shuffled_dataset.select(range(<span class="hljs-number">1000</span>))
    train_dataset = load_dataset(<span class="hljs-string">"bdotloh/empathetic-dialogues-contexts"</span>, split=<span class="hljs-string">'train'</span>)
    <span class="hljs-keyword">return</span> train_dataset, valid_dataset

<span class="hljs-comment"># Define a seed for reproducibility</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">set_seed</span>(<span class="hljs-params">seed_value</span>):</span>
    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    <span class="hljs-keyword">if</span> torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed_value)

set_seed(<span class="hljs-number">0</span>)  <span class="hljs-comment"># Set the seed for reproducibility</span>
train_dataset, valid_dataset = prepare_datasets()
</code></pre>
<ol>
<li>Next, streamline the code by introducing several functions.</li>
</ol>
<p>To view the complete code of the functions, refer to the <a target="_blank" href="https://colab.research.google.com/drive/1CfKTnFRGWXoi_1w9kRkpkDGMVQ_XeUTd?usp=sharing">notebook</a>.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Function to tokenize the dataset and return tensors ready for model input</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">tokenize_data</span>(<span class="hljs-params">tokenizer, dataset, max_length=<span class="hljs-number">512</span></span>):</span>
<span class="hljs-comment"># Function to perform model predictions and return logits</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_model_predictions</span>(<span class="hljs-params">model, inputs, device</span>):</span>
<span class="hljs-comment"># Function to calculate top-k accuracy</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">calculate_top_k_accuracy</span>(<span class="hljs-params">true_labels, predicted_probs, k=<span class="hljs-number">1</span></span>):</span>
<span class="hljs-comment"># Function to calculate and print calibration metrics for different top percentages</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">calculate_and_print_calibration_metrics</span>(<span class="hljs-params">predictions, method_name, percentages=[<span class="hljs-number">5</span>, <span class="hljs-number">10</span>, <span class="hljs-number">25</span>, <span class="hljs-number">50</span>]</span>):</span>
<span class="hljs-comment"># Function to prepare the DataLoader for auxiliary model training</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">create_data_loader</span>(<span class="hljs-params">hidden_states, labels, batch_size=<span class="hljs-number">16</span></span>):</span>
<span class="hljs-comment"># Function to calculate entropy from logits</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">calculate_entropy</span>(<span class="hljs-params">logits</span>):</span>
<span class="hljs-comment"># Renders the calibration chart</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">render_calibration_chart</span>(<span class="hljs-params">predictions, method_name</span>):</span>
</code></pre>
<h2 id="heading-uncertainty-estimation-methods">Uncertainty Estimation Methods</h2>
<p>Now that the environment is ready, let's delve into the theory behind uncertainty estimation methods. We'll start with the simple techniques, discussing the intricacies and effectiveness of each.</p>
<h3 id="heading-softmax-based-methods">Softmax-Based Methods</h3>
<p>Softmax is a mathematical function that converts raw model logits into probabilities by exponentiating and normalizing them, ensuring a 0 to 1 range with a sum of 1. This makes it a crucial bridge between raw predictions and their interpretation as confidence levels across different classes.</p>
<p><strong>Max Softmax</strong></p>
<p>The Max Softmax method gauges uncertainty by relying on the highest softmax score to indicate confidence. This approach assumes that a higher probability for a class corresponds to greater certainty.</p>
<ul>
<li><p>While computationally efficient and easy to implement, it may lack reliability, particularly for out-of-distribution samples where it tends to be overly confident.</p>
</li>
<li><p>In a graphical representation, imagine the Max Softmax method as assessing the length of the longest horizontal bar, with each bar length representing the model's confidence.</p>
</li>
</ul>
<p><strong>Softmax Difference</strong></p>
<p>The Softmax Difference method extends the Max Softmax concept by assessing the difference between the highest and second-highest softmax probabilities. A significant gap between these probabilities signals a strong preference by the model for one class, implying heightened confidence.</p>
<ul>
<li><p>This approach can offer a more nuanced measure of uncertainty compared to Max Softmax alone, particularly when the top two probabilities are closely matched.</p>
</li>
<li><p>To visualize this, picture the Softmax Difference method as examining the gap between two adjacent horizontal bars. A wider gap signifies a more decisive prediction from the model.</p>
</li>
</ul>
<p><strong>Softmax Variance</strong></p>
<p>Softmax Variance considers the distribution of probabilities across all classes. It evaluates confidence spread by calculating the variance of softmax probabilities.</p>
<p>Low variance implies concentrated predictions, indicating high confidence in a specific class. High variance suggests uncertainty, with probabilities spread across multiple classes.</p>
<ul>
<li>For a visual representation, imagine Softmax Variance as assessing the evenness of bar lengths across the chart. Greater uniformity signifies increased uncertainty in the model's predictions.</li>
</ul>
<p><strong>Softmax Entropy</strong></p>
<p>Entropy measures the uncertainty in the model's predictions by considering the entire softmax probability distribution. High entropy corresponds to greater uncertainty.</p>
<ul>
<li>Imagine Softmax Entropy as assessing the variation in the lengths of horizontal bars. A more scattered pattern indicates higher uncertainty.</li>
</ul>
<hr />
<p>Now that we've covered the conceptual basis of the uncertainty estimation methods, <strong>let's delve into the practical aspects.</strong></p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">compute_confidence_scores</span>(<span class="hljs-params">logits, method=<span class="hljs-string">'max_softmax'</span></span>):</span>
    confidences = []
    <span class="hljs-keyword">if</span> method == <span class="hljs-string">'Max Softmax'</span>:
        confidences = torch.nn.Softmax(dim=<span class="hljs-number">-1</span>)(logits).max(dim=<span class="hljs-number">1</span>).values.cpu().numpy()
    <span class="hljs-keyword">elif</span> method == <span class="hljs-string">'Softmax Difference'</span>:
        top_two_probs = torch.topk(torch.nn.Softmax(dim=<span class="hljs-number">-1</span>)(logits), <span class="hljs-number">2</span>, dim=<span class="hljs-number">1</span>).values
        confidences = (top_two_probs[:, <span class="hljs-number">0</span>] - top_two_probs[:, <span class="hljs-number">1</span>]).cpu().numpy()
    <span class="hljs-keyword">elif</span> method == <span class="hljs-string">'Softmax Variance'</span>:
        probs = torch.nn.Softmax(dim=<span class="hljs-number">-1</span>)(logits).cpu().numpy()
        confidences = np.var(probs, axis=<span class="hljs-number">1</span>)
    <span class="hljs-keyword">elif</span> method == <span class="hljs-string">'Softmax Entropy'</span>:
        probs = torch.nn.Softmax(dim=<span class="hljs-number">-1</span>)(logits)
        confidences = (probs * torch.log(probs)).sum(dim=<span class="hljs-number">1</span>).cpu().numpy()
    <span class="hljs-keyword">else</span>:
        <span class="hljs-keyword">raise</span> ValueError(<span class="hljs-string">f"Unknown method <span class="hljs-subst">{method}</span>"</span>)
    <span class="hljs-keyword">return</span> confidences
</code></pre>
<p>In many cases, various methods for uncertainty estimation share similar conclusions. However, it's the moments of disagreement that are particularly insightful.</p>
<p>In the example below, Max Softmax and Softmax Difference methods disagree on whether to include a text in Top-10% by confidence, allowing to explore the difference in the two assessments.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1707818648666/4bfc4ac8-cc0c-4701-bbf8-fe50adcc14eb.png" alt class="image--center mx-auto" /></p>
<p>On the validation set, Top-10% threshold for Max-Softmax is 0.9941 and for Softmax Difference it is 0.9932.</p>
<ul>
<li><p>On the left chart, both Max Softmax and Softmax Difference methods agree on identifying a sample text as one of the top 10% most confident predictions in the dataset, a consensus observed in 99% of the samples.</p>
</li>
<li><p>If the threshold value is set to the 10% most confident predictions, the "Content" class on the left chart surpasses for both the Max Softmax and Softmax Difference.</p>
</li>
<li><p>In contrast, the right chart shows the class "Grateful" surpassing the Max Softmax threshold with a value of 0.9945, but its difference of 0.9931 falls just short of the Softmax Difference threshold value.</p>
</li>
</ul>
<p>Even though transformers often generate high probability values, a minor difference, such as the one between 0.9956 and 0.9945, can be crucial in the analysis. This highlights the importance of subtle probability variations when identifying the most confident predictions.</p>
<hr />
<h2 id="heading-advanced-methods">Advanced Methods</h2>
<p>After exploring simple techniques that rely solely on a model's output from a single forward pass, let's step into a more advanced territory.</p>
<p>The next methods offer a deeper dive into uncertainty estimation.</p>
<h3 id="heading-monte-carlo-dropout">Monte Carlo Dropout</h3>
<p>Monte Carlo Dropout (MCD) estimates uncertainty by utilizing dropout layers in a neural network. It keeps dropout active during inference, running multiple forward passes to measure prediction variability. High variability signals low confidence.</p>
<p>MCD offers a practical method for quantifying model uncertainty without retraining or altering the model architecture.</p>
<p>During training, MCD uses dropout to regularize the model, preventing overfitting. Enabling dropout during inference simulates a model ensemble, where each pass uses a slightly different network architecture. The variance in predictions indicates the model's uncertainty.</p>
<p><strong>Limitations and Assumptions:</strong></p>
<ul>
<li><p>Monte Carlo Dropout has a limitation in assuming that dropout layers alone can adequately capture model uncertainty. This assumption may not hold for all network architectures or datasets.</p>
</li>
<li><p>Another drawback is the increased computational cost due to the necessity for multiple forward passes.</p>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-comment"># This pseudocode illustrates the core steps of Monte Carlo dropout </span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">monte_carlo_dropout</span>(<span class="hljs-params">model, data, num_samples</span>):</span>
    enable_dropout(model)
    <span class="hljs-keyword">for</span> each sample <span class="hljs-keyword">in</span> data:
        <span class="hljs-keyword">for</span> j <span class="hljs-keyword">in</span> <span class="hljs-number">1</span> to num_samples:
            prediction_j = model_predict(sample)
            store prediction_j
        average_predictions = calculate_mean(predictions)
        uncertainty = calculate_standard_deviation(predictions)
        store average_predictions <span class="hljs-keyword">and</span> uncertainty
    <span class="hljs-keyword">return</span> aggregated_results
</code></pre>
<h2 id="heading-mahalanobis-distance">Mahalanobis Distance</h2>
<p>The Mahalanobis Distance method gauges uncertainty by measuring how far a new data point is from the distribution of a class in the hidden space of a neural network. Smaller distances indicate higher confidence.</p>
<ul>
<li><p>This method is good at spotting out-of-distribution examples because it shows how unusual a point is compared to what the model has learned. Points far from any class mean are likely outliers or from a new distribution.</p>
</li>
<li><p>Mahalanobis Distance can be used in combination with other techniques.</p>
</li>
<li><p>As such, to simplify and speed up calculations, Principal Component Analysis (PCA) can be used to reduce the dimensionality of the feature space. This method concentrates on the most crucial directions while minimizing noise, assuming that these directions capture the essential aspects for uncertainty estimation.</p>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-comment"># This pseudocode illustrates Mahalanobis distance for estimating confidence</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">prepare_mahalanobis</span>(<span class="hljs-params">data, n_components=<span class="hljs-number">0.9</span></span>):</span>
    <span class="hljs-comment"># Compute hidden states for the dataset</span>
    hidden_states = compute_hidden_states(data)

    <span class="hljs-comment"># Apply PCA to reduce dimensionality</span>
    pca = PCA(n_components=n_components)
    pca.fit(hidden_states)
    hidden_states_pca = pca.transform(hidden_states)

    <span class="hljs-comment"># Calculate mean vector and precision matrix</span>
    mean_vector = np.mean(hidden_states_pca, axis=<span class="hljs-number">0</span>)
    covariance_matrix = np.cov(hidden_states_pca, rowvar=<span class="hljs-literal">False</span>)
    precision_matrix = np.linalg.inv(covariance_matrix)

    <span class="hljs-keyword">return</span> mean_vector, precision_matrix, pca

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">compute_mahalanobis_distance</span>(<span class="hljs-params">samples_pca, mean_vector, precision_matrix</span>):</span>
    distances = []
    <span class="hljs-keyword">for</span> sample <span class="hljs-keyword">in</span> samples_pca:
        distance = mahalanobis(sample, mean_vector, precision_matrix)
        distances.append(distance)
    <span class="hljs-keyword">return</span> np.array(distances)

<span class="hljs-comment"># Usage:</span>
mean_vector, precision_matrix, pca = prepare_mahalanobis(train_data)
<span class="hljs-comment"># Transform validation data using trained PCA</span>
validation_pca = pca.transform(validation_data)
<span class="hljs-comment"># Compute distances</span>
distances = compute_mahalanobis_distance(validation_pca, mean_vector, precision_matrix)
</code></pre>
<h3 id="heading-auxiliary-classifier">Auxiliary Classifier</h3>
<p>An auxiliary classifier is a secondary model trained to predict the certainty of the primary model's predictions. It takes the hidden states of the main model as input and learns to distinguish between correct and incorrect predictions.</p>
<p>This approach directly models uncertainty and can be tailored to the specific distribution of the data. By learning from the main model's hidden representations, the auxiliary classifier can provide a more nuanced understanding of the model's confidence in its predictions.</p>
<p><strong>Limitations and Assumptions:</strong></p>
<ul>
<li><p>If the auxiliary classifier overfits to the primary model's mistakes, it may inherit its biases, leading to overconfident incorrect predictions or underconfident correct predictions.</p>
</li>
<li><p>It's essential to ensure that the auxiliary classifier is trained with a representative and balanced dataset to mitigate this risk.</p>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-comment"># Pseudo code for Auxiliary classifier method</span>
<span class="hljs-comment"># Define a simple MLP for binary classification</span>
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">SimpleMLP</span>(<span class="hljs-params">nn.Module</span>):</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, input_dim</span>):</span>
        super().__init__()
        self.layer1 = nn.Linear(input_dim, <span class="hljs-number">32</span>)  <span class="hljs-comment"># First linear layer</span>
        self.layer2 = nn.Linear(<span class="hljs-number">32</span>, <span class="hljs-number">1</span>)  <span class="hljs-comment"># Second linear layer, outputting a single value</span>

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">forward</span>(<span class="hljs-params">self, x</span>):</span>
        x = torch.relu(self.layer1(x))  <span class="hljs-comment"># Apply ReLU activation after first layer</span>
        x = torch.sigmoid(self.layer2(x))  <span class="hljs-comment"># Apply sigmoid activation after second layer for binary output</span>
        <span class="hljs-keyword">return</span> x

<span class="hljs-comment"># Function to create binary labels based on prediction correctness</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">create_binary_labels</span>(<span class="hljs-params">predicted_classes, true_labels</span>):</span>
    correct_predictions = (predicted_classes == true_labels).astype(int)  <span class="hljs-comment"># 1 for correct, 0 for incorrect</span>
    <span class="hljs-keyword">return</span> correct_predictions

<span class="hljs-comment"># Train the auxiliary classifier with the balanced dataset</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">train_auxiliary_classifier</span>(<span class="hljs-params">features, labels, epochs=<span class="hljs-number">10</span></span>):</span>
    model = SimpleMLP(input_dim=features.shape[<span class="hljs-number">1</span>]).to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=<span class="hljs-number">0.001</span>)
    criterion = nn.BCELoss()  <span class="hljs-comment"># Binary cross-entropy loss for binary classification</span>

    <span class="hljs-keyword">for</span> epoch <span class="hljs-keyword">in</span> range(epochs):
        model.train()  <span class="hljs-comment"># Set model to training mode</span>
        total_loss = <span class="hljs-number">0</span>
        <span class="hljs-keyword">for</span> inputs, targets <span class="hljs-keyword">in</span> train_loader:  <span class="hljs-comment"># Assuming features and labels are loaded as batches</span>
            inputs, targets = inputs.to(device), targets.to(device)  <span class="hljs-comment"># Move data to the appropriate device</span>
            optimizer.zero_grad()  <span class="hljs-comment"># Clear gradients before each update</span>
            outputs = model(inputs)  <span class="hljs-comment"># Forward pass to get output</span>
            loss = criterion(outputs, targets)  <span class="hljs-comment"># Calculate loss between model predictions and true labels</span>
            loss.backward()  <span class="hljs-comment"># Calculate gradients</span>
            optimizer.step()  <span class="hljs-comment"># Update model parameters</span>
            total_loss += loss.item()  <span class="hljs-comment"># Aggregate loss</span>

    <span class="hljs-keyword">return</span> model

<span class="hljs-comment"># Use the trained auxiliary classifier to estimate uncertainty</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">estimate_uncertainty</span>(<span class="hljs-params">model, features</span>):</span>
    model.eval()  <span class="hljs-comment"># Switch model to evaluation mode</span>
    <span class="hljs-keyword">with</span> torch.no_grad():
        predictions = model(features)
    uncertainty = <span class="hljs-number">1</span> - predictions  <span class="hljs-comment"># Assuming higher prediction score means lower uncertainty</span>
    <span class="hljs-keyword">return</span> uncertainty
</code></pre>
<h2 id="heading-comparative-analysis">Comparative Analysis</h2>
<p>Let's compare these methods based on two key metrics:</p>
<ol>
<li><p>Accuracy of the top-X% most confident predictions</p>
</li>
<li><p>Calibration charts</p>
</li>
</ol>
<div class="hn-table">
<table>
<thead>
<tr>
<td></td><td>top_5%_accuracy</td><td>top_10%_accuracy</td><td>top_25%_accuracy</td></tr>
</thead>
<tbody>
<tr>
<td>Max Softmax</td><td><strong>0.980</strong></td><td>0.960</td><td><strong>0.864</strong></td></tr>
<tr>
<td>Softmax Difference</td><td><strong>0.980</strong></td><td><strong>0.970</strong></td><td>0.856</td></tr>
<tr>
<td>Softmax Variance</td><td><strong>0.980</strong></td><td>0.960</td><td><strong>0.864</strong></td></tr>
<tr>
<td>Softmax Entropy</td><td><strong>0.980</strong></td><td>0.960</td><td>0.856</td></tr>
<tr>
<td>Monte Carlo Dropout</td><td>0.860</td><td>0.870</td><td>0.800</td></tr>
<tr>
<td>Mahalanobis Distance</td><td>0.700</td><td>0.650</td><td>0.684</td></tr>
<tr>
<td>Auxiliary Classifier</td><td>0.840</td><td>0.760</td><td>0.720</td></tr>
</tbody>
</table>
</div><h3 id="heading-accuracy">Accuracy</h3>
<p>The comparative analysis has brought to light the varying efficacy of different uncertainty estimation methods in augmenting the prediction confidence.</p>
<p>Compared to the baseline model with a general accuracy of approximately 54.60%, the softmax-based methods (Max Softmax, Softmax Difference, Softmax Variance, and Softmax Entropy) stand out with a higher accuracy. These methods have achieved nearly perfect accuracy (98%) in the top 5% most confident predictions.</p>
<p><mark>This highlights their effectiveness in discerning the most reliable predictions from the model's output.</mark></p>
<p>On the other hand, Monte Carlo Dropout, Mahalanobis Distance, and Auxiliary Classifier have underperformed in comparison to their softmax-based counterparts on this specific metric.</p>
<p><mark>This discrepancy could be attributed to a potential lack of sufficient training data, which is often crucial for these methods to refine their estimates of uncertainty.</mark></p>
<p>It's important to note that, despite this performance gap, current research acknowledges scenarios where more advanced methods may outperform softmax-based approaches. In more a complex task, the capability of Monte Carlo Dropout, Mahalanobis Distance, and Auxiliary Classifier could be more evident, making them invaluable tools in certain contexts.</p>
<h3 id="heading-calibration-charts">Calibration Charts</h3>
<p>Calibration charts function like maps, helping to navigate the model's confidence reliability. They illustrate how closely the model's perceived confidence aligns with its actual accuracy. A well-calibrated model would have a chart where confidence levels match the accuracy.</p>
<p>Upon analyzing the calibration charts, each method shows strengths in calibration. The models confidence and actual accuracy align pretty well, which implies that all models are reasonably well-calibrated.</p>
<p>While there are expected fluctuations, they do not significantly diminish the overall calibration quality. It's reassuring that, regardless of the chosen method, a consistently reliable level of confidence in the model's predictions can be expected.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1707485540470/e7a47b7a-8e7f-4635-8a01-2ac1d63c4b82.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this example task, different methods for uncertainty estimation have proven effective in boosting the reliability of predictions from transformer models. Even when a model has moderate overall performance, these methods can identify subsets of predictions with significantly higher accuracy.</p>
<h1 id="heading-additional-reading-and-practice-suggestions">Additional Reading and Practice Suggestions</h1>
<p>For those who love a challenge and want to delve deeper, here are three research tasks:</p>
<h2 id="heading-task-1-bias-analysis">Task 1. Bias Analysis</h2>
<p>Investigate how the subset of most confident predictions differs from the full set. Which methods introduce more bias, and which are more impartial?</p>
<h2 id="heading-task-2-out-of-distribution-performance">Task 2. Out-of-Distribution Performance</h2>
<p>Test these methods on an out-of-distribution dataset. Choose a sentiment classification dataset, translate the current dataset, or craft a specific small dataset. How do the methods fare in unfamiliar waters?</p>
<h2 id="heading-task-3-kaggle-style-challenge">Task 3. Kaggle Style Challenge</h2>
<p>Blend the scores from different methods, add new features, and use them in a gradient boosting algorithm. Can you build a more accurate and robust ensemble compared to individual methods?</p>
<h1 id="heading-further-reading">Further reading</h1>
<p>For those hungry for more knowledge, here's a list of papers that dive deeper into the world of uncertainty estimation in transformers.</p>
<ul>
<li><p>Artem Shelmanov, Evgenii Tsymbalov, Dmitri Puzyrev, Kirill Fedyanin, Alexander Panchenko, and Maxim Panov. 2021. <a target="_blank" href="https://aclanthology.org/2021.eacl-main.157/">How certain is your Transformer?</a></p>
</li>
<li><p>Artem Vazhentsev, Gleb Kuzmin, Artem Shelmanov, Akim Tsvigun, Evgenii Tsymbalov, Kirill Fedyanin, Maxim Panov, Alexander Panchenko, Gleb Gusev, Mikhail Burtsev, Manvel Avetisian, and Leonid Zhukov. 2022. <a target="_blank" href="https://aclanthology.org/2022.acl-long.566/">Uncertainty estimation of transformer predictions for misclassification detection.</a></p>
</li>
<li><p>Andreas Nugaard Holm, Dustin Wright, and Isabelle Augenstein. 2023. <a target="_blank" href="https://arxiv.org/abs/2210.14037">Revisiting softmax for uncertainty approximation in text classification.</a></p>
</li>
<li><p>Jiahuan Pei, Cheng Wang, and György Szarvas. 2022. <a target="_blank" href="https://arxiv.org/abs/2112.13776">Transformer uncertainty estimation with hierarchical stochastic attention.</a></p>
</li>
<li><p>Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 2017. <a target="_blank" href="https://arxiv.org/abs/1706.04599">On calibration of modern neural networks.</a></p>
</li>
<li><p>Elizaveta Kostenok, Daniil Cherniavskii, and Alexey Zaytsev. 2023. <a target="_blank" href="https://arxiv.org/abs/2308.11295">Uncertainty estimation of transformers’ predictions via topological analysis of the attention matrices</a></p>
</li>
<li><p>Jiazheng Li, Zhaoyue Sun, Bin Liang, Lin Gui, and Yulan He. 2023a. <a target="_blank" href="https://arxiv.org/abs/2306.03598">CUE: An uncertainty interpretation framework for text classifiers built on pretrained language models.</a></p>
</li>
<li><p>Yonatan Geifman and Ran El-Yaniv. 2017. <a target="_blank" href="https://arxiv.org/abs/1705.08500">Selective classification for deep neural networks</a></p>
</li>
<li><p>Karthik Abinav Sankararaman, Sinong Wang, and Han Fang. 2022. <a target="_blank" href="https://arxiv.org/abs/2206.00826">Bayesformer: Transformer with uncertainty estimation</a></p>
</li>
</ul>
]]></content:encoded></item></channel></rss>