Practical NLP for Risk Modeling, Part III - Narrative Compression vs. Truncation – The Pleasure of Finding Things Out: A blog by James Triveri

The notebook version of this post can be downloaded here.

In Part II of the Practical NLP for Risk Modeling series, we fine-tuned DistilBERT end-to-end on NOAA tornado narratives and saw significant performance improvement versus the frozen-embedding baseline from Part I. That result is encouraging, but it raises a practical question: How much narrative text do we actually need? Free-form text data can be long, repetitive, and expensive to process. If it is possible to compress narratives while simultaneously preserving most of the severity signal, we get cheaper scoring with an acceptable level of performance degradation. In this post, we explore text compression using abstractive summarization models.

Abstractive summarization is a technique where a model generates a new, shorter version of a text that captures its main ideas rather than extracting sentences from the original. Abstractive models can rephrase, combine, and restructure information to produce a more concise summary. Modern approaches typically use sequence-to-sequence transformer models that generate the summary token-by-token.

In the process of running these experiments, I discovered that I misunderstood how abstractive text summarization models actually operate. The original experimental outline was to perform a compression sweep, where text narratives would be summarized at increasingly aggressive compression levels, and the hope was to identify the point at which performance deteriorates to an unacceptable level. The plan was to compress the original text at 5%, 10%, 25%, 50%, 75%, 90%, and 95%, where “X% compression” means “keep (1 − X%) of the original token budget”, measured via the DistilBERT tokenizer. But I found that the summarized output was virtually identical regardless of the token budget. There would be a slight difference between the full narrative and 5% compression. The summary would be the same for 5% compression through 90% compression, then the 95% the summary would be identical to the 90% summary but with trailing words removed to satisfy the token budget.

After doing a little research, this behavior turns out not to be surprising. Sequence-to-sequence summarizers are trained to produce a fluent summary, not to fill an arbitrary token budget. Once the model decides the summary is complete, generation terminates before hitting any upper bound token constraint.

As a consequence, I changed the design of the experiment. Instead of the compression sweep, four conditions will be evaluated:

The full event narrative with no compression. This is the upper-bound reference point.
Truncation baseline. Keep only a fixed fraction of the original DistilBERT token budget by cutting off the narrative after the allotted number of tokens.
A medium summary condition, which limited the number of summarized output tokens to 96.
A short summary setting, which limited the number of summarized output tokens to 48.

After generating the summary, the same final DistilBERT token budget used for truncation will still be enforced (128 tokens). This matters because the summarizer and classifier use different tokenizers, so the post-generation truncation step keeps the comparison fair. The revised plan is:

Train a fine-tuned DistilBERT classifier on the full-text training set (2008–2022).
Create validation sets associated with each of the four conditions (2023-2025).
Evaluate the model on each validation set using a fixed classification threshold.
Compare performance across full-text, truncation and medium and short compression conditions.

Environment

It would be impractical to run this notebook without GPU acceleration. If you do not have access to a GPU, opt for a 15-20% subset of the original validation set prior to creating the short and medium compressed datasets later in the post. Alternatively, check out Google Colab, which provides a limited GPU runtime for their managed notebook instances free of charge.

To match the environment used in throughout this notebook, install the dependencies specified in the following requirements.txt. Note that transformers>=5.0 and torch>2.5 are required:

!python -m pip install -r https://gist.githubusercontent.com/jtrive84/e313afbf2def24687e3c3247aa836fe9/raw/f2493005d291920dc1fcfd54274b9dfa42004ebe/requirements.txt --quiet

If an NVIDIA GPU is available, install the CUDA-enabled PyTorch wheel instead of the default CPU build. This is only necessary if installing the dependencies on your own machine. GPU instances in Sagemaker, Colab and the NVIDIA RAPIDS Docker images already have CUDA-enabled PyTorch available:

!python -m pip install -r https://gist.githubusercontent.com/jtrive84/bc9dad844ab927357f0c35ea71f33963/raw/01d84194f9ad59f3b265e45d7ff39cefe1386673/requirements-gpu.txt --quiet

If the CUDA-enabled PyTorch wheel was installed, the torch version number should have a +cu*** suffix:


import torch

print(f"torch version: {torch.__version__}")
print(f"cuda version : {torch.version.cuda}")

torch version: 2.6.0+cu124
cuda version : 12.4

To keep this notebook as self-contained as possible, the next few cells walk through retrieving and reproducing the train–validation splits used in Parts I and II of the series.


import os
import pandas as pd
import requests
from bs4 import BeautifulSoup

base_url = "https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/"


def get_latest_details_file(year):
    """
    Return the filename of the latest details file for the given year.
    Sample filename: StormEvents_details-ftp_v1.0_d2022_c20250721.csv.gz
    """
    html = requests.get(base_url).text
    soup = BeautifulSoup(html, "html.parser")
    candidates = [
        a["href"]
        for a in soup.find_all("a", href=True)
        if f"StormEvents_details-ftp_v1.0_d{year}" in a["href"]
    ]

    return sorted(candidates)[-1]


# Check if dataset exists locally prior to hitting NOAA servers.
if not os.path.exists("noaa-events-2008-2025.parquet"):
    
    print("Fetching latest filenames from NOAA servers...")
    # Get latest filenames for years 2008-2025.
    latest_filenames = [get_latest_details_file(year) for year in range(2008, 2026)]

    # Load each file into a DataFrame and concatenate.
    dfall = pd.concat(
        [pd.read_csv(f"{base_url + fname}", compression="gzip") for fname in latest_filenames],
        ignore_index=True,
    )
    dfall.to_parquet("noaa-events-2008-2025.parquet")
else:
    print("Loading dataset locally...")
    dfall = pd.read_parquet("noaa-events-2008-2025.parquet")

print(f"Total records loaded: {dfall.shape[0]:,}")
dfall.head()

Loading dataset locally...
Total records loaded: 1,161,550

	BEGIN_YEARMONTH	BEGIN_DAY	BEGIN_TIME	END_YEARMONTH	END_DAY	END_TIME	EPISODE_ID	EVENT_ID	STATE	STATE_FIPS	...	END_RANGE	END_AZIMUTH	END_LOCATION	BEGIN_LAT	BEGIN_LON	END_LAT	END_LON	EPISODE_NARRATIVE	EVENT_NARRATIVE	DATA_SOURCE
0	200802	22	1300	200802	22	2200	14216	79884	NEW HAMPSHIRE	33	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A noreaster moved up the coast southeast of Ca...	NaN	CSV
1	200804	1	352	200804	1	352	15549	88334	NEW HAMPSHIRE	33	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Strong southwest flow behind a warm front allo...	An amateur radio operator recorded a wind gust...	CSV
2	200803	1	0	200803	1	1320	14773	83820	NEW HAMPSHIRE	33	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Low pressure tracked from the Great Lakes acro...	NaN	CSV
3	200801	14	500	200801	14	1700	13559	75727	NEW HAMPSHIRE	33	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Low pressure moved up the Atlantic coast and s...	NaN	CSV
4	200812	19	1353	200812	21	200	25148	146588	NEW HAMPSHIRE	33	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	An intensifying coastal low spread heavy snow ...	Six to eight inches of snow fell across easter...	CSV

5 rows × 51 columns

Pre-processing steps:

Retain events from 2008 to present.
Keep only records with EVENT_TYPE = “Tornado”.
Drop records having TOR_F_SCALE “EFU”, “EF0”, “EF1”,“F0”, “F1”.
CLASS = 0 for EF2 events, CLASS = 1 for (EF3, EF4 and EF5) events.
Create a DatasetDict, with training set including events from 2008-2022 and test set events from 2023-2025.


import warnings

from datasets import Dataset, DatasetDict
import matplotlib as mpl
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import torch

np.set_printoptions(suppress=True, precision=5, linewidth=1000)
pd.options.mode.chained_assignment = None
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option("display.precision", 5)
warnings.filterwarnings("ignore")


# Filter for tornadoes from 2008 onward with significant damage ratings.
df = (
    dfall[
        (dfall.YEAR >= 2008) & 
        (dfall.EVENT_TYPE == "Tornado") & 
        (~dfall.TOR_F_SCALE.isin(["EFU", "EF0", "EF1" ,"F0", "F1"]))
    ]
    .dropna(subset=["EVENT_NARRATIVE"])
    .drop_duplicates(subset=["EVENT_NARRATIVE"])
    .reset_index(drop=True)
)

# Strip whitespace from EVENT_NARRATIVE.
df["EVENT_NARRATIVE"] = df["EVENT_NARRATIVE"].str.replace(r"\s+", " ", regex=True).str.strip()

# Create target class based on TOR_F_SCALE.
df["CLASS"] = np.where(df.TOR_F_SCALE.isin(["EF2"]), 0, 1)

keep_columns = [
    "EVENT_ID",
    "EVENT_NARRATIVE",
    "TOR_F_SCALE",
    "BEGIN_LAT",
    "BEGIN_LON",
    "CLASS"
]

# Create train-test splits. 
dftrain = df[df["YEAR"] <= 2022][keep_columns].reset_index(drop=True)
dfvalid = df[df["YEAR"] >  2022][keep_columns].reset_index(drop=True)

# Create Dataset objects.
ds_train = Dataset.from_pandas(dftrain)
ds_valid = Dataset.from_pandas(dfvalid)
ds = DatasetDict({"train": ds_train, "valid": ds_valid})

print(f"Train size: {len(dftrain):,}")
print(f"Valid size: {len(dfvalid):,}\n")
print(f"Training sample:\n{ds_train[0]}")

c:\Users\jtriv\miniforge3\envs\gnlp\Lib\site-packages\tqdm\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Train size: 2,621
Valid size: 642

Training sample:
{'EVENT_ID': 105611, 'EVENT_NARRATIVE': 'A tornado damaged numerous trees, including large trees uprooted, blew windows out of a home, destroyed a metal shed, blew two windows and part of a wall out of a metal building, damaged at least three grain bins, destroyed or damaged numerous outbuildings and small sheds, blew down or snapped off at least 15 power poles, bent a metal light pole, tipped one wagon and blew the top off another, blew down a barb wire fence and pushed fence posts almost to the ground, destroyed a hog barn, and flattened corn stubble, before crossing the county line into Lyon County. Contents inside several damaged or destroyed buildings and sheds were also damaged, especially on one farm where damaged buildings housed a farm and trucking business.', 'TOR_F_SCALE': 'EF2', 'BEGIN_LAT': 43.1421, 'BEGIN_LON': -96.3, 'CLASS': 0}

Fine-tune DistilBERT on Full Text (baseline model)

The compression experiment needs a single reference classifier. We train the model once on full text, then evaluate it on compressed holdout narratives. This largely reproduces the work from Part II.


import random

import numpy as np
import pandas as pd
import torch

from sklearn.metrics import (
    accuracy_score,
    roc_auc_score,
    average_precision_score
)

from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding
)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


# Set random seed for reproducibility.
SEED = 516
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

print(f"Device: {device}")


def tokenize_batch(batch):
    return tokenizer(
        batch["EVENT_NARRATIVE"],
        truncation=True,
        max_length=512
    )

def softmax(x):
    x = x - x.max(axis=1, keepdims=True)
    exp_x = np.exp(x)
    return exp_x / exp_x.sum(axis=1, keepdims=True)


def compute_metrics(eval_pred):
    logits, labels = eval_pred
    probs = softmax(logits)[:, 1]
    preds = (probs >= 0.5).astype(int)
    return {
        "accuracy": accuracy_score(labels, preds),
        "roc_auc": roc_auc_score(labels, preds),
        "precision": average_precision_score(labels, probs),
    }

Device: cuda


model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

ds_tokenized = ds.map(tokenize_batch, batched=True)
ds_tokenized = ds_tokenized.remove_columns(["EVENT_NARRATIVE", "TOR_F_SCALE", "EVENT_ID", "BEGIN_LAT", "BEGIN_LON"])
ds_tokenized = ds_tokenized.rename_column("CLASS", "labels")
ds_tokenized.set_format("torch")
collator = DataCollatorWithPadding(tokenizer=tokenizer)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2).to(device)

# Training arguments. 
args = TrainingArguments(
    output_dir="distilbert-noaa-ef-part3",
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_strategy="steps",
    logging_steps=50,
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    num_train_epochs=10,
    weight_decay=0.01,
    load_best_model_at_end=False,
    metric_for_best_model="roc_auc",
    greater_is_better=True,
    seed=SEED
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=ds_tokenized["train"],
    eval_dataset=ds_tokenized["valid"],
    data_collator=collator,
    compute_metrics=compute_metrics
)

trainer.train()

Map: 100%|██████████| 2621/2621 [00:01<00:00, 2312.68 examples/s]
Map: 100%|██████████| 642/642 [00:00<00:00, 1815.11 examples/s]
Loading weights: 100%|██████████| 100/100 [00:00<00:00, 218.74it/s, Materializing param=distilbert.transformer.layer.5.sa_layer_norm.weight]   
DistilBertForSequenceClassification LOAD REPORT from: distilbert-base-uncased
Key                     | Status     | 
------------------------+------------+-
vocab_projector.bias    | UNEXPECTED | 
vocab_transform.bias    | UNEXPECTED | 
vocab_layer_norm.weight | UNEXPECTED | 
vocab_transform.weight  | UNEXPECTED | 
vocab_layer_norm.bias   | UNEXPECTED | 
pre_classifier.weight   | MISSING    | 
classifier.weight       | MISSING    | 
pre_classifier.bias     | MISSING    | 
classifier.bias         | MISSING    | 

Notes:
- UNEXPECTED    :can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING   :those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.

[3280/3280 1:22:38, Epoch 10/10]

Epoch	Training Loss	Validation Loss	Accuracy	Roc Auc	Precision
1	0.450666	0.459206	0.788162	0.621244	0.509355
2	0.230294	0.385231	0.900312	0.831570	0.836489
3	0.188894	0.325148	0.914330	0.903474	0.875828
4	0.160771	0.340444	0.925234	0.859782	0.900350
5	0.073583	0.407805	0.920561	0.880924	0.861684
6	0.038271	0.458676	0.919003	0.908915	0.821543
7	0.036121	0.429346	0.922118	0.884349	0.879728
8	0.032758	0.490672	0.923676	0.914356	0.822769
9	0.001190	0.458039	0.923676	0.907106	0.866877
10	0.012224	0.439459	0.928349	0.905297	0.878418

Writing model shards: 100%|██████████| 1/1 [00:00<00:00,  2.00it/s]
Writing model shards: 100%|██████████| 1/1 [00:00<00:00,  2.08it/s]
Writing model shards: 100%|██████████| 1/1 [00:00<00:00,  1.68it/s]
Writing model shards: 100%|██████████| 1/1 [00:00<00:00,  1.22it/s]
Writing model shards: 100%|██████████| 1/1 [00:00<00:00,  3.29it/s]
Writing model shards: 100%|██████████| 1/1 [00:00<00:00,  4.18it/s]
Writing model shards: 100%|██████████| 1/1 [00:00<00:00,  4.09it/s]
Writing model shards: 100%|██████████| 1/1 [00:00<00:00,  3.89it/s]
Writing model shards: 100%|██████████| 1/1 [00:00<00:00,  4.04it/s]
Writing model shards: 100%|██████████| 1/1 [00:00<00:00,  4.10it/s]

TrainOutput(global_step=3280, training_loss=0.14509652031631004, metrics={'train_runtime': 4959.9542, 'train_samples_per_second': 5.284, 'train_steps_per_second': 0.661, 'total_flos': 2612640273795948.0, 'train_loss': 0.14509652031631004, 'epoch': 10.0})

Compressing Tornado Event Narratives

Abstractive summarization is the process of using a summarization model to compress a narrative into fewer tokens. There are many summarization models to choose from, but we will focus on sshleifer/distilbart-cnn-12-6, which is a distilled version of the BART sequence-to-sequence transformer designed for text summarization. It was trained on the CNN/DailyMail news summarization dataset learning to generate summaries of long news articles. Because it is distilled (compressed) from the full BART model, it is smaller and faster at inference while retaining much of the original model’s summarization quality.

I also evaluated facebook/bart-large-cnn, but the generated summaries were very similar. Exploring alternative models is a topic that may warrant further examination.


from transformers import AutoModelForSeq2SeqLM
from sklearn.metrics import precision_recall_fscore_support


comp_model_id = "sshleifer/distilbart-cnn-12-6"
comp_tokenizer = AutoTokenizer.from_pretrained(comp_model_id)
summarizer = AutoModelForSeq2SeqLM.from_pretrained(comp_model_id).to(device)

# Settings for short and medium compressed narratives. 
summary_presets = {
    "short": {
        "max_new_tokens": 48,
        "min_new_tokens": 20,
        "num_beams": 4,
    },
    "medium": {
        "max_new_tokens": 96,
        "min_new_tokens": 40,
        "num_beams": 4,
    },
}


def token_len_distilbert(text):
    """
    Length of tokenized text.
    """
    return len(tokenizer(text, truncation=True, max_length=512)["input_ids"])


def summarize_text_fixed(text, summary_cfg):
    """
    Generate a summary using a fixed summarization configuration.
    """
    inputs = comp_tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        max_length=1024
    ).to(device)

    with torch.no_grad():
        summary_ids = summarizer.generate(
            **inputs,
            max_new_tokens=summary_cfg["max_new_tokens"],
            min_new_tokens=summary_cfg["min_new_tokens"],
            num_beams=summary_cfg["num_beams"],
            do_sample=False,
            early_stopping=True
        )

    return comp_tokenizer.decode(summary_ids[0], skip_special_tokens=True)


def summarize_then_fit_budget(text, keep_tokens, summary_cfg):
    """
    Summarize first, then enforce the final DistilBERT token budget so the
    downstream classifier sees the same maximum length as the truncation 
    baseline.
    """
    summary_text = summarize_text_fixed(text, summary_cfg)
    return truncate_to_token_budget(summary_text, keep_tokens)


def truncate_to_token_budget(text, keep_tokens):
    """
    Truncate original text to retain at most keep_tokens.
    """
    enc = tokenizer(text, truncation=True, max_length=512, add_special_tokens=True)
    ids = enc["input_ids"]

    # keep_tokens includes special tokens in this simple implementation.
    # Enforces a minimum length of 8 tokens.
    nbr_tokens = max(8, min(keep_tokens, len(ids)))
    ids_keep = ids[:nbr_tokens]

    # decode back to text.
    return nbr_tokens, tokenizer.decode(ids_keep, skip_special_tokens=True)

summarize_text_fixed turns a the original event narrative into a generated summary using the pre-specified model in either short or medium mode.
truncate_to_token_budget ensures the compressed narrative is constrained by the same maximum length as the truncation baseline.
summarize_then_fit_budget wraps summarize_text_fixed and truncate_to_token_budget.

The data flow can be envisioned as follows:

original narrative text
      ↓
summarizer tokenizer
      ↓
summarizer token IDs
      ↓
DistilBART generate summary
      ↓
summary token IDs
      ↓
constrain number of tokens
      ↓
decode back to summarized text

The next cell provides an example using summarizer. We define a helper function, generate_compressed_narrative, to generate compressed text. generate_compressed_narrative is only used to show the summarizer in action, and isn’t used beyond the next cell.


from IPython.display import display, Markdown


def generate_compressed_narrative(txt, max_new_tokens, min_new_tokens):
    """
    Generate a compressed version of the input text using the summarization model.
    """
    inputs = comp_tokenizer(
        txt,
        return_tensors="pt",
        truncation=True,
        max_length=1024
    ).to(device)

    with torch.no_grad():
        summary_ids = summarizer.generate(
            **inputs,
            min_new_tokens=min_new_tokens,
            max_new_tokens=max_new_tokens,
            num_beams=4,
            do_sample=False,
            early_stopping=True
        )

    return comp_tokenizer.decode(summary_ids[0], skip_special_tokens=True)

# Select a random event narrative.
txt_full = ds_train["EVENT_NARRATIVE"][10] # 655
nbr_tokens_orig = len(comp_tokenizer(txt_full, add_special_tokens=True)["input_ids"])

txt_medium = generate_compressed_narrative(
    txt_full, 
    max_new_tokens=summary_presets["medium"]["max_new_tokens"],
    min_new_tokens=summary_presets["medium"]["max_new_tokens"]
)
nbr_tokens_medium = len(comp_tokenizer(txt_medium, add_special_tokens=True)["input_ids"])

txt_short = generate_compressed_narrative(
    txt_full, 
    max_new_tokens=summary_presets["short"]["max_new_tokens"],
    min_new_tokens=summary_presets["short"]["max_new_tokens"]
)
nbr_tokens_short = len(comp_tokenizer(txt_short, add_special_tokens=True)["input_ids"])


display(Markdown(f"**Original event narrative ({nbr_tokens_orig} tokens):**"))
display(Markdown(txt_full))
display(Markdown(f"**Medium summary ({nbr_tokens_medium} tokens):**"))
display(Markdown(txt_medium))
display(Markdown(f"**Short summary ({nbr_tokens_short} tokens):**"))
display(Markdown(txt_short))

Original event narrative (316 tokens):

Most of the tornado damage was north of interstate 30 with some structures showing EF2 damage. In particular, the cinderblock and brick lawnmower business just north of Hwy 82 was completely destroyed with roofing debris and lawnmower parts thrown to the west and north of the building location. A brick home several hundred yards from the lawnmower business sustained significant damage to its roof and exterior walls. A metal shop building built with large metal I-beams was completely destroyed. I-beams were twisted and thrown in a northerly and westerly direction up to 200 yards from the building location with concrete still attached. The trees between the large metal building and the interstate were uprooted or snapped in a convergent pattern…indicative of tornadic winds. In total…12 structures were damaged or destroyed between Hwy 82 and the interstate and numerous trees were downed. Three tractor trailers were flipped on interstate 30 which resulted in the interstate being shut down and there was one injury. Further south of Hwy 82 on the Lonestar Army Ammunition Depot, numerous trees were snapped or uprooted and damage to parts of the Depot were reported…although it was not surveyed. North of interstate 30 along the service road…an outbuilding sales business lost several buildings and had many others damaged. Along Farm to Market 2253, numerous trees were snapped and uprooted on either side of the road and several sheds and barns were damaged or destroyed. A greenhouse was severely damaged near the end of the track. Some homes were also damaged from fallen trees.

Medium summary (96 tokens):

Most of the tornado damage was north of interstate 30 with some structures showing EF2 damage . In total…12 structures were damaged or destroyed between Hwy 82 and the interstate and numerous trees were downed . A metal shop building built with large metal I-beams was completely destroyed . Three tractor trailers were flipped on interstate 30 which resulted in the interstate being shut down and there was one injury . A greenhouse was severely damaged near the end of the track . Further south of H

Short summary (48 tokens):

Most of the tornado damage was north of interstate 30 with some structures showing EF2 damage . A cinderblock and brick lawnmower business just north of Hwy 82 was completely destroyed . A metal shop building built with large metal

build_validation_set constructs a validation dataset in order to validate our classification model under different input length strategies. Starting from the original event narratives and class labels, build_validation_setcan return either the full text, a truncated version limited to a specified token budget, or a summarized version that is subsequently constrained to that same budget. For each record, it tracks the original DistilBERT token length and the number of tokens allowed for the classifier, enabling later analysis of how compression affects model performance. The output is converted to a Hugging Face Dataset containing the processed narratives, labels, token statistics, and a method indicator identifying whether the input text was left unchanged, truncated, or summarized.



def build_validation_set(
    ds_valid, method="truncate", keep_tokens=512, summary_cfg=None, method_label=None
):
    """
    Build full, compressed or truncated datasets. 

    Parameters
    ----------
    ds_valid : Dataset
        Validation dataset containing EVENT_NARRATIVE and CLASS columns.

    keep_tokens : int
        Maximum number of DistilBERT tokens allowed for classifier input.

    method : str
        Either "full", "truncate" or "summarize".

    summary_cfg : dict or None
        Required when method="summarize". Controls summarization generation.

    """
    texts = ds_valid["EVENT_NARRATIVE"]
    ys = ds_valid["CLASS"]
    orig_nbr_tokens = [token_len_distilbert(txt) for txt in texts]

    if method == "full":

        df = pd.DataFrame({
            "EVENT_NARRATIVE": texts,
            "labels": ys,
            "orig_tokens": orig_nbr_tokens,
            "kept_tokens": orig_nbr_tokens,
            "method": ["full_text"] * len(texts)
        })

    else:
        out_texts, out_orig_tokens, out_nbr_tokens = [], [], []

        for txt in texts:
            orig_nbr_tokens = token_len_distilbert(txt)

            if method == "truncate":
                nbr_tokens, txt_c = truncate_to_token_budget(txt, keep_tokens)

            elif method == "summarize":
                nbr_tokens, txt_c = summarize_then_fit_budget(txt, keep_tokens, summary_cfg)

            else:
                raise ValueError("method must be 'full', 'truncate' or 'summarize'")

            out_texts.append(txt_c)
            out_orig_tokens.append(orig_nbr_tokens)
            out_nbr_tokens.append(nbr_tokens)

        df = pd.DataFrame({
            "EVENT_NARRATIVE": out_texts,
            "labels": ys,
            "orig_tokens": out_orig_tokens,
            "kept_tokens": out_nbr_tokens,
            "method": [(method_label or method)] * len(out_texts)
        })

    return Dataset.from_pandas(df)


def tokenize_for_eval(ds):
    ds2 = ds.map(tokenize_batch, batched=True)
    keep_cols = ["input_ids", "attention_mask", "labels", "orig_tokens", "keep_tokens", "method"]
    remove_cols = [c for c in ds2.column_names if c not in keep_cols]
    ds2 = ds2.remove_columns(remove_cols)
    ds2.set_format("torch")
    return ds2


def compute_metrics_from_predictions(yhat, yact):

    precision, recall, f1, _ = precision_recall_fscore_support(
        yhat,
        yact,
        average="weighted",
        zero_division=0
    )

    return {
        "accuracy": accuracy_score(yhat, yact),
        "precision_weighted": precision,
        "recall_weighted": recall,
        "f1_weighted": f1,
    }


def evaluate_dataset(ds):
    ds_tokenized = tokenize_for_eval(ds)
    pred_output = trainer.predict(ds_tokenized)
    yhat = np.argmax(pred_output.predictions, axis=1)
    yact = pred_output.label_ids
    metrics = compute_metrics_from_predictions(yhat, yact)
    avg_orig_tokens = float(np.mean(ds["orig_tokens"]))
    avg_keep_tokens = float(np.mean(ds["kept_tokens"]))
    
    return {
        **metrics,
        "avg_orig_tokens": avg_orig_tokens,
        "avg_nbr_tokens": avg_keep_tokens
    }

In the next cell each of the four validation sets are created and keep_tokens is set to 128. The event narratives are typically 200-300 DistilBERT tokens on average. A value around 128 compresses the average narrative by roughly 50% while still allowing multiple sentences, and is a common sequence length for BERT-style classifiers in practice. Note that the 128-token limit is treated as a maximum budget, not a target length. If a summarized narrative is already shorter than 128 tokens, it is passed to the classifier unchanged. Truncation is only applied when the compressed text exceeds this limit.


keep_tokens = 128

# Original event narratives. 
ds_valid_full = build_validation_set(
    ds_valid, 
    method="full"
)

# Event narratives truncated to 128 tokens. 
ds_valid_trunc = build_validation_set(
    ds_valid,
    keep_tokens=keep_tokens,
    method="truncate"
)

# Moderate compression.
ds_valid_sum_medium = build_validation_set(
    ds_valid,
    keep_tokens=keep_tokens,
    method="summarize",
    summary_cfg=summary_presets["medium"],
    method_label="summarize_medium"
)

# Aggressive compression.
ds_valid_sum_short = build_validation_set(
    ds_valid,
    keep_tokens=keep_tokens,
    method="summarize",
    summary_cfg=summary_presets["short"],
    method_label="summarize_short"
)

With the datasets prepared, we evaluate the classifier across the four validation sets and capture the resulting performance metrics:


from time import perf_counter

results = []

for name, ds_curr in [
    ("full_text", ds_valid_full),
    ("truncate", ds_valid_trunc),
    ("summarize_medium", ds_valid_sum_medium),
    ("summarize_short", ds_valid_sum_short)
]:
    t_i = perf_counter()
    metrics = evaluate_dataset(ds_curr)
    t_total = perf_counter() - t_i
    results.append({
        "condition": name,
        "runtime": t_total,
        **metrics
    })

dfsumm = pd.DataFrame(results)

dfsumm.head(5)

Map: 100%|██████████| 642/642 [00:00<00:00, 3081.44 examples/s]

Map: 100%|██████████| 642/642 [00:00<00:00, 4806.41 examples/s]

Map: 100%|██████████| 642/642 [00:00<00:00, 5358.20 examples/s]

Map: 100%|██████████| 642/642 [00:00<00:00, 2741.43 examples/s]

	condition	runtime	accuracy	precision_weighted	recall_weighted	f1_weighted	avg_orig_tokens	avg_nbr_tokens
0	full_text	12.38370	0.92835	0.92756	0.92835	0.92785	223.26636	223.26636
1	truncate	3.44808	0.87072	0.88704	0.87072	0.87644	223.26636	115.88785
2	summarize_medium	4.08387	0.85826	0.87870	0.85826	0.86549	223.26636	62.97508
3	summarize_short	2.94239	0.84579	0.88494	0.84579	0.85898	223.26636	46.38006

Our results show a clear trade-off between predictive performance and input length. Using full narratives resulted in the best classification performance (weighted recall of ~93%), but it is also the slowest condition because the classifier is required to process the entire text sequence for each sample. Surprisingly, truncation provides the best compromise among the three compression methods: although recall drops a bit, it retains roughly half the original tokens while cutting runtime by ~75%. The trade off is that truncation blindly discards the latter portion of the narrative, which may occasionally remove useful information.

The summarization approaches compress the narratives more aggressively, but the additional compression comes with a further decline in predictive performance. The results suggest that most of the useful signal for classification appears early in the narrative.

Whether the degradation in performance is acceptable is application dependent. For full text, the model’s inference rate was about 52 samples per second. For truncation, it was closer to 187 samples per second. The full-text approach would be more desirable in research settings where accuracy is the primary concern. If analysts are retrospectively studying severe weather events to improve risk models or validate damage classifications, runtime is far less important than ensuring the model captures every useful detail in the narrative. In a high-volume workflow, the ability to process records quickly and cheaply can matter more than extracting every last bit of predictive signal from the narrative.

Finally, an avenue for future exploration would be to carry out a truncation sweep, conceptually sticking to the original idea motivating this post but focusing on different truncation thresholds instead of constrained abstractive summarization.