Evaluating LLMs for Textual content Summarization: An Introduction

April 7, 2025

1

Massive language fashions (LLMs) have proven great potential throughout numerous purposes. On the SEI, we research the software of LLMs to a variety of DoD related use instances. One software we take into account is intelligence report summarization, the place LLMs may considerably scale back the analyst cognitive load and, probably, the extent of human error. Nevertheless, deploying LLMs with out human supervision and analysis may result in vital errors together with, within the worst case, the potential lack of life. On this publish, we define the basics of LLM analysis for textual content summarization in high-stakes purposes similar to intelligence report summarization. We first focus on the challenges of LLM analysis, give an summary of the present cutting-edge, and eventually element how we’re filling the recognized gaps on the SEI.

Why is LLM Analysis Necessary?

LLMs are a nascent expertise, and, subsequently, there are gaps in our understanding of how they may carry out in numerous settings. Most excessive performing LLMs have been educated on an enormous quantity of knowledge from a huge array of web sources, which may very well be unfiltered and non-vetted. Due to this fact, it’s unclear how typically we will count on LLM outputs to be correct, reliable, constant, and even protected. A widely known challenge with LLMs is hallucinations, which suggests the potential to provide incorrect and non-sensical data. It is a consequence of the truth that LLMs are basically statistical predictors. Thus, to securely undertake LLMs for high-stakes purposes and make sure that the outputs of LLMs nicely characterize factual knowledge, analysis is essential. On the SEI, we’ve been researching this space and revealed a number of stories on the topic thus far, together with Concerns for Evaluating Massive Language Fashions for Cybersecurity Duties and Assessing Alternatives for LLMs in Software program Engineering and Acquisition.

Challenges in LLM Analysis Practices

Whereas LLM analysis is a crucial drawback, there are a number of challenges, particularly within the context of textual content summarization. First, there are restricted knowledge and benchmarks, with floor fact (reference/human generated) summaries on the size wanted to check LLMs: XSUM and Day by day Mail/CNN are two generally used datasets that embody article summaries generated by people. It’s tough to determine if an LLM has not already been educated on the out there check knowledge, which creates a possible confound. If the LLM has already been educated on the out there check knowledge, the outcomes might not generalize nicely to unseen knowledge. Second, even when such check knowledge and benchmarks can be found, there is no such thing as a assure that the outcomes can be relevant to our particular use case. For instance, outcomes on a dataset with summarization of analysis papers might not translate nicely to an software within the space of protection or nationwide safety the place the language and magnificence will be totally different. Third, LLMs can output totally different summaries based mostly on totally different prompts, and testing below totally different prompting methods could also be essential to see which prompts give the perfect outcomes. Lastly, selecting which metrics to make use of for analysis is a serious query, as a result of the metrics must be simply computable whereas nonetheless effectively capturing the specified excessive stage contextual which means.

LLM Analysis: Present Strategies

As LLMs have turn into outstanding, a lot work has gone into totally different LLM analysis methodologies, as defined in articles from Hugging Face, Assured AI, IBM, and Microsoft. On this publish, we particularly give attention to analysis of LLM-based textual content summarization.

We will construct on this work relatively than growing LLM analysis methodologies from scratch. Moreover, many strategies will be borrowed and repurposed from current analysis strategies for textual content summarization strategies that aren’t LLM-based. Nevertheless, attributable to distinctive challenges posed by LLMs—similar to their inexactness and propensity for hallucinations—sure features of analysis require heightened scrutiny. Measuring the efficiency of an LLM for this job isn’t so simple as figuring out whether or not a abstract is “good” or “unhealthy.” As an alternative, we should reply a set of questions focusing on totally different features of the abstract’s high quality, similar to:

Is the abstract factually appropriate?
Does the abstract cowl the principal factors?
Does the abstract accurately omit incidental or secondary factors?
Does each sentence of the abstract add worth?
Does the abstract keep away from redundancy and contradictions?
Is the abstract well-structured and arranged?
Is the abstract accurately focused to its meant viewers?

The questions above and others like them reveal that evaluating LLMs requires the examination of a number of associated dimensions of the abstract’s high quality. This complexity is what motivates the SEI and the scientific group to mature current and pursue new strategies for abstract analysis. Within the subsequent part, we focus on key strategies for evaluating LLM-generated summaries with the aim of measuring a number of of their dimensions. On this publish we divide these strategies into three classes of analysis: (1) human evaluation, (2) automated benchmarks and metrics, and (3) AI red-teaming.

Human Evaluation of LLM-Generated Summaries

One generally adopted method is human analysis, the place individuals manually assess the standard, truthfulness, and relevance of LLM-generated outputs. Whereas this may be efficient, it comes with vital challenges:

Scale: Human analysis is laborious, probably requiring vital effort and time from a number of evaluators. Moreover, organizing an adequately massive group of evaluators with related subject material experience is usually a tough and costly endeavor. Figuring out what number of evaluators are wanted and methods to recruit them are different duties that may be tough to perform.
Bias: Human evaluations could also be biased and subjective based mostly on their life experiences and preferences. Historically, a number of human inputs are mixed to beat such biases. The necessity to analyze and mitigate bias throughout a number of evaluators provides one other layer of complexity to the method, making it tougher to mixture their assessments right into a single analysis metric.

Regardless of the challenges of human evaluation, it’s typically thought of the gold customary. Different benchmarks are sometimes aligned to human efficiency to find out how automated, less expensive strategies evaluate to human judgment.

Automated Analysis

A few of the challenges outlined above will be addressed utilizing automated evaluations. Two key elements frequent with automated evaluations are benchmarks and metrics. Benchmarks are constant units of evaluations that usually include standardized check datasets. LLM benchmarks leverage curated datasets to provide a set of predefined metrics that measure how nicely the algorithm performs on these check datasets. Metrics are scores that measure some facet of efficiency.

In Desk 1 beneath, we take a look at a few of the in style metrics used for textual content summarization. Evaluating with a single metric has but to be confirmed efficient, so present methods give attention to utilizing a set of metrics. There are lots of totally different metrics to select from, however for the aim of scoping down the house of attainable metrics, we take a look at the next high-level features: accuracy, faithfulness, compression, extractiveness, and effectivity. We had been impressed to make use of these features by inspecting HELM, a well-liked framework for evaluating LLMs. Under are what these features imply within the context of LLM analysis:

Accuracy typically measures how carefully the output resembles the anticipated reply. That is usually measured as a mean over the check cases.
Faithfulness measures the consistency of the output abstract with the enter article. Faithfulness metrics to some extent seize any hallucinations output by the LLM.
Compression measures how a lot compression has been achieved through summarization.
Extractiveness measures how a lot of the abstract is straight taken from the article as is. Whereas rewording the article within the abstract is typically crucial to attain compression, a much less extractive abstract might yield extra inconsistencies in comparison with the unique article. Therefore, this can be a metric one would possibly monitor in textual content summarization purposes.
Effectivity measures what number of assets are required to coach a mannequin or to make use of it for inference. This may very well be measured utilizing totally different metrics similar to processing time required, power consumption, and so on.

Whereas normal benchmarks are required when evaluating a number of LLMs throughout quite a lot of duties, when evaluating for a particular software, we might have to choose particular person metrics and tailor them for every use case.

Side

Metric

Kind

Clarification

Accuracy

ROUGE

Computable rating

Measures textual content overlap

BLEU

Computable rating

Measures textual content overlap and
computes precision

METEOR

Computable rating

Measures textual content overlap
together with synonyms, and so on.

BERTScore

Computable rating

Measures cosine similarity
between embeddings of abstract and article

Faithfulness

SummaC

Computable rating

Computes alignment between
particular person sentences of abstract and article

QAFactEval

Computable rating

Verifies consistency of
abstract and article based mostly on query answering

Compression

Compresion ratio

Computable rating

Measures ratio of quantity
of tokens (phrases) in abstract and article

Extractiveness

Protection

Computable rating

Measures the extent to
which abstract textual content is from article

Density

Computable rating

Quantifies how nicely the
phrase sequence of a abstract will be described as a sequence of extractions

Effectivity

Computation time

Bodily measure

–

Computation power

Bodily measure

–

Word that AI could also be used for metric computation at totally different capacities. At one excessive, an LLM might assign a single quantity as a rating for consistency of an article in comparison with its abstract. This state of affairs is taken into account a black-box approach, as customers of the approach are usually not capable of straight see or measure the logic used to carry out the analysis. This sort of method has led to debates about how one can belief one LLM to evaluate one other LLM. It’s attainable to make use of AI strategies in a extra clear, gray-box method, the place the interior workings behind the analysis mechanisms are higher understood. BERTScore, for instance, calculates cosine similarity between phrase embeddings. In both case, human will nonetheless must belief the AI’s skill to precisely consider summaries regardless of missing full transparency into the AI’s decision-making course of. Utilizing AI applied sciences to carry out large-scale evaluations and comparability between totally different metrics will in the end nonetheless require, in some half, human judgement and belief.

To this point, the metrics we’ve mentioned make sure that the mannequin (in our case an LLM) does what we count on it to, below ideally suited circumstances. Subsequent, we briefly contact upon AI red-teaming geared toward stress-testing LLMs below adversarial settings for security, safety, and trustworthiness.

AI Purple-Teaming

AI red-teaming is a structured testing effort to search out flaws and vulnerabilities in an AI system, typically in a managed atmosphere and in collaboration with AI builders. On this context, it entails testing the AI system—an LLM for summarization—with adversarial prompts and inputs. That is finished to uncover any dangerous outputs from an AI system that might result in potential misuse of the system. Within the case of textual content summarization for intelligence stories, we might think about that the LLM could also be deployed domestically and utilized by trusted entities. Nevertheless, it’s attainable that unknowingly to the consumer, a immediate or enter may set off an unsafe response attributable to intentional or unintentional knowledge poisoning, for instance. AI red-teaming can be utilized to uncover such instances.

LLM Analysis: Figuring out Gaps and Our Future Instructions

Although work is being finished to mature LLM analysis strategies, there are nonetheless main gaps on this house that stop the right validation of an LLM’s skill to carry out high-stakes duties similar to intelligence report summarization. As a part of our work on the SEI we’ve recognized a key set of those gaps and are actively working to leverage current strategies or create new ones that bridge these gaps for LLM integration.

We got down to consider totally different dimensions of LLM summarization efficiency. As seen from Desk 1, current metrics seize a few of these through the features of accuracy, faithfulness, compression, extractiveness and effectivity. Nevertheless, some open questions stay. As an illustration, how will we determine lacking key factors from a abstract? Does a abstract accurately omit incidental and secondary factors? Some strategies to attain these have been proposed, however not absolutely examined and verified. One option to reply these questions can be to extract key factors and evaluate key factors from summaries output by totally different LLMs. We’re exploring the small print of such strategies additional in our work.

As well as, most of the accuracy metrics require a reference abstract, which can not all the time be out there. In our present work, we’re exploring methods to compute efficient metrics within the absence of a reference abstract or solely getting access to small quantities of human generated suggestions. Our analysis will give attention to growing novel metrics that may function utilizing restricted variety of reference summaries or no reference summaries in any respect. Lastly, we are going to give attention to experimenting with report summarization utilizing totally different prompting methods and examine the set of metrics required to successfully consider whether or not a human analyst would deem the LLM-generated abstract as helpful, protected, and in keeping with the unique article.

With this analysis, our aim is to have the ability to confidently report when, the place, and the way LLMs may very well be used for high-stakes purposes like intelligence report summarization, and if there are limitations of present LLMs which may impede their adoption.

Supply hyperlink

Previous article7 Frequent Errors to Keep away from When Outsourcing Software program Growth

Next articleCloudflare broadcasts distant MCP server to scale back limitations to creating AI brokers

Evaluating LLMs for Textual content Summarization: An Introduction

Why is LLM Analysis Necessary?

Challenges in LLM Analysis Practices

LLM Analysis: Present Strategies

Human Evaluation of LLM-Generated Summaries

Automated Analysis

AI Purple-Teaming

LLM Analysis: Figuring out Gaps and Our Future Instructions

Turing Award Particular: A Dialog with John Hennessy

Spin the Retrospective Slot Machine (aka CRAP)

Sourcegraph and the Frontier of AI in Software program Engineering with Beyang Liu

LEAVE A REPLY Cancel reply

Most Popular

How tech giants like Netflix constructed resilient programs with chaos engineering

10 Finest Google Sheets Add-ons for Activity & Workflow Automation

Cloudflare broadcasts distant MCP server to scale back limitations to creating AI brokers

7 Frequent Errors to Keep away from When Outsourcing Software program Growth

Recent Comments

ABOUT US

POPULAR POSTS

How tech giants like Netflix constructed resilient programs with chaos engineering

10 Finest Google Sheets Add-ons for Activity & Workflow Automation

Cloudflare broadcasts distant MCP server to scale back limitations to creating AI brokers

POPULAR CATEGORY