Using LLMs for Text Summarization Task

LLM

Text-Summarization

Finetuning

Leaderboard

Rouge-Score-Comparison

Mayukh Rath Machine Learning Engineer @ Infocusp

14 min read . 20 March 2025

blog banner

Text Summarization is a very trendy topic nowadays and besides RAG this is one of the most commonly integrated use cases across modern AI platforms. With the advancements of LLMs we can now easily design highly scalable text summarization applications and achieve good performance without retraining our own models. In this blog, we cover key aspects of text summarization tasks including available open source datasets, training methods, LLM based design and auto-evaluation pipeline.

What is Text Summarization

Text summarization is one NLP task where the large document is represented in a concise way that includes only important and relevant information from the text. The ideal summary should highlight the topic of the actual text and also should not remove any crucial information from the text but also should be much shorter than the actual text. This is always a tradeoff between the information coverage and the length of the summary.

Open Source Datasets

There are several open source datasets available that we can use for training our in-house models or evaluate our prompts/applications with various LLMs.
Some of the well known summarization datasets that we can explore to test our applications.

CNN-DM: News articles paired with multi-sentence summaries. Commonly used for extractive and abstractive summarization tasks.
XSum: Contains BBC articles with single-sentence summaries that capture the core idea. Great for testing extreme summarization.
Pubmed: Scientific articles from the biomedical domain with structured abstracts. Useful for summarizing long, technical content.

Different types of Summarization and Implementation -

There are several variations of summarization tasks, each suited to different use cases and implementation methods.

Query Focused Summarization

Given the context and a query, the goal is to summarize the text in such a way that it generates answers for the given query based on the relevant information present in the text.
We generally use LLMs for this use case, where prompt engineering is an essential element. Prompts need to be tailored to the specific task to achieve meaningful results.

Extractive Summarization

This is a technique where the most important sentences are selected from the given text to represent the summary. First sentences in the text are ranked based on the importance of that sentence in the whole text, then the top-ranked sentences are taken in the original order to form the summary.

Pretrained models

We can use pretrained bert models to start Extractive summarization. It also supports integration with sentence-transformers models.

Train Your Own Model

We can also train encoder-based models from scratch. TransformerSum is also a good starting point if we want to train our own model from scratch.

Abstractive Summarization -

This method goes beyond sentence extraction. It involves understanding the context and rewriting the content using key information from the input—generating new, concise sentences that weren’t in the original text. We'll walk through a few ways you can do abstractive summarization.

Pretrained models

There are plenty of pretrained models on Hugging Face we can explore for summarization tasks. A good place to start is the SummLlama3.2-3B model, which performs well across a variety of text types. Another promising direction is this recent research paper on summarization by Taewon Yun et al, though the authors haven't released the dataset or training code yet.

Train Your Own Model

If we want full control over the pipeline, training a model from scratch is a practical path. Transformersum offers a useful starting point, with training code and access to open-source datasets.

We can also merge multiple datasets to improve generalization. Architectures like LongformerEncoderDecoder are particularly effective when working with longer documents that require extended context.

Using LLMs Directly for Summarization

We can directly use LLMs to summarize text without additional training. For example, this notebook demonstrates how to use VertexAI with Gemini for summarization tasks.

Summarization Strategies in LangChain

We can use different strategies available in Langchain for the summarization.

Stuffing

It is the simplest technique where the whole context is put into the prompt and sent to the LLM to generate the summary. So there is only one API call in this case. It is definitely faster than other techniques where we need to make multiple API calls. But it has limitations like sometimes the document is very large and doesn’t fit into the context length of the particular LLM. Also we fit very large text into the prompt so the generated summaries are not always reliable or it can be slow as well.

MapReduce

Here the summarization is done in multiple stages. We first generate summaries for each smaller chunk and combine those summaries to provide an overall summary. It overcomes the limitations of Stuffing technique by parallelizing chunks based calls, but it may lose context between different chunks.

Refine

This technique starts with generating an initial summary using a small subset of the document. In each following step, a new chunk is added along with the previously generated summary. A refinement prompt is then used to improve the summary based on the newly added content. While this method helps maintain continuity and context across the document, it involves multiple sequential API calls, which can impact speed and efficiency.

We can get implementation of each technique in this LangChain notebook on large document summarisation.

Evaluation Metrics

This section describes different techniques that we can use to assess the quality of generated summary in comparison to the human crafted reference summary or without reference summary.
First we’ll discuss the ngrams based approach then we’ll move to the similarity based approach (BERTScore). Lastly we’ll describe the LLM-based evaluation. This is also effective when no human-written reference is available.

ROUGE scores

It is one of the most widely used metrics for text summarization that compares reference and candidate summaries. It is calculated based on common ngrams between the candidates. There are several ROUGE variants:

ROUGE-N: Measures overlap of n-grams (e.g., unigrams, bigrams)
ROUGE-L: Focuses on the longest common subsequence.
ROUGE-S: Considers skip-grams.

More details and mathematical examples can be found in this blog.
However, ROUGE has some limitations:

It only matches exact word sequences. If a model paraphrases the content effectively, it may still get a low ROUGE score. Lets consider the generated summary contains lots of new words so even if the summary is similar to the reference summary the Rouge score will be low. Example:
- Reference: "The cat sat on the mat.”
- Generated: "A feline rested on the rug.”

Semantically similar, but low ROUGE score.

Also it considers only the ngrams to generate the score but doesn’t capture the fluency or readability of the generated text.
It gives more score to longer summaries with more reference terms even if the overall quality of the summary is poor.

Meteor

This is similar to rouge scores but words are changed to base form by using stemming and lemmatization. It introduces a fragmentation penalty that is calculated based on the number of matched fragments between candidate and reference summaries. o, if the matched words are spread out or appear in a different order, the score goes down. This helps penalize redundant information or information different from reference summary.
While METEOR improves over ROUGE in some ways, it still has its drawbacks:

It struggles with deeper context understanding.
It doesn’t handle paraphrased content well, especially with different syntactic structure or less common synonyms.

For example:

Reference Summary:
"The cat sat on the mat and looked outside the window."
Candidate Summary A:
"A cat was sitting on the mat gazing out of the window."
Candidate Summary B:
"The mat was occupied by the cat who looked outside.”

Candidate A scores reasonably well under METEOR since it captures the core idea and uses base forms like sat/sitting and looked/gazing. However, Candidate B—despite being semantically accurate—might receive a lower score due to its fragmented structure and different phrasing.

If you're interested in the math behind it, this blog provides a clear breakdown.

BLEU

This is mainly used in machine translation tasks and can also be used for summarization tasks. It compares candidate and reference text and generates a Precision score. While Rouge and Meteor scores are recall oriented, BLEU scores are precision oriented. It also introduces a brevity penalty to prevent shorter texts that have high precision score but omits important information. For example:

Reference Summary:
"The team secured a win in the final seconds of the game."
Candidate Summary A:
"The team won the game."
Candidate Summary B:
"In the final moments, the team secured a victory."

Candidate A might have higher precision due to exact word overlap, but it may receive a brevity penalty for missing context. Candidate B, while semantically close, might get a lower score due to paraphrasing.

More details and mathematical explanations are available here.

BERTScore

This is generally used in evaluating machine translation or paraphrasing tasks but can also be used in summarization. BERTScore considers each token in the candidate summary and calculates the most similar token in the reference summary by generating the BERT embeddings and similarity score. It uses similarity metrics, majorly cosine similarity, to assess the closeness of the vectors. We can extend word to word similarity from BERTScore to calculate precision recall and F1 score etc. Mathematical explanations of the different steps can be found in this blog.

LLM-as-Judge

We can use LLMs to judge the quality of the translation or summary without a reference summary. It can assess the quality of the generated text along multiple dimensions like coherence, consistency, fluency and relevance. This blog introduces one approach where they first summarize two documents by using summarization and generate similarity scores based on LLM’s judgment.

Recommendation

While working on a summarization task, initially we can evaluate the models with the help of Rouge scores. This is a cost-efficient option, however it comes with its limitations.
We can use LLM based evaluation when we don’t have a reference summary to compare with, or we want to go beyond ngrams based scoring. We can use LLMs to rank summaries from different LLMs and then decide which model to work with based on highest rank. We can also use several open-source tools available for LLM based evaluation:

DeepEval: Offers customizable evaluation pipelines for tasks like summarization, classification, and more.
Ragas: A useful framework originally designed for evaluating RAG pipelines but supports summarization score metrics as well.

Leaderboard

Leaderboards help us to identify LLMs that are suitable for a particular task. Huggingface hosts several task-specific leaderboards. The idea is when we start working on a particular task (Example: Summarization), we can choose LLMs that are ranked higher in the list.
Summarization Leaderboard - This is one of the Leaderboards available where benchmarking is done on a Summarization dataset (few of the selected datasets).
HallucinationsLeaderboard
It does benchmarking on a few well known summarization datasets like CNN-DM , XSUM etc. we can look into the columns like cnn-dm/Rouge , CNN-DM/factKB , CNN-DM/bert-p and similar columns for XSUM dataset.
We can combine and take the average of columns - Faithfulness, Factuality and CNN-DM/ROUGE and based on that we can shortlist the top models for our applications.

We are also conducting similar experiments on our end. More on this in the Experiments section.

Experiments

We conducted an experiment to understand how different LLMs perform on Summarization task. Along with comparing the quality of summaries, we also monitored GPU memory consumption, latency, and how latency varies with the number of input and output tokens. For this, we used 500 randomly selected samples from the CNN-DM dataset and generated summaries using a custom prompt. Then, we compared the LLMs by calculating the Rouge scores between the reference and candidate summaries.

Prompts

This is the prompt that we have used.

```
Write a short summary of the following text. Follow the rules mentioned below:

Return your response in 2-3 lines that covers key points of the text
Do not need any bullet points from text. Instead use them to write a concise summary of the text
Do not add any task specific information. Ex: Sure! Here is a concise summary fo the text you provided: , Sure! Here is a summary of the text in 3 lines:, Do not add such text at the beginning

text:
```

Results:

Rouge Score Comparison

Model	No of Parameters	Quantization	Rouge1	Rouge2	RougeL	RougeLSum
Gemini-1.5-flash			0.29	0.087	0.196	0.229
tinylama	1B	Q4_0	0.214	0.061	0.14	0.168
llama 3.1-8b	8B	Q4_K_M	0.315	0.103	0.209	0.245
llama3.2:3b	3B	Q4_K_M	0.303	0.098	0.202	0.241
llama2	7B	Q4_0	0.267	0.091	0.181	0.211
llama2:13b	13B	Q4_0	0.288	0.097	0.196	0.228
mistral-7b-instruct	7B	Q4_0	0.265	0.094	0.178	0.21
mistral-openorca:7b	7B	Q4_0	0.273	0.092	0.181	0.215
phi3:14b	14B	Q4_0	0.29	0.089	0.192	0.226

Latency Comparison

Model	No of Parameters	Quantization	Total time for 500 samples	avg/sample(secs)	avg _input_words = 626 , input_tokens = (input+prompt ) tokenized Avg Input Tokens	avg_summ_words = 34 output_tokens = tokenized(out_summ) Avg Output Tokens
Gemini-1.5-flash					934	73
tinylama	1B	Q4_0	749	1.49	1059	180
llama 3.1-8b	8B	yes - Q4_K_M	1697	3.39	919	72
llama3.2:3b	3B	Q4_K_M	747	1.49	919	63
llama2	7B	Q4_0	2155	4.31	1059	131
llama2:13b	13B	Q4_0	3340	6.62	1059	103
mistral-7b-instruct	7B	Q4_0	2269	4.6	1022	144
mistral-openorca:7b	7B	Q4_0	2090	4.18	1022	126
phi3:14b	14B	Q4_0	3113	6.22	1058	90

Tools and Codebase

To support our experimentation and evaluation efforts, we explored various tools and frameworks across the summarization pipeline—ranging from model training to deployment and hosting. This section outlines the key resources and libraries we used during our work, including training platforms, hosting solutions, and utilities for preprocessing and evaluation.

Training

Huggingface

It provides a comprehensive framework for training and fine-tuning transformer models for summarization tasks.
1. TransformerSum
It has support for both Extractive and Abstractive Summarization. Also adds libraries for efficient data preprocessing and adds support for Weights & Biases for tracking the experiments.

LLM Hosting -

For LLM Hosting we have used - ollama . It comes with a docker container that deploys the model in the localhost. It's the easiest tool available to work with LLMs.

Conclusion

In this work, our aim is to provide all the details regarding the text summarization task. We first introduced the Summarization task and then mentioned different types of the task and also how to implement each summarization task from scratch. We have also mentioned details regarding how we can use LLMs for the task along with the functionalities available in Langchain. We have provided links to the codebases for the same as well. Also we have linked the results that we have achieved while running the experiments. We hope this blog serves as a practical guide for anyone with a basic understanding of Python to build their own summarizer and effectively evaluate it.

If you’re building something around summarization or experimenting with similar ideas, we’d love to hear from you. Let’s learn together, collaborate, and shape what comes next.