0% found this document useful (0 votes)
135 views5 pages

Wikihow: A Large Scale Text Summarization Dataset

The document introduces a new large-scale dataset called WikiHow containing over 230,000 article and summary pairs extracted from an online knowledge base. Each article consists of multiple paragraphs summarized at the paragraph level. The dataset exhibits more diversity in writing styles compared to existing news datasets and allows for longer text summarization. The document evaluates existing summarization methods on WikiHow and establishes benchmarks to further improve abstractive summarization performance.

Uploaded by

soumya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
135 views5 pages

Wikihow: A Large Scale Text Summarization Dataset

The document introduces a new large-scale dataset called WikiHow containing over 230,000 article and summary pairs extracted from an online knowledge base. Each article consists of multiple paragraphs summarized at the paragraph level. The dataset exhibits more diversity in writing styles compared to existing news datasets and allows for longer text summarization. The document evaluates existing summarization methods on WikiHow and establishes benchmarks to further improve abstractive summarization performance.

Uploaded by

soumya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

WikiHow: A Large Scale Text Summarization Dataset

Mahnaz Koupaee William Yang Wang


University of California, Santa Barbara University of California, Santa Barbara
[email protected] [email protected]

Abstract sequence-to-sequence model, the summaries may


be limited to only headlines (Gigaword), they may
Sequence-to-sequence models have recently be more useful as an extractive summarization
gained the state of the art performance in sum-
arXiv:1810.09305v1 [cs.CL] 18 Oct 2018

dataset (New York Times) and their abstraction


marization. However, not too many large-scale
high-quality datasets are available and almost level might be limited (CNN/Daily mail).
all the available ones are mainly news articles To overcome the issues of the existing datasets,
with specific writing style. Moreover, abstrac- we present a new large-scale dataset called Wiki-
tive human-style systems involving descrip- How using the online WikiHow1 knowledge base.
tion of the content at a deeper level require It contains articles about various topics written in
data with higher levels of abstraction. In this different styles making them different form exist-
paper, we present WikiHow, a dataset of more
ing news datasets. Each article consists of multi-
than 230,000 article and summary pairs ex-
tracted and constructed from an online knowl- ple paragraphs and each paragraph starts with a
edge base written by different human authors. sentence summarizing it. By merging the para-
The articles span a wide range of topics and graphs to form the article and the paragraph out-
therefore represent high diversity styles. We lines to form the summary, the resulting version
evaluate the performance of the existing meth- of the dataset contains more than 200,000 long-
ods on WikiHow to present its challenges and sequence pairs. We then present two features to
set some baselines to further improve it. show how abstractive our dataset is. Finally, we
analyze the performance of some of the existing
1 Introduction
extractive and abstractive systems on WikiHow as
Summarization as the process of generating a benchmarks for further studies. The contribution
shorter version of a piece of text while preserving of this work is three-fold:
important context information is one of the most
challenging NLP tasks. Sequence-to-sequence • We introduce a large-scale, diverse dataset
neural networks have recently obtained signifi- with various writing styles, convenient for
cant performance improvements on summariza- long-sequence text summarization.
tion (Rush et al., 2015; Chopra et al., 2016). How-
• We introduce level of abstractedness and
ever, the existence of large-scale datasets is the key
compression ratio metrics to show how ab-
to success of these models. Moreover, the length
stractive the new dataset is.
of the articles and the diversity in their styles can
create more complications. • We evaluate the performance of the existing
Almost all existing summarization datasets such systems on WikiHow to create benchmarks
as DUC (Harman and Over, 2004), Gigaword and understand the challenges better.
(Napoles et al., 2012), New York Times (Sand-
haus, 2008) and CNN/Daily Mail (Nallapati et al., 2 Existing Datasets
2016) consist of news articles. The news articles
There are several datasets used to evaluate the
have their own specific styles and therefore the
summarization systems. We briefly describe the
systems trained on only news may not be gen-
properties of these datasets as follows.
eralized well. On the other hand, the existing
1
datasets may not be large enough (DUC) to train a http://www.wikihow.com/
The Lead Dataset Size 230,843
The most important information about an event Average Article Length 579.8
Who? What? Where? When? Why? How?
Average Summary Length 62.1
The Body
The crucial information expanding the topic
Vocabulary Size 556,461
Argument, Controversy, Story, Evidence,
Background details Table 1: The WikiHow datasets statistics.
The Tail
Extra information
Interesting, related
items. NEWSROOM: This corpus (Grusky et al., 2018)
Journalist
Assessment is the most recent large-scale dataset introduced
for text summarization. It consists of diverse
summaries combining abstractive and extractive
Figure 1: Inverted Pyramid writing style. The first few strategies yet it is another news dataset and the av-
sentences of news articles contain the important information
making Lead-3 baselines outperforming most of the systems.
erage length of summaries are limited to 26.7.

DUC: The Document Understanding Conference 3 WikiHow Dataset


dataset (Harman and Over, 2004) contains 500
news articles and their summaries capped at 75 The existing summarization datasets, consist of
bytes. The summaries are written by human au- news articles. These articles are written by jour-
thors and there exist more than one summary per nalists and follow the journalistic style. The jour-
article which is its major advantage over other ex- nalists usually follow the Inverted Pyramid style
isting datasets. The DUC dataset cannot be used (Po¨ ttker, 2003) (depicted in Figure 1) to priori-
for training models with large number of parame- tize and structure a text by starting with mention-
ters and therefore is used along with other datasets ing the most important, interesting or attention-
(Rush et al., 2015; Nallapati et al., 2017). grabbing elements of a story in the opening para-
Gigaword: Another collection of news articles graphs and later adding details and any back-
used for summarization is Gigaword (Napoles ground information. This writing style might be
et al., 2012). The original articles in the dataset the cause why lead-3 baselines (where the first
do not have summaries paired with them. How- three sentences are selected to form the summary)
ever, some prior work (Rush et al., 2015; Chopra usually score higher compared to the existing sum-
et al., 2016) used a subset of this dataset and con- marization systems. We introduce a new dataset
structed pairs of summaries by using the first line called WikiHow, obtained from WikiHow data
of the article and its headline, making the dataset dump. This dataset contains articles written by or-
suitable for short text summarization tasks. dinary people, not journalists, describing the steps
of doing a task throughout the text. Therefore, the
New York Times: The New York Times (NYT)
Inverted Pyramid does not apply to it as all parts
dataset (Sandhaus, 2008) is a large collection of
of the text can be of similar importance.
articles published between 1996 and 2007. While
this dataset has been mainly used for extractive
3.1 WikiHow Knowledge Base
systems (Hong and Nenkova, 2014; Durrett et al.,
2016), Paulus et al. (2017) are the first to evaluate The WikiHow knowledge base contains online ar-
their abstractive system using NYT. ticles describing a procedural task about various
CNN/Daily Mail: This dataset mainly used in re- topics (from arts and entertainment to computers
cent summarization papers (Nallapati et al., 2016; and electronics) with multiple methods or steps
See et al., 2017; Nallapati et al., 2017) consists of and new articles are added to it regularly. Each
online CNN and Daily Mail news articles and was article consists of a title starting with “How to”
originally developed for question/answering sys- and a short description of the article. There are
tems. The highlights associated with each article two types of articles: the first type of articles de-
are concatenated to form the summary. Two ver- scribe single-method tasks in different steps, while
sions of this dataset depending on the preprocess- the second type of articles represent multiple steps
ing exist. Nallapati et al. (2017) has used the entity of different methods for a task. Each step descrip-
anonymization to create the anonymized version tion starts with a bold line summarizing that step
of the dataset while See et al. (2017) replaced the and is followed by a more detailed explanation. A
anonymized entities with their actual values and truncated example of a WikiHow article and how
create the non-anonymized version. the data pairs are constructed is shown in Figure 2.
Article 1:
One easy way to conserve water is to cut down on your shower time. Practice cutting your showers
down to 10 minutes, then 7, then 5. Challenge yourself to take a shorter shower every day. Washing
Method 1 Reducing Your Water Usage machines take up a lot of water and electricity, so running a cycle for a couple of articles of clothing is
Take quicker showers to conserve water. One easy way to conserve water is to cut down on your shower time. inefficient. Hold off on laundry until you can fill the machine. Avoid letting the water run while you’re
Practice cutting your showers down to 10 minutes, then 7, then 5. Challenge yourself to take a shorter shower brushing your teeth or shaving. Keep your hoses and faucets turned off as much as possible. When you
every day. need them, use them sparingly.

Wait for a full load of clothing before running a washing machine. Washing machines take up a lot of water
and electricity, so running a cycle for a couple of articles of clothing is inefficient. Hold off on laundry until you Summary 1:
can fill the machine. Take quicker showers to conserve water. Wait for a full load of clothing before running a washing
machine. Turn off the water when you’re not using it.
Turn off the water when you’re not using it. Avoid letting the water run while you’re brushing your teeth or …
shaving. Keep your hoses and faucets turned off as much as possible. When you need them, use them sparingly.
… Article 2:
Any chemicals you use in your home end up back in the water supply. Choose natural soaps or create
Method 2 Using River-Friendly Products your own cleaning and disinfecting agents out of vinegar, baking soda, lemon juice, and other natural
products. These products have far less of a negative impact if they reach a river. New products take way
Select biodegradable cleaning products. Any chemicals you use in your home end up back in the water supply. more water to make than recycled products. Reuse what you already own when possible. If you need to
Choose natural soaps or create your own cleaning and disinfecting agents out of vinegar, baking soda, lemon buy something, opt for products made out of recycled paper or other reused material.
juice, and other natural products. These products have far less of a negative impact if they reach a river. ….
Choose recycled products instead of new ones. New products take way more water to make than recycled Summary 2:
products. Reuse what you already own when possible. If you need to buy something, opt for products made out Select biodegradable cleaning products. Choose recycled products instead of new ones.
of recycled paper or other reused material. …

Figure 2: An example of our new dataset: WikiHow summary dataset, which includes +200K summaries. The bold lines
summarizing the paragraph (shown in red boxes) are extracted and form the summary. The detailed descriptions of each step
(except the bold lines) will form the article. Note that the articles and the summaries are truncated and the presented texts are
not in their actual lengths.

3.2 Data Extraction and Dataset


Construction
We made use of the python Scrapy 2 library to
write a crawler to get the data from the Wiki-
How website. The articles classified into 20 differ-
ent categories, cover a wide range of topics. Our
crawler was able to obtain 142, 783 unique arti-
cles (some containing more than one method) at
the time of crawling (new articles are added regu-
larly). To prepare the data for the summarization
Figure 3: Uniqueness of n-grams in CNN/Daily mail and
task, each method (if any) described in the article WikiHow datasets.
is considered as a separate article. To generate the
reference summaries, bold lines representing the CNN/Daily mail known as one of the most ab-
summary of the steps are extracted and concate- stractive and common datasets in recent summa-
nated. The remaining parts of the steps (the de- rization papers (Nallapati et al., 2016, 2017; See
tailed descriptions) are also concatenated to form et al., 2017; Paulus et al., 2017).
the source article. After this step, 230, 843 articles
and reference summaries are generated. There are 4.1 Level of Abstractedness
some articles with only the bold lines i.e. there is Abstractedness of the dataset is measured by cal-
no more explanation for the steps, so they cannot culating the unique n-grams in the reference sum-
be used for the summarization task. To filter out mary which are not in the article. The compari-
these articles, we used a size threshold so that pairs son is shown in Figure 3. Except for common uni-
with summaries longer than the article size will be grams, bi-grams and trigrams between the articles,
removed. The final dataset is made of 204, 004 and the summaries, no other common n-grams ex-
articles and their summaries. The statistics of the ist in the WikiHow pairs. The higher level of ab-
dataset are shown in Table 1. The dataset is re- stractedness creates new challenges for the sum-
leased to the public3 . marization systems as they have to be more cre-
ative in generating more novel summaries.
4 WikiHow Properties
The large scale of the WikiHow dataset by hav- 4.2 Compression Ratio
ing more than 230, 000 pairs, and its average We define compression ratio to characterize the
article and summary lengths makes it a better summarization. We first calculate the average
choice compared to DUC and Gigaword corpus. length of sentences for both the articles and the
We also define two metrics to represent the ab- summaries. The compression ratio is then de-
straction level of WikiHow by comparing it with fined as the ratio between the average length of
2 sentences and the average length of summaries.
https://scrapy.org/
3
https://github.com/mahnazkoupaee/ The higher the compression ratio, the more diffi-
WikiHow-Dataset cult the summarization task, as it needs to capture
CNN/Daily Mail WikiHow
Model ROUGE METEOR ROUGE METEOR
1 2 L exact 1 2 L exact
TextRank 35.23 13.90 31.48 18.03 27.53 7.4 20.00 12.92
Seq-to-seq with attention 31.33 11.81 28.83 12.03 22.04 6.27 20.87 10.06
Pointer-generator 36.44 15.66 33.42 15.35 27.30 9.10 25.65 9.70
Pointer-generator + coverage 39.53 17.28 36.38 17.32 28.53 9.23 26.54 10.56
Lead-3 baseline 40.34 17.70 36.57 20.48 26.00 7.24 24.25 12.85

Table 2: The ROUGE-F1 scores of different methods on non-anonymized version of CNN/Daily Mail dataset and WikiHow
dataset. The ROUGE scores are given by the 95% confidence interval of at most ±0.25 in the official ROUGE script.

WikiHow CNN/Daily Mail This baseline cannot be directly used for the Wiki-
Article Sentence Length 100.68 118.73
Summary Sentence Length 42.27 82.63 How dataset as the first 3 sentences of each article
Compression Ratio 2.38 1.44 only describe a small portion of the whole article.
We created the Lead-3 baseline by extracting the
Table 3: Compression ratio of WikiHow and CNN/Daily
mail datasets. The represented article and summary lengths first sentence of each paragraph and concatenated
are the mean over all sentences. them to create the summary.

higher levels of abstraction and semantics. Table 5.2 Results


3 shows the results for WikiHow and CNN/Daily To study the performance of the evaluated sys-
Mail. The higher compression ratio of WikiHow tems, we used the Pyrouge package4 to report the
shows the need for higher levels of abstraction. F1 score for ROUGE-1, ROUGE-2 and ROUGE-
L (Lin, 2004) and the METEOR (Banerjee and
5 Experiments Lavie, 2005)5 both based on the exact matches
We evaluate the performance of the WikiHow and on inclusion of stem, paraphrasing and syn-
dataset using existing extractive and abstractive onyms (s/p/s) to evaluate the methods . Table
baselines. The systems used and the results gen- 2 represents the results of multiple baselines on
erated for WikiHow and CNN/Daily mail are de- both the CNN/Daily Mail (the well-known, most
scribed in the following sections. common abstractive summarization dataset) and
also the proposed WikiHow dataset. As it can be
5.1 Evaluated Systems seen, the summarization systems perform a lot bet-
TextRank Extractive system: An extractive sum- ter on CNN/Daily mail compared to the WikiHow
marization system (Mihalcea and Tarau, 2004; dataset with lead-3 outperforming other baselines
Barrios et al., 2016) using a graph-based ranking due to the news inverted pyramid writing style de-
method to select sentences from the article and scribed earlier. On the other hand, the poor perfor-
form the summary. mance of lead-3 on WikiHow shows the different
Sequence-to-sequence model with attention: A writing styles in its articles. Moreover, all base-
baseline system applied by Chopra et al. (2016); lines perform about 10 ROUGE scores better on
Nallapati et al. (2016) to abstractive summariza- the CNN/Daily mail compared to the WikiHow.
tion task to generate summaries using the prede- This difference suggests new features and aspects
fined vocabulary. This baseline is not able to han- inherent in the new dataset which can be used to
dle Out of Vocabulary words (OOVs). further improve the summarization systems.
Pointer-generator abstractive system: A 6 Conclusion
pointer-generator mechanism (See et al., 2017)
allowing the model to freely switch between We present WikiHow, a new large-scale summa-
copying a word from the input sequence or rization dataset consisting of diverse articles form
generating a word form the predefined vocabulary. WikiHow knowledge base. The WikiHow features
Pointer-generator with coverage abstractive discussed in the paper can create new challenges to
system: The pointer-generator baseline with the summarization systems. We hope that the new
added coverage loss (See et al., 2017) to reduce dataset can attract researchers attention as a choice
the repetition in the final generated summary. to evaluate their systems.
Lead-3 baseline: A baseline selecting the first 4
pypi.python.org/pypi/pyrouge/0.1.3
5
three sentences of the article to form the summary. www.cs.cmu.edu/˜alavie/METEOR
References Extraction, pages 95–100. Association for Compu-
tational Linguistics.
Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An
automatic metric for mt evaluation with improved Romain Paulus, Caiming Xiong, and Richard Socher.
correlation with human judgments. In Proceedings 2017. A deep reinforced model for abstractive sum-
of the acl workshop on intrinsic and extrinsic evalu- marization. arXiv preprint arXiv:1705.04304.
ation measures for machine translation and/or sum-
marization, pages 65–72. Horst Po¨ ttker. 2003. News and its communicative
quality: The inverted pyramidwhen and why did it
Federico Barrios, Federico López, Luis Argerich, and appear? Journalism Studies, 4(4):501–511.
Rosa Wachenchauzer. 2016. Variations of the simi-
larity function of textrank for automated summariza- Alexander M Rush, Sumit Chopra, and Jason We-
tion. arXiv preprint arXiv:1602.03606. ston. 2015. A neural attention model for ab-
stractive sentence summarization. arXiv preprint
Sumit Chopra, Michael Auli, Alexander M Rush, and arXiv:1509.00685.
SEAS Harvard. 2016. Abstractive sentence summa-
rization with attentive recurrent neural networks. In Evan Sandhaus. 2008. The new york times annotated
HLT-NAACL, pages 93–98. corpus. Linguistic Data Consortium, Philadelphia,
6(12):e26752.
Greg Durrett, Taylor Berg-Kirkpatrick, and Dan Klein.
2016. Learning-based single-document summariza- Abigail See, Peter J Liu, and Christopher D Manning.
tion with compression and anaphoricity constraints. 2017. Get to the point: Summarization with pointer-
arXiv preprint arXiv:1603.08887. generator networks. ACL.
Max Grusky, Mor Naaman, and Yoav Artzi. 2018.
Newsroom: A dataset of 1.3 million summaries with
diverse extractive strategies. In Proceedings of the
2018 Conference of the North American Chapter of
the Association for Computational Linguistics: Hu-
man Language Technologies, Volume 1 (Long Pa-
pers), volume 1, pages 708–719.
Donna Harman and Paul Over. 2004. The effects of hu-
man variation in duc summarization evaluation. Text
Summarization Branches Out.
Kai Hong and Ani Nenkova. 2014. Improving the
estimation of word importance for news multi-
document summarization. In Proceedings of the
14th Conference of the European Chapter of the As-
sociation for Computational Linguistics, pages 712–
721.
Chin-Yew Lin. 2004. Rouge: A package for auto-
matic evaluation of summaries. In Text summariza-
tion branches out: Proceedings of the ACL-04 work-
shop, volume 8. Barcelona, Spain.
Rada Mihalcea and Paul Tarau. 2004. Textrank: Bring-
ing order into text. In Proceedings of the 2004 con-
ference on empirical methods in natural language
processing.
Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017.
Summarunner: A recurrent neural network based se-
quence model for extractive summarization of docu-
ments. AAAI.
Ramesh Nallapati, Bowen Zhou, Cicero dos Santos,
Ça glar Gulçehre, and Bing Xiang. 2016. Abstrac-
tive text summarization using sequence-to-sequence
rnns and beyond. CoNLL 2016, page 280.
Courtney Napoles, Matthew Gormley, and Benjamin
Van Durme. 2012. Annotated gigaword. In Pro-
ceedings of the Joint Workshop on Automatic Knowl-
edge Base Construction and Web-scale Knowledge

You might also like