Wikihow: A Large Scale Text Summarization Dataset
Wikihow: A Large Scale Text Summarization Dataset
Figure 2: An example of our new dataset: WikiHow summary dataset, which includes +200K summaries. The bold lines
summarizing the paragraph (shown in red boxes) are extracted and form the summary. The detailed descriptions of each step
(except the bold lines) will form the article. Note that the articles and the summaries are truncated and the presented texts are
not in their actual lengths.
Table 2: The ROUGE-F1 scores of different methods on non-anonymized version of CNN/Daily Mail dataset and WikiHow
dataset. The ROUGE scores are given by the 95% confidence interval of at most ±0.25 in the official ROUGE script.
WikiHow CNN/Daily Mail This baseline cannot be directly used for the Wiki-
Article Sentence Length 100.68 118.73
Summary Sentence Length 42.27 82.63 How dataset as the first 3 sentences of each article
Compression Ratio 2.38 1.44 only describe a small portion of the whole article.
We created the Lead-3 baseline by extracting the
Table 3: Compression ratio of WikiHow and CNN/Daily
mail datasets. The represented article and summary lengths first sentence of each paragraph and concatenated
are the mean over all sentences. them to create the summary.