Skip to content

lxj5957/CLTS-Dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 

Repository files navigation

CLTS-Dataset

CLTS: A Chinese Long Text Summarization Dataset

Introduction

Currently, there are two issues in automatic summarization dataset for Chinese: (1) There are fewer Chinese datasets. (2) There is a lack of Chinese long text datasets, and most existing datasets contain short texts, which can only be used to generate a title from short text.

CLTS is a new Chinese long text summarization dataset, extracted from the Chinese news website ThePaper.cn. The resulting version of the dataset contains more than 180,000 long-sequence pairs, where each article consists of multiple paragraphs and each summary consists of multiple sentences.

The dataset is introduced in CLTS: A New Chinese Long Text Summarization Dataset.

Properties

The comparison result of CLTS to other datasets is shown in the table below. For Chinese dataset, length is the number of Chinese characters, while for English, it is the number of words.

Dataset CLTS LCSTS CNN/DM
Dataset Size 185,397 2,412,363 312,085
Training Set Size 148,317 2,400,591 287,227
Max Article Length 6130 135 2173
Mean Article Length 1363.69 88.65 687.09
Mean Summary Length 58.12 16.09 48.49

Samples

We select some samples from the dataset and you can see them in samples.txt

Download

We split the corpus into three parts, including training, validation and test set. The data can be downloaded from the link here. And the password is iucp.

About

A Chinese Long Text Summarization Dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages