Skip to content

This repository contains the SubSumE dataset for subjective document summarization.

License

Notifications You must be signed in to change notification settings

afariha/SubSumE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 

Repository files navigation

SubSumE Dataset

This repository contains the SubSumE dataset for subjective document summarization. See the paper and the talk for details on dataset creation. Also check out our work SuDocu on example-based document summarization.

Dataset Files

Download the dataset from here.

The dataset contains :

  • Simplified text from 48 Wikipedia pages of the states in the US. Additionally, all the sentences in these documents are put together in a single file processed_state_sentences.csv and are assigned a unique sentence id that is used in summary json files.
  • Intent-based summaries created by human annotators.

Each datapoint file in the directory user_summary_jsons contains a json containing summaries of Wikipedia pages of eight states with following keys:

  • intent : Summarization intent provided to human annotators for generating the summary
  • summaries: List of summary jsons for eight states assigned to the annotator. Each json in the list contains following keys:
    • state_name: Name of the state
    • sentence_ids: Global ids of sentences (wrt processed_state_sentences.csv) present in the summary
    • sentences: List of sentences present in the summary
    • use_keywords: Keywords used by the annotator to search the document when creating summaries

Acknowledgements

This work was supported by the NSF under grants IIS-1453543, IIS1943971, and CCF-1763423, and a Microsoft Research Dissertation Grant.

About

This repository contains the SubSumE dataset for subjective document summarization.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages