0% found this document useful (0 votes)
6 views

Text Mining and Dataset Creation in Python

The document provides an overview of text mining, detailing the process of transforming unstructured text into structured data through steps such as data source identification, cleaning, preprocessing, feature extraction, labeling, and data integration. It also introduces Python libraries like NLTK for text processing and named entity recognition, along with practical coding examples. Additionally, it lists various tools and libraries for text mining, including Texthero and Hugging Face, to aid in natural language processing projects.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Text Mining and Dataset Creation in Python

The document provides an overview of text mining, detailing the process of transforming unstructured text into structured data through steps such as data source identification, cleaning, preprocessing, feature extraction, labeling, and data integration. It also introduces Python libraries like NLTK for text processing and named entity recognition, along with practical coding examples. Additionally, it lists various tools and libraries for text mining, including Texthero and Hugging Face, to aid in natural language processing projects.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Text Data Mining using

Python
MSCS F22, Spring 2023
Instructor: Dr. Umara Zahid
What is Text mining?
• Text Mining is the process of transforming unstructured text into a
structured format to identify meaningful patterns and new insights.

• Text mining is a useful approach for converting unstructured textual


data into structured data that can be analyzed and used for various
purposes.
Steps to create structured dataset
from textual data
• Here are some steps to create a structured dataset from textual data using text mining
approaches:
1. Identify the data source: The first step is to identify the source of the textual data that
you want to mine. This could be anything from social media posts, customer reviews,
news articles, or scientific papers.
Live Demo of Data Sources in Class
Link for Text Datasets
https://paperswithcode.com/datasets?task=text-generation
2. Data cleaning: Once you have identified the data source, the next step is to clean the
data. This involves removing any irrelevant or redundant information, such as HTML tags
(if it is a webpage), special characters, and punctuation. This will help to ensure that the
text is in a standardized format and can be easily processed by text-mining algorithms.
Python code provided in the later slides
Steps to create structured dataset
from textual data
3. Text preprocessing: After cleaning the data, you need to preprocess the text to
make it suitable for text mining. This involves tokenizing the text into individual
words or phrases, removing stop words (such as "the" and "and"), stemming or
lemmatizing words to their root form, and converting the text to lowercase.
4. Feature extraction: Once you have preprocessed the text, you can extract
relevant features from the text. This could include extracting entities such as
names, locations, and organizations, identifying topics and themes, or
extracting sentiment or emotion.
5. Labeling: If your dataset requires labeling, such as for training a machine
learning algorithm, you will need to manually label a portion of your data. This
could involve categorizing the text into different classes or assigning a sentiment
score.
6. Data integration: Finally, you can integrate the structured data into a database,
spreadsheet, csv format for further analysis.
Python code for Data Cleaning
• The Natural Language Toolkit (nltk) is a
popular Python library for working with
text data.
• This code defines a `clean_text()`
function that takes a string of text as
input and performs the following
operations:
1. Converts the text to lowercase.
2. Removes any non-alphanumeric
characters (i.e., anything that is not a letter
or number).
3. Removes any extra whitespace (i.e.,
multiple spaces in a row).
4. Removes any leading or trailing
whitespace.
Python code for Data Cleaning
• In this code, a `remove_html_tags` function
takes a string of HTML text as input and
returns the same text with all HTML tags
removed.
• The function uses a regular expression
tokenizer from nltk to tokenize the text and
filter out tokens that represent HTML tags.
• The regular expression `\w+` matches any
sequence of one or more word characters
(letters, digits, or underscores), which
excludes HTML tags.
• To use the function, simply pass your HTML
text as a string to the function.
• Punctuations also removed
• Use Beautifulsoup for removing complete
Text
Preprocessing
code

• Output:
hello example text preprocessing going
remove stop word lemmatize word
Named Entity Recognition
• Named entity recognition (NER) is a task in natural language
processing (NLP) that involves identifying and extracting named
entities such as persons, organizations, locations, and other types of
entities from unstructured text.
• NLTK (Natural Language Toolkit) is a popular Python library for NLP
that provides various tools and modules for working with text data.
• NLTK provides several built-in algorithms and datasets for named
entity recognition.
Python code for Named Entity
Recognition
• Here's a simple example of how to perform NER using NLTK:
Transform named entities
recognized by NER (Named Entity
Recognition) into CSV format
1. Extract the named entities from your NER output: Depending on the
NER library you are using, you might have the named entities in different
formats. In general, you need to extract the entity text, the entity type,
and the entity position (start and end index) in the original text.
2. Create a CSV file: Create a new CSV file using a spreadsheet application
like Microsoft Excel or Google Sheets.
3. Define the columns: Define the columns in CSV file. In general, at least
four columns defined: Entity Text, Entity Type, Start Index, and End Index.
4. Add the named entities to the CSV file: Add each named entity to a
new row in the CSV file, with the entity text, entity type, start index, and
end index in the corresponding columns.
Python Code
Other Text Mining Libraries/ Tools
1. Texthero to Prepare a Text-based Dataset for Your NLP Project
• https://www.analyticsvidhya.com/blog/2020/08/how-to-use-texthero
-to-prepare-a-text-based-dataset-for-your-nlp-project/
2. Hugging Face: is an American company that develops tools for
building applications using machine learning. It is most notable for
its transformers library built for natural language processing
applications and its platform that allows users to share machine
learning models and datasets, makes chatbots as well
https://huggingface.co/learn/nlp-course/chapter1/1
Other Text Mining Libraries/ Tools
3. Create a dataset for natural language processing or define your own dataset
in IBM Spectrum Conductor Deep Learning Impact 1.2.
• https://www.ibm.com/docs/en/scdli/1.2.0?topic=dataset-any
4. Google Cloud’s Vertex AI combines data engineering, data science, and ML
engineering workflows, enabling your teams to collaborate using a common
toolset. Vertex AI provides several options for model training: AutoML lets
you train tabular, image, text, or video data without writing code or
preparing data splits.
• https://cloud.google.com/vertex-ai/docs/start/introduction-unified-platform
• https://cloud.google.com/vertex-ai/docs/tutorials/text-classification-automl/da
taset
5. Google Developers Machine Learning
• https://developers.google.com/machine-learning

You might also like