Text Mining and Dataset Creation in Python

The document provides an overview of text mining, detailing the process of transforming unstructured text into structured data through steps such as data source identification, cleaning, preprocessing, feature extraction, labeling, and data integration. It also introduces Python libraries like NLTK for text processing and named entity recognition, along with practical coding examples. Additionally, it lists various tools and libraries for text mining, including Texthero and Hugging Face, to aid in natural language processing projects.

Uploaded by

Umara Zahid Lecturer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views13 pages

Text Mining and Dataset Creation in Python

Uploaded by

Umara Zahid Lecturer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 13

Text Data Mining using

Python
MSCS F22, Spring 2023
Instructor: Dr. Umara Zahid
What is Text mining?
• Text Mining is the process of transforming unstructured text into a
structured format to identify meaningful patterns and new insights.

• Text mining is a useful approach for converting unstructured textual

data into structured data that can be analyzed and used for various
purposes.
Steps to create structured dataset
from textual data
• Here are some steps to create a structured dataset from textual data using text mining
approaches:
1. Identify the data source: The first step is to identify the source of the textual data that
you want to mine. This could be anything from social media posts, customer reviews,
news articles, or scientific papers.
Live Demo of Data Sources in Class
Link for Text Datasets
https://paperswithcode.com/datasets?task=text-generation
2. Data cleaning: Once you have identified the data source, the next step is to clean the
data. This involves removing any irrelevant or redundant information, such as HTML tags
(if it is a webpage), special characters, and punctuation. This will help to ensure that the
text is in a standardized format and can be easily processed by text-mining algorithms.
Python code provided in the later slides
Steps to create structured dataset
from textual data
3. Text preprocessing: After cleaning the data, you need to preprocess the text to
make it suitable for text mining. This involves tokenizing the text into individual
words or phrases, removing stop words (such as "the" and "and"), stemming or
lemmatizing words to their root form, and converting the text to lowercase.
4. Feature extraction: Once you have preprocessed the text, you can extract
relevant features from the text. This could include extracting entities such as
names, locations, and organizations, identifying topics and themes, or
extracting sentiment or emotion.
5. Labeling: If your dataset requires labeling, such as for training a machine
learning algorithm, you will need to manually label a portion of your data. This
could involve categorizing the text into different classes or assigning a sentiment
score.
6. Data integration: Finally, you can integrate the structured data into a database,
spreadsheet, csv format for further analysis.
Python code for Data Cleaning
• The Natural Language Toolkit (nltk) is a
popular Python library for working with
text data.
• This code defines a `clean_text()`
function that takes a string of text as
input and performs the following
operations:
1. Converts the text to lowercase.
2. Removes any non-alphanumeric
characters (i.e., anything that is not a letter
or number).
3. Removes any extra whitespace (i.e.,
multiple spaces in a row).
4. Removes any leading or trailing
whitespace.
Python code for Data Cleaning
• In this code, a `remove_html_tags` function
takes a string of HTML text as input and
returns the same text with all HTML tags
removed.
• The function uses a regular expression
tokenizer from nltk to tokenize the text and
filter out tokens that represent HTML tags.
• The regular expression `\w+` matches any
sequence of one or more word characters
(letters, digits, or underscores), which
excludes HTML tags.
• To use the function, simply pass your HTML
text as a string to the function.
• Punctuations also removed
• Use Beautifulsoup for removing complete
Text
Preprocessing
code

• Output:
hello example text preprocessing going
remove stop word lemmatize word
Named Entity Recognition
• Named entity recognition (NER) is a task in natural language
processing (NLP) that involves identifying and extracting named
entities such as persons, organizations, locations, and other types of
entities from unstructured text.
• NLTK (Natural Language Toolkit) is a popular Python library for NLP
that provides various tools and modules for working with text data.
• NLTK provides several built-in algorithms and datasets for named
entity recognition.
Python code for Named Entity
Recognition
• Here's a simple example of how to perform NER using NLTK:
Transform named entities
recognized by NER (Named Entity
Recognition) into CSV format
1. Extract the named entities from your NER output: Depending on the
NER library you are using, you might have the named entities in different
formats. In general, you need to extract the entity text, the entity type,
and the entity position (start and end index) in the original text.
2. Create a CSV file: Create a new CSV file using a spreadsheet application
like Microsoft Excel or Google Sheets.
3. Define the columns: Define the columns in CSV file. In general, at least
four columns defined: Entity Text, Entity Type, Start Index, and End Index.
4. Add the named entities to the CSV file: Add each named entity to a
new row in the CSV file, with the entity text, entity type, start index, and
end index in the corresponding columns.
Python Code
Other Text Mining Libraries/ Tools
1. Texthero to Prepare a Text-based Dataset for Your NLP Project
• https://www.analyticsvidhya.com/blog/2020/08/how-to-use-texthero
-to-prepare-a-text-based-dataset-for-your-nlp-project/
2. Hugging Face: is an American company that develops tools for
building applications using machine learning. It is most notable for
its transformers library built for natural language processing
applications and its platform that allows users to share machine
learning models and datasets, makes chatbots as well
https://huggingface.co/learn/nlp-course/chapter1/1
Other Text Mining Libraries/ Tools
3. Create a dataset for natural language processing or define your own dataset
in IBM Spectrum Conductor Deep Learning Impact 1.2.
• https://www.ibm.com/docs/en/scdli/1.2.0?topic=dataset-any
4. Google Cloud’s Vertex AI combines data engineering, data science, and ML
engineering workflows, enabling your teams to collaborate using a common
toolset. Vertex AI provides several options for model training: AutoML lets
you train tabular, image, text, or video data without writing code or
preparing data splits.
• https://cloud.google.com/vertex-ai/docs/start/introduction-unified-platform
• https://cloud.google.com/vertex-ai/docs/tutorials/text-classification-automl/da
taset
5. Google Developers Machine Learning
• https://developers.google.com/machine-learning

Avionics System Hawker 800XP
100% (4)
Avionics System Hawker 800XP
730 pages
Information Security Awareness - Refresher Course
100% (2)
Information Security Awareness - Refresher Course
83 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
Machine Learning, NLP_ Text Classification Using Scikit-learn, Python and NLTK
No ratings yet
Machine Learning, NLP_ Text Classification Using Scikit-learn, Python and NLTK
9 pages
02 - NLP Pipeline - Binh
No ratings yet
02 - NLP Pipeline - Binh
37 pages
Statistical Computing With Python
No ratings yet
Statistical Computing With Python
21 pages
Text Mining Problems-4
No ratings yet
Text Mining Problems-4
59 pages
Gentle Start to Natural Language Processing Using Python
No ratings yet
Gentle Start to Natural Language Processing Using Python
6 pages
2 NLP Pipeline
No ratings yet
2 NLP Pipeline
57 pages
2. NLP Pipeline
No ratings yet
2. NLP Pipeline
50 pages
Module2.4 Text Processing
No ratings yet
Module2.4 Text Processing
17 pages
Research Process in Flow Chart
67% (12)
Research Process in Flow Chart
46 pages
Unit 3 AI-ML Driven Data Science and Automation
No ratings yet
Unit 3 AI-ML Driven Data Science and Automation
49 pages
NLP Text Classification Week4
No ratings yet
NLP Text Classification Week4
26 pages
DeekshikaJadyada-AP24LDS11
No ratings yet
DeekshikaJadyada-AP24LDS11
6 pages
Lect02
No ratings yet
Lect02
23 pages
The Analysis of Web Page Information Processing Based On Natural Language Processing
No ratings yet
The Analysis of Web Page Information Processing Based On Natural Language Processing
4 pages
NLP_record300
No ratings yet
NLP_record300
24 pages
NLP_Preprocessing_Steps__1740444240
No ratings yet
NLP_Preprocessing_Steps__1740444240
20 pages
APznzaaezhN_zrfGNBIVQoFpyxQuDJEbpYM-rd1_4RK0dsKNoyaIK1leg5AOwJTuo35Fm7my_JrMLHTTwQc2-C9HancQl3eg5PMXqg3GVh...P8BhsI_jQJy5fp8rf5U6yKHXRfFB-0sfyXvsKcrtjBjLcU1flNWbsLeC886utDYCdlHaYbVGoX44N_s9IQDFZVmSS9erIHdWuLbw1xo7dFCD-1IOTfC4GfUw8x
No ratings yet
APznzaaezhN_zrfGNBIVQoFpyxQuDJEbpYM-rd1_4RK0dsKNoyaIK1leg5AOwJTuo35Fm7my_JrMLHTTwQc2-C9HancQl3eg5PMXqg3GVh...P8BhsI_jQJy5fp8rf5U6yKHXRfFB-0sfyXvsKcrtjBjLcU1flNWbsLeC886utDYCdlHaYbVGoX44N_s9IQDFZVmSS9erIHdWuLbw1xo7dFCD-1IOTfC4GfUw8x
171 pages
Week 8-Module 7 NLP
No ratings yet
Week 8-Module 7 NLP
52 pages
Understanding Language Model
No ratings yet
Understanding Language Model
5 pages
Presentation On Harshad Mehta Scam
67% (6)
Presentation On Harshad Mehta Scam
14 pages
Python NLP
No ratings yet
Python NLP
15 pages
Handling Corpus Raw Text
No ratings yet
Handling Corpus Raw Text
15 pages
Detail NLP
No ratings yet
Detail NLP
5 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
DMP Alarm Panel XRSuper6-XR20-XR40 installation manual
No ratings yet
DMP Alarm Panel XRSuper6-XR20-XR40 installation manual
24 pages
4.TWITTER EXTRACTION AND ANALYTICS
No ratings yet
4.TWITTER EXTRACTION AND ANALYTICS
45 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
text classification reseach paper
No ratings yet
text classification reseach paper
4 pages
AI Zone: Log in Sign Up
No ratings yet
AI Zone: Log in Sign Up
24 pages
NLP Preprocessing Steps
No ratings yet
NLP Preprocessing Steps
20 pages
Great Big Natural Language Processing Primer KDnuggets
No ratings yet
Great Big Natural Language Processing Primer KDnuggets
25 pages
WDM - Week - I
No ratings yet
WDM - Week - I
24 pages
Unit 5
No ratings yet
Unit 5
8 pages
A Novel Approach For Filtering Unrelated Data From Websites Using Natural Language Processing
No ratings yet
A Novel Approach For Filtering Unrelated Data From Websites Using Natural Language Processing
4 pages
SMA (TASK1 AND 2) ... HARDCOPY (Final) ..Pranchal..
No ratings yet
SMA (TASK1 AND 2) ... HARDCOPY (Final) ..Pranchal..
11 pages
Unraveling The Power of Natural Language Processing
No ratings yet
Unraveling The Power of Natural Language Processing
11 pages
Introduction To Text Mining
No ratings yet
Introduction To Text Mining
6 pages
Organizational Structure and Compensation Study Final
100% (2)
Organizational Structure and Compensation Study Final
128 pages
NLP Manual
No ratings yet
NLP Manual
15 pages
Genasi_Ignara_Flameborn_Full_Character_Sheet
No ratings yet
Genasi_Ignara_Flameborn_Full_Character_Sheet
3 pages
A Comprehensive Guide To Understand and Implement Text Classification in Python
No ratings yet
A Comprehensive Guide To Understand and Implement Text Classification in Python
34 pages
sdbbs-purohit_dakshina_2023_members17sep24
No ratings yet
sdbbs-purohit_dakshina_2023_members17sep24
5 pages
A Tutorial On: Linguistic Data Analysis
No ratings yet
A Tutorial On: Linguistic Data Analysis
99 pages
NLP_course-EDC-1-29
No ratings yet
NLP_course-EDC-1-29
29 pages
NLP For ML - Spam Classifier
No ratings yet
NLP For ML - Spam Classifier
14 pages
SocrAI Day 3
No ratings yet
SocrAI Day 3
43 pages
Unit2 Full
No ratings yet
Unit2 Full
28 pages
Natural Language Processing manual
No ratings yet
Natural Language Processing manual
39 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
Growing A Circular Economy With Fungal Biotechnology: A White Paper
No ratings yet
Growing A Circular Economy With Fungal Biotechnology: A White Paper
23 pages
Literature Review On Air Cooler
67% (3)
Literature Review On Air Cooler
5 pages
Chapter 4-Layered Architecture
No ratings yet
Chapter 4-Layered Architecture
88 pages
CO2 Data Base PDF
No ratings yet
CO2 Data Base PDF
36 pages
Minorproject Ishant
No ratings yet
Minorproject Ishant
18 pages
Session 1
No ratings yet
Session 1
22 pages
Cot - DLP - English 4 by Teacher Margie v#4
No ratings yet
Cot - DLP - English 4 by Teacher Margie v#4
6 pages
Natural Language Processing
No ratings yet
Natural Language Processing
19 pages
Lab Manual
No ratings yet
Lab Manual
10 pages
Data Manipulation with Python Step by Step: A Practical Guide with Examples
From Everand
Data Manipulation with Python Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
Teaching English Using ICT
100% (2)
Teaching English Using ICT
177 pages
Analysis of The Effect of The Ho Chi Minh City Tunnel Settlement On The Adjacent Buildings
No ratings yet
Analysis of The Effect of The Ho Chi Minh City Tunnel Settlement On The Adjacent Buildings
9 pages
Anna Thompson - Complete Resume
No ratings yet
Anna Thompson - Complete Resume
1 page
Finalised Time Table June-17 PDF
0% (1)
Finalised Time Table June-17 PDF
4 pages
NLP - Srilakshmi H - PPT Assignment
No ratings yet
NLP - Srilakshmi H - PPT Assignment
29 pages
Dharmshala McLeodganj 5 Days 4 Nights R1
No ratings yet
Dharmshala McLeodganj 5 Days 4 Nights R1
6 pages
Korean Wave
No ratings yet
Korean Wave
12 pages
Alpine Cde-9870
No ratings yet
Alpine Cde-9870
84 pages
Python Algorithms Step by Step: A Practical Guide with Examples
From Everand
Python Algorithms Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
AIS Report
No ratings yet
AIS Report
33 pages
Random Variables and Probability Distribution: Purnomo Jurusan Teknik Mesin UGM
No ratings yet
Random Variables and Probability Distribution: Purnomo Jurusan Teknik Mesin UGM
48 pages
Crown Xls Series Datasheet Original
100% (1)
Crown Xls Series Datasheet Original
2 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
Session 11-12 - Text Analytics
No ratings yet
Session 11-12 - Text Analytics
38 pages
Vibration Refers To Mechanical Oscillations About An Equilibrium Point
No ratings yet
Vibration Refers To Mechanical Oscillations About An Equilibrium Point
5 pages
DC Motors
No ratings yet
DC Motors
3 pages
De-119895 Offshore Reporting
100% (2)
De-119895 Offshore Reporting
16 pages
St. Peter Lutheran School 2 Grade Class Handbook 2017-2018: Gail Raupp Samantha Crowe
No ratings yet
St. Peter Lutheran School 2 Grade Class Handbook 2017-2018: Gail Raupp Samantha Crowe
6 pages
Mastering Python: A Comprehensive Guide for Beginners and Experts
From Everand
Mastering Python: A Comprehensive Guide for Beginners and Experts
Rick Spair
No ratings yet
Century Science and Technology "Age of Machine Tools" - Tools Were Made For Tools Machines Were Made For Other Parts of Another Machine
No ratings yet
Century Science and Technology "Age of Machine Tools" - Tools Were Made For Tools Machines Were Made For Other Parts of Another Machine
6 pages
Python Programming: Learn, Code, Create
From Everand
Python Programming: Learn, Code, Create
Sachin Naha
No ratings yet
NLP Lab Manual-1
No ratings yet
NLP Lab Manual-1
18 pages
Natural Language Processing
No ratings yet
Natural Language Processing
12 pages
Diet and Dental Caries
100% (1)
Diet and Dental Caries
83 pages
Submitted By:-Submitted To: - Name - Govinda DR Dilbag Sir ROLL NO - 191543 Department - DTHM
No ratings yet
Submitted By:-Submitted To: - Name - Govinda DR Dilbag Sir ROLL NO - 191543 Department - DTHM
10 pages
Mastering Python Programming: A Comprehensive Guide: The IT Collection
From Everand
Mastering Python Programming: A Comprehensive Guide: The IT Collection
Christopher Ford
5/5 (1)
Python For Data Science
From Everand
Python For Data Science
Kevin Clark
No ratings yet

Text Mining and Dataset Creation in Python

Uploaded by

Text Mining and Dataset Creation in Python

Uploaded by

Text Data Mining using

• Text mining is a useful approach for converting unstructured textual

You might also like