Text Mining and Dataset Creation in Python
Text Mining and Dataset Creation in Python
Python
MSCS F22, Spring 2023
Instructor: Dr. Umara Zahid
What is Text mining?
• Text Mining is the process of transforming unstructured text into a
structured format to identify meaningful patterns and new insights.
• Output:
hello example text preprocessing going
remove stop word lemmatize word
Named Entity Recognition
• Named entity recognition (NER) is a task in natural language
processing (NLP) that involves identifying and extracting named
entities such as persons, organizations, locations, and other types of
entities from unstructured text.
• NLTK (Natural Language Toolkit) is a popular Python library for NLP
that provides various tools and modules for working with text data.
• NLTK provides several built-in algorithms and datasets for named
entity recognition.
Python code for Named Entity
Recognition
• Here's a simple example of how to perform NER using NLTK:
Transform named entities
recognized by NER (Named Entity
Recognition) into CSV format
1. Extract the named entities from your NER output: Depending on the
NER library you are using, you might have the named entities in different
formats. In general, you need to extract the entity text, the entity type,
and the entity position (start and end index) in the original text.
2. Create a CSV file: Create a new CSV file using a spreadsheet application
like Microsoft Excel or Google Sheets.
3. Define the columns: Define the columns in CSV file. In general, at least
four columns defined: Entity Text, Entity Type, Start Index, and End Index.
4. Add the named entities to the CSV file: Add each named entity to a
new row in the CSV file, with the entity text, entity type, start index, and
end index in the corresponding columns.
Python Code
Other Text Mining Libraries/ Tools
1. Texthero to Prepare a Text-based Dataset for Your NLP Project
• https://www.analyticsvidhya.com/blog/2020/08/how-to-use-texthero
-to-prepare-a-text-based-dataset-for-your-nlp-project/
2. Hugging Face: is an American company that develops tools for
building applications using machine learning. It is most notable for
its transformers library built for natural language processing
applications and its platform that allows users to share machine
learning models and datasets, makes chatbots as well
https://huggingface.co/learn/nlp-course/chapter1/1
Other Text Mining Libraries/ Tools
3. Create a dataset for natural language processing or define your own dataset
in IBM Spectrum Conductor Deep Learning Impact 1.2.
• https://www.ibm.com/docs/en/scdli/1.2.0?topic=dataset-any
4. Google Cloud’s Vertex AI combines data engineering, data science, and ML
engineering workflows, enabling your teams to collaborate using a common
toolset. Vertex AI provides several options for model training: AutoML lets
you train tabular, image, text, or video data without writing code or
preparing data splits.
• https://cloud.google.com/vertex-ai/docs/start/introduction-unified-platform
• https://cloud.google.com/vertex-ai/docs/tutorials/text-classification-automl/da
taset
5. Google Developers Machine Learning
• https://developers.google.com/machine-learning