Unit 3 AI-ML Driven Data Science and Automation
Unit 3 AI-ML Driven Data Science and Automation
Natural language processing (NLP) is a field of computer science and a subfield of artificial
intelligence that aims to make computers understand human language.
NLP uses computational linguistics, which is the study of how language works, and various
models based on statistics, machine learning, and deep learning.
These technologies allow computers to analyze and process text or voice data, and to grasp their
full meaning, including the speaker’s or writer’s intentions and emotions.
NLP powers many applications that use language, such as text translation, voice recognition, text
summarization, and chatbots. You may have used some of these applications yourself, such as
voice-operated GPS systems, digital assistants, speech-to-text software, and customer service
bots.
NLP also helps businesses improve their efficiency, productivity, and performance by
simplifying complex tasks that involve language.
• Data Collection: Gathering text data from various sources such as websites, books,
social media, or proprietary databases.
• Data Storage: Storing the collected text data in a structured format, such as a database or
a collection of documents.
2. Text Preprocessing
Preprocessing is crucial to clean and prepare the raw text data for analysis. Common
preprocessing steps include:
• Stemming and Lemmatization: Reducing words to their base or root forms. Stemming
cuts off suffixes, while lemmatization considers the context and converts words to their
meaningful base form.
3. Text Representation
• Bag of Words (BoW): Representing text as a collection of words, ignoring grammar and
word order but keeping track of word frequency.
4. Feature Extraction
Extracting meaningful features from the text data that can be used for various NLP tasks.
• N-grams: Capturing sequences of N words to preserve some context and word order.
• Syntactic Features: Using parts of speech tags, syntactic dependencies, and parse trees.
Selecting and training a machine learning or deep learning model to perform specific NLP tasks.
• Supervised Learning: Using labeled data to train models like Support Vector Machines
(SVM), Random Forests, or deep learning models like Convolutional Neural Networks
(CNNs) and Recurrent Neural Networks (RNNs).
Deploying the trained model and using it to make predictions or extract insights from new text
data.
• Text Classification: Categorizing text into predefined classes (e.g., spam detection,
sentiment analysis).
• Named Entity Recognition (NER): Identifying and classifying entities in the text.
Evaluating the performance of the NLP algorithm using metrics such as accuracy, precision,
recall, F1-score, and others.
• Spam Filters: One of the most irritating things about email is spam. Gmail uses natural
language processing (NLP) to discern which emails are legitimate and which are spam.
These spam filters look at the text in all the emails you receive and try to figure out what
it means to see if it’s spam or not.
• Questions Answering: NLP can be seen in action by using Google Search or Siri
Services. A major use of NLP is to make search engines understand the meaning of what
we are asking and generate natural language in return to give us the answers.
Text mining
Text mining is a component of data mining that deals specifically with unstructured text data. It
involves the use of natural language processing (NLP) techniques to extract useful information
and insights from large amounts of unstructured text data. Text mining can be used as a
preprocessing step for data mining or as a standalone process for specific tasks.
Text Mining is the process of extracting meaningful information, patterns, and insights from
unstructured text data using NLP techniques and statistical methods. It involves converting large
volumes of textual data into structured information for analysis and decision-making.
1. Data Collection:
Gathering text data from emails, articles, social media, or customer reviews.
2. Text Preprocessing:
3. Feature Extraction:
Classification: Categorizing text into predefined classes (e.g., spam vs. non-spam).
Named Entity Recognition (NER): Identifying entities like names, dates, locations.
2. Spam Detection: Classifying emails as spam or not spam based on text patterns.
3. Topic Modeling: Automatically identifying themes or topics in large document sets using
LDA (Latent Dirichlet Allocation).
4. Information Retrieval: Search engines use text mining to retrieve relevant documents.
Let’s analyze customer reviews to determine whether they are positive or negative.
✅ Sample Dataset:
Text Preprocessing:
Lowercasing:
"The product is amazing. I love it!" → "the product is amazing. i love it!"
Tokenization:
Lemmatization:
Review 2: [0, 0, 0, 1, 1]
TF-IDF Representation:
Output:
Review 1 → Positive
Review 2 → Negative
Review 3 → Positive
Review 4 → Negative
Sentiment analysis
Sentiment analysis is the process of classifying whether a block of text is positive, negative, or
neutral. The goal that Sentiment mining tries to gain is to be analysed people’s opinions in a way
that can help businesses expand. It focuses not only on polarity (positive, negative & neutral) but
also on emotions (happy, sad, angry, etc.). It uses various Natural Language Processing
algorithms such as Rule-based, Automatic, and Hybrid.
Sentiment Analysis (also known as opinion mining) is a Natural Language Processing (NLP)
technique used to determine the emotional tone behind a body of text. It identifies whether the
sentiment expressed in the text is positive, negative, neutral, or even more complex emotions
like joy, anger, sadness, etc.
Types of Sentiment Analysis
Fine-Grained Sentiment Analysis
This depends on the polarity base. This category can be designed as very positive, positive,
neutral, negative, or very negative. The rating is done on a scale of 1 to 5. If the rating is 5 then it
is very positive, 2 then negative, and 3 then neutral.
Emotion detection
The sentiments happy, sad, angry, upset, jolly, pleasant, and so on come under emotion detection.
It is also known as a lexicon method of sentiment analysis.
It focuses on a particular aspect for instance if a person wants to check the feature of the cell
phone then it checks the aspect such as the battery, screen, and camera quality then aspect based
is used.
Multilingual consists of different languages where the classification needs to be done as positive,
negative, and neutral. This is highly challenging and comparatively difficult.
The goal is to identify whether the expressed sentiment is positive, negative, or neutral. let’s
understand the overview in general two steps:
Preprocessing
Starting with collecting the text data that needs to be analysed for sentiment like customer
reviews, social media posts, news articles, or any other form of textual content. The collected
text is pre-processed to clean and standardize the data with various tasks:
• Removing stop words (common words like “and,” “the,” etc. that don’t contribute much
to sentiment).
• Stemming or Lemmatization: Reducing words to their root form.
Analysis
Text is converted for analysis using techniques like bag-of-words or word embeddings (e.g.,
Word2Vec, GloVe).Models are then trained with labeled datasets, associating text with
sentiments (positive, negative, or neutral).
After training and validation, the model predicts sentiment on new data, assigning labels based
on learned patterns.
✅ Sample Dataset:
✅ Text Preprocessing:
Lowercasing:
"The product is amazing, I absolutely love it!" → "the product is amazing, i absolutely love it!"
Tokenization:
Stop-word Removal:
✅ Output:
3. "The service was okay, not great but not terrible." → Neutral
Customer Feedback and Reviews: Analyzing product reviews to gauge customer satisfaction.
Identifying positive and negative feedback for improvement.
Market Research and Trend Analysis: Identifying customer preferences and emerging trends.
Healthcare and Patient Feedback: Analyzing patient reviews and feedback for better healthcare
services.
• Named Entity: Any word or group of words that refer to a specific person, place,
organization, or other object or concept.
• Corpus: A collection of texts used for language analysis and training of NER models.
• POS Tagging: A process that involves labeling words in a text with their corresponding
parts of speech, such as nouns, verbs, adjectives, etc.
• Chunking: A process that involves grouping words together into meaningful phrases
based on their part of speech and syntactic structure.
• Training and Testing Data: The process of training a model with a set of labeled data
(called the training data) and evaluating its performance on another set of labeled data
(called the testing data).
Now, let’s take a look at the various steps involved in the NER process:
• Tokenization: The first step in NER involves breaking down the input text into
individual words or tokens.
• POS Tagging: Next, we need to label each word in the text with its corresponding part of
speech.
• Chunking: After POS tagging, we can group the words together into meaningful phrases
using a process called chunking.
• Named Entity Recognition: Once we have identified the chunks, we can apply NER
techniques to identify and classify the named entities in the text.
• Evaluation: Finally, we can evaluate the performance of our NER model on a set of
testing data to determine its accuracy and effectiveness.
NER has numerous applications in NLP, including information extraction, sentiment analysis,
question-answering, recommendation systems, and more. Here are some common use cases of
NER:
• Information Extraction: NER can be used to extract relevant information from large
volumes of unstructured text, such as news articles, social media posts, and online
reviews. This information can be used to generate insights and make informed decisions.
• Sentiment Analysis: NER can be used to identify the sentiment expressed in a text
towards a particular named entity, such as a product or service. This information can be
used to improve customer satisfaction and identify areas for improvement.
• Question Answering: NER can be used to identify the relevant entities in a text that can
be used to answer a specific question. This is particularly useful for chatbots and virtual
assistants.
• Recommendation Systems: NER can be used to identify the interests and preferences of
users based on the entities mentioned in their search queries or online interactions. This
information can be used to provide personalized recommendations and improve user
engagement.
Advantages of NER
• Improved Accuracy: NER can improve the accuracy of NLP applications by identifying
and classifying named entities in a text more accurately and efficiently.
• Speed and Efficiency: NER can automate the process of identifying and classifying
named entities in a text, saving time and improving efficiency.
• Personalization: NER can be used to identify the interests and preferences of users
based on their interactions with a system, allowing for personalized recommendations
and improved user engagement.
Disadvantages of NER
• Ambiguity: NER can be challenging to apply in cases where there is ambiguity in the
meaning of a word or phrase. For example, the word “Apple” can refer to a fruit or a
technology company.
• Limited Scope: NER is limited to identifying and classifying named entities in a text and
cannot capture the full meaning of a text.
• Data Requirements: NER requires large volumes of labeled data for training, which can
be expensive and time-consuming to collect and annotate.
Necessary requirements:
Import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
import nltk
tokens = nltk.word_tokenize(text)
tagged = nltk.pos_tag(tokens)
entities = nltk.chunk.ne_chunk(tagged)
Output:
In this code, we first define the text to be analyzed and tokenize it into words using
nltk.word_tokenize(text). We then apply part-of-speech tagging to the tokens using
nltk.pos_tag(tokens). Finally, we apply named entity recognition to the tagged words using
nltk.chunk.ne_chunk(tagged).
The output of this code for the sample text “GeeksforGeeks is a recognized platform for online
learning in India” is:
This shows that NLTK was able to recognize “GeeksforGeeks” as an organization and “India” as
a geographic location.
AI in recommendation systems
Recommendation Systems powered by Artificial Intelligence (AI) suggest relevant items,
products, or content to users based on their preferences, behavior, and historical data. These
systems use machine learning, deep learning, and NLP to personalize recommendations,
enhancing user experience and increasing engagement.
1. Content-Based Filtering:
o Based on item features (e.g., genre, description, tags) and user preferences.
o Example:
2. Collaborative Filtering:
o User-based filtering:
o Item-based filtering:
o Example:
o Improves accuracy and handles cold-start problems (lack of data for new users).
o Example:
4. Knowledge-Based Recommendation:
o Suitable for domains where preferences are based on specific needs (e.g., travel,
healthcare).
o Example:
Collaborative filtering
Collaborative Filtering makes recommendations based on the behavior and preferences of similar
users. It assumes that:
Example:
User 1 5 4 ? 2
User 2 4 ? 3 1
User 3 ? 3 4 2
? → Missing rating (unwatched or unrated movie).
Based on the interaction matrix, the system finds similar users or items.
Generate Recommendations:
Suggests items that similar users have liked but the target user hasn’t interacted with yet.
• Example: People who bought laptops might also buy laptop bags, even if the system
doesn't explicitly know about the relationship.
Content-based filtering
Content-Based Filtering recommends items based on their features and the user's past
preferences.
• Assumes that if a user liked an item, they will like similar items.
▪ Genre: Fantasy
3. Similarity Calculation:
✅ 1. Personalized Recommendations:
• Reduces the need for manual intervention, allowing data scientists to focus on high-level
decision-making and model interpretation.
• Automates repetitive and time-consuming tasks, allowing data scientists to focus on more
complex problems.
4. Enhances Scalability:
5. Continuous Monitoring:
• Automated monitoring ensures model performance does not degrade over time.
1. Loss of Control:
• Automated processes may reduce human oversight, making it harder to detect subtle
issues.
2. Limited Customization:
• Pre-built automation tools may lack flexibility for complex use cases.
4. Overfitting Risks:
2. Finance (Banking)
3. Healthcare
4. Marketing
5. Manufacturing
AutoML frameworks
Automated Machine Learning (AutoML) refers to the process of automating the end-to-end
tasks of machine learning (ML) model development.
o Feature engineering.
o Model selection.
o Hyperparameter tuning.
Google AutoML
✅ Overview
Google AutoML is a cloud-based AutoML platform that allows users to build custom machine
learning models with minimal coding.
Part of Google Cloud AI Platform, it offers pre-trained models and tools to train custom
models using structured and unstructured data.
🔥 Key Features
Data ingestion, preprocessing, model training, and deployment are all automated.
Pre-trained Models:
Hyperparameter Tuning:
Model Deployment:
Deploy models via Google Cloud and serve predictions through REST APIs.
🔹 Use Cases
Pros Cons
Highly scalable with cloud infrastructure Requires Google Cloud Platform (GCP)
subscription
Pre-trained models reduce training time May lack flexibility for custom preprocessing
🔹 Example Workflow
Train the Model: Use AutoML to automatically select and train the best model.
✅ Overview
Provides a Python and R interface with seamless integration into existing workflows.
🔥 Key Features
Tests multiple models: XGBoost, Random Forest, Deep Learning, and GLM.
Hyperparameter Tuning:
Stacked Ensembles:
Combines multiple models into a single ensemble for better accuracy.
Parallel Processing:
Model Interpretability:
🔹 Use Cases
Financial Services:
Healthcare:
Retail:
Pros Cons
Highly scalable with distributed processing Limited support for unstructured data
Ensemble modeling improves accuracy May require large infrastructure for big data
Great for tabular data and regression tasks Lacks deep learning support compared to Google
AutoML
🔹 Example Workflow
Install H2O.ai:
import h2o
h2o.init()
data = h2o.import_file("data.csv")
Run AutoML:
Evaluate Model:
perf = aml.leader.model_performance(test)
print(perf)
🚀 3. Auto-sklearn
✅ Overview
🔥 Key Features
Preprocessing Pipelines:
Automates encoding, scaling, and missing value imputation.
Hyperparameter Optimization:
Meta-learning:
Built-in Ensembles:
🔹 Use Cases
Pros Cons
Easy to integrate with scikit-learn Only works with tabular data
🔹 Example Workflow
Install Auto-sklearn:
X, y = load_digits(return_X_y=True)
Run AutoML:
automl = AutoSklearnClassifier(time_left_for_this_task=300)
automl.fit(X_train, y_train)
It works by:
🚀 1. Featuretools
✅ Overview
It creates new features by combining existing ones through aggregation and transformation.
Supports relational data and time-series data.
Uses the concept of Deep Feature Synthesis (DFS) to create multiple levels of features.
🔥 Key Features
Aggregation Primitives:
Automatically applies statistical operations like sum, mean, count, max, min.
Transformation Primitives:
Time-based Features:
🔹 Use Cases
Finance:
Retail:
Healthcare:
Pros Cons
Supports relational and time-series data Consumes significant memory for large datasets
Improves model accuracy with meaningful features Requires domain knowledge to select
relevant features
Easily integrates with sklearn and pandas Can be slow with large, complex datasets
🔹 Example Workflow
Install Featuretools:
Load Data:
import featuretools as ft
import pandas as pd
# Sample dataset
customers = pd.DataFrame({
})
transactions = pd.DataFrame({
})
Create Entity Set and Add Data:
es = ft.EntitySet(id="customers")
es = es.entity_from_dataframe(entity_id="customers", dataframe=customers,
index="customer_id", time_index="signup_date")
es = es.entity_from_dataframe(entity_id="transactions", dataframe=transactions,
index="transaction_id", time_index="transaction_date")
relationship = ft.Relationship(es["customers"]["customer_id"],
es["transactions"]["customer_id"])
es = es.add_relationship(relationship)
target_entity="customers",
trans_primitives=["month", "year"])
print(features.head())
🚀 2. Feature-engine
✅ Overview
It is built on pandas and NumPy, making it easy to integrate with existing pipelines.
Variable transformation.
Feature scaling.
🔥 Key Features
Feature Transformations:
Log, square root, power, and reciprocal transformations.
Encoding Techniques:
One-hot encoding, ordinal encoding, and mean encoding for categorical variables.
Feature Scaling:
Outlier Handling:
🔹 Use Cases
Retail:
Automatically generate purchase frequency features.
Finance:
Healthcare:
Pros Cons
Easy integration with sklearn pipelines Less powerful than Featuretools for multi-level
features
Improves model accuracy with transformed features Limited support for relational data
🔹 Example Workflow
Install Feature-engine:
Load Data:
import pandas as pd
# Sample dataset
data = pd.DataFrame({
})
data["day"] = data["date"].dt.day
data["month"] = data["date"].dt.month
data["year"] = data["date"].dt.year
Apply Transformations:
log_transformer = LogTransformer(variables=["sales"])
data = log_transformer.fit_transform(data)
data = cyclical.fit_transform(data)
print(data.head())
Enables orchestration of complex ETL pipelines, ML workflows, and data engineering tasks.
Benefits:
🚀 1. Apache Airflow
✅ Overview
Apache Airflow is an open-source workflow automation and orchestration tool.
🔥 Key Features
Task Scheduling:
Parallel Execution:
🔹 Use Cases
ETL Pipelines:
Data Ingestion:
Pros Cons
Scheduler:
Executor:
Metastore (Database):
Worker Nodes:
# Define DAG
def extract():
print("Extracting data...")
def transform():
print("Transforming data...")
def load():
print("Loading data...")
# Create tasks
# Define dependencies
✅ Overview
🔥 Key Features
Hybrid Execution:
Data Parameterization:
Built-in Caching:
Scalable Execution:
Real-time Monitoring:
Easy Integration:
🔹 Use Cases
Data Extraction and Transformation:
Pros Cons
Simple and intuitive API Requires a Prefect Cloud account for advanced features
@task
def extract():
print("Extracting data...")
return "data"
@task
def transform(data):
print(f"Transforming {data}...")
return "transformed_data"
@task
def load(data):
print(f"Loading {data}...")
@flow
def etl_pipeline():
data = extract()
transformed = transform(data)
load(transformed)
if __name__ == "__main__":
etl_pipeline()