Ds Sem
Ds Sem
1. Volume:
Refers to the amount of data. If the data is large enough, it can be considered "big data."
2. Variety:
Refers to the different types and sources of data, which can include structured (like
databases) and unstructured data (like videos and social media posts).
3. Velocity:
Refers to the speed at which data is generated and processed. How quickly this happens
determines how useful the data can be.
4. Variability:
Refers to the inconsistencies in data. Sometimes, data can vary or be unpredictable,
making it harder to manage and use effectively.
5. Value:
Refers to the usefulness of the data. It highlights what organizations can do with the
collected data to create benefits or insights.
1. Storage:
Storing large datasets requires significant resources.
2. Formatting and Data Cleaning:
Advanced methods are often needed to format and clean data before it can be analyzed.
3. Quality Control:
Ensuring data quality can be challenging, especially when relying on small samples to
represent larger datasets.
4. Security and Privacy Concerns:
Protecting big data is often more complex than for smaller, traditional datasets.
5. Accuracy and Consistency:
Many big data methods are still new and not perfect. Improvements are ongoing.
6. Vague Definition of Data Science:
"Data science" is a broad term without a clear definition, which can create confusion.
7. Mastering Data Science Is Difficult:
It involves skills from many fields (statistics, computer science, mathematics), making it
nearly impossible to be an expert in all areas.
8. Domain Knowledge Required:
Data science depends on expertise in specific fields. Without domain knowledge,
understanding and interpreting results can be difficult.
9. Unpredictable Results:
Sometimes, the data does not provide useful or expected insights, limiting its
effectiveness for decision-making.
10. Privacy Issues:
Many industries struggle to keep customer data private, raising ethical concerns.
Current Landscape of Perspectives (Elaborated in Easy Language):
1. What is Data?
Data represents information about real-world activities or processes. The kind of data we
collect depends on how we gather it (our data collection or sampling method).
2. How is Data Used?
Once we collect the data, we use it to develop new ideas or insights. To make sense of the
data, we simplify it by creating models or formulas, which are called statistical models or
estimators.
3. Randomness and Uncertainty:
The process of collecting data often involves randomness and uncertainty, meaning the
results can vary.
Sample:
1. What is a Sample?
A sample is a small part taken from a larger group (called a population) to study and
make conclusions about the whole group. It helps us understand the population without
having to study every individual in it.
2. How Do We Get a Sample?
There are different methods to collect samples, and these methods are called sampling
mechanisms. The way we choose the sample can affect the results of our study.
3. Be Careful About Bias:
Some sampling methods might introduce bias into the data, meaning they might favor
certain outcomes or give a distorted view of the population. If bias occurs, the
conclusions we make can be wrong or misleading.
Let’s say we want to study how many emails employees send. We could collect the data in two
different ways:
Sample 1: Choose 110 employees at random and count how many emails they send.
Sample 2: Choose 110 emails at random and check which employees sent them.
These two samples will give us different results because they focus on different things. The first
sample looks at employees, while the second looks at emails. Depending on the sample we
choose, our conclusion about how many emails are sent by employees can change.
This explanation shows how populations (the whole group) and samples (a small part of the
group) work, and why choosing the right sample is important to get accurate results.
Modeling:
1. What is modeling?
Modeling is about describing real-world situations mathematically to solve problems or
answer questions. (It uses data.)
2. What does modeling involve?
o It's a creative and technical process.
o It uses math, science, and technical knowledge to explain new situations.
3. Steps in the modeling process:
o Deciding on a strategy to create the model.
o Analyzing and understanding the problem.
o Picking variables and setting relationships between them.
o Using math and computational tools to solve the problem.
4. Examples:
o Architects use models to design buildings with blueprints and small 3D versions.
o Molecular biologists use 3D visuals to understand protein structures.
5. Important note:
A model simplifies reality by removing unnecessary details.
How to Build a Data Model?
Before starting, it's important to fully understand the problem you're trying to solve. A data
scientist should talk to experts in the field to understand the business challenge clearly. This
step ensures that you are solving the right problem.
What it Means:
Fitting a model means adjusting its parameters to improve accuracy.
Steps:
1. Run the algorithm on data where the target variable is known to produce an initial
model.
2. Compare the model’s predictions with the actual observed values to measure accuracy.
3. Adjust the algorithm’s parameters to reduce error and make the model more accurate.
4. Repeat this process several times until the model reaches an optimal state to make
accurate predictions.
Overfitting:
1. Happens when the model learns random noise or fluctuations in the training data as
patterns.
2. Overfitted models perform well on training data but poorly on new, unseen data (test
data).
3. This limits the model’s ability to generalize and make predictions on new data.
Unit 2
What is EDA?
EDA is a method of analyzing data using visual techniques, statistical summaries, and
graphical representations.
Purpose of EDA:
o To discover trends, patterns, and check assumptions in the data.
o To ensure the data is accurate and free of obvious errors before building models
or drawing conclusions.
o EDA is an essential step in any data science project.
The goal of EDA is to help data scientists deeply understand the dataset and achieve specific
outcomes, such as:
Basic Tools:
o Plots and Graphs: Visualize data using charts.
o Summary Statistics: Mean, median, minimum, maximum, and quartiles.
EDA Techniques:
o Plot distributions of variables (e.g., box plots).
o Analyze data trends over time (time-series plots).
o Transform variables if needed.
o Examine relationships between variables (scatterplot matrices).
o Identify outliers or unusual patterns in the data.
o EDA involves systematically exploring data to uncover insights and prepare it for
analysis.
EDA (Exploratory Data Analysis) is crucial in data science as it helps analyze and understand
data from all angles.
Businesses use EDA to make better decisions by studying large amounts of data.
It identifies the most impactful features, enabling meaningful and profitable decisions.
That’s why EDA is essential in the data science process.
1. Data Collection
o Data is generated in large amounts across industries like healthcare, sports, tourism,
etc.
o Businesses gather data through surveys, social media, customer feedback, and other
sources.
o Without enough relevant data, further analysis cannot begin.
2. Understanding Variables
o The first step in analysis is to examine the data to understand its features or variables.
o Identify key variables that affect outcomes and their potential impact.
o This helps extract valuable insights and is critical to achieving meaningful results.
3. Data Cleaning
o Remove unnecessary or irrelevant data, like null values and outliers, to improve data
quality.
o Cleaning ensures that only important and relevant information is kept.
o This reduces processing time and makes computations more efficient.
4. Finding Correlations
o Check how variables are related to each other using methods like a correlation matrix.
o Understanding these relationships helps reveal patterns and connections in the data.
5. Using the Right Statistical Methods
o Depending on the type of data (categorical or numerical), different statistical tools are
used.
o Statistical formulas provide useful results, but graphs and visualizations make them
easier to understand.
6. Visualization and Analysis
o After the analysis, review the results to spot trends and patterns in the data.
o Understanding correlations and trends helps in making better decisions.
o Analysts need good skills to provide insights for industries like retail, healthcare, or
agriculture.
1. Python
2. R
R is commonly used for statistical analysis and detailed EDA by data scientists and statisticians.
It’s an open-source language ideal for statistical computing and data visualization.
Popular libraries in R include:
o ggplot, Leaflet, Lattice.
Automated EDA libraries in R:
o Data Explorer, SmartEDA, and GGally.
3. MATLAB
EDA (Exploratory Data Analysis) is the process of examining datasets to find patterns,
relationships, and insights. Depending on how many variables (columns) we are analyzing, EDA
can be divided into three main types: Univariate, Bivariate, and Multivariate Analysis.
1. Univariate Analysis
Box Plots: Help detect outliers and understand the spread of the data. These graphs are very useful
when comparisons are to be shown in percentages, like values in the 25 %, 50 %, and 75% range
(quartiles).
Bar Charts: Used for categorical data to show how often each category appears.
Summary Statistics: Include mean, median, mode, variance, and standard deviation.
2. Bivariate Analysis
Bivariate analysis looks at the relationship between two variables. This helps in finding
connections, trends, and correlations.
Common methods used:
3. Multivariate Analysis
Multivariate analysis involves analyzing more than two variables at once to understand complex
relationships. This is often used in advanced statistical modeling.
Common methods used:
Pair Plots: Help visualize the relationships between multiple variables at once.
Principal Component Analysis (PCA): A dimensionality reduction technique used to reduce the
dimensionality of large datasets, while preserving as much variance as possible
Advantages of Using EDA (Exploratory Data Analysis)
EDA helps identify patterns and trends in data after it has been prepared and
formatted.
These insights guide actions to meet business goals effectively.
Just like a good employee answers job-related questions, EDA provides clear
answers to business-related queries.
In data science, predictive models need the best data features to work well.
EDA ensures the right patterns and trends are available for training these models,
like preparing the perfect recipe for success.
Using the right EDA tools with the right data helps achieve the desired outcomes.
Philosophy of EDA (Exploratory Data Analysis)
1. What is EDA?
o Introduced by John Tukey in 1977, EDA is a way to explore data without fixed
expectations.
o The goal is to let the data reveal patterns, trends, and structures on its own.
2. Key Idea
o Be flexible and avoid rigid assumptions about the data.
o Let the data "speak for itself" and guide the analysis.
3. Impact
o EDA has influenced areas like robust analysis, data mining, and visual analytics.
o It helps avoid relying on strict mathematical models that may not fit real data.
4. EDA vs. Statistical Graphics
o Statistical Graphics: Focuses only on creating graphs to visualize data.
o EDA: Goes deeper to analyze and interpret data meaningfully, beyond just visuals.
In simple terms, EDA is a mindset and approach to explore and understand data without
making early assumptions.
Illustrate the Linear Regression? Apply this technique for House price prediction
Unit 3
Why linear regression and knn are poor choices for spam filtering
Spam filtering is the process of identifying and separating unwanted or irrelevant emails
(spam) from useful or legitimate ones in your inbox. The goal is to prevent spam emails—like
advertisements, phishing attempts, or malicious content—from cluttering your mailbox or
causing harm.
Spam filters use algorithms to analyze incoming emails based on various factors to decide if an
email is spam or not.
Why Linear Regression Doesn't Work for Spam Filtering:
To compare emails, you can check how many words or phrases match. However, this only looks at basic
similarities, like the way the email is written, and doesn’t understand deeper patterns or meanings.
Depends on “distance” : KNN works well if you can clearly measure how "similar" one email
is to another. But for spam filtering, it’s difficult to create a reliable way to measure how similar
one email is to another , so KNN doesn’t work well.
Limited generalization: KNN can only spot spam that looks very similar to what it already
knows. It has trouble recognizing new or different kinds of spam.
Same problem with non-spam: The same problem occurs with non-spam emails. KNN
will only classify an email as "not spam" if it looks very similar to the non-spam emails it
has already seen during training
3. Cost of using KNN:
You need a lot of labeled spam and non-spam data for KNN to work well.
KNN has to compare every new email to all the existing emails in its database. This is
very slow and expensive when processing millions of emails daily.
4 Better alternative:
Naive Bayes is preferred for spam filtering because it is faster, needs less data, and can
generalize better to new types of spam.
Naive Bayes Algorithm
A Naive Bayes classifier is a group of simple algorithms that use Bayes' Theorem to make
predictions. Even though it makes a "naive" assumption that features are independent, it works
well for many real-world problems, especially in text classification (like spam detection).
"Naive": Because it assumes that all features (like words in an email) are independent of each
other, which is rarely true in real life.
"Bayes": Because it uses Bayes' Theorem, a formula developed by Thomas Bayes, which
calculates the probability of something happening based on prior knowledge.
Bayes’ Theorem
Bayes’ Theorem finds the probability of an event occurring given the probability of another event
that has already occurred. Bayes’ theorem is stated mathematically as the following equation:
1. Feature independence:
It assumes that each feature (like words in a message) is independent and doesn't affect
other features.
(In reality, features are often related, but this simplification helps make the algorithm
fast and easy to use.)
2. Continuous features follow a normal distribution:
If a feature (like a number) is continuous, Naive Bayes assumes that it follows a bell-
shaped curve (normal distribution).
3. Discrete features follow a multinomial distribution:
If features are discrete (like words in a document), it assumes they follow a multinomial
distribution (common in text classification).
4. All features are equally important:
It assumes that every feature contributes equally to the prediction.
5. No missing data:
Naive Bayes works best when there are no missing values in the data.
It's simple, fast, and works well for many tasks, especially text-related problems like spam
filtering, sentiment analysis, and document classification.
1. Spam Detection
o Naive Bayes is commonly used to classify emails as spam or not spam by analyzing the
frequency of certain words (like "free," "win," etc.).
2. Sentiment Analysis
o It helps determine whether a piece of text (like a product review or tweet) has a
positive, negative, or neutral sentiment.
3. Text Classification
o Widely used in applications like news categorization (e.g., sports, politics,
entertainment) or document classification based on topics.
4. Recommendation Systems
o Used in personalized recommendation systems to suggest products, movies, or music
based on a user’s past preferences.
5. Medical Diagnosis
o Applied in disease prediction by analyzing symptoms and predicting the likelihood of a
disease (e.g., whether a patient has cancer or not).
6. Fraud Detection
o Helps detect fraudulent transactions by analyzing patterns in user behavior and
identifying unusual or suspicious activity.
7. Language Detection
o Can classify a text into different languages by analyzing the structure and frequency of
words in the document.
8. Face Recognition
o Naive Bayes can be used for image classification, such as distinguishing between
different faces or objects in images.
Naive Bayes is widely used in filtering applications (like spam detection, content filtering, and
sentiment analysis) because of the following key reasons:
In applications like email filtering or text classification, the data is often sparse (many features
have zero or low values).
Naive Bayes handles sparse data effectively, as it assumes independence between features and
focuses on feature frequency.
Filtering applications like spam detection involve a large number of features (words in a
message).
Naive Bayes performs well in such high-dimensional environments by treating each feature
independently.
Even with a small amount of labeled data, Naive Bayes can produce reasonable results.
This is important in filtering tasks where collecting large labeled datasets can be difficult.
5. Probabilistic Interpretation
Naive Bayes provides probabilistic outputs, meaning it can estimate the likelihood of a message
being spam or legitimate.
This helps in ranking or prioritizing messages based on confidence levels.
Since Naive Bayes assumes feature independence, irrelevant features have little effect on its
performance.
In filtering tasks, some words may not contribute to the classification, but Naive Bayes can still
perform well without explicitly removing those features.
Naive Bayes has proven effective in real-world applications like spam filtering (e.g., Gmail’s
spam detection system) and sentiment analysis.
Its simplicity, combined with reasonable accuracy and speed, makes it a popular choice in
practice.
Conclusion
The motivation for using Naive Bayes in filtering applications lies in its simplicity, efficiency, and
effectiveness in handling high-dimensional, sparse data with relatively small training datasets.
Advantages of Naive Bayes for Spam Filtering:
1. Privacy
o Respect user consent and handle sensitive data carefully.
o Example: Follow rules like GDPR, which require informing users
about how their data will be used.
2. Security
o Protect data from breaches and unauthorized access.
o Example: Use encryption and allow access only to authorized
individuals for sensitive data.
3. Bias in Algorithms
o Avoid reinforcing biases by ensuring the training data is fair and
diverse.
o Example: Train facial recognition systems on diverse datasets to
prevent racial or gender bias.
4. Transparency
o Make how algorithms work and decisions made by them clear to
users.
o Example: Explain AI decisions so users understand automated
outcomes.
5. Data Ownership
o Use third-party data only with proper consent from the owner.
o Example: Social media platforms should get explicit permission
before using user data for research.
Ethical practices in data science help build public trust and maintain the integrity
of the field.
Privacy and Security Ethics in Data Science
In data science, privacy and security ethics focus on responsibly managing and
using sensitive data to ensure individuals' rights are protected. Below are key
considerations:
1. Data Privacy:
o Ensuring personal information is protected and not shared without
consent.
o Following legal frameworks like GDPR or CCPA to safeguard user
data.
2. Informed Consent:
o Collecting data only after informing users about its purpose and
obtaining their consent.
3. Data Security:
o Protecting datasets from unauthorized access, breaches, and misuse.
4. Anonymization:
o Removing identifiable information from data to maintain user
privacy.
5. Bias and Fairness:
o Avoiding biases in data collection and analysis to ensure ethical
outcomes.
6. Transparency:
o Clearly communicating how data is used, stored, and analyzed.
7. Accountability:
o Holding organizations and individuals responsible for ethical data
practices.
Unit 4