Unit I data analytics
Unit I data analytics
Here’s a list of basic terminology in data science to help you understand the key concepts and
jargon used in the eld:
1. Data
• De ni on: Raw facts and gures collected for analysis. Data can be structured (e.g.,
numbers, dates) or unstructured (e.g., text, images).
• Example: Sales records, sensor data, customer reviews.
2. Dataset
• De ni on: A collec on of data points or records, o en organized in tables with rows
(records) and columns (features or a ributes).
• Example: A dataset of customers with a ributes like age, gender, and purchasing history.
3. Feature
• De ni on: A characteris c or property of the data, typically represented as a column in a
dataset.
• Example: In a dataset of house prices, features might include size, loca on, and number of
bedrooms.
4. Label
• De ni on: The target variable or the outcome that you are trying to predict in a dataset.
It’s o en used in supervised learning.
• Example: In a dataset of email spam detec on, the label might be “spam” or “not spam.”
5. Data Preprocessing
• De ni on: The process of cleaning and transforming raw data into a usable format for
analysis or modeling. This may involve handling missing values, removing outliers,
normaliza on, or encoding categorical data.
• Example: Conver ng categorical values like "Male" and "Female" to 0 and 1 for machine
learning algorithms.
6. Exploratory Data Analysis (EDA)
• De ni on: The ini al analysis of a dataset to summarize its main characteris cs, o en
using visual methods like histograms, sca er plots, and box plots.
• Example: Using EDA to check the distribu on of customer ages or to detect pa erns in
sales data.
7. Correla on
• De ni on: A sta s cal measure that describes the extent to which two variables are
related. A posi ve correla on means that as one variable increases, the other also
increases.
• Example: The rela onship between the amount of adver sing spent and the sales revenue.
fi
fi
fi
fi
fi
fi
fi
ft
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
fi
ti
ti
fi
ti
tt
tt
tt
ti
ti
ft
ti
ti
ti
tt
ft
8. Model
• De ni on: An algorithm or mathema cal representa on used to make predic ons or
decisions based on data. Models are trained on data to nd pa erns or rela onships.
• Example: A linear regression model predic ng house prices based on features like square
footage, number of rooms, etc.
9. Training Data
• De ni on: A subset of the data used to train a machine learning model. The model learns
pa erns from this data to make predic ons or classi ca ons.
• Example: If you are building a model to predict whether an email is spam, the training data
would consist of emails labeled as spam or not spam.
10. Test Data
• De ni on: A separate subset of the data used to evaluate the performance of a trained
model. Test data helps assess how well the model generalizes to new, unseen data.
• Example: The emails you use to test the spam detec on model a er it’s been trained.
11. Supervised Learning
• De ni on: A type of machine learning where the model is trained on labeled data (i.e.,
data with known outcomes). The goal is to predict the label for new, unseen data.
• Example: Predic ng house prices based on features such as size and loca on using a
dataset where the prices are known.
12. Unsupervised Learning
• De ni on: A type of machine learning where the model is trained on unlabeled data and
must nd hidden pa erns or structures within the data.
• Example: Clustering customers into di erent groups based on purchasing behavior without
prior knowledge of the groups.
13. Over ng
• De ni on: A modeling problem where a model is too closely aligned to the training data,
capturing noise or random uctua ons rather than the actual underlying pa erns. This
results in poor performance on new, unseen data.
• Example: A decision tree model that is excessively complex and performs well on training
data but poorly on test data.
14. Under ng
• De ni on: A modeling problem where the model is too simplis c to capture the
underlying pa erns in the data, resul ng in poor performance on both training and test
data.
• Example: A linear regression model trying to predict complex data that requires a non-
linear approach.
tt
fi
fi
fi
fi
fi
fi
fi
fi
ti
ti
ti
ti
ti
ti
ti
fi
tti
fi
tti
tt
ti
tt
fl
ti
ti
ti
ff
ti
ti
ti
fi
ti
fi
ti
tt
ti
ft
ti
ti
tt
ti
15. Cross-Valida on
• De ni on: A technique for assessing the performance of a model by par oning the data
into mul ple subsets (folds) and training the model on di erent subsets while tes ng on
others.
• Example: Using k-fold cross-valida on to train and test a model mul ple mes on di erent
data splits to evaluate its robustness.
16. Accuracy
• De ni on: A metric used to evaluate the performance of a model, represen ng the
propor on of correct predic ons out of the total predic ons made.
• Example: If a spam detec on model correctly classi es 90 out of 100 emails, its accuracy is
90%.
17. Precision and Recall
• De ni on: Precision measures the propor on of posi ve predic ons that were actually
correct, while recall measures the propor on of actual posi ves that were correctly
iden ed by the model.
• Example: In a medical diagnosis model, precision refers to how many of the predicted
"diseased" pa ents actually have the disease, while recall refers to how many actual
"diseased" pa ents were correctly iden ed by the model.
18. F1 Score
• De ni on: The harmonic mean of precision and recall, providing a balance between the
two. It's o en used when there is an uneven class distribu on.
• Example: A model that is good at iden fying posi ve cases without making too many false
posi ves but also ensures that few true posi ves are missed.
19. Clustering
• De ni on: A type of unsupervised learning that groups similar data points together into
clusters. It is o en used to nd pa erns in data without prede ned labels.
• Example: Segmen ng customers into di erent groups based on purchasing behavior using
k-means clustering.
20. Dimensionality Reduc on
• De ni on: The process of reducing the number of input variables (features) in a dataset
while retaining essen al pa erns. This is o en done to make the data more manageable or
to visualize high-dimensional data.
• Example: Using Principal Component Analysis (PCA) to reduce a dataset with many
features into a few principal components.
21. Deep Learning
• De ni on: A subset of machine learning that uses neural networks with many layers (deep
neural networks) to model complex pa erns in large datasets.
• Example: Image recogni on tasks like iden fying objects in photos using Convolu onal
Neural Networks (CNNs).
fi
fi
fi
fi
fi
fi
fi
ti
ti
fi
ti
ti
ti
ti
ti
ti
ti
ti
ti
ft
ti
ti
ft
ti
ti
ti
ti
ti
ti
fi
tt
ti
tt
ti
ti
tt
ff
ti
fi
ti
ti
ti
ft
ti
ti
fi
ti
ti
ff
ti
ti
fi
ti
ti
ti
ti
ti
ti
ti
ti
ff
22. Natural Language Processing (NLP)
• De ni on: A eld of data science focused on enabling machines to understand, interpret,
and generate human language.
• Example: Sen ment analysis of customer reviews, machine transla on, and chatbots.
23. Ar cial Intelligence (AI)
• De ni on: The broader eld that encompasses crea ng systems capable of performing
tasks that typically require human intelligence, such as understanding language,
recognizing pa erns, and making decisions.
• Example: Autonomous driving systems, recommenda on engines, and intelligent virtual
assistants.
24. Big Data
• De ni on: Large and complex datasets that cannot be easily processed using tradi onal
data-processing methods. Big data o en requires specialized tools and infrastructure.
• Example: Social media data, sensor data, and transac on logs that contain billions of
records.
25. Data Visualiza on
• De ni on: The graphical representa on of data to help convey insights clearly and
e ec vely through charts, graphs, and other visual formats.
• Example: Using bar charts, pie charts, and heatmaps to present ndings from a dataset.
Pros Cons
It’s easily searchable and used for machine It’s limited in usage, meaning it can only be used for
learning algorithms. its intended purpose.
It’s accessible to businesses and It’s limited in storage op ons because it’s stored in
organiza ons for interpre ng data. systems like data warehouses with rigid schemas.
There are more tools available for analyzing It requires tabular formats that require rigid schema
structured data than unstructured. consis ng of prede ned elds.
Pros Cons
It remains unde ned un l it’s needed, making it It requires data scien sts to have exper se in
adaptable for data professionals to take only what preparing and analyzing the data, which could
they need for a speci c query while storing most restrict other employees in the organiza on
data in massive data lakes. from accessing it.
Special tools are needed to deal with
Within de ni ons, unstructured data can be
unstructured data, further contribu ng to its
collected quickly and easily.
lack of accessibility.
1. Nominal Data
What is Nominal Data?
Nominal data, a fundamental type of qualita ve data, is used primarily to label or name variables
without impar ng numeric values.
ff
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ff
ti
ti
ti
ti
fi
ti
ti
fi
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ff
ti
ti
ti
ti
ti
ff
This simplest form of measurement categorizes variables into dis nct, non-overlapping groups.
Unlike other data types, nominal data lacks an inherent order or measurable distance between its
categories, and it does not adhere to a true zero value.
It’s crucial in elds requiring classi ca on without quan ta ve analysis, such as iden fying di erent
species in biology or categorizing various types of government in poli cal science.
Examples and Applica ons
• You o en encounter nominal data in everyday situa ons. For example, when you specify
your hair color (black, brown, grey, blonde) or select your preferred mode of public transport
(bus, tram, train), you are providing nominal data.
• These categories are exclusive and descrip ve, serving as iden ers without any quan ta ve
signi cance. In surveys, nominal data can be gathered through ques ons that o er a set list
of op ons.
• For instance, a survey might ask, “Which state do you live in?” followed by a drop-down list
of states, or “What is your employment status?” with op ons like employed, unemployed, or
re red.
Signi cance in Data Analysis
In data analysis, nominal data’s primary value lies in its ability to segment and organize informa on
categorically.
This data type is useful for sta s cal analysis, marke ng strategies, and demographic studies where
understanding the distribu on of categories is more relevant than measuring or comparing
numerical values.
For example, marketers might analyze nominal data to determine the most popular product colors or
features among di erent demographic groups, enabling targeted marke ng strategies.

ti
fi
fi
ti
ft
fi
ff
ti
ti
ti
ti
fi
ti
ti
ti
ti
ti
ti
ti
ti
fi
ti
ti
ti
ti
ff
ti
ti
ti
ff
ti
Nominal data is typically visualized using bar charts or pie charts, which e ec vely display the
frequency distribu on of categories.
2. Ordinal Data
What is Ordinal Data?
Ordinal data classi es variables into categories that have a natural order but where the distances
between the categories are not necessarily uniform or known.
This type of data is o en seen in scenarios where ranking is possible but the exact di erence
between ranks is not quan able.
It’s a step above nominal data, which involves categories without any order, and below interval data,
where the di erences between values are evenly spaced.
Examples and Applica ons
• You commonly encounter ordinal data in everyday situa ons and professional se ngs. For
instance, in surveys, you might be asked to rate your sa sfac on on a scale from 1 to 5,
where each number represents a level of sa sfac on from ‘very dissa s ed’ to ‘very
sa s ed’.
• These scales are ordinal because they convey an order—higher numbers mean more
sa sfac on. However, the di erence in experience between consecu ve numbers isn’t
necessarily the same.
• Other examples include classifying economic status (low, medium, high), or levels of
educa on (high school, college, university).
• Ordinal data is extensively used in market research and healthcare. It helps in assessing
consumer preferences and pa ent outcomes respec vely, where responses are categorized
into ordered levels.

ti
ti
fi
ti
ti
ff
fi
ti
ft
ti
ti
fi
ff
ti
ti
ti
ti
ti
ti
ti
ti
ti
fi
ff
ti
tti
ff
• This data type is pivotal in sta s cal analysis, especially in non-parametric sta s cs which do
not assume data distribu on pa erns.
Comparison with Nominal Data
3. Discrete Data
What is Discrete Data?
Discrete data consists of countable values, limited to whole numbers or integers, and cannot be
subdivided into smaller parts.
This type of data ts into speci c categories and is essen al for various types of sta s cal analysis
because it is straigh orward to summarize and compute.
Examples and Applica ons
• You encounter discrete data frequently in everyday life and professional environments.
• For instance, the size of your department’s workforce, the number of new clients acquired in
a quarter, or the inventory count in your stockroom are all examples of discrete data.
• This data is typically visualized using bar graphs, which e ec vely represent the countable
nature of the data.
• In marke ng, discrete data aids in demographic analysis and helps in understanding
consumer behavior by categorizing data into di erent demographic variables like age,
income, and educa on level.
Role in Quan ta ve Analysis
Discrete data plays a pivotal role in quan ta ve analysis as it provides precise counts that are
essen al for sta s cal calcula ons.
It is o en used in simple sta s cal analyses like frequency distribu ons, where data is organized
against single values.
This type of data is par cularly useful in scenarios where data points are dis nct and separate, such
as the number of ckets sold per day or the number of students a ending a class.
fi
ft
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
fi
ti
ti
ti
tf
ti
ti
ti
ti
ti
ti
ti
ti
fi
ti
tt
ti
ti
ti
ff
ff
ti
ti
tt
ti
ti
ti
ti
ti
ti
The clear, countable nature of discrete data makes it invaluable for making informed decisions based
on quan ta ve facts.
Interval
The interval scale is a numerical scale which labels and orders variables, with a known, evenly
spaced interval between each of the values.
A good example of ra o data is weight in kilograms. If something weighs zero kilograms, it truly
weighs nothing—compared to temperature (interval data), where a value of zero degrees doesn’t
mean there is “no temperature,” it simply means it’s extremely cold!

ti
ti
ti
ff
ti
What Are The 5 Steps in Data Science Lifecycle
The data science lifecycle is a systema c approach to extrac ng value from data. It provides a
framework for data scien sts to follow from problem de ni on to model evalua on.
The data science lifecycle encompasses ve main stages, each with its own set of tasks and goals.
These stages are:
1. De ning the problem
2. Data collec on and prepara on
3. Data explora on and analysis
4. Model building and evalua on
5. Deployment and maintenance
Step 1: De ning the problem
The rst step in the data science lifecycle is to de ne the problem that needs to be solved.
This involves clearly ar cula ng the business objec ve and understanding the key requirements and
constraints.
E ec ve problem de ni on sets the stage for the en re data science project, as it helps to align the
goals of the analysis with the needs of the organiza on.
The role of problem de ni on in data science
A well-de ned problem provides a clear direc on for the data science project and helps data
scien sts focus their e orts on nding relevant and ac onable insights.
Furthermore, problem de ni on helps to manage expecta ons by establishing realis c goals and
melines for the data science project.
Techniques for e ec ve problem de ni on
E ec ve problem de ni on requires a systema c approach. Data scien sts can employ techniques
such as:
• Stakeholder interviews: Engaging with key stakeholders to understand their requirements,
expecta ons, and pain points.
• Problem framing: Breaking down the overarching problem into smaller, more manageable
sub-problems.
• De ning success criteria: Establishing clear and measurable criteria for evalua ng the
success of the data science project.
• Se ng priori es: Iden fying the most cri cal aspects of the problem that need to be
addressed rst.
• Documen ng requirements: Documen ng the problem statement, goals, and constraints to
ensure that all team members are aligned.
ti
ff
ff
tti
fi
fi
fi
ti
ti
ti
ti
fi
ti
fi
fi
ti
ti
ti
ff
ti
fi
fi
ff
ti
ti
fi
ti
ti
ti
fi
ti
ti
ti
ti
ti
fi
fi
ti
ti
fi
ti
ti
ti
ti
fi
ti
ti
ti
ti
fi
ti
ti
ti
ti
ti
ti
ti
Step 2: Data collec on and prepara on
Once the problem has been de ned, the next step is to collect and prepare the relevant data for
analysis. This involves iden fying the data sources, acquiring the data, and transforming it into a
format suitable for analysis.
The process of data collec on in data science
Data collec on is a cri cal phase in the data science lifecycle, as the quality and completeness of the
data directly impact the accuracy and reliability of the analyses.
Data scien sts can collect data from various sources, including internal databases, external APIs, web
scraping, and surveys.
During the data collec on process, it is essen al to ensure the privacy and security of the data,
especially when dealing with sensi ve or personally iden able informa on.
Data scien sts must also consider data governance and compliance requirements, such as data
protec on regula ons.
Preparing your data for analysis
Before diving into the analysis, data scien sts need to prepare the data by cleaning, transforming,
and restructuring it. This involves tasks such as:
• Data cleaning: Removing outliers, handling missing values, and resolving inconsistencies.
• Data integra on: Combining data from di erent sources and resolving any discrepancies or
con icts.
• Feature engineering: Crea ng new features that capture relevant informa on and improve
the performance of machine learning models.
• Data reduc on: Reducing the dimensionality of the data to focus on the most informa ve
variables.
A er successfully building and evalua ng the data model, the next crucial phase in the data science
lifecycle is deployment and maintenance.
Deployment strategies
Deploying a data model requires careful planning to minimise disrup ons and ensure its prac cal
u lity. Common deployment strategies include:
• Batch Processing: Implemen ng the model periodically to analyse large volumes of data in
batches, suitable for scenarios with less urgency.
ti
ft
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
tt
ti
ti
ff
fi
ti
ti
ti
ff
ti
• Real- me Processing: Enabling the model to process data in real- me, providing
instantaneous insights and predic ons, ideal for applica ons requiring quick responses.
• Cloud Deployment: Leveraging cloud pla orms for deployment, o ering scalability,
exibility, and accessibility, facilita ng easier updates and maintenance.
Con nuous monitoring and maintenance
Once deployed, con nuous monitoring and maintenance are essen al to sustain the model’s
performance. Key considera ons include:
• Performance Monitoring: Regularly assessing the model’s accuracy and responsiveness to
ensure it aligns with the expected outcomes.
• Data Dri Detec on: Monitoring changes in input data distribu on to iden fy poten al
shi s that might impact the model’s performance.
• Upda ng Models: Periodically upda ng the model to incorporate new data, adapt to
changing pa erns, and improve predic ve capabili es.
• Security Measures: Implemen ng robust security measures to protect the model and data,
especially when dealing with sensi ve informa on.
Data Science is the deep study of a large quan ty of data, which involves extrac ng some meaning
from the raw, structured, and unstructured data. Extrac ng meaningful data from large amounts
uses algorithms processing of data and this processing can be done using sta s cal techniques and
algorithm, scien c techniques, di erent technologies, etc. It uses various tools and techniques to
extract meaningful data from raw data. Data Science is also known as the Future of Ar cial
Intelligence.
For Example, Jagroop loves books to read but every me he wants to buy some books he is always
confused about which book he should buy as there are plenty of choices in front of him. This Data
Science Technique will be useful. When he opens Amazon he will get product recommenda ons
based onuses his previous data. When he chooses one of them he also gets a recommenda on to
buy these books with this one as this set is mostly bought. So all Recommenda ons of Products and
Showing sets of books purchased collec vely is one of the examples of Data Science.
Real-world Applica ons of Data Science
1. In Search Engines
The most useful applica on of Data Science is Search Engines. As we know when we want to search
for something on the internet, we mostly use Search engines like Google, Yahoo, DuckDuckGo and
Bing, etc. So Data Science is used to get Searches faster.
For Example, When we search for something suppose “Data Structure and algorithm courses ” then
at that me on Internet Explorer we get the rst link of GeeksforGeeks Courses. This happens
because the GeeksforGeeks website is visited most in order to get informa on regarding Data
Structure courses and Computer related subjects. So this analysis is done using Data Science, and we
get the Topmost visited Web Links.
2. In Transport
Data Science is also entered in real- me such as the Transport eld like Driverless Cars. With the help
of Driverless Cars, it is easy to reduce the number of Accidents.
For Example, In Driverless Cars the training data is fed into the algorithm and with the help of Data
Science techniques, the Data is analyzed like what as the speed limit in highways, Busy Streets,
Narrow Roads, etc. And how to handle di erent situa ons while driving etc.
3. In Finance
Data Science plays a key role in Financial Industries. Financial Industries always have an issue of fraud
and risk of losses. Thus, Financial Industries needs to automate risk of loss analysis in order to carry
out strategic decisions for the company. Also, Financial Industries uses Data Science Analy cs tools in
order to predict the future. It allows the companies to predict customer life me value and their stock
market moves.
For Example, In Stock Market, Data Science is the main part. In the Stock Market, Data Science is
used to examine past behavior with past data and their goal is to examine the future outcome. Data
is analyzed in such a way that it makes it possible to predict future stock prices over a set metable.
ti
ti
ti
fi
ti
ti
ff
ti
ti
ff
fi
ti
ti
ti
ti
fi
ti
ti
ti
ti
ti
ti
ti
fi
ti
ti
ti
ti
4. In E-Commerce
E-Commerce Websites like Amazon, Flipkart, etc. uses data Science to make a be er user experience
with personalized recommenda ons.
For Example, When we search for something on the E-commerce websites we get sugges ons
similar to choices according to our past data and also we get recommenda ons according to most
buy the product, most rated, most searched, etc. This is all done with the help of Data Science.
5. In Health Care
In the Healthcare Industry data science act as a boon. Data Science is used for:
• Detec ng Tumor.
• Drug discoveries.
• Medical Image Analysis.
• Virtual Medical Bots.
• Gene cs and Genomics.
• Predic ve Modeling for Diagnosis etc.
6. Image Recogni on
Currently, Data Science is also used in Image Recogni on. For Example, When we upload our image
with our friend on Facebook, Facebook gives sugges ons Tagging who is in the picture. This is done
with the help of machine learning and Data Science. When an Image is Recognized, the data analysis
is done on one’s Facebook friends and a er analysis, if the faces which are present in the picture
matched with someone else pro le then Facebook suggests us auto-tagging.
7. Targe ng Recommenda on
Targe ng Recommenda on is the most important applica on of Data Science. Whatever the user
searches on the Internet, he/she will see numerous posts everywhere. This can be explained
properly with an example: Suppose I want a mobile phone, so I just Google search it and a er that, I
changed my mind to buy o ine. In Real -World Data Science helps those companies who are paying
for Adver sements for their mobile. So everywhere on the internet in the social media, in the
websites, in the apps everywhere I will see the recommenda on of that mobile phone which I
searched for. So this will force me to buy online.
8. Airline Rou ng Planning
With the help of Data Science, Airline Sector is also growing like with the help of it, it becomes easy
to predict ight delays. It also helps to decide whether to directly land into the des na on or take a
halt in between like a ight can have a direct route from Delhi to the U.S.A or it can halt in between
a er that reach at the des na on.
9. Data Science in Gaming
In most of the games where a user will play with an opponent i.e. a Computer Opponent, data
science concepts are used with machine learning where with the help of past data the Computer will
improve its performance. There are many games like Chess, EA Sports, etc. will use Data Science
concepts.
ft
ti
ti
ti
ti
ti
ti
fl
ti
ti
fl
ti
ti
ti
ffl
ti
ti
fi
ft
ti
ti
ti
ti
ti
tt
ti
ti
ti
ft
10. Medicine and Drug Development
The process of crea ng medicine is very di cult and me-consuming and has to be done with full
disciplined because it is a ma er of Someone’s life. Without Data Science, it takes lots of me,
resources, and nance or developing new Medicine or drug but with the help of Data Science, it
becomes easy because the predic on of success rate can be easily determined based on biological
data or factors. The algorithms based on data science will forecast how this will react to the human
body without lab experiments.
11. In Delivery Logis cs
Various Logis cs companies like DHL, FedEx, etc. make use of Data Science. Data Science helps these
companies to nd the best route for the Shipment of their Products, the best me suited for delivery,
the best mode of transport to reach the des na on, etc.
12. Autocomplete
AutoComplete feature is an important part of Data Science where the user will get the facility to just
type a few le ers or words, and he will get the feature of auto-comple ng the line. In Google Mail,
when we are wri ng formal mail to someone so at that me data science concept of Autocomplete
feature is used where he/she is an e cient choice to auto-complete the whole line. Also in Search
Engines in social media, in various apps, AutoComplete feature is widely used.
Various data science tools and programming pla orms for developing data
science applica ons
Data science relies on a wide variety of tools and programming pla orms that help with the en re
process of data collec on, cleaning, analysis, modeling, and visualiza on. Here are some of the most
popular and widely used tools in the data science ecosystem:
1. Programming Languages
• Python:
o Overview: The most popular language for data science due to its simplicity,
readability, and powerful libraries.
o Key Libraries: