0% found this document useful (0 votes)

46 views17 pages

Data Science Notes - 1-PD

Data science best notes for beginners

Uploaded by

itzsudipto99

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views17 pages

Data Science Notes - 1-PD

Data science best notes for beginners

Uploaded by

itzsudipto99

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Data Science — Getting Started

Data science is the process of extracting and analysing useful information from data to
solve problems that are difficult to solve analytically. For example, when you visit an e-
commerce site and look at a few categories and products before making a purchase, you
are creating data that Analysts can use to figure out how you make purchases.

It involves different disciplines like mathematical and statistical modelling, extracting data
from its source and applying data visualization techniques. It also involves handling big
data technologies to gather both structured and unstructured data.

It helps you find patterns that are hidden in the raw data. The term "Data Science" has
evolved because mathematical statistics, data analysis, and "big data" have changed over
time.

Data science is an interdisciplinary field that lets you learn from both organised and
unorganised data. With data science, you can turn a business problem into a research
project and then apply into a real-world solution.

History of data science:

John Tukey used the term "data analysis" in 1962 to define a field that resembled current
modern data science. In a 1985 lecture to the Chinese Academy of Sciences in Beijing, C.
F. Jeff Wu introduced the phrase "data science" as an alternative word for statistics for the
first time. Subsequently, conference held at the University of Montpellier II in 1992
participants at a statistics recognised the birth of a new field centred on data of many
sources and forms, integrating known ideas and principles of statistics and data analysis
with computers.

Peter Naur suggested the phrase "data science" as an alternative name for computer
science in 1974. The International Federation of Classification Societies was the first
conference to highlight data science as a special subject in 1996. Yet, the concept
remained in change. Following the 1985 lecture at the Chinese Academy of Sciences in
Beijing, C. F. Jeff Wu again advocated for the renaming of statistics to data science in
1997. He reasoned that a new name would assist statistics in inaccurate stereotypes and
perceptions, such as being associated with accounting or confined to data description.
Hayashi Chikio proposed data science in 1998 as a new, multidisciplinary concept with
three components: data design, data collecting, and data analysis.

In the 1990s, "knowledge discovery" and "data mining" were popular phrases for the
process of identifying patterns in datasets that were growing in size.

In 2012, engineers Thomas H. Davenport and DJ Patil proclaimed "Data Scientist: The
Hottest Job of the 21st Century," a term that was taken up by major metropolitan
publications such as the New York Times and the Boston Globe. They repeated it a decade
later, adding that "the position is in more demand than ever"

William S. Cleveland is frequently associated with the present understanding of data

science as a separate field. In a 2001 study, he argued for the development of statistics
into technological fields; a new name was required as this would fundamentally alter the
subject. In the following years, "data science" grew increasingly prevalent. In 2002, the
5
Council on Data for Science and Technology published Data Science Journal. Columbia
University established The Journal of Data Science in 2003. The Section on Statistical
Learning and Data Mining of the American Statistical Association changed its name to the
Section on Statistical Learning and Data Science in 2014, reflecting the growing popularity
of data science.

In 2008, DJ Patil and Jeff Hammerbacher were given the professional designation of "data
scientist." Although it was used by the National Science Board in their 2005 study "Long-
Lived Digital Data Collections: Supporting Research and Teaching in the 21st Century," it
referred to any significant role in administering a digital data collection.

An agreement has not yet been reached on the meaning of data science, and some believe
it to be a buzzword. Big data is a similar concept in marketing. Data scientists are
responsible for transforming massive amounts of data into useful information and
developing software and algorithms that assist businesses and organisations in
determining optimum operations.

Why data science?

According to IDC, worldwide data will reach 175 zettabytes by 2025. Data Science helps
businesses to comprehend vast amounts of data from different sources, extract useful
insights, and make better data-driven choices. Data Science is used extensively in several
industrial fields, such as marketing, healthcare, finance, banking, and policy work.

Here are significant advantages of using Data Analytics Technology:

1) Data is the oil of the modern age. With the proper tools, technologies, and
algorithms, we can leverage data to create a unique competitive edge.
2) Data Science may assist in detecting fraud using sophisticated machine learning
techniques.
3) It helps you avoid severe financial losses.
4) Enables the development of intelligent machines
5) You may use sentiment analysis to determine the brand loyalty of your customers.
This helps you to make better and quicker choices.
6) It enables you to propose the appropriate product to the appropriate consumer in
order to grow your company.

Need for Data Science:

1) The data we have and how much data we generate: According to Forbes, the
total quantity of data generated, copied, recorded, and consumed in the globe surged
by about 5,000% between 2010 and 2020, from 1.2 trillion gigabytes to 59 trillion
gigabytes.
2) How companies have benefited from data science?
a) Several businesses are undergoing data transformation (converting their IT
architecture to one that supports data science), there are data boot camps
around, etc. Indeed, there is a straightforward explanation for this: data science
provides valuable insights.
b) Companies are being outcompeted by firms that make judgments based on data.
For example, the Ford organization in 2006, had a loss of $12.6 billion. Following
the defeat, they hired a senior data scientist to manage the data and undertook a
three-year makeover. This ultimately resulted in the sale of almost 2,300,000
automobiles and earned a profit for 2009 as a whole.

6
3) Demand and average salary of a data scientist:
a) According to India Today, India is the second biggest centre for data science in
the world due to the fast digitalization of companies and services. By 2026,
analysts anticipate that the nation will have more than 11 million employment
opportunities. In fact, recruiting in the data science field has surged by 46% since
2019.
b) Bank of America was one of the first financial institutions to provide mobile
banking to its consumers a decade ago. Recently, the Bank of America introduced
Erica, its first virtual financial assistant. It is regarded the as best financial
invention in the world.
Erica now serves as a client adviser for more than 45 million consumers worldwide.
Erica uses Voice Recognition to receive client feedback, which represents a
technical development in Data Science.
c) The Data Science and Machine Learning curves are steep. Although India sees a
massive influx of data scientists each year, relatively few possess the needed skill
set and specialization. As a consequence, people with specialised data skills are in
great demand.

Impact of data science:

Data Science has had a significant influence on several aspects of modern civilization. The
significance of Data Science to organisations keeps on increasing. According to one
research, the worldwide market for data science would reach $115 billion by 2023.

Healthcare industry has benefitted from the rise of data science. In 2008, Google
employees realised that they could monitor influenza strains in real time. Previous
technologies could only provide weekly updates on instances. Google was able to build one
of the first systems for monitoring the spread of diseases by using data science.

The sports sector has similarly profited from data science. A data scientist in 2019 found
ways to measure and calculate how goal attempts increase a soccer team's odds of
winning. In reality, data science is utilised to easily compute statistics in several sports.

Government agencies also use data science on a daily basis. Governments throughout the
globe employ databases to monitor information regarding social security, taxes, and other
data pertaining to their residents. The government's usage of emerging technologies
continues to develop.

Since the Internet has become the primary medium of human communication, the
popularity of e-commerce has also grown. With data science, online firms may monitor
the whole of the customer experience, including marketing efforts, purchases, and
consumer trends. Ads must be one of the greatest instances of eCommerce firms using
data science. Have you ever looked for anything online or visited an eCommerce product
website, only to be bombarded by advertisements for that product on social networking
sites and blogs?

Ad pixels are integral to the online gathering and analysis of user information. Companies
leverage online consumer behaviour to retarget prospective consumers throughout the
internet. This usage of client information extends beyond eCommerce. Apps such as Tinder
and Facebook use algorithms to assist users locate precisely what they are seeking. The
Internet is a growing treasure trove of data, and the gathering and analysis of this data
will also continue to expand.

7
Data Science — What is data?

What is data in data science?

Data is the foundation of data science. Data is the systematic record of a specified
characters, quantity or symbols on which operations are performed by a computer, which
may be stored and transmitted. It is a compilation of data to be utilised for a certain
purpose, such as a survey or an analysis. When structured, data may be referred to as
information. The data source (original data, secondary data) is also an essential
consideration.

Data comes in many shapes and forms, but can generally be thought of as being the result
of some random experiment — an experiment whose outcome cannot be determined in
advance, but whose workings are still subject to analysis. Data from a random experiment
are often stored in a table or spreadsheet. A statistical convention to denote variables is
often called as features or columns and individual items (or units) as rows.

Types of data:
There are mainly two types of data, they are:

Qualitative data:
Qualitative data consists of information that cannot be counted, quantified, or expressed
simply using numbers. It is gathered from text, audio, and pictures and distributed using
data visualization tools, including word clouds, concept maps, graph databases, timelines,
and infographics.

The objective of qualitative data analysis is to answer questions about the activities and
motivations of individuals. Collecting, and analyzing this kind of data may be time-
consuming. A researcher or analyst that works with qualitative data is referred to as a
qualitative researcher or analyst.

Qualitative data can give essential statistics for any sector, user group, or product.

Types of Qualitative data:

There are mainly two types of Qualitative data, they are:

Nominal data
In statistics, nominal data (also known as nominal scale) is used to designate variables
without giving a numerical value. It is the most basic type of measuring scale. In contrast
to ordinal data, nominal data cannot be ordered or quantified.

For example, The name of the person, the colour of the hair, nationality, etc. Let’s assume
a girl named Aby her hair is brown and she is from America.

8
Nominal data may be both qualitative and quantitative. Yet, there is no numerical value
or link associated with the quantitative labels (e.g., identification number). In contrast,
several qualitative data categories can be expressed in nominal form. These might consist
of words, letters, and symbols. Names of individuals, gender, and nationality are some of
the most prevalent instances of nominal data.

Analyze nominal data:

Using the grouping approach, nominal data can be analyzed. The variables may be sorted
into groups, and the frequency or percentage can be determined for each category. The
data may also be shown graphically, for example using a pie chart.

Although though nominal data cannot be processed using mathematical operators, they
may still be studied using statistical techniques. Hypothesis testing is one approach to
assess and analyse the data.

With nominal data, nonparametric tests such as the chi-squared test may be used to test
hypotheses. The purpose of the chi-squared test is to evaluate whether there is a
statistically significant discrepancy between the predicted frequency and the actual
frequency of the provided values.

Ordinal data:
Ordinal data is a type of data in statistics where the values are in a natural order. One of
the most important things about ordinal data is that you can't tell what the differences
between the data values are. Most of the time, the width of the data categories doesn't
match the increments of the underlying attribute.

In some cases, the characteristics of interval or ratio data can be found by grouping the
values of the data. For instance, the ranges of income are ordinal data, while the actual
income is ratio data.

Ordinal data can't be changed with mathematical operators like interval or ratio data can.
Because of this, the median is the only way to figure out where the middle of a set of
ordinal data is.

9
This data type is widely found in the fields of finance and economics. Consider an economic
study that examines the GDP levels of various nations. If the report rates the nations
based on their GDP, the rankings are ordinal statistics.

Analyzing ordinal data:

Using visualisation tools to evaluate ordinal data is the easiest method. For example, the
data may be displayed as a table where each row represents a separate category. In
addition, they may be represented graphically using different charts. The bar chart is the
most popular style of graph used to display these types of data.

Ordinal data may also be studied using sophisticated statistical analysis methods like
hypothesis testing. Note that parametric procedures such as the t-test and ANOVA cannot
be used to these data sets. Only nonparametric tests, such as the Mann-Whitney U test or
Wilcoxon Matched-Pairs test, may be used to evaluate the null hypothesis about the data.

Qualitative data collection methods:

Below are some approaches and collection methods to collect qualitative data:

1. Data records: Utilizing data that is already existing as the data source is a best
technique to do qualitative research. Similar to visiting a library, you may examine
books and other reference materials to obtain data that can be utilised for research.
2. Interviews: Personal interviews are one of the most common ways to get
deductive data for qualitative research. The interview may be casual and not have
a set plan. It is often like a conversation. The interviewer or researcher gets the
information straight from the interviewee.
3. Focus groups: Focus groups are made up of 6 to 10 people who talk to each other.
The moderator's job is to keep an eye on the conversation and direct it based on
the focus questions.
4. Case Studies: Case studies are in-depth analyses of an individual or group, with
an emphasis on the relationship between developmental characteristics and the
environment.

10
5. Observation: It is a technique where the researcher observes the object and take
down transcript notes to find out innate responses and reactions without
prompting.

Quantitative data:
Quantitative data consists of numerical values, has numerical features, and mathematical
operations can be performed on this type of data such as addition. Quantitative data is
mathematically verifiable and evaluable due to its quantitative character.

The simplicity of their mathematical derivations makes it possible to govern the

measurement of different parameters. Typically, it is gathered for statistical analysis
through surveys, polls, or questionnaires given to a subset of a population. Researchers
are able to apply the collected findings to an entire population.

Types of Quantitative data:

There are mainly two types of quantitative data, they are:

Discrete Data:
These are data that can only take on certain values, as opposed to a range. For instance,
data about the blood type or gender of a population is considered discrete data.

Example of discrete quantitative data may be the number of visitors to your website; you
could have 150 visits in one day, but not 150.6 visits. Usually, tally charts, bar charts, and
pie charts are used to represent discrete data.

Characteristics of discrete data:

Since it is simple to summarise and calculate discrete data, it is often utilized in elementary
statistical analysis. Let's examine some other essential characteristics of discrete data:

1. Discrete data is made up of discrete variables that are finite, measurable,

countable, and can't be negative (5, 10, 15, and so on).
2. Simple statistical methods, like bar charts, line charts, and pie charts, make it
easy to show and explain discrete data.
3. Data can also be categorical, which means it has a fixed number of data values,
like a person's gender.
4. Data that is both time- and space-bound is spread out in a random way. Discrete
distributions make it easier to look at discrete values.

Continuous Data:
These are data that may take values between a certain range, including the greatest and
lowest possible. The difference between the greatest and least value is known as the data
range. For instance, the height and weight of your school's children. This is considered
continuous data. The tabular representation of continuous data is known as a frequency
distribution. These may be depicted visually using histograms.

Characteristics of continuous data:

Continuous data, on the other hand, can be either numbers or spread out over time and
date. This data type uses advanced statistical analysis methods because there are an

11
infinite number of possible values. The important characteristics about continuous data
are:

1. Continuous data changes over time, and at different points in time, it can have
different values.
2. Random variables, which may or may not be whole numbers, make up continuous
data.
3. Data analysis tools like line graphs, skews, and so on are used to measure
continuous data.
4. One type of continuous data analysis that is often used is regression analysis.

Quantitative data collection methods:

Below are some approaches and collection methods to collect quantitative data:

1. Surveys and questionnaires: These types of research are good for getting
detailed feedback from users and customers, especially about how people feel
about a product, service, or experience.
2. Open-source datasets: There are a lot of public datasets that can be found
online and analysed for free. Researchers sometimes look at data that has
already been collected and try to figure out what it means in a way that fits their
own research project.
3. Experiments: A common method is an experiment, which usually has a control
group and an experimental group. The experiment is set up so that it can be
controlled and the conditions can be changed as needed.
4. Sampling: When there are a lot of data points, it may not be possible to survey
each person or data point. In this case, quantitative research is done with the help
of sampling. Sampling is the process of choosing a sample of data that is
representative of the whole. The two types of sampling are Random sampling (also
called probability sampling), and non-random sampling.

Types of Data collection:

Data collection can be classified into two types according to the source:

1) Primary Data: These are the data that are acquired for the first time for a particular
purpose by an investigator. Primary data are 'pure' in the sense that they have not
been subjected to any statistical manipulations and are authentic. Examples of
primary data include the Census of India.
2) Secondary Data: These are the data that were initially gathered by a certain entity.
This indicates that this kind of data has already been gathered by researchers or
investigators and is accessible in either published or unpublished form. This data is
impure because statistical computations may have previously been performed on it.
For example, Information accessible on the website of the Government of India or
the Department of Finance, or in other archives, books, journals, etc.

Big data:
Big data is defined as data with a larger volume and require overcoming logistical
challenges to deal with them. Big data refers to bigger, more complicated data collections,
particularly from novel data sources. Some data sets are so extensive that conventional
data processing software is incapable of handling them. But, these vast quantities of data
can be use to solve business challenges that were previously unsolvable.

12
Data science is the study of how to analyse huge amount of data and get the information
from them. You can compare big data and data science to crude oil and an oil refinery.
Data science and big data grew out of statistics and traditional ways of managing data,
but they are now seen as separate fields.

People often use the three Vs to describe the characteristics of big data:

1. Volume: How much information is there?

2. Variety: How different are the different kinds of data?
3. Velocity: How fast do new pieces of information get made?

How do we use data in data science?

Every data must undergo pre-processing. This is an essential series of processes that
converts raw data into a more comprehensible and valuable format for further processing.
Common procedures are:

1) Collect and store the dataset:

2) Data cleaning
a) Handling missing data
b) Noisy data
3) Data integration
4) Data transformation
a) Generalization
b) Normalization
c) Attribute selection
d) Aggregation

We will discuss these processes in detail in upcoming chapters.

13
Data science- Lifecycle

What is Data science lifecycle?

A data science lifecycle is a systematic approach to find a solution for a data problem which
shows the steps that are taken to develop, deliver/deploy , and maintain a data science
project. We can assume a general data science lifecycle with some of the most important
common steps that is shown in the figure given below but some steps may differ from
project to project as each project is different so life cycle may differ since not every data
science project is built the same way

A standard data science lifecycle approach comprises the use of machine learning
algorithms and statistical procedures that result in more accurate prediction models. Data
extraction, preparation, cleaning, modelling, assessment, etc., are some of the most
important data science stages. This technique is known as "Cross Industry Standard
Procedure for Data Mining" in the field of data science.

How many phases are there in the data science life cycle?
There are mainly six phases in data science life cycle:

14
Identifying problem and understanding the business:
The data science lifecycle starts with "why?" just like any other business lifecycle. One of
the most important parts of the data science process is figuring out what the problem is.
This helps to find a clear goal around which all the other steps can be planned out. In
short, it's important to know the business goal as earliest because it will determine what
the end goal of the analysis will be.

This phase should evaluate the trends of business, assess case studies of comparable
analyses, and research the industry’s domain. The group will evaluate the feasibility of the
project given the available employees, equipment, time, and technology. When these
factors been discovered and assessed, a preliminary hypothesis will be formulated to
address the business issues resulting from the existing environment. This phrase should -

1. Specify the issue that why the problem must be resolved immediately and
demands answer.
2. Specify the business project's potential value.
3. Identify dangers, including ethical concerns, associated with the project.
4. Create and convey a flexible, highly integrated project plan.

Data collection:
The next step in the data science lifecycle is data collection, which means getting raw data
from the appropriate and reliable source. The data that is collected can be either organized
or unorganized. The data could be collected from website logs, social media data, online
data repositories, and even data that is streamed from online sources using APIs, web
scraping, or data that could be in Excel or any other source.

The person doing the job should know the difference between the different data sets that
are available and how an organization invests its data. Professionals find it hard to keep
track of where each piece of data comes from and whether it is up to date or not. During
the whole lifecycle of a data science project, it is important to keep track of this information
because it could help test hypotheses or run any other new experiments.

The information may be gathered by surveys or the more prevalent method of automated
data gathering, such as internet cookies which is the primary source of data that is
unanalysed.

We can also use secondary data which is an open-source dataset. There are many available
websites from where we can collect data for example
Kaggle(https://www.kaggle.com/datasets), UCI Machine Learning Repository(
http://archive.ics.uci.edu/ml/index.php ), Google Public
Datasets( https://cloud.google.com/bigquery/public-data/ ). There are some predefined
datasets available in python. Let’s import the Iris dataset from python and use it to define
phases of data science.

from sklearn.datasets import load_iris

import pandas as pd
# Load Data

15
iris = load_iris()
# Create a dataframe
df = pd.DataFrame(iris.data, columns = iris.feature_names)
df['target'] = iris.target
X = iris.data

Data processing
After collecting high-quality data from reliable sources, next step is to process it. The
purpose of data processing is to ensure if there is any problem with the acquired data so
that it can be resolved before proceeding to the next phase. Without this step, we may
produce mistakes or inaccurate findings.

There may be several difficulties with the obtained data. For instance, the data may have
several missing values in multiple rows or columns. It may include several outliers,
inaccurate numbers, timestamps with varying time zones, etc. The data may potentially
have problems with date ranges. In certain nations, the date is formatted as DD/MM/YYYY,
and in others, it is written as MM/DD/YYYY. During the data collecting process numerous
problems can occur, for instance, if data is gathered from many thermometers and any of
them are defective, the data may need to be discarded or recollected.

At this phase, various concerns with the data must be resolved. Several of these problems
have multiple solutions, for example, if the data includes missing values, we can either
replace them with zero or the column's mean value. However, if the column is missing a
large number of values, it may be preferable to remove the column completely since it has
so little data that it cannot be used in our data science life cycle method to solve the issue.

When the time zones are all mixed up, we cannot utilize the data in those columns and
may have to remove them until we can define the time zones used in the supplied
timestamps. If we know the time zones in which each timestamp was gathered, we may
convert all timestamp data to a certain time zone. In this manner, there are a number of
strategies to address concerns that may exist in the obtained data.

We will access the data and then store it in a dataframe using python.

from sklearn.datasets import load_iris

import pandas as pd
import numpy as np

# Load Data
iris = load_iris()

# Create a dataframe
df = pd.DataFrame(iris.data, columns = iris.feature_names)

df['target'] = iris.target

16
X = iris.data

All data must be in numeric representation for machine learning models. This implies that
if a dataset includes categorical data, it must be converted to numeric values before the
model can be executed. So we will be implementing label encoding.

Label encoding:

species = []
for i in range(len(df['target'])):
if df['target'][i] == 0:
species.append("setosa")
elif df['target'][i] == 1:
species.append('versicolor')
else:
species.append('virginica')

df['species'] = species
labels = np.asarray(df.species)
df.sample(10)
labels = np.asarray(df.species)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(labels)
labels = le.transform(labels)
df_selected1 = df.drop(['sepal length (cm)', 'sepal width (cm)', "species"], ax
is=1)

Data analysis
Data analysis Exploratory Data Analysis (EDA) is a set of visual techniques for analysing
data. With this method, we may get specific details on the statistical summary of the data.
Also, we will be able to deal with duplicate numbers, outliers, and identify trends or
patterns within the collection.

At this phase, we attempt to get a better understanding of the acquired and processed
data. We apply statistical and analytical techniques to make conclusions about the data
and determine the link between several columns in our dataset. Using pictures, graphs,
charts, plots, etc., we may use visualisations to better comprehend and describe the data.

Professionals use data statistical techniques such as the mean and median to better
comprehend the data. Using histograms, spectrum analysis, and population distribution,
they also visualise data and evaluate its distribution patterns. The data will be analysed
based on the problems.
17
Below code is used to check if there are any null values in the dataset:

df.isnull().sum()

Output:

sepal length (cm) 0

sepal width (cm) 0

petal length (cm) 0

petal width (cm) 0
target 0
species 0
dtype: int64

From the above output we can conclude that there are no null values in the dataset as the
sum of all the null values in the column is 0.

We will be using shape parameter to check the shape (rows, columns) of the dataset:

df.shape

Output:

(150, 5)

Now we will use info() to check the columns and their data types:

df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal length (cm) 150 non-null float64
1 sepal width (cm) 150 non-null float64
2 petal length (cm) 150 non-null float64
3 petal width (cm) 150 non-null float64
4 target 150 non-null int64
dtypes: float64(4), int64(1)

18
memory usage: 6.0 KB

Only one column contains category data, whereas the other columns include non-Null
numeric values.

Now we will use describe() on the data. The describe() method performs fundamental
statistical calculations to a dataset, such as extreme values, the number of data points,
standard deviation, etc. Any missing or NaN values are immediately disregarded. The
describe() method accurately depicts the distribution of data.

df.describe()

Output:

Data visualization:
Target column: Our target column will be the Species column since we will only want
results based on species in the end.

Matplotlib and seaborn library will be used for data visualization.

Below is the species countplot:

import seaborn as sns

import matplotlib.pyplot as plt

sns.countplot(x='species', data=df, )
plt.show()

Output:

19
There are many other visualization plots in Data science. To know more about them refer
https://www.tutorialspoint.com/machine_learning_with_python/machine_learning_with_
python_understanding_data_with_visualization.htm

Data modelling:
Data Modelling is one of the most important aspects of data science and is sometimes
referred to as the core of data analysis. The intended output of a model should be derived
from prepared and analysed data. The environment required to execute the data model
will be chosen and constructed, before achieving the specified criteria.

At this phase, we develop datasets for training and testing the model for production-
related tasks. It also involves selecting the correct mode type and determining if the
problem involves classification, regression, or clustering. After analysing the model type,
we must choose the appropriate implementation algorithms. It must be performed with
care, as it is crucial to extract the relevant insights from the provided data.

Here machine learning comes in picture. Machine learning is basically divided into
classification, regression, or clustering models and each model have some algorithms
which is applied on the dataset to get the relevant information. These models are used in
this phase. We will discuss these models in detail in the machine learning chapter.

Model deployment:
We have reached the final stage of the data science lifecycle. The model is finally ready to
be deployed in the desired format and chosen channel after a detailed review process.
Note that the machine learning model has no utility unless it is deployed in the production.
Generally speaking, these models are associated and integrated with products and
applications.

Model deployment contains the establishment of a delivery method necessary to deploy

the model to market consumers or to another system. Machine learning models are also
being implemented on devices and gaining acceptance and appeal. Depending on the
complexity of the project, this stage might range from a basic model output on a Tableau
Dashboard to a complicated cloud-based deployment with millions of users.

20
Who are all involved in Data Science lifecycle?
Data is being generated, collected, and stored on voluminous servers and data warehouses
from the individual level to the organisational level. But how will you access this massive
data repository? This is where the data scientist comes in, since he or she is a specialist
in extracting insights and patterns from unstructured text and statistics.

Below, we present the many job profiles of the data science team participating in the data
science lifecycle.

S.No Job profile Role

1. Business Analyst Understanding business requirements and

find the right target customers.

2. Data Analyst Format and clean the raw data, interpret

and visualise them to perform the analysis
and provide the technical summary of the
same

3. Data Scientists Improve quality of machine learning

models.

4. Data Engineer They are in charge of gathering data from

social networks, websites, blogs, and other
internal and external web sources ready for
further analysis.

5. Data Architect Connect, centralise, protect, and keep up

with the organization's data sources.

6. Machine learning engineer Design and implement machine learning-

related algorithms and applications.

SAS 101 - Introduction To Data Science
No ratings yet
SAS 101 - Introduction To Data Science
10 pages
iMY DATA SCIENCE - Removed
No ratings yet
iMY DATA SCIENCE - Removed
19 pages
Bhavya Khurana
No ratings yet
Bhavya Khurana
21 pages
Fundamentals of Data Science
No ratings yet
Fundamentals of Data Science
53 pages
1.1 Idml
No ratings yet
1.1 Idml
3 pages
Data Science and Its Importance
No ratings yet
Data Science and Its Importance
9 pages
Introduction To Data Science Lecture 1
No ratings yet
Introduction To Data Science Lecture 1
4 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
20 pages
A Sneak Peek Into Data Science
No ratings yet
A Sneak Peek Into Data Science
2 pages
Introduction To Data Science UNIT 1
No ratings yet
Introduction To Data Science UNIT 1
44 pages
OceanofPDF - Com DATA SCIENCE Simple and Effective Tips An - Benjamin Smith
100% (1)
OceanofPDF - Com DATA SCIENCE Simple and Effective Tips An - Benjamin Smith
122 pages
23STUCHH010864
No ratings yet
23STUCHH010864
24 pages
1) Data-Sci Chapter-1
No ratings yet
1) Data-Sci Chapter-1
17 pages
PSAI Unit 1
No ratings yet
PSAI Unit 1
70 pages
Final Seminar Report
100% (2)
Final Seminar Report
18 pages
Data Science
No ratings yet
Data Science
46 pages
Fds Module 1
No ratings yet
Fds Module 1
65 pages
UNIT - I Intro To DS
No ratings yet
UNIT - I Intro To DS
18 pages
Ch7-Overview of Data Science-Part 1
No ratings yet
Ch7-Overview of Data Science-Part 1
37 pages
Data Science 2020
100% (1)
Data Science 2020
123 pages
Data Science Lecture 1 Introduction
No ratings yet
Data Science Lecture 1 Introduction
27 pages
Are View of Data Science
No ratings yet
Are View of Data Science
18 pages
Chapter 1 Data Science Fundamentals
No ratings yet
Chapter 1 Data Science Fundamentals
34 pages
Introduction To Data Science - Unit-1
No ratings yet
Introduction To Data Science - Unit-1
9 pages
Data Science Life Cycle
No ratings yet
Data Science Life Cycle
12 pages
DataScience Intro
No ratings yet
DataScience Intro
36 pages
Raw Data Science Personal Statement
No ratings yet
Raw Data Science Personal Statement
5 pages
Module-1 Notes Basics 09.07.25
No ratings yet
Module-1 Notes Basics 09.07.25
45 pages
Datascience
100% (2)
Datascience
15 pages
PSD02 - Data Science Overview
No ratings yet
PSD02 - Data Science Overview
64 pages
Basic of Ds
No ratings yet
Basic of Ds
14 pages
Unit 1 DS BCA NOTES
No ratings yet
Unit 1 DS BCA NOTES
7 pages
Data Science
No ratings yet
Data Science
6 pages
IDS Complete Notes
No ratings yet
IDS Complete Notes
126 pages
DS B&V-1
No ratings yet
DS B&V-1
30 pages
BA UNIT III Developing Analytical Talent
No ratings yet
BA UNIT III Developing Analytical Talent
73 pages
Applied Data Science Career Guide
No ratings yet
Applied Data Science Career Guide
15 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
Himadev
No ratings yet
Himadev
37 pages
Applied - Data - Science MODULE 1 SEM8
No ratings yet
Applied - Data - Science MODULE 1 SEM8
16 pages
How Does Data Science Works in 2021
No ratings yet
How Does Data Science Works in 2021
9 pages
5th Unit Data Science 24
No ratings yet
5th Unit Data Science 24
12 pages
Data Collection and Preparation Exploratory Data Analysis (EDA) Machine Learning Data Visualization Model Deployment and Evaluation
No ratings yet
Data Collection and Preparation Exploratory Data Analysis (EDA) Machine Learning Data Visualization Model Deployment and Evaluation
10 pages
Unit 1
No ratings yet
Unit 1
28 pages
Data Science Basics
No ratings yet
Data Science Basics
25 pages
Data Science CLASS 12 INVESTIGATORY PROJECT
No ratings yet
Data Science CLASS 12 INVESTIGATORY PROJECT
9 pages
DataScience Intro
No ratings yet
DataScience Intro
36 pages
(DSBDA) Unit 1 Introduction To Data Science
No ratings yet
(DSBDA) Unit 1 Introduction To Data Science
14 pages
Datascience Presentation
No ratings yet
Datascience Presentation
94 pages
Chirag Modi Data Science Report
No ratings yet
Chirag Modi Data Science Report
29 pages
What Is Data Science?: Module - 1
No ratings yet
What Is Data Science?: Module - 1
29 pages
Unit-1 Data Science
No ratings yet
Unit-1 Data Science
74 pages
DataScientist v2
No ratings yet
DataScientist v2
14 pages
Data Science Basics
No ratings yet
Data Science Basics
5 pages
Data Science Essay
No ratings yet
Data Science Essay
2 pages
Data Science Training
No ratings yet
Data Science Training
8 pages
Data Science Hype and Reality
No ratings yet
Data Science Hype and Reality
7 pages
Unit 1-FDS
100% (2)
Unit 1-FDS
18 pages
All About Data Science: Learn Data Science from scratch
From Everand
All About Data Science: Learn Data Science from scratch
Devi Prasad
No ratings yet
Big Data: Opportunities and challenges
From Everand
Big Data: Opportunities and challenges
BCS, The Chartered Institute for IT
No ratings yet
Global Overview of The Cosmetic Industry (HR Perspective)
No ratings yet
Global Overview of The Cosmetic Industry (HR Perspective)
2 pages
Boztepe Taskiran - CT 1 2019 1
No ratings yet
Boztepe Taskiran - CT 1 2019 1
7 pages
Promotion and Integrated Marketing Communications: Lesson 1 - Basics
No ratings yet
Promotion and Integrated Marketing Communications: Lesson 1 - Basics
45 pages
The Hoshin Kanri Bibliography: Guide
No ratings yet
The Hoshin Kanri Bibliography: Guide
20 pages
Registration For Quantiphi Analytics Internship Cum PPO Recruitment Drive-2024 Batch (YET To REGISTERED Students)
No ratings yet
Registration For Quantiphi Analytics Internship Cum PPO Recruitment Drive-2024 Batch (YET To REGISTERED Students)
2 pages
Common Application Form For Existing Investors PDF
No ratings yet
Common Application Form For Existing Investors PDF
2 pages
Pcse 100 Et C 014 0
No ratings yet
Pcse 100 Et C 014 0
7 pages
1.3 Planning, Control and Decision Making
No ratings yet
1.3 Planning, Control and Decision Making
3 pages
Term Paper On Financial Management
No ratings yet
Term Paper On Financial Management
11 pages
Managerial Accounting An Introduction To Concepts Methods and Uses 11th Edition Maher Test Bank Download
No ratings yet
Managerial Accounting An Introduction To Concepts Methods and Uses 11th Edition Maher Test Bank Download
43 pages
Account - Statement - 011222 - 010323
No ratings yet
Account - Statement - 011222 - 010323
69 pages
Income Statement Common Size
No ratings yet
Income Statement Common Size
3 pages
Amazon Interview Questions
No ratings yet
Amazon Interview Questions
7 pages
Magic 7 Issue #1 - Read Magic 7 Issue #1 Comic Online in High Quality
No ratings yet
Magic 7 Issue #1 - Read Magic 7 Issue #1 Comic Online in High Quality
60 pages
Consumer Awareness Project
No ratings yet
Consumer Awareness Project
4 pages
10X DIY Market Research Survey
No ratings yet
10X DIY Market Research Survey
2 pages
Session 5 (A.c. 3.1)
No ratings yet
Session 5 (A.c. 3.1)
25 pages
The Social Impact of Large - Scale Construction Projects in Africa - Emerging Lessons From Rebuilding Mbare Retail Market in Harare, Zimbabwe
No ratings yet
The Social Impact of Large - Scale Construction Projects in Africa - Emerging Lessons From Rebuilding Mbare Retail Market in Harare, Zimbabwe
4 pages
NAFTA
No ratings yet
NAFTA
6 pages
Role To User Assignments
No ratings yet
Role To User Assignments
96 pages
Part 7.4
No ratings yet
Part 7.4
14 pages
Site Selection and Plant Layout
No ratings yet
Site Selection and Plant Layout
33 pages
A221224384 Nach
No ratings yet
A221224384 Nach
1 page
Implementing Regulations For Zakat Collection
No ratings yet
Implementing Regulations For Zakat Collection
79 pages
How To Start Dropshipping On Amazon 2024 - Cheat Sheet
No ratings yet
How To Start Dropshipping On Amazon 2024 - Cheat Sheet
2 pages
Distribution: Service Marketing Mix
No ratings yet
Distribution: Service Marketing Mix
4 pages
Sports Job Cover Letter Sample
100% (2)
Sports Job Cover Letter Sample
4 pages
Option Trading Atoz Slides
No ratings yet
Option Trading Atoz Slides
17 pages
59260
No ratings yet
59260
8 pages
Final DITO CME Updated Business Plan (9 June 2025)
No ratings yet
Final DITO CME Updated Business Plan (9 June 2025)
7 pages

Data Science Notes - 1-PD

Uploaded by

Data Science Notes - 1-PD

Uploaded by

Data Science — Getting Started

History of data science:

William S. Cleveland is frequently associated with the present understanding of data

Why data science?

Here are significant advantages of using Data Analytics Technology:

Need for Data Science:

Impact of data science:

What is data in data science?

Types of Qualitative data:

Analyze nominal data:

Analyzing ordinal data:

Qualitative data collection methods:

The simplicity of their mathematical derivations makes it possible to govern the

Types of Quantitative data:

Characteristics of discrete data:

1. Discrete data is made up of discrete variables that are finite, measurable,

Characteristics of continuous data:

Quantitative data collection methods:

Types of Data collection:

1. Volume: How much information is there?

How do we use data in data science?

1) Collect and store the dataset:

We will discuss these processes in detail in upcoming chapters.

What is Data science lifecycle?

from sklearn.datasets import load_iris

from sklearn.datasets import load_iris

sepal length (cm) 0

petal length (cm) 0

Matplotlib and seaborn library will be used for data visualization.

Below is the species countplot:

import seaborn as sns

Model deployment contains the establishment of a delivery method necessary to deploy

S.No Job profile Role

1. Business Analyst Understanding business requirements and

2. Data Analyst Format and clean the raw data, interpret

3. Data Scientists Improve quality of machine learning

4. Data Engineer They are in charge of gathering data from

5. Data Architect Connect, centralise, protect, and keep up

6. Machine learning engineer Design and implement machine learning-

You might also like