Data Science Notes - 1-PD
Data Science Notes - 1-PD
Data science is the process of extracting and analysing useful information from data to
solve problems that are difficult to solve analytically. For example, when you visit an e-
commerce site and look at a few categories and products before making a purchase, you
are creating data that Analysts can use to figure out how you make purchases.
It involves different disciplines like mathematical and statistical modelling, extracting data
from its source and applying data visualization techniques. It also involves handling big
data technologies to gather both structured and unstructured data.
It helps you find patterns that are hidden in the raw data. The term "Data Science" has
evolved because mathematical statistics, data analysis, and "big data" have changed over
time.
Data science is an interdisciplinary field that lets you learn from both organised and
unorganised data. With data science, you can turn a business problem into a research
project and then apply into a real-world solution.
Peter Naur suggested the phrase "data science" as an alternative name for computer
science in 1974. The International Federation of Classification Societies was the first
conference to highlight data science as a special subject in 1996. Yet, the concept
remained in change. Following the 1985 lecture at the Chinese Academy of Sciences in
Beijing, C. F. Jeff Wu again advocated for the renaming of statistics to data science in
1997. He reasoned that a new name would assist statistics in inaccurate stereotypes and
perceptions, such as being associated with accounting or confined to data description.
Hayashi Chikio proposed data science in 1998 as a new, multidisciplinary concept with
three components: data design, data collecting, and data analysis.
In the 1990s, "knowledge discovery" and "data mining" were popular phrases for the
process of identifying patterns in datasets that were growing in size.
In 2012, engineers Thomas H. Davenport and DJ Patil proclaimed "Data Scientist: The
Hottest Job of the 21st Century," a term that was taken up by major metropolitan
publications such as the New York Times and the Boston Globe. They repeated it a decade
later, adding that "the position is in more demand than ever"
In 2008, DJ Patil and Jeff Hammerbacher were given the professional designation of "data
scientist." Although it was used by the National Science Board in their 2005 study "Long-
Lived Digital Data Collections: Supporting Research and Teaching in the 21st Century," it
referred to any significant role in administering a digital data collection.
An agreement has not yet been reached on the meaning of data science, and some believe
it to be a buzzword. Big data is a similar concept in marketing. Data scientists are
responsible for transforming massive amounts of data into useful information and
developing software and algorithms that assist businesses and organisations in
determining optimum operations.
1) Data is the oil of the modern age. With the proper tools, technologies, and
algorithms, we can leverage data to create a unique competitive edge.
2) Data Science may assist in detecting fraud using sophisticated machine learning
techniques.
3) It helps you avoid severe financial losses.
4) Enables the development of intelligent machines
5) You may use sentiment analysis to determine the brand loyalty of your customers.
This helps you to make better and quicker choices.
6) It enables you to propose the appropriate product to the appropriate consumer in
order to grow your company.
6
3) Demand and average salary of a data scientist:
a) According to India Today, India is the second biggest centre for data science in
the world due to the fast digitalization of companies and services. By 2026,
analysts anticipate that the nation will have more than 11 million employment
opportunities. In fact, recruiting in the data science field has surged by 46% since
2019.
b) Bank of America was one of the first financial institutions to provide mobile
banking to its consumers a decade ago. Recently, the Bank of America introduced
Erica, its first virtual financial assistant. It is regarded the as best financial
invention in the world.
Erica now serves as a client adviser for more than 45 million consumers worldwide.
Erica uses Voice Recognition to receive client feedback, which represents a
technical development in Data Science.
c) The Data Science and Machine Learning curves are steep. Although India sees a
massive influx of data scientists each year, relatively few possess the needed skill
set and specialization. As a consequence, people with specialised data skills are in
great demand.
Healthcare industry has benefitted from the rise of data science. In 2008, Google
employees realised that they could monitor influenza strains in real time. Previous
technologies could only provide weekly updates on instances. Google was able to build one
of the first systems for monitoring the spread of diseases by using data science.
The sports sector has similarly profited from data science. A data scientist in 2019 found
ways to measure and calculate how goal attempts increase a soccer team's odds of
winning. In reality, data science is utilised to easily compute statistics in several sports.
Government agencies also use data science on a daily basis. Governments throughout the
globe employ databases to monitor information regarding social security, taxes, and other
data pertaining to their residents. The government's usage of emerging technologies
continues to develop.
Since the Internet has become the primary medium of human communication, the
popularity of e-commerce has also grown. With data science, online firms may monitor
the whole of the customer experience, including marketing efforts, purchases, and
consumer trends. Ads must be one of the greatest instances of eCommerce firms using
data science. Have you ever looked for anything online or visited an eCommerce product
website, only to be bombarded by advertisements for that product on social networking
sites and blogs?
Ad pixels are integral to the online gathering and analysis of user information. Companies
leverage online consumer behaviour to retarget prospective consumers throughout the
internet. This usage of client information extends beyond eCommerce. Apps such as Tinder
and Facebook use algorithms to assist users locate precisely what they are seeking. The
Internet is a growing treasure trove of data, and the gathering and analysis of this data
will also continue to expand.
7
Data Science — What is data?
Data comes in many shapes and forms, but can generally be thought of as being the result
of some random experiment — an experiment whose outcome cannot be determined in
advance, but whose workings are still subject to analysis. Data from a random experiment
are often stored in a table or spreadsheet. A statistical convention to denote variables is
often called as features or columns and individual items (or units) as rows.
Types of data:
There are mainly two types of data, they are:
Qualitative data:
Qualitative data consists of information that cannot be counted, quantified, or expressed
simply using numbers. It is gathered from text, audio, and pictures and distributed using
data visualization tools, including word clouds, concept maps, graph databases, timelines,
and infographics.
The objective of qualitative data analysis is to answer questions about the activities and
motivations of individuals. Collecting, and analyzing this kind of data may be time-
consuming. A researcher or analyst that works with qualitative data is referred to as a
qualitative researcher or analyst.
Qualitative data can give essential statistics for any sector, user group, or product.
Nominal data
In statistics, nominal data (also known as nominal scale) is used to designate variables
without giving a numerical value. It is the most basic type of measuring scale. In contrast
to ordinal data, nominal data cannot be ordered or quantified.
For example, The name of the person, the colour of the hair, nationality, etc. Let’s assume
a girl named Aby her hair is brown and she is from America.
8
Nominal data may be both qualitative and quantitative. Yet, there is no numerical value
or link associated with the quantitative labels (e.g., identification number). In contrast,
several qualitative data categories can be expressed in nominal form. These might consist
of words, letters, and symbols. Names of individuals, gender, and nationality are some of
the most prevalent instances of nominal data.
Using the grouping approach, nominal data can be analyzed. The variables may be sorted
into groups, and the frequency or percentage can be determined for each category. The
data may also be shown graphically, for example using a pie chart.
Although though nominal data cannot be processed using mathematical operators, they
may still be studied using statistical techniques. Hypothesis testing is one approach to
assess and analyse the data.
With nominal data, nonparametric tests such as the chi-squared test may be used to test
hypotheses. The purpose of the chi-squared test is to evaluate whether there is a
statistically significant discrepancy between the predicted frequency and the actual
frequency of the provided values.
Ordinal data:
Ordinal data is a type of data in statistics where the values are in a natural order. One of
the most important things about ordinal data is that you can't tell what the differences
between the data values are. Most of the time, the width of the data categories doesn't
match the increments of the underlying attribute.
In some cases, the characteristics of interval or ratio data can be found by grouping the
values of the data. For instance, the ranges of income are ordinal data, while the actual
income is ratio data.
Ordinal data can't be changed with mathematical operators like interval or ratio data can.
Because of this, the median is the only way to figure out where the middle of a set of
ordinal data is.
9
This data type is widely found in the fields of finance and economics. Consider an economic
study that examines the GDP levels of various nations. If the report rates the nations
based on their GDP, the rankings are ordinal statistics.
Using visualisation tools to evaluate ordinal data is the easiest method. For example, the
data may be displayed as a table where each row represents a separate category. In
addition, they may be represented graphically using different charts. The bar chart is the
most popular style of graph used to display these types of data.
Ordinal data may also be studied using sophisticated statistical analysis methods like
hypothesis testing. Note that parametric procedures such as the t-test and ANOVA cannot
be used to these data sets. Only nonparametric tests, such as the Mann-Whitney U test or
Wilcoxon Matched-Pairs test, may be used to evaluate the null hypothesis about the data.
1. Data records: Utilizing data that is already existing as the data source is a best
technique to do qualitative research. Similar to visiting a library, you may examine
books and other reference materials to obtain data that can be utilised for research.
2. Interviews: Personal interviews are one of the most common ways to get
deductive data for qualitative research. The interview may be casual and not have
a set plan. It is often like a conversation. The interviewer or researcher gets the
information straight from the interviewee.
3. Focus groups: Focus groups are made up of 6 to 10 people who talk to each other.
The moderator's job is to keep an eye on the conversation and direct it based on
the focus questions.
4. Case Studies: Case studies are in-depth analyses of an individual or group, with
an emphasis on the relationship between developmental characteristics and the
environment.
10
5. Observation: It is a technique where the researcher observes the object and take
down transcript notes to find out innate responses and reactions without
prompting.
Quantitative data:
Quantitative data consists of numerical values, has numerical features, and mathematical
operations can be performed on this type of data such as addition. Quantitative data is
mathematically verifiable and evaluable due to its quantitative character.
Discrete Data:
These are data that can only take on certain values, as opposed to a range. For instance,
data about the blood type or gender of a population is considered discrete data.
Example of discrete quantitative data may be the number of visitors to your website; you
could have 150 visits in one day, but not 150.6 visits. Usually, tally charts, bar charts, and
pie charts are used to represent discrete data.
Since it is simple to summarise and calculate discrete data, it is often utilized in elementary
statistical analysis. Let's examine some other essential characteristics of discrete data:
Continuous Data:
These are data that may take values between a certain range, including the greatest and
lowest possible. The difference between the greatest and least value is known as the data
range. For instance, the height and weight of your school's children. This is considered
continuous data. The tabular representation of continuous data is known as a frequency
distribution. These may be depicted visually using histograms.
Continuous data, on the other hand, can be either numbers or spread out over time and
date. This data type uses advanced statistical analysis methods because there are an
11
infinite number of possible values. The important characteristics about continuous data
are:
1. Continuous data changes over time, and at different points in time, it can have
different values.
2. Random variables, which may or may not be whole numbers, make up continuous
data.
3. Data analysis tools like line graphs, skews, and so on are used to measure
continuous data.
4. One type of continuous data analysis that is often used is regression analysis.
1. Surveys and questionnaires: These types of research are good for getting
detailed feedback from users and customers, especially about how people feel
about a product, service, or experience.
2. Open-source datasets: There are a lot of public datasets that can be found
online and analysed for free. Researchers sometimes look at data that has
already been collected and try to figure out what it means in a way that fits their
own research project.
3. Experiments: A common method is an experiment, which usually has a control
group and an experimental group. The experiment is set up so that it can be
controlled and the conditions can be changed as needed.
4. Sampling: When there are a lot of data points, it may not be possible to survey
each person or data point. In this case, quantitative research is done with the help
of sampling. Sampling is the process of choosing a sample of data that is
representative of the whole. The two types of sampling are Random sampling (also
called probability sampling), and non-random sampling.
1) Primary Data: These are the data that are acquired for the first time for a particular
purpose by an investigator. Primary data are 'pure' in the sense that they have not
been subjected to any statistical manipulations and are authentic. Examples of
primary data include the Census of India.
2) Secondary Data: These are the data that were initially gathered by a certain entity.
This indicates that this kind of data has already been gathered by researchers or
investigators and is accessible in either published or unpublished form. This data is
impure because statistical computations may have previously been performed on it.
For example, Information accessible on the website of the Government of India or
the Department of Finance, or in other archives, books, journals, etc.
Big data:
Big data is defined as data with a larger volume and require overcoming logistical
challenges to deal with them. Big data refers to bigger, more complicated data collections,
particularly from novel data sources. Some data sets are so extensive that conventional
data processing software is incapable of handling them. But, these vast quantities of data
can be use to solve business challenges that were previously unsolvable.
12
Data science is the study of how to analyse huge amount of data and get the information
from them. You can compare big data and data science to crude oil and an oil refinery.
Data science and big data grew out of statistics and traditional ways of managing data,
but they are now seen as separate fields.
People often use the three Vs to describe the characteristics of big data:
13
Data science- Lifecycle
A standard data science lifecycle approach comprises the use of machine learning
algorithms and statistical procedures that result in more accurate prediction models. Data
extraction, preparation, cleaning, modelling, assessment, etc., are some of the most
important data science stages. This technique is known as "Cross Industry Standard
Procedure for Data Mining" in the field of data science.
How many phases are there in the data science life cycle?
There are mainly six phases in data science life cycle:
14
Identifying problem and understanding the business:
The data science lifecycle starts with "why?" just like any other business lifecycle. One of
the most important parts of the data science process is figuring out what the problem is.
This helps to find a clear goal around which all the other steps can be planned out. In
short, it's important to know the business goal as earliest because it will determine what
the end goal of the analysis will be.
This phase should evaluate the trends of business, assess case studies of comparable
analyses, and research the industry’s domain. The group will evaluate the feasibility of the
project given the available employees, equipment, time, and technology. When these
factors been discovered and assessed, a preliminary hypothesis will be formulated to
address the business issues resulting from the existing environment. This phrase should -
1. Specify the issue that why the problem must be resolved immediately and
demands answer.
2. Specify the business project's potential value.
3. Identify dangers, including ethical concerns, associated with the project.
4. Create and convey a flexible, highly integrated project plan.
Data collection:
The next step in the data science lifecycle is data collection, which means getting raw data
from the appropriate and reliable source. The data that is collected can be either organized
or unorganized. The data could be collected from website logs, social media data, online
data repositories, and even data that is streamed from online sources using APIs, web
scraping, or data that could be in Excel or any other source.
The person doing the job should know the difference between the different data sets that
are available and how an organization invests its data. Professionals find it hard to keep
track of where each piece of data comes from and whether it is up to date or not. During
the whole lifecycle of a data science project, it is important to keep track of this information
because it could help test hypotheses or run any other new experiments.
The information may be gathered by surveys or the more prevalent method of automated
data gathering, such as internet cookies which is the primary source of data that is
unanalysed.
We can also use secondary data which is an open-source dataset. There are many available
websites from where we can collect data for example
Kaggle(https://www.kaggle.com/datasets), UCI Machine Learning Repository(
http://archive.ics.uci.edu/ml/index.php ), Google Public
Datasets( https://cloud.google.com/bigquery/public-data/ ). There are some predefined
datasets available in python. Let’s import the Iris dataset from python and use it to define
phases of data science.
15
iris = load_iris()
# Create a dataframe
df = pd.DataFrame(iris.data, columns = iris.feature_names)
df['target'] = iris.target
X = iris.data
Data processing
After collecting high-quality data from reliable sources, next step is to process it. The
purpose of data processing is to ensure if there is any problem with the acquired data so
that it can be resolved before proceeding to the next phase. Without this step, we may
produce mistakes or inaccurate findings.
There may be several difficulties with the obtained data. For instance, the data may have
several missing values in multiple rows or columns. It may include several outliers,
inaccurate numbers, timestamps with varying time zones, etc. The data may potentially
have problems with date ranges. In certain nations, the date is formatted as DD/MM/YYYY,
and in others, it is written as MM/DD/YYYY. During the data collecting process numerous
problems can occur, for instance, if data is gathered from many thermometers and any of
them are defective, the data may need to be discarded or recollected.
At this phase, various concerns with the data must be resolved. Several of these problems
have multiple solutions, for example, if the data includes missing values, we can either
replace them with zero or the column's mean value. However, if the column is missing a
large number of values, it may be preferable to remove the column completely since it has
so little data that it cannot be used in our data science life cycle method to solve the issue.
When the time zones are all mixed up, we cannot utilize the data in those columns and
may have to remove them until we can define the time zones used in the supplied
timestamps. If we know the time zones in which each timestamp was gathered, we may
convert all timestamp data to a certain time zone. In this manner, there are a number of
strategies to address concerns that may exist in the obtained data.
We will access the data and then store it in a dataframe using python.
# Load Data
iris = load_iris()
# Create a dataframe
df = pd.DataFrame(iris.data, columns = iris.feature_names)
df['target'] = iris.target
16
X = iris.data
All data must be in numeric representation for machine learning models. This implies that
if a dataset includes categorical data, it must be converted to numeric values before the
model can be executed. So we will be implementing label encoding.
Label encoding:
species = []
for i in range(len(df['target'])):
if df['target'][i] == 0:
species.append("setosa")
elif df['target'][i] == 1:
species.append('versicolor')
else:
species.append('virginica')
df['species'] = species
labels = np.asarray(df.species)
df.sample(10)
labels = np.asarray(df.species)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(labels)
labels = le.transform(labels)
df_selected1 = df.drop(['sepal length (cm)', 'sepal width (cm)', "species"], ax
is=1)
Data analysis
Data analysis Exploratory Data Analysis (EDA) is a set of visual techniques for analysing
data. With this method, we may get specific details on the statistical summary of the data.
Also, we will be able to deal with duplicate numbers, outliers, and identify trends or
patterns within the collection.
At this phase, we attempt to get a better understanding of the acquired and processed
data. We apply statistical and analytical techniques to make conclusions about the data
and determine the link between several columns in our dataset. Using pictures, graphs,
charts, plots, etc., we may use visualisations to better comprehend and describe the data.
Professionals use data statistical techniques such as the mean and median to better
comprehend the data. Using histograms, spectrum analysis, and population distribution,
they also visualise data and evaluate its distribution patterns. The data will be analysed
based on the problems.
17
Below code is used to check if there are any null values in the dataset:
df.isnull().sum()
Output:
From the above output we can conclude that there are no null values in the dataset as the
sum of all the null values in the column is 0.
We will be using shape parameter to check the shape (rows, columns) of the dataset:
df.shape
Output:
(150, 5)
Now we will use info() to check the columns and their data types:
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal length (cm) 150 non-null float64
1 sepal width (cm) 150 non-null float64
2 petal length (cm) 150 non-null float64
3 petal width (cm) 150 non-null float64
4 target 150 non-null int64
dtypes: float64(4), int64(1)
18
memory usage: 6.0 KB
Only one column contains category data, whereas the other columns include non-Null
numeric values.
Now we will use describe() on the data. The describe() method performs fundamental
statistical calculations to a dataset, such as extreme values, the number of data points,
standard deviation, etc. Any missing or NaN values are immediately disregarded. The
describe() method accurately depicts the distribution of data.
df.describe()
Output:
Data visualization:
Target column: Our target column will be the Species column since we will only want
results based on species in the end.
sns.countplot(x='species', data=df, )
plt.show()
Output:
19
There are many other visualization plots in Data science. To know more about them refer
https://www.tutorialspoint.com/machine_learning_with_python/machine_learning_with_
python_understanding_data_with_visualization.htm
Data modelling:
Data Modelling is one of the most important aspects of data science and is sometimes
referred to as the core of data analysis. The intended output of a model should be derived
from prepared and analysed data. The environment required to execute the data model
will be chosen and constructed, before achieving the specified criteria.
At this phase, we develop datasets for training and testing the model for production-
related tasks. It also involves selecting the correct mode type and determining if the
problem involves classification, regression, or clustering. After analysing the model type,
we must choose the appropriate implementation algorithms. It must be performed with
care, as it is crucial to extract the relevant insights from the provided data.
Here machine learning comes in picture. Machine learning is basically divided into
classification, regression, or clustering models and each model have some algorithms
which is applied on the dataset to get the relevant information. These models are used in
this phase. We will discuss these models in detail in the machine learning chapter.
Model deployment:
We have reached the final stage of the data science lifecycle. The model is finally ready to
be deployed in the desired format and chosen channel after a detailed review process.
Note that the machine learning model has no utility unless it is deployed in the production.
Generally speaking, these models are associated and integrated with products and
applications.
20
Who are all involved in Data Science lifecycle?
Data is being generated, collected, and stored on voluminous servers and data warehouses
from the individual level to the organisational level. But how will you access this massive
data repository? This is where the data scientist comes in, since he or she is a specialist
in extracting insights and patterns from unstructured text and statistics.
Below, we present the many job profiles of the data science team participating in the data
science lifecycle.
21