TYBSC CS Data Science Munotes
TYBSC CS Data Science Munotes
.in
1.9 Questions
Since the invention of computers, people have used the word data to mean
computer information, and this information is transmitted or stored. There
are different kinds of data; such are as follows:
m
Sound
Video
Single character
Number (integer or floating-point)
Picture
Boolean (true or false)
Text (string)
In a computer's storage, data is stored in the form of a series of binary
digits (bits) that contain the value 1 or 0. The information can be in terms
of pictures, text documents, software programs, audio or video clips, or
other kinds of data. The computer data may be stored in files and folders
on the computer's storage, and processed by the computer's CPU, which
utilizes logical operations to generate output (new data) form input data.
1
Data science As the data is stored on the computer in binary form (zero or one), which
can be processed, created, saved, and stored digitally. This allows data to
be sent from one computer to another with the help of various media
devices or a network connection. Furthermore, if you use data multiple
times, it does not deteriorate over time or lose quality.
.in
measured, or easily expressed with the help of numbers. It can be collected
from audio, text, and pictures. It is shared via data visualization tools, such
as concept maps, clouds, infographics, timelines, and databases. For
instance, collecting data on attributes such as honesty, intelligence,
es
creativity, wisdom, and cleanliness about students of any class would be
considered as a sample of qualitative data.
ot
Typically, it has two types: ethnographic data and interpretive data. The
collection of data for understanding how a group assigns context for an
event, it is known as ethnographic data. The data, which is collected to
un
2
Qualitative data analysis Introduction to Data
Science
Qualitative data can be analyzed through being either deductive or
inductive approach. In the deductive technique, the analyst starts with a
question and evaluate data subjectively in terms of the question. In the
inductive technique, he or she simply evaluates the data to look for
patterns as in this approach; the analyst has no agenda. Frequently, the
inductive process is also known as grounded theory. Generally, an
inductive technique takes more time as compared to the deductive
technique.
.in
snapshot to understand the qualitative dynamics that able to affect
success.
es
o Porter's five forces: It is a framework that is used to improve the
SWOT analysis. It is developed by Harvard professor Michael E.
Porter, which improves SWOT analysis with the help of identifying
and analyzing the internal and external factors that able to effect
ot
success.
Furthermore, QDAS (qualitative data analysis software) helps to collect
un
3
Data science Although the analysts can easily analyze the quantitative data through any
software tool like spreadsheets, the analysis of qualitative data depends on
the researcher's how they have skills and experience, which helps to create
parameters from a small sampling, and larger data set can be examined.
2. Quantitative Data: These types of data can be measured but not simply
observed. The data can be numerically represented and used for statistical
analysis and mathematical calculations. For example, these mathematical
derivations can be used in real-life decisions. Also, the number of students
participate in different games from a class; the mathematical calculation
gives an estimate of how many students are playing in which sport.
This data is any quantifiable information that is used to answer questions
such as How much?" "How often?" "How many?". These data can be
conveniently evaluated by using mathematical techniques and also can be
verified. Usually, quantitative data is collected for statistical analysis sent
across to a particular section of a population with the help of surveys,
questionnaires, or polls. Furthermore, quantitative data helps to measure
several parameters controllable as it includes mathematical derivations.
.in
Types of Quantitative Data
There are various types of quantitative data; such are as follows:
es
o Measurement of physical objects: It is commonly used to calculate
the measurement of any physical thing, for instance, assigned each
cubicle to the newly joined employees in any organization is carefully
ot
4
I. Surveys Introduction to Data
Surveys were traditionally conducted with the help of paper-based Science
methods and have gradually evolved into online mediums. Collecting
the closed-ended questions form a major part of these surveys is more
appropriate in the collection of quantitative data. The survey contains
answer options for a particular question. Also, surveys are unified to
collect feedback from an audience. Surveys are classified into different
category on the basis of the time involved in completing surveys:
o Longitudinal Studies: In this, a market researcher conducts surveys
from a specific time period to another as it is a type of observational
research. When the primary objective is to collect and analyze a
pattern in data, this survey is often implemented.
o Cross-sectional studies: In this, a market researcher conducts surveys
at a particular time period. It helps to understand a particular subject
from the sample at a certain time period by implementing a
questionnaire.
.in
There are some principles given below to administer a survey to
collect quantitative data:
o Use of Different Question Types: Closed-ended questions have to be
es
used in a survey to collect quantitative data. These questions can be a
combination of several types of questions as well as multiple-choice
questions like rating scale questions, semantic differential scale
questions, and more. It helps to collect data, which can be understood
ot
and analyzed.
o Fundamental Levels of Measurement: Collection of the quantitative
un
5
Data science o Face-to-Face Interviews: In addition to the already asked survey
questions, an interviewer can prepare a list of important interview
questions. Thus, interviewers will be capable of providing complete
details about the topic under discussion. Also, an interviewer will get
help to collect more details about the topic by managing to bond with
the interviewee on a personal level, through which the responses also
improve.
o Computer-Assisted Personal Interview: In this method, the
interviewers are able to enter the collected data directly into the
computer or any other similar device. It is also called a one-on-one
interview technique, which technique helps to reduce the processing
time and provides benefits to interviewers as they do not require to
carry a hardcopy of questionnaires and only enter the answers on the
laptop.
o Online/Telephonic Interviews: Although, telephone-based interviews
are not a modern technique. These types of interviews have also
moved to online mediums like Zoom or Skype, which provides the
.in
option to online interviews over the network. Online interview is
beneficial that helps to overcome the issue of distance between
interviewer and interviewee and save their time.
es
o However, the interview is only a phone call in case of telephonic
interviews.
6
offer the parameters that rank the most important, including in-depth Introduction to Data
insight into purchasing decisions. Science
.in
avenues and the frequency in any organization.
o Text analysis: In this method, intelligent tools work on easily
understandable data. They make more quantify or fashion qualitative
es
and open-ended data of this data. This method is helpful in the case
when the collected data is unstructured and needs to convert into a
structural way that makes it understandable.
ot
7
Data science Disadvantages of Quantitative Data
Some of the disadvantages of quantitative data are as follows:
1. Input
First, the data must receive input before a computer starts to process
anything. For instance, to enter input into the computer have to type on the
.in
keyboard.
2. Process
A computer uses a program to process the data into information, which
es
data has received through input. The program may organize, calculate, or
manipulate the data to create understandable information.
ot
3. Output
It is displayed as output to the user after the data is processed into
information. For example, when you use the Windows Calculator, the
un
retrieval. It uses storage media like hard disk, floppy disk, etc.
5. What is the difference between data and information?
.in
Decision specific purpose, hence information, hence
making cannot be used for widely used for decision
decision making. making.
es
Measuring unit The data is measured in The information is
bytes and bits. measured in meaningful
units such as quantity,
ot
organization.
Usefulness The data may not be Information is easily
useful as it is collected available to the
by the researcher. researcher for use; hence
it is valuable and useful.
9
Data science A compiler is required to translate a high-level language into a low-level
language.
.in
writers use text editors and accountants use spreadsheets, software
developers use IDEs to make their job easier.
have to learn about all the tools and can instead focus on just one
application. The following are some reasons why developers use IDEs:
Syntax highlighting
An IDE can format the written text by automatically making some words
bold or italic, or by using different font colors. These visual cues make the
source code more readable and give instant feedback about accidental
syntax errors.
10
Refactoring support Introduction to Data
Science
Code refactoring is the process of restructuring the source code to make it
more efficient and readable without changing its core functionality. IDEs
can auto-refactor to some extent, allowing developers to improve their
code quickly and easily. Other team members understand readable code
faster, which supports collaboration within the team.
Compilation
An IDE compiles or converts the code into a simplified language that the
operating system can understand. Some programming languages
implement just-in-time compiling, in which the IDE converts human-
.in
readable code into machine code from within the application.
Testing
es
The IDE allows developers to automate unit tests locally before the
software is integrated with other developers' code and more complex
integration tests are run.
ot
Debugging
Debugging is the process of fixing any errors or bugs that testing reveals.
un
One of the biggest values of an IDE for debugging purposes is that you
can step through the code, line by line, as it runs and inspect code
behavior. IDEs also integrate several debugging tools that highlight bugs
m
Local IDEs
Developers install and run local IDEs directly on their local machines.
They also have to download and install various additional libraries
depending on their coding preferences, project requirements, and
development language. While local IDEs are customizable and do not
require an internet connection once installed, they present several
challenges:
11
Data science They can be time consuming and difficult to set up.
They consume local machine resources and can slow down machine
performance significantly.
Configuration differences between the local machine and the
production environment can give rise to software errors.
Cloud IDEs
Developers use cloud IDEs to write, edit, and compile code directly in the
browser so that they don't need to download software on their local
machines. Cloud-based IDEs have several advantages over traditional
IDEs. The following are some of these advantages:
.in
differences.
Platform independence
es
Cloud IDEs work on the browser and are independent of local
development environments. This means they connect directly to the cloud
vendor's platform, and developers can use them from any machine.
ot
Better performance
Building and compiling functions in an IDE requires a lot of memory and
un
can slow down the developer's computer. The cloud IDE uses compute
resources from the cloud and frees up the local machine’s resources.
12
Why is exploratory data analysis important in data science? Introduction to Data
Science
The main purpose of EDA is to help look at data before making any
assumptions. It can help identify obvious errors, as well as better
understand patterns within the data, detect outliers or anomalous events,
find interesting relations among the variables.
Data scientists can use exploratory analysis to ensure the results they
produce are valid and applicable to any desired business outcomes and
goals. EDA also helps stakeholders by confirming they are asking the right
questions. EDA can help answer questions about standard deviations,
categorical variables, and confidence intervals. Once EDA is complete and
insights are drawn, its features can then be used for more sophisticated
data analysis or modeling, including machine learning.
.in
Clustering and dimension reduction techniques, which help create
graphical displays of high-dimensional data containing many
variables.
es
Univariate visualization of each field in the raw dataset, with summary
statistics.
ot
learning where data points are assigned into K groups, i.e. the number
of clusters, based on the distance from each group’s centroid. The data
points closest to a particular centroid will be clustered under the same
category. K-means Clustering is commonly used in market
segmentation, pattern recognition, and image compression.
Predictive models, such as linear regression, use statistics and data to
predict outcomes.
13
Data science Univariate graphical. Non-graphical methods don’t provide a full
picture of the data. Graphical methods are therefore required. Common
types of univariate graphics include:
o Stem-and-leaf plots, which show all data values and the shape of the
distribution.
o Histograms, a bar plot in which each bar represents the frequency
(count) or proportion (count/total count) of cases for a range of values.
o Box plots, which graphically depict the five-number summary of
minimum, first quartile, median, third quartile, and maximum.
Multivariate nongraphical: Multivariate data arises from more than
one variable. Multivariate non-graphical EDA techniques generally
show the relationship between two or more variables of the data
through cross-tabulation or statistics.
Multivariate graphical: Multivariate data uses graphics to display
relationships between two or more sets of data. The most used graphic
.in
is a grouped bar plot or bar chart with each group representing one
level of one of the variables and each bar within a group representing
the levels of the other variable.
es
Other common types of multivariate graphics include:
Scatter plot, which is used to plot data points on a horizontal and a
ot
Some of the most common data science tools used to create an EDA
include:
Python: An interpreted, object-oriented programming language with
dynamic semantics. Its high-level, built-in data structures, combined
with dynamic typing and dynamic binding, make it very attractive for
rapid application development, as well as for use as a scripting or glue
language to connect existing components together. Python and EDA
can be used together to identify missing values in a data set, which is
important so you can decide how to handle missing values for machine
learning.
14
R: An open-source programming language and free software Introduction to Data
environment for statistical computing and graphics supported by the R Science
Foundation for Statistical Computing. The R language is widely used
among statisticians in data science in developing statistical
observations and data analysis.
Data visualization convert large and small data sets into visuals, which
is easy to understand and process for humans.
In the world of Big Data, the data visualization tools and technologies
.in
are required to analyze vast amounts of information.
Data visualizations are common in your everyday life, but they always
es
appear in the form of graphs and charts. The combination of multiple
visualizations and bits of information are still referred to as
Infographics.
ot
Today's data visualization tools go beyond the charts and graphs used
in the Microsoft Excel spreadsheet, which displays the data in more
sophisticated ways such as dials and gauges, geographic maps, heat
maps, pie chart, and fever chart.
15
Data science
.in
To craft an effective data visualization, you need to start with clean data
es
that is well-sourced and complete. After the data is ready to visualize, you
need to pick the right chart.
After you have decided the chart type, you need to design and customize
ot
16
Data visualization tools have been necessary for democratizing data, Introduction to Data
analytics, and making data-driven perception available to workers Science
throughout an organization. They are easy to operate in comparison to
earlier versions of BI software or traditional statistical analysis software.
This guide to a rise in lines of business implementing data visualization
tools on their own, without support from IT.
.in
1.7 DIFFERENT TYPES OF DATA SOURCES
The sources of data can be classified into two types: statistical and non-
es
statistical. Statistical sources refer to data that is gathered for some official
purposes, incorporate censuses, and officially administered surveys. Non-
statistical sources refer to the collection of data for other administrative
purposes or for the private sector.
ot
1. Internal sources
When data is collected from reports and records of the organisation
itself, they are known as the internal sources.
m
For example, a company publishes its annual report’ on profit and loss,
total sales, loans, wages, etc.
2. External sources
When data is collected from sources outside the organisation, they are
known as the external sources. For example, if a tour and travel
company obtains information on Karnataka tourism from Karnataka
Transport Corporation, it would be known as an external source of
data.
Types of Data
A) Primary data
Primary data means first-hand information collected by an
investigator.
It is collected for the first time.
17
Data science It is original and more reliable.
For example, the population census conducted by the government of
India after every ten years is primary data.
B) Secondary data
Secondary data refers to second-hand information.
It is not originally collected and rather obtained from already
published or unpublished sources.
For example, the address of a person taken from the telephone
directory or the phone number of a company taken from Just Dial are
secondary data.
1.8 SUMMARY
In this chapter we learn about Basic introduction of Data Science that is
what is data and difference between data and information. Types of data
.in
and how we can collect the data, Introduction to high level programming
language, what is use of Integrated Development Environment (IDE) types
of IDEs,use of IDE. How to explore the data and how to analysis the data
to fetch the proper data form complied data. Also how to display fetched
es
data in proper way and user understandable format. Finally we understand
the different ways and types of data sources to collect the data.
ot
1.9 QUESTIONS
1. What is Data?
un
18
2
DATA MANAGEMENT
Unit Structure
2.1 Objetive
2.2 Introduction
2.3 Data Collection
2.4 Data cleaning/extraction
2.5 Data analysis
2.6 Modeling
2.7 Summary
.in
2.8 Questions
2.1 OBJECTIVE
es
1) To study the data management technics.
2) To Understand the what is data collection, Need Data Collection,
ot
2.2 INTRODUCTION
Data management is the process of ingesting, storing, organizing and
maintaining the data created and collected by an organization. Effective
data management is a crucial piece of deploying the IT systems that run
business applications and provide analytical information to help drive
operational decision-making and strategic planning by corporate
executives, business managers and other end users
19
Data science Our society is highly dependent on data, which underscores the
importance of collecting it. Accurate data collection is necessary to make
informed business decisions, ensure quality assurance, and keep research
integrity.
During data collection, the researchers must identify the data types, the
sources of data, and what methods are being used. We will soon see that
there are many different data collection methods. There is heavy reliance
on data collection in research, commercial, and government fields.
Before an analyst begins collecting data, they must answer three
questions first:
What’s the goal or purpose of this research?
What kinds of data are they planning on gathering?
What methods and procedures will be used to collect, store, and process
the information?
.in
Additionally, we can break up data into qualitative and quantitative types.
Qualitative data covers descriptions such as color, size, quality, and
appearance. Quantitative data, unsurprisingly, deals with numbers, such as
statistics, poll numbers, percentages, etc.
es
Why Do We Need Data Collection?
Before a judge makes a ruling in a court case or a general creates a plan of
ot
attack, they must have as many relevant facts as possible. The best courses
of action come from informed decisions, and information and data are
synonymous.
un
The concept of data collection isn’t a new one, as we’ll see later, but the
world has changed. There is far more data available today, and it exists in
forms that were unheard of a century ago. The data collection process has
m
had to change and grow with the times, keeping pace with technology.
Whether you’re in the world of academia, trying to conduct research, or
part of the commercial sector, thinking of how to promote a new product,
you need data collection to help you make better choices.
Now that you know what is data collection and why we need it, let's take a
look at the different methods of data collection. While the phrase “data
collection” may sound all high-tech and digital, it doesn’t necessarily
entail things like computers, big data, and the internet. Data collection
could mean a telephone survey, a mail-in comment card, or even some guy
with a clipboard asking passersby some questions. But let’s see if we can
sort the different data collection methods into a semblance of organized
categories.
Transnational Tracking
Interviews and Focus Groups
Observation
Online Tracking
Forms
Social Media Monitoring
Data collection breaks down into two methods. As a side note, many
terms, such as techniques, methods, and types, are interchangeable and
depending on who uses them. One source may call data collection
techniques “methods,” for instance. But whatever labels we use, the
general concepts and breakdowns apply across the board whether we’re
talking about marketing analysis or a scientific research project.
.in
The two methods are:
Primary
es
As the name implies, this is original, first-hand data collected by the data
researchers. This process is the initial information gathering step,
performed before anyone carries out any further or related research.
ot
Primary data results are highly accurate provided the researcher collects
the information. However, there’s a downside, as first-hand research is
potentially time-consuming and expensive.
un
Secondary
Secondary data is second-hand data collected by other parties and already
m
21
Data science time. In this article, we will look at eight common steps in the data
cleaning process, as mentioned below.
1. Removing duplicates
2. Remove irrelevant data
3. Standardize capitalization
4. Convert data type
5. Handling outliers
6. Fix errors
7. Language Translation
8. Handle missing values
Why is Data Cleaning So Important?
.in
As an experienced Data Scientist, I have hardly seen any perfect data.
Real-world data is noisy and contains a lot of errors. They are not in
their best format. So, it becomes important that these data points need to
be fixed.
es
It is estimated that data scientists spend between 80 to 90 percent of
their time in data cleaning. Your workflow should start with data
cleaning. You may likely duplicate or incorrectly classify data while
ot
working with large datasets and merging several data sources. Your
algorithms and results will lose their accuracy if you have wrong or
un
incomplete data.
For example: consider data where we have the gender column. If the
data is being filled manually, then there is a chance that the data column
m
22
Data Cleaning Process [In 8 Steps] Data Management
.in
When you are working with large datasets, working across multiple
data sources, or have not implemented any quality checks before adding
an entry, your data will likely show duplicated values.
es
These duplicated values add redundancy to your data and can make
your calculations go wrong. Duplicate serial numbers of products in a
dataset will give you a higher count of products than the actual
ot
numbers.
Duplicate email IDs or mobile numbers might cause your
communication to look more like spam. We take care of these duplicate
un
.in
The most preferred code case is the snake case or cobra case.
Cobra case is a writing style in which the first letter of each word is
written in uppercase, and each space is substituted by the underscore
es
(_) character. While, in the snake case, the first letter of each word is
written in lowercase and each space is substituted by the underscore.
Therefore, the column name “Total Sales” can be written as
ot
“Total_Sales” in the cobra case and “Total Sales” in the snake case.
Along with the column names, the capitalization of the data points
should also be fixed.
un
For example: while collecting names and email IDs through online
forms, surveys, or other means, we can get inputs in assorted styles.
We can fix them to avoid duplicate entries getting ignored. The email
m
Id's‘[email protected]’and ‘[email protected]’
can be interpreted as different email IDs, so it is better to make all the
email ID values in the field lowercase. Similarly, for the email, we can
follow the title case where all words are capitalized.
Step 4: Convert data type
When working with CSV data in python, pandas will attempt to guess
the types for us; for the most part, it succeeds, but occasionally we'll
need to provide a little assistance.
The most common data types that we find in the data are text, numeric,
and date data types. The text data types can accept any kind of mixed
values including alphabets, digits, or even special characters. A person’s
name, type of product, store location, email ID, password, etc., are some
examples of text data types.
Numeric data types contain integer values or decimal point numbers,
also called float. Having a numeric data type column means you can
24
perform mathematical computations like finding the minimum, Data Management
maximum, average, and median, or analyzing the distribution using
histogram, box plot, q-q plot, etc.
Having a numeric column as an integer column will not allow you to
perform this numerical analysis. Therefore, it becomes important to
convert the data types in the required formats if they are not already.
The monthly sales figures of a store, the price of a product, units of
electricity consumed, etc., are examples of a numeric column. However,
it is worth noting that columns like a numeric ID or phone number
should not be represented as numeric columns but instead as text
columns. Though they represent numeric values, operations like
minimum or average values on these columns do not provide any
significant information. Therefore, these columns should be represented
as text columns.
The data type if not identified correctly will end up being identified as a
string or text column. In such cases, we need to explicitly define the
.in
data type of the column and the date format which is mentioned in the
data. The date column can be represented in different formats:
October 02, 2023
es
02-10-2023
2023/10/02
ot
2-Oct-2023
Step 5: Handling Outliers
un
25
Data science Remove the observations that consist of outlier values.
Apply transformations like a log, square root, box-cox, etc., to make
the data values follow the normal or near-normal distribution.
You can learn about these methods and other data cleaning or wrangling
skills with Bootcamp for Data Science. Develop your programming and
analytical abilities as you gain confidence as a data scientist under the
direction of expert professionals with six capstone projects and over 280
hours of on-demand self-paced learning.
Step 6: Fix errors
Errors in your data can lead you to miss out on the key findings. This
needs to be avoided by fixing the errors that your data might have.
Systems that manually input data without any provision for data checks
are almost always going to contain errors. To fix them, we need to first
get the data understanding. Post that, we can define logic or check the
data and accordingly get the data errors fixed. Consider the following
example cases.
.in
Removing the country code from the mobile field so that all the values
are exactly 10 digits.
es
Remove any unit mentioned in columns like weight, height, etc. to
make it a numeric field.
ot
Identifying any incorrect data format like email address and then either
fixing it or removing it.
Making some validation checks like customer purchase date should be
un
greater than the manufacturing date, the total amount should be equal
to the sum of the other amounts, any punctuation or special characters
found in a field that does not allow it, etc.
m
26
Either removing the records that have missing values or Data Management
.in
Consider another dataset where we have information about the laborers
working on a construction site. If the gender column in this dataset has
around 30 percent missing values. We cannot drop 30 percent of data
es
observations but on further digging, we found that among the rest 70
percent of observations and 90 percent of records are male. Therefore,
we can choose to fill these missing values as the male gender. By doing
this, we have made an assumption, but it can be a safe assumption
ot
because the laborers working on the construction site are male dominant
and even the data suggests the same. We have used a measure of central
un
tendency called Mode, in this case. There are also other ways of filling
missing values in a numerical field by using Mean or Median values
based on whether the field values follow a gaussian distribution or not.
m
27
Data science Improving productivity: Maintaining data quality and enabling more
precise analytics that support the overall decision-making process are
made possible by cleaning the data.
Avoiding unnecessary costs and errors: Correcting faulty or
mistaken data in the future is made easier by keeping track of errors
and improving reporting to determine where errors originate.
Staying organized
Improved mapping
.in
risks inherent in decision-making by providing useful insights and
statistics, often presented in charts, images, tables, and graphs.
A simple example of data analysis can be seen whenever we make a
es
decision in our daily lives by evaluating what has happened in the past or
what will happen if we make that decision. Basically, this is the process of
analyzing the past or future and making a decision based on that analysis.
ot
It’s not uncommon to hear the term “big data” brought up in discussions
about data analysis. Data analysis plays a crucial role in processing big
data into useful information. Neophyte data analysts who want to dig
un
28
Reduce Operational Costs: Data analysis shows you which areas in Data Management
your business need more resources and money, and which areas are
not producing and thus should be scaled back or eliminated outright.
Better Problem-Solving Methods: Informed decisions are more likely
to be successful decisions. Data provides businesses with information.
You can see where this progression is leading. Data analysis helps
businesses make the right choices and avoid costly pitfalls.
You Get More Accurate Data: If you want to make informed
decisions, you need data, but there’s more to it. The data in question
must be accurate. Data analysis helps businesses acquire relevant,
accurate information, suitable for developing future marketing
strategies, business plans, and realigning the company’s vision or
mission.
.in
alternately, data analysis steps, involves gathering all the information,
processing it, exploring the data, and using it to find patterns and other
insights. The process of data analysis consists of:
es
Data Requirement Gathering: Ask yourself why you’re doing this
analysis, what type of data you want to use, and what data you plan to
analyze.
ot
time to clean it up. This process is where you remove white spaces,
duplicate records, and basic errors. Data cleaning is mandatory before
sending the information on for analysis.
Data Analysis: Here is where you use data analysis software and other
tools to help you interpret and understand the data and arrive at
conclusions. Data analysis tools include Excel, Python, R, Looker,
Rapid Miner, Chartio, Metabase, Redash, and Microsoft Power BI.
Data Interpretation: Now that you have your results, you need to
interpret them and come up with the best courses of action based on
your findings.
Data Visualization: Data visualization is a fancy way of saying,
“graphically show your information in a way that people can read and
understand it.” You can use charts, graphs, maps, bullet points, or a
host of other methods. Visualization helps you derive valuable insights
by helping you compare datasets and observe relationships.
29
Data science Types of Data Analysis
A half-dozen popular types of data analysis are available today, commonly
employed in the worlds of technology and business. They are:
Diagnostic Analysis: Diagnostic analysis answers the question, “Why
did this happen?” Using insights gained from statistical analysis (more
on that later!), analysts use diagnostic analysis to identify patterns in
data. Ideally, the analysts find similar patterns that existed in the past,
and consequently, use those solutions to resolve the present challenges
hopefully.
Predictive Analysis: Predictive analysis answers the question, “What
is most likely to happen?” By using patterns found in older data as
well as current events, analysts predict future events. While there’s no
such thing as 100 percent accurate forecasting, the odds improve if the
analysts have plenty of detailed information and the discipline to
research it thoroughly.
Prescriptive Analysis: Mix all the insights gained from the other data
.in
analysis types, and you have prescriptive analysis. Sometimes, an issue
can’t be solved solely with one analysis type, and instead requires
multiple insights.
es
Statistical Analysis: Statistical analysis answers the question, “What
happened?” This analysis covers data collection, analysis, modeling,
interpretation, and presentation using dashboards. The statistical
ot
.in
applications and in the database or file system structures used to process,
store and manage the data.
Data modeling can also help establish common data definitions and
es
internal data standards, often in connection with data governance
programs. In addition, it plays a big role in data architecture
processes that document data assets, map how data moves through IT
systems and create a conceptual data management framework. Data
ot
models are a key data architecture component, along with data flow
diagrams, architectural blueprints, a unified data vocabulary and other
un
artifacts.
Traditionally, data models have been built by data modelers, data
architects and other data management professionals with input from
business analysts, executives and users. But data modeling is also now an
m
.in
elements. Database designers use physical data models to create
designs and generate schema for databases.
processing and management. That's still the case, but the techniques used
to create data models have evolved along with the development of new
un
32
model specification in 1969. Because of that, the network technique is Data Management
often referred to as the CODASYL model.
.in
approach for data capture and update processes, making them particularly
suitable for transaction processing applications.
es
5. Dimensional data modeling
Dimensional data models are primarily used in data warehouses and data
marts that support business intelligence applications. They consist of fact
ot
tables that contain data about transactions or other events and dimension
tables that list attributes of the entities in the fact tables. For example, a
fact table could detail product purchases by customers, while connected
un
dimension tables hold data about the products and customers. Notable
types of dimensional models are star schemas, which connect a fact table
to different dimension tables, and snowflake schemas, which include
multiple levels of dimension tables.
m
2.7 SUMMARY
In this chapter we learn about the Data management for that we want to
collect data so first know about the data collection then we know about
what Are the Different Methods of Data Collection. After that we learn
about Data Cleaning in Data Science, need of data cleaning and various
methods to clean the data. After cleaning the data we analyses the data,
then we understand the process of data analyses. To do that process we
understand the what is data modeling then Data modeling techniques.
2.8 QUESTIONS
1. Why Do We Need Data Collection?
.in
2. What Are the Different Methods of Data Collection?
3. What is Data Cleaning in Data Science?
es
4. Why is Data Cleaning So Important?
5. What are the different types of data models?
ot
2.9 REFERENCE
un
www.google.com
www.javatpoint.com
m
34
3
DATA CURATION
Unit Structure
3.0 Objective
3.1 Data Curation
3.1.1 Introduction
3.1.2 Data Curation Life Cycle
3.2 Query Languages and Operations to Speicify and Transform Data
3.2.1 Query Languages and Operations
3.2.2 Relational Algebra
.in
3.2.3 Joins
3.2.4 Aggregate/Group Functions
es
3.2,5 Structured Query Languages (Sql, Non- Procedural Query
Languages)
ot
3.4.1 XML
3.4.2 X Query
m
3.4.3 X Path
3.4.4 Json
3.5 Unstructured Data
3.0 OBJECTIVES
In this chapter the students will learn about:
Data Curation
Data Curation Life Cycle
Query Languages
Structured and Unstructured Data
35
Data science 3.1 DATA CURATION
Curation is the round-the-clock maintenance of data. Data curation refers
to the data management.
It is the process of creating, organizing, and maintaining of data sets. With
the help of this process, we can access and used the information or data as
per the requirements.
It involves the various methods like collecting, structuring, indexing, and
cataloguing data for users in an organization, any other as well.
.in
maintain, and validate data.
Data curation is used to determine what information is worth saving and
for how much duration.
es
The main aim of data curation is to ensure that the data is reliably
retrievable for future use or reuse.
ot
1. PRESERVING
It is first step in Data Curation, in this method we are gathering data from
many sources, after gathering maintaining gathered data is known as
Preserving.
36
2. SHARING Data Curation
It is second step in Data Curation, in this method we are making sure that
data is available and retrievable for future purpose with authenticated user,
it is known as Sharing.
3. DISCOVERING
It is third step in Data Curation, in this method we are reusing the data
which we have collected with different combinations, with the help of
these various combinations of data we can discover with new patterns and
trends of data, it is known as Discovering.
.in
es
ot
un
m
37
Data science 2. Description and Representation of Information:
It is layer 2 of data curation life cycle model, in this layer assignment
of administrative, descriptive, technical, structural and preservation of
the data or database or digital objects is done depending on the
standards defined.
3. Preservation and Planning:
In this layer, the planning for the preservations of the digital objects,
data, and databases are carried out for throughout the life cycle.
.in
6. Create and Receive:
In this layer, creation of data using descriptive and technical metadata,
also it includes receiving of data from the various formats.
es
7. Appraise and Select:
In this layer, evaluation and the selection of data is carried out which
ot
8. Ingest:
un
9. Preservation Action:
m
In this layer, various actions are carried out with the aim long term
preservations and retention of the data. Preservation actions includes
for data to remain reliable, authentic, and usable.
10. Store:
In this layer, data is get stored in the secured manner.
12. Transform:
In this layer, it consists of very important component, where we can
create new data from the original material and then transform that data
into the meaningful form i.e. different format for generation of final
results.
38
13. Conceptualize: Data Curation
In this layer, the data to form with an idea or any principle which the
user wants for final result generation.
14. Dispose:
In this layer, the data which is not useful for longer time or in future
from the database it is also known as unwanted data. The unwanted
data can be dispose to create a new space for the upcoming data.
15. Reappraise:
In this layer, the data which are not able to processed or fails the
validation process of data is return back.
16. Migrate:
In this layer, data is migrated at various places depends on the need
and then converted according to the new environment.
In this method, data is migrated at different places as well as it also
converted according to new environment depending on the need.
.in
3.2 QUERY LANGUAGES AND OPERATIONS TO
SPEICIFY AND TRANSFORM DATA
es
Query languages is also known as database query languages (DQL). They
are computer languages, that are used to make the queries in databases.
Example of DQL is Structured Query Language (SQL).
ot
databases and it will retrieve the data from that specified databases or
information system.
Query languages are special type of languages used for retrieving required
m
.in
es
ot
un
1. Select Operation
It is used to select particular rows (tuples) from the given database which
satisfies the given condition. Sigma symbol is used to denote the
select operation.
The operations performed by select operation is represented as P
Where stands for selection predicate and r stands for relation. P is
prepositional logic formula which may use connectors like and, or, and
not.
It can also use the relational operators like =, <, >, >=, <=, ≠, etc.
For example:
subject = “Datascience”(Books)
The output will select the rows from Books where name of the subject is
“Datascience”.
40
2. Project Data Curation
It is used to project columns which satisfy the given condition. Pi
Symbol is used to denote the project operation. Where
Where A1, A2, An are attribute names of relation r. Automatically removal
of duplicate records because as relation is a set.
For example:
subject, author (Books)
The output will select the columns which projects the subject & author
from Books table or relation.
3. Union (
It is used to performs binary union between two tables or relations.
Symbol is used to denote the union operation.
r s = {t | t r or t
Where r and s are either database relation or relations.
For union operation r and s must have number of attributes equal and
same.
.in
Attribute domains must be compatible.
Automatically elimination of duplicate records.
For example:
es
author (Books) author (Articles)
The output will projects the names of the authors who have either written
a book or an article or both.
ot
4. Set Difference ( - )
It is used to remove the tuples or rows which are present in one relation
but not present in another relation.
un
Where it finds all the rows that are present in r but not in s.
For example
The output will give the name of author who have written books but not
articles.
41
Data science The output will show all the books and articles written by author
‘Balguruswamy’.
6. Rename (
It is used when output of relational algebra queries produce relation
without name, so this operation, is used to rename produced output
relation.
symbol is used to denote rename operation.
For example
Where E and X are the name of the table or relation and here E is renamed
by X.
3.2.3 JOINS
Join is used to combines the two different relations or tables and makes it
as a single relation.
There are three types of joins as follows:
.in
es
ot
1. Theta Join ( )
It combines rows or tuples from two different relations if the given theta
m
2. Natural Join ( ⋈ )
Natural join does not use any comparison operator.
It does not concatenate the way a Cartesian product does. Natural join will
work only if there is at least one common attribute that exists between any
two relations.
Theta and Natural join are also known as inner joins.
In case of inner join it will includes only the rows or tuples which are
matching attributes and the remaining are get discarded in the resulting
relation or table.
42
3. Outer Join Data Curation
It is used to deal with the unmatched attributes of the table.
There are 3 different types of outer join
1. Left outer join
2. Right outer join
3. Full outer join
.in
relation.
R⋈S
Here the right outer join will take all the rows or tuples from the table or
relation i.e S considered in the output. The rows or tuples which do not
es
have any matching rows or tuples in R form S then that rows or tuples of
R are made NULL.
ot
1. Count( )
It returns the total number of records present in a given relation.
43
Data science For example
Count (*)
It shows the given number of rows or tuples of in that relation.
Count (columnname)
It shows the number of non-null values over the given columnname.
Count (salary)
It shows the number of non-null values over the column salary.
2. Sum( )
It returns the sum of values for a particular attribute.
For example
Sum(salary)
It shows the sum of salary for all the employee in that relation or table.
It will display sum of salary for non - null values.
3. Avg( )
It returns the average of the values over the given attribute.
For example
.in
Avg(salary)
It will give the average value of salary that is total or sum of all salary
divided by total count and returns its value.
es
4. Min( )
It returns minimum value of a particular attribute.
For example
ot
Min(salary)
It will return the minimum salary from the salary attribute.
un
5. Max( )
It returns maximum value of a particular attribute.
For example
Max(salary)
m
44
Data Curation
.in
es
ot
45
Data science Alter: It is used to alter the existing table or column description.
Alter Database, Alter table
Drop: It is used to delete existing table.
Drop table
.in
transactions to maintain the integrity of the database with the help of SQL
statements.
Following SQL commands are used for TCL
es
Begin Transaction: It is used to opens a transaction.
Commit Transaction: It is used to commits a transaction.
Rollback Transaction: It is used to Rollback a transaction.
ot
.in
ng/BDL2.pdf]
3.4.1 XML
As we all know that XML stands for Extensible Markup Language.
It is a markup language and file format that helps in storing and
transporting of data.
It is designed to carry data and not just to display data as it is self-
descriptive.
47
Data science It was formed from extracting the properties of SGML (Standard
Generalized Markup Language).
It supports exchanging of information between computer systems. They
can be websites, databases, and any third-party applications.
It consists of predefined rules which makes it easy to transmit data as
XML files over any network.
The components of an XML file are
XML Document:
The content which are mentioned between <xml></xml> tags are called as
XML Document. It is mainly at the beginning and the end of an XML
code.
XML Declaration:
The content begins with come information about XML itself. It also
.in
mentions the XML version.
For example: <?xml version=”1.0 encoding=”UTF-8”?>
XML Elements:
es
The other tags you create within an XML document are called as XML
Elements. It consists of the following features:
ot
1. Text
2. Attributes
un
3. Other elements
For example:
m
<Fruits>
<Berries>
<type> Strawberry </type>
<type> Blueberry </type>
<type> Raspberry </type>
</Berries>
<Citrus>
<type> Oranges </type>
<type> Lemons </type>
<type> Limes </type>
48
</Citrus> Data Curation
</Fruits>
Here, <Fruits></Fruits> are root elements and
<Berries></Berries>&<Citrus></Citrus> are other element names.
4. XML Attributes:
The XML elements which can have other descriptors are called as XML
Attributes. One can define his/her own attribute name and attribute values
within the quotation marks.
For example: <Score=”80”>
5. XML Content:
The data that is present in the XML file is called as XML content. In the
given example in XML Elements, Strawberry, Blueberry, Raspberry,
Oranges, Lemons and Limes are the content.
.in
EXAMPLE: es
ot
un
m
49
Data science One can search web documents for relevant information and generate
summary reports.
It replaces complex Java or C++ programs with a few lines of code.
3.4.3 XPath
XPath (XML Path Language) is a query language used to navigate through
an XML document and select specific elements or attributes. It is widely
used in web scraping and data extraction, as well as in data science for
parsing and analyzing XML data.
In data science, XPath can be used to extract information from XML files
or APIs. For example, you might use XPath to extract specific data fields
from an XML response returned by a web API, such as stock prices or
weather data.
XPath can also be used in combination with other tools and languages
commonly used in data science, such as Python and Beautiful Soup, to
scrape data from websites and extract structured data for analysis. By
.in
using XPath to select specific elements and attributes, you can quickly and
easily extract the data you need for analysis.
3.4.4 JSON
50
It is used for exchanging data between web applications and servers, Data Curation
and can be used with many programming languages.
JSON data is represented in key-value pairs, similar to a dictionary or
a hash table.
The key represents a string that identifies the value, and the value can
be a string, number, Boolean, array, or another JSON object.
JSON objects are enclosed in curly braces {}, and arrays are enclosed
in square brackets [].
.in
Unstructured data is a data that is which is not organized in a pre-
defined manner or does not have a pre-defined data model, thus it is
es
not a good fit for a mainstream relational database.
For Unstructured data, there are alternative platforms for storing and
managing, it is increasingly prevalent in IT systems and is used by
ot
51
Data science 3.6 SUMMARY
This chapter contains the detail study of what is data, different types of
data, data curation and its various steps. Query languages and the various
operators of query languages, Structured, unstructured and semi structured
data with example, what is aggregate and group functions, detail study of
structured query languages like SQL, non-procedural query languages,
structured, semi-structured and unstructured data, XML, XQuery, XPath,
JSON.
.in
6. Write a note on semi structured data with example.
7. What is XML? Explain its advantages and disadvantages.
es
8. Write a note on
a. XQueryb. XPath c. JSON
ot
3.8 REFERENCES
un
Publication,2015
3. Hands-On Programming with R, Garrett Grolemund,1st Edition,
2014
4. An Introduction to Statistical Learning, James, G., Witten, D.,
Hastie, T., Tibshirani, R.,Springer,2015
5. http://www.icet.ac.in/Uploads/Downloads/1._MOdule_1_PDD_KQ
B__(1)%20(1).PDF
6. https://www.researchgate.net/figure/Diagram-of-the-digital-curation-
lifecycle_fig3_340183022
52
4
DATA BASE SYSTEMS
Unit Structure
4.0 Objective
4.1 Web Crawler & Web Scraping
4.1.1 Difference between Web Crawler and Web Scraping
4.2 Security and Ethical Considerations in Relation to Authenticating
And Authorizing
4.2.1 ACCESS TO DATA ON REMOTE SYSTEMS
4.3 Software Development Tools
4.3.1 Version Control/Source Control
.in
4.3.2 Github
4.4 Large Scale Data Systems
es
4.4.1 Paradigms of Distributed Database Storage
4.4.1.1 Homogeneous Distributed Databases
ot
4.4.3 Mongodb
4.4.4 Hbase
m
4.0 OBJECTIVES
In this chapter the students will learn about:
Web Crawler
Web Scraping
Security and Ethical considerations in relation to authenticating and
authorizing
53
Data science To data on remote systems
Software Development Tools
Version control terminology and functionalities
Github
Large Scale Data Systems
Distributed Database Storage
NOSQL
MongoDB
HBase
AWS
Cloud Services
Map Reduce
.in
4.1 WEB CRAWLER&WEB SCRAPER
Web Crawler
es
Web crawler is also known as web spider, search engine bot. It takes the
content from internet then downloads and indexes it.
The main aim of web crawler or bot is to learn from every webpage on the
ot
web, the content or the information can be retrieved when the user needs.
It is known as “web crawlers” as crawling is the technical term for
un
for collection of data by web crawlers, Search engine also provides the
relevant link for the same information or content. Search engine will
generate the list of webpages that contains a user types a search into
Google or Bing or any other search engine like Yahoo.
Web crawler is a bot which will like in a book it will go through the
various books in a disorganized library and combines a card catalog,
where anyone who want to visit the library can easily and quickly fine the
content or information they need.
With the help of sorted and categorized data in library’s book topic-wise,
then the organizer will read first the title, summary of each text book to
find out what is content of particular book, if the reader needs then user
can download it use it as per need.
In short book is nothing but the information on the web library which
organized in a systematic manner (sort and index). User can download
which one is relevant and use as per the need.
54
The sequence of searching in a book or web library, it will start with a Data Base Systems
certain set of known webpages and then follow the links i.e. hyperlinks
from those pages to the other pages, after following the hyperlinks from
those other pages to additional pages will open and user can get the
information or data on it. Internet is crawled by search engine bots.
Example of web crawlers: Amazonbot (Amazon), Bingbot (Bing), Yahoo,
Baiduspider (Baidu), Googlebot (Google), DuckDuckbot (DuckDuckGo)
etc.
Search Indexing
With the help of search engine on internet, it is like creating a library card
catalog, where on internet it will retrieve the information or data when the
user searches for it. It can also be compared to the index in the back of the
book, where it will lists all the places in the book where a particular topic
or phrase is typed by the user on any search engine.
The main aim of search indexing is on the web library the text that appears
will search with the help of internet.
.in
Metadata is the data that gives the details about search engines what a
webpage is about. Meta means what the description will appear on search
es
engine result pages.
[source: techtarget.com/whatis/definition/crawler]
Crawling process: It collects data from various websites that allow
crawling and indexing. Once collected data then it sends to the respective
search engine like Google or user defines any other search engine.
Indexing process: After crawling process the Google or respective search
engine shelves the data base on its relevance and the importance to users.
With the help of hyperlink or URLs the data which are present on various
sites get processed and stored in a Google or respective search engine
database.
Ranking Process: After completing indexing process, user enters a query
on search engine (Google), then the search engine shows the results from
55
Data science the stored database to the user. Sharing results with relevant keyword is
also cumulate with result. Ranking of a website on a particular search
engine is the key factor of relevance on search engine.
Web Scraper
Online scraping is a computerised technique for gathering copious
volumes of data from websites. The majority of this data is unstructured in
HTML format and is transformed into structured data in a database or
spreadsheet so that it can be used in multiple applications. To collect data
from websites, web scraping can be done in a variety of methods. Options
include leveraging specific APIs, online services, or even writing your
own code from scratch for web scraping. You may access the structured
data on many huge websites, including Google, Twitter, Facebook,
StackOverflow, and others, using their APIs.
.in
1. It is used for downloading information Web pages
It is mostly employed
4. It is done on both small and large scale. in large scale.
m
Data de-duplication is an
Data de-duplication is not necessarily a integral part of Web
6. part of Web Scraping. Scraping.
This needs crawl agent and a parser for This only needs only
7. parsing the response. crawl agent.
56
4.2 SECURITY AND ETHICAL CONSIDERATIONS IN Data Base Systems
RELATION TO AUTHENTICATING AND
AUTHORIZING
Authentication And Authorization for Storage System
Security is an important parameter for any data storage system. Various
security attacks that can be faced in any system can be:
1. Password guessing attack
2. Replay attack.
3. Man-in-the-middle attack
6. Phishing attack
4. Masquerade attack
5. Shoulder surfing attack.
6. Insider attack
.in
Authentication and Authorization are two major processes used for
security of data on the emote system.
1. Denial-of-service (DoS) and distributed denial-of-service (DDoS)
es
attacks overwhelms a system's resources so that it cannot respond to
service requests.
2. Man-in-the-middle (MitM) attack : A MitM attack occurs when a
ot
57
Data science 6. SQL injection attack: SQL injection has become a common issue with
database-driven websites. It occurs when a malefactor executes a SQL
query to the database via the input data from the client to server.
7. Cross-site scripting (XSS) attack: XSS attacks use third-party web
resources to run scripts in the victim web browser or scriptable
application.
8. Malware attack: Malicious software can be described as unwanted
software that is installed in your system without your consent.
Examples of data security technologies include data backups, data
masking and data erasure.
A key data security technology measure is encryption, where digital
data, software/hardware, and h drives are encrypted so that it is made
unreadable to unauthorized users and hackers.
One of the most commonly used methods for data security is the use of
authentication and authorization.
.in
With authentication, users must provide a password, code, biometric
data, or some other form of data verify identity of user before we grant
access to a system or data.
es
4.2.1 ACCESS TO DATA ON REMOTE SYSTEMS
There are various major process used for security of data on remote
ot
system.
Authentication
un
It is a process for confirming the identity of the user. The basic way of
providing authentication is through username and password, but many a
time this approach fails due to hackers or attackers if some hacker will be
able to crack the password and username than even the hacker will able to
m
58
Authorization Data Base Systems
It follows the authentication step which means that once the authentication
of a particular user is done the next step is authorization which is to check
what rights are given to that user.
-During the process of authentication policies are made which define the
authorities of that user.
Various algorithms used for authentication and authorization are:
1. RSA algorithm.
2. AES algorithm and MD5 hashing algorithm.
3. OTP password algorithm.
4. Data encryption standard algorithm.
5. Rijndael encryption algorithm.
4.3 SOFTWARE DEVELOPMENT TOOLS
.in
Software development tools plays a crucial role in data science workflows,
especially as projects become more complex and involve larger amounts
of data.
es
Here are some of the most commonly used software development tools in
data science:
59
Data science
Let’s try to understand the process with the help of this diagram
Source: (https://youtu.be/Yc8sCSeMhi4)
There are 3 workstations or three different developers at three other
.in
locations, and there’s one repository acting as a server. The work stations
are using that repository either for the process of committing or updating
the tasks.
es
There may be a large number of workstations using a single server
repository. Each workstation will have its working copy, and all these
workstations will be saving their source codes into a particular server
ot
repository.
This makes it easy for any developer to access the task being done using
the repository. If any specific developer's system breaks down, then the
un
work won't stop, as there will be a copy of the source code in the central
repository.
Finally, let’s have a look at some of the best Version Control Systems in
m
the market.
60
Integrated Development Environments (IDEs) Data Base Systems
.in
Data analysis and visualization tools
Data analysis and visualization tools help data scientists to explore, clean,
and visualize data.
es
Popular tools include
Pandas
ot
NumPy
Matplotlib
un
61
Data science 4.3.2 GITHUB
Github is an Internet hosting service for software development and version
control using Git. It provides the distributed version control of Git plus
access control, bug tracking, software feature requests, task management,
continuous integration, and wikis for every project.
Projects on GitHub.com can be accessed and managed using the standard
Git command-line interface; all standard Git commands work with it.
GitHub.com also allows users to browse public repositories on the site.
Multiple desktop clients and Git plugins are also available. The site
provides social networking-like functions such as feeds, followers, wikis
is newest. Anyone can browse and download public repositories but only
registered users can contribute content to repositories.
Git
GIT full form is “Global Information Tracker,” Git is a DevOps tool used
for source code management. It is a free and open-source version control
system used to handle small to very large projects efficiently. Git is used
.in
to tracking changes in the source code, enabling multiple developers to
work together on non-linear development. While Git is a tool that's used to
manage multiple versions of source code edits that are then transferred to
es
files in a Git repository, GitHub serves as a location for uploading copies
of a Git repository.
Need of Github
ot
To store the large data normal databases cannot be used and hence databases like
NoSQL, MongoDB and HBase etc are good option for large scale data systems.
Large scale systems do not always have centralized data storage. Distributed
database approach is widely used in many applications.
62
Distributed databases are capable of modular development, meaning that Data Base Systems
systems can be expanded by adding new computers and local data to the
new site and connecting them to the distributed system without
interruption. When failures occur in centralized databases, the system
comes to a complete stop. When a component fails in distributed database
systems, however, the system will continue to function at reduced
performance until the error is fixed. Data is physically stored across
multiple sites. Data in each site can be managed by a DBMS independent
of the other sites. The processors in the sites are connected via a network.
They do not have any multiprocessor configuration. A distributed database
is not a loosely connected file system.
A distributed database incorporates transaction processing, but it is not
synonymous with a transaction processing system.
Distributed database systems are mainly classified as homogenous and
heterogeneous database.
.in
es
ot
un
m
63
Data science In non-autonomous databases, data is distributed across the various
nodes or sites and one node manages all the other nodes as if like
client server model.
.in
4.4.1.2 HETEROGENEOUS DISTRIBUTED DATABASES
.in
4. Replication and availability: Most NoSQL databases provide built-in
replication and fault-tolerance features that ensure high availability
and data durability.
es
5. Distributed architecture: NoSQL databases are typically designed as
distributed systems, which allows them to distribute data across
multiple nodes in the cluster. This enables efficient handling of large
ot
databases do not require a predefined schema. This means that you can
add new fields or attributes to the data on the fly, without having to
modify the entire database schema.
m
4.4.3 MONGODB
MongoDB is a document-oriented NoSQL database that stores data in the
form of JSON-like documents.
Automatic sharing: MongoDB can automatically split data across
multiple servers, allowing it to handle large volumes of data and scale
horizontally.
Indexing: MongoDB supports indexes on any field, including fields
within nested documents and arrays.
Rich query language: MongoDB supports a rich query language that
includes filtering, sorting, and aggregation.
Dynamic schema: MongoDB's flexible schema allows you to add new
fields or change existing ones without affecting the existing data.
Replication: MongoDB supports replica sets, which provide
automatic failover and data redundancy.
65
Data science 4.4.4 HBASE
HBase is also a NoSQL database, but it is a column-oriented database
built on top of Hadoop. HBase is an excellent choice for applications that
require random read/write access to large amounts of data.
Built on Hadoop: HBase is built on top of Hadoop, allowing it to
leverage Hadoop's distributed file system (HDFS) for storage and
MapReduce for processing.
Strong consistency: HBase provides strong consistency guarantees,
ensuring that all reads and writes are seen by all nodes in the cluster.
Scalability: HBase can scale to handle petabytes of data and billions
of rows.
Data compression: HBase provides data compression options,
reducing the amount of storage required for large datasets.
Transactions: HBase supports multi-row transactions, allowing for
complex operations to be executed atomically.
.in
4.5 AWS (AMAZON WEB SERVICES)
Amazon Web Service
Amazon Web Service is a platform that offers scalable, easy to use,
es
flexible and cost-effective cloud computing platforms, API’s and solutions
to individuals, businesses and companies. AWS provides different IT
resources available on demand. It also provides different services such as
ot
EC2 which stands for Elastic Compute Cloud. EC2 provides the
opportunity to the users to choose a virtual machine as per their
requirement. It gives freedom to the user to choose between a variety of
storage, configurations, services, etc.
S3 stands for Simple Storage Service, using which online backup and
archiving of data becomes easier. It allows the users to store and retrieve
various types of data using API calls. It doesn’t contain any computing
element.
EBS also known as Elastic Block Store, provides persistent block storage
volumes which are to be used in instances created by EC2. It has the
ability to replicate itself for maintaining its availability throughout.
The Important Cloud Services according to various categories that are
provided by AWS are given below :
1. Compute
.in
Amazon EC2: Amazon Elastic Compute Cloud (Amazon EC2) is a web
service that provides secure, resizable compute capacity in the cloud. It
allows organisations to obtain and configure virtual compute capacity in
es
the cloud. Amazon EC2 is an example of Infrastructure as a Service(IaaS).
AWS Elastic Beanstalk: AWS Elastic Beanstalk is a Platform as a Service
ot
Ruby.
2. Networking
Amazon Route 53: Amazon Route 53 is a highly available and scalable
m
3. Storage
Amazon S3 (Simple Storage Service): Amazon Simple Storage Service
(Amazon S3) is object storage with a simple web service interface to store
and retrieve any amount of data from anywhere on the web. You can use
Amazon S3 as primary storage for cloud-native applications as a target for
backup and recovery and disaster recovery.
Amazon Glacier: Amazon Glacier is a secure, durable, and extremely low-
cost storage service for data archiving and long-term backup. Data stored
67
Data science in Amazon Glacier takes several hours to retrieve, which is why it’s ideal
for archiving.
4. Databases
Amazon RDS (Relational Database Service): Amazon Relational Database
Service (Amazon RDS) makes it easy to set up, operate, and scale a
relational database in the cloud. You can find Amazon RDS is also
available on several database instance types – optimised for memory,
performance, or I/O.
Benefits of AWS
High Availability.
Parallel Processing.
Security.
Low Latency.
.in
Fault Tolerance and disaster recovery.
Cost effective.
es
4.5.2 CLOUD SERVICES
What is Cloud Computing?
ot
hardware and software, users can simply rent resources from cloud service
providers. There are several types of cloud computing models, including:
1. Infrastructure as a Service (IaaS): Provides users with access to
m
68
2. Cost-effectiveness: Cloud computing reduces the need for businesses Data Base Systems
to invest in expensive hardware and infrastructure. Instead, they can
rent computing resources from cloud service providers on a pay-as-
you-go basis. This allows businesses to only pay for what they use,
reducing overall costs.
3. Accessibility: With cloud computing, users can access computing
resources from anywhere with an internet connection. This means that
employees can work remotely and collaborate on projects from
different locations.
4. Security: Cloud service providers offer robust security measures,
including encryption, firewalls, and access controls to protect data and
applications. Additionally, cloud providers often employ dedicated
security teams to monitor and respond to potential security threats.
5. Reliability: Cloud service providers offer high levels of uptime and
availability, ensuring that resources are always accessible when
needed. Additionally, cloud providers typically have redundant
infrastructure in place to ensure that services remain available even if
.in
there is an outage in one location.
6. Flexibility: Cloud computing allows businesses to experiment with
new applications and services without having to commit to long-term
es
investments. This means that businesses can test new ideas quickly
and easily, without worrying about the cost of hardware or
infrastructure. Overall, cloud computing offers numerous advantages
ot
69
Data science 5. Cost: While cloud computing can be cost-effective in some cases, it
can also be expensive if usage levels are high or if resources are not
managed effectively. Additionally, cloud providers may raise prices or
change their pricing models over time, which can impact the cost of
using cloud computing.
6. Data privacy and compliance: Businesses may face challenges in
ensuring that data stored in the cloud is compliant with regulatory
requirements. Additionally, some organizations may have concerns
about data privacy and how data is used by cloud providers.
Overall, while cloud computing offers many benefits, it is important for
businesses to carefully consider the potential drawbacks and risks before
deciding to adopt cloud computing.
.in
time-consuming. Cloud computing allows businesses to access computing
resources over the internet, rather than having to build and maintain their
own infrastructure. Overall, cloud computing addresses many of the key
es
needs that businesses face, including scalability, flexibility, cost-
efficiency, reliability, security, and innovation. As a result, cloud
computing has become an essential technology for many businesses.
ot
70
MapReduce is widely used in big data processing because it allows Data Base Systems
developers to write code that can be easily parallelized and distributed
across a large number of machines. This enables the processing of very
large datasets that would otherwise be difficult or impossible to handle
with traditional data processing techniques.
Uses of MapReduce
Scalability: MapReduce is highly scalable as it allows parallel
processing of large datasets across a large number of machines. This
makes it ideal for handling big data workloads.
Fault tolerance: MapReduce is designed to handle failures in the
cluster. If a machine fails, the MapReduce framework automatically
reassigns the tasks to other machines, ensuring the job is completed
without any data loss or errors.
Flexibility: MapReduce is flexible as it can be used with a variety of
data storage systems, including Hadoop Distributed File System
(HDFS), Amazon S3, and Google Cloud Storage.
.in
Cost-effective: MapReduce is cost-effective as it uses commodity
hardware to process data. This makes it an affordable solution for
es
handling big data workloads.
Efficient: MapReduce is efficient because it performs data processing
operations in parallel, which reduces the overall processing time. This
ot
4.6 SUMMARY
This chapter gives brief introduction of Database System. After studying
this chapter, you will learn about the concept of web crawling and web
scraping, what are the various security and ethical considerations in
relation to authentication and authorization, what are the software
development tools, what is version control, GitHub, detail study of large-
scale systems with the different types namely homogeneous distributed
system and heterogeneous distributed system, NoSQL, HBase, Mongo
DB, what is AWS, cloud services and MapReduce.
.in
4.8 REFERENCES
1. Doing Data Science, Rachel Schutt and Cathy O’Neil, O’Reilly,2013
es
2. Mastering Machine Learning with R, Cory Lesmeister, PACKT
Publication,2015
3. Hands-On Programming with R, Garrett Grolemund,1st Edition, 2014
ot
5. https://www.cloudflare.com/learning/bots/what-is-a-web-
crawler/#:~:text=A%20web%20crawler%2C%20or%20spider,appear
%20in%20search%20engine%20results
m
6. (https://capsicummediaworks.com/web-crawler-guide/)
72
5
INTRODUCTION TO MODEL
SELECTION
Unit Structure
5.0 Objectives
5.1 Introduction
5.2 Regularization
5.2.1 Regularization techniques
5.3 Bias/variance tradeoff
5.3.1 What is Bias?
.in
5.3.2 What is Variance?
5.3.3 Bias-Variance Tradeoff
es
5.4 Parsimony Model
5.4.1 How to choose a Parsimonious Model
ot
5.4.1.1 AIC
5.4.1.2 BIC
un
5.4.1.3 MDL
5.5 Cross validation
m
5.0 OBJECTIVES
To understand the factors that needs to be considered while selecting a
model
To get familiar with the regularization techniques and bias-variance
tradeoffs
To understand the parsimony and cross-validation techniques
73
Data science 5.1 INTRODUCTION
The process of choosing a single machine learning model out of a group of
potential candidates for a training dataset is known as model selection.
Model selection is a procedure that can be used to compare models of the
same type that have been set up with various model hyperparameters (e.g.,
different kernels in an SVM)and models of other types (such as logistic
regression, SVM, KNN, etc).
A "good enough" model is particular to your project and might mean
many different things, including:
.in
A model that is proficient in terms of current technology
5.2 REGULARIZATION
es
The term "regularization" describes methods for calibrating machine
learning models to reduce the adjusted loss function and avoid overfitting
or underfitting.
ot
un
m
1] Ridge Regularization
It is also referred to as Ridge Regression and modifies over- or under-
fitted models by applying a penalty equal to the sum of the squares of the
coefficient magnitude.
74
As a result, coefficients are produced and the mathematical function that Introduction to Model
represents our machine learning model is minimized. The coefficients' Selection
magnitudes are squared and summed. Ridge Regression applies
regularization by reducing the number of coefficients. The cost function of
ridge regression is shown in the function below:
.in
control the punishment term by varying the values of the penalty function.
The magnitude of the coefficients decreases as the penalty increases. The
settings are trimmed. As a result, it serves to prevent multicollinearity and,
es
through coefficient shrinkage, lower the model's complexity.
Have a look at the graph below, which shows linear regression:
ot
un
m
.in
es
Figure 4: Ridge regression model
ot
Comparing the two models, with all data points, we can see that the Ridge
regression line fits the model more accurately than the linear regression
un
line.
m
2]Lasso Regularization
By imposing a penalty equal to the total of the absolute values of the
coefficients, it alters the models that are either overfitted or underfitted.
76
Lasso regression likewise attempts coefficient minimization, but it uses Introduction to Model
the actual coefficient values rather than squaring the magnitudes of the Selection
coefficients. As a result of the occurrence of negative coefficients, the
coefficient sum can also be 0. Think about the Lasso regression cost
function:
.in
like we did in Ridge Regression. Again, consider a Linear Regression
model:
es
ot
un
m
.in
Figure 8: Lasso regression
Comparing the two models, with all data points, we can see that the Lasso
regression line fits the model more accurately than the linear regression
es
line.
The bias is the discrepancy between our actual values and the predictions.
In order for our model to be able to forecast new data, it must make some
basic assumptions about our data.
Figure 9: Bias
78
When the bias is significant, our model's assumptions are too simplistic, Introduction to Model
and the model is unable to capture the crucial aspects of our data. As a Selection
result, our model cannot successfully analyze the testing data because it
has not been able to recognize patterns in the training data. If so, our
model is unable to operate on fresh data and cannot be put into use.
Underfitting refers to the situation where the model is unable to recognize
patterns in our training set and hence fails for both seen and unseen data.
Figure following provides an illustration of underfitting. The line of best
fit is a straight line that doesn't go through any of the data points, as can be
seen by the model's lack of pattern detection in our data. The model was
unable to effectively train on the provided data and is also unable to
predict fresh data.
.in
es
Figure 10: Underfitting
patterns. Insufficient time spent working with the data will result in bias
because patterns won't be discovered. On the other hand, if our model is
given access to the data too frequently, it will only be able to train very
well for that data. The majority of patterns in the data will be captured, but
m
79
Data science We can see from the above picture how effectively our model has learned
from the training data, which has trained it to recognize cats. Nevertheless,
given fresh information, like the image of a fox, our model predicts it to be
a cat because that is what it has learnt to do. When variance is high, our
model will catch all the properties of the data provided, including the
noise, will adjust to the data, and predict it extremely well. However,
when given new data, it is unable to forecast since it is too specific to
training data.
As a result, while our model will perform admirably on test data and
achieve high accuracy, it will underperform on brand-new, unforeseen
data. The model won't be able to forecast new data very effectively
because it could not have the exact same characteristics. Overfitting is the
term for this.
.in
es
Figure 12:Over-fitted model where we see model performance on, a)
ot
We need to strike the ideal balance between bias and variance for every
model. This only makes sure that we record the key patterns in our model
and ignore the noise it generates. The term for this is bias-variance
m
Figure 13:Error in Training and Testing with high Bias and Variance
80
We can observe from the above figure that when bias is large, the error in Introduction to Model
both the training set and the test set is also high. When the variance is Selection
high, the model performs well on the testing set and the error is low, but
the error on the training set is significant. We can see that there is a zone
in the middle where the bias and variance are perfectly balanced and the
error in both the training and testing set is minimal.
.in
es
ot
the data is concentrated in the center, or at the target, the fit is optimal. We
can see that the error in our model grows as we move farther and farther
from the center. The ideal model has little bias and little volatility.
m
81
Data science 2. Parsimonious models typically exhibit higher forecasting accuracy.
When used on fresh data, models with fewer parameters typically
perform better.
To demonstrate these concepts, think about the following two situations.
Example 1: Parsimonious Models=Simple Interpretation
Assume that we wish to create a model to forecast house prices using a set
of real estate-related explanatory factors. Take into account the two
models below, together with their modified R-squared:
Model 1:
Equation: House price = 8,830 + 81*(sq. ft.)
Adjusted R2: 0.7734
Model 2:
Equation: House price = 8,921 + 77*(sq. ft.) + 7*(sq. ft.)2 – 9*(age) +
.in
600*(rooms) + 38*(baths)
Adjusted R2: 0.7823
es
While the second model includes five explanatory factors and an only
marginally higher adjusted R2, the first model only has one explanatory
variable with an adjusted R2 of.7734.
ot
same ability to explain the fluctuation in home prices as the other models.
For instance, according to the first model, an increase of one unit in a
home's square footage corresponds to a $81 rise in the average price of a
m
82
This indicates that compared to a simpler model with fewer parameters, a Introduction to Model
very complicated model with many parameters is likely to perform poorly Selection
on a fresh dataset that it hasn't seen before.
.in
n: Number of observations in the training dataset.
LL: Log-likelihood of the model on the training dataset.
es
k: Number of parameters in the model.
The AIC of each model may be determined using this procedure, and the
ot
model with the lowest AIC value will be chosen as the best model.
When compared to the next method, BIC, this strategy tends to prefer
more intricate models.
un
BIC = -2 * LL + log(n) * k
where:
n: Number of observations in the training dataset.
log: The natural logarithm (with base e)
LL: Log-likelihood of the model on the training dataset.
k: Number of parameters in the model.
Using this method, you can calculate the BIC of each model and then
select the model with the lowest BIC value as the best model.
.in
By training the model on a subset of the input data and testing it on a
subset of the input data that hasn't been used before, you may validate the
es
model's effectiveness. It is also a method for determining how well a
statistical model generalizes to a different dataset.
Testing the model's stability is a necessary step in machine learning (ML).
ot
This indicates that we cannot fit our model to the training dataset alone.
We set aside a specific sample of the datasetone that wasn't included in the
training datasetfor this use. After that, before deployment, we test our
un
84
Nevertheless, it has a significant drawback in that we are only using 50% Introduction to Model
of the dataset to train our model, which means that the model can fail to Selection
capture crucial dataset information. It frequently produces the underfitted
model as well.
.in
method, only one data point is set aside for each learning set, while the
remaining dataset is used to train the model. Each datapoint in this process
is repeated again. Hence, for n samples, n distinct training sets and n test
es
sets are obtained. It has these characteristics:
As all the data points are used, the bias is minimal in this method.
Because the process is run n times, the execution time is long.
ot
4] K-Fold Cross-Validation
K-fold cross-validation approach divides the input dataset into K groups of
samples of equal sizes. These samples are called folds. For each learning
m
set, the prediction function uses k-1 folds, and the rest of the folds are used
for the test set. This approach is a very popular CV approach because it is
easy to understand, and the output is less biased than other methods.
The steps for k-fold cross-validation are:
o Split the input dataset into K groups
o For each group:
o Take one group as the reserve or test data set.
o Use remaining groups as the training dataset
o Fit the model on the training set and evaluate the performance of the
model using the test set.
Let's take an example of 5-folds cross-validation. So, the dataset is
grouped into 5 folds. On 1st iteration, the first fold is reserved for test the
model, and rest are used to train the model. On 2nd iteration, the second
85
Data science fold is used to test the model, and rest are used to train the model. This
process will continue until each fold is not used for the test fold.
Consider the below diagram:
.in
for addressing bias and variation.
It can be understood by utilizing the example of housing costs, where
some homes may have substantially higher prices than others. A stratified
es
k-fold cross-validation technique is helpful to handle such circumstances.
6] Holdout Method
ot
The inaccuracy that results from this procedure provides insight into how
effectively our model will work with the unidentified dataset. Although
this method is straightforward to use, it still struggles with large volatility
m
5.6 SUMMARY
We have studied the following points from this chapter:
.in
Beyond model performance, there may be several competing
considerations to consider throughout the model selection process,
es
including complexity, maintainability, and resource availability.
87
Data science 7) Explain the Parsimony Model.
8) How you willchoose a Parsimonious Model?
9) Explain: AIC, BIC and MDL.
10) Explain the Cross validation.
11) Describe the methods used for Cross-Validation.
12) Write a note on limitations and applications of Cross-Validation
techniques.
.in
es
ot
un
m
88
6
DATA TRANSFORMATIONS
Unit Structure
6.0 Objectives
6.1 Introduction
6.2 Dimension reduction
6.2.1 The curse of dimensionality
6.2.2 Benefits of applying dimensionality reduction
6.2.3 Disadvantages of dimensionality reduction
6.2.4 Approaches of dimension reduction
.in
6.2.5 Common techniques of dimensionality reduction
6.3 Feature extraction
es
6.3.1 Why feature extraction is useful?
6.3.2Applications of Feature Extraction
6.3.3 Benefits
ot
6.4 Smoothing
6.5 Aggregating
6.5.1 Working of data aggregation
m
6.0 OBJECTIVES
To understand the various data transformations involved in machine
learning
.in
with latency measured in seconds or minutes by using cloud-based data
warehouses. Organizations can load raw data directly into the data
warehouse and perform preload transformations at query time thanks to
es
the scalability of the cloud platform.
Data transformation may be used in data warehousing, data wrangling,
data integration, and migration. Data transformation makes business and
ot
90
challenging to visualize or anticipate the results; hence, dimensionality Data Transformations
reduction techniques must be used.
The phrase "it is a manner of turning the higher dimensions dataset into
lower dimensions dataset, guaranteeing that it gives identical information"
can be used to describe the technique of "dimensionality reduction." These
methods are frequently used in machine learning to solve classification
and regression issues while producing a more accurate predictive model.
It is frequently utilized in disciplines like speech recognition, signal
processing, bioinformatics, etc. that deal with high-dimensional data.
Moreover, it can be applied to cluster analysis, noise reduction, and data
visualization.
.in
es
ot
un
m
.in
A] Feature Selection
In order to create a high accuracy model, a subset of the important features
es
from a dataset must be chosen, and the irrelevant characteristics must be
excluded. This process is known as feature selection. To put it another
way, it is a method of choosing the best characteristics from the input
ot
dataset.
The feature selection process employs three techniques:
un
1] Filter methods
In this method, the dataset is filtered, and a subset that contains only the
relevant features is taken. Some common techniques of filters method are:
m
o Correlation
o Chi-Square Test
o ANOVA
o Information Gain, etc.
2] Wrapper methods
The wrapper technique uses a machine learning model to evaluate itself,
but it has the same objective as the filter method. With this approach,
some features are provided to the ML model, and performance is assessed.
To improve the model's accuracy, the performance determines whether to
include or exclude certain features. Although it is more difficult to use,
this method is more accurate than the filtering method. The following are
some typical wrapper method techniques:
92
o Forward Selection Data Transformations
o Backward Selection
o Bi-directional Elimination
3] Embedded Methods: Embedded methods check the different training
iterations of the machine learning model and evaluate the importance of
each feature. Some common techniques of Embedded methods are:
o LASSO
o Elastic Net
o Ridge Regression, etc.
B] Feature extraction
The process of converting a space with many dimensions into one with
fewer dimensions is known as feature extraction. This strategy is helpful
when we want to retain all of the information while processing it with
.in
fewer resources.
Some common feature extraction techniques are:
es
Principal Component Analysis
Kernel PCA
Backward Elimination
Forward Selection
Score comparison
Missing Value Ratio
Low Variance Filter
High Correlation Filter
Random Forest
Factor Analysis
Auto-Encoder
.in
o The performance of the model is checked.
o Now we will remove one feature each time and train the model on n-1
es
features for n times, and will compute the performance of the model.
o We will check the variable that has made the smallest or no change in
ot
the performance of the model, and then we will drop that variable or
features; after that, we will be left with n-1 features.
Repeat the complete process until no feature can be dropped.
un
94
Missing Value Ratio Data Transformations
If a dataset has too many missing values, then we drop those variables as
they do not carry much useful information. To perform this, we can set a
threshold level, and if a variable has missing values more than that
threshold, we will drop that variable. The higher the threshold value, the
more efficient the reduction.
.in
the model can be degraded. This correlation between the independent
numerical variable gives the calculated value of the correlation coefficient.
If this value is higher than the threshold value, we can remove one of the
es
variables from the dataset. We can consider those variables or features that
show a high correlation with the target variable.
Random Forest
ot
Factor Analysis
Factor analysis is a technique in which each variable is kept within a
group according to the correlation with other variables, it means variables
within a group can have a high correlation between themselves, but they
have a low correlation with variables of other groups.
We can understand it by an example, such as if we have two variables
Income and spend. These two variables have a high correlation, which
means people with high income spends more, and vice versa. So, such
variables are put into a group, and that group is known as the factor. The
number of these factors will be reduced as compared to the original
dimension of the dataset.
95
Data science Auto-encoders
One of the popular methods of dimensionality reduction is auto-encoder,
which is a type of ANN or artificial neural network, and its main aim is to
copy the inputs to their outputs. In this, the input is compressed into latent-
space representation, and output is occurred using this representation. It
has mainly two parts:
o Encoder: The function of the encoder is to compress the input to form
the latent-space representation.
o Decoder: The function of the decoder is to recreate the output from the
latent-space representation.
.in
processing units.
The dimensionality reduction method, which divides and condenses a
starting set of raw data into smaller, easier-to-manage groupings, includes
es
feature extraction. As a result, processing will be simpler. The fact that
these enormous data sets contain a lot of different variables is their most
crucial feature. Processing these variables takes a lot of computing power.
In order to efficiently reduce the amount of data, feature extraction helps
ot
to extract the best feature from those large data sets by choosing and
combining variables into features. These features are simple to use while
still accurately and uniquely describing the real data set.
un
6.3.3 Benefits
Feature extraction can prove helpful when training a machine learning
model. It leads to:
A Boost in training speed
An improvement in model accuracy
A reduction in risk of overfitting
.in
A rise in model explainability
Better data visualization
es
6.3.4 Feature extraction techniques
The following is a list of some common feature extraction techniques:
ot
97
Data science By minimizing the changes that may occur each month, such as vacations
or petrol prices, an economist can smooth out data to make seasonal
adjustments for particular indicators, such retail sales.
Yet, there are drawbacks to using this technology. When identifying trends
or patterns, data smoothing doesn't necessarily explain them. It might also
cause certain data points to be overlooked in favor of others.
Pros
Helps identify real trends by eliminating noise from the data
Allows for seasonal adjustments of economic data
Easily achieved through several techniques including moving
averages
Cons
Removing data always comes with less information to analyze,
.in
increasing the risk of errors in analysis
Smoothing may emphasize analysts' biases and ignore outliers that
may be meaningful
es
6.5 AGGREGATING
Finding, gathering, and presenting data in a condensed style is the process
ot
98
6.5.2 Examples of aggregate data Data Transformations
Finding the average age of customer buying a particular product
which can help in finding out the targeted age group for that
particular product. Instead of dealing with an individual customer,
the average age of the customer is calculated.
Finding the number of consumers by country. This can increase sales
in the country with more buyers and help the company to enhance its
marketing in a country with low buyers. Here also, instead of an
individual buyer, a group of buyers in a country are considered.
By collecting the data from online buyers, the company can analyze
the consumer behaviour pattern, the success of the product which
helps the marketing and finance department to find new marketing
strategies and planning the budget.
Finding the value of voter turnout in a state or country. It is done by
counting the total votes of a candidate in a particular region instead
of counting the individual voter records.
.in
6.5.3 Data aggregators
A system used in data mining called a "data aggregator" gathers
es
information from many sources, analyses it, and then repackages it in
usable packages. They significantly contribute to the improvement of
client data by serving as an agent. When a consumer asks data examples
ot
instances.
Working
m
99
Data science Communications in social media
Speech recognition like call centers
Headlines of a news
Browsing history and other personal data of devices.
Processing of data: After collecting data, the data aggregator finds
the atomic data and aggregates it. In the processing technique,
aggregators use various algorithms from the field of Artificial
Intelligence or Machine learning techniques. It also incorporates
statistical methods to process it, like the predictive analysis. By this,
various useful insights can be extracted from raw data.
Presentation of data: After the processing step, the data will be in a
summarized format which can provide a desirable statistical result
with detailed and accurate data.
6.6 SUMMARY
.in
The modification of data characteristics for better access or storage is
known as data transformation. Data's format, structure, or values may all
undergo transformation. Data analytics transformation typically takes
es
place after data has been extracted or loaded (ETL/ELT).
Data transformation improves the effectiveness of analytical procedures
ot
and makes it possible to make judgements using data. There is a need for
clean, usable data since raw data is frequently challenging to examine and
has a size that is too great to yield useful insight.
un
An analyst or engineer will choose the data structure before starting the
transformation procedure. The following are the most typical types of data
transformation:
m
100
6.7 LIST OF REFERENCES Data Transformations
.in
4] State the disadvantages of dimensionality reduction.
5] Explain the different approaches of dimension reduction.
es
6] What are the common techniques of dimensionality reduction?
7] What is Feature extraction?
ot
101
7
SUPERVISED LEARNING
Unit Structure
7.0 Objectives
7.1 Introduction
7.2 Linear models
7.2.1 What is linear model?
7.2.2 Types of linear model
7.2.3 Applications of linear model
7.3 Regression trees
7.3.1 What are regression trees?
7.3.2 Mean square error
.in
7.3.3 Building a regression tree
7.4 Time-series Analysis
es
7.4.1 What is time series analysis?
7.4.2 Types of time series analysis
7.4.3 Value of time series analysis
ot
7.1 INTRODUCTION
A class of techniques and algorithms known as "supervised learning" in
machine learning and artificial intelligence develop predictive models
utilizing data points with predetermined outcomes. The model is trained
using an appropriate learning technique (such as neural networks, linear
regression, or random forests), which often employs some sort of
optimization procedure to reduce a loss or error function.
In other words, supervised learning is the process of training a model by
providing both the right input data and output data. The term "labelled
data" is typically used to describe this input/output pair. Consider a
teacher who, armed with the right answers, will award or deduct points
.in
from a student depending on how accurately she answered a question. For
two different sorts of issues, supervised learning is frequently utilized to
develop machine learning models.
es
Regression: The model identifies outputs that correspond to actual
variables (number which can have decimals.)
Classification: The model creates categories for its inputs.
ot
input data, linear models forecast the target variable. Here, we've covered
linear regression and logistic regression, two essential linear models in
machine learning. While logistic regression is a classification algorithm,
linear regression is utilized for jobs involving regression.
103
Data science 7.2.2 Types of linear model
1] Linear regression
A statistical method known as "linear regression" makes predictions about
the outcome of a response variable by fusing a variety of affecting
variables. It makes an effort to depict the target's (dependent
variables)linear relationship with features (independent variables). We can
determine the ideal model parameter values using the cost function.
Example: An analyst would be interested in seeing how market
movement influences the price of ExxonMobil (XOM). The value of the
S&P 500 index will be the independent variable, or predictor, in this
example, while the price of XOM will be the dependent variable. In
reality, various elements influence an event's result. Hence, we usually
have many independent features.
2] Logistic regression
A progression from linear regression is logistic regression. The result of
.in
the linear regression is first transformed between 0 and 1 by the sigmoid
function. Following that, a predetermined threshold aids in calculating the
likelihood of the output values. Values over the threshold value have a
es
tendency to have a probability of 1, whereas values below the threshold
value have a tendency to have a probability of 0.
Example: A bank wants to predict if a customer will default on their loan
ot
based on their credit score and income. The independent variables would
be credit score and income, while the dependent variable would be
whether the customer defaults (1) or not (0).
un
m
104
The connection between fertilizer application rates and agricultural Supervised Learning
yields.
Athletes' performances and training schedule.
.in
procedure. Now, we require a different approach. The mean square error is
a measurement that indicates how much our projections stray from the
initial goal.
es
ot
We only care about how far the prediction deviates from the target; Y is
the actual value, and Y hat is the prediction. not which way around. In
order to divide the total amount by the total number of records, we square
un
the difference.
We follow the same procedure as with classification trees in the regression
tree approach. But rather than focusing on entropy, we strive to lower the
m
.in
Figure 2: Actual dataset
es
We need to build a Regression tree that best predicts the Y given the X.
Step 1
ot
The first step is to sort the data based on X (In this case, it is already
sorted). Then, take the average of the first 2 rows in variable X (which is
(1+2)/2 = 1.5 according to the given dataset). Divide the dataset into 2
un
parts (Part A and Part B), separated by x < 1.5 and X ≥ 1.5.
Now, Part A consist only of one point, which is the first row (1,1) and all
the other points are in Part — B. Now, take the average of all the Y values
m
Step 2
In step 1, we calculated the average for the first 2 numbers of sorted X and
split the dataset based on that and calculated the predictions. Then, we do
the same process again but this time, we calculate the average for the
second 2 numbers of sorted X ( (2+3)/2 = 2.5 ). Then, we split the dataset
again based on X < 2.5 and X ≥ 2.5 into Part A and Part B again and
predict outputs, find mean square error as shown in step 1. This process is
repeated for the third 2 numbers, the fourth 2 numbers, the 5th, 6th, 7th till
n-1th 2 numbers (where n is the number of records or rows in the dataset ).
106
Step 3 Supervised Learning
Now that we have n-1 mean squared errors calculated, we need to choose
the point at which we are going to split the dataset. and that point is the
point, which resulted in the lowest mean squared error on splitting at it. In
this case, the point is x=5.5. Hence the tree will be split into 2 parts. x<5.5
and x≥ 5.5. The Root node is selected this way and the data points that go
towards the left child and right child of the root node are further recursively
exposed to the same algorithm for further splitting.
Brief Explanation of working of the algorithm:
The basic idea behind the algorithm is to find the point in the independent
variable to split the data-set into 2 parts, so that the mean squared error is
the minimised at that point. The algorithm does this in a repetitive fashion
and forms a tree-like structure.
A regression tree for the above shown dataset would look like this
.in
es
ot
un
m
107
Data science 7.4 TIME-SERIES ANALYSIS
7.4.1 What is time series analysis?
A method of examining a collection of data points gathered over time is a
time series analysis. Additionally, it is specifically utilized for non-
stationary data, or data that is constantly changing over time. The time
series data varies from all other data due to this component as well. Time
series analysis is also used to predict future data based on the past. As a
result, we can conclude that it goes beyond simply gathering data.
Predictive analytics includes the subfield of time series analysis. It
supports in forecasting by projecting anticipated variations in data, such as
seasonality or cyclical activity, which provides a greater understanding of
the variables.
.in
created some intricate models to help with understanding. Analysts, on the
other hand, are unable to take into account all variations or generalize a
specific model to all samples. These are the typical time series analysis
methods:
es
o Classification: This model is used for the identification of data. It also
allocates categories to the data.
ot
108
o Segmentation: This type typically divides the data into several Supervised Learning
segments in order to display the underlying attributes of the original
data.
.in
businesses and organizations in examining the root causes of trends or
other systematic patterns across time. Moreover, with all these facts, you
can put them down in a chart visually and that assists in a deeper
knowledge of the industry. In turn, this will help firms delve more into the
es
fundamental causes of seasonal patterns or trends.
Also, it aids organizations in projecting how specific occurrences will turn
ot
109
Data science Growth: Another significant feature of time series analysis is that it
also adds to the financial as well as endogenous growth of an
organization. Endogenous growth is the internal expansion of a
business that resulted in increased financial capital. Time series
analysis can be used to detect changes in policy factors, which is a great
illustration of the value of this series in many domains.
Several sectors have noted the prevalence of time series analysis. Statistics
professionals frequently utilize it to determine probability and other
fundamentals. Also, it is crucial in the medical sectors.
Mathematicians also prefer time series because econometrics uses them as
well. It is crucial for predicting earthquakes and other natural disasters,
estimating their impact zones, and identifying weather patterns for
forecasting.
.in
1. Decompositional Models
The time series data shows certain patterns. Consequently, it is quite
es
beneficial to divide the time series into different parts for simple
comprehension. Each element represents a particular pattern. The term
"decompositional models" refers to this procedure. The time series is
ot
2. Smoothing-based Model
This technique is one of the most statistical ones for time series because it
m
5. ARIMA
AutoRegressive Integrated Moving Average is abbreviated as ARMA. It is
the forecasting technique in time series analysis that is most frequently
utilised. The Moving Average Model and the Autoregressive Model are
combined to create it.
Hence, rather than focusing on individual values in the series, the model
instead seeks to estimate future time series movement. When there is
evidence of non-stationarity in the data, ARIMA models are applied.
A linear mixture of seasonal historical values and/or prediction errors is
added to the SARIMA model, also known as the Seasonal ARIMA model,
in addition to these.
7.5 FORECASTING
.in
Forecasting is a method of foretelling the future using the outcomes of the
past data. In order to anticipate future events, a thorough analysis of past
es
and present trends or events is required. It makes use of statistical methods
and tools.
Time series forecasting is employed in various sectors, including finance,
ot
111
Data science The term "trends" is used to characterize the upward or downward
motion of time series, which is often displayed in linear modes.
.in
1] Naive model
Naive models are often implemented as a random walk and a seasonal
random walk, with the most recent value observed serving as the unit for
es
the forecast for the following period (a forecast is made using a value from
the same time period as the most recent observation).
3] ARIMA/ SARIMA
m
112
5] Multi-layer perceptron (MLP) Supervised Learning
.in
CNNs, also referred to as convolutional neural network models, decision
tree-based models like Random Forest, and variations of gradient boosting
(LightGBM, CatBoost, etc.) can be used for time series forecasting in
es
addition to the methods mentioned above.
113
Data science Forecasting based on economic and demographic factors: Economic
and demographic factors contain a wealth of statistical information that
can be used to accurately forecast time series data. Hence, the optimum
target market may be defined, and the most effective tactics to interact
with that specific TA may be developed.
.in
the two nodes of a decision tree. Whereas Leaf nodes are the results of
decisions and do not have any more branches, Decision nodes are used to
create decisions and have numerous branches.The given dataset's features
es
are used to execute the test or make the decisions.The given dataset's
features are used to execute the test or make the decisions.It is a graphical
depiction for obtaining all feasible answers to a choice or problem based
on predetermined conditions.It is known as a decision tree because, like a
ot
tree, it begins with the root node and grows on subsequent branches to
form a structure resembling a tree.The CART algorithm, which stands for
un
114
Decision Tree Terminologies: Supervised Learning
Root Node: Root node is from where the decision tree starts. It
represents the entire dataset, which further gets divided into two or
more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot
be segregated further after getting a leaf node.
Parent/Child node: The root node of the tree is called the parent
node, and other nodes are called the child nodes.
.in
Working of an algorithm:
In a decision tree, the algorithm begins at the root node and works its way
up to forecast the class of the given dataset. This algorithm follows the
es
branch and jumps to the following node by comparing the values of the
root attribute with those of the record (real dataset) attribute.
The algorithm verifies the attribute value with the other sub-nodes once
ot
again for the following node before continuing. It keeps doing this until it
reaches the tree's leaf node. The following algorithm can help you
comprehend the entire procedure:
un
o Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection
m
o
Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the
best attributes.
o Step-4: Generate the decision tree node, which contains the best
attribute.
o Step-5: Recursively make new decision trees using the subsets of the
dataset created in step -3. Continue this process until a stage is reached
where you cannot further classify the nodes and called the final node as
a leaf node.
7.7 LOGISTIC REGRESSION
One of the most often used Machine Learning algorithms, within the
category of Supervised Learning, is logistic regression. With a
predetermined set of independent factors, it is used to predict the
categorical dependent variable. In a categorical dependent variable, the
output is predicted via logistic regression. As a result, the result must be a
115
Data science discrete or categorical value. Rather of providing the exact values of 0 and
1, it provides the probabilistic values that fall between 0 and 1. It can be
either Yes or No, 0 or 1, true or false, etc. With the exception of how they
are applied, logistic regression and linear regression are very similar.
Whereas logistic regression is used to solve classification difficulties,
linear regression is used to solve regression problems.
In logistic regression, we fit a "S" shaped logistic function, which predicts
two maximum values, rather than a regression line (0 or 1). The logistic
function's curve shows the possibility of several things, including whether
or not the cells are malignant, whether or not a mouse is obese depending
on its weight, etc.Logistic Regression is a major machine learning
technique since it has the capacity to offer probabilities and categorize
new data using continuous and discrete datasets. When classifying
observations using various sources of data, logistic regression can be used
to quickly identify the factors that will work well. The logistic function is
displayed in the graphic below:
.in
es
ot
un
tool called the sigmoid function.It transforms any real value between 0 and
1 into another value.The logistic regression's result must fall within the
range of 0 and 1, and because it cannot go beyond this value, it has the
shape of a "S" curve. The S-form curve is called the Sigmoid function or
the logistic function.We apply the threshold value idea in logistic
regression, which establishes the likelihood of either 0 or 1. Examples
include values that incline to 1 over the threshold value and to 0 below it.
Assumptions for logistic regression:
o The dependent variable must be categorical in nature.
o The independent variable should not have multi-collinearity.
Type of logistic regression:
On the basis of the categories, Logistic Regression can be classified into
three types:
o Binomial: In binomial Logistic regression, there can be only two
possible types of the dependent variables, such as 0 or 1, Pass or Fail,
etc.
116
o Multinomial: In multinomial Logistic regression, there can be 3 or Supervised Learning
more possible unordered types of the dependent variable, such as
"cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more
possible ordered types of dependent variables, such as "low",
"Medium", or "High".
7.8 CLASSIFICATION USING SEPARATING
HYPERPLANES
Suppose that we have a n×p data matrix X that consists of n training
observations in p-dimensional space
.in
and that these observations fall into two classesthat is, y1,...,yn∈ {−1, 1}
where −1 represents one class and 1 the other class. We also have a test
observation, a p-vector of observed features x∗ = (x1*….Xp*)T. Our goal is
es
to develop a classifier based on the training data that will correctly classify
the test observation using its feature measurements
Suppose that it is possible to construct a hyperplane that separates the
ot
those from the purple class as yi = −1. Then a separating hyperplane has
the property that
m
117
Data science
.in
example of such a classifier. That is, we classify the test observation x∗
based on the sign of
es
If f(x∗) is positive, then we assign the test observation to class 1, and if
ot
f(x∗) is negative, then we assign it to class −1. We can also make use of
the magnitude of f(x∗). If f(x∗) is far from zero, then this means that x∗ lies
far from the hyperplane, and so we can be confident about our class
un
assignment for x∗. On the other hand, if f(x∗) is close to zero, then x∗ is
located near the hyperplane, and so we are less certain about the class
assignment for x∗. As we see in Figure, a classifier that is based on a
separating hyperplane leads to a linear decision boundary.
m
7.9 K-NN
One of the simplest machine learning algorithms, based on the supervised
learning method, is K-Nearest Neighbor.The K-NN algorithm makes the
assumption that the new case and the existing cases are comparable, and it
places the new instance in the category that is most like the existing
categories.A new data point is classified using the K-NN algorithm based
on similarity after all the existing data has been stored. This means that
utilizing the K-NN method, fresh data can be quickly and accurately
sorted into a suitable category.Although the K-NN approach is most
frequently employed for classification problems, it can also be utilized for
regression.Since K-NN is a non-parametric technique, it makes no
assumptions about the underlying data.It is also known as a lazy learner
algorithm since it saves the training dataset rather than learning from it
immediately. Instead, it uses the dataset to perform an action when
classifying data.KNN method maintains the dataset during the training
118
phase and subsequently classifies new data into a category that is quite Supervised Learning
similar to the new data.
Consider the following scenario: We have a photograph of a creature that
resembles both cats and dogs, but we are unsure of its identity. However,
since the KNN algorithm is based on a similarity metric, we can utilize it
for this identification. Our KNN model will examine the new data set for
features that are comparable to those found in the photographs of cats and
dogs, and based on those features, it will classify the data as belonging to
either the cat or dog group.
.in
es
ot
un
The K-NN working can be explained on the basis of the below algorithm:
o Step-1: Select the number K of the neighbors
o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean
distance.
o Step-4: Among these k neighbors, count the number of the data points
in each category.
o Step-5: Assign the new data points to that category for which the
number of the neighbor is maximum.
o Step-6: Our model is ready.
Suppose we have a new data point and we need to put it in the required
category. Consider the below image:
119
Data science
.in
have already studied in geometry. It can be calculated as:
es
ot
un
m
120
Supervised Learning
o As we can see the 3 nearest neighbors are from category A, hence this
new data point must belong to category A.
7.9.3 Selecting value of k in KNN Algorithm
.in
o There is no particular way to determine the best value for "K", so we
need to try some values to find the best out of them. The most
preferred value for K is 5.
es
o A very low value for K such as K=1 or K=2, can be noisy and lead to
the effects of outliers in the model.
ot
o Large values for K are good, but it may find some difficulties.
o It is simple to implement.
o It is robust to the noisy training data
m
7.10 SUMMARY
A subset of machine learning and artificial intelligence is supervised
learning, commonly referred to as supervised machine learning. It is
defined by its use of labelled datasets to train algorithms that to classify
data or predict outcomes effectively.The most widely used machine
learning algorithm is supervised learning since it is simple to comprehend
and apply. The model uses labelled data and variables as inputs to get
reliable results.Building an artificial system that can learn the relationship
121
Data science between the input and the output and anticipate the system's output given
new inputs is the aim of supervised learning. We have also covered
several supervised learning algorithms along with its working and delved
into its various fundamental’s aspects affecting the performance.
.in
2] State the types of linear model.
3] Illustrate the applications of linear model.
es
4] What are regression trees?
5] Explain the steps involved in building a regression tree.
ot
122
8
UNSUPERVISED LEARNING
Unit Structure
8.0 Objectives
8.1 Introduction
8.2 Principal Components Analysis (PCA)
8.2.1 Principal components in PCA
8.2.2 Steps for PCA algorithm
8.2.3 Applications of PCA
8.3 k-means clustering
.in
8.3.1 k-means algorithm
8.3.2 Working of k-means algorithm
es
8.4 Hierarchical clustering
8.5 Ensemble methods
ot
8.6 Summary
8.7 List of References
m
8.0 OBJECTIVES
To get familiar with the fundamentals and principles involved in
unsupervised learning
To get acquaint with the different algorithms associated with the
unsupervised learning
8.1 INTRODUCTION
As the name suggests, unsupervised learning is a machine learning
technique in which models are not supervised using training dataset.
Instead, models itself find the hidden patterns and insights from the given
data. It can be compared to learning which takes place in the human brain
while learning new things. It can be defined as:
123
Data science “Unsupervised learning is a type of machine learning in which models are
trained using unlabeled dataset and are allowed to act on that data
without any supervision”
As unlike supervised learning, we have the input data but no
corresponding output data, unsupervised learning cannot be used to solve
a regression or classification problem directly. Finding the underlying
structure of a dataset, classifying the data into groups based on
similarities, and representing the dataset in a compressed format are the
objectives of unsupervised learning.
Consider the following scenario: An input dataset including photos of
various breeds of cats and dogs is provided to the unsupervised learning
algorithm. The algorithm is never trained on the provided dataset;thus, it
has no knowledge of its characteristics. The unsupervised learning
algorithm's job is to let the image features speak for themselves. This work
will be carried out by an unsupervised learning algorithm, which will
cluster the image collection into groups based on visual similarities.
.in
The following are a few key arguments for the significance of
unsupervised learning:
Finding valuable insights from the data is made easier with the aid of
es
unsupervised learning.
and output are not always the same in the real world.
.in
o Orthogonal: It defines that variables are not correlated to each other,
and hence the correlation between the pair of variables is zero.
es
o Eigenvectors: If there is a square matrix M, and a non-zero vector v is
given. Then v will be eigenvector if Av is the scalar multiple of v.
Covariance Matrix: A matrix containing the covariance between the
ot
o
pair of variables is called the Covariance Matrix.
125
Data science 2] Data representation in a structure
We will now create a structure to represent our dataset. We'll use the two-
dimensional matrix of independent variable X as an example. Thus, each
row represents a data item and each column represents a feature. The
dataset's dimensions are determined by the number of columns.
.in
5] Calculating the Eigen Values and Eigen Vectors
The resulting covariance matrix Z's eigenvalues and eigenvectors must
es
now be determined. The high information axis' directions are represented
by eigenvectors or the covariance matrix. Moreover, the eigenvalues are
defined as the coefficients of these eigenvectors.
ot
.in
characteristics with the others.
It gives us the ability to divide the data into various groups and provides a
practical method for automatically identifying the groups in the unlabeled
es
dataset without the need for any training.
Each cluster has a centroid assigned to it because the algorithm is
centroid-based. This algorithm's primary goal is to reduce the total
ot
k clusters, and then continues the procedure until it runs out of clusters to
use. In this algorithm, the value of k should be predetermined.
The two major functions of the k-means clustering algorithm are:
m
Uses an iterative technique to choose the best value for K centre points
or centroids.
127
Data science The K-means Clustering Algorithm is explained in the diagram below:
.in
dataset.
Step 3: Assign each data point to its nearest centroid, which will create the
K clusters that have been predetermined.
es
Step 4: Determine the variance and relocate each cluster's centroid.
Step 5: Re-assign each data point to the new centroid of each cluster by
ot
FINISH.
Step 7: The model is finished.
Let's analyze the visual plots in order to comprehend the
m
aforementioned steps:
Consider that there are two variables, M1 and M2. The following shows
the x-y axis scatter plot of these two variables:
128
o Let's take number k of clusters, i.e., K=2, to identify the dataset and to Unsupervised Learning
put them into different clusters. It means here we will try to group
these datasets into two different clusters.
o We need to choose some random k points or centroid to form the
cluster. These points can be either the points from the dataset or any
other point. So, here we are selecting the below two points as k points,
which are not the part of our dataset. Consider the belowimage:
.in
o Now we will assign each data point of the scatter plot to its closest K-
point or centroid. We will compute it by applying some mathematics
that we have studied to calculate the distance between two points. So,
we will draw a median between boththe centroids. Consider the below
es
image:
ot
un
m
From the above image, it is clear that points left side of the line is near to
the K1 or blue centroid, and points to the right of the line are close to the
yellow centroid. Let's color them as blue and yellow for clear
visualization.
129
Data science o As we need to find the closest cluster, so we will repeat the process by
choosing a new centroid. To choose the new centroids, we will
compute the center of gravity ofthese centroids, and will find new
centroids as below:
o Next, we will reassign each datapoint to the new centroid. For this, we
will repeat the same process of finding a median line. The median will
.in
be like below image:
es
ot
un
o From the above image, we can see, one yellow point is on the left side
of the line, and two blue points are right to the line. So, these three
points will be assigned to new centroids.
m
130
Unsupervised Learning
o As we got the new centroids so again will draw the median line and
reassign the data points. So, the image will be:
.in
es
ot
o We can see in the above image; there are no dissimilar data points on
either side ofthe line, which means our model is formed. Consider the
below image:
un
m
As our model is ready, so we can now remove the assumed centroids, and
the two final clusters will be as shown in the below image:
131
Data science
.in
combine the two clusters that are the most similar. These procedures must
be repeated until all of the clusters are combined.
The goal of hierarchical clustering is to create a hierarchy of nested
es
clusters. a Dendrogram, a type of graph (A Dendrogram is a tree-like
diagram that statistics the sequences of merges or splits) depicts this
hierarchy graphically and is an inverted tree that explains the sequence in
which elements are combined (bottom-up view) or clusters are dispersed
ot
(top-down view).
A data mining technique called hierarchical clustering builds a
un
132
2. The approach can have high processing costs and memory needs, Unsupervised Learning
particularly for huge datasets.
3. The initial conditions, linkage criterion, and distance metric can have
an impact on the outcomes.
In conclusion, hierarchical clustering is a data mining technique that
groups related data points into clusters by giving the clusters a
hierarchical structure.
4. This technique can handle various data formats and show the
connections between the clusters. Unfortunately, the results could be
sensitive to certain circumstances and have a large computational cost.
1. Agglomerative: At first, treat each data point as a separate cluster.
Next, at each step, combine the cluster's closest pairs. It uses a bottom-up
approach. Every dataset is first viewed as a distinct entity or cluster. The
clusters combine with other clusters at each iteration until only one cluster
remains.
.in
Agglomerative Hierarchical Clustering uses the following algorithm:
133
Data science Step-1: Consider each alphabet as a single cluster and calculate the
distance of one cluster from all the other clusters.
Step-2: In the second step comparable clusters are merged together
to form a single cluster. Let’s say cluster (B) and cluster (C) are very
similar to each other therefore we merge them in the second step
similarly to cluster (D) and (E) and at last, we get the clusters [(A),
(BC), (DE), (F)]
Step-3: We recalculate the proximity according to the algorithm and
merge the two nearest clusters([(DE), (F)]) together to form new
clusters as [(A), (BC), (DEF)]
Step-4: Repeating the same process; The clusters DEF and BC are
comparable and merged together to form a new cluster. We’re now
left with clusters [(A), (BCDEF)].
Step-5: At last the two remaining clusters are merged together to
form a single cluster [(ABCDEF)].
.in
2. Divisive:
We can say that Divisive Hierarchical clustering is precisely
the opposite of Agglomerative Hierarchical clustering. In Divisive
es
Hierarchical clustering, we take into account all of the data points as a
single cluster and in every iteration, we separate the data points from the
clusters which aren’t comparable. In the end, we are left with N clusters.
ot
un
m
134
Unsupervised Learning
.in
parallel ensemble approaches. To promote independence among the basis
learners, parallel techniques make use of parallel generations of base
learners. The mistake resulting from the use of averages is greatly
decreased by the independence of base learners.
es
The majority of ensemble techniques only use one algorithm for base
learning, which makes all base learners homogeneous. Base learners who
have comparable traits and are of the same type are referred to as
ot
135
Data science account. As a result, the aggregate is based either on all of the results from
the predictive models or on the probability bootstrapping techniques.
Bagging is useful because it creates a single strong learner that is more
stable than individual weak base learners. Moreover, it gets rid of any
variance, which lessens overfitting in models. The computational cost of
bagging is one of its drawbacks. Hence, ignoring the correct bagging
technique can result in higher bias in models.
2] Boosting
Boosting is an ensemble strategy that improves future predictions by
learning from previous predictor errors. The method greatly increases
model predictability by combining numerous weak base learners into one
strong learner. Boosting works by placing weak learners in a sequential
order so that they can learn from the subsequent learner to improve their
predictive models.
There are many different types of boosting, such as gradient boosting,
Adaptive Boosting (AdaBoost), and XGBoost (Extreme Gradient
.in
Boosting). AdaBoost employs weak learners in the form of decision trees,
the majority of which include a single split known as a decision stump.
The primary decision stump in AdaBoost consists of observations with
es
equal weights.
Gradient boosting increases the ensemble's predictors in a progressive
manner, allowing earlier forecasters to correct later ones, improving the
ot
Decision trees with boosted gradients are used in XGBoost, which offers
faster performance. It largely depends on the goal model's efficiency and
effectiveness in terms of computing. Gradient boosted machines must be
m
3] Stacking
Another ensemble method called stacking is sometimes known as layered
generalization. This method works by allowing a training algorithm to
combine the predictions of numerous different learning algorithms that are
similar. Regression, density estimations, distance learning, and
classifications have all effectively used stacking. It can also be used to
gauge the amount of inaccuracy that occurs when bagging.
8.6 SUMMARY
Any machine learning challenge aims to choose a single model that can
most accurately forecast the desired result. Ensemble approaches consider
a wide range of models and average those models to build one final model,
as opposed to creating one model and hoping that this model is the
best/most accurate predictor we can make.
136
Unsupervised learning, commonly referred to as unsupervised machine Unsupervised Learning
learning, analyses and groups unlabeled datasets using machine learning
algorithms. These algorithms identify hidden patterns or data clusters
without the assistance of a human.
Unsupervised learning's main objective is to find hidden and intriguing
patterns in unlabeled data. Unsupervised learning techniques, in contrast
to supervised learning, cannot be used to solve a regression or
classification problem directly because it is unknown what the output
values will be. We have also studied different techniques and algorithms
for classification and to boost the performance.
In conclusion, the unsupervised learning algorithms allow you to
accomplish more complex processing jobs. There are various benefits of
unsupervised learnings such as it is taken in place issue solving time,
hence all of the input data which is to be examined and categorized in the
appearance of learners.
.in
1] Doing Data Science, Rachel Schutt and Cathy O’Neil, O’Reilly,2013.
2] Mastering Machine Learning with R, Cory Lesmeister, PACKT
es
Publication,2015.
3] Hands-On Programming with R, Garrett Grolemund,1st Edition, 2014.
ot
137