Edit Ds
Edit Ds
– I Semester
DATA SCIENCE
(20BT70503)
(Professional Elective – 5)
Prepared
By
CH. PRATHIMA
Assistant Professor
Data science
School of Computing
Mohan Babu university
1
IV B. Tech. – I Semester
(20BT70503) DATA SCIENCE
(Professional Elective – 5)
(Common to CSE, CSBS and CSSE)
COURSE DESCRIPTION: Concepts of data science, extracting meaning from data, the
dimensionality problem, plotting with pandas and sea born, Probability distributions, Time
series analysis, Predictive modeling.
COURSE OUTCOMES: After successful completion of this course, the students will be able to:
CO1. Demonstrate knowledge on the concepts of data science to perform data analysis.
CO2. Develop methods to extract meaning from data using feature selection techniques.
CO3. Create data visualization using charts, plots and histograms to identify trends,
patterns and outliers in data using Matplotlib and Sea born.
CO4. Develop distribution functions to analyze and interpret data to extract meaningf ul
statistics.
CO5. Design and develop predictive models for a given problem to support prediction and
forecasting.
DETAILED SYLLABUS:
UNIT–I: INTRODUCTION (09 Periods)
Definition of data science, Skills for data science, Tools for data science, Data types, Data
collections, Data preprocessing, Data analysis and data analytics, Descriptive analysis,
Diagnostic analytics, Predict ive analytics, Prescriptive analytics, Exploratory analysis,
Mechanistic analysis.
(AUTONOMOUS )
SREE SAINATH NAGAR, A. RANGAMPET -517 102
Department of Information Technology
LESSON PLAN – 2023-24
S. No. of Book(s)
Topic Topics for self-study
No. periods followed
UNIT–I: INTRODUCTION
Definition of data science, Skills for
1. 1 T1
data science
2. Tools for data science, Data types 1 T1
Data collections, Data
3. preprocessing, Data analysis and 1 T1 Tools for Data Science
data analytics
Descriptive analysis, Diagnostic
4. 2 T1
analytics
Predictive analytics, Prescriptive
5. 2 T1
analytics
6. Exploratory analysis 1 T1
7. Mechanistic analysis 1 T1
Total periods required: 09
UNIT–II: DATA EXTRACTION
Extracting meaning from data –
8. 1 T2
Feature selection
9. User retention, Filters 1 T2 Data extraction using
10. Wrappers, Entropy 1 T2 Deep learning tools
3
S. No. of Book(s)
Topic Topics for self-study
No. periods followed
16. Plotting with Pandas and Seaborn 1 T2
17. Line plots, Bar plots 1 T2 Using Power Bi Tool
18. Histograms and density plots 1 T2
19. Scatter plots 1 T2
20. Facet grids and Categorical data 1 T2
21. Other Python visualization tools. 2 T2
Total periods required: 08
UNIT–IV: STATISTICAL THINKING
Distributions – Representing and T2
22. 1
plotting histograms
23. Outliers, summarizing distributions 1 T2
24. Variance, Reporting results 1 T2 Advanced Statistical
tool on data using AI
Probability mass function – Plotting T2
25. 1
PMFs, Other visualizations
The class size paradox, Data frame T2
26. 1
indexing
Cumulative distribution functions - T2
27. 1
Limits of PMFs
Representing CDFs, Percentile T2
28. 1
based statistics
Random numbers, comparing T2
29. 2
percentile ranks
Modeling distributions - T2
Exponential distribution, Normal
30. 2
distribution, Lognormal
distribution.
Total periods required: 11
UNIT V: TIME SERIES ANALYSIS AND PREDICTIVE MODELING
Time series analysis – Import ing T2
31. 1
and cleaning
32. Plotting, Moving averages 1 T2 Advanced Data
33. Missing values, Serial correlation 1 T2 Analysis and text
Analysis
34. Autocorrelation 1 T2
Predictive modeling – Overview , T2
35. 2
Evaluating predictive models
36. Building predictive model solutions 1 T2
37. Sentiment analysis 1 T2
Total periods required: 08
Grand total periods required: 45
TEXT BOOK:
1. Chirag Shah, A Hands-on Introduction to Data Science, Cambridge University Press,
2020.
2. Alen B. Downey, Think Stats: Exploratory Data Analysis, O’Reilly, 2nd Edition, 2014.
REFERENCE BOOKS:
1. Wes McKinney, Python for Data Analysis, O’Reilly, 2nd Edition, 2017.
2. Ofer Mendelevitch, Casey Stella, Douglas Eadline, Practical Data science with Hadoop and
Spark: Designing and Building Effective Analytics at Scale, Addison Wesley, 2017.
3. Rachel Schutt, Cathy O’Neil, Doing Data Science: Straight Talk from the Frontline,
O’Reilly, 2014.
4. Jake VanderPlas, Python Data Science Handbook: Essential Tools for Working with Data ,
O’Reilly, 2017.
ADDITIONAL LEARNING RESOURCES:
4
1. https://swayam.gov.in/nd1_noc19_cs60/preview
2. https://towardsdatascience.com/
3. https://www.w3schools.com/datascience/
4. https://github.com/jakevdp/PythonDataScienceHandbook
5. https://www.kaggle.com
Course Material
UNIT–I: INTRODUCTION (09 Periods)
Definition of data science, Skills for data science, Tools for data science, Data types, Data collections,
Data preprocessing, Data analysis and data analytics, Descriptive analysis, Diagnostic analytics,
Predictive analytics, Prescriptive analytics, Exploratory analysis, Mechanistic analysis.
Data science is not a one-step process such that you will get to learn it in a short time and call ourselves a
Data Scientist. It’s passes from many stages and every element is important. One should always follow
the proper steps to reach the ladder. Every step has its value and it counts in your model. Buckle up in
your seats and get ready to learn about those steps.
• Problem Statement: No work starts without motivation; Data science is no exception though.
It’s really important to declare or formulate your problem statement very clearly and precisely. Your
whole model and it’s working depend on your statement. Many scientists consider this as the main
and much important step of Date Science. So, make sure what’s your problem statement and how well
can it add value to business or any other organization.
• Data Collection: After defining the problem statement, the next obvious step is to go in search of
data that you might require for your model. You must do good research, find all that you need. Data
can be in any form i.e unstructured or structured. It might be in various forms like videos, spreadsheets,
coded forms, etc. You must collect all these kinds of sources.
• Data Cleaning: As you have formulated your motive and also you did collect your data, the next
step to do is cleaning. Yes, it is! Data cleaning is the most favorite thing for data scientists to do. Data
cleaning is all about the removal of missing, redundant, unnecessary and duplicate data from your
collection. There are various tools to do so with the help of programming in either R or Python. It’s
5
totally on you to choose one of them. Various scientist has their opinion on which to choose. When it
comes to the statistical part, R is preferred over Python, as it has the privilege of more than 12,000
packages. While python is used as it is fast, easily accessible and we can perform the same things as
we can in R with the help of various packages.
• Data Analysis and Exploration: It’s one of the prime things in data science to do and time to
get inner Holmes out. It’s about analyzing the structure of data, finding hidden patterns in them,
studying behaviors, visualizing the effects of one variable over others and then concluding. We can
explore the data with the help of various graphs formed with the help of libraries using any
programming language. In R, GGplot is one of the most famous models while Matplotlib in Python.
• Data Modelling: Once you are done with your study that you have formed from data visualization,
you must start building a hypothesis model such that it may yield you a good prediction in future.
Here, you must choose a good algorithm that best fit to your model. There different kinds of algorithms
from regression to classification, SVM (Support vector machines), Clustering, etc. Your model can
be of a Machine Learning algorithm. You train your model with the train data and then test it with test
data. There are various methods to do so. One of them is the K-fold method where you split your
whole data into two parts, one is Train and the other is test data. On these bases, you train your model.
• Optimization and Deployment: You followed each and every step and hence build a model
that you feel is the best fit. But how can you decide how well your model is performing? This where
optimization comes. You test your data and find how well it is performing by checking its accuracy.
In short, you check the efficiency of the data model and thus try to optimize it for better accurate
prediction. Deployment deals with the launch of your model and let the people outside there to benefit
from that. You can also obtain feedback from organizations and people to know their need and then
to work more on your model.
1. Improved decision-making: Data science can help organizations make better decisions by providing
insights and predictions based on data analysis.
2. Cost-effective: With the right tools and techniques, data science can help organizations reduce costs
by identifying areas of inefficiency and optimizing processes.
3. Innovation: Data science can be used to identify new opportunities for innovation and to develop new
products and services.
4. Competitive advantage: Organizations that use data science effectively can gain a competitive
advantage by making better decisions, improving efficiency, and identifying new opportunities.
5. Personalization: Data science can help organizations personalize their products or services to better
meet the needs of individual customers.
6. Data quality: The accuracy and quality of the data used in data science can have a significant impact
on the results obtained.
7. Privacy concerns: The collection and use of data can raise privacy concerns, particularly if the data is
personal or sensitive.
8. Complexity: Data science can be a complex and technical field that requires specialized skills and
expertise.
9. Bias: Data science algorithms can be biased if the data used to train them is biased, which can lead to
inaccurate results.
10. Interpretation: Interpreting data science results can be challenging, particularly for non-technical
stakeholders who may not understand the underlying assumptions and methods used.
In another view, Dave Holtz blogs about specific skill sets desired by various positions to which
a data scientist may apply. He lists basic types of data science jobs:
1. A Data Scientist Is a Data Analyst Who Lives in San Francisco! Holtz notes that, for some
companies, a data scientist and a data analyst are synonymous. These roles are typically entry-
level and will work with pre-existing tools and applications that require the basics skills to
retrieve, wrangle, and visualize data. These digital tools may include MySQL databases and
advanced functions within Excel such as pivot tables and basic data visualizations (e.g., line and
bar charts). Additionally, the data analyst may perform the analysis of experimental testing results
or manage other pre-existing analytical toolboxes such as Google Analytics or Tableau. Holtz
further notes that, “jobs such as these are excellent entry-level positions, and may even allow a
budding data scientist to try new things and expand their skillset.”
2. Please Wrangle Our Data! Companies will discover that they are drowning in data and need
someone to develop a data management system and infrastructure that will house the enormous
(and growing) dataset, and create access to perform data retrieval and analysis. “Data engineer”
and “data scientist” are the typical job titles you will find associated with this type of required
skill set and experience. In these scenarios, a candidate will likely be one of the company’s first
7
data hires and thus this person should be able to do the job without significant statistics or
machine-learning expertise. A data scientist with a software engineering background might excel
at a company like this, where it is more important that they make meaningful data-like
contributions to the production code and provide basic insights and analyses. Mentorship
opportunities for junior data scientists may be less plentiful at a company like this. As a result, an
associate will have great opportunities to shine and grow via trial by fire, but there will be less
guidance and a greater risk of flopping or stagnating.
3. We Are Data. Data Is Us. There are a number of companies for whom their data (or their data
analysis platform) is their product. These environments offer intense data analysis or machine
learning opportunities. Ideal candidates will likely have a formal mathematics, statistics, or
physics background and hope to continue down a more academic path. Data scientists at these
types of firms would focus more on producing data-driven products than answering operational
corporate questions. Companies that fall into this group include consumer-facing organizations
with massive amounts of data and companies that offer a data-based service.
4. Reasonably Sized Non-Data Companies Who Are Data-Driven. This categorizes many
modern businesses. This type of role involves joining an established team of other data scientists.
The company evaluates data but is not entirely concerned about data. Its data scientists perform
analysis, touch production code, visualize data, etc. These companies are either looking for
generalists or they are looking to fill a specific niche where they feel their team is lacking, such
as data visualization or machine learning. Some of the more important skills when interviewing
at these firms are familiarity with tools designed for “big data” (e.g., Hive or Pig), and experience
with messy, real-life datasets.
A couple of sections ago, we discussed what kind of skills one needs to have to be a successful
data scientist. We also know by now that a lot of what data scientists do involves processing data
and deriving insights. An example was given above, along with a hands-on practice problem.
These things should at least give you an idea of what you may expect to do in data science. Going
forward, it is important that you develop a solid foundation in statistical techniques and
computational thinking .And then you need to pick up a couple of programing and data processing
tools. A whole section of this book is devoted to such tools (Part II) and covers some of the most
used tools in data science – Python, R, and SQL. But let us quickly review these here so we
understand what to expect when we get to those chapters.
Let me start by noting that there are no special tools for doing data science; there just
happen to be some tools that are more suitable for the kind of things one does in data science. And
so, if you already know some programing language (e.g., C, Java, PHP) or a scientific data
processing environment (e.g., Matlab), you could use them to solve many or most of the problems
and tasks in data science. Of course, if you go through this book, you would also find that Python
or R could generate a graph with one line of code – something that could take you a lot more effort
in C or Java. In other words, while Python or R were not specifically designed for people to do
data science, they provide excellent environments for quick implementation, visualization, and
testing for most of what one would want to do in data science – at least at the level in which we
are interested in this book.
Python is a scripting language. It means that programs written in Python do not need to
be compiled as a whole like you would do with a program in C or Java; instead, a Python 27 1.7
Tools for Data Science program runs line by line. The language (its syntax and structure) also
provides a very easy learning curve for the beginner, yet giving very powerful tools for advanced
programmers.
8
Let us see this with an example. If you want to write the classic “Hello, World” program
in Java, here is how it goes:
Step 1: Write the code and save as HellowWorld.java.
public class HelloWorld {
public static void main(String[] args) {
System.out.println(“Hello, World”);
}
}
Step 2: Compile the code.
% javac HelloWorld.java
% java HelloWorld
This should display “Hello, World” on the console. Do not worry if you have never done
Java (or any) programming before and all this looks confusing. I hope you can at least see that
printing a simple message on the screen is quite complicated (we have not even done any data
processing!).
In contrast, here is how you do the same in Python:
Step 1: Write the code and save as hello.py print(“Hello, World”)
Step 2: Run the program. % python hello.py
Again, do not worry about actually trying this now.
For now, at least you can appreciate how easy it is to code in Python. And if you want to
accomplish the same in R, you type the same – print (“Hello, World”) – in R console. Both Python
and R offer a very easy introduction to programming, and even if you have never done any
programming before, it is possible to start solving data problems from day 1 of using either of
these. Both of them also offer plenty of packages that you can import or call into them to
accomplish more complex tasks such as machine learning (see Part III of this book). Most times
in this book we will see data available to us in simple text files formatted as CSV (comma -
separated values) and we can load up that data into a Python or R environment. However, such a
method has a major limit – the data we could store in a file or load in a computer’s memory cannot
be beyond a certain size. In such cases (and for some other reasons), we may need to use a better
storage of data in something called an SQL (Structured Query Language) database. The field of
this database is very rich with lots of 28 Introduction tools, techniques, and methods for addressing
all kinds of data problems. We will, however, limit ourselves to working with SQL databases
through Python or R, primarily so that we could work with large and remote datasets.
In addition to these top three most used tools for data science (see Appendix G), we will
also skim through basic UNIX. Why? Because a UNIX environment allows one to solve many
data problems and day-to-day data processing needs without writing any code. After all, there is
no perfect tool that could address all our data science needs or meet all of our preferences and
constraints. And so, we will pick up several of the most popular tools in data science in this book,
while solving data problems using a hands-on approach.
Most commonly, structured data refers to highly organized information that can be
seamlessly included in a database and readily searched via simple search operations; whereas
unstructured data is essentially the opposite, devoid of any underlying structure. In structured data,
different values – whether they are numbers or something else – are labeled, which is not the case
when it comes to unstructured data. Let us look at these two types in more detail.
9
Structured Data: Structured data is the most important data type for us, as we will be using it for
most of the exercises in this book. Already we have seen it a couple of times. In the previous
chapter we discussed an example that included height and weight data. That example included
structured data because the data has defined fields or labels; we know “60” to be height and “120”
to be weight for a given record (which, in this case, is for one person). But structured data does
not need to be strictly numbers. Table 2.1 contains data about some customers. This data includes
numbers (age, income, num. vehicles), text (housing. type), Boolean type (is.employed), and
categorical data (sex, marital. Stat). What matters for us is that any data we see here – whether it
is a number, a category, or a text – is labeled. In other words, we know what that number, category,
or text means. Pick a data point from the table – say, third row and eighth column. That is “22.”
We know from the structure of the table that that data is a number; specifically, it is the age of a
customer. Which customer? The one with the ID 2848 and who lives in Georgia. You see how
easily we could interpret and use the data since it is in a structured format? Of course, someone
would have to collect, store, and present the data in such a format, but for now we will not worry
about that.
Unstructured Data: Unstructured data is data without labels. Here is an example: “It was found
that a female with a height between 65 inches and 67 inches had an IQ of 125–130. However, it
was not clear looking at a person shorter or taller than this observation if the change in IQ score
could be different, and, even if it was, it could not be possibly concluded that the change was
solely due to the difference in one’s height.” In this paragraph, we have several data points: 65,
67, 125–130, female. However, they are not clearly labeled. If we were to do some processing, as
we did in the first chapter to try to associate height and IQ, we would not be able to do that easily.
And certainly, if we were to create a systematic process (an algorithm, a program) to go through
such data or observations, we would be in trouble because that process would not be able to
identify which of these numbers corresponds to which of the quantities. Of course, humans have
no difficulty understanding a paragraph like this that contains unstructured data. But if we want
to do a systematic process for analyzing a large amount of data and creating insights from it, the
more structured it is, the better. As I mentioned, in this book for the most part we will work with
structured data. But at times when such data is not available, we will look to other ways to convert
unstructured data to structured data, or process unstructured data, such as text, directly.
10
reasonably structured to allow automated processing). Open data structures do not discriminate against
any person or group of persons and should be made available to the widest range of users for the
widest range of purposes, often by providing the data in multiple formats for consumption. To the
extent permitted by law, these formats should be non-proprietary, publicly available, and no
restrictions should be placed on their use.
• Described. Open data are described fully so that consumers of the data have sufficient
information to understand their strengths, weaknesses, analytical limitations, and security
requirements, as well as how to process them. This involves the use of robust, granular metadata (i.e.,
fields or elements that describe data), thorough documentation of data elements, data dictionaries, and,
if applicable, additional descriptions of the purpose of the collection, the population of interest, the
characteristics of the sample, and the method of data collection.
• Reusable. Open data are made available under an open license6 that places no restrictions on
their use.
• Complete. Open data are published in primary forms (i.e., as collected at the source), with the
finest possible level of granularity that is practicable and permitted by law and other requirements.
Derived or aggregate open data should also be published but must reference the primary data.
• Timely. Open data are made available as quickly as necessary to preserve the value of the data.
Frequency of release should account for key audiences and downstream needs.
• Managed Post-Release. A point of contact must be designated to assist with data use and to
respond to complaints about adherence to these open data requirements.
In this snippet, the first row mentions the variable names. The remaining rows each individua lly
represent one data point. It should be noted that, for some data points, values of all the columns may
not be available. The “Data Pre-processing” section later in this chapter describes how to deal with
such missing information.
An advantage of the CSV format is that it is more generic and useful when sharing with almost
anyone. Why? Because specialized tools to read or manipulate it are not required. Any spreadsheet
program such as Microsoft Excel or Google Sheets can readily open a CSV file and display it correctly
most of the time. But there are also several disadvantages. For instance, since the comma is used to
separate fields, if the data contains a comma, that could be problematic. This could be addressed by
escaping the comma (typically adding a backslash before that comma), but this remedy could be
frustrating because not everybody follows such standards.
2. TSV (Tab-Separated Values) files are used for raw data and can be imported into and exported from
spreadsheet software. Tab-separated values files are essentially text files, and the raw data can be
viewed by text editors, though such files are often used when Data moving raw data between
spreadsheets. An example of a TSV file is shown below, along with the advantages and disadvantages
of this format. Suppose the registration records of all employees in an office are stored as follows:
12
An advantage of TSV format is that the delimiter (tab) will not need to be avoided because it is unusual
to have the tab character within a field. In fact, if the tab character is present, it may have to be
removed. On the other hand, TSV is less common than other delimited formats such as CSV.
3. XML (eXtensible Markup Language) was designed to be both human- and machinereadable, and can
thus be used to store and transport data. In the real world, computer systems and databases contain
data in incompatible formats. As the XML data is stored in plain text format, it provides a software-
and hardware-independent way of storing data. This makes it much easier to create data that can be
shared by different applications. XML has quickly become the default mechanism for sharing data
between disparate information systems. Currently, many information technology departments are
deciding between purchasing native XML databases and converting existing data from relational and
object-based storage to an XML model that can be shared with business partners. Here is an example
of a page of XML
If you have ever worked with HTML, then chances are this should look familiar. But as you can see,
unlike HTML, we are using custom tags such as <book> and <price>
That means whosoever reads this will not be able to readily format or process it. But in
contrast to HTML, the markup data in XML is not meant for direct visualization. Instead, one could
write a program, a script, or an app that specifically parses this markup and uses it according to the
context. For instance, one could develop a website that runs in a Web browser and uses the above data
in XML, whereas someone else could write a different code and use this same data in a mobile app.
In other words, the data remains the same, but the presentation is different. This is one of the core
advantages of XML and one of the reasons XML is becoming quite important as we deal with multiple
devices, platforms, and services relying on the same data.
4. RSS (Really Simple Syndication) is a format used to share data between services, and which was
defined in the 1.0 version of XML. It facilitates the delivery of information from various sources on
the Web. Information provided by a website in an XML file in such a way is called an RSS feed. Most
current Web browsers can directly read RSS files, but a special RSS reader or aggregator may also be
used. The format of RSS follows XML standard usage but in addition defines the names of specific
tags (some required and some optional), and what kind of information should be stored in them. It was
designed to show selected data. So, RSS starts with the XML standard, and then further defines it so
that it is more specific. Let us look at a practical example of RSS usage. Imagine you have a website
that provides several updates of some information (news, stocks, weather) per day. To keep up with
13
this, and even to simply check if there are any updates, a user will have to continuously return to this
website throughout the day. This is not only time-consuming, but also unfruitful as the user may be
checking too frequently and encountering no updates, or, conversely, checking not often enough and
missing out on crucial information as it becomes available. Users can check your site faster using an
RSS aggregator (a site or program that gathers and sorts out RSS feeds). This aggregator will ensure
that it has the information as soon as the website provides it, and then it pushes that information out
to the user – often as a notification. Since RSS data is small and fast loading, it can easily be used with
services such as mobile phones, personal digital assistants (PDAs), and smart watches. RSS is useful
for websites that are updated frequently, such as
• News sites – Lists news with title, date and descriptions.
• Companies – Lists news and new products.
• Calendars – Lists upcoming events and important days.
• Site changes – Lists changed pages or new pages
•
Do you want to publish your content using RSS? Here is a brief guideline on how to make it
happen. First, you need to register your content with RSS aggregator(s). To participate, first create an
RSS document and save it with an .xml extension (see example below). Then, upload the file to your
website. Finally, register with an RSS aggregator. Each day (or with a frequency you specify) the
aggregator searches the registered websites for RSS documents, verifies the link, and displays
information about the feed so clients can link to documents that interest them.
5. JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is not only easy for
humans to read and write, but also easy for machines to parse and generate. It is based on a subset of
14
the JavaScript Programming Language, Standard ECMA-262, 3rd Edition – December 1999. JSON
is built on two structures:
• An ordered list of values. In most languages, this is realized as an array, vector, list, or sequence.
When exchanging data between a browser and a server, the data can be sent only as text. JSON is
text, and we can convert any JavaScript object into JSON, and send JSON to the server. We can also
convert any JSON received from the server into JavaScript objects. This way we can work with the
data as JavaScript objects, with no complicated parsing and translations. Let us look at examples of
how one could send and receive data using JSON. Sending data: If the data is stored in a JavaScript
object, we can convert the object into JSON, and send it to a server. Below is an example
Now that we have seen several formats of data storage and presentation, it is important to note that these
are by no means the only ways to do it, but they are some of the most preferred and commonly used ways.
Having familiarized ourselves with data formats, we will now move on with manipulating the data.
Data in the real world is often dirty; that is, it needs being cleaned up before it can be used for a desired
purpose. This is often called data pre-processing. What makes data “dirty”? Here are some of the
factors that indicate that data is not clean or ready to process:
• Incomplete. When some of the attribute values are lacking, certain attributes of interest are lacking,
or attributes contain only aggregate data.
• Noisy. When data contains errors or outliers. For example, some of the data points in a dataset may
contain extreme values that can severely affect the dataset’s range.
• Inconsistent. Data contains discrepancies in codes or names. For example, if the “Name” column for
registration records of employees contains values other than alphabetical letters, or if records do not
start with a capital letter, discrepancies are present
15
1.6.1 Data Cleaning
Since there are several reasons why data could be “dirty,” there are just as many ways to “clean” it.
For this discussion, we will look at three key methods that describe ways in which data may be
“cleaned,” or better organized, or scrubbed of potentially incorrect, incomplete, or duplicated
information.
suitable for a computer to understand. To accomplish this, there is no specific scientific method.
The approaches to take are all about manipulating or wrangling (or munging) the data to turn it
into something that is more convenient or desirable. This can be done manually, automatically,
or, in many cases, semi-automatically. Consider the following text recipe. “Add two diced
tomatoes, three cloves of garlic, and a pinch of salt in the mix.” This can be turned into a table
(Table 2.2). This table conveys the same information as the text, but it is more “analysis friendly.”
Of course, the real question is – How did that sentence get turned into the table? A not-so-
encouraging answer is “using whatever means necessary”! I know that is not what you want to
hear because it does not sound systematic. Unfortunately, often there is no better or systematic
method for wrangling. Not surprisingly, there are people who are hired to do specifically just this
– wrangle ill-formatted data into something more manageable.
Sometimes data may be in the right format, but some of the values are missing. Consider
a table containing customer data in which some of the home phone numbers are absent. This could
be due to the fact that some people do not have home phones – instead they use their mobile
phones as their primary or only phone.
16
Other times data may be missing due to problems with the process of collecting data, or
an equipment malfunction. Or, comprehensiveness may not have been considered important at the
time of collection. For instance, when we started collecting that customer data, it was limited to a
certain city or region, and so the area code for a phone number was not necessary to collect. Well,
we may be in trouble once we decide to expand beyond that city or region, because now we will
have numbers from all kinds of area codes.
Furthermore, some data may get lost due to system or human error while storing or
transferring the data.
So, what to do when we encounter missing data? There is no single good answer. We
need to find a suitable strategy based on the situation. Strategies to combat missing data include
ignoring that record, using a global constant to fill in all missing values, imputation, inference-
based solutions (Bayesian formula or a decision tree), etc. We will revisit some of these inference
techniques later in the book in chapters on machine learning and data mining.
There are times when the data is not missing, but it is corrupted for some reason. This is,
in some ways, a bigger problem than missing data. Data corruption may be a result of faulty data
collection instruments, data entry problems, or technology limitations. For example, a digital
thermometer measures temperature to one decimal point (e.g., 70.1°F), but the storage system
ignores the decimal points. So, now we have 70.1°F and 70.9°F both stored as 70°F. This may not
seem like a big deal, but for humans a 99.4°F temperature means you are fine, and 99.8°F means
you have a fever, and if our storage system represents both of them as 99°F, then it fails to
differentiate between healthy and sick persons! Just as there is no single technique to take care of
missing data, there is no one way to remove noise, or smooth out the noisiness in the data.
However, there are some steps to try. First, you should identify or remove outliers. For example,
records of previous students who sat for a data science examination show all students scored
between 70 and 90 points, barring one student who received just 12 points. It is safe to assume
that the last student’s record is an outlier (unless we have a reason to believe that this anomaly is
really an unfortunate case for a student!). Second, you could try to resolve inconsistencies in the
data. For example, all entries of customer names in the sales data should follow the convention of
capitalizing all letters, and you could easily correct them if they are not.
To be as efficient and effective for various data analyses as possible, data from various sources
commonly needs to be integrated. The following steps describe how to integrate multiple
databases or files.
1. Combine data from multiple sources into a coherent storage place (e.g., a single file or a
database).
2. Engage in schema integration, or the combining of metadata from different sources.
3. Detect and resolve data value conflicts. For example:
a. A conflict may arise; for instance, such as the presence of different attributes and values from
various sources for the same real-world entity.
b. Reasons for this conflict could be different representations or different scales; for example,
metric vs. British units.
4. Address redundant data in data integration. Redundant data is commonly generated in the
process of integrating multiple databases. For example:
a. The same attribute may have different names in different databases.
b. One attribute may be a “derived” attribute in another table; for example, annual revenue.
c. Correlation analysis may detect instances of redundant data.
17
The following five processes may be used for data transformation. For the time being, do
not worry if these seem too abstract. We will revisit some of them in the next section as we work
through an example of data pre-processing.
1. Smoothing: Remove noise from data.
2. Aggregation: Summarization, data cube construction.
3. Generalization: Concept hierarchy climbing.
4. Normalization: Scaled to fall within a small, specified range and aggregation.
Some of the techniques that are used for accomplishing normalization (but we will not be
covering them here) are:
a. Min–max normalization.
b. Z-score normalization.
c. Normalization by decimal scaling.
5. Attribute or feature construction. a. New attributes constructed from the given ones. Detailed
explanation of all of these techniques are out of scope for this book, but later in this chapter
we will do a hands-on exercise to practice some of these in simpler forms.
Data reduction is a key process in which a reduced representation of a dataset that produces the
same or similar analytical results is obtained. One example of a large dataset that could warrant
reduction is a data cube. Data cubes are multidimensional sets of data that can be stored in a
spreadsheet. But do not let the name fool you. A data cube could be in two, three, or a higher
dimension. Each dimension typically represents an attribute of interest. Now, consider that you
are trying to make a decision using this multidimensional data. Sure, each of its attributes
(dimensions) provides some information, but perhaps not all of them are equally useful for a given
situation. In fact, often we could reduce information from all those dimensions to something much
smaller and manageable without losing much. This leads us to two of the most common techniques
used for data reduction.
1. Data Cube Aggregation. The lowest level of a data cube is the aggregated data for an individual
entity of interest. To do this, use the smallest representation that is sufficient to address the given
task. In other words, we reduce the data to its more meaningful size and structure for the task at
hand.
2. Dimensionality Reduction. In contrast with the data cube aggregation method, where the data
reduction was with the consideration of the task, dimensionality reduction method works with
respect to the nature of the data. Here, a dimension or a column in your data spreadsheet is referred
to as a “feature,” and the goal of the process is to identify which features to remove or collapse to
a combined feature. This requires identifying redundancy in the given data and/or creating
composite dimensions or features that could sufficiently represent a set of raw features. Strategies
for reduction include sampling, clustering, principal component analysis, etc
We are often dealing with data that are collected from processes that are continuous, such as
temperature, ambient light, and a company’s stock price. But sometimes we need to convert these
continuous values into more manageable parts. This mapping is called discretization. And as you
can see, in undertaking discretization, we are also essentially reducing data. Thus, this process of
discretization could also be perceived as a means of data reduction, but it holds particular
importance for numerical data. There are three types of attributes involved in discretization.
18
To achieve discretization, divide the range of continuous attributes into intervals. For instance, we
could decide to split the range of temperature values into cold, moderate, and hot, or the price of
company stock into above or below its market valuation
These two terms – data analysis and data analytics – are often used interchangeably and could be
confusing. Is a job that calls for data analytics really talking about data analysis and vice versa? Well,
there are some subtle but important differences between analysis and analytics. A lack of
understanding can affect the practitioner’s ability to leverage the data to their best advantage.
According to Dave Kasik, Boeing’s Senior Technical Fellow in visualization and interactive
techniques, “In my terminology, data analysis refers to hands-on data exploration and evaluation. Data
analytics is a broader term and includes data analysis as [a] necessary subcomponent. Analytics
defines the science behind the analysis. The science means understanding the cognitive processes an
analyst uses to understand problems and explore data in meaningful ways.
One way to understand the difference between analysis and analytics is to think in terms of past
and future. Analysis looks backwards, providing marketers with a historical view of what has
happened. Analytics, on the other hand, models the future or predicts a result.
Analytics makes extensive use of mathematics and statistics and the use of descriptive techniques
and predictive models to gain valuable knowledge from data. These insights from data are used to
recommend action or to guide decision-making in a business context. Thus, analytics is not so much
concerned with individual analysis or analysis steps, but with the entire methodology.
There is no clear agreeable-to-all classification scheme available in the literature to categorize all
the analysis techniques that are used by data science professionals. However, based on their
application on various stages of data analysis, I have categorized analysis techniques into six classes
of analysis and analytics: descriptive analysis, diagnostic analytics, predictive analytics, prescriptive
analytics, exploratory analysis, and mechanistic analysis.
Descriptive analysis is about: “What is happening now based on incoming data.” It is a method for
quantitatively describing the main features of a collection of data. Here are a few key points about
descriptive analysis:
• Typically, it is the first kind of data analysis performed on a dataset.
• Usually it is applied to large volumes of data, such as census data.
• Description and interpretation processes are different steps.
Descriptive analysis can be useful in the sales cycle, for example, to categorize customers by their
likely product preferences and purchasing patterns. Another example is the Census Data Set, where
descriptive analysis is applied on a whole population.
Researchers and analysts collecting quantitative data or translating qualitative data into numbers
are often faced with a large amount of raw data that needs to be organized and summarized before it
can be analyzed. Data can only reveal patterns and allow observers to draw conclusions when it is
presented as an organized summary. Here is where descriptive statistics come into play: they facilitate
analyzing and summarizing the data and are thus instrumental to processes inherent in data science.
instrumental to processes inherent in data science. Data cannot be properly used if it is not
correctly interpreted. This requires appropriate statistics. For example, should we use the mean,
median, or mode, two of these, or all three?4 Each of these measures is a summary that emphasizes
certain aspects of the data and overlooks others. They all provide information we need to get a full
picture of the world we are trying to understand.
19
The process of describing something requires that we extract its important parts: to paint a scene,
an artist must first decide which features to highlight. Similarly, humans often point out significant
aspects of the world with numbers, such as the size of a room, the population of a State , or the
Scholastic Aptitude Test (SAT) score of a high-school senior. Nouns name these things, or
characteristic areas, populations, and verbal learning abilities. To describe these features, English
speakers use adjectives, for example, decent-sized room, smalltown population, bright high-school
senior. But numbers can replace these words: 100 sq. ft. room, Florida population of 18,801,318, or a
senior with a verbal score of 800.
Numerical representation can hold a considerable advantage over words. Numbers allow humans
to more precisely differentiate between objects or concepts. For example, two rooms may be described
as “small,” but numbers distinguish a 9-foot expanse from a 10- foot expanse. One could argue that
even imperfect measuring instruments afford more levels of differentiation than adjectives. And, of
course, numbers can modify words by providing a count of units (2500 persons), indicating a rank, or
placing the characteristics on some scale (SAT score of 800, with a mean of 600).
• Variables
• Frequency Distribution
• Measures of Centrality
• Dispersion of a Distribution
20
1.10 Predictive Analytics
As you may have guessed, predictive analytics has its roots in our ability to predict what might happen.
These analytics are about understanding the future using the data and the trends we have seen in the
past, as well as emerging new contexts and processes. An example is trying to predict how people will
spend their tax refunds based on how consumers normally behave around a given time of the year
(past data and trends), and how a new tax policy (new context) may affect people’s refunds.
Predictive analytics provides companies with actionable insights based on data. Such information
includes estimates about the likelihood of a future outcome. It is important to remember that no
statistical algorithm can “predict” the future with 100% certainty because the foundation of predictive
analytics is based on probabilities. Companies use these statistics to forecast what might happen. Some
of the software most commonly used by data science professionals for predictive analytics are SAS
predictive analytics, IBM predictive analytics, RapidMiner, and others.
As Figure 3.11 suggests, predictive analytics is done in stages.
1. First, once the data collection is complete, it needs to go through the process of cleaning
2. Cleaned data can help us obtain hindsight in relationships between different variables. Plotting the
data (e.g., on a scatterplot) is a good place to look for hindsight.
3. Next, we need to confirm the existence of such relationships in the data. This is where regression
comes into play. From the regression equation, we can confirm the pattern of distribution inside the
data. In other words, we obtain insight from hindsight.
4. Finally, based on the identified patterns, or insight, we can predict the future, i.e., foresight.
The following example illustrates a use for predictive analytics. Let us assume that Salesforce
kept campaign data for the last eight quarters. This data comprises total sales generated by
newspaper, TV, and online ad campaigns and associated expenditures, as provided in Table 3.4
21
With this data, we can predict the sales based on the expenditures of ad campaigns in different
media for Salesforce.
Like data analytics, predictive analytics has a number of common applications. For example,
many people turn to predictive analytics to produce their credit scores. Financial services use such
numbers to determine the probability that a customer will make their credit payments on time.
FICO, in particular, has extensively used predictive analytics to develop the methodology to
calculate individual FICO scores.
Customer relationship management (CRM) classifies another common area for predictive
analytics. Here, the process contributes to objectives such as marketing campaigns, sales, and
customer service. Predictive analytics applications are also used in the healthcare field. They can
determine which patients are at risk for developing certain conditions such as diabetes, asthma,
and other chronic or serious illnesses.
Prescriptive analytics is the area of business analytics dedicated to finding the best course
of action for a given situation. This may start by first analyzing the situation (using descriptive
analysis), but then moves toward finding connections among various parameters/variables, and
their relation to each other to address a specific problem, more likely that of prediction.
A process-intensive task, the prescriptive approach analyzes potential decisions, the
interactions between decisions, the influences that bear upon these decisions, and the bearing all
of this has on an outcome to ultimately prescribe an optimal course of action in real time.
Prescriptive analytics can also suggest options for taking advantage of a future opportunity or
mitigate a future risk and illustrate the implications of each. In practice,
prescriptive analytics can continually and automatically process new data to improve the
accuracy of predictions and provide advantageous decision options.
Specific techniques used in prescriptive analytics include optimization, simulation, game
theory, and decision-analysis methods.
Prescriptive analytics can be really valuable in deriving insights from given data, but it is
largely not used. According to Gartner,14 13% of organizations are using predictive analytics, but
only 3% are using prescriptive analytics. Where big data analytics in general sheds light on a
subject, prescriptive analytics gives you laser-like focus to answer specific questions.
For example, in healthcare, we can better manage the patient population by using
prescriptive analytics to measure the number of patients who are clinically obese, then add filters
for factors like diabetes and LDL cholesterol levels to determine where to focus treatment.
There are two more categories of data analysis techniques that are different from the
above-mentioned four categories – exploratory analysis and mechanistic analysis.
Extracting meaning from data – Feature selection, User retention, Filters, Wrappers, Entropy, Decision
tree algorithm; Random forests, The dimensionality problem, Single value decomposition, Principal
component analysis.
Feature selection:
Feature selection is a process that chooses a subset of features from the original features so that
the feature space is optimally reduced according to a certain criterion.
Feature selection is a critical step in the feature construction process. In text categorization
problems, some words simply do not appear very often. Perhaps the word “groovy” appears in exactly one
training document, which is positive. Is it really worth keeping this word around as a feature? It’s a
dangerous endeavor because it’s hard to tell with just one training example if it is really correlated with
the positive class or is it just noise. You could hope that your learning algorithm is smart enough to figure
it out. Or you could just remove it.
The techniques for feature selection in machine learning can be broadly classified into the following
categories:
Supervised Techniques: These techniques can be used for labeled data and to identify the relevant features
for increasing the efficiency of supervised models like classification and regression. For Example- linear
regression, decision tree, SVM, etc.
Unsupervised Techniques: These techniques can be used for unlabeled data. For Example- K-Means
Clustering, Principal Component Analysis, Hierarchical Clustering, etc.
There are three categories of feature selection methods, depending on how they interact with the
classifier, namely, filter, wrapper, and embedded methods.
1. Instance based approaches: There is no explicit procedure for feature subset generation. Many small
24
data samples are sampled from the data. Features are weighted according to their roles in differentiating
instances of different classes for a data sample. Features with higher weights can be selected.
2. Nondeterministic approaches: Genetic algorithms and simulated annealing are also used in feature
selection.
3. Exhaustive complete approaches: Branch and Bound evaluates estimated accuracy and ABB checks an
inconsistency measure that is monotonic. Both start with a full feature set until the preset bound cannot be
maintained.
While building a machine learning model for real-life dataset, we come across a lot of features in
the dataset and not all these features are important every time. Adding unnecessary features while training
the model leads us to reduce the overall accuracy of the model, increase the complexity of the model and
decrease the generalization capability of the model and makes the model biased. Even the saying
“Sometimes less is better” goes as well for the machine learning model. Hence, feature selection is one of
the important steps while building a machine learning model. Its goal is to find the best possible set of
features for building a machine learning model.
• Filter methods
• Wrapper methods
• Embedded methods
User retention is the number of users who continue interacting with your product over a given period.
We’ll discuss how to measure user retention in depth below, but understand two things:
User retention over a defined period that you determine. For example, you can audit your user retention
rate based on a recurring monthly or quarterly schedule, or choose a specific time period to examine.
You can calculate user retention for your app at large or focus on specific features. If you’re measuring
user retention for your overall product, you can look at the number of logins over a period of time. For
specifics, focus on the number of users who interacted with a certain feature over your timeframe.
If you offer a paid subscription service, you’ll measure customer retention instead of user retention. It’s
also crucial to look at if you’re working off a freemium model — comparing your user retention rate with
your customer retention rate can offer insights into which features are the most engaging and are worth
paying for.
A high churn rate indicates your users don’t see the value in your product. High churn is also a red flag
for vulnerability — it means that your product isn’t meeting users’ needs, which shows there is an
opportunity for a competitor to creep into your customer base.
25
Importance of User Retention
User retention is a critical aspect of any business, regardless of size or industry. It refers to the a bility of a
company to keep its existing customers engaged and satisfied with its products or services over an
extended period of time.
Acquiring new customers can be expensive due to marketing and advertising costs. Companies can reduce
these costs by retaining existing customers while still generating revenue. Retained users are more likely
to make repeat purchases, increasing the lifetime value of each customer. Satisfied users are also more
likely to recommend your product or service to others, leading to new customers and increased revenue.
Consistently providing a positive user experience can create strong brand loyalty, leading to long-term
relationships with your customers. Loyal customers are more likely to forgive minor mistakes, like a delay
in delivery or a minor product issue, and continue to support your business. They may also become
advocates for your brand by promoting it to others through word-of-mouth recommendations.
Retained users also provide valuable feedback on your company’s product or service. This feedback can
help identify areas for improvement and refine offerings to better meet the needs of the customers.
Companies that listen to their users and incorporate their feedback are more likely to build products or
services that better align with their customer’s needs.
1. User onboarding
During onboarding, first-time users sign up and acclimate to your product. Creating a smooth and
straightforward new user onboarding experience can lead users to the next phase more quickly, while a
complicated or confusing process will lose their interest. Onboarding is also your opportunity to point
users toward your most important features — those that are most likely to motivate them to use your
product regularly.
2. Activation
User activation is the phase when your users see your product’s value. This is their “aha!” moment —
when everything clicks and they see how your product can play a beneficial role in their day-to-day tasks.
The amount of time it takes a user to reach this moment is the time-to-value (TTV). Activated users feel
positively toward your product and are much more likely to stay retained.
3. Habit forming
In the third phase, users form a habit involving your product, making regular usage a part of their routine.
Users in this phase feel they need your product, and getting more people into the habit-forming stage will
26
mean better user retention rates. Offering ongoing guidance with in-app messaging can help your users
continue to discover new features and functionality as they move into this stage, create more sticky
products, and drive overall product adoption.
These methods are generally used while doing the pre-processing step. These methods select
features from the dataset irrespective of the use of any machine learning algorithm. In terms of
computation, they are very fast and inexpensive and are very good for removing duplicated, correlated,
redundant features but these methods do not remove multicollinearity. Selection of feature is evaluated
individually which can sometimes help when features are in isolation (don’t have a dependency on other
features) but will lag when a combination of features can lead to increase in the overall performance of
the model.
Information Gain – It is defined as the amount of information provided by the feature for identifying the
target value and measures reduction in the entropy values. Information gain of each attribute is calculated
considering the target values for feature selection.
Chi-square test — Chi-square method (X2) is generally used to test the relationship between categorical
variables. It compares the observed values from different attributes of the dataset to its expected value.
Fisher’s Score – Fisher’s Score selects each feature independently according to their scores under Fisher
criterion leading to a suboptimal set of features. The larger the Fisher’s score is, the better is the selected
feature.
Correlation Coefficient – Pearson’s Correlation Coefficient is a measure of quantifying the association
between the two continuous variables and the direction of the relationship with its values ranging from -1
to 1.
Variance Threshold – It is an approach where all features are removed whose variance doesn’t meet the
specific threshold. By default, this method removes features having zero variance. The assumption made
using this method is higher variance features are likely to contain more information.
Mean Absolute Difference (MAD) – This method is similar to variance threshold method but the
difference is there is no square in MAD. This method calculates the mean absolute difference from the
mean value.
Dispersion Ratio – Dispersion ratio is defined as the ratio of the Arithmetic mean (AM) to that of
Geometric mean (GM) for a given feature. Its value ranges from +1 to ∞ as AM ≥ GM for a given feature.
Higher dispersion ratio implies a more relevant feature.
Mutual Dependence – This method measures if two variables are mutually dependent, and thus provides
the amount of information obtained for one variable on observing the other variable. Depending on the
presence/absence of a feature, it measures the amount of information that feature contributes to making
the target prediction.
27
Relief – This method measures the quality of attributes by randomly sampling an instance from the dataset
and updating each feature and distinguishing between instances that are near to each other based on the
difference between the selected instance and two nearest instances of same and opposite classes.
Wrapper methods, also referred to as greedy algorithms train the algorithm by using a subset of features
in an iterative manner. Based on the conclusions made from training in prior to the model, addition and
removal of features takes place. Stopping criteria for selecting the best subset are usually pre-defined by
the person training the model such as when the performance of the model decreases or a specific number
of features has been achieved. The main advantage of wrapper methods over the filter methods is that they
provide an optimal set of features for training the model, thus resulting in better accuracy than the filter
methods but are computationally more expensive.
Forward selection – This method is an iterative approach where we initially start with an empty set of
features and keep adding a feature which best improves our model after each iteration. The stopping
criterion is till the addition of a new variable does not improve the performance of the model.
Backward elimination – This method is also an iterative approach where we initially start with all features
and after each iteration, we remove the least significant feature. The stopping criterion is till no
improvement in the performance of the model is observed after the feature is removed.
Bi-directional elimination – This method uses both forward selection and backward elimination technique
simultaneously to reach one unique solution.
Exhaustive selection – This technique is considered as the brute force approach for the evaluation of
feature subsets. It creates all possible subsets and builds a learning algorithm for each subset and selects
the subset whose model’s performance is best.
Recursive elimination – This greedy optimization method selects features by recursively considering the
smaller and smaller set of features. The estimator is trained on an initial set of features and their importance
is obtained using feature_importance_attribute. The least important features are then removed from the
current set of features till we are left with the required number of features.
In embedded methods, the feature selection algorithm is blended as part of the learning algorithm, thus
having its own built-in feature selection methods. Embedded methods encounter the drawbacks of filter
and wrapper methods and merge their advantages. These methods are faster like those of filter methods
and more accurate than the filter methods and take into consideration a combination of features as well.
Tree-based methods – These methods such as Random Forest, Gradient Boosting provides us feature
importance as a way to select features as well. Feature importance tells us which features are more
important in making an impact on the target feature.
A decision tree is a flowchart-like structure in which each internal node represents a test on a feature (e.g.
whether a coin flip comes up heads or tails) , each leaf node represents a class label (decision taken after
computing all features) and branches represent conjunctions of features that lead to those class labels. The
paths from root to leaf represent classification rules. Below diagram illustrate the basic flow of decision
tree for decision making with labels (Rain(Yes), No Rain(No)).
Decision tree is one of the predictive modelling approaches used in statistics, data mining and machine
learning.
Decision trees are constructed via an algorithmic approach that identifies ways to split a data set based on
different conditions. It is one of the most widely used and practical methods for supervised learning.
Decision Trees are a non-parametric supervised learning method used for
both classification and regression tasks.
Tree models where the target variable can take a discrete set of values are called classification trees.
Decision trees where the target variable can take continuous values (typically real numbers) are
called regression trees. Classification And Regression Tree (CART) is general term for this.
Data Format
The dependent variable, Y, is the target variable that we are trying to understand, classify or generalize.
The vector x is composed of the features, x1, x2, x3 etc., that are used for that task.
Example
29
Approach to make decision tree
While making decision tree, at each node of tree we ask different type of questions. Based on the asked
question we will calculate the information gain corresponding to it.
Information Gain
Information gain is used to decide which feature to split on at each step in building the tree. Simplicity is
best, so we want to keep our tree small. To do so, at each step we should choose the split that results in
the purest daughter nodes. A commonly used measure of purity is called information. For each node of
the tree, the information value measures how much information a feature gives us about the class. The
split with the highest information gain will be taken as the first split and the process will continue until all
children nodes are pure, or until the information gain is 0.
Question
Now we will try to Partition the dataset based on asked question. Data will be divided into two classes at
each steps.
30
Algorithm for constructing decision tree usually works top-down, by choosing a variable at each step that
best splits the set of items. Different algorithms use different metrices for measuring best.
Gini Impurity
First let’s understand the meaning of Pure and Impure.
Pure
Pure means, in a selected sample of dataset all data belongs to same class (PURE).
Impure
Impure means, data is mixture of different classes.
If our dataset is Pure then likelihood of incorrect classification is 0. If our sample is mixture of different
classes then likelihood of incorrect classification will be high.
Calculating Gini Impurity.
Example
Now build the Decision tree based on step discussed above recursively at each node.
32
Output
From above output we can see that at each steps data is divided into True and False rows. This process
keep repeated until we reach leaf node where information gain is 0 and further split of data is not
possible as nodes are Pure.
• Prone to overfitting.
• Require some kind of measurement as to how well they are doing.
• Need to be careful with parameter tuning.
• Can create biased learned trees if some classes dominate.
Overfitting is one of the major problem for every model in machine learning. If model is overfitted
it will poorly generalized to new samples. To avoid decision tree from overfitting we remove the
branches that make use of features having low importance. This method is called as Pruning or post-
pruning. This way we will reduce the complexity of tree, and hence imroves predictive accuracy by
the reduction of overfitting.
Pruning should reduce the size of a learning tree without reducing predictive accuracy as measured
by a cross-validation set. There are 2 major Pruning techniques.
• Minimum Error: The tree is pruned back to the point where the cross-validated error is a
minimum.
• Smallest Tree: The tree is pruned back slightly further than the minimum error. Technically
the pruning creates a decision tree with cross-validation error within 1 standard error of the
minimum error.
33
Early Stop or Pre -pruning
An alternative method to prevent overfitting is to try and stop the tree-building process early,
before it produces leaves with very small samples. This heuristic is known as early stopping but is
also sometimes known as pre-pruning decision trees.
At each stage of splitting the tree, we check the cross-validation error. If the error does not
decrease significantly enough then we stop. Early stopping may underfit by stopping too early. The
current split may be of little benefit, but having made it, subsequent splits more significantly reduce
the error.
Early stopping and pruning can be used together, separately, or not at all. Post pruning decision
trees is more mathematically rigorous, finding a tree at least as good as early stopping. Early stopping
is a quick fix heuristic. If used together with pruning, early stopping may save time. After all, why
build a tree only to prune it back again.
A random forest consists of multiple random decision trees. Two types of randomnesses are built into the
trees. First, each tree is built on a random sample from the original data. Second, at each tree node, a subset
of features are randomly selected to generate the best split.
We use the dataset below to illustrate how to build a random forest tree. Note Class = XOR(X1,X2). X3
is made identical as X2 (for illustrative purposes in later sections).
The same process is applied to build multiple trees. The figure below illustrates the flow of applying a
random forest with three trees to a testing data instance.
34
https://www.slideshare.net/m80m07/random-forest
The dimensionality problem is a problem that arises when gathering a huge amount of data. It occurs when
highly noisy dimensions with fewer pieces of information and without significant benefit can be obtained
due to the large data. The curse of dimensionality refers to what happens when you add more and more
variables to a multivariate model. The more dimensions you add to a data set, the more difficult it becomes
to predict certain quantities. When the dimensionality increases, the volume of the space increases so fast
that the available data become sparse.
The Singular Value Decomposition (SVD) of a matrix is a factorization of that matrix into three matrices.
It has some interesting algebraic properties and conveys important geometrical and theoretical insights
about linear transformations. It also has some important applications in data science.
The linear algebra essential to data science, machine learning, and artificial intelligence is often
overlooked as most introductory courses fail to display the big picture. Concepts such as eigen
decomposition and singular value decomposition (SVD) are incredibly important from a practitioner's
standpoint; they are the core of dimensionality reduction techniques including principal component
analysis (PCA) and latent semantic analysis (LSA). This article aims to exhibit SVD by gently introducing
the mathematics required in tandem with tangible Python code.
Matrix Multiplication
To start, let’s consider the following vector, x, as the sum of two basis vectors i and j.
Principal component analysis (PCA) is a technique that transforms high-dimensions data into lower-
dimensions while retaining as much information as possible.
35
Principal Component Analysis(PCA) technique was introduced by the mathematician Karl Pearson in
1901. It works on the condition that while the data in a higher dimensional space is mapped to data in a
lower dimension space, the variance of the data in the lower dimensional space should be maximum.
The original 3-dimensional data set. The red, blue, green arrows are the direction of the first, second, and
third principal components
• PCA is an unsupervised learning algorithm technique used to examine the interrelations among a
set of variables. It is also known as a general factor analysis where regression determines a line
of best fit.
• The main goal of Principal Component Analysis (PCA) is to reduce the dimensionality of a dataset
while preserving the most important patterns or relationships between the variables without any
prior knowledge of the target variables.
36
• PCA is a technique for dimensionality reduction that identifies a set of orthogonal axes, called
principal components, that capture the maximum variance in the data. The principal components
are linear combinations of the original variables in the dataset and are ordered in decreasing order
of importance. The total variance captured by all the principal components is equal to the total
variance in the original dataset.
• The first principal component captures the most variation in the data, but the second principal
component captures the maximum variance that is orthogonal to the first principal component,
and so on.
• PCA can be used for a variety of purposes, including data visualization, feature selection, and data
compression. In data visualization, PCA can be used to plot high-dimensional data in two or three
dimensions, making it easier to interpret. In feature selection, PCA can be used to identify the
most important variables in a dataset. In data compression, PCA can be used to reduce the size of
a dataset without losing important information.
• In PCA, it is assumed that the information is carried in the variance of the features, that is, the
higher the variation in a feature, the more information that features carries.
Overall, PCA is a powerful tool for data analysis and can help to simplify complex
datasets, making them easier to understand and work with.
37