CH 4
CH 4
Tables
which contain numeric or alpha-numeric data. But this leads to a very critical dilemma: are these
datasets accessible to all? Should these databases be accessible to all? What are the various
sources
of data from which we can gather such databases? Let’s find out!
Sources of Data
There exist various sources of data from where we can collect any type of data required and the
data
collection process can be categorised in two ways: Offline and Online.
Offline Data Collection
Online Data Collection
Sensors
Open-sourced Government Portals
Surveys
Reliable Websites (Kaggle)
Interviews
World Organisations’ open-sourced statistical
websites
Observations
While accessing data from any of the data sources, following points should be kept in mind:
1. Data which is available for public usage only should be taken up.
2. Personal datasets should only be used with the consent of the owner.
3. One should never breach someone’s privacy to collect data.
4. Data should only be taken form reliable sources as the data collected from random sources
can be wrong or unusable.
5. Reliable sources of data ensure the authenticity of data which helps in proper training of the
AI model.
Types of Data
For Data Science, usually the data is collected in the form of tables. These tabular datasets can be
stored in different formats. Some of the commonly used formats are:
1. CSV: CSV stands for comma separated values. It is a simple file format used to store tabular
data. Each line of this file is a data record and reach record consists of one or more fields which
are separated by commas. Since the values of records are separated by a comma, hence they
are known as CSV files.
2. Spreadsheet: A Spreadsheet is a piece of paper or a computer program which is used for
accounting and recording data using rows and columns into which information can be
entered. Microsoft excel is a program which helps in creating spreadsheets.
3. SQL: SQL is a programming language also known as Structured Query Language. It is a domain-
specific language used in programming and is designed for managing data held in different
kinds of DBMS (Database Management System) It is particularly useful in handling structured
data.
A lot of other formats of databases also exist, you can explore them online!
Data Access
After collecting the data, to be able to use it for programming purposes, we should know how to
access
the same in a Python code. To make our lives easier, there exist various Python packages which
help
us in accessing structured data (in tabular form) inside the code. Let us take a look at some of
these
packages:
NumPy
NumPy, which stands for Numerical Python, is the fundamental package for Mathematical and
logical
operations on arrays in Python. It is a commonly used package when it comes to working around
numbers. NumPy gives a wide range of arithmetic operations around numbers giving us an easier
approach in working with them. NumPy also works with arrays, which is nothing but a
homogenous
collection of Data.
An array is nothing but a set of multiple values which are of same datatype. They can be numbers,
characters, booleans, etc. but only one datatype can be accessed through an array. In NumPy, the
arrays used are known as ND-arrays (N-Dimensional Arrays) as NumPy comes with a feature of
creating n-dimensional arrays in Python.
An array can easily be compared to a list. Let us take a look at how they are different:
NumPy Arrays
Lists
1. Homogenous collection of Data.
2. Can contain only one type of data, hence not
flexible with datatypes.
3. Cannot be directly initialized. Can be operated
with Numpy package only.
4. Direct numerical operations can be done. For
example, dividing the whole array by 3 divides
every element by 3.
5. Widely used for arithmetic operations.
6. Arrays take less memory space.
7. Functions
like
concatenation,
appending,
reshaping, etc are not trivially possible with
arrays.
8. Example: To create a numpy array ‘A’:
import numpy
A=numpy.array([1,2,3,4,5,6,7,8,9,0])
1. Heterogenous collection of Data.
2. Can contain multiple types of data,
hence flexible with datatypes.
3. Can be directly initialized as it is a part
of Python syntax.
4. Direct numerical operations are not
possible. For example, dividing the
whole list by 3 cannot divide every
element by 3.
5. Widely used for data management.
6. Lists acquire more memory space.
7. Functions
like
concatenation,
appending, reshaping, etc are trivially
possible with lists.
8. Example: To create a list:
A = [1,2,3,4,5,6,7,8,9,0]
Pandas
Pandas is a software library written for the Python programming language for data manipulation
and
analysis. In particular, it offers data structures and operations for manipulating numerical tables
and
time series. The name is derived from the term "panel data", an econometrics term for data sets
that
include observations over multiple time periods for the same individuals.
Pandas is well suited for many different kinds of data:
•
Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
•
Ordered and unordered (not necessarily fixed-frequency) time series data.
•
Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
•
Any other form of observational / statistical data sets. The data actually need not be labelled
at all to be placed into a Pandas data structure
* Images shown here are the property of individual organisations and are used here for reference
purpose only.
The two primary data structures of Pandas, Series (1-dimensional) and DataFrame (2-dimensional),
handle the vast majority of typical use cases in finance, statistics, social science, and many areas
of
engineering. Pandas is built on top of NumPy and is intended to integrate well within a scientific
computing environment with many other 3rd party libraries.
Here are just a few of the things that pandas does well:
•
Easy handling of missing data (represented as NaN) in floating point as well as non-floating
point data
•
Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional
objects
•
Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or
the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the
data for you in computations
•
Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
•
Intuitive merging and joining data sets
•
Flexible reshaping and pivoting of data sets
Matplotlib*
Matplotlib is an amazing visualization library in Python for 2D plots of arrays. Matplotlib is a multi-
platform data visualization library built on NumPy arrays. One of the greatest benefits of
visualization
is that it allows us visual access to huge amounts of data in easily digestible visuals. Matplotlib
comes
with a wide variety of plots. Plots helps to understand trends, patterns, and to make correlations.
They’re typically instruments for reasoning about quantitative information. Some types of graphs
that
we can make with this package are listed below:
Not just plotting, but you can also modify your plots the way you wish. You can stylise them and
make
them more descriptive and communicable.
These packages help us in accessing the datasets we have and also in exploring them to develop
a
better understanding of them.
Do you remember using these formulas in your class? Let us recall all of them here:
1. What is Mean? How is it calculated?
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
2. What is Median? How is it calculated?
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
3. What is Mode? How is it calculated?
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
* Images shown here are the property of individual organisations and are used here for reference
purpose only.
This is an example of a double bar chart. The 2 axes depict two different parameters while bars of
different colours work with different entities ( in this case it is women and men). Bar chart also
works
on discontinuous data and is made at uniform intervals.
Histograms are the accurate representation of a continuous data.
When it comes to plotting the variation in just one entity of a period
of time, histograms come into the picture. It represents the frequency
of the variable at different points of time with the help of the bins.
Bar Chart
Histogram
* Images shown here are the property of individual organisations and are used here for reference
purpose only.
In the given example, the histogram is showing the variation in frequency of the entity plotted
with
the help of XY plane. Here, at the left, the frequency of the element has been plotted and it is a
frequency map for the same. The colours show the transition from low to high and vice versa.
Whereas
on the right, a continuous dataset has been plotted which might not be talking about the
frequency
of occurrence of the element.
When the data is split according to its percentile throughout the
range, box plots come in haman. Box plots also known as box and
whiskers plot conveniently display the distribution of data throughout
the range with the help of 4 quartiles.
Box Plots
Here as we can see, the plot contains a box and two lines at its left and right are termed as
whiskers.
The plot has 5 different parts to it:
Quartile 1: From 0 percentile to 25th percentile – Here data lying between 0 and 25th percentile is
plotted. Now, if the data is close to each other, lets say 0 to 25th percentile data has been covered
in
just 20-30 marks range, then the whisker would be smaller as the range is smaller. But if the range
is
large that is 0-30 marks range, then the whisker would also get elongated as the range is longer.
Quartile 2: From 25th Percentile to 50th percentile – 50th percentile is termed as the mean of the
whole
distribution and since the data falling in the range of 25th percentile to 75th percentile has
minimum
deviation from the mean, it is plotted inside the box.
Quartile 3: From 50th percentile to 75th percentile – This range is again plotted in the box as its
deviation from the mean is less. Quartile 2 & 3 (from 25th percentile to 75th percentile) together
constitute the Inter Quartile Range (IQR). Also, depending upon the range of distribution, just like
whiskers, the length of box also varies if the data is less spread or more.
Quartile 4: From 75th percentile to 100th percentile – It is the whiskers plot for top 25 percentile
data.
Outliers: The advantage of box plots is that they clearly show the outliers in a data distribution.
Points
which do not lie in the range are plotted outside the graph as dots or circles and are termed as
outliers
as they do not belong to the range of data. Since being out of range is not an error, that is why
they
are still plotted on the graph for visualisation.
Let us now move ahead and experience data visualisation using Jupyter notebook. Matplotlib
library
will help us in plotting all sorts of graphs while Numpy and Pandas will help us in analysing the
data.
Data Sciences: Classification Model
In this section, we would be looking at one of the classification models used in Data Sciences. But
before we look into the technicalities of the code, let us play a game.
Personality Prediction
Step 1: Here is a map. Take a good look at it. In this map you can see the arrows determine a
quality.
The qualities mentioned are:
1. Positive X-axis – People focussed: You focus more on people and try to deliver the best
experience to them.
2. Negative X-axis – Task focussed: You focus more on the task which is to be accomplished and
try to do your best to achieve that.
3. Positive Y-axis – Passive: You focus more on listening to people and understanding everything
that they say without interruption.
4. Negative Y-axis – Active: You actively participate in the discussions and make sure that you
make your point in-front of the crowd.
Think for a minute and understand which of these qualities you have in you. Now, take a chit and
write
your name on it. Place this chit at a point in this map which best describes you. It can be placed
anywhere on the graph. Be honest about yourself and put it on the graph.
Step 2: Now that you have all put up your chits on the graph, it’s time to take a quick quiz. Go to
this
link and finish the quiz on it individually: https://tinyurl.com/discanimal
On this link, you will find a personality prediction quiz. Take this quiz individually and try to answer
all
the questions honestly. Do not take anyone’s help in it and do not discuss about it with anyone.
Once
the quiz is finished, remember the animal which has been predicted for you. Write it somewhere
and
do not show it to anyone. Keep it as your little secret.
Once everyone has gone through the quiz, go back to the board remove your chit, and draw the
symbol which corresponds to your animal in place of your chit. Here are the symbols:
Lion
Otter
Golden Retriever
Beaver
⚫
☺
Place these symbols at the locations where you had put up your names. Ask 4 students not to do
so
and tell them to keep their animals a secret. Let their name chits be on the graph so that we can
predict their animals with the help of this map.
Now, we will try to use the nearest neighbour algorithm here and try to predict what can be the
possible animal(s) for these 4 unknowns. Now look that these 4 chits one by one. Which animal is
occurring the most in their vicinity? Do you think that if the m lion symbol is occurring the most
near
their chit, then there is a good probability that their animal would also be a lion? Now let us try to
guess the animal for all 4 of them according to their nearest neighbours respectively. After
guessing
the animals, ask these 4 students if the guess is right or not.
K-Nearest Neighbour: Explained
The k-nearest neighbours (KNN) algorithm is a simple, easy-to-implement supervised machine
learning algorithm that can be used to solve both classification and regression problems. The KNN
algorithm assumes that similar things exist in close proximity. In other words, similar things are
near
to each other as the saying goes “Birds of a feather flock together”. Some features of KNN are:
•
The KNN prediction model relies on the surrounding points or neighbours to determine its
class or group
•
Utilises the properties of the majority of the nearest points to decide how to classify unknown
points
•
Based on the concept that similar data points should be close to each other
The personality prediction activity was a brief introduction to KNN. As you recall, in that activity,
we
tried to predict the animal for 4 students according to the animals which were the nearest to their
points. This is how in a lay-man’s language KNN works. Here, K is a variable which tells us about
the
number of neighbours which are taken into account during prediction. It can be any integer value
starting from 1.
Let us look at another example to demystify this algorithm. Let us assume that we need to predict
the
sweetness of a fruit according to the data which we have for the same type of fruit. So here we
have
three maps to predict the same:
* Images shown here are the property of individual organisations and are used here for reference
purpose only.
Here, X is the value which is to be predicted. The green dots depict sweet values and the blue
ones
denote not sweet.
Let us try it out by ourselves first. Look at the map closely and decide whether X should be sweet
or
not sweet?
Now, let us look at each graph one by one:
Here, we can see that K is taken as 1 which means that we are taking only 1 nearest
neighbour into consideration. The nearest value to X is a blue one hence 1-nearest
neighbour algorithm predicts that the fruit is not sweet.
In the 2nd graph, the value of K is 2. Taking 2 nearest nodes to X into consideration, we
see that one is sweet while the other one is not sweet. This makes it difficult for the
machine to make any predictions based on the nearest neighbour and hence the
machine is not able to give any prediction.
In the 3rd graph, the value of K becomes 3. Here, 3 nearest nodes to X are chosen out
of which 2 are green and 1 is blue. On the basis of this, the model is able to predict that
the fruit is sweet.
On the basis of this example, let us understand KNN better:
KNN tries to predict an unknown value on the basis of the known values. The model simply
calculates
the distance between all the known points with the unknown point (by distance we mean to say
the
different between two values) and takes up K number of points whose distance is minimum. And
according to it, the predictions are made.
Let us understand the significance of the number of neighbours:
1. As we decrease the value of K to 1, our predictions become less stable. Just think for a minute,
imagine K=1 and we have X surrounded by several greens and one blue, but the blue is the
single nearest neighbour. Reasonably, we would think X is most likely green, but because K=1,
KNN incorrectly predicts that it is blue.
1
2
3