Ai Project 1
Ai Project 1
Introduction
Each domain has its own type of data which gets fed into the machine and hence has its own way of
working around it. Talking about Data Sciences, it is a concept to unify statistics, data analysis,
machine learning and their related methods in order to understand and analyze actual phenomena
with data. It employs techniques and theories drawn from many fields within the context of
Mathematics, Statistics, Computer Science, and Information Science.
Applications of Data Sciences
Data Science is not a new field. Data Sciences majorly work around analyzing the data and when it
comes to Al, the analysis helps in making the machine intelligent enough to perform tasks by itself.
There exist various applications of Data Science in today's world. Some of them are:
Fraud and Risk Detection: The earliest applications of data science were in Finance. Companies
were fed up of bad debts and losses every year. However, they had a lot of data which use to get collected
during the initial paperwork while sanctioning loans. They decided to bring in data scientists in order to rescue
them from losses.
Over the years, banking companies learned to divide and conquer data via customer profiling, past
expenditures, and other essential variables to analyze the probabilities of risk and default. Moreover, it also
helped them to push their banking products based on customer's purchasing power.
Genetics & Genomics: Data Science applications also enable an advanced level of treatment
personalization through research in genetics and genomics. The goal is to understand the impact of
the DNA on our health and find individual biological connections between genetics, diseases, and
drug response. Data science techniques allow integration of different kinds of data with genomic
data in disease research, which provides a deeper understanding of genetic issues in reactions to
particular drugs and diseases. As soon as we acquire reliable personal genome data, we will achieve
a deeper understanding of the human DNA. The advanced genetic risk prediction will be a major step towards more individual care.
Internet Search: When we talk about search engines, we think 'Google' right? But there are
many other search engines like Yahoo, Bing, Ask, AOL, and so on. All these search engines (including
Google) make use of data science algorithms to deliver the best result for our searched query in the
fraction of a second. Considering the fact that Google processes more than 20 petabytes of data
every day, had there been no data science, Google wouldn't have been the 'Google' we know today.
Targeted Advertising: If you thought Search would have been the biggest of all data
science applications, here is a challenger - the entire digital marketing spectrum. Starting from the
display banilrs on various websites to the digital billboards at the airports - almost all of them are
decided by using data science algorithms. This is the reason why digital ads have been able to get a
much higher CTR (Call-Through Rate) than traditional advertisements. They can be targeted based
on a user's past behavior.
world is known to bear heavy losses. Except for a few airline service
providers, companies are struggling to maintain their occupancy ratio
and operating profits. With high rise in air-fuel prices and the need to
offer heavy discounts to customers, the situation has got worse. It
wasn't long before airline companies started using Data Science to
identify the strategic areas of improvements. Now, while using Data
Science, the airline companies can:
But, before we get deeper into data analysis, let us recall how Data Sciences can be leveraged to solve some of the pressing
problems around us. For this, let us understand the Al project cycle framework around Data Sciences with the help of an
example.
Humans are social animals. We tend to organize and/or participate in various kinds of social gatherings all the time. We
love eating out with friends and family because of which we can find restaurants almost everywhere and out of these,
many of the restaurants arrange for buffets to offer a variety of food items to their customers. Be it small shops or big
outlets, every restaurant prepares food in bulk as they expect a good crowd to come and enjoy their food. But in most
cases, after the day ends, a lot of food is left which becomes unusable for the restaurant as they do not wish to serve stale
food to their customers the next day. So, every day, they prepare food in large quantities keeping in mind the probable
number of customers walking into their outlet. But if the expectations are not met, a good amount of food gets wasted
which eventually becomes a loss for the restaurant as they either have to dump it or give it to hungry people for free. And
if this daily loss is taken into account for a year, it becomes quite a big amount.
Now that we have understood the scenario well, let us take a deeper look into the problem to find out more about various factors around
it. Let us fill up the 4Ws problem canvas to find out.
After finalizing the goal of our project, let us now move towards looking at various data features which affect
the problem in some way or the other. Since any Al-based project requires data for testing and training, we
need to understand what kind of data is to be collected to work towards the goal. In our scenario, various
factors that would affect the quantity of food to be prepared for the next day consumption in buffets would
be:
Now let us understand how these factors are related to our problem statement. For this, we can use the
System Maps tool to figure out the relationship of elements with the project's goal. Here is the System map for
our problem statement.
In this system map, you can see how the relationship of each element is defined with the goal of our project.
Recall that the positive arrows determine a direct relationship of elements while the negative ones show an
inverse relationship of elements.
After looking at the factors affecting our problem statement, now it's time to take a look at the data which is
to be acquired for the goal. For this problem, a dataset covering all the elements mentioned above is made for
each dish prepared by the restaurant over a period of 30 days. This data is collected offline in the form of a
regular survey since this is a personalized dataset created just for one restaurant's needs.
Specifically, the data collected comes under the following categories: Name of the dish, Price of the dish,
Quantity of dish produced per day, Quantity of dish left unconsumed per day, Total number of customers per
day, fixed customers per day, etc.
Data Exploration
After creating the database, we now need to look at the data collected and understand what is required out of
it. In this case, since the goal of our project is to be able to predict the quantity of food to be prepared for the
next day, we need to have the following data:
Thus, we extract the required information from the curated dataset and clean it up in such a way that there
exist no errors or missing elements in it.
Modeling
Once the dataset is ready, we train our model on it. In this case, a regression model is chosen in which the
dataset is fed as a data frame and is trained accordingly. Regression is a Supervised Learning model which
takes in continuous values of data over a period of time. Since in our case the data which we have is a
continuous data of 30 days, we can use the regression model so that it predicts the next values to it in a similar
manila. In this case, the dataset of 30 days is divided in a ratio of 2:1 for training and testing respectively. In
this case, the model is first trained on the 20-day data and then gets evaluated for the rest of the 10 days.
Evaluation
Once the model has been trained on the training dataset of 20 days, it is now time to see if the model is
working properly or not. Let us see how the model works and how is it tested.
Step 1: The trained model is fed data regards the name of the dish and the quantity produced for the
Same.
Step 2: It is then fed data regards the quantity of food left unconsumed for the same dish on previous
occasions.
Step 3: The model then works upon the entries according to the training it got at the modeling stage.
Step 4: The Model predicts the quantity of food to be prepared for the next day.
Step 5: The prediction is compared to the testing dataset value. From the testing dataset, ideally, we can say
that the quantity of food to be produced for next day's consumption should be the total quantity minus the
unconsumed quantity.
Step 6: The model is tested for 10 testing datasets kept aside while training.
Step 8: If the prediction value is same or almost similar to the actual values, the model is said to be accurate.
Otherwise, either the model selection is changed or the model is trained on more data for better accuracy.
Once the model is able to achieve optimum efficiency, it is ready to be deployed in the restaurant for real-time
usage.
Data Collection
Data collection is nothing new which has come up in our lives. It has been in our society since ages. Even when
people did not have fair knowledge of calculations, records were still maintained in some way or the other to
keep an account of relevant things. Data collection is an exercise which does not require even a tiny bit of
technological knowledge. But when it comes to analyzing the data, it becomes a tedious process for humans as
it is all about numbers and alpha-numerical data. That is where Data Science comes into the picture. It not only
gives us a clearer idea around the dataset, but also adds value to it by providing deeper and clearer analyses
around it. And as Al gets incorporated in the process, predictions and suggestions by the machine become
possible on the same. Now that we have gone through an example of a Data Science based project, we have a
bit of clarity regarding the type of data that can be used to develop a Data Science related project. For the data
domain-based projects, majorly the type of data used is in numerical or alpha-numerical format and such
datasets are curated in the form of tables. Such databases are very commonly found in any institution for
record maintenance and other purposes. Some examples of datasets which you must already be aware of are:
Databases of loans issued, account holder, locker owners, employee registrations, bank
visitors, etc.
Usage details per day, cash denominations transaction details, visitor details, etc.
Movie details, tickets sold offline, tickets sold online, refreshment purchases, etc.
Now look around you and find out what are the different types of databases which are maintained in the
places mentioned below. Try surveying people who are responsible for the designated places to get a better
idea.
As you can see, all the type of data which has been mentioned above is in the form of tables. Tables which
contain numeric or alpha-numeric data. But this leads to a very critical dilemma: are these datasets accessible
to all? Should these databases be accessible to all? What are the various sources of data from which we can
gather such databases? Let's find out!
Sources of Data
There exist various sources of data from where we can collect any type of data required and the data
collection process can be categorized in two ways: Offline and Online.
While accessing data from any of the data sources, following points should be kept in mind:
1. Data which is available for public usage only should be taken up.
2. Personal datasets should only be used with the consent of the owner.
4. Data should only be taken form reliable sources as the data collected from random sources can be wrong or unusable.
5. Reliable sources of data ensure the authenticity of data which helps in proper training of the Al model.
Types of Data
For Data Science, usually the data is collected in the form of tables. These tabular datasets can be stored in
different formats. Some of the commonly used formats are:
1. CSV: CSV stands for comma separated values. It is a simple file format used to store tabular data. Each line
of this file is a data record and reach record consists of one or more fields which are separated by commas.
Since the values of records are separated by a comma, hence they are known as CSV files.
2. Spreadsheet: A Spreadsheet is a piece of paper or a computer program which is used for accounting and
recording data using rows and columns into which information can be entered. Microsoft excel is a program
which helps in creating spreadsheets.
3. SQL: SQL is a programming language also known as Structured Query Language. It is a domain- specific
language used in programming and is designed for managing data held in different kinds of DBMS (Database
Management System) It is particularly useful in handling structured data.
A lot of other formats of databases also exist, you can explore them online!
Data Access
After collecting the data, to be able to use it for programming purposes, we should know how to access the same in a
Python code. To make our lives easier, there exist various Python packages which help us in accessing structured data (in
tabular form) inside the code. Let us take a look at some of these packages :
NumPy
NumPy, which stands for Numerical Python, is the fundamental package for Mathematical and logical operations on arrays
in Python. It is a commonly used package when it comes to working around numbers. NumPy gives a wide range of
arithmetic operations around numbers giving us an easier approach in working with them. NumPy also works with arrays,
which is nothing but a homogenous collection of Data.
An array is nothing but a set of multiple values which are of same data type. They can be numbers, characters, booleans,
etc. but only one data type can be accessed through an array. In NumPy, the arrays used are known as ND-arrays (N-
Dimensional Arrays) as NumPy comes with a feature of creating n-dimensional arrays in Python.
An array can easily be compared to a list. Let us take a look at how they are different:
Pandas
Pandas is a software library written for the Python programming language for data manipulation and analysis.
In particular, it offers data structures and operations for manipulating numerical tables and time series. The
name is derived from the term "panel data", an econometrics term for data sets that include observations over
multiple time periods for the same individuals.
Here are just a few of the things that pandas does well:
Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user
can simply ignore the labels and let Series, Data Frame, etc. automatically align the data for you in
computations
Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
Intuitive merging and joining data sets
Flexible reshaping and pivoting of data sets
Matplotlib
Matplotlib is an amazing visualization library in Python for 2D plots of arrays. Matplotlib is a multi- platform
data visualization library built on NumPy arrays. One of the greatest benefits of visualization is that it allows us
visual access to huge amounts of data in easily digestible visuals. Matplotlib comes with a wide variety of plots.
Plot helps to understand trends, patterns, and to make correlations. They're typically instruments for
reasoning about quantitative information. Some types of graphs that we can make with this package are listed
below:
Not just plotting, but you can also modify your plots the way you wish. You can stylise them and make them
more descriptive and communicable.
These packages help us in accessing the datasets we have and also in exploring them to develop a better
understanding of them.
Basic Statistics with Python
We have already understood that Data Sciences works around analyzing data and performing tasks around it. For analyzing
the numeric & alpha-numeric data used for this domain, mathematics comes to our rescue. Basic statistical methods used
in mathematics come quite hAmanin Python too for analyzing and working around such datasets. Statistical tools widely
used in Python are:
Do you remember using these formulas in your class? Let us recall all of them here:
Answer) The mean, or average, of a dataset in Python is calculated by summing all values and dividing by the
number of values using `sum(data) / len(data)`. NumPy's `numpy.mean()` function provides a concise alternative for efficient numerical
operations on arrays, streamlining mean calculation .
Answer) The median is the middle value of a dataset when it is sorted. In Python, it is calculated using `statistics.median()` or
`numpy.median()` for numeric data. For an odd-sized dataset, the median is the middle value, and for an even-sized dataset, it's the
average of the two middle values.
Answer) The mode is the most frequently occurring value in a dataset. In Python, you can calculate it using `statistics.mode( )` for single-
modal datasets. For multimodal datasets, use external libraries like SciPy or custom functions. NumPy also has `numpy.mode()` for array-
like objects, considering the most frequent element(s).
Answer) Standard deviation measures the dispersion of values in a dataset. In Python, it's calculated using `statistics.stdev()` for sample
data and `numpy.std()` for numerical arrays. It involves finding the square root of the variance, where the variance is the a verage of the
squared differences from the mean.
Answer) Variance quantifies the spread of values in a dataset. In Python, calculate it with `statistics.variance()` for sample data or
`numpy.var()` for numerical arrays. It involves finding the average of squared differences between each data point and the mean,
providing a measure of data dispersion.
Advantage of using Python packages is that we do not need to make our own formula or equation to find out the results. There exist a lot
of pre-defined functions with packages like NumPy which reduces this trouble for us. All we need to do is write that function and pass on
the data to it. It's that simple! Let us take a look at various Python syntaxes that can help us with the statistical work in data analysis. Head
to the Jupyter Notebook of Basic statistics with Python and start exploring! You may find the Jupyter notebook here: http://bit.ly/data
notebook
Data Visualization
While collecting data, it is possible that the data might come with some errors. Let us first take a look at the types of issues we can face
with data:
1. Erroneous Data: There are two ways in which the data can be erroneous:
Incorrect values: The values in the dataset (at random places) are incorrect. For example, in the column of phone number,
there is a decimal value or in the marks column, there is a name mentioned, etc. These are incorrect values that do not
resemble the kind of data expected in that position.
Invalid or Null values: At some places, the values get corrupted and hence they become invalid. Many times you will find NaN
values in the dataset. These are null values which do not hold any meaning and are not processible. That is why, these values
(as and when encountered) are removed from the database.
2. Missing Data: In some datasets, some cells remain empty. The values of these cells are missing and hence the cells remain empty.
Missing data cannot be interpreted as an error as the values here are not erroneous or might not be missing because of any er ror.
3. Outliers: Data which does not fall in the range of a certain element are referred to as outliers. To understand this better, let us take an
example of marks of students in a class. Let us assume that a student was absent for exams and hence has got 0 marks in it. If his marks
are taken into account, the whole class's average would go down. To prevent this, the average is taken for the range of marks from highest
to lowest keeping this particular result separate. This makes sure that the average marks of the class are true according to the data.
Analyzing the data collected can be difficult as it is all about tables and numbers. While machines work efficiently on numbers, humans
need visual aid to understand and comprehend the information passed. Hence, data visualization is used to interpret the data collected
and identify patterns and trends out of it.
It is one of the most commonly used graphical methods. From students to scientists,
everyone uses bar charts in some way or the other. It is a very easy to draw yet
informative graphical representation. Various versions of bar chart exist like single bar
chart, double bar chart, etc.
This is an example of a double bar chart. The 2 axes depict two different parameters while bars of different
colours work with different entities (in this case it is women and men). Bar chart also works on discontinuous
data and is made at uniform intervals.
Quartile 1: From 0 percentile to 25th percentile - Here data lying between 0 and 25th percentile is plotted. Now, if the data is close to each
other, lets say 0 to 25th percentile data has been covered in just 20-30 marks range, then the whisker would be smaller as the range is
smaller. But if the range is large that is 0-30 marks range, then the whisker would also get elongated as the range is longer.
Quartile 2: From 25th Percentile to 50th percentile - 50th percentile is termed as the mean of the whole distribution and since the data
falling in the range of 25th percentile to 75th percentile has minimum deviation from the mean, it is plotted inside the box.
Quartile 3: From 50th percentile to 75th percentile - This range is again plotted in the box as its deviation from the mean is less. Quartile 2
& 3 (from 25th percentile to 75th percentile) together constitute the Inter Quartile Range (IQR). Also, depending upon the ra nge of
distribution, just like whiskers, the length of box also varies if the data is less spread or more.
Quartile 4: From 75th percentile to 100th percentile - It is the whiskers plot for top 25 percentile data.
Outliers: The advantage of box plots is that they clearly show the outliers in a data distribution. Points which do not lie in the range are
plotted outside the graph as dots or circles and are termed as outliers as they do not belong to the range of data. Since bei ng out of range
is not an error, that is why they are still plotted on the graph for visualization.
Let us now move ahead and experience data visualization using Jupyter notebook. Matplotlib library will help us in plotting a ll sorts of
graphs while Numpy and Pandas will help us in analyzing the data.
In this section, we would be looking at one of the classification models used in Data Sciences. But before we look into the technicalities of
the code, let us play a game.
Personality Prediction
Step 1: Here is a map. Take a good look at it. In this map you can see the arrows determine a quality. The qualities mentioned are:
1. Positive X-axis - People focused: You focus more on people and try to deliver the best experience to them.
2. Negative X-axis - Task focused: You focus more on the task which is to be accomplished and try to do your best to achieve that.
3. Positive Y-axis-Passive: You focus more on listening to people and understanding everything that they say without interruption.
4. Negative Y-axis - Active: You actively participate in the discussions and make sure that you make your point in-front of the crowd.
Think for a minute and understand which of these qualities you have in you. Now, take a chit and write your name on it. Place this chit at a
point in this map which best describes you. It can be placed anywhere on the graph. Be honest about yourself and put it on the graph.
Step 2: Now that you have all put up your chits on the graph, it's time to take a quick quiz. Go to this link and finish the quiz on it
individually: https://tinyurl.com/discanimal
On this link, you will find a personality prediction quiz. Take this quiz individually and try to answer all the questions hones tly. Do not take
anyone's help in it and do not discuss about it with anyone. Once the quiz is finished, remember the animal which has been predicted for
you. Write it somewhere and do not show it to anyone.
Once everyone has gone through the quiz, go back to the board remove your chit, and draw the symbol which corresponds to your animal
in place of your chit. Here are the symbols:
Place these symbols at the locations where you had put up your names. Ask 4 students not to do so and tell them to keep their animals a
secret. Let their name chits be on the graph so that we can predict their animals with the help of this map.
Now, we will try to use the nearest neighbor algorithm here and try to predict what can be the possible animal(s) for these 4 unknowns.
Now look that these 4 chits one by one. Which animal is occurring the most in their vicinity? Do you think that if the m lion symbol is
occurring the most near their chit, then there is a good probability that their animal would also be a lion? Now let us try t o guess the
animal for all 4 of them according to their nearest neighbors respectively. After guessing the animals, ask these 4 students if the guess is
right or not.
The k-nearest neighbours (KNN) algorithm is a simple, easy-to-implement supervised machine learning algorithm that can be used to solve
both classification and regression problems. The KNN algorithm assumes that similar things exist in close proximity. In other words, simil ar
things are near to each other as the saying goes "Birds of a feather flock together". Some features of KNN are:
• The KNN prediction model relies on the surrounding points or neighbours to determine its class or group
. Utilises the properties of the majority of the nearest points to decide how to classify unknown points
• Based on the concept that similar data points should be close to each other
The personality prediction activity was a brief introduction to KNN. As you recall, in that activity, we tried to predict the animal for 4
students according to the animals which were the nearest to their points. This is how in a lay-man's language KNN works. Here, K is a
variable which tells us about the number of neighbours which are taken into account during prediction. It can be any integer value starting
from 1.
Let us look at another example to demystify this algorithm. Let us assume that we need to predict the sweetness of a fruit according to the
data which we have for the same type of fruit. So here we have three maps to predict the same:
Here, X is the value which is to be predicted. The green dots depict sweet values and the blue ones denote not sweet.
Let us try it out by ourselves first. Look at the map closely and decide whether X should be sweet or not sweet?
In the 2nd graph, the value of K is 2. Taking 2 nearest nodes to X into consideration, we see that one is sweet while
the other one is not sweet. This makes it difficult for the machine to make any predictions based on the nearest
neighbour and hence the machine is not able to give any prediction.
In the 3rd graph, the value of K becomes 3. Here, 3 nearest nodes to X are chosen out of which 2 are green and 1 is
blue. On the basis of this, the model is able to predict that the fruit is sweet.
KNN tries to predict an unknown value on the basis of the known values. The model simply calculates the distance between all the known
points with the unknown point (by distance we mean to say the different between two values) and takes up K number of points w hose
distance is minimum. And according to it, the predictions are made.
1. As we decrease the value of K to 1, our predictions become less stable. Just think for a minute, imagine K-1 and we have X
surrounded by several greens and one blue, but the blue is the single nearest neighbour. Reasonably, we would think X is most
likely green, but because K=1, KNN incorrectly predicts that it is blue.
2. Inversely, as we increase the value of K, our predictions become more stable due to majority
voting/ averaging, and thus, more likely to make more accurate predictions (up to a certain point). Eventually, we begin to
witness an increasing number of errors. It is at this point we know we have pushed the value of K too far.
3. In cases where we are taking a majority vote (e.g. picking the mode in a classification problem) among labels, we usually make
K an odd number to have a tiebreaker.