0% found this document useful (0 votes)
55 views57 pages

Class X AI Unit 4: Data Science

The document provides an overview of Data Science, highlighting its integration of statistics, data analysis, and machine learning to analyze real-world phenomena. It discusses various applications of Data Science in fields such as finance, genetics, internet search, targeted advertising, and airline route planning. Additionally, it outlines the importance of data acquisition, collection methods, and tools like Python libraries (NumPy, Pandas, Matplotlib) for data manipulation and visualization.

Uploaded by

Laksin VJ
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views57 pages

Class X AI Unit 4: Data Science

The document provides an overview of Data Science, highlighting its integration of statistics, data analysis, and machine learning to analyze real-world phenomena. It discusses various applications of Data Science in fields such as finance, genetics, internet search, targeted advertising, and airline route planning. Additionally, it outlines the importance of data acquisition, collection methods, and tools like Python libraries (NumPy, Pandas, Matplotlib) for data manipulation and visualization.

Uploaded by

Laksin VJ
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Class X

AI
Unit 4: Data
Science
CLASS X – ARTIFICIAL
INTELLIGENCE
DATA SCIECE?
Talking about Data Sciences, it is a
concept to unify statistics, data
analysis, machine learning and their
related methods in order to understand
and analyse actual phenomena with
data.

It employs techniques and


theories drawn from many fields
within the context of Mathematics,
Statistics, Computer Science, and
Information Science.
Applications of Data
Sciences
Fraud and Risk Detection:
• The earliest applications of data science were in Finance.
• Companies were fed up of bad debts and losses every year.
• However, they had a lot of data which use to get collected during
the initial paperwork while sanctioning loans.
• They decided to bring in data scientists in order to rescue them
from losses.
• Over the years, banking companies learned to divide and conquer
data via customer profiling, past expenditures, and other
essential variables to analyse the probabilities of risk and default.
• Moreover, it also helped them to push their banking products
based on customer’s purchasing power.
Genetics & Genomics*:
• Data Science applications also enable an advanced level of treatment
personalization through research in genetics and genomics.
• The goal is to understand the impact of the DNA on our health and find
individual biological connections between genetics, diseases, and drug
response.
• Data science techniques allow integration of different kinds of data with
genomic data in disease research, which provides a deeper understanding of
genetic issues in reactions to particular drugs and diseases.
• As soon as we acquire reliable personal genome data, we will achieve a deeper
understanding of the human DNA.
• The advanced genetic risk prediction will be a major step towards more
individual care.
Internet Search: When we talk about search
engines, we think ‘Google’. Right? But there
are many other search engines like Yahoo,
Bing, Ask, AOL, and so on.
All these search engines (including Google)
make use of data science algorithms to deliver
the best result for our searched query in the
fraction of a second.
Considering the fact that Google processes
more than 20 petabytes of data every day, had
there been no data science, Google wouldn’t
have been the ‘Google’ we know today.
Targeted Advertising:
If you thought Search would have been the biggest of all
data science applications, here is a challenger – the
entire digital marketing spectrum.
Starting from the display banners on various websites to
the digital billboards at the airports – almost all of them
are decided by using data science algorithms.
This is the reason why digital ads have been able to get
a much higher CTR (Call-Through Rate) than traditional
advertisements. They can be targeted based on a user’s
past behaviour.
Website Recommendations:
Aren’t we all used to the suggestions about similar
products on Amazon?
They not only help us find relevant products from
billions of products available with them but also add
a lot to the user experience.
A lot of companies have fervidly used this engine to
promote their products in accordance with the user’s
interest and relevance of information.
Internet giants like Amazon, Twitter, Google Play,
Netflix, LinkedIn, IMDB and many more use this
system to improve the user experience.
The recommendations are made based on previous
search results for a user.
Airline Route Planning:
The Airline Industry across the world is
known to bear heavy losses. Except for
a few airline service providers,
companies are struggling to maintain
their occupancy ratio and operating
profits. With high rise in air-fuel prices
and the need to offer heavy discounts
to customers, the situation has got
worse. It wasn’t long before airline
companies started using Data Science
to identify the strategic areas of
improvements.
Now, while using Data Science, the
airline companies can:
• Predict flight delay
• Decide which class of airplanes to buy
• Whether to directly land at the destination or take a halt
in between
(For example, A flight can have a direct route from New
Delhi to New York. Alternatively, it can also choose to halt
in any country.)
• Effectively drive customer loyalty programs
Getting Started

Data Sciences is a combination of Python and Mathematical


concepts like Statistics, Data Analysis, probability, etc. Concepts
of Data Science can be used in developing applications around AI
as it gives a strong base for data analysis in Python.
Revisiting
AI
Project
Cycle
Humans are social animals. We tend to organise
and/or participate in various kinds of social
gatherings all the time. We love eating out with
friends and family because of which we can find
restaurants almost everywhere and out of these,
many of the restaurants arrange for buffets to offer
a variety of food items to their customers. Be it
small shops or big outlets, every restaurant
prepares food in bulk as they expect a good crowd
to come and enjoy their food. But in most cases,
after the day ends, a lot of food is left which
becomes unusable for the restaurant as they do not
wish to serve stale food to their customers the next
day.
So, every day, they prepare food in large
quantities keeping in mind the probable
number of customers walking into their
outlet.
But if the expectations are not met, a
good amount of food gets wasted which
eventually becomes a loss for the
restaurant as they either have to dump it
or give it to hungry people for free.
And if this daily loss is taken into account
for a year, it becomes quite a big amount.
Problem Scoping Now that we have understood
the scenario well, let us take a deeper look into the
problem to find out more about various factors
around it.

Let us fill up the 4Ws problem canvas to find out.


Data Acquisition:

After finalising the goal of our project, let us now move


towards looking at various data features which affect the
problem in some way or the other.

Since any AI-based project requires data for testing and


training, we need to understand what kind of data is to be
collected to work towards the goal.
In our scenario,
various factors
that would affect
the quantity of
food to be
prepared for the
next day
consumption in
buffets would be:
• Now let us understand how these factors are related to
our problem statement.

• For this, we can use the System Maps tool to figure out
the relationship of elements with the project’s goal.

• Here is the System map for our problem statement.


After looking at the factors affecting our problem statement, now it’s
time to take a look at the data which is to be acquired for the goal.

For this problem, a dataset covering all the elements mentioned


above is made for each dish prepared by the restaurant over a
period of 30 days.

This data is collected offline in the form of a regular survey since


this is a personalised dataset created just for one restaurant’s
needs.
Specifically, the data collected comes
under the following categories:

• Name of the dish,


• Price of the dish,
• Quantity of dish produced per day,
• Quantity of dish left unconsumed
per day,
• Total number of customers per day,
• Fixed customers per day, etc.
Data Collection
• Data collection is nothing new which
has come up in our lives.
• It has been in our society since ages.
• Even when people did not have fair
knowledge of calculations, records
were still maintained in some way or
the other to keep an account of
relevant things.
• Data collection is an exercise which
does not require even a tiny bit of
technological knowledge.
Data Collection
But when it comes to analysing the data, it becomes a tedious
process for humans as it is all about numbers and alpha-
numerical data.
That is where Data Science comes into the picture.
It not only gives us a clearer idea around the dataset, but also
adds value to it by providing deeper and clearer analyses
around it.
And as AI gets incorporated in the process, predictions and
suggestions by the machine become possible on the same.
For the data domain-based
projects, majorly the type of data
used is in numerical or alpha-
numerical format and such
datasets are curated in the form
of tables.
Such databases are very
commonly found in any
institution for record maintenance
and other purposes.
Some examples of datasets which you must already be aware of are:
Sources of Data

There exist various sources of data from where


we can collect any type of data required and
the data collection process can be categorised
in two ways:

Offline and Online.


Types of Data For Data Science, usually the
data is collected in the form of tables.
These tabular datasets can be stored in
different formats.

Some of the commonly used formats are:


• CSV
• Spreadsheet
• SQL
1. CSV:
CSV stands for comma separated values.
It is a simple file format used to store
tabular data.

Each line of this file is a data record and


reach record consists of one or more
fields which are separated by commas.

Since the values of records are


separated by a comma, hence they are
known as CSV files.
2. Spreadsheet:
A Spreadsheet is a piece of paper or a computer program
which is used for accounting and recording data using
rows and columns into which information can be entered.
Microsoft excel is a program which helps in creating
spreadsheets.
3. SQL:
SQL is a programming language also known
as Structured Query Language.
It is a domainspecific language used in
programming and is designed for managing
data held in different kinds of DBMS
(Database Management System)
It is particularly useful in handling structured
data.
Data Access

After collecting the data, to be able to use it for programming


purposes, we should know how to access the same in a Python
code.

To make our lives easier, there exist various Python packages which
help us in accessing structured data (in tabular form) inside the
code.

Let us take a look at some of these packages:


NumPy
NumPy, which stands for Numerical Python, is the fundamental
package for Mathematical and logical operations on arrays in
Python.
It is a commonly used package when it comes to working around
numbers.
NumPy also works with arrays, which is nothing but a homogenous
collection of Data.
An array is nothing but a set of multiple values which are of same
datatype.
They can be numbers, characters, booleans, etc. but only one
datatype can be accessed through an array. In NumPy, the arrays
used are known as ND-arrays (N-Dimensional Arrays) as NumPy
comes with a feature of creating n-dimensional arrays in Python.
Pandas
• Pandas is a software library written for the Python
programming language for data manipulation and
analysis.

• In particular, it offers data structures and operations for


manipulating numerical tables and time series.

• The name is derived from the term "panel data", an


econometrics term for data sets that include
observations over multiple time periods for the same
individuals.
The two primary data structures of Pandas,
Series (1-dimensional) and DataFrame (2-
dimensional), handle the vast majority of typical
use cases in finance, statistics, social science,
and many areas of engineering.

Pandas is built on top of NumPy and is intended


to integrate well within a scientific computing
environment with many other 3rd party
libraries.
Matplotlib*
Matplotlib is an amazing visualization library in Python for 2D plots
of arrays.
Matplotlib is a multiplatform data visualization library built on
NumPy arrays.
One of the greatest benefits of visualization is that it allows us
visual access to huge amounts of data in easily digestible visuals.
Matplotlib comes with a wide variety of plots.
Plots helps to understand trends, patterns, and to make
correlations.
They’re typically instruments for reasoning about quantitative
information.
Some types of graphs that we can make
with this package are listed :
Basic Statistics with Python
We have already understood that Data Sciences works around
analysing data and performing tasks around it.

For analysing the numeric & alpha-numeric data used for this
domain, mathematics comes to our rescue. Basic statistical
methods used in mathematics come for analysing and working
around such datasets.

You might also like