0% found this document useful (0 votes)

7 views

data-science-ai-revision-notes

Data Science is a multidisciplinary field that combines statistics, data analysis, and machine learning to extract insights from real-world data, enhancing decision-making across various industries. Key applications include fraud detection in finance, personalized medicine in healthcare, and targeted advertising in digital marketing, among others. The document also highlights essential tools and techniques such as Python libraries (NumPy, Pandas, Matplotlib) and statistical methods that support data analysis and visualization.

Uploaded by

vishu244x

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

data-science-ai-revision-notes

Uploaded by

vishu244x

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

DATA SCIENCE (AI)

REVISION NOTES

Data Science is a multidisciplinary field that integrates statistics, data analysis, machine learning,
and related techniques to analyze real-world data. It extracts insights and trends to make informed
decisions, enhancing the ability of machines to solve problems or perform tasks autonomously. The
core components of data science involve:

 Mathematics: Statistical models, probability theory, and algebra help understand and predict
data patterns.
 Statistics: Crucial for data summarization and analysis, providing tools for hypothesis testing,
regression analysis, etc.
 Computer Science: Implements algorithms to process and analyze large datasets efficiently.
 Information Science: Deals with the management, retrieval, and storage of data.

Example: Playing with AI

Data science is a field that involves principles, concepts, technology and tools to analyse complex
sets of data in order to derive valuable knowledge.
Data science involves the methods and models of statistics, algorithmic and programming tools of
computer science and machine learning approaches of artificial intelligence field.

Core Domains of AI in Data Science:

Data Science is essential in different AI fields, each focusing on specific data types:

1. Data Science: Works with numeric and alpha-numeric data, essential for statistical analysis
and machine learning models.
 Example: A dataset containing sales figures, customer ages, and product prices for
predictive modeling.
2. Computer Vision (CV): Deals with image and visual data to enable machines to understand
and interpret visual information.
 Example: A self-driving car’s camera processing traffic signals and obstacles.
3. Natural Language Processing (NLP): Focuses on textual and speech-based data, helping
machines understand and interact with human language.
 Example: Voice assistants like Siri and Alexa using NLP to understand spoken
commands.

Prepared by: M. S. KumarSwamy, TGT(Maths) Page - 1 -

Applications of Data Science:

Data Science has revolutionized industries by providing insights and driving decision-making in
many domains. Some notable applications include:

1. Fraud and Risk Detection (Finance):

In the early days, financial institutions struggled with defaults and losses due to bad debts. They had
extensive data about their customers’ financial history but needed effective tools to leverage this
information. Data Science algorithms helped analyze:

 Customer profiling: Identifying high-risk customers based on their past behavior.

 Predicting defaults: Using statistical models to predict which customers might default on
loans based on historical data.

Example: Banks now analyze transaction patterns and spending behavior to assess a loan applicant’s
risk level and offer customized banking products.

2. Genetics and Genomics (Healthcare):

Data Science plays a significant role in understanding genetic

data and its impact on health. By combining genomics with data
analytics, researchers can:

 Personalize treatments based on an individual’s genetic

makeup.
 Predict disease risk: Analyze the correlation between
genetic variations and susceptibility to certain diseases.

Example: Using genetic data to predict how a patient will

respond to a specific drug, leading to personalized medicine.

3. Internet Search Engines:

Search engines like Google use Data Science to handle vast

amounts of data and deliver relevant results within seconds.
Algorithms analyze:

 User queries: Match them with indexed web pages.

 Click behavior: Improve ranking algorithms based
on how users interact with search results.

Example: Google processes over 20 petabytes of data daily.

Without advanced data science techniques, it would not be
able to deliver accurate results at the speed it does.

4. Targeted Advertising (Digital Marketing):

Data Science has transformed the digital marketing landscape by enabling targeted advertisements.
Based on user data, algorithms predict:

Prepared by: M. S. KumarSwamy, TGT(Maths) Page - 2 -

 User preferences: Ads are tailored based on
browsing history and behavior.
 Ad effectiveness: Measure and improve the click-
through rate (CTR) by targeting ads at users most
likely to interact.

Example: Facebook and Instagram use past browsing

behavior to serve ads relevant to the user’s interests,
resulting in higher engagement.

5. Website Recommendation Engines:

Companies like Amazon, Netflix, and YouTube rely heavily on recommendation engines powered by
Data Science. These systems analyze user behavior to suggest:

 Products on e-commerce sites based on browsing and purchase history.

 Movies or shows on streaming platforms based on watch history and ratings.

Example: Netflix’s recommendation engine suggests new shows based on what users have
previously watched, increasing user engagement and satisfaction.

6. Airline Route Planning:

Airlines face significant operational

challenges, such as flight delays and
optimizing routes for fuel efficiency.
Data Science helps them by:

 Predicting flight delays based

on historical data (e.g., weather,
traffic).
 Route optimization: Choosing
whether to fly direct or via
layovers to maximize efficiency.

Example: Using past data to predict the best flight routes, minimizing fuel costs, and improving
customer satisfaction.

Case Study: Predicting Food Waste in Restaurants

This section of the document presents a Data Science project example aimed at reducing food waste
in buffet restaurants. The challenge is that restaurants often overestimate the amount of food needed,
leading to waste and financial losses.

Problem Scoping:

1. Who: The primary stakeholders are restaurant owners and chefs.

2. What: The problem is that food is often left unconsumed at the end of the day, leading to
waste.
3. Where: Buffet-style restaurants where food is prepared in bulk.
4. Why: If restaurants could better predict customer turnout, they could prepare the right amount
of food, reducing waste.

Prepared by: M. S. KumarSwamy, TGT(Maths) Page - 3 -

Proposed Solution:

 Goal: To develop a predictive model that estimates the quantity of food to prepare daily.
 Data Required: Datasets related to daily customer numbers, dish prices, quantity prepared,
and unconsumed food over a period of 30 days.

Steps Involved:

1. Data Collection: Collect data on the number of customers, types of dishes, food quantities
prepared, and leftovers.
2. Data Exploration: Clean and preprocess the data to ensure accuracy, removing missing
values or outliers. We extract the required information from the curated dataset and clean it up
in such a way that there exist no errors or missing elements in it.
3. Modeling: Train a regression model on 30 days of data to predict the amount of food to
prepare based on historical consumption patterns. In this case, the dataset of 30 days is
divided in a ratio of 2:1 for training and testing respectively. In this case, the model is first
trained on the 20-day data and then gets evaluated for the rest of the 10 days.
4. Evaluation: Test the model’s accuracy by comparing its predictions with actual food
consumption.

Data Science Tools and Techniques:

Various tools and programming libraries are essential in Data Science, helping analysts and
developers process, analyze, and visualize data.

1. Data Collection Methods:

 Offline: Surveys, observations, and interviews conducted manually.

 Online: Data gathered from open-source websites (e.g., Kaggle) or government portals.

Examples of Data:

2. Data Storage Formats:

Prepared by: M. S. KumarSwamy, TGT(Maths) Page - 4 -

 CSV (Comma Separated Values): It is a simple file format used to store tabular data. Each
line of this file is a data record and reach record consists of one or more fields which are
separated by commas. Since the values of records are separated by a comma, hence they are
known as CSV files.
 Spreadsheet: A Spreadsheet is a piece of paper or a computer program which is used for
accounting and recording data using rows and columns into which information can be entered.
Microsoft excel is a program which helps in creating spreadsheets.
 SQL: SQL is a programming language also known as Structured Query Language. It is a
domain-specific language used in programming and is designed for managing data held in
different kinds of DBMS (Database Management System) It is particularly useful in handling
structured data.

3. Python Libraries:

 NumPy: For numerical computing and working with arrays.

 Pandas: For data manipulation and handling tabular datasets (e.g., DataFrames).
 Matplotlib: For data visualization, including plotting graphs like bar charts, histograms, and
scatter plots.

NumPy

NumPy, which stands for Numerical Python, is the fundamental package for Mathematical and
logical operations on arrays in Python. It is a commonly used package when it comes to working
around numbers. NumPy gives a wide range of arithmetic operations around numbers giving us an
easier approach in working with them. NumPy also works with arrays, which is nothing but a
homogenous collection of Data.

An array is nothing but a set of multiple values which are of same datatype. They can be numbers,
characters, booleans, etc. but only one datatype can be accessed through an array. In NumPy, the
arrays used are known as ND-arrays (N-Dimensional Arrays) as NumPy comes with a feature of
creating n-dimensional arrays in Python.

An array can easily be compared to a list. Let us take a look at how they are different:
NumPy Arrays Lists
1. Homogenous collection of Data. 1. Heterogenous collection of Data.
2. Can contain only one type of data, hence not 2. Can contain multiple types of data, hence
flexible with datatypes. flexible with datatypes.
3. Cannot be directly initialized. Can be operated 3. Can be directly initialized as it is a part of
with Numpy package only. Python syntax.
4. Direct numerical operations can be done. For 4. Direct numerical operations are not possible.
example, dividing the whole array by 3 divides For example, dividing the whole list by 3 cannot
every element by 3. divide every element by 3.
5. Widely used for arithmetic operations. 5. Widely used for data management.
6. Arrays take less memory space. 6. Lists acquire more memory space.
7. Functions like concatenation, appending, 7. Functions like concatenation, appending,
reshaping, etc are not trivially possible with reshaping, etc are trivially possible with lists.
arrays.
8. Example: To create a numpy array ‘A’: 8. Example: To create a list:
import numpy A = [1,2,3,4,5,6,7,8,9,0]
A=numpy.array([1,2,3,4,5,6,7,8,9,0])

Pandas

Prepared by: M. S. KumarSwamy, TGT(Maths) Page - 5 -

Pandas is a software library written for the Python programming language for data manipulation and
analysis. In particular, it offers data structures and operations for manipulating numerical tables and
time series. The name is derived from the term "panel data", an econometrics term for data sets that
include observations over multiple time periods for the same individuals.

Pandas is well suited for many different kinds of data:

• Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
• Ordered and unordered (not necessarily fixed-frequency) time series data.
• Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
• Any other form of observational / statistical data sets. The data actually need not be labelled at all to
be placed into a Pandas data structure

The two primary data structures of Pandas, Series (1-dimensional) and DataFrame (2-dimensional),
handle the vast majority of typical use cases in finance, statistics, social science, and many areas of
engineering. Pandas is built on top of NumPy and is intended to integrate well within a scientific
computing environment with many other 3rd party libraries.

Matplotlib
Matplotlib is an amazing visualization library in Python for 2D plots of arrays. Matplotlib is a multi-
platform data visualization library built on NumPy arrays. One of the greatest benefits of
visualization is that it allows us visual access to huge amounts of data in easily digestible visuals.
Matplotlib comes with a wide variety of plots. Plots helps to understand trends, patterns, and to make
correlations. They’re typically instruments for reasoning about quantitative information. Some types
of graphs that we can make with this package are listed below:

Not just plotting, but you can also modify your plots the way you wish. You can stylise them and
make them more descriptive and communicable.

These packages help us in accessing the datasets we have and also in exploring them to develop a
better understanding of them.

Statistics in Data Science (with Python)

Basic statistics are fundamental to Data Science, providing tools to summarize and analyze data:

1. Mean: The average value of a dataset, calculated by summing all values and dividing by the
number of values.
2. Median: The middle value of a sorted dataset, which is less sensitive to outliers than the
mean.
3. Mode: The most frequently occurring value in the dataset.
4. Standard Deviation: Measures how spread out the values are around the mean. A low
standard deviation means values are close to the mean; a high standard deviation means they
are spread out.
5. Variance: The square of the standard deviation, showing the variability of the data.

Prepared by: M. S. KumarSwamy, TGT(Maths) Page - 6 -

Data Visualization Techniques:

Data visualization is critical for interpreting large datasets. Some common visualizations include:

1. Scatter Plots: Used for plotting discontinuous data, often showing relationships between two
variables (X and Y axes). Multiple parameters can be represented by color and size of the
points.
 Example: Plotting customer age vs purchase amount with points representing different
product categories.
2. Bar Charts: Simple yet effective for visualizing categorical data, where each bar represents a
different category.
 Example: Comparing male and female participation in a survey.
3. Histograms: Show the frequency distribution of a continuous dataset, often used to display
data ranges.
 Example: Plotting the distribution of customer ages at a retail store.
4. Box Plots: Display the distribution of data across quartiles and highlight outliers, making it
useful for identifying skewness.
 Example: Analyzing salary ranges in a company and spotting outliers.

K-Nearest Neighbors (KNN) Algorithm

K-Nearest Neighbors (KNN) is a supervised learning algorithm used for both classification and
regression. It predicts outcomes by finding the ‘K’ nearest data points (neighbors) to a given point
and basing predictions on the majority class of those neighbors.

Example: Predicting Fruit Sweetness

Suppose you want to predict if a fruit is sweet or not, based on the surrounding data points (known
fruits).

 K=1: The closest point to the unknown fruit is used to predict sweetness.
 K=3: The three nearest neighbors are considered, and if two are sweet and one is not, the
model predicts the fruit is sweet.

The algorithm works on the principle that similar data points exist near each other.

KNN tries to predict an unknown value on the basis of the known values. The model simply
calculates the distance between all the known points with the unknown point (by distance we mean to
say the different between two values) and takes up K number of points whose distance is minimum.
And according to it, the predictions are made.

Let us understand the significance of the number of neighbours:

1. As we decrease the value of K to 1, our predictions become less stable. Just think for a minute,
imagine K=1 and we have X surrounded by several greens and one blue, but the blue is the single
nearest neighbour. Reasonably, we would think X is most likely green, but because K=1, KNN
incorrectly predicts that it is blue.

2. Inversely, as we increase the value of K, our predictions become more stable due to majority voting
/ averaging, and thus, more likely to make more accurate predictions (up to a certain point).

Prepared by: M. S. KumarSwamy, TGT(Maths) Page - 7 -

Eventually, we begin to witness an increasing number of errors. It is at this point we know we have
pushed the value of K too far.

3. In cases where we are taking a majority vote (e.g. picking the mode in a classification problem)
among labels, we usually make K an odd number to have a tiebreaker.

Prepared by: M. S. KumarSwamy, TGT(Maths) Page - 8 -

01 - Introduction To Data Analytics
100% (1)
01 - Introduction To Data Analytics
58 pages
Notes Data Science
No ratings yet
Notes Data Science
5 pages
Applications of Data Science
No ratings yet
Applications of Data Science
5 pages
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
Unit4StudyMaterial_a44617ee4c2c4b55a599dab5510c5e93_15626(1)
No ratings yet
Unit4StudyMaterial_a44617ee4c2c4b55a599dab5510c5e93_15626(1)
47 pages
X AI SS CH4 LM
No ratings yet
X AI SS CH4 LM
57 pages
Data science
No ratings yet
Data science
68 pages
DATA SCIENCE Basics
No ratings yet
DATA SCIENCE Basics
6 pages
Aids QB2
No ratings yet
Aids QB2
13 pages
Data Science Management_vss
No ratings yet
Data Science Management_vss
84 pages
DS_UNIT I
No ratings yet
DS_UNIT I
3 pages
Unit-4 Data Science: What Is Data Science? Write Some of Its Applications. Ans
No ratings yet
Unit-4 Data Science: What Is Data Science? Write Some of Its Applications. Ans
5 pages
Data Science
No ratings yet
Data Science
10 pages
Data Science
No ratings yet
Data Science
62 pages
DS QB unit 1
No ratings yet
DS QB unit 1
45 pages
Data Science QB Solve SEM6
No ratings yet
Data Science QB Solve SEM6
157 pages
Screenshot 2023-10-23 at 7.15.14 AM
No ratings yet
Screenshot 2023-10-23 at 7.15.14 AM
25 pages
Data Science & Cyber Security
No ratings yet
Data Science & Cyber Security
13 pages
PDF Data Science
No ratings yet
PDF Data Science
7 pages
Data Science Course in Hyderabad
No ratings yet
Data Science Course in Hyderabad
9 pages
Unit I
No ratings yet
Unit I
13 pages
Title_ An Overview of Data Science and Its Applications
No ratings yet
Title_ An Overview of Data Science and Its Applications
3 pages
Continuous Improvement Through Data Science From Products To Systems Beyond ChatGPT
No ratings yet
Continuous Improvement Through Data Science From Products To Systems Beyond ChatGPT
10 pages
The Field of Data Science
No ratings yet
The Field of Data Science
4 pages
Data SC Details
No ratings yet
Data SC Details
3 pages
Data Sciences
No ratings yet
Data Sciences
23 pages
Data Science XTH
No ratings yet
Data Science XTH
10 pages
Unit 16
0% (1)
Unit 16
15 pages
5th Sem Internship Eport
No ratings yet
5th Sem Internship Eport
83 pages
Fundamentals of Data Science unit 1
No ratings yet
Fundamentals of Data Science unit 1
33 pages
Data science
No ratings yet
Data science
10 pages
Data-Science-and-Analytics-Reviewer
No ratings yet
Data-Science-and-Analytics-Reviewer
5 pages
Overview of Data Science
No ratings yet
Overview of Data Science
3 pages
Data Science_Chapter 3
No ratings yet
Data Science_Chapter 3
29 pages
Contact For The Course: - Instructor: Dr. Kauser Ahmed P
No ratings yet
Contact For The Course: - Instructor: Dr. Kauser Ahmed P
54 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
25 pages
DataScience Intro
No ratings yet
DataScience Intro
36 pages
Internship Report: T.J.Instituteoftechnology
No ratings yet
Internship Report: T.J.Instituteoftechnology
29 pages
Impact of Data Science Across Industries
No ratings yet
Impact of Data Science Across Industries
3 pages
Data Science and Its Relevance Across Disciplines
No ratings yet
Data Science and Its Relevance Across Disciplines
17 pages
data science notes 1
No ratings yet
data science notes 1
3 pages
Fd45092a Ccad 459e Bc18 b01536fd6bac Untitled
No ratings yet
Fd45092a Ccad 459e Bc18 b01536fd6bac Untitled
53 pages
Introduction to Data Science
No ratings yet
Introduction to Data Science
3 pages
Introduction To Data Science and Python For Data
No ratings yet
Introduction To Data Science and Python For Data
12 pages
Technical Report Writing For Ca2 Examination: Topic: Introduction To Data Science
No ratings yet
Technical Report Writing For Ca2 Examination: Topic: Introduction To Data Science
7 pages
Data Science Introduction
No ratings yet
Data Science Introduction
22 pages
data science
No ratings yet
data science
8 pages
Unit I & II_FDS_II AI&DS
No ratings yet
Unit I & II_FDS_II AI&DS
48 pages
Data Science Internship
No ratings yet
Data Science Internship
6 pages
RamA123744_File4
No ratings yet
RamA123744_File4
1 page
AIDS-QB2
No ratings yet
AIDS-QB2
17 pages
Data Science & Its Applications
No ratings yet
Data Science & Its Applications
59 pages
Paper 20
No ratings yet
Paper 20
3 pages
data science
No ratings yet
data science
2 pages
Data Science Report - Compress
No ratings yet
Data Science Report - Compress
31 pages
introduction to data science
No ratings yet
introduction to data science
8 pages
Formation of Data Science and Fundamentals
No ratings yet
Formation of Data Science and Fundamentals
4 pages
Extended_Comprehensive_Guide_to_Data_Science
No ratings yet
Extended_Comprehensive_Guide_to_Data_Science
2 pages
Data Science
No ratings yet
Data Science
44 pages
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
From Everand
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
WINTON CLEM
No ratings yet
Data Science Project Ideas for Thesis, Term Paper, and Portfolio
From Everand
Data Science Project Ideas for Thesis, Term Paper, and Portfolio
Zemelak Goraga
No ratings yet
class 12 chapter 1 geo eng (1)
No ratings yet
class 12 chapter 1 geo eng (1)
33 pages
Long answer
No ratings yet
Long answer
1 page
Accountancy Sample Paper Questions
No ratings yet
Accountancy Sample Paper Questions
13 pages
Top 65 Vistas
No ratings yet
Top 65 Vistas
67 pages
imp_MICROeco & stats
No ratings yet
imp_MICROeco & stats
139 pages
General Studies
No ratings yet
General Studies
17 pages
imp_english
No ratings yet
imp_english
64 pages
statistics full syllabus
No ratings yet
statistics full syllabus
299 pages
Assignment 2 Part 1
No ratings yet
Assignment 2 Part 1
2 pages
Arunai Theory Exam Time Table NovDec2024
No ratings yet
Arunai Theory Exam Time Table NovDec2024
25 pages
Feedzai CultureBook
No ratings yet
Feedzai CultureBook
21 pages
Dissertation Topics On E-Government
100% (2)
Dissertation Topics On E-Government
7 pages
Lab Manual of CC
No ratings yet
Lab Manual of CC
9 pages
IBM Pre-Apprenticeship Offerings
No ratings yet
IBM Pre-Apprenticeship Offerings
9 pages
STAT 650 - Foundations of Data Science Syllabus
No ratings yet
STAT 650 - Foundations of Data Science Syllabus
13 pages
Essential Data Science Notes - A Concise PDF Guide
No ratings yet
Essential Data Science Notes - A Concise PDF Guide
20 pages
Prov_Routine_CA4_Even_2024_25_250425_154513
No ratings yet
Prov_Routine_CA4_Even_2024_25_250425_154513
60 pages
Get Data-driven Analytics for Sustainable Buildings and Cities: From Theory to Application 1st Edition Xingxing Zhang free all chapters
100% (2)
Get Data-driven Analytics for Sustainable Buildings and Cities: From Theory to Application 1st Edition Xingxing Zhang free all chapters
65 pages
Job Description Global
No ratings yet
Job Description Global
3 pages
confessions-of-a-data-scientist
No ratings yet
confessions-of-a-data-scientist
21 pages
Business Analysis in The Data Science Age Paper 1 PDF
No ratings yet
Business Analysis in The Data Science Age Paper 1 PDF
22 pages
Exposys Data Labs Report
No ratings yet
Exposys Data Labs Report
18 pages
7Sv5OgVaQ3CmuCgoyZ2I - Data Literacy Foundations-Maven
No ratings yet
7Sv5OgVaQ3CmuCgoyZ2I - Data Literacy Foundations-Maven
109 pages
DS 1
No ratings yet
DS 1
2 pages
Maryam BA Assgn 1
No ratings yet
Maryam BA Assgn 1
4 pages
Assessment Brief - Business Process Support.docx
No ratings yet
Assessment Brief - Business Process Support.docx
15 pages
Christian Wagner Dissertation
100% (2)
Christian Wagner Dissertation
8 pages
The Future of Statistics and The Data Science
No ratings yet
The Future of Statistics and The Data Science
5 pages
BSC CSThird Year Affiliated College Syllabuswef 202122
No ratings yet
BSC CSThird Year Affiliated College Syllabuswef 202122
36 pages
Syllabus MSC IT-Halmstad
No ratings yet
Syllabus MSC IT-Halmstad
2 pages
Pavan Rathod 9146296044
No ratings yet
Pavan Rathod 9146296044
1 page
Tina Huang: Education and Honors
No ratings yet
Tina Huang: Education and Honors
1 page
Python Foundation Data Science
No ratings yet
Python Foundation Data Science
2 pages
Booz Allen - Data Scientist Capability Handbook PDF
No ratings yet
Booz Allen - Data Scientist Capability Handbook PDF
44 pages
Data Science Nigeria Annual Report
No ratings yet
Data Science Nigeria Annual Report
64 pages
7709 - Supply Chain Analytics
100% (2)
7709 - Supply Chain Analytics
154 pages
Tredence
No ratings yet
Tredence
3 pages