0% found this document useful (0 votes)
4 views

data-science-ai-revision-notes

Data Science is a multidisciplinary field that combines statistics, data analysis, and machine learning to extract insights from real-world data, enhancing decision-making across various industries. Key applications include fraud detection in finance, personalized medicine in healthcare, and targeted advertising in digital marketing, among others. The document also highlights essential tools and techniques such as Python libraries (NumPy, Pandas, Matplotlib) and statistical methods that support data analysis and visualization.

Uploaded by

vishu244x
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

data-science-ai-revision-notes

Data Science is a multidisciplinary field that combines statistics, data analysis, and machine learning to extract insights from real-world data, enhancing decision-making across various industries. Key applications include fraud detection in finance, personalized medicine in healthcare, and targeted advertising in digital marketing, among others. The document also highlights essential tools and techniques such as Python libraries (NumPy, Pandas, Matplotlib) and statistical methods that support data analysis and visualization.

Uploaded by

vishu244x
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

DATA SCIENCE (AI)

REVISION NOTES

Data Science is a multidisciplinary field that integrates statistics, data analysis, machine learning,
and related techniques to analyze real-world data. It extracts insights and trends to make informed
decisions, enhancing the ability of machines to solve problems or perform tasks autonomously. The
core components of data science involve:

 Mathematics: Statistical models, probability theory, and algebra help understand and predict
data patterns.
 Statistics: Crucial for data summarization and analysis, providing tools for hypothesis testing,
regression analysis, etc.
 Computer Science: Implements algorithms to process and analyze large datasets efficiently.
 Information Science: Deals with the management, retrieval, and storage of data.

Example: Playing with AI

Data science is a field that involves principles, concepts, technology and tools to analyse complex
sets of data in order to derive valuable knowledge.
Data science involves the methods and models of statistics, algorithmic and programming tools of
computer science and machine learning approaches of artificial intelligence field.

Core Domains of AI in Data Science:

Data Science is essential in different AI fields, each focusing on specific data types:

1. Data Science: Works with numeric and alpha-numeric data, essential for statistical analysis
and machine learning models.
 Example: A dataset containing sales figures, customer ages, and product prices for
predictive modeling.
2. Computer Vision (CV): Deals with image and visual data to enable machines to understand
and interpret visual information.
 Example: A self-driving car’s camera processing traffic signals and obstacles.
3. Natural Language Processing (NLP): Focuses on textual and speech-based data, helping
machines understand and interact with human language.
 Example: Voice assistants like Siri and Alexa using NLP to understand spoken
commands.

Prepared by: M. S. KumarSwamy, TGT(Maths) Page - 1 -


Applications of Data Science:

Data Science has revolutionized industries by providing insights and driving decision-making in
many domains. Some notable applications include:

1. Fraud and Risk Detection (Finance):

In the early days, financial institutions struggled with defaults and losses due to bad debts. They had
extensive data about their customers’ financial history but needed effective tools to leverage this
information. Data Science algorithms helped analyze:

 Customer profiling: Identifying high-risk customers based on their past behavior.


 Predicting defaults: Using statistical models to predict which customers might default on
loans based on historical data.

Example: Banks now analyze transaction patterns and spending behavior to assess a loan applicant’s
risk level and offer customized banking products.

2. Genetics and Genomics (Healthcare):

Data Science plays a significant role in understanding genetic


data and its impact on health. By combining genomics with data
analytics, researchers can:

 Personalize treatments based on an individual’s genetic


makeup.
 Predict disease risk: Analyze the correlation between
genetic variations and susceptibility to certain diseases.

Example: Using genetic data to predict how a patient will


respond to a specific drug, leading to personalized medicine.

3. Internet Search Engines:

Search engines like Google use Data Science to handle vast


amounts of data and deliver relevant results within seconds.
Algorithms analyze:

 User queries: Match them with indexed web pages.


 Click behavior: Improve ranking algorithms based
on how users interact with search results.

Example: Google processes over 20 petabytes of data daily.


Without advanced data science techniques, it would not be
able to deliver accurate results at the speed it does.

4. Targeted Advertising (Digital Marketing):

Data Science has transformed the digital marketing landscape by enabling targeted advertisements.
Based on user data, algorithms predict:

Prepared by: M. S. KumarSwamy, TGT(Maths) Page - 2 -


 User preferences: Ads are tailored based on
browsing history and behavior.
 Ad effectiveness: Measure and improve the click-
through rate (CTR) by targeting ads at users most
likely to interact.

Example: Facebook and Instagram use past browsing


behavior to serve ads relevant to the user’s interests,
resulting in higher engagement.

5. Website Recommendation Engines:

Companies like Amazon, Netflix, and YouTube rely heavily on recommendation engines powered by
Data Science. These systems analyze user behavior to suggest:

 Products on e-commerce sites based on browsing and purchase history.


 Movies or shows on streaming platforms based on watch history and ratings.

Example: Netflix’s recommendation engine suggests new shows based on what users have
previously watched, increasing user engagement and satisfaction.

6. Airline Route Planning:

Airlines face significant operational


challenges, such as flight delays and
optimizing routes for fuel efficiency.
Data Science helps them by:

 Predicting flight delays based


on historical data (e.g., weather,
traffic).
 Route optimization: Choosing
whether to fly direct or via
layovers to maximize efficiency.

Example: Using past data to predict the best flight routes, minimizing fuel costs, and improving
customer satisfaction.

Case Study: Predicting Food Waste in Restaurants

This section of the document presents a Data Science project example aimed at reducing food waste
in buffet restaurants. The challenge is that restaurants often overestimate the amount of food needed,
leading to waste and financial losses.

Problem Scoping:

1. Who: The primary stakeholders are restaurant owners and chefs.


2. What: The problem is that food is often left unconsumed at the end of the day, leading to
waste.
3. Where: Buffet-style restaurants where food is prepared in bulk.
4. Why: If restaurants could better predict customer turnout, they could prepare the right amount
of food, reducing waste.

Prepared by: M. S. KumarSwamy, TGT(Maths) Page - 3 -


Proposed Solution:

 Goal: To develop a predictive model that estimates the quantity of food to prepare daily.
 Data Required: Datasets related to daily customer numbers, dish prices, quantity prepared,
and unconsumed food over a period of 30 days.

Steps Involved:

1. Data Collection: Collect data on the number of customers, types of dishes, food quantities
prepared, and leftovers.
2. Data Exploration: Clean and preprocess the data to ensure accuracy, removing missing
values or outliers. We extract the required information from the curated dataset and clean it up
in such a way that there exist no errors or missing elements in it.
3. Modeling: Train a regression model on 30 days of data to predict the amount of food to
prepare based on historical consumption patterns. In this case, the dataset of 30 days is
divided in a ratio of 2:1 for training and testing respectively. In this case, the model is first
trained on the 20-day data and then gets evaluated for the rest of the 10 days.
4. Evaluation: Test the model’s accuracy by comparing its predictions with actual food
consumption.

Data Science Tools and Techniques:

Various tools and programming libraries are essential in Data Science, helping analysts and
developers process, analyze, and visualize data.

1. Data Collection Methods:

 Offline: Surveys, observations, and interviews conducted manually.


 Online: Data gathered from open-source websites (e.g., Kaggle) or government portals.

Examples of Data:

2. Data Storage Formats:

Prepared by: M. S. KumarSwamy, TGT(Maths) Page - 4 -


 CSV (Comma Separated Values): It is a simple file format used to store tabular data. Each
line of this file is a data record and reach record consists of one or more fields which are
separated by commas. Since the values of records are separated by a comma, hence they are
known as CSV files.
 Spreadsheet: A Spreadsheet is a piece of paper or a computer program which is used for
accounting and recording data using rows and columns into which information can be entered.
Microsoft excel is a program which helps in creating spreadsheets.
 SQL: SQL is a programming language also known as Structured Query Language. It is a
domain-specific language used in programming and is designed for managing data held in
different kinds of DBMS (Database Management System) It is particularly useful in handling
structured data.

3. Python Libraries:

 NumPy: For numerical computing and working with arrays.


 Pandas: For data manipulation and handling tabular datasets (e.g., DataFrames).
 Matplotlib: For data visualization, including plotting graphs like bar charts, histograms, and
scatter plots.

NumPy

NumPy, which stands for Numerical Python, is the fundamental package for Mathematical and
logical operations on arrays in Python. It is a commonly used package when it comes to working
around numbers. NumPy gives a wide range of arithmetic operations around numbers giving us an
easier approach in working with them. NumPy also works with arrays, which is nothing but a
homogenous collection of Data.

An array is nothing but a set of multiple values which are of same datatype. They can be numbers,
characters, booleans, etc. but only one datatype can be accessed through an array. In NumPy, the
arrays used are known as ND-arrays (N-Dimensional Arrays) as NumPy comes with a feature of
creating n-dimensional arrays in Python.

An array can easily be compared to a list. Let us take a look at how they are different:
NumPy Arrays Lists
1. Homogenous collection of Data. 1. Heterogenous collection of Data.
2. Can contain only one type of data, hence not 2. Can contain multiple types of data, hence
flexible with datatypes. flexible with datatypes.
3. Cannot be directly initialized. Can be operated 3. Can be directly initialized as it is a part of
with Numpy package only. Python syntax.
4. Direct numerical operations can be done. For 4. Direct numerical operations are not possible.
example, dividing the whole array by 3 divides For example, dividing the whole list by 3 cannot
every element by 3. divide every element by 3.
5. Widely used for arithmetic operations. 5. Widely used for data management.
6. Arrays take less memory space. 6. Lists acquire more memory space.
7. Functions like concatenation, appending, 7. Functions like concatenation, appending,
reshaping, etc are not trivially possible with reshaping, etc are trivially possible with lists.
arrays.
8. Example: To create a numpy array ‘A’: 8. Example: To create a list:
import numpy A = [1,2,3,4,5,6,7,8,9,0]
A=numpy.array([1,2,3,4,5,6,7,8,9,0])

Pandas

Prepared by: M. S. KumarSwamy, TGT(Maths) Page - 5 -


Pandas is a software library written for the Python programming language for data manipulation and
analysis. In particular, it offers data structures and operations for manipulating numerical tables and
time series. The name is derived from the term "panel data", an econometrics term for data sets that
include observations over multiple time periods for the same individuals.

Pandas is well suited for many different kinds of data:


• Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
• Ordered and unordered (not necessarily fixed-frequency) time series data.
• Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
• Any other form of observational / statistical data sets. The data actually need not be labelled at all to
be placed into a Pandas data structure

The two primary data structures of Pandas, Series (1-dimensional) and DataFrame (2-dimensional),
handle the vast majority of typical use cases in finance, statistics, social science, and many areas of
engineering. Pandas is built on top of NumPy and is intended to integrate well within a scientific
computing environment with many other 3rd party libraries.

Matplotlib
Matplotlib is an amazing visualization library in Python for 2D plots of arrays. Matplotlib is a multi-
platform data visualization library built on NumPy arrays. One of the greatest benefits of
visualization is that it allows us visual access to huge amounts of data in easily digestible visuals.
Matplotlib comes with a wide variety of plots. Plots helps to understand trends, patterns, and to make
correlations. They’re typically instruments for reasoning about quantitative information. Some types
of graphs that we can make with this package are listed below:

Not just plotting, but you can also modify your plots the way you wish. You can stylise them and
make them more descriptive and communicable.

These packages help us in accessing the datasets we have and also in exploring them to develop a
better understanding of them.

Statistics in Data Science (with Python)

Basic statistics are fundamental to Data Science, providing tools to summarize and analyze data:

1. Mean: The average value of a dataset, calculated by summing all values and dividing by the
number of values.
2. Median: The middle value of a sorted dataset, which is less sensitive to outliers than the
mean.
3. Mode: The most frequently occurring value in the dataset.
4. Standard Deviation: Measures how spread out the values are around the mean. A low
standard deviation means values are close to the mean; a high standard deviation means they
are spread out.
5. Variance: The square of the standard deviation, showing the variability of the data.

Prepared by: M. S. KumarSwamy, TGT(Maths) Page - 6 -


Data Visualization Techniques:

Data visualization is critical for interpreting large datasets. Some common visualizations include:

1. Scatter Plots: Used for plotting discontinuous data, often showing relationships between two
variables (X and Y axes). Multiple parameters can be represented by color and size of the
points.
 Example: Plotting customer age vs purchase amount with points representing different
product categories.
2. Bar Charts: Simple yet effective for visualizing categorical data, where each bar represents a
different category.
 Example: Comparing male and female participation in a survey.
3. Histograms: Show the frequency distribution of a continuous dataset, often used to display
data ranges.
 Example: Plotting the distribution of customer ages at a retail store.
4. Box Plots: Display the distribution of data across quartiles and highlight outliers, making it
useful for identifying skewness.
 Example: Analyzing salary ranges in a company and spotting outliers.

K-Nearest Neighbors (KNN) Algorithm

K-Nearest Neighbors (KNN) is a supervised learning algorithm used for both classification and
regression. It predicts outcomes by finding the ‘K’ nearest data points (neighbors) to a given point
and basing predictions on the majority class of those neighbors.

Example: Predicting Fruit Sweetness

Suppose you want to predict if a fruit is sweet or not, based on the surrounding data points (known
fruits).

 K=1: The closest point to the unknown fruit is used to predict sweetness.
 K=3: The three nearest neighbors are considered, and if two are sweet and one is not, the
model predicts the fruit is sweet.

The algorithm works on the principle that similar data points exist near each other.

KNN tries to predict an unknown value on the basis of the known values. The model simply
calculates the distance between all the known points with the unknown point (by distance we mean to
say the different between two values) and takes up K number of points whose distance is minimum.
And according to it, the predictions are made.

Let us understand the significance of the number of neighbours:

1. As we decrease the value of K to 1, our predictions become less stable. Just think for a minute,
imagine K=1 and we have X surrounded by several greens and one blue, but the blue is the single
nearest neighbour. Reasonably, we would think X is most likely green, but because K=1, KNN
incorrectly predicts that it is blue.

2. Inversely, as we increase the value of K, our predictions become more stable due to majority voting
/ averaging, and thus, more likely to make more accurate predictions (up to a certain point).

Prepared by: M. S. KumarSwamy, TGT(Maths) Page - 7 -


Eventually, we begin to witness an increasing number of errors. It is at this point we know we have
pushed the value of K too far.

3. In cases where we are taking a majority vote (e.g. picking the mode in a classification problem)
among labels, we usually make K an odd number to have a tiebreaker.

Prepared by: M. S. KumarSwamy, TGT(Maths) Page - 8 -

You might also like