data-science-ai-revision-notes
data-science-ai-revision-notes
REVISION NOTES
Data Science is a multidisciplinary field that integrates statistics, data analysis, machine learning,
and related techniques to analyze real-world data. It extracts insights and trends to make informed
decisions, enhancing the ability of machines to solve problems or perform tasks autonomously. The
core components of data science involve:
Mathematics: Statistical models, probability theory, and algebra help understand and predict
data patterns.
Statistics: Crucial for data summarization and analysis, providing tools for hypothesis testing,
regression analysis, etc.
Computer Science: Implements algorithms to process and analyze large datasets efficiently.
Information Science: Deals with the management, retrieval, and storage of data.
Data science is a field that involves principles, concepts, technology and tools to analyse complex
sets of data in order to derive valuable knowledge.
Data science involves the methods and models of statistics, algorithmic and programming tools of
computer science and machine learning approaches of artificial intelligence field.
Data Science is essential in different AI fields, each focusing on specific data types:
1. Data Science: Works with numeric and alpha-numeric data, essential for statistical analysis
and machine learning models.
Example: A dataset containing sales figures, customer ages, and product prices for
predictive modeling.
2. Computer Vision (CV): Deals with image and visual data to enable machines to understand
and interpret visual information.
Example: A self-driving car’s camera processing traffic signals and obstacles.
3. Natural Language Processing (NLP): Focuses on textual and speech-based data, helping
machines understand and interact with human language.
Example: Voice assistants like Siri and Alexa using NLP to understand spoken
commands.
Data Science has revolutionized industries by providing insights and driving decision-making in
many domains. Some notable applications include:
In the early days, financial institutions struggled with defaults and losses due to bad debts. They had
extensive data about their customers’ financial history but needed effective tools to leverage this
information. Data Science algorithms helped analyze:
Example: Banks now analyze transaction patterns and spending behavior to assess a loan applicant’s
risk level and offer customized banking products.
Data Science has transformed the digital marketing landscape by enabling targeted advertisements.
Based on user data, algorithms predict:
Companies like Amazon, Netflix, and YouTube rely heavily on recommendation engines powered by
Data Science. These systems analyze user behavior to suggest:
Example: Netflix’s recommendation engine suggests new shows based on what users have
previously watched, increasing user engagement and satisfaction.
Example: Using past data to predict the best flight routes, minimizing fuel costs, and improving
customer satisfaction.
This section of the document presents a Data Science project example aimed at reducing food waste
in buffet restaurants. The challenge is that restaurants often overestimate the amount of food needed,
leading to waste and financial losses.
Problem Scoping:
Goal: To develop a predictive model that estimates the quantity of food to prepare daily.
Data Required: Datasets related to daily customer numbers, dish prices, quantity prepared,
and unconsumed food over a period of 30 days.
Steps Involved:
1. Data Collection: Collect data on the number of customers, types of dishes, food quantities
prepared, and leftovers.
2. Data Exploration: Clean and preprocess the data to ensure accuracy, removing missing
values or outliers. We extract the required information from the curated dataset and clean it up
in such a way that there exist no errors or missing elements in it.
3. Modeling: Train a regression model on 30 days of data to predict the amount of food to
prepare based on historical consumption patterns. In this case, the dataset of 30 days is
divided in a ratio of 2:1 for training and testing respectively. In this case, the model is first
trained on the 20-day data and then gets evaluated for the rest of the 10 days.
4. Evaluation: Test the model’s accuracy by comparing its predictions with actual food
consumption.
Various tools and programming libraries are essential in Data Science, helping analysts and
developers process, analyze, and visualize data.
Examples of Data:
3. Python Libraries:
NumPy
NumPy, which stands for Numerical Python, is the fundamental package for Mathematical and
logical operations on arrays in Python. It is a commonly used package when it comes to working
around numbers. NumPy gives a wide range of arithmetic operations around numbers giving us an
easier approach in working with them. NumPy also works with arrays, which is nothing but a
homogenous collection of Data.
An array is nothing but a set of multiple values which are of same datatype. They can be numbers,
characters, booleans, etc. but only one datatype can be accessed through an array. In NumPy, the
arrays used are known as ND-arrays (N-Dimensional Arrays) as NumPy comes with a feature of
creating n-dimensional arrays in Python.
An array can easily be compared to a list. Let us take a look at how they are different:
NumPy Arrays Lists
1. Homogenous collection of Data. 1. Heterogenous collection of Data.
2. Can contain only one type of data, hence not 2. Can contain multiple types of data, hence
flexible with datatypes. flexible with datatypes.
3. Cannot be directly initialized. Can be operated 3. Can be directly initialized as it is a part of
with Numpy package only. Python syntax.
4. Direct numerical operations can be done. For 4. Direct numerical operations are not possible.
example, dividing the whole array by 3 divides For example, dividing the whole list by 3 cannot
every element by 3. divide every element by 3.
5. Widely used for arithmetic operations. 5. Widely used for data management.
6. Arrays take less memory space. 6. Lists acquire more memory space.
7. Functions like concatenation, appending, 7. Functions like concatenation, appending,
reshaping, etc are not trivially possible with reshaping, etc are trivially possible with lists.
arrays.
8. Example: To create a numpy array ‘A’: 8. Example: To create a list:
import numpy A = [1,2,3,4,5,6,7,8,9,0]
A=numpy.array([1,2,3,4,5,6,7,8,9,0])
Pandas
The two primary data structures of Pandas, Series (1-dimensional) and DataFrame (2-dimensional),
handle the vast majority of typical use cases in finance, statistics, social science, and many areas of
engineering. Pandas is built on top of NumPy and is intended to integrate well within a scientific
computing environment with many other 3rd party libraries.
Matplotlib
Matplotlib is an amazing visualization library in Python for 2D plots of arrays. Matplotlib is a multi-
platform data visualization library built on NumPy arrays. One of the greatest benefits of
visualization is that it allows us visual access to huge amounts of data in easily digestible visuals.
Matplotlib comes with a wide variety of plots. Plots helps to understand trends, patterns, and to make
correlations. They’re typically instruments for reasoning about quantitative information. Some types
of graphs that we can make with this package are listed below:
Not just plotting, but you can also modify your plots the way you wish. You can stylise them and
make them more descriptive and communicable.
These packages help us in accessing the datasets we have and also in exploring them to develop a
better understanding of them.
Basic statistics are fundamental to Data Science, providing tools to summarize and analyze data:
1. Mean: The average value of a dataset, calculated by summing all values and dividing by the
number of values.
2. Median: The middle value of a sorted dataset, which is less sensitive to outliers than the
mean.
3. Mode: The most frequently occurring value in the dataset.
4. Standard Deviation: Measures how spread out the values are around the mean. A low
standard deviation means values are close to the mean; a high standard deviation means they
are spread out.
5. Variance: The square of the standard deviation, showing the variability of the data.
Data visualization is critical for interpreting large datasets. Some common visualizations include:
1. Scatter Plots: Used for plotting discontinuous data, often showing relationships between two
variables (X and Y axes). Multiple parameters can be represented by color and size of the
points.
Example: Plotting customer age vs purchase amount with points representing different
product categories.
2. Bar Charts: Simple yet effective for visualizing categorical data, where each bar represents a
different category.
Example: Comparing male and female participation in a survey.
3. Histograms: Show the frequency distribution of a continuous dataset, often used to display
data ranges.
Example: Plotting the distribution of customer ages at a retail store.
4. Box Plots: Display the distribution of data across quartiles and highlight outliers, making it
useful for identifying skewness.
Example: Analyzing salary ranges in a company and spotting outliers.
K-Nearest Neighbors (KNN) is a supervised learning algorithm used for both classification and
regression. It predicts outcomes by finding the ‘K’ nearest data points (neighbors) to a given point
and basing predictions on the majority class of those neighbors.
Suppose you want to predict if a fruit is sweet or not, based on the surrounding data points (known
fruits).
K=1: The closest point to the unknown fruit is used to predict sweetness.
K=3: The three nearest neighbors are considered, and if two are sweet and one is not, the
model predicts the fruit is sweet.
The algorithm works on the principle that similar data points exist near each other.
KNN tries to predict an unknown value on the basis of the known values. The model simply
calculates the distance between all the known points with the unknown point (by distance we mean to
say the different between two values) and takes up K number of points whose distance is minimum.
And according to it, the predictions are made.
1. As we decrease the value of K to 1, our predictions become less stable. Just think for a minute,
imagine K=1 and we have X surrounded by several greens and one blue, but the blue is the single
nearest neighbour. Reasonably, we would think X is most likely green, but because K=1, KNN
incorrectly predicts that it is blue.
2. Inversely, as we increase the value of K, our predictions become more stable due to majority voting
/ averaging, and thus, more likely to make more accurate predictions (up to a certain point).
3. In cases where we are taking a majority vote (e.g. picking the mode in a classification problem)
among labels, we usually make K an odd number to have a tiebreaker.