0% found this document useful (0 votes)
4 views

Data Science - Unit 1 MDM

The document outlines a course on Introduction to Data Science at Jawaharlal Nehru Engineering College, detailing its objectives, outcomes, and content structure. It emphasizes the importance of data science in various sectors, the role of Python in data analysis, and the significance of exploratory data analysis (EDA) techniques. The course aims to equip students with the necessary skills to analyze and interpret data effectively using statistical and machine learning concepts.

Uploaded by

grebe64246
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Data Science - Unit 1 MDM

The document outlines a course on Introduction to Data Science at Jawaharlal Nehru Engineering College, detailing its objectives, outcomes, and content structure. It emphasizes the importance of data science in various sectors, the role of Python in data analysis, and the significance of exploratory data analysis (EDA) techniques. The course aims to equip students with the necessary skills to analyze and interpret data effectively using statistical and machine learning concepts.

Uploaded by

grebe64246
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Jawaharlal Nehru Engineering College,

Chh. Sambhajinagar

Data Science

Sandip S. Kankal
Assistant Professor, CSE,
JNEC, Chh. Sambhajinagar
12-Aug-24 1
Course: MDM in Data Science
Semester -III
• Course code: CSE21MDL201
• Course name: Introduction to Data Science
• Course category: MDM
• Credits: 2
• Teaching scheme: L-2hrs/week
• Evaluation scheme: CA–60, ESE–40
12-Aug-24 2
Pre-requisite
• Basics of any programming language

12-Aug-24 3
Course Objectives:
• To provide the knowledge and expertise to become
a proficient data scientist.
• Demonstrate an understanding of statistics and
machine learning concepts that are vital for data
science.
• Critically evaluate data visualisations based on
their design and use for communicating stories
from data
12-Aug-24 4
Course Outcomes:
• At the end of the course, the students will be able to
CO1: To explain how data is collected, managed and stored
for data science
CO2: To understand the key concepts in data science
including their real-world applications and the toolkit used
by data scientists.
CO3: To understand different tools and languages used for
Data Science.

12-Aug-24 5
Contents
• Unit -1 : Introduction to Data Science
• Unit - 2: Feature Generation & Extraction
• Unit - 3: Data Visualization
• Unit - 4: Applications & Tools used in Data
Science

12-Aug-24 6
Unit 1: Introduction to Data Science
• Introduction to Data Science
• Different Sectors using Data Science
• Purpose & Components of Python in Data Science
• Data Analytics Process
• Knowledge Check
• EDA
• EDA – Quantitative technique
• EDA – Graphical Technique
• Data Analytics Conclusion & Predictions
12-Aug-24 7
Data Science
• Data + Science
• Data : factual information (such as measurements or
statistics) used as a basis for reasoning, discussion, or
calculation
• A set of values of qualitative or quantitative variables.
• Raw facts and figures
• E.g. The number of visitors to a website in one month,
Individual satisfaction scores on a customer service
survey
12-Aug-24 8
Data Science
• Data + Science
• Science: Science is a strict systematic discipline that
builds and organizes knowledge in the form of
testable hypotheses and predictions about the
world.
• Science consists of observing the world by
watching, listening, observing, and recording

12-Aug-24 9
Data Science
• We live in a world that’s drowning in data

12-Aug-24 10
Data All Around
• Lots of data is being collected
and warehoused
– Web data, e-commerce
– Financial transactions, bank/credit transactions
– Online trading and purchasing
– Social Network

12-Aug-24 11
What is Data?
• a collection of raw facts, figures, numbers, or
observations
• could be anything from website visitor statistics to
customer feedback survey results.
• Think of data as the building blocks of information.
• On its own, a single data point might seem
insignificant, but when combined and analyzed, it
reveals valuable insights.
12-Aug-24 12
Data vs Information
• Data is the raw material, while information is what
we derive from that data.

12-Aug-24 13
Data vs Information
Data Information

Data is unorganised and unrefined facts Information comprises processed, organised data
presented in a meaningful context

Data is an individual unit that contains raw Information is a group of data that collectively
materials which do not carry any specific carries a logical meaning.
meaning.

Data doesn’t depend on information. Information depends on data.

Raw data alone is insufficient for decision making Information is sufficient for decision making

An example of data is a student’s test score The average score of a class is the information
derived from the given data.
12-Aug-24 14
How Much Data Do We have?

12-Aug-24 15
12-Aug-24 16
Big Data
• Big data refers to
extremely large and
diverse collections of
structured, unstructured,
and semi-structured data
that continues to grow
exponentially over time.

12-Aug-24 17
Big Data
Big Data is any data that is expensive to manage and hard
to extract value from
– Volume
• The size of the data
– Velocity
• The latency of data processing relative to the growing demand
for interactivity
– Variety and Complexity
• the diversity of sources, formats, quality, structures.

12-Aug-24 18
12-Aug-24 19
Big Data Growth

12-Aug-24 20
Data Science
• Data science is the practice of mining large data sets of
raw data, both structured and unstructured, to identify
patterns and extract actionable insight from them.
• This is an interdisciplinary field, and the foundations
of data science include statistics, inference, computer
science, predictive analytics, machine learning
algorithm development, and new technologies to gain
insights from big data.

12-Aug-24 21
What is Data Science?
• Data science is the study of data to extract
meaningful insights for business.
• It's a multidisciplinary field that combines principles
and practices from mathematics, statistics, artificial
intelligence (AI), and computer engineering to
analyze large amounts of data.
• “data+science” refers to the scientific study of data.

12-Aug-24 22
Data Science
• Data science enables businesses to process huge amounts of
structured and unstructured big data to detect patterns.
• This in turn allows companies to increase efficiencies, manage
costs, identify new market opportunities, and boost their
market advantage.
• Asking a personal assistant like Alexa or Siri for a
recommendation demands data science.
• So does operating a self-driving car, using a search engine that
provides useful results, or talking to a chatbot for customer
service.
• These are all real-life applications for data science.
12-Aug-24 23
Big Data & Data Science
• Data comes from various sources, such as online
purchases, multimedia forms, instruments,
financial logs, sensors, text files, and others.
• Data might be unstructured, semi-structured, or
structured.
• This is all “big data,” and putting it to good use is a
pressing job of the 21st century.

12-Aug-24 24
Big Data & Data Science
• Data science is not one tool, skill, or method.
• Instead, it is a scientific approach that uses applied
statistical and mathematical theory and computer tools to
process big data.
• The foundations of data science combine the
interdisciplinary strengths of data cleansing, intelligent data
capture techniques, and data mining and programming.
• The result is the data scientist’s ability to capture, maintain,
and prepare big data for intelligent analysis.
12-Aug-24 25
Different Sectors using Data Science

12-Aug-24 26
Purpose of Python in Data Science
• Data science is a domain that deals with the collection,
analysis and interpretation of data, specifically for
business purposes.
• It involves statistics, machine learning, artificial
intelligence and database systems techniques altogether.
• Python is one of the most popular programming
languages used in data science owing to its simplicity and
flexibility.

12-Aug-24 27
Purpose of Python in Data Science
• Python is one of the most popular programming languages
• It provides simplicity and flexibility
• It uses an elegant syntax, hence programs are easier to read
• Large standard library and community support
• Python is open source and expressive language
• GUI Support
• It involves statistics, machine learning, artificial intelligence.

12-Aug-24 28
Purpose of Python in Data Science
• In terms of application areas, Data scientists prefer
Python for the following modules:
• Data Analysis
• Data Visualizations
• Machine Learning
• Deep Learning
• Image processing
• Computer Vision
• Natural Language Processing (NLP)
12-Aug-24 29
Components of Python in Data Science
• Python has libraries with large collections of mathematical
functions and analytical tools.
• Pandas - This library is used for structured data operations, like
import CSV files, create dataframes, and data preparation
• Numpy - This is a mathematical library. Has a powerful N-
dimensional array object, linear algebra, Fourier transform,
etc.
• Matplotlib - This library is used for visualization of data.
• SciPy - This library has linear algebra modules
• Seaborn

12-Aug-24 30
Data Analysis
• Data analysis is the process of collecting, transforming,
cleaning and modeling data with the goal of discovering
required information
• A simple example of data analysis is whenever we take
decision in our day-to-day life is by thinking what happened
last time or what will happen by choosing that particular
decision
• This is nothing but analyzing our past or future and making
decisions based on it
12-Aug-24 31
Data Analytics
• Data analytics is the process of using tools,
technologies, and processes to convert raw data
into insights that can help solve problems and
identify trends.
• It can help businesses improve decision-making,
shape processes, and grow.

12-Aug-24 32
12-Aug-24 33
Analytics Types

12-Aug-24 34
12-Aug-24 35
Data Analytics Process

12-Aug-24 36
Why EDA?
• EDA is analyzing data using visual techniques.
• It is used to discover patterns or trends or to check assumptions
with the help of statistical summaries & graphical tools
• To check mistake
• Checking assumptions
• Selection of appropriate models
• Determining relationship between variables

12-Aug-24 37
• Exploratory Data Analysis
is a data analytics process to
understand the data in
depth and learn the different
data characteristics, often
with visual means.
• This allows you to get a
better feel of your data and
find useful patterns in it.

12-Aug-24 38
Key aspects of EDA include:
• Distribution of Data: Examining the distribution of data points to understand their range,
central tendencies (mean, median), and dispersion (variance, standard deviation).
• Graphical Representations: Utilizing charts such as histograms, box plots, scatter plots, and
bar charts to visualize relationships within the data and distributions of variables.
• Outlier Detection: Identifying unusual values that deviate from other data points. Outliers can
influence statistical analyses and might indicate data entry errors or unique cases.
• Correlation Analysis: Checking the relationships between variables to understand how they
might affect each other. This includes computing correlation coefficients and creating
correlation matrices.
• Handling Missing Values: Detecting and deciding how to address missing data points, whether
by imputation or removal, depending on their impact and the amount of missing data.
• Summary Statistics: Calculating key statistics that provide insight into data trends and
nuances.
• Testing Assumptions: Many statistical tests and models assume the data meet certain
conditions (like normality or homoscedasticity). EDA helps verify these assumptions.
12-Aug-24 39
Techniques
• Most of the EDA techniques are graphical in
nature with few quantitative techniques
• EDA – Quantitative
– Descriptive Statistics (Mean, Median, Mode, Variance,
Std deviation, Range)
• EDA – Graphical
– Histogram, Scatterplot, Bar chart, Line Chart, Boxplot
etc.

12-Aug-24 40
Types of EDA
• Univariate non-graphical (Quantitative)
• Univariate graphical

• Multivariate nongraphical (Quantitative)


• Multivariate graphical

12-Aug-24 41
EDA
• Univariate analysis is a statistical method that examines
one variable at a time to summarize or describe it, and to
look for patterns in the data.
• Bivariate data involves two different variables, and the
analysis of this type of data focuses on understanding the
relationship or association between these two variables.
• Multivariate data refers to datasets where each
observation or sample point consists of multiple
variables or features.
12-Aug-24 42
Analysis

• Example -
• Studying the heights of players
• Analyzing the sale of ice creams based on the
temperature outside.
• Analysing Revenue based on expenditure.

12-Aug-24 43
How to perform EDA?
• This involves exploring dataset in three ways:

• Summarizing a dataset using descriptive statistics


• Visualizing dataset using charts
• Normalizing dataset

12-Aug-24 44
Univariate Analysis
• Histograms: Used to visualize the distribution of a variable.
• Box plots: Useful for detecting outliers and understanding the spread
and skewness of the data.
• Bar charts: Employed for categorical data to show the frequency of
each category.
• Summary statistics: Calculations like mean, median, mode, variance,
and standard deviation that describe the central tendency and dispersion
of the data.

12-Aug-24 45
Bivariate Analysis
• Scatter Plots: A scatter plot helps visualize the relationship between two continuous
variables.
• Correlation Coefficient: This statistical measure (often Pearson’s correlation coefficient
for linear relationships) quantifies the degree to which two variables are related.
• Cross-tabulation: Also known as contingency tables, cross-tabulation is used to analyze
the relationship between two categorical variables. It shows the frequency distribution of
categories of one variable in rows and the other in columns, which helps in
understanding the relationship between the two variables.
• Line Graphs: In the context of time series data, line graphs can be used to compare two
variables over time. This helps in identifying trends, cycles, or patterns that emerge in
the interaction of the variables over the specified period.
• Covariance: Covariance is a measure used to determine how much two random variables
change together. However, it is sensitive to the scale of the variables, so it’s often
supplemented by the correlation coefficient for a more standardized assessment of the
relationship.
12-Aug-24 46
Multivariate Analysis
• Pair plots: Visualize relationships across several variables
simultaneously to capture a comprehensive view of
potential interactions.
• Principal Component Analysis (PCA): A dimensionality
reduction technique used to reduce the dimensionality of
large datasets, while preserving as much variance as
possible.

12-Aug-24 47
Tools for EDA
• Python Libraries • R Packages
• Pandas: Provides functions for data • ggplot2: Part of the tidyverse, it’s a
manipulation and analysis, including data powerful tool for making complex
structure handling and time series plots from data in a data frame.
functionality. • dplyr: A grammar of data
• Matplotlib: A plotting library for creating
manipulation, providing a consistent
set of verbs that help you solve the
static, interactive, and animated most common data manipulation
visualizations in Python. challenges.
• Seaborn: Built on top of Matplotlib, it • tidyr: Helps to tidy your data.
provides a high-level interface for drawing Tidying your data means storing it in
attractive and informative statistical graphics. a consistent form that matches the
• Plotly: An interactive graphing library for semantics of the dataset with the
making interactive plots and offers more way it is stored.
sophisticated visualization capabilities.
12-Aug-24 48
Steps in EDA
• Data Collection: involves gathering relevant data for analysis. Data can be
collected from various sources, including public datasets, surveys, and databases.
• Data Cleaning: This step involves checking for missing data, errors, and
outliers. The data is cleaned by removing duplicates, correcting data entry
errors, and filling in missing values.
• Data Visualization: This step involves creating visualizations to identify
patterns and relationships in the data. Common visualization techniques include
scatter plots, histograms, and box plots.
• Data Transformation: This step involves transforming the data to make it more
suitable for analysis. This can include normalization, scaling, and
standardization.
• Data Modeling: This step involves creating models to describe the relationships
between variables. Models can be simple, such as linear regression, or complex,
such as12-Aug-24
decision trees or neural networks. 49
EDA Process
• STEP 1: Import libraries
# importting Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

• STEP 2: Read .csv file


df = pd.read_csv("../input/cardataset/data.csv")

• STEP 3: Display first 5 rows and last 5 rows of dataset


# To display the top 5 rows
df.head(5)
12-Aug-24 50
#import libraries EDA Process
import pandas as pd #Renaming the columns
import numpy as np df = df.rename(columns={"Engine HP": "HP",
import seaborn as sns #visualisation "Engine Cylinders": "Cylinders", "Transmission
import matplotlib.pyplot as plt #visualisation Type": "Transmission", "Driven_Wheels":
%matplotlib inline "Drive Mode","highway MPG": "MPG-H", "city
sns.set(color_codes=True) mpg": "MPG-C", "MSRP": "Price" })
df.head(5)
#Load dataset
df = #Dropping the duplicate rows
pd.read_csv("C:/Users/input/cardataset/data. df.shape
csv")
# To displaythe top 5 rows duplicate_rows_df = df[df.duplicated()]
df.head(5) print("number of duplicate rows: ",
df.tail(5) # To display the bottom 5 rows duplicate_rows_df.shape)

#check types of data


df.dtypes 12-Aug-24 51
EDA Process
#Detecting Outliers
#Now let us remove the duplicate data because
sns.boxplot(x=df['Price'])
it's ok to remove them.
sns.boxplot(x=df['HP'])
df.count() # Used to count the number of rows
sns.boxplot(x=df['Cylinders'])
#So seen above there are 11914 rows and we are
removing 989 rows of duplicate data.
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
df = df.drop_duplicates()
IQR = Q3 - Q1
df.head(5)
print(IQR)
df.count()
df = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 *
IQR))).any(axis=1)]
df.shape

12-Aug-24 52
EDA Process
#Plot different features against one another (scatter), against frequency (histogram)
#Histogram
#Histogram refers to the frequency of occurrence of variables in an interval.
df.Make.value_counts().nlargest(40).plot(kind='bar', figsize=(10,5))
plt.title("Number of cars by make")
plt.ylabel('Number of cars')
plt.xlabel('Make');

#Heatmaps
#Heat Maps is a type of plot which is necessary when we need to find the dependent variables.
#One of the best way to find the relationship between the features can be done using heat maps.
#In the below heat map we know that the price feature depends mainly on the Engine Size,
Horsepower, and Cylinders.
plt.figure(figsize=(10,5))
c= df.corr()
sns.heatmap(c,cmap="BrBG",annot=True)
c 12-Aug-24 53
EDA Process
#Scatterplot
#We generally use scatter plots to find the correlation between two variables
# Here the scatter plots are plotted between Horsepower and Price and we can see the plot below.
#With the plot given below, we can easily draw a trend line.
#These features provide a good scattering of points.

fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(df['HP'], df['Price'])
ax.set_xlabel('HP')
ax.set_ylabel('Price')
plt.show()

12-Aug-24 54
12-Aug-24 55
12-Aug-24 56
Data Analytics Conclusion & Prediction

12-Aug-24 57
12-Aug-24 58
Knowledge Check
• Data is
• A. A set of values of qualitative or quantitative variables
• B. factual information or raw facts
• C. Both A & B
• D. None

12-Aug-24 59
Knowledge Check
• Select one of the following where data is being collected
• A. Education
• B. Business
• C. Healthcare
• D. All of the above

12-Aug-24 60
Knowledge Check
Information is
• A. processed organized and structured data.
• B. sufficient for decision making
• C. data that carries logical meaning
• D. All of the above

12-Aug-24 61
Knowledge Check
• Identify the types of analytics
• A. Descriptive
• B. Diagnostic
• C. Predictive
• D. All of the above

12-Aug-24 62
Knowledge Check
EDA is
• A. used to discover patterns
or trends or
• B. to check assumptions with
statistical summaries &
graphical tools
• C. analyzing data using
visual techniques
• D. All of the above
12-Aug-24 63
References
• https://www.rudderstack.com/learn/data-
analytics/data-analytics-processes/
• https://dev.to/yankho817/exploratory-data-
analysis-edaultimate-guide-174d

12-Aug-24 64

You might also like