0% found this document useful (0 votes)
111 views70 pages

IOT Domain

Exploratory data analysis (EDA) is used by data scientists to analyze and investigate datasets and summarize their main characteristics. EDA helps determine how best to manipulate data sources to get the answers needed, making it easier to discover patterns, spot anomalies, and check assumptions. EDA techniques include univariate and multivariate graphical and non-graphical analysis. Common EDA tools are used to perform statistical functions like clustering, dimension reduction, and predictive modeling. EDA is applied by loading email data from Gmail into a pandas dataframe, then visualizing the data and gaining insights through techniques like univariate analysis and predictive modeling.

Uploaded by

Lucky Mahanto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
111 views70 pages

IOT Domain

Exploratory data analysis (EDA) is used by data scientists to analyze and investigate datasets and summarize their main characteristics. EDA helps determine how best to manipulate data sources to get the answers needed, making it easier to discover patterns, spot anomalies, and check assumptions. EDA techniques include univariate and multivariate graphical and non-graphical analysis. Common EDA tools are used to perform statistical functions like clustering, dimension reduction, and predictive modeling. EDA is applied by loading email data from Gmail into a pandas dataframe, then visualizing the data and gaining insights through techniques like univariate analysis and predictive modeling.

Uploaded by

Lucky Mahanto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 70

Module: 6 Data Analytics

for IoT Solutions


Module: 6 Data Analytics for IoT Solutions

• Data generation, Data gathering, Data Pre-processing, data


analyzation, application of analytics, Exploratory Data Analysis,
vertical-specific algorithms.
Application of Exploratory Data Analysis
(EDA)
• What is EDA ?
• Aim of EDA
• EDA Tools
• EDA Techniques
• EDA vs CDA
• Steps of EDA
• Application of EDA with personal email
What is Exploratory Data Analysis
• Exploratory data analysis (EDA) is used by data scientists to analyze and
investigate data sets and summarize their main characteristics, often
employing data visualization methods

• It helps determine how best to manipulate data sources to get the answers
you need, making it easier for data scientists to discover patterns, spot
anomalies, test a hypothesis, or check assumptions

Ref:https://www.ibm.com/cloud/learn/exploratory-data-analysis#toc-types-of-e-64hsTW2A
What is Exploratory Data Analysis
• EDA is primarily used to see what data can reveal beyond the formal modeling
or hypothesis testing task and provides a better understanding of data set
variables and the relationships between them

• It can also help determine if the statistical techniques you are considering for
data analysis are appropriate

• Originally developed by American mathematician John Tukey in the 1970s,


EDA techniques continue to be a widely used method in the data discovery
process today.
Aim of EDA
• Maximize insight into a dataset
• Uncover underlying structure (relationship)
• Extract important variables
• Detect outliers and anomalies
• Test underlying assumptions
• Develop valid models
• Determine optimal factor settings (Xs)
Exploratory data analysis tools
• Specific statistical functions and techniques that can be performed with EDA
tools include:
• Clustering and dimension reduction techniques, which help create graphical displays of
high-dimensional data containing many variables
• Univariate, Multivariate, Bivariate visualization of each field in the raw dataset, with
summary statistics.
• Predictive models, such as linear regression, use statistics and data to predict
outcomes.
• K-means Clustering 
Exploratory Data Analysis Techniques
• There are four exploratory data analysis techniques that data experts use,
which include:

• Univariate Non-Graphical
• This is the simplest type of EDA, where data has a single variable. Since there is only one
variable, data professionals do not have to deal with relationships. 

• Univariate Graphical
• Non-graphical techniques do not present the complete picture of data. Therefore, for
comprehensive EDA, data specialists implement graphical methods, such as stem-and-
leaf plots, box plots, and histograms. 
Exploratory Data Analysis Techniques
• Multivariate Non-Graphical
• Multivariate data consists of several variables. Non-graphic multivariate EDA methods
illustrate relationships between 2 or more data variables using statistics or cross-
tabulation

• Multivariate Graphical
• This EDA technique makes use of graphics to show relationships between 2 or more
datasets. The widely-used multivariate graphics include bar chart, bar plot, heat map,
bubble chart, run chart, multivariate chart, and scatter plot. 
Exploratory vs Confirmatory Data
Analysis
EDA CDA
No hypothesis at first Start with hypothesis

Generate hypothesis Test the null hypothesis

Uses graphical methods (mostly) Uses statistical models


Steps of EDA
• Generate good research questions

• Data restructuring: You may need to make new variables from the existing
ones
• Instead of using two variables, obtaining rates or percentages of them
• Creating dummy variables for categorical variables

• Based on the research questions, use appropriate graphical tools and obtain
descriptive statistics. Try to understand the data structure, relationships,
anomalies, unexpected behaviors
Steps of EDA
• Try to identify confusing variables, interaction relations and
multicollinearity, if any.

• Handle missing observations

• Decide on the need of transformation (on response and/or


explanatory variables)

• Decide on the hypothesis based on your research questions


After EDA
• Confirmatory Data Analysis: Verify the hypothesis by statistical analysis

• Get conclusions and present the results using various graphical


representation
Visual Aids for EDA
• Line chart
• Bar chart
• Scatter plot
• Area plot and
• stacked plot
• Pie chart
• Table chart
• Polar chart
• Histogram
• Lollipop chart
Line Chart
A line chart is used to illustrate the relationship between two or more
continuous variables.
Scatter plots can be constructed in the following two
situations:
1) When one continuous variable is dependent on another variable, which is under
the control of the observer
2) When both continuous variables are independent
There are two important concepts—independent variable and dependent variable. In
statistical modeling or mathematical modeling, the values of dependent variables rely
on the values of independent variables. The dependent variable is the outcome
variable being studied. The independent variables are also referred to as regressors.
The takeaway message here is that scatter plots are used when we need to show the
relationship between two variables, and hence are sometimes referred to as
correlation plots.
A bubble plot is a manifestation of the scatter plot where each data point on the graph is shown as a
bubble. Each bubble can be illustrated with a different colour, size, and appearance.
Bubble plot for predictive analyis
Lollipop Chart
• The lollipop chart is a composite
chart with bars and circles. It is a
variant of the bar chart with a
circle at the end, to highlight the
data value. Like a bar chart, a
lollipop chart is used to compare
categorical data. For this kind of
composite chart, we are able to
use more visual elements to
convey information.
• The stacked plot can be useful when we want to visualize the cumulative effect of multiple
variables being plotted on the y axis.
• The purpose of the pie chart is to communicate proportions and it is
• widely accepted.
• Histogram plots are used to depict the distribution of any continuous variable. These types of
plots are very popular in statistical analysis
• A lollipop chart can be used to display ranking in the data. It is similar to an ordered bar chart.
Lollipop chart for predictive analysis
• Case study
Box plot
• Boxplots are a standardized way of displaying the distribution of data
based on a five number summary (“minimum”, first quartile (Q1),
median, third quartile (Q3), and “maximum”).
What is an outlier in a data plot?
• An outlier for a scatter plot is the point or points that are farthest
from the regression line.
• median (Q2/50th Percentile): the middle value of the dataset.
• first quartile (Q1/25th Percentile): the middle number between the smallest
number (not the “minimum”) and the median of the dataset.
• third quartile (Q3/75th Percentile): the middle value between the median
and the highest value (not the “maximum”) of the dataset.
• interquartile range (IQR): 25th to the 75th percentile.
• whiskers (shown in blue)
• outliers (shown as green circles)
• “maximum”: Q3 + 1.5*IQR
• “minimum”: Q1 -1.5*IQR
Choosing the best chart
Application of EDA
Application of EDA Hands-
on
Topic covered:
• Loading the dataset
• Data transformation
• Data analysis

Outcome:
You will learn about how to export all your emails as a dataset, how to use import them
inside a pandas dataframe, how to visualize them, and the different types of insights you
can gain.

Ref: Book: Hands-On Exploratory Data Analysis with Python by Suresh Kumar Mukhiya Usman Ahmed. Chapter 3
EDA with Personal Email- Step 1
1.Here are the steps to follow (Data generation and collection):
a) 1. Log in to your personal Gmail account.
b) 2. Go to the following link: https://takeout.google.com/settings/takeout
c) 3. Deselect all the items but Gmail, as shown in the following screenshot:
EDA with Personal Email-Step 1
d. Select the archive format, as shown in the
following screenshot

• Note that I selected Send download link by email, One-


time archive, .zip, and the maximum allowed size.
• You can customize the format. Once done, hit Create
archive
• You will get an email archive that is ready for download.
You can use the path to the mbox file for further
analysis, which will be discussed further.
EDA with Personal Email-Step 2
Loading the dataset
• I loaded my own personal email from Google Mail. For privacy reasons, You shouldn't share the
dataset. However, I will show you different EDA operations that you can perform to analyze several
aspects of your email behavior:

1. Let's load the required libraries:


• import numpy as np
• import pandas as pd
• import matplotlib.pyplot as plt

• Note that for this analysis, we need to have the mailbox package installed. If it is not installed on your system, it can be
added to your Python build using the pip install mailbox instruction.
EDA with Personal Email-Step 2
Loading the dataset

2. When you have loaded the libraries, load the dataset:


• import mailbox
• mboxfile = "PATH TO DOWNLOADED MBOX FIL"
• mbox = mailbox.mbox(mboxfile)
• mbox
• Note that it is essential that you replace the mbox file path with your own path.

• The output of the preceding code is as follows:


<mailbox.mbox at 0x7f124763f5c0>

• The output indicates that the mailbox has been successfully created.
EDA with Personal Email-Step 2
Loading the dataset
3. Next, let's see the list of available keys::
for key in mbox[0].keys():
print(key)

• The output of the preceding code is as follows:

• The preceding output shows the list of keys that are present in
the extracted dataset.
EDA with Personal Email-Step 3
A. Data Transformation B. Data cleansing
• Although there are a lot of Let's create a CSV file with only the required fields. Let's start with
objects returned by the the following steps
extracted data, we do not need 1. Import the csv package:
all the items. We will only • import csv
extract the required fields. 2. Create a CSV file with only the required attributes:
• Data cleansing is one of the with open('mailbox.csv', 'w') as outputfile:
essential steps in the data writer = csv.writer(outputfile)
analysis phase. writer.writerow(['subject','from','date','to','label','thread'])
for message in mbox:
• For our analysis, all we need is writer.writerow([
data for the following: subject, message['subject'],
from, date, to, label, and message['from'],
thread. message['date'],
message['to'],
message['X-Gmail-Labels'],
message['X-GM-THRID']
]
)
• The preceding output is a csv file named mailbox.csv. Next, instead of
loading the mbox file, we can use the CSV file for loading, which will be
smaller than the original dataset.
EDA with Personal Email-Step 3
C. Loading the CSV file D. Converting the date
• We will load the CSV file. Refer • Next, we will convert the date.
to the following code block: • Check the datatypes of each column as shown here:
dfs=pd.read_csv('mailbox.csv', • dfs.dtypes
names=['subject', 'from', 'date', • The output of the preceding code is as follows:
'to',
'label', 'thread'])
• The preceding code will
generate a pandas data frame
with only the required fields
stored in the CSV file

• Note that a date field is an object. So, we need to convert it


into a DateTime argument.
• In the next step, we are going to convert the date field into an
actual DateTime argument. We can do this by using the
pandas to_datetime() method. See the following code:
dfs['date'] = dfs['date'].apply(lambda x: pd.to_datetime(x,
errors='coerce', utc=True))
EDA with Personal Email-Step 3
E. Removing NaN values
• Next, we are going to remove NaN values from the field.

• We can do this as follows:


• dfs = dfs[dfs['date'].notna()]

• Next, it is good to save the preprocessed file into a separate CSV file in case we need it again.

• We can save the data frame into a separate CSV file as follows:
• dfs.to_csv('gmail.csv')
EDA with Personal Email-Step 4
• Applying descriptive statistics • Let's check the first few entries of the email dataset:
• Having preprocessed the dataset, let's dfs.head(10)
do some sanity checking using
descriptive statistics techniques
• The output of the preceding code is as follows:

• We can implement this as shown here:


dfs.info()

• The output of the preceding code is as


follows:

• Note that our data frame so far contains six different columns. Take a look
at the from field:
• It contains both the name and the email. For our analysis, we only need
an email address. We can use a regular expression to refactor the column.
EDA with Personal Email-Step 5
• Data refactoring 3. Next, let's apply the function to the from column:
1. First of all, import the regular dfs['from'] = dfs['from'].apply(lambda x: extract_email_ID(x))
expression package:
import re
4. Next, we are going to refactor the label field. The logic is
2. Next, let's create a function that takes simple. If an email is from your email address, then it is the sent
an entire string from any column and email. Otherwise, it is a received email, that is, an inbox email:
extracts an email address: myemail = '[email protected]'
def extract_email_ID(string): dfs['label'] = dfs['from'].apply(lambda x: 'sent' if x==myemail
email = re.findall(r'<(.+?)>', string) else 'inbox')
if not email:
email = list(filter(lambda y: '@' in y,
string.split()))
return email[0] if email else np.nan
EDA with Personal Email-Step 6
• Dropping columns
1. Note that the to column only contains your own email. So, we can drop this
dfs.drop(columns='to', inplace=True)

2. This drops the to column from the data frame. Let's display the first 10 entries now:
dfs.head(10)
The output of the preceding code is as follows: Check the preceding output. The fields are cleaned. The data is transformed into
the correct format.
EDA with Personal Email-Step 7

• Refactoring timezones
1. We can refactor timezones by using the method given here:dfs.drop(columns='to', inplace=True)
import datetime
import pytz
def refactor_timezone(x):
est = pytz.timezone('US/Eastern')
return x.astimezone(est)
EDA with Personal Email-Step 7
Data analysis
• This is the most important part of EDA. This is the part where we gain insights from
the data that we have.

• Let's answer the following questions one by one:


1. How many emails did I send during a given timeframe?
2. At what times of the day do I send and receive emails with Gmail?
3. What is the average number of emails per day?
4. What is the average number of emails per hour?
5. Whom do I communicate with most frequently?
6. What are the most active emailing days?
7. What am I mostly emailing about?

• In the following sections, we will answer the preceding questions.


1. Number of emails
Time of day
Time of day
Average emails per day and hour
Average emails per day and hour
Average emails per day and hour
Average emails per day and hour

• The average emails per hour and per


graph is illustrated by the preceding
graph.

• In my case, most email communication


happened between 2018 and 2020.
Number of emails per day
• Let's find the busiest day of the week in terms of emails:
• counts = dfs.dayofweek.value_counts(sort=False)
• counts.plot(kind='bar')

• The output of the preceding code is as follows:


Number of emails per day
Number of emails per day
Number of emails per day
Summary
• We imported data from our own Gmail accounts in mbox format.
• We loaded the dataset and performed some primitive EDA
techniques, including data loading, data transformation, and data
analysis.
• We also tried to answer some basic questions about email
communication.
Vertical vs. Horizontal Data Scientists

Vertical data scientists have very deep knowledge in some narrow field.


They might be computer scientists very familiar with computational
complexity of all sorting algorithms
Or
Software engineer with years of experience writing Python code (including
graphic libraries) applied to API development and web crawling technology
OR
Database guy with strong data modeling, data warehousing, graph
databases, Hadoop and NoSQL expertise. Or a predictive modeler expert in
Bayesian networks, SAS and SVM.
Horizontal data scientists
• They are a blend of business analysts, statisticians, computer scientists
and domain experts. They combine vision with technical knowledge

• They know about more modern, data-driven techniques applicable to


unstructured, streaming, and big data

• They can design robust, efficient, simple, replicable and scalable code
and algorithms.
Horizontal data scientists also come with the following features:
• They have some familiarity with six sigma concepts. In essence, speed is more important than perfection, for these analytic practitioners.

• They have experience in producing success stories out of large, complicated, messy data sets - including in measuring the success.

• Experience in identifying the real problem to be solved, the data sets (external and internal) they need, the data base structures they need,

the metrics they need, rather than being passive consumers of data sets produced or gathered by third parties lacking the skills to collect /

create the right data.

• They know rules of thumb and pitfalls to avoid, more than theoretical concepts. However they have a bit more than just basic knowledge of

computational complexity, good sampling and design of experiment, robust statistics and cross-validation, modern data base design and

programming languages (R, scripting languages, Map Reduce concepts, SQL)

• Advanced Excel and visualization skills.

• They can help produce useful dashboards (the ones that people really use on a daily basis to make decisions) or alternate tools to

communicate insights found in data (orally, by email or automatically - and sometimes in real time machine-to-machine mode).

• They think outside the box.

• They are innovators who create truly useful stuff


Vertical data scientists are the by-product of our rigid University system
which trains people to become either a computer scientist, a statistician, an
operations research or a MBA guy - but not all the four at the same time.

This is one of the reasons of offering data science program and why
recruiters can't find data scientists.

Mostly they find and recruit vertical data scientists. Companies are not yet
used to identifying horizontal data scientists - the true money makers and
ROI generators among analytic professionals.
Vertical-specific algorithms (ML workflow)
Detailed Classification of ML Techniques
• Part 1: Data Pre-processing (before ML)
• Part 2: Regression
 Simple Linear Regression
 Multiple Linear Regression
 Polynomial Regression
 Support Vector Regression (SVR)
 Decision Tree Regression
 Random Forest Regression
 Evaluating Regression Models Performance(KPI): R^2, RMSE, Cross fold validation and
Score

05/01/2023 Dr. Geetha Mani,SELECT 65


Cont..
Part 3: Classification
 Logistic Regression
 K-Nearest Neighbors (K-NN)
 Support Vector Machine (SVM)
 Kernel SVM
 Naive Bayes
 Decision Tree Classification
 Random Forest Classification
 Evaluating Classification Models Performance: Confusion matrix, Recall, Precision,
Sensitivity / specificity

05/01/2023 Dr. Geetha Mani,SELECT 66


Cont..
Part 4: Clustering
 K-Means Clustering
 Hierarchical Clustering

Part 5: Association Rule Learning

Part 6: Reinforcement Learning

Part 7: Natural Language Processing

Part 8: Deep Learning:


 Artificial Neural Networks (ANN)
 Convolutional Neural Networks (CNN)

05/01/2023 Dr. Geetha Mani,SELECT 67


Cont..
Part 9: Dimensionality Reduction
 Principal Component Analysis (PCA)
 Linear Discriminant Analysis (LDA)
 Kernel PCA

Part 10: Model Selection and Deployment

05/01/2023 Dr. Geetha Mani,SELECT 68


Reference
• Book: Hands-On Exploratory Data Analysis with Python by Suresh Kumar
Mukhiya Usman Ahmed. Chapter 3 & 11.
• https://www.datasciencecentral.com/profiles/blogs/vertical-vs-horizontal-
data-scientists
.
• https://towardsdatascience.com/vertical-vs-horizontal-ai-startups-
e2bdec23aa16

You might also like