IOT Domain
IOT Domain
• It helps determine how best to manipulate data sources to get the answers
you need, making it easier for data scientists to discover patterns, spot
anomalies, test a hypothesis, or check assumptions
Ref:https://www.ibm.com/cloud/learn/exploratory-data-analysis#toc-types-of-e-64hsTW2A
What is Exploratory Data Analysis
• EDA is primarily used to see what data can reveal beyond the formal modeling
or hypothesis testing task and provides a better understanding of data set
variables and the relationships between them
• It can also help determine if the statistical techniques you are considering for
data analysis are appropriate
• Univariate Non-Graphical
• This is the simplest type of EDA, where data has a single variable. Since there is only one
variable, data professionals do not have to deal with relationships.
• Univariate Graphical
• Non-graphical techniques do not present the complete picture of data. Therefore, for
comprehensive EDA, data specialists implement graphical methods, such as stem-and-
leaf plots, box plots, and histograms.
Exploratory Data Analysis Techniques
• Multivariate Non-Graphical
• Multivariate data consists of several variables. Non-graphic multivariate EDA methods
illustrate relationships between 2 or more data variables using statistics or cross-
tabulation
• Multivariate Graphical
• This EDA technique makes use of graphics to show relationships between 2 or more
datasets. The widely-used multivariate graphics include bar chart, bar plot, heat map,
bubble chart, run chart, multivariate chart, and scatter plot.
Exploratory vs Confirmatory Data
Analysis
EDA CDA
No hypothesis at first Start with hypothesis
• Data restructuring: You may need to make new variables from the existing
ones
• Instead of using two variables, obtaining rates or percentages of them
• Creating dummy variables for categorical variables
• Based on the research questions, use appropriate graphical tools and obtain
descriptive statistics. Try to understand the data structure, relationships,
anomalies, unexpected behaviors
Steps of EDA
• Try to identify confusing variables, interaction relations and
multicollinearity, if any.
Outcome:
You will learn about how to export all your emails as a dataset, how to use import them
inside a pandas dataframe, how to visualize them, and the different types of insights you
can gain.
Ref: Book: Hands-On Exploratory Data Analysis with Python by Suresh Kumar Mukhiya Usman Ahmed. Chapter 3
EDA with Personal Email- Step 1
1.Here are the steps to follow (Data generation and collection):
a) 1. Log in to your personal Gmail account.
b) 2. Go to the following link: https://takeout.google.com/settings/takeout
c) 3. Deselect all the items but Gmail, as shown in the following screenshot:
EDA with Personal Email-Step 1
d. Select the archive format, as shown in the
following screenshot
• Note that for this analysis, we need to have the mailbox package installed. If it is not installed on your system, it can be
added to your Python build using the pip install mailbox instruction.
EDA with Personal Email-Step 2
Loading the dataset
• The output indicates that the mailbox has been successfully created.
EDA with Personal Email-Step 2
Loading the dataset
3. Next, let's see the list of available keys::
for key in mbox[0].keys():
print(key)
• The preceding output shows the list of keys that are present in
the extracted dataset.
EDA with Personal Email-Step 3
A. Data Transformation B. Data cleansing
• Although there are a lot of Let's create a CSV file with only the required fields. Let's start with
objects returned by the the following steps
extracted data, we do not need 1. Import the csv package:
all the items. We will only • import csv
extract the required fields. 2. Create a CSV file with only the required attributes:
• Data cleansing is one of the with open('mailbox.csv', 'w') as outputfile:
essential steps in the data writer = csv.writer(outputfile)
analysis phase. writer.writerow(['subject','from','date','to','label','thread'])
for message in mbox:
• For our analysis, all we need is writer.writerow([
data for the following: subject, message['subject'],
from, date, to, label, and message['from'],
thread. message['date'],
message['to'],
message['X-Gmail-Labels'],
message['X-GM-THRID']
]
)
• The preceding output is a csv file named mailbox.csv. Next, instead of
loading the mbox file, we can use the CSV file for loading, which will be
smaller than the original dataset.
EDA with Personal Email-Step 3
C. Loading the CSV file D. Converting the date
• We will load the CSV file. Refer • Next, we will convert the date.
to the following code block: • Check the datatypes of each column as shown here:
dfs=pd.read_csv('mailbox.csv', • dfs.dtypes
names=['subject', 'from', 'date', • The output of the preceding code is as follows:
'to',
'label', 'thread'])
• The preceding code will
generate a pandas data frame
with only the required fields
stored in the CSV file
• Next, it is good to save the preprocessed file into a separate CSV file in case we need it again.
• We can save the data frame into a separate CSV file as follows:
• dfs.to_csv('gmail.csv')
EDA with Personal Email-Step 4
• Applying descriptive statistics • Let's check the first few entries of the email dataset:
• Having preprocessed the dataset, let's dfs.head(10)
do some sanity checking using
descriptive statistics techniques
• The output of the preceding code is as follows:
• Note that our data frame so far contains six different columns. Take a look
at the from field:
• It contains both the name and the email. For our analysis, we only need
an email address. We can use a regular expression to refactor the column.
EDA with Personal Email-Step 5
• Data refactoring 3. Next, let's apply the function to the from column:
1. First of all, import the regular dfs['from'] = dfs['from'].apply(lambda x: extract_email_ID(x))
expression package:
import re
4. Next, we are going to refactor the label field. The logic is
2. Next, let's create a function that takes simple. If an email is from your email address, then it is the sent
an entire string from any column and email. Otherwise, it is a received email, that is, an inbox email:
extracts an email address: myemail = '[email protected]'
def extract_email_ID(string): dfs['label'] = dfs['from'].apply(lambda x: 'sent' if x==myemail
email = re.findall(r'<(.+?)>', string) else 'inbox')
if not email:
email = list(filter(lambda y: '@' in y,
string.split()))
return email[0] if email else np.nan
EDA with Personal Email-Step 6
• Dropping columns
1. Note that the to column only contains your own email. So, we can drop this
dfs.drop(columns='to', inplace=True)
2. This drops the to column from the data frame. Let's display the first 10 entries now:
dfs.head(10)
The output of the preceding code is as follows: Check the preceding output. The fields are cleaned. The data is transformed into
the correct format.
EDA with Personal Email-Step 7
• Refactoring timezones
1. We can refactor timezones by using the method given here:dfs.drop(columns='to', inplace=True)
import datetime
import pytz
def refactor_timezone(x):
est = pytz.timezone('US/Eastern')
return x.astimezone(est)
EDA with Personal Email-Step 7
Data analysis
• This is the most important part of EDA. This is the part where we gain insights from
the data that we have.
• They can design robust, efficient, simple, replicable and scalable code
and algorithms.
Horizontal data scientists also come with the following features:
• They have some familiarity with six sigma concepts. In essence, speed is more important than perfection, for these analytic practitioners.
• They have experience in producing success stories out of large, complicated, messy data sets - including in measuring the success.
• Experience in identifying the real problem to be solved, the data sets (external and internal) they need, the data base structures they need,
the metrics they need, rather than being passive consumers of data sets produced or gathered by third parties lacking the skills to collect /
• They know rules of thumb and pitfalls to avoid, more than theoretical concepts. However they have a bit more than just basic knowledge of
computational complexity, good sampling and design of experiment, robust statistics and cross-validation, modern data base design and
• They can help produce useful dashboards (the ones that people really use on a daily basis to make decisions) or alternate tools to
communicate insights found in data (orally, by email or automatically - and sometimes in real time machine-to-machine mode).
This is one of the reasons of offering data science program and why
recruiters can't find data scientists.
Mostly they find and recruit vertical data scientists. Companies are not yet
used to identifying horizontal data scientists - the true money makers and
ROI generators among analytic professionals.
Vertical-specific algorithms (ML workflow)
Detailed Classification of ML Techniques
• Part 1: Data Pre-processing (before ML)
• Part 2: Regression
Simple Linear Regression
Multiple Linear Regression
Polynomial Regression
Support Vector Regression (SVR)
Decision Tree Regression
Random Forest Regression
Evaluating Regression Models Performance(KPI): R^2, RMSE, Cross fold validation and
Score