0% found this document useful (0 votes)
6 views

Data science-Unit-2

Unit 2 focuses on data collection and management in data science, covering methods for gathering primary and secondary data, including quantitative and qualitative approaches. It discusses the importance of data exploration, cleaning, and preparation, highlighting techniques for handling missing values and outliers. Additionally, the document explains the use of APIs for programmatic data access and the significance of effective data storage and management in decision-making processes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Data science-Unit-2

Unit 2 focuses on data collection and management in data science, covering methods for gathering primary and secondary data, including quantitative and qualitative approaches. It discusses the importance of data exploration, cleaning, and preparation, highlighting techniques for handling missing values and outliers. Additionally, the document explains the use of APIs for programmatic data access and the significance of effective data storage and management in decision-making processes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Unit - 2

Data collection and management


Syllabus
1. Introduction,
2. Sources of data,
3. Data collection and APIs,
4. Exploring and fixing data,
5. Data storage and management,
6. Using multiple data sources
Introduction
Data Science is about data gathering or collection, analysis and decision-making,
finding patterns in data through analysis and make future predictions.
By using Data Science, companies can make:
• Better decisions (should we choose A or B)
• Predictive analysis (what will happen next?)
• Pattern discoveries (find pattern, or maybe hidden information in the data)

Data collection: is a process of gathering information from all the relevant sources.
Most of the organizations use data collection methods to make assumptions about future
probabilities and trends.
Once the data is collected, it is necessary to undergo the data organization process.
A data can be classified into two types:
1. primary data
2. and secondary data
The primary importance of data collection in any research or business process is that it
helps to determine many important things about the company, particularly the
performance.
So, the data collection process plays an important role in all the streams.

Depending on the type of data, the data collection method is divided into two categories
namely,
1. Primary Data Collection methods
2. Secondary Data Collection methods
1. Primary Data Collection Methods
Primary data or raw data is a type of information that is obtained directly from the first-hand
source through experiments, surveys or observations.
The primary data collection method is further classified into two types.
They are:
1. Quantitative Data Collection Methods
2. Qualitative Data Collection Methods
Let us discuss the different methods performed to collect the data under these two data
collection methods.
1. Quantitative Data Collection Methods: is based on mathematical calculations using various
formats like close-ended questions, correlation and regression methods, mean, median or
mode measures.
2. Qualitative Data Collection Methods: does not involve any mathematical calculations.
This method is closely associated with elements that are not quantifiable. This qualitative
data collection method includes interviews, questionnaires, observations, case studies, etc.
Page 1 of 33
There are several methods to collect this type of data. They are
1. Observation Method
Observation method is used when the study relates to behavioral science. This method is
planned systematically. It is subject to many controls and checks. The different types of
observations are:
• Structured and unstructured observation
• Controlled and uncontrolled observation
• Participant, non-participant and disguised observation

2. Interview Method
The method of collecting data in terms of oral or verbal responses. It is achieved in two ways,
such as
• Personal Interview
• Telephonic Interview
3. Questionnaire Method:In this method, the set of questions are mailed to the respondent.
They should read, reply and subsequently return the questionnaire. The questions are printed
in the definite order on the form. A good survey should have the following features:
• Short and simple
• Should follow a logical sequence
• Provide adequate space for answers
• Avoid technical terms
4. Schedules
This method is similar to the questionnaire method with a slight difference.
It explains the aims and objects of the investigation and may remove misunderstandings, if
any have come up.
2. Interview Method
The method of collecting data in terms of oral or verbal responses. It is achieved in two ways,
such as
• Personal Interview
• Telephonic Interview
3. Questionnaire Method:In this method, the set of questions are mailed to the respondent.
They should read, reply and subsequently return the questionnaire. The questions are printed
in the definite order on the form. A good survey should have the following features:
• Short and simple
• Should follow a logical sequence
• Provide adequate space for answers
• Avoid technical terms
4. Schedules
This method is similar to the questionnaire method with a slight difference.
It explains the aims and objects of the investigation and may remove misunderstandings, if
any have come up.
Data Collection and APIs
We can programmatically collect data using:
1. Direct downloads / import of data.
2. Applied Programming Interfaces (APIs).
We can download data from a website independently and then work with the data in Python.
Independently downloading and unzipping data each week is not efficient and does not
explicitly tie your data to your analysis.
You can automate the data download process using Python.
Automation is particularly useful when:

Page 2 of 33
1. You want to download lots of data or particular subsets of data to support an
analysis.
2. There are programmatic ways to access and query the data online.
Link Data Access to Processing & Analysis
When you automate data access, download, or retrieval, and embed it in your code, you
are directly linking your analysis to your data.
Three Ways to Collect/Access Data
You can break up programmatic data collection into three general categories:
1. Data that you download by calling a specific URL and using the Pandas function
read_table, which takes in a url.
2. Data that you directly import into Python using the Pandas function read_csv.
3. Data that you download using an API, which makes a request to a data
repository and returns requested data.
Two Key Formats
The data that you access programmatically may be returned in one of two main formats:
1. Tabular Human-readable file: Files that are tabular, including CSV files
(Comma Separated Values) and even spreadsheets (Microsoft Excel, etc.). These
files are organized into columns and rows and are “flat” in structure rather than
hierarchical.
2. Structured Machine-readable files: Files that can be stored in a text format but
are hierarchical and structured in some way that optimizes machine readability.
JSON files are an example of structured machine-readable files.

Different ways of data collection


1. Download Files Programmatically
Pandas function read_csv()
Note that you can use the read_csv() function from Pandas to import data directly into Python
by providing a URL to the CSV file (e.g. read_csv(URL)).
When you programmatically read data into Python using read_csv() you are not saving a
copy of your data locally, on your computer - you are importing the data directly into Python.
If you want a copy of that data to use for future analysis without directly importing it, you
will need to export the data to your working directory using write_csv().
Below is an example of directly importing data into Python using read_csv(). The data are
average annual temperature in Canada from the World Bank.
import pandas as pd
url = "http://climatedataapi.worldbank.org/climateweb/rest/v1/country/cru/tas/year/CAN.csv"
df = pd.read_csv(url)
df.head()

Page 3 of 33
2. Using Web-APIs
Web APIs allow you to access data available via an internet web interface.
Often you can access data from web APIs using a URL that contains sets of parameters that
specifies the type and particular subset of data that you are interested in.
If you have worked with a database such as Microsoft SQL Server or PostgreSQL, or if
you’ve ever queried data from a GIS system like ArcGIS, then you can compare the set of
parameters associated with a URL string to a SQL query.
Web APIs are a way to strip away all the extraneous visual interface that you don’t care about
and get the data that you want.
Why You Use Web APIs
Among other things, APIs allow us to:
• Get information that would be time-consuming to get otherwise.
• Get information that you can’t get otherwise.
• Automate an analytical workflows that require continuously updated data.
• Access data using a more direct interface.
What is an API?
‘API’ stands for ‘Application Programming Interface’.
In simple words, an API is a (hypothetical) contract between 2 softwares saying if the user
software provides input in a pre-defined format, the later with extend its functionality and
provide the outcome to the user software.
Think of it like this, Graphical user interface (GUI) or command line interface (CLI) allows
humans to Interact with code, where as an Application programmable interface (API) allows
one piece of code to interact with other code.
Basic elements of an API:
An API has three primary elements:
Access: is the user or who is allowed to ask for data or services?
Request: is the actual data or service being asked for (e.g., if I give you current location from
my game (Pokemon Go), tell me the map around that place). A Request has two main parts:
Methods: i.e. the questions you can ask, assuming you have access (it also defines the
type of responses available).
Parameters: additional details you can include in the question or response.
Response: the data or service as a result of your request.
Over the past 15 years, we have seen tremendous advancements in data collection, data
storage, and analytic capabilities.
Businesses and governments now routinely analyse large amounts of data to improve
evidence-based decision-making.
Quandl provides a guided tour by archiving data in categories, in several curated data
collections.
The Response may give you one of two things:

1. Some data or
2. An explanation of why your request failed

Page 4 of 33
Page 5 of 33
Exploring and Fixing Data
1. Steps of Data Exploration and Preparation
2. Missing Value Treatment
o Why missing value treatment is required ?
o Why data has missing values?
o Which are the methods to treat missing value ?
3. Techniques of Outlier Detection and Treatment
o What is an outlier?
o What are the types of outliers ?
o What are the causes of outliers ?
o What is the impact of outliers on dataset ?
o How to detect outlier ?
o How to remove outlier ?
4. The Art of Feature Engineering
o What is Feature Engineering ?
o What is the process of Feature Engineering ?
o What is Variable Transformation ?
o When should we use variable transformation ?
o What are the common methods of variable transformation ?
o What is feature variable creation and its benefits ?

There are no shortcuts for data exploration.


Thinking machine learning can sail you away from every data storm- it won’t.
After some point of time, you’ll realize that you are struggling at improving model’s accuracy.
In such situation, data exploration techniques will come to your rescue.

1. Steps of Data Exploration and Preparation


Remember the quality of your inputs decide the quality of your output.
So, once you have got your business hypothesis ready, it makes sense to spend lot of time and
efforts here.
Data exploration, cleaning and preparation can take up to 70% of your total project time.
Below are the steps involved to understand, clean and prepare your data for building your
predictive model:

1. Variable Identification
2. Univariate Analysis
3. Bi-variate Analysis
4. Missing values treatment
5. Outlier treatment
6. Variable transformation
7. Variable creation
Finally, we will need to iterate over steps 4 – 7 multiple times before we come up with our
refined model.
Let’s now study each stage in detail:-
1. Variable Identification
First, identify Predictor (Input) and Target (output) variables. Next, identify the data type and
category of the variables.
Let’s understand this step more clearly by taking an example.

Page 6 of 33
Example:- Suppose, we want to predict, whether the students will play cricket or not (refer
below data set). Here you need to identify predictor variables, target variable, data type of
variables and category of variables.

Below, the variables have been defined in different category:

2. Univariate Analysis
At this stage, we explore variables one by one. Method to perform uni-variate analysis will
depend on whether the variable type is categorical or continuous.
Let’s look at these methods and statistical measures for categorical and continuous variables
individually:
Continuous Variables:- In case of continuous variables, we need to understand the central
tendency and spread of the variable. These are measured using various statistical metrics
visualization methods as shown below:

Note: Univariate analysis is also used to highlight missing and outlier values. In the upcoming
part of this series, we will look at methods to handle missing and outlier values. To know more
about these methods, you can refer course descriptive statistics from Udacity.
Categorical Variables:- For categorical variables, we’ll use frequency table to understand
distribution of each category. We can also read as percentage of values under each category. It
can be be measured using two metrics, Count and Count% against each category. Bar chart
can be used as visualization.

Page 7 of 33
3. Bi-variate Analysis
Bi-variate Analysis finds out the relationship between two variables. Here, we look
for association and disassociation between variables at a pre-defined significance level. We can
perform bi-variate analysis for any combination of categorical and continuous variables. The
combination can be: Categorical & Categorical, Categorical & Continuous and Continuous &
Continuous. Different methods are used to tackle these combinations during analysis process.
Let’s understand the possible combinations in detail:
Continuous & Continuous: While doing bi-variate analysis between two continuous
variables, we should look at scatter plot. It is a nifty way to find out the relationship between
two variables. The pattern of scatter plot indicates the relationship between variables. The
relationship can be linear or non-linear.

Scatter plot shows the relationship between two


variable but does not indicates the strength of relationship amongst them. To find the strength
of the relationship, we use Correlation. Correlation varies between -1 and +1.

 -1: perfect negative linear correlation


 +1:perfect positive linear correlation and

 0: No correlation
Correlation can be derived using following formula:
Correlation = Covariance(X,Y) / SQRT( Var(X)* Var(Y))
Various tools have function or functionality to identify correlation between variables. In Excel,
function CORREL() is used to return the correlation between two variables and SAS uses
procedure PROC CORR to identify the correlation. These function returns Pearson Correlation
value to identify the relationship between two variables:

In above example, we have good positive relationship(0.65) between two variables X and Y.

Categorical & Categorical: To find the relationship between two categorical variables, we
can use following methods:

Page 8 of 33
 Two-way table: We can start analyzing the relationship by creating a two-way table of
count and count%. The rows represents the category of one variable and the columns
represent the categories of the other variable. We show count or count% of observations
available in each combination of row and column categories.
 Stacked Column Chart: This method is more of a visual form of Two-way table.

 Chi-Square Test: This test is used to derive the statistical significance of relationship
between the variables. Also, it tests whether the evidence in the sample is strong enough
to generalize that the relationship for a larger population as well. Chi-square is based
on the difference between the expected and observed frequencies in one or more
categories in the two-way table. It returns probability for the computed chi-square
distribution with the degree of freedom.
Probability of 0: It indicates that both categorical variable are dependent
Probability of 1: It shows that both variables are independent.
Probability less than 0.05: It indicates that the relationship between the variables is significant
at 95% confidence. The chi-square test statistic for a test of independence of two categorical
variables is found by:
where O represents the observed frequency. E is the expected frequency under the null
hypothesis and computed by:

From previous two-way table, the expected count for product


category 1 to be of small size is 0.22. It is derived by taking the row total for Size (9) times
the column total for Product category (2) then dividing by the sample size (81). This is
procedure is conducted for each cell. Statistical Measures used to analyze the power of
relationship are:

 Cramer’s V for Nominal Categorical Variable


 Mantel-Haenszed Chi-Square for ordinal categorical variable.
Different data science language and tools have specific methods to perform chi-square test. In
SAS, we can use Chisq as an option with Proc freq to perform this test.
Categorical & Continuous: While exploring relation between categorical and continuous
variables, we can draw box plots for each level of categorical variables. If levels are small in
number, it will not show the statistical significance. To look at the statistical significance we
can perform Z-test, T-test or ANOVA.

 Z-Test/ T-Test:- Either test assess whether mean of two groups are statistically

different from each other or not. If the probability of Z is

Page 9 of 33
small then the difference of two averages is more significant. The T-test is very
similar to Z-test but it is used when number of observation for both categories is less
than 30.

 ANOVA:- It assesses whether the average of more than two groups is statistically
different.
Example: Suppose, we want to test the effect of five different exercises. For this, we recruit
20 men and assign one type of exercise to 4 men (5 groups). Their weights are recorded after
a few weeks. We need to find out whether the effect of these exercises on them is
significantly different or not. This can be done by comparing the weights of the 5 groups of 4
men each.
Till here, we have understood the first three stages of Data Exploration, Variable Identification,
Uni-Variate and Bi-Variate analysis. We also looked at various statistical and visual methods
to identify the relationship between variables.
Now, we will look at the methods of Missing values Treatment. More importantly, we will also
look at why missing values occur in our data and why treating them is necessary.

Page 10 of 33
2. Missing Value Treatment
Why missing values treatment is required?
Missing data in the training data set can reduce the power / fit of a model or can lead to a biased
model because we have not analysed the behavior and relationship with other variables
correctly. It can lead to wrong prediction or classification.

Notice the missing values in the image shown above: In the left scenario, we have not treated
missing values. The inference from this data set is that the chances of playing cricket by males
is higher than females. On the other hand, if you look at the second table, which shows data
after treatment of missing values (based on gender), we can see that females have higher
chances of playing cricket compared to males.
Why my data has missing values?
We looked at the importance of treatment of missing values in a dataset. Now, let’s identify
the reasons for occurrence of these missing values. They may occur at two stages:

1. Data Extraction: It is possible that there are problems with extraction process. In such
cases, we should double-check for correct data with data guardians. Some hashing
procedures can also be used to make sure data extraction is correct. Errors at data
extraction stage are typically easy to find and can be corrected easily as well.
2. Data collection: These errors occur at time of data collection and are harder to correct.
They can be categorized in four types:
o Missing completely at random: This is a case when the probability of missing
variable is same for all observations. For example: respondents of data
collection process decide that they will declare their earning after tossing a fair
coin. If an head occurs, respondent declares his / her earnings & vice versa. Here
each observation has equal chance of missing value.
o Missing at random: This is a case when variable is missing at random and
missing ratio varies for different values / level of other input variables. For
example: We are collecting data for age and female has higher missing value
compare to male.
o Missing that depends on unobserved predictors: This is a case when the
missing values are not random and are related to the unobserved input variable.
For example: In a medical study, if a particular diagnostic causes discomfort,
then there is higher chance of drop out from the study. This missing value is not
at random unless we have included “discomfort” as an input variable for all
patients.
o Missing that depends on the missing value itself: This is a case when the
probability of missing value is directly correlated with missing value itself. For

Page 11 of 33
example: People with higher or lower income are likely to provide non-response
to their earning.
Which are the methods to treat missing values?

1. Deletion: It is of two types: List Wise Deletion and Pair Wise Deletion.
o In list wise deletion, we delete observations where any of the variable is missing.
Simplicity is one of the major advantage of this method, but this method reduces
the power of model because it reduces the sample size.
o In pair wise deletion, we perform analysis with all cases in which the variables
of interest are present. Advantage of this method is, it keeps as many cases
available for analysis. One of the disadvantage of this method, it uses different
sample size for different variables.

o Deletion methods are used when the nature of missing data is “Missing
completely at random” else non random missing values can bias the model
output.
2. Mean/ Mode/ Median Imputation: Imputation is a method to fill in the missing values
with estimated ones. The objective is to employ known relationships that can be
identified in the valid values of the data set to assist in estimating the missing values.
Mean / Mode / Median imputation is one of the most frequently used methods. It
consists of replacing the missing data for a given attribute by the mean or median
(quantitative attribute) or mode (qualitative attribute) of all known values of that
variable. It can be of two types:-

1. Generalized Imputation: In this case, we calculate the mean or median for all
non missing values of that variable then replace missing value with mean or
median. Like in above table, variable “Manpower” is missing so we take
average of all non missing values of “Manpower” (28.33) and then replace
missing value with it.
2. Similar case Imputation: In this case, we calculate average for gender
“Male” (29.75) and “Female” (25) individually of non missing values then
replace the missing value based on gender. For “Male“, we will replace missing
values of manpower with 29.75 and for “Female” with 25.

3. Prediction Model: Prediction model is one of the sophisticated method for handling
missing data. Here, we create a predictive model to estimate values that will substitute
the missing data. In this case, we divide our data set into two sets: One set with no
missing values for the variable and another one with missing values. First data set
become training data set of the model while second data set with missing values is test

Page 12 of 33
data set and variable with missing values is treated as target variable. Next, we create a
model to predict target variable based on other attributes of the training data set and
populate missing values of test data set.We can use regression, ANOVA, Logistic
regression and various modeling technique to perform this. There are 2 drawbacks for
this approach:
o The model estimated values are usually more well-behaved than the true values
o If there are no relationships with attributes in the data set and the attribute with
missing values, then the model will not be precise for estimating missing values.
4. KNN Imputation: In this method of imputation, the missing values of an attribute are
imputed using the given number of attributes that are most similar to the attribute whose
values are missing. The similarity of two attributes is determined using a distance
function.
o Advantages:
 k-nearest neighbour can predict both qualitative & quantitative attributes
 Creation of predictive model for each attribute with missing data is not
required
 Attributes with multiple missing values can be easily treated
 Correlation structure of the data is taken into consideration
o Disadvantage:
 KNN algorithm is very time-consuming in analyzing large database. It
searches through all the dataset looking for the most similar instances.
 Choice of k-value is very critical. Higher value of k would include
attributes which are significantly different from what we need whereas
lower value of k implies missing out of significant attributes.

Page 13 of 33
3. Techniques of Outlier Detection and Treatment
What is an Outlier?
Outlier is a commonly used terminology by analysts and data scientists as it needs close
attention else it can result in wildly wrong estimations. Simply speaking, Outlier is an
observation that appears far away and diverges from an overall pattern in a sample.
Let’s take an example, we do customer profiling and find out that the average annual income
of customers is $0.8 million. But, there are two customers having annual income of $4 and $4.2
million. These two customers annual income is much higher than rest of the population. These
two observations will be seen as Outliers.

What are the types of Outliers?


Outlier can be of two types: Univariate and Multivariate. Above, we have discussed the
example of univariate outlier. These outliers can be found when we look at distribution of a
single variable. Multi-variate outliers are outliers in an n-dimensional space. In order to find
them, you have to look at distributions in multi-dimensions.
Let us understand this with an example. Let us say we are understanding the relationship
between height and weight. Below, we have univariate and bivariate distribution for Height,
Weight. Take a look at the box plot. We do not have any outlier (above and below 1.5*IQR,
most common method). Now look at the scatter plot. Here, we have two values below and one
above the average in a specific segment of weight and height.

What causes Outliers?


Whenever we come across outliers, the ideal way to tackle them is to find out the reason of
having these outliers. The method to deal with them would then depend on the reason of their
occurrence. Causes of outliers can be classified in two broad categories:

Page 14 of 33
1. Artificial (Error) / Non-natural
2. Natural.
Let’s understand various types of outliers in more detail:

 Data Entry Errors:- Human errors such as errors caused during data collection,
recording, or entry can cause outliers in data. For example: Annual income of a
customer is $100,000. Accidentally, the data entry operator puts an additional zero in
the figure. Now the income becomes $1,000,000 which is 10 times higher. Evidently,
this will be the outlier value when compared with rest of the population.
 Measurement Error: It is the most common source of outliers. This is caused when
the measurement instrument used turns out to be faulty. For example: There are 10
weighing machines. 9 of them are correct, 1 is faulty. Weight measured by people on
the faulty machine will be higher / lower than the rest of people in the group. The
weights measured on faulty machine can lead to outliers.
 Experimental Error: Another cause of outliers is experimental error. For example: In
a 100m sprint of 7 runners, one runner missed out on concentrating on the ‘Go’
call which caused him to start late. Hence, this caused the runner’s run time to
be more than other runners. His total run time can be an outlier.
 Intentional Outlier: This is commonly found in self-reported measures that involves
sensitive data. For example: Teens would typically under report the amount of alcohol
that they consume. Only a fraction of them would report actual value. Here actual
values might look like outliers because rest of the teens are under reporting the
consumption.
 Data Processing Error: Whenever we perform data mining, we extract data from
multiple sources. It is possible that some manipulation or extraction errors may lead to
outliers in the dataset.
 Sampling error: For instance, we have to measure the height of athletes. By mistake,
we include a few basketball players in the sample. This inclusion is likely to cause
outliers in the dataset.
 Natural Outlier: When an outlier is not artificial (due to error), it is a natural outlier.
For instance: In my last assignment with one of the renowned insurance company, I
noticed that the performance of top 50 financial advisors was far higher than rest of the
population. Surprisingly, it was not due to any error. Hence, whenever we perform any
data mining activity with advisors, we used to treat this segment separately.

What is the impact of Outliers on a dataset?


Outliers can drastically change the results of the data analysis and statistical modeling. There
are numerous unfavourable impacts of outliers in the data set:

 It increases the error variance and reduces the power of statistical tests
 If the outliers are non-randomly distributed, they can decrease normality
 They can bias or influence estimates that may be of substantive interest
 They can also impact the basic assumption of Regression, ANOVA and other statistical
model assumptions.
To understand the impact deeply, let’s take an example to check what happens to a data set
with and without outliers in the data set.

Page 15 of 33
Example:

As you can see, data set with outliers has significantly different mean and standard deviation.
In the first scenario, we will say that average is 5.45. But with the outlier, average soars to 30.
This would change the estimate completely.

How to detect Outliers?


Most commonly used method to detect outliers is visualization. We use various visualization
methods, like Box-plot, Histogram, Scatter Plot (above, we have used box plot and scatter
plot for visualization). Some analysts also various thumb rules to detect outliers. Some of them
are:

 Any value, which is beyond the range of -1.5 x IQR to 1.5 x IQR
 Use capping methods. Any value which out of range of 5th and 95th percentile can be
considered as outlier
 Data points, three or more standard deviation away from mean are considered outlier
 Outlier detection is merely a special case of the examination of data for influential data
points and it also depends on the business understanding
 Bivariate and multivariate outliers are typically measured using either an index of
influence or leverage, or distance. Popular indices such as Mahalanobis’ distance and
Cook’s D are frequently used to detect outliers.
 In SAS, we can use PROC Univariate, PROC SGPLOT. To identify outliers and
influential observation, we also look at statistical measure like STUDENT, COOKD,
RSTUDENT and others.

How to remove Outliers?


Most of the ways to deal with outliers are similar to the methods of missing values like deleting
observations, transforming them, binning them, treat them as a separate group, imputing values
and other statistical methods. Here, we will discuss the common techniques used to deal with
outliers:
Deleting observations: We delete outlier values if it is due to data entry error, data processing
error or outlier observations are very small in numbers. We can also use trimming at both ends
to remove outliers.
Transforming and binning values: Transforming variables can also eliminate outliers.
Natural log of a value reduces the variation caused by extreme values. Binning is also a form

Page 16 of 33
of variable transformation. Decision Tree algorithm allows to deal with outliers well due to
binning of variable. We can also use the process of assigning weights to different observations.
Imputing: Like imputation of missing values, we can also impute outliers. We can use mean,
median, mode imputation methods. Before imputing values, we should analyse if it is natural
outlier or artificial. If it is artificial, we can go with imputing values. We can also use statistical
model to predict values of outlier observation and after that we can impute it with predicted
values.
Treat separately: If there are significant number of outliers, we should treat them separately in
the statistical model. One of the approach is to treat both groups as two different groups and
build individual model for both groups and then combine the output.

Till here, we have learnt about steps of data exploration, missing value treatment and
techniques of outlier detection and treatment. These 3 stages will make your raw data better
in terms of information availability and accuracy. Let’s now proceed to the final stage of data
exploration. It is Feature Engineering.

Page 17 of 33
4. The Art of Feature Engineering
What is Feature Engineering?
Feature engineering is the science (and art) of extracting more information from existing data.
You are not adding any new data here, but you are actually making the data you already have
more useful.
For example, let’s say you are trying to predict foot fall in a shopping mall based on dates. If
you try and use the dates directly, you may not be able to extract meaningful insights from the
data. This is because the foot fall is less affected by the day of the month than it is by the day
of the week. Now this information about day of week is implicit in your data. You need to
bring it out to make your model better.
This exercising of bringing out information from data in known as feature engineering.

What is the process of Feature Engineering ?


You perform feature engineering once you have completed the first 5 steps in data exploration
– Variable Identification, Univariate, Bivariate Analysis, Missing Values
Imputation and Outliers Treatment. Feature engineering itself can be divided in 2 steps:

 Variable transformation.
 Variable / Feature creation.
These two techniques are vital in data exploration and have a remarkable impact on the power
of prediction. Let’s understand each of this step in more details.

What is Variable Transformation?


In data modelling, transformation refers to the replacement of a variable by a function. For
instance, replacing a variable x by the square / cube root or logarithm x is a transformation. In
other words, transformation is a process that changes the distribution or relationship of a
variable with others.
Let’s look at the situations when variable transformation is useful.

When should we use Variable Transformation?


Below are the situations where variable transformation is a requisite:

 When we want to change the scale of a variable or standardize the values of a variable
for better understanding. While this transformation is a must if you have data in
different scales, this transformation does not change the shape of the variable
distribution

 When we can transform complex non-linear relationships into linear relationships.


Existence of a linear relationship between variables is easier to comprehend compared
to a non-linear or curved relation. Transformation helps us to convert a non-linear
relation into linear relation. Scatter plot can be used to find the relationship between
two continuous variables. These transformations also improve the prediction. Log
transformation is one of the commonly used transformation technique used in these
situations.

Page 18 of 33

Symmetric distribution is preferred over skewed distribution as it is easier to
interpret and generate inferences. Some modeling techniques requires normal
distribution of variables. So, whenever we have a skewed distribution, we can
use transformations which reduce skewness. For right skewed distribution, we take
square / cube root or logarithm of variable and for left skewed, we take square / cube
or exponential of variables.

 Variable Transformation is also done from an implementation point of view (Human


involvement). Let’s understand it more clearly. In one of my project on employee
performance, I found that age has direct correlation with performance of the
employee i.e. higher the age, better the performance. From an implementation stand
point, launching age based program might present implementation challenge. However,
categorizing the sales agents in three age group buckets of <30 years, 30-45 years and
>45 and then formulating three different strategies for each group is a judicious
approach. This categorization technique is known as Binning of Variables.

What are the common methods of Variable Transformation?


There are various methods used to transform variables. As discussed, some of them include
square root, cube root, logarithmic, binning, reciprocal and many others. Let’s look at these
methods in detail by highlighting the pros and cons of these transformation methods.

 Logarithm: Log of a variable is a common transformation method used to change the


shape of distribution of the variable on a distribution plot. It is generally used for
reducing right skewness of variables. Though, It can’t be applied to zero or negative
values as well.

 Square / Cube root: The square and cube root of a variable has a sound effect on
variable distribution. However, it is not as significant as logarithmic transformation.
Cube root has its own advantage. It can be applied to negative values including zero.
Square root can be applied to positive values including zero.

 Binning: It is used to categorize variables. It is performed on original values, percentile


or frequency. Decision of categorization technique is based on business understanding.

Page 19 of 33
For example, we can categorize income in three categories, namely: High, Average and
Low. We can also perform co-variate binning which depends on the value of more than
one variables.

What is Feature / Variable Creation & its Benefits?


Feature / Variable creation is a process to generate a new variables / features based on existing
variable(s). For example, say, we have date(dd-mm-yy) as an input variable in a data set. We
can generate new variables like day, month, year, week, weekday that may have better
relationship with target variable. This step is used to highlight the hidden relationship in a
variable:

There are various techniques to create new features. Let’s look at the some of the commonly
used methods:

 Creating derived variables: This refers to creating new variables from existing
variable(s) using set of functions or different methods. Let’s look at it through “Titanic
– Kaggle competition”. In this data set, variable age has missing values. To predict
missing values, we used the salutation (Master, Mr, Miss, Mrs) of name as a new
variable. How do we decide which variable to create? Honestly, this depends on
business understanding of the analyst, his curiosity and the set of hypothesis he might
have about the problem. Methods such as taking log of variables, binning variables and
other methods of variable transformation can also be used to create new variables.

 Creating dummy variables: One of the most common application of dummy variable
is to convert categorical variable into numerical variables. Dummy variables are also
called Indicator Variables. It is useful to take categorical variable as a predictor in
statistical models. Categorical variable can take values 0 and 1. Let’s take a variable
‘gender’. We can produce two variables, namely, “Var_Male” with values 1 (Male)
and 0 (No male) and “Var_Female” with values 1 (Female) and 0 (No Female). We
can also create dummy variables for more than two classes of a categorical variables
with n or n-1 dummy variables.

Page 20 of 33
Data Storage
Data Storing in a data science process refers to storing of useful data which you may use in
your data science process to dig the actionable insights out of it. Data Storing in data science
itself is an orderly process which needs many things to be kept in consideration.

1. Identify your goals:


First of all if you want to store data for doing data science, the foremost task for you is to
have a clear strategy for data saving.
As identification of goals is the first step in process of data storage, this should be prioritized
by you as a Data Scientist because all following steps will depend on this.
There’s a pattern for goal setting in the software industry known as OKR — Objectives and
Key Results.
It was introduced in 2004 by famous venture capitalist John Doerr to Google’s founding team
and rest is the history. Most of tech giants like Amazon, Zalando and Intel use OKRs till
today for goal setting.
OKRs give us a clear strategy to only measure those actionable things which matter most to
us.
Prioritizing things may depend on revenue, scope and your resources. According to OKR
strategy, objectives should be clear, concise, concrete, action-oriented and ideally
inspirational.
Same goes for your goals in the data science process. You should always choose goals which
are actionable and practical.
Measuring KPIs — Key Performance Indicators
KPIs are measurable values which are decisive in achieving your business objectives.
For example, if you have some job application management website, you may be interested in
the number of candidates which applied in last month, how many got interviewed, hired or
rejected etc. You may then allocate appropriate resources to handle this user base.

2. Big data or small data :


Next thing after you have clear cut goal is to decide which type of data you need. This
decision totally depends on your goal and on your resources.
Big data is normally the data which needs to be stored on different servers and it’s coming
out from multiple sources. It may be from sources which are continuously generating huge
data. It has a lot of noise and is unstructured normally.
Small data is traditional data which is structured, stored in databases usually by us and you
have full control over it.
( For example: consider you are in a large organization which works on some targeted digital
marketing. Now if you want to better segment the users then it is obvious that you will need
big data. It may involve storing some massive social media statistics, machine data, or
customer transactions every day.
On the other hand if you are working on small scale project or a part of big data to see
behaviour of your data over-all then you will be dealing with small or traditional data.
Resource allocation is important in this step, because you may need additional servers to
store Big data. Also, in the coming steps of data science process you will need special tools to
deal with it. So, you should definitely keep all these things in mind before moving on.

Page 21 of 33
3. Avoid data fatigue : refers to over storage or measurement of data which is useless for you
or doesn’t align with your data collection goals.
The most common problem for today’s Data Scientists is noisy or incorrect data.
To address this issue properly, one needs to focus on needs of specific problem he/she is
solving and then collect the data accordingly.
A Data Scientist can also use some tricks to store data to reduce the size of data.
For instance, if you need latitude and longitude of a place, then you can store this data in the
form of geocodes.
Geocodes can be decoded using some basic packages in Python or R. This can significantly
reduce the size of your data.
The common steps for avoiding data fatigue are :
i. Don’t forget to ask around for existing processes :
Each company which works on data has some process of managing data. So, it is always
good to lookout for existing process followed in the company. Starting from scratch is
difficult. Study those existing processes and find out the ways to improve them.
ii. Stop thinking about objectives which are not actionable :
If you are obvious about a goal that it is not needed then don’t waste your time cramming on
it. As a Data Scientist, it is your responsibility to find out ahead of time what should be your
objectives in terms of strategy, innovation and execution.
iii. Don’t expect your data storage mechanism to be perfect :
There is always a room for improvement in any process. Data storage process follows the
same principle. Never blindly expect your process to do all the things for you. Keep a human
at loop at times so that they may exercise their intuition which can lead to more
improvements.
iv.. Don’t work in isolation :
Never work in departments. Keep your Database Administrators on-board with you. They
may help you in architecture related things. Plus you can help them by letting them know
about your needs as a Data Scientist.
v. Learn difference between intelligent filtering and correlation :
Statisticians say that correlation doesn’t imply causation and they are not wrong. If you have
heard about revenue performance of a competitor by certain things then it is not necessary
that the same process may work for you. You will at last need to use your own wit to know
what your needs are and what data you will need to meet them.
4. Data management :
Decide which one to use — SQL or NoSQL
The final thing which you have to deal during the process of data storage is to whether use
SQL based databases or Non SQL ones.
Both have their own advantages and disadvantages and are made to deal with particular
applications.
To make our decision easy, let us limit our discussion to MongoDB and MySQL which are
spearheads of both types of DBMS.
First of all make it clear to you that both MongoDB and My SQL cannot suite all types of
applications in Data Science.
If we are collecting data which is semi structured or unstructured, then MongoDB should be
used. It is because complex queries like Joins are slower in MySQL, in this case.

Page 22 of 33
Mostly we have this situation in Big Data, where speed of processing is also our primary
concern.
However, if your data is highly structured or you already know the ins and outs of your data,
then MySQL may be the best option.
It is because you can manipulate the data very well and changes can be made relatively
quickly in a structured data using SQL compared to a NoSQL platform like MongoDB for
structured data.
There are also other Non SQL and SQL alternatives which may be more suitable for
applications you are working on. So, it is good to check which may cater you needs.
For instance, MariaDB offers more and better storage engines as compared to other relational
databases.
Using Cassandra, NoSQL support is also available in MariaDB, enabling you to run SQL and
SQL in a single database system.
MariaDB also supports TokuDB, which can handle big data for large organizations and
corporate users.

Page 23 of 33
Data Management
Data management: is the practice of collecting, keeping, and using data securely, efficiently,
and cost-effectively.
The goal of data management is to help people, organizations, and connected things optimize
the use of data within the bounds of policy and regulation so that they can make decisions
and take actions that maximize the benefit to the organization.
A robust data management strategy is becoming more important than ever as organizations
increasingly rely on intangible assets to create value.

Managing digital data in an organization involves a broad range of tasks, policies,


procedures, and practices. The work of data management has a wide scope, covering factors
such as how to:

1. Create, access, and update data across a diverse data tier


2. Store data across multiple clouds and on premises
3. Provide high availability and disaster recovery
4. Use data in a growing variety of apps, analytics, and algorithms
5. Ensure data privacy and security
6. Archive and destroy data in accordance with retention schedules and compliance
requirements
A formal data management strategy addresses the activity of users and administrators, the
capabilities of data management technologies, the demands of regulatory requirements, and
the needs of the organization to obtain value from its data.

Data Management Systems Today


Today’s organizations need a data management solution that provides an efficient way to
manage data across a diverse but unified data tier. Data management systems are built on data
management platforms and can include databases, data lakes and warehouses, big data
management systems, data analytics, and more.

All these components work together as a “data utility” to deliver the data management
capabilities an organization needs for its apps, and the analytics and algorithms that use the
data originated by those apps. Although current tools help database administrators (DBAs)
automate many of the traditional management tasks, manual intervention is still often
required because of the size and complexity of most database deployments. Whenever
manual intervention is required, the chance for errors increases. Reducing the need for
manual data management is a key objective of a new data management technology, the
autonomous database.

Page 24 of 33
Data management platform
A data management platform is the foundational system for collecting and analyzing large
volumes of data across an organization. Commercial data platforms typically include
software tools for management, developed by the database vendor or by third-party vendors.
These data management solutions help IT teams and DBAs perform typical tasks such as:

 Identifying, alerting, diagnosing, and resolving faults in the database system or


underlying infrastructure
 Allocating database memory and storage resources
 Making changes in the database design
 Optimizing responses to database queries for faster application performance
The increasingly popular cloud data platforms allow businesses to scale up or down quickly
and cost-effectively. Some are available as a service, allowing organizations to save even
more.
Based in the cloud, an autonomous database uses artificial intelligence (AI) and machine
learning to automate many data management tasks performed by DBAs, including managing
database backups, security, and performance tuning.
Also called a self-driving database, an autonomous database offers significant benefits for
data management, including:
 Reduced complexity

 Decreased potential for human error


 Higher database reliability and security
 Improved operational efficiency
 Lower costs
The increasingly popular cloud data platforms allow businesses to scale up or down quickly
and cost-effectively. Some are available as a service, allowing organizations to save even
more.

Big Data Management Systems


In some ways, big data is just what it sounds like—lots and lots of data. But big data also
comes in a wider variety of forms than traditional data, and it’s collected at a high rate of
speed. Think of all the data that comes in every day, or every minute, from a social media
source such as Facebook. The amount, variety, and speed of that data are what make it so
valuable to businesses, but they also make it very complex to manage.
As more and more data is collected from sources as disparate as video cameras, social media,
audio recordings, and Internet of Things (IoT) devices, big data management systems have
emerged. These systems specialize in three general areas.
Big data integration brings in different types of data—from batch to streaming—and
transforms it so that it can be consumed.
Big data management stores and processes data in a data lake or data warehouse efficiently,
securely, and reliably, often by using object storage.
Big data analysis uncovers new insights with analytics and uses machine learning and AI
visualization to build models.

Page 25 of 33
Companies are using big data to improve and accelerate product development, predictive
maintenance, the customer experience, security, operational efficiency, and much more. As
big data gets bigger, so will the opportunities.
Data Management Challenges
Most of the challenges in data management today stem from the faster pace of business and
the increasing proliferation of data. The ever-expanding variety, velocity, and volume of data
available to organizations is pushing them to seek more-effective management tools to keep
up. Some of the top challenges organizations face include the following:
1. They don’t know what data they have
2. They must maintain performance levels as the data tier expands
3. They must meet constantly changing compliance requirements
4. They aren’t sure how to repurpose data to put it to new uses
5. They must keep up with changes in data storage

They don’t know what data Data from an increasing number and variety of sources such as sensors, smart devices, social
they have media, and video cameras is being collected and stored. But none of that data is useful if the
organization doesn’t know what data it has, where it is, and how to use it.

They must maintain Organizations are capturing, storing, and using more data all the time. To maintain peak
performance levels as the data response times across this expanding tier, organizations need to continuously monitor the
tier expands type of questions the database is answering and change the indexes as the queries change—
without affecting performance.

They must meet constantly Compliance regulations are complex and multijurisdictional, and they change constantly.
changing compliance Organizations need to be able to easily review their data and identify anything that falls under
requirements new or modified requirements. In particular, personally identifiable information (PII) must be
detected, tracked, and monitored for compliance with increasingly strict global privacy
regulations.

They aren’t sure how to Collecting and identifying the data itself doesn’t provide any value—the organization needs
repurpose data to put it to new to process it. If it takes a lot of time and effort to convert the data into what they need for
uses analysis, that analysis won’t happen. As a result, the potential value of that data is lost.

They must keep up with In the new world of data management, organizations store data in multiple systems, including
changes in data storage data warehouses and unstructured data lakes that store any data in any format in a single
repository. An organization’s data scientists need a way to quickly and easily transform data
from its original format into the shape, format, or model they need it to be in for a wide array
of analyses.

Page 26 of 33
Data Management Best Practices
Addressing data management challenges requires a comprehensive, well-thought-out set of
best practices. Although specific best practices vary depending on the type of data involved
and the industry, the following best practices address the major data management challenges
organizations face today:
1. Create a discovery layer to identify your data
2. Develop a data science environment to efficiently repurpose your data
3. Use autonomous technology to maintain performance levels across your expanding
data tier
4. Use discovery to stay on top of compliance requirements
5. Use a common query layer to manage multiple and diverse forms of data storage

Create a discovery layer to A discovery layer on top of your organization’s data tier allows analysts and data scientists to
identify your data search and browse for datasets to make your data useable.

Develop a data science A data science environment automates as much of the data transformation work as possible,
environment to efficiently streamlining the creation and evaluation of data models. A set of tools that eliminates the
repurpose your data need for the manual transformation of data can expedite the hypothesizing and testing of
new models.

Use autonomous technology to Autonomous data capabilities use AI and machine learning to continuously monitor database
maintain performance levels queries and optimize indexes as the queries change. This allows the database to maintain
across your expanding data tier rapid response times and frees DBAs and data scientists from time-consuming manual tasks.

Use discovery to stay on top of New tools use data discovery to review data and identify the chains of connection that need
compliance requirements to be detected, tracked, and monitored for multijurisdictional compliance. As compliance
demands increase globally, this capability is going to be increasingly important to risk and
security officers.

Use a common query layer to New technologies are enabling data management repositories to work together, making the
manage multiple and diverse differences between them disappear. A common query layer that spans the many kinds of
forms of data storage data storage enables data scientists, analysts, and applications to access data without needing
to know where it is stored and without needing to manually transform it into a usable format.

Page 27 of 33
Using Data from Multiple Sources

Using Data from Multiple Sources

At the end of the lecture you will be able to understand:


1. What is Big Data?
2. What is ETL?
3. How To Extract Data from Multiple Sources
4. How to Combine and Merge Data from Multiple Sources
5. Challenges of Using Data from Multiple Sources
6. Problems with Merging Data

What is Big Data?


Big data: is exactly what it sounds like: the use of extremely large and/or extremely complex
datasets that stretch the capabilities of standard BI and analytics tools.
While there’s no formal definition for what exactly makes a dataset “big,” the U.S. National
Institute of Standards and Technology defines big data as:
“extensive datasets—primarily in the characteristics of volume, variety, velocity, and/or
variability—that require a scalable architecture for efficient storage, manipulation, and
analysis.”
As this definition suggests, there are several qualities that make big data distinct from
traditional data analytics methods:
 Volume: The data may be intimidating due to its sheer size.
 Variety: The data may come in many different forms or file formats, making it harder
to integrate.
 Velocity: The data may arrive very rapidly in real-time, requiring you to constantly
process it.
 Variability: The data’s meaning may frequently change, or the data may have serious
flaws and errors.
Dealing with big data is one of the greatest challenges for modern BI and analytics
workflows. The good news is that when implemented correctly, ETL can help you collect and
process big data—no matter its size or location—to get better insights, monitor historical
trends, and make smarter data-driven decisions.

What is ETL?
ETL (extract, transform, load) : is the dominant paradigm for efficiently getting data from
multiple sources into a single location, where it can be used for self-service queries and data
analytics. As the name suggests, ETL consists of three sub-processes:
 Extract: Data is first extracted from its source location(s). These sources may be—
but are not limited to—files, websites, software applications, and relational and non-
relational databases.
 Transform: The extracted data is then transformed in order to make it suitable for
your different purpose. Depending on the ETL workflow, the transformation stage
may include:
o Adding or removing data rows, columns, and/or fields.
o Deleting duplicate, out-of-date, and/or extraneous data.
o Joining multiple data sources together.
o Converting data in one format to another (e.g. date/time formats or
imperial/metric units).

Page 28 of 33
 Load: Finally, the transformed data is loaded into the target location. This is usually
a data warehouse, a specialized system intended for real-time BI, analytics, and
reporting.
What is ELT?
ELT stands for "Extract, Load, and Transform." In this process, data gets leveraged via a data
warehouse in order to do basic transformations. That means there's no need for data staging.
ELT uses cloud-based data warehousing solutions for all different types of data - including
structured, unstructured, semi-structured, and even raw data types.
1. ELT is a relatively new technology, made possible because of modern, cloud-based
warehouse server technologies – endless storage and scalable processing power.
For example, platforms like Amazon Redshift and Google BigQuery make ELT pipelines
possible because of their incredible processing capabilities.
2. Ingest anything and everything as the data becomes available: ELT is paired with a
data lake which lets you ingest an ever-expanding pool of raw data immediately, as it
becomes available.
There's no requirement to transform the data into a special format before saving it in
the data lake.
3. Transforms only the data you need: ELT transforms only the data required for a
particular analysis.
Although it can slow down the process of analyzing the data, it offers more
flexibility—because you can transform the data in different ways on the fly to
produce different types of metrics, forecasts, and reports.
Conversely with ETL, the entire ETL pipeline—and the structure of the data in the
OLAP warehouse—may require modification if the previously-decided structure
doesn't allow for a new type of analysis.

Page 29 of 33
Differences between ETL and ELT.

ETL ELT

1. ETL is the Extract, Transform, and Load process for 1. ELT is Extract, Load, and Transform process for
data. data.

2. In ETL, data moves from the data source (operational 2. There is no need for a data staging.
databases or other sources) to staging area then into the
data warehouse.

3. Transformations happen within a staging area outside 3. Transformations happen inside the data system itself,
the data warehouse. and no staging area is required.

4. ETL can be used to structure unstructured data, but it 4. ELT is a solution for uploading unstructured data
can’t be used to pass unstructured data into the target into a data lake ( are special kinds of data stores that
system. accept any kind of structured or unstructured data) and
make unstructured data available to business
intelligence systems.

5. Data is structured 5. ELT is best when dealing with massive amounts of


structured and unstructured data.

6. ELT is less-reliable than ETL 6. ETL is more reliable than ELT

7. Source and target database are different (Eg., Oracle 7. Source and target databases are same
source and SAP target database

8. Data volume is small or moderate 8. Data volume is large

9. Data transformations are complex 9. Data transformations are less complex

10. Advantage of ETL 10. Advantage of ELT


1. Pre-structured nature of the OLAP data 1. Flexibility and ease of storing new, unstructured
warehouse. data.
2. After structuring/transforming the data, ETL 2. High speed
allows for speedier, more efficient, more stable 3. Low-Maintenance
data analysis. 4. Quicker Loading
3. ETL can perform sophisticated data
transformations and can be more cost-effective

Page 30 of 33
How to Extract Data from Multiple Sources?
Extracting data from multiple sources is an involved process that requires contemplation and
planning.
The steps required to extract data from multiple sources.
Step 1: Decide Which Sources to Use based on which data to extract.
Identify which data you want to extract and decide from which sources.
Step 2: Choose the Extraction Method: ELT or ELT
Step 3: Estimate the Size of the Extraction
Step 4: Connect to the Data Sources
Each data source may have its own API (application programming interface) or connector to
help with the extraction process. If you can’t easily connect to a given data source, you may
have to build a custom integration.

How to Combine and Merge Data from Multiple Sources?


Once you’ve extracted data from multiple sources, you need to combine and merge it before
loading it into the target destination.
The most important steps of combining and merging data from multiple sources are:

Step 1: Data Cleansing


Data cleansing involves deleting information that is old, inaccurate, duplicate, or out-of-date,
as well as detecting and correcting errors and typos.
Step 2: Data Reconciliation
Data reconciliation is the identification, integration, and standardization of different data
records that refer to the same entity.
Step 3: Data Summarization
Data summarization creates new data records by performing operations on existing records.
For example, this step might involve adding up sales figures in different regions to come up
with the company’s total sales last quarter.
Step 4: Data Filtering
Data filtering ignores irrelevant information by selecting only certain rows, columns, and
fields from a larger dataset.
Step 5: Data Aggregation
Data aggregation combines data from multiple sources so that it can be presented in a more
digestible, understandable format.
The 5 steps above are just a small sample of how you can merge and transform data from
different sources during ETL.
What Are The Challenges with Using Data from Multiple Sources?
Using data from multiple sources is necessary for modern BI and analytics, but it can lead to
data quality issues if you’re not careful. The challenges associated with using data from
multiple sources include:
Problem 1: Heterogeneous Data
Different data sources may store data in different ways, using different data formats. This
issue is known as “heterogeneous data.” For example, you may need to take data from files,
web APIs, databases, CRM systems, and more. What's more, this information may be
structured, semi-structured, or unstructured data.
Solution 1: Increased Visibility
Solving the challenge of heterogeneous data requires you to know exactly which data sources
you'll pulling from, and how each data source stores information. These answers are crucial
in order to know how you will treat each data source during the extract and transform phases
of ETL.

Page 31 of 33
Problem 2: Data Integrations
Each data source you use needs to be integrated with the larger ETL workflow. Not only is
this a complex and technically demanding undertaking, but it can also break your ETL
process if the structure of the underlying data source changes.

Solution 2: Greater Connectivity


Since every data source is different, you may get lucky if the source already has an existing
API or connector—or in the worst case, you may have to build your own custom integrations,
which is very time- and labor-intensive.
Instead, it's better to have a robust solution for ETL data integration that can automatically
connect to a wide variety of data sources.
Problem 3: Scalability
As your business grows, you’ll likely want to integrate data from more data sources.
If you don’t plan for efficiency and scalability, however, this can majorly slow down your
ETL process—especially if you’re working with big data.
Solution 3: Good System Design
When it comes to scalability challenges, the good news is that you can use both horizontal
scaling (adding more machines) and vertical scaling (adding more resources to a machine) for
your ETL workflow.
For example, you can use techniques such as massively parallel processing (MPP) to
simultaneously extract information from many different sources.

What Are Some Problems with Merging Data?


Even once you’ve collected data from multiple sources, the potential obstacles aren’t over.
When merging data, look out for the following challenges:
Problem 1: Duplicate and Conflicting Data
Multiple sources may have the same data, requiring you to detect and remove these
duplicates. Even worse, the sources may not agree with each other, forcing you to figure out
which of them is correct.
Solution 1: Clear Transformation Rules
Solving the problem of duplicate and conflicting data requires you to have well-defined,
robust transformation rules that can detect these problems. Data integration tools like Xplenty
come with features and components that help you detect and filter duplicate data.
Problem 2: Reconciling Information
Two different data sources may refer to the same entity differently. For example, one source
may record users’ gender as “male” and “female,” while another records gender as “M” and
“F.” And this is just a simple case—reconciling data consistency issues can be a lot more
complicated.
Solution 2: Clear Transformation Rules
Again, defining clear transformation rules will help you automate the vast majority of the
data reconciliation process. As you get more familiar with the ETL process, you'll get a better
sense of which kinds of reconciliations need to be performed over your data sources. It's a
good idea to use an ETL data integration solution that can perform many of these
reconciliations automatically.
Problem 3: Slow Join Performance
Joining data is often notoriously slow. For example, left joins in SQL have a reputation for
being significantly slower than inner joins. The poor performance of joins can be attributed to
both poor ETL design and to the inherent slowness of the join operation.

Page 32 of 33
Solution 3: Avoiding Joins (When Possible)
Avoid unnecessary joins when possible. This is especially true for cross joins, which take the
Cartesian product of two datasets, and nested loop joins, which can be inefficient on large
result sets. In addition, try to reduce your usage of in-memory joins and merges.

Page 33 of 33

You might also like