Data science-Unit-2
Data science-Unit-2
Data collection: is a process of gathering information from all the relevant sources.
Most of the organizations use data collection methods to make assumptions about future
probabilities and trends.
Once the data is collected, it is necessary to undergo the data organization process.
A data can be classified into two types:
1. primary data
2. and secondary data
The primary importance of data collection in any research or business process is that it
helps to determine many important things about the company, particularly the
performance.
So, the data collection process plays an important role in all the streams.
Depending on the type of data, the data collection method is divided into two categories
namely,
1. Primary Data Collection methods
2. Secondary Data Collection methods
1. Primary Data Collection Methods
Primary data or raw data is a type of information that is obtained directly from the first-hand
source through experiments, surveys or observations.
The primary data collection method is further classified into two types.
They are:
1. Quantitative Data Collection Methods
2. Qualitative Data Collection Methods
Let us discuss the different methods performed to collect the data under these two data
collection methods.
1. Quantitative Data Collection Methods: is based on mathematical calculations using various
formats like close-ended questions, correlation and regression methods, mean, median or
mode measures.
2. Qualitative Data Collection Methods: does not involve any mathematical calculations.
This method is closely associated with elements that are not quantifiable. This qualitative
data collection method includes interviews, questionnaires, observations, case studies, etc.
Page 1 of 33
There are several methods to collect this type of data. They are
1. Observation Method
Observation method is used when the study relates to behavioral science. This method is
planned systematically. It is subject to many controls and checks. The different types of
observations are:
• Structured and unstructured observation
• Controlled and uncontrolled observation
• Participant, non-participant and disguised observation
2. Interview Method
The method of collecting data in terms of oral or verbal responses. It is achieved in two ways,
such as
• Personal Interview
• Telephonic Interview
3. Questionnaire Method:In this method, the set of questions are mailed to the respondent.
They should read, reply and subsequently return the questionnaire. The questions are printed
in the definite order on the form. A good survey should have the following features:
• Short and simple
• Should follow a logical sequence
• Provide adequate space for answers
• Avoid technical terms
4. Schedules
This method is similar to the questionnaire method with a slight difference.
It explains the aims and objects of the investigation and may remove misunderstandings, if
any have come up.
2. Interview Method
The method of collecting data in terms of oral or verbal responses. It is achieved in two ways,
such as
• Personal Interview
• Telephonic Interview
3. Questionnaire Method:In this method, the set of questions are mailed to the respondent.
They should read, reply and subsequently return the questionnaire. The questions are printed
in the definite order on the form. A good survey should have the following features:
• Short and simple
• Should follow a logical sequence
• Provide adequate space for answers
• Avoid technical terms
4. Schedules
This method is similar to the questionnaire method with a slight difference.
It explains the aims and objects of the investigation and may remove misunderstandings, if
any have come up.
Data Collection and APIs
We can programmatically collect data using:
1. Direct downloads / import of data.
2. Applied Programming Interfaces (APIs).
We can download data from a website independently and then work with the data in Python.
Independently downloading and unzipping data each week is not efficient and does not
explicitly tie your data to your analysis.
You can automate the data download process using Python.
Automation is particularly useful when:
Page 2 of 33
1. You want to download lots of data or particular subsets of data to support an
analysis.
2. There are programmatic ways to access and query the data online.
Link Data Access to Processing & Analysis
When you automate data access, download, or retrieval, and embed it in your code, you
are directly linking your analysis to your data.
Three Ways to Collect/Access Data
You can break up programmatic data collection into three general categories:
1. Data that you download by calling a specific URL and using the Pandas function
read_table, which takes in a url.
2. Data that you directly import into Python using the Pandas function read_csv.
3. Data that you download using an API, which makes a request to a data
repository and returns requested data.
Two Key Formats
The data that you access programmatically may be returned in one of two main formats:
1. Tabular Human-readable file: Files that are tabular, including CSV files
(Comma Separated Values) and even spreadsheets (Microsoft Excel, etc.). These
files are organized into columns and rows and are “flat” in structure rather than
hierarchical.
2. Structured Machine-readable files: Files that can be stored in a text format but
are hierarchical and structured in some way that optimizes machine readability.
JSON files are an example of structured machine-readable files.
Page 3 of 33
2. Using Web-APIs
Web APIs allow you to access data available via an internet web interface.
Often you can access data from web APIs using a URL that contains sets of parameters that
specifies the type and particular subset of data that you are interested in.
If you have worked with a database such as Microsoft SQL Server or PostgreSQL, or if
you’ve ever queried data from a GIS system like ArcGIS, then you can compare the set of
parameters associated with a URL string to a SQL query.
Web APIs are a way to strip away all the extraneous visual interface that you don’t care about
and get the data that you want.
Why You Use Web APIs
Among other things, APIs allow us to:
• Get information that would be time-consuming to get otherwise.
• Get information that you can’t get otherwise.
• Automate an analytical workflows that require continuously updated data.
• Access data using a more direct interface.
What is an API?
‘API’ stands for ‘Application Programming Interface’.
In simple words, an API is a (hypothetical) contract between 2 softwares saying if the user
software provides input in a pre-defined format, the later with extend its functionality and
provide the outcome to the user software.
Think of it like this, Graphical user interface (GUI) or command line interface (CLI) allows
humans to Interact with code, where as an Application programmable interface (API) allows
one piece of code to interact with other code.
Basic elements of an API:
An API has three primary elements:
Access: is the user or who is allowed to ask for data or services?
Request: is the actual data or service being asked for (e.g., if I give you current location from
my game (Pokemon Go), tell me the map around that place). A Request has two main parts:
Methods: i.e. the questions you can ask, assuming you have access (it also defines the
type of responses available).
Parameters: additional details you can include in the question or response.
Response: the data or service as a result of your request.
Over the past 15 years, we have seen tremendous advancements in data collection, data
storage, and analytic capabilities.
Businesses and governments now routinely analyse large amounts of data to improve
evidence-based decision-making.
Quandl provides a guided tour by archiving data in categories, in several curated data
collections.
The Response may give you one of two things:
1. Some data or
2. An explanation of why your request failed
Page 4 of 33
Page 5 of 33
Exploring and Fixing Data
1. Steps of Data Exploration and Preparation
2. Missing Value Treatment
o Why missing value treatment is required ?
o Why data has missing values?
o Which are the methods to treat missing value ?
3. Techniques of Outlier Detection and Treatment
o What is an outlier?
o What are the types of outliers ?
o What are the causes of outliers ?
o What is the impact of outliers on dataset ?
o How to detect outlier ?
o How to remove outlier ?
4. The Art of Feature Engineering
o What is Feature Engineering ?
o What is the process of Feature Engineering ?
o What is Variable Transformation ?
o When should we use variable transformation ?
o What are the common methods of variable transformation ?
o What is feature variable creation and its benefits ?
1. Variable Identification
2. Univariate Analysis
3. Bi-variate Analysis
4. Missing values treatment
5. Outlier treatment
6. Variable transformation
7. Variable creation
Finally, we will need to iterate over steps 4 – 7 multiple times before we come up with our
refined model.
Let’s now study each stage in detail:-
1. Variable Identification
First, identify Predictor (Input) and Target (output) variables. Next, identify the data type and
category of the variables.
Let’s understand this step more clearly by taking an example.
Page 6 of 33
Example:- Suppose, we want to predict, whether the students will play cricket or not (refer
below data set). Here you need to identify predictor variables, target variable, data type of
variables and category of variables.
2. Univariate Analysis
At this stage, we explore variables one by one. Method to perform uni-variate analysis will
depend on whether the variable type is categorical or continuous.
Let’s look at these methods and statistical measures for categorical and continuous variables
individually:
Continuous Variables:- In case of continuous variables, we need to understand the central
tendency and spread of the variable. These are measured using various statistical metrics
visualization methods as shown below:
Note: Univariate analysis is also used to highlight missing and outlier values. In the upcoming
part of this series, we will look at methods to handle missing and outlier values. To know more
about these methods, you can refer course descriptive statistics from Udacity.
Categorical Variables:- For categorical variables, we’ll use frequency table to understand
distribution of each category. We can also read as percentage of values under each category. It
can be be measured using two metrics, Count and Count% against each category. Bar chart
can be used as visualization.
Page 7 of 33
3. Bi-variate Analysis
Bi-variate Analysis finds out the relationship between two variables. Here, we look
for association and disassociation between variables at a pre-defined significance level. We can
perform bi-variate analysis for any combination of categorical and continuous variables. The
combination can be: Categorical & Categorical, Categorical & Continuous and Continuous &
Continuous. Different methods are used to tackle these combinations during analysis process.
Let’s understand the possible combinations in detail:
Continuous & Continuous: While doing bi-variate analysis between two continuous
variables, we should look at scatter plot. It is a nifty way to find out the relationship between
two variables. The pattern of scatter plot indicates the relationship between variables. The
relationship can be linear or non-linear.
0: No correlation
Correlation can be derived using following formula:
Correlation = Covariance(X,Y) / SQRT( Var(X)* Var(Y))
Various tools have function or functionality to identify correlation between variables. In Excel,
function CORREL() is used to return the correlation between two variables and SAS uses
procedure PROC CORR to identify the correlation. These function returns Pearson Correlation
value to identify the relationship between two variables:
In above example, we have good positive relationship(0.65) between two variables X and Y.
Categorical & Categorical: To find the relationship between two categorical variables, we
can use following methods:
Page 8 of 33
Two-way table: We can start analyzing the relationship by creating a two-way table of
count and count%. The rows represents the category of one variable and the columns
represent the categories of the other variable. We show count or count% of observations
available in each combination of row and column categories.
Stacked Column Chart: This method is more of a visual form of Two-way table.
Chi-Square Test: This test is used to derive the statistical significance of relationship
between the variables. Also, it tests whether the evidence in the sample is strong enough
to generalize that the relationship for a larger population as well. Chi-square is based
on the difference between the expected and observed frequencies in one or more
categories in the two-way table. It returns probability for the computed chi-square
distribution with the degree of freedom.
Probability of 0: It indicates that both categorical variable are dependent
Probability of 1: It shows that both variables are independent.
Probability less than 0.05: It indicates that the relationship between the variables is significant
at 95% confidence. The chi-square test statistic for a test of independence of two categorical
variables is found by:
where O represents the observed frequency. E is the expected frequency under the null
hypothesis and computed by:
Z-Test/ T-Test:- Either test assess whether mean of two groups are statistically
Page 9 of 33
small then the difference of two averages is more significant. The T-test is very
similar to Z-test but it is used when number of observation for both categories is less
than 30.
ANOVA:- It assesses whether the average of more than two groups is statistically
different.
Example: Suppose, we want to test the effect of five different exercises. For this, we recruit
20 men and assign one type of exercise to 4 men (5 groups). Their weights are recorded after
a few weeks. We need to find out whether the effect of these exercises on them is
significantly different or not. This can be done by comparing the weights of the 5 groups of 4
men each.
Till here, we have understood the first three stages of Data Exploration, Variable Identification,
Uni-Variate and Bi-Variate analysis. We also looked at various statistical and visual methods
to identify the relationship between variables.
Now, we will look at the methods of Missing values Treatment. More importantly, we will also
look at why missing values occur in our data and why treating them is necessary.
Page 10 of 33
2. Missing Value Treatment
Why missing values treatment is required?
Missing data in the training data set can reduce the power / fit of a model or can lead to a biased
model because we have not analysed the behavior and relationship with other variables
correctly. It can lead to wrong prediction or classification.
Notice the missing values in the image shown above: In the left scenario, we have not treated
missing values. The inference from this data set is that the chances of playing cricket by males
is higher than females. On the other hand, if you look at the second table, which shows data
after treatment of missing values (based on gender), we can see that females have higher
chances of playing cricket compared to males.
Why my data has missing values?
We looked at the importance of treatment of missing values in a dataset. Now, let’s identify
the reasons for occurrence of these missing values. They may occur at two stages:
1. Data Extraction: It is possible that there are problems with extraction process. In such
cases, we should double-check for correct data with data guardians. Some hashing
procedures can also be used to make sure data extraction is correct. Errors at data
extraction stage are typically easy to find and can be corrected easily as well.
2. Data collection: These errors occur at time of data collection and are harder to correct.
They can be categorized in four types:
o Missing completely at random: This is a case when the probability of missing
variable is same for all observations. For example: respondents of data
collection process decide that they will declare their earning after tossing a fair
coin. If an head occurs, respondent declares his / her earnings & vice versa. Here
each observation has equal chance of missing value.
o Missing at random: This is a case when variable is missing at random and
missing ratio varies for different values / level of other input variables. For
example: We are collecting data for age and female has higher missing value
compare to male.
o Missing that depends on unobserved predictors: This is a case when the
missing values are not random and are related to the unobserved input variable.
For example: In a medical study, if a particular diagnostic causes discomfort,
then there is higher chance of drop out from the study. This missing value is not
at random unless we have included “discomfort” as an input variable for all
patients.
o Missing that depends on the missing value itself: This is a case when the
probability of missing value is directly correlated with missing value itself. For
Page 11 of 33
example: People with higher or lower income are likely to provide non-response
to their earning.
Which are the methods to treat missing values?
1. Deletion: It is of two types: List Wise Deletion and Pair Wise Deletion.
o In list wise deletion, we delete observations where any of the variable is missing.
Simplicity is one of the major advantage of this method, but this method reduces
the power of model because it reduces the sample size.
o In pair wise deletion, we perform analysis with all cases in which the variables
of interest are present. Advantage of this method is, it keeps as many cases
available for analysis. One of the disadvantage of this method, it uses different
sample size for different variables.
o Deletion methods are used when the nature of missing data is “Missing
completely at random” else non random missing values can bias the model
output.
2. Mean/ Mode/ Median Imputation: Imputation is a method to fill in the missing values
with estimated ones. The objective is to employ known relationships that can be
identified in the valid values of the data set to assist in estimating the missing values.
Mean / Mode / Median imputation is one of the most frequently used methods. It
consists of replacing the missing data for a given attribute by the mean or median
(quantitative attribute) or mode (qualitative attribute) of all known values of that
variable. It can be of two types:-
1. Generalized Imputation: In this case, we calculate the mean or median for all
non missing values of that variable then replace missing value with mean or
median. Like in above table, variable “Manpower” is missing so we take
average of all non missing values of “Manpower” (28.33) and then replace
missing value with it.
2. Similar case Imputation: In this case, we calculate average for gender
“Male” (29.75) and “Female” (25) individually of non missing values then
replace the missing value based on gender. For “Male“, we will replace missing
values of manpower with 29.75 and for “Female” with 25.
3. Prediction Model: Prediction model is one of the sophisticated method for handling
missing data. Here, we create a predictive model to estimate values that will substitute
the missing data. In this case, we divide our data set into two sets: One set with no
missing values for the variable and another one with missing values. First data set
become training data set of the model while second data set with missing values is test
Page 12 of 33
data set and variable with missing values is treated as target variable. Next, we create a
model to predict target variable based on other attributes of the training data set and
populate missing values of test data set.We can use regression, ANOVA, Logistic
regression and various modeling technique to perform this. There are 2 drawbacks for
this approach:
o The model estimated values are usually more well-behaved than the true values
o If there are no relationships with attributes in the data set and the attribute with
missing values, then the model will not be precise for estimating missing values.
4. KNN Imputation: In this method of imputation, the missing values of an attribute are
imputed using the given number of attributes that are most similar to the attribute whose
values are missing. The similarity of two attributes is determined using a distance
function.
o Advantages:
k-nearest neighbour can predict both qualitative & quantitative attributes
Creation of predictive model for each attribute with missing data is not
required
Attributes with multiple missing values can be easily treated
Correlation structure of the data is taken into consideration
o Disadvantage:
KNN algorithm is very time-consuming in analyzing large database. It
searches through all the dataset looking for the most similar instances.
Choice of k-value is very critical. Higher value of k would include
attributes which are significantly different from what we need whereas
lower value of k implies missing out of significant attributes.
Page 13 of 33
3. Techniques of Outlier Detection and Treatment
What is an Outlier?
Outlier is a commonly used terminology by analysts and data scientists as it needs close
attention else it can result in wildly wrong estimations. Simply speaking, Outlier is an
observation that appears far away and diverges from an overall pattern in a sample.
Let’s take an example, we do customer profiling and find out that the average annual income
of customers is $0.8 million. But, there are two customers having annual income of $4 and $4.2
million. These two customers annual income is much higher than rest of the population. These
two observations will be seen as Outliers.
Page 14 of 33
1. Artificial (Error) / Non-natural
2. Natural.
Let’s understand various types of outliers in more detail:
Data Entry Errors:- Human errors such as errors caused during data collection,
recording, or entry can cause outliers in data. For example: Annual income of a
customer is $100,000. Accidentally, the data entry operator puts an additional zero in
the figure. Now the income becomes $1,000,000 which is 10 times higher. Evidently,
this will be the outlier value when compared with rest of the population.
Measurement Error: It is the most common source of outliers. This is caused when
the measurement instrument used turns out to be faulty. For example: There are 10
weighing machines. 9 of them are correct, 1 is faulty. Weight measured by people on
the faulty machine will be higher / lower than the rest of people in the group. The
weights measured on faulty machine can lead to outliers.
Experimental Error: Another cause of outliers is experimental error. For example: In
a 100m sprint of 7 runners, one runner missed out on concentrating on the ‘Go’
call which caused him to start late. Hence, this caused the runner’s run time to
be more than other runners. His total run time can be an outlier.
Intentional Outlier: This is commonly found in self-reported measures that involves
sensitive data. For example: Teens would typically under report the amount of alcohol
that they consume. Only a fraction of them would report actual value. Here actual
values might look like outliers because rest of the teens are under reporting the
consumption.
Data Processing Error: Whenever we perform data mining, we extract data from
multiple sources. It is possible that some manipulation or extraction errors may lead to
outliers in the dataset.
Sampling error: For instance, we have to measure the height of athletes. By mistake,
we include a few basketball players in the sample. This inclusion is likely to cause
outliers in the dataset.
Natural Outlier: When an outlier is not artificial (due to error), it is a natural outlier.
For instance: In my last assignment with one of the renowned insurance company, I
noticed that the performance of top 50 financial advisors was far higher than rest of the
population. Surprisingly, it was not due to any error. Hence, whenever we perform any
data mining activity with advisors, we used to treat this segment separately.
It increases the error variance and reduces the power of statistical tests
If the outliers are non-randomly distributed, they can decrease normality
They can bias or influence estimates that may be of substantive interest
They can also impact the basic assumption of Regression, ANOVA and other statistical
model assumptions.
To understand the impact deeply, let’s take an example to check what happens to a data set
with and without outliers in the data set.
Page 15 of 33
Example:
As you can see, data set with outliers has significantly different mean and standard deviation.
In the first scenario, we will say that average is 5.45. But with the outlier, average soars to 30.
This would change the estimate completely.
Any value, which is beyond the range of -1.5 x IQR to 1.5 x IQR
Use capping methods. Any value which out of range of 5th and 95th percentile can be
considered as outlier
Data points, three or more standard deviation away from mean are considered outlier
Outlier detection is merely a special case of the examination of data for influential data
points and it also depends on the business understanding
Bivariate and multivariate outliers are typically measured using either an index of
influence or leverage, or distance. Popular indices such as Mahalanobis’ distance and
Cook’s D are frequently used to detect outliers.
In SAS, we can use PROC Univariate, PROC SGPLOT. To identify outliers and
influential observation, we also look at statistical measure like STUDENT, COOKD,
RSTUDENT and others.
Page 16 of 33
of variable transformation. Decision Tree algorithm allows to deal with outliers well due to
binning of variable. We can also use the process of assigning weights to different observations.
Imputing: Like imputation of missing values, we can also impute outliers. We can use mean,
median, mode imputation methods. Before imputing values, we should analyse if it is natural
outlier or artificial. If it is artificial, we can go with imputing values. We can also use statistical
model to predict values of outlier observation and after that we can impute it with predicted
values.
Treat separately: If there are significant number of outliers, we should treat them separately in
the statistical model. One of the approach is to treat both groups as two different groups and
build individual model for both groups and then combine the output.
Till here, we have learnt about steps of data exploration, missing value treatment and
techniques of outlier detection and treatment. These 3 stages will make your raw data better
in terms of information availability and accuracy. Let’s now proceed to the final stage of data
exploration. It is Feature Engineering.
Page 17 of 33
4. The Art of Feature Engineering
What is Feature Engineering?
Feature engineering is the science (and art) of extracting more information from existing data.
You are not adding any new data here, but you are actually making the data you already have
more useful.
For example, let’s say you are trying to predict foot fall in a shopping mall based on dates. If
you try and use the dates directly, you may not be able to extract meaningful insights from the
data. This is because the foot fall is less affected by the day of the month than it is by the day
of the week. Now this information about day of week is implicit in your data. You need to
bring it out to make your model better.
This exercising of bringing out information from data in known as feature engineering.
Variable transformation.
Variable / Feature creation.
These two techniques are vital in data exploration and have a remarkable impact on the power
of prediction. Let’s understand each of this step in more details.
When we want to change the scale of a variable or standardize the values of a variable
for better understanding. While this transformation is a must if you have data in
different scales, this transformation does not change the shape of the variable
distribution
Page 18 of 33
Symmetric distribution is preferred over skewed distribution as it is easier to
interpret and generate inferences. Some modeling techniques requires normal
distribution of variables. So, whenever we have a skewed distribution, we can
use transformations which reduce skewness. For right skewed distribution, we take
square / cube root or logarithm of variable and for left skewed, we take square / cube
or exponential of variables.
Square / Cube root: The square and cube root of a variable has a sound effect on
variable distribution. However, it is not as significant as logarithmic transformation.
Cube root has its own advantage. It can be applied to negative values including zero.
Square root can be applied to positive values including zero.
Page 19 of 33
For example, we can categorize income in three categories, namely: High, Average and
Low. We can also perform co-variate binning which depends on the value of more than
one variables.
There are various techniques to create new features. Let’s look at the some of the commonly
used methods:
Creating derived variables: This refers to creating new variables from existing
variable(s) using set of functions or different methods. Let’s look at it through “Titanic
– Kaggle competition”. In this data set, variable age has missing values. To predict
missing values, we used the salutation (Master, Mr, Miss, Mrs) of name as a new
variable. How do we decide which variable to create? Honestly, this depends on
business understanding of the analyst, his curiosity and the set of hypothesis he might
have about the problem. Methods such as taking log of variables, binning variables and
other methods of variable transformation can also be used to create new variables.
Creating dummy variables: One of the most common application of dummy variable
is to convert categorical variable into numerical variables. Dummy variables are also
called Indicator Variables. It is useful to take categorical variable as a predictor in
statistical models. Categorical variable can take values 0 and 1. Let’s take a variable
‘gender’. We can produce two variables, namely, “Var_Male” with values 1 (Male)
and 0 (No male) and “Var_Female” with values 1 (Female) and 0 (No Female). We
can also create dummy variables for more than two classes of a categorical variables
with n or n-1 dummy variables.
Page 20 of 33
Data Storage
Data Storing in a data science process refers to storing of useful data which you may use in
your data science process to dig the actionable insights out of it. Data Storing in data science
itself is an orderly process which needs many things to be kept in consideration.
Page 21 of 33
3. Avoid data fatigue : refers to over storage or measurement of data which is useless for you
or doesn’t align with your data collection goals.
The most common problem for today’s Data Scientists is noisy or incorrect data.
To address this issue properly, one needs to focus on needs of specific problem he/she is
solving and then collect the data accordingly.
A Data Scientist can also use some tricks to store data to reduce the size of data.
For instance, if you need latitude and longitude of a place, then you can store this data in the
form of geocodes.
Geocodes can be decoded using some basic packages in Python or R. This can significantly
reduce the size of your data.
The common steps for avoiding data fatigue are :
i. Don’t forget to ask around for existing processes :
Each company which works on data has some process of managing data. So, it is always
good to lookout for existing process followed in the company. Starting from scratch is
difficult. Study those existing processes and find out the ways to improve them.
ii. Stop thinking about objectives which are not actionable :
If you are obvious about a goal that it is not needed then don’t waste your time cramming on
it. As a Data Scientist, it is your responsibility to find out ahead of time what should be your
objectives in terms of strategy, innovation and execution.
iii. Don’t expect your data storage mechanism to be perfect :
There is always a room for improvement in any process. Data storage process follows the
same principle. Never blindly expect your process to do all the things for you. Keep a human
at loop at times so that they may exercise their intuition which can lead to more
improvements.
iv.. Don’t work in isolation :
Never work in departments. Keep your Database Administrators on-board with you. They
may help you in architecture related things. Plus you can help them by letting them know
about your needs as a Data Scientist.
v. Learn difference between intelligent filtering and correlation :
Statisticians say that correlation doesn’t imply causation and they are not wrong. If you have
heard about revenue performance of a competitor by certain things then it is not necessary
that the same process may work for you. You will at last need to use your own wit to know
what your needs are and what data you will need to meet them.
4. Data management :
Decide which one to use — SQL or NoSQL
The final thing which you have to deal during the process of data storage is to whether use
SQL based databases or Non SQL ones.
Both have their own advantages and disadvantages and are made to deal with particular
applications.
To make our decision easy, let us limit our discussion to MongoDB and MySQL which are
spearheads of both types of DBMS.
First of all make it clear to you that both MongoDB and My SQL cannot suite all types of
applications in Data Science.
If we are collecting data which is semi structured or unstructured, then MongoDB should be
used. It is because complex queries like Joins are slower in MySQL, in this case.
Page 22 of 33
Mostly we have this situation in Big Data, where speed of processing is also our primary
concern.
However, if your data is highly structured or you already know the ins and outs of your data,
then MySQL may be the best option.
It is because you can manipulate the data very well and changes can be made relatively
quickly in a structured data using SQL compared to a NoSQL platform like MongoDB for
structured data.
There are also other Non SQL and SQL alternatives which may be more suitable for
applications you are working on. So, it is good to check which may cater you needs.
For instance, MariaDB offers more and better storage engines as compared to other relational
databases.
Using Cassandra, NoSQL support is also available in MariaDB, enabling you to run SQL and
SQL in a single database system.
MariaDB also supports TokuDB, which can handle big data for large organizations and
corporate users.
Page 23 of 33
Data Management
Data management: is the practice of collecting, keeping, and using data securely, efficiently,
and cost-effectively.
The goal of data management is to help people, organizations, and connected things optimize
the use of data within the bounds of policy and regulation so that they can make decisions
and take actions that maximize the benefit to the organization.
A robust data management strategy is becoming more important than ever as organizations
increasingly rely on intangible assets to create value.
All these components work together as a “data utility” to deliver the data management
capabilities an organization needs for its apps, and the analytics and algorithms that use the
data originated by those apps. Although current tools help database administrators (DBAs)
automate many of the traditional management tasks, manual intervention is still often
required because of the size and complexity of most database deployments. Whenever
manual intervention is required, the chance for errors increases. Reducing the need for
manual data management is a key objective of a new data management technology, the
autonomous database.
Page 24 of 33
Data management platform
A data management platform is the foundational system for collecting and analyzing large
volumes of data across an organization. Commercial data platforms typically include
software tools for management, developed by the database vendor or by third-party vendors.
These data management solutions help IT teams and DBAs perform typical tasks such as:
Page 25 of 33
Companies are using big data to improve and accelerate product development, predictive
maintenance, the customer experience, security, operational efficiency, and much more. As
big data gets bigger, so will the opportunities.
Data Management Challenges
Most of the challenges in data management today stem from the faster pace of business and
the increasing proliferation of data. The ever-expanding variety, velocity, and volume of data
available to organizations is pushing them to seek more-effective management tools to keep
up. Some of the top challenges organizations face include the following:
1. They don’t know what data they have
2. They must maintain performance levels as the data tier expands
3. They must meet constantly changing compliance requirements
4. They aren’t sure how to repurpose data to put it to new uses
5. They must keep up with changes in data storage
They don’t know what data Data from an increasing number and variety of sources such as sensors, smart devices, social
they have media, and video cameras is being collected and stored. But none of that data is useful if the
organization doesn’t know what data it has, where it is, and how to use it.
They must maintain Organizations are capturing, storing, and using more data all the time. To maintain peak
performance levels as the data response times across this expanding tier, organizations need to continuously monitor the
tier expands type of questions the database is answering and change the indexes as the queries change—
without affecting performance.
They must meet constantly Compliance regulations are complex and multijurisdictional, and they change constantly.
changing compliance Organizations need to be able to easily review their data and identify anything that falls under
requirements new or modified requirements. In particular, personally identifiable information (PII) must be
detected, tracked, and monitored for compliance with increasingly strict global privacy
regulations.
They aren’t sure how to Collecting and identifying the data itself doesn’t provide any value—the organization needs
repurpose data to put it to new to process it. If it takes a lot of time and effort to convert the data into what they need for
uses analysis, that analysis won’t happen. As a result, the potential value of that data is lost.
They must keep up with In the new world of data management, organizations store data in multiple systems, including
changes in data storage data warehouses and unstructured data lakes that store any data in any format in a single
repository. An organization’s data scientists need a way to quickly and easily transform data
from its original format into the shape, format, or model they need it to be in for a wide array
of analyses.
Page 26 of 33
Data Management Best Practices
Addressing data management challenges requires a comprehensive, well-thought-out set of
best practices. Although specific best practices vary depending on the type of data involved
and the industry, the following best practices address the major data management challenges
organizations face today:
1. Create a discovery layer to identify your data
2. Develop a data science environment to efficiently repurpose your data
3. Use autonomous technology to maintain performance levels across your expanding
data tier
4. Use discovery to stay on top of compliance requirements
5. Use a common query layer to manage multiple and diverse forms of data storage
Create a discovery layer to A discovery layer on top of your organization’s data tier allows analysts and data scientists to
identify your data search and browse for datasets to make your data useable.
Develop a data science A data science environment automates as much of the data transformation work as possible,
environment to efficiently streamlining the creation and evaluation of data models. A set of tools that eliminates the
repurpose your data need for the manual transformation of data can expedite the hypothesizing and testing of
new models.
Use autonomous technology to Autonomous data capabilities use AI and machine learning to continuously monitor database
maintain performance levels queries and optimize indexes as the queries change. This allows the database to maintain
across your expanding data tier rapid response times and frees DBAs and data scientists from time-consuming manual tasks.
Use discovery to stay on top of New tools use data discovery to review data and identify the chains of connection that need
compliance requirements to be detected, tracked, and monitored for multijurisdictional compliance. As compliance
demands increase globally, this capability is going to be increasingly important to risk and
security officers.
Use a common query layer to New technologies are enabling data management repositories to work together, making the
manage multiple and diverse differences between them disappear. A common query layer that spans the many kinds of
forms of data storage data storage enables data scientists, analysts, and applications to access data without needing
to know where it is stored and without needing to manually transform it into a usable format.
Page 27 of 33
Using Data from Multiple Sources
What is ETL?
ETL (extract, transform, load) : is the dominant paradigm for efficiently getting data from
multiple sources into a single location, where it can be used for self-service queries and data
analytics. As the name suggests, ETL consists of three sub-processes:
Extract: Data is first extracted from its source location(s). These sources may be—
but are not limited to—files, websites, software applications, and relational and non-
relational databases.
Transform: The extracted data is then transformed in order to make it suitable for
your different purpose. Depending on the ETL workflow, the transformation stage
may include:
o Adding or removing data rows, columns, and/or fields.
o Deleting duplicate, out-of-date, and/or extraneous data.
o Joining multiple data sources together.
o Converting data in one format to another (e.g. date/time formats or
imperial/metric units).
Page 28 of 33
Load: Finally, the transformed data is loaded into the target location. This is usually
a data warehouse, a specialized system intended for real-time BI, analytics, and
reporting.
What is ELT?
ELT stands for "Extract, Load, and Transform." In this process, data gets leveraged via a data
warehouse in order to do basic transformations. That means there's no need for data staging.
ELT uses cloud-based data warehousing solutions for all different types of data - including
structured, unstructured, semi-structured, and even raw data types.
1. ELT is a relatively new technology, made possible because of modern, cloud-based
warehouse server technologies – endless storage and scalable processing power.
For example, platforms like Amazon Redshift and Google BigQuery make ELT pipelines
possible because of their incredible processing capabilities.
2. Ingest anything and everything as the data becomes available: ELT is paired with a
data lake which lets you ingest an ever-expanding pool of raw data immediately, as it
becomes available.
There's no requirement to transform the data into a special format before saving it in
the data lake.
3. Transforms only the data you need: ELT transforms only the data required for a
particular analysis.
Although it can slow down the process of analyzing the data, it offers more
flexibility—because you can transform the data in different ways on the fly to
produce different types of metrics, forecasts, and reports.
Conversely with ETL, the entire ETL pipeline—and the structure of the data in the
OLAP warehouse—may require modification if the previously-decided structure
doesn't allow for a new type of analysis.
Page 29 of 33
Differences between ETL and ELT.
ETL ELT
1. ETL is the Extract, Transform, and Load process for 1. ELT is Extract, Load, and Transform process for
data. data.
2. In ETL, data moves from the data source (operational 2. There is no need for a data staging.
databases or other sources) to staging area then into the
data warehouse.
3. Transformations happen within a staging area outside 3. Transformations happen inside the data system itself,
the data warehouse. and no staging area is required.
4. ETL can be used to structure unstructured data, but it 4. ELT is a solution for uploading unstructured data
can’t be used to pass unstructured data into the target into a data lake ( are special kinds of data stores that
system. accept any kind of structured or unstructured data) and
make unstructured data available to business
intelligence systems.
7. Source and target database are different (Eg., Oracle 7. Source and target databases are same
source and SAP target database
Page 30 of 33
How to Extract Data from Multiple Sources?
Extracting data from multiple sources is an involved process that requires contemplation and
planning.
The steps required to extract data from multiple sources.
Step 1: Decide Which Sources to Use based on which data to extract.
Identify which data you want to extract and decide from which sources.
Step 2: Choose the Extraction Method: ELT or ELT
Step 3: Estimate the Size of the Extraction
Step 4: Connect to the Data Sources
Each data source may have its own API (application programming interface) or connector to
help with the extraction process. If you can’t easily connect to a given data source, you may
have to build a custom integration.
Page 31 of 33
Problem 2: Data Integrations
Each data source you use needs to be integrated with the larger ETL workflow. Not only is
this a complex and technically demanding undertaking, but it can also break your ETL
process if the structure of the underlying data source changes.
Page 32 of 33
Solution 3: Avoiding Joins (When Possible)
Avoid unnecessary joins when possible. This is especially true for cross joins, which take the
Cartesian product of two datasets, and nested loop joins, which can be inefficient on large
result sets. In addition, try to reduce your usage of in-memory joins and merges.
Page 33 of 33