Week 04
Week 04
Week 04
• Data Cleaning
• Data Visualization
• Statistical Testing
Data Science
2nd Trimester 2
2022
Data Loading
• Data loading defines the LOAD component of the ETL process.
• ETL stands for Extraction, Transformation, and Load.
• Extraction deals with the retrieval and combining of data from
multiple sources.
• Transformation deals with cleaning and formatting of the Extracted
Data.
• Data Loading deals with data getting loaded into a storage system,
such as a cloud data warehouse.
Data Science
2nd Trimester 3
2022
• Data loading is quite simply the process of packing up your data and moving it to a
designated data warehouse.
• It is at the beginning of this transitory phase where you can begin planning a
roadmap, outlining where you would like to move forward with your data and how
you would like to use it.
• Data Loading is the ultimate step in the ETL process.
• In this step, the extracted data and the transformed data are loaded into the target
database.
• All three steps in the ETL process can run parallel.
• Data extraction takes time and therefore the second phase of the transformation
process is executed simultaneously.
• This prepares the data for the third stage that is data loading.
• As soon as some data is ready, data loading is done without waiting for the previous
steps to be completed
Data Science
2nd Trimester 4
2022
Challenges in Data Loading
• Many ETL solutions are cloud-based, which accounts for their speed and scalability.
• But large enterprises with traditional, on-premise infrastructure and data management
processes often use custom-built scripts to collect and perform data loading on their own
data into storage systems through customized configurations.
• This can result in following
Slow Data Analysis: Each time a data source is added or changed, the system has to be reconfigured,
which takes time and hampers the ability to make quick decisions
Increase the likelihood of errors. Changes and reconfigurations open up the door for human error,
duplicate or missing data, and other problems
Require specialized knowledge: In-house IT teams often lack the skill (and bandwidth) needed
to code and monitor ETL functions themselves.
• Require costly equipment: Organizations have to purchase, house, and maintain hardware and
other equipment to run the process on-site.
Data Science
2nd Trimester 5
2022
• Unorganized Data: Loading your data can become unorganized very fast.
• Universal formatting: Before you begin loading your data, make sure that you identify where it is
coming from and where you want to go.
• Loss of data: Tracking the status of all data is critical for a smooth loading process.
• Speed: Although it’s exciting to be closer to your final destination, do not rush through this phase.
Errors are most likely to occur during this time
Data Science
2nd Trimester 6
2022
Dirty Data
Dirty data is a database record that contains errors. Dirty data can be caused by a number of
factors including duplicate records, incomplete or outdated data, and the improper parsing of
record fields from disparate systems.
Some types of dirty data:
• Duplicate Data
• Inaccurate Data
• Inconsistent Data
• Outdated Data
• Insecure Data
• Incomplete Data
Analysts should look for anomalies, verify the data with domain knowledge, and decide the most
appropriate approach to clean the data
Data Cleaning
• Data cleaning (sometimes also known as data cleansing or data wrangling) is an important
early step in the data analytics process.
• This crucial exercise, which involves preparing and validating data, usually takes place
before your core analysis.
• Data cleaning is not just a case of removing erroneous data, although that’s often part of
it.
• The majority of work goes into detecting rogue data and (wherever possible) correcting
it.
• Rogue data’ includes things like incomplete, inaccurate, irrelevant, corrupt or incorrectly
formatted data.
Data Cleaning
• Since data analysis is commonly used to inform business decisions, results need to be
accurate.
• In this case, it might seem safer simply to remove rogue or incomplete data.
• But this poses problems, too: an incomplete dataset will also impact the results of your
analysis.
• That’s why one of the main aims of data cleaning is to keep as much of a dataset intact as
possible.
• This helps improve the reliability of your insights
Key Benefits Of Data Cleaning
Data analysis requires effectively cleaned data to produce accurate and trustworthy insights. But
clean data has a range of other benefits, too
Staying organized: Today’s businesses collect lots of information from clients, customers,
product users, and so on. These details include everything from addresses and phone numbers to
bank details and more. Cleaning this data regularly means keeping it tidy. It can then be stored
more effectively and securely
Avoiding mistakes: Dirty data doesn’t just cause problems for data analytics. It also affects daily
operations. For instance, marketing teams usually have a customer database. If that database is in
good order, they’ll have access to helpful, accurate information. If it’s a mess, mistakes are bound
to happen, such as using the wrong name in personalized mail outs.
Key Benefits Of Data Cleaning
• Improving productivity: Regularly cleaning and updating data means rogue information is
quickly purged. This saves teams from having to wade through old databases or documents to find
what they’re looking for.
• Avoiding unnecessary costs: Making business decisions with bad data can lead to expensive
mistakes. But bad data can incur costs in other ways too. Simple things, like processing errors, can
quickly snowball into bigger problems. Regularly checking data allows you to detect blips sooner.
This gives you a chance to correct them before they require a more time-consuming (and costly)
fix.
• Improved mapping: Increasingly, organizations are looking to improve their internal data
infrastructures. For this, they often hire data analysts to carry out data modeling and to build new
applications. Having clean data from the start makes it far easier to collate and map, meaning that
a solid data hygiene plan is a sensible measure.
Data Cleaning Steps
• Get Rid of Unwanted Observations
The first stage in any data cleaning process is to remove the observations (or data points) you don’t
want. This includes irrelevant observations, i.e. those that don’t fit the problem you’re looking to
solve. For instance, if we were running an analysis on . This step of the process also involves
removing duplicate data. Duplicate data commonly occurs when you combine multiple datasets,
scrape data online, or receive it from third-party sources.
• Fix Structural Errors
Structural errors usually emerge as a result of poor data housekeeping. They include things like typos
and inconsistent capitalization, which often occur during manual data entry.
Let’s say you have a dataset covering the properties of different metals. ‘Iron’ (uppercase) and ‘iron’
(lowercase) may appear as separate classes (or categories). Ensuring that capitalization is consistent
makes that data much cleaner and easier to use. You should also check for mislabeled categories. For
instance, ‘Iron’ and ‘Fe’ (iron’s chemical symbol) might be labeled as separate classes, even though
they’re the same. Other things to look out for are the use of underscores, dashes, and other rogue
punctuation!
Standardize Data
Standardizing your data is closely related to fixing structural errors, but it takes it a step further .
Standardizing also means ensuring that things like numerical data use the same unit of measurement.
For instance, combining miles and kilometers in the same dataset will cause problems. Even dates
have different conventions, with the US putting the month before the day, and Europe putting the day
before the month.
Remove Unwanted Outliers
Outliers are data points that dramatically differ from others in the set. They can cause problems with
certain types of data models and analysis. While outliers can affect the results of an analysis, you
should always approach removing them with caution. Only remove an outlier if you can prove that it
is erroneous
Fix Contradictory Data Errors
Contradictory (or cross-set) data errors are another common problem to look out for. Contradictory
errors are where you have a full record containing inconsistent or incompatible data. An example
might be a pupil’s grade score being associated with a field that only allows options for ‘pass’ and
‘fail’.
Type Conversion and Syntax Errors
Once you’ve tackled other inconsistencies, the content of your spreadsheet or dataset might look
good to go. However, you need to check that everything is in order behind the scenes, too. Type
conversion refers to the categories of data that you have in your dataset. A simple example is that
numbers are numerical data, whereas currency uses a currency value
Deal with Missing Data
There are three common approaches to this problem. The first is to remove the entries associated with the
missing data. The second is to impute (or guess) the missing data, based on other, similar data. In most cases,
however, both of these options negatively impact your dataset in other ways. Removing data often means
losing other important information. Guessing data might reinforce existing patterns, which could be wrong. The
third option (and often the best one) is to flag the data as missing. To do this, ensure that empty fields have the
same value, e.g. ‘missing’ or ‘0’ (if it’s a numerical field). Then, when you carry out analysis, you’ll at least be
taking into account that data is missing, which in itself can be informative.
Validate the Dataset
This often involves using scripts that check whether or not the dataset agrees with validation rules (or
‘check routines’) that you have predefined. You can also carry out validation against existing, ‘gold
standard’ datasets. This all sounds a bit technical, but all you really need to know at this stage is that
validation means checking the data is ready for analysis. If there are still errors (which there usually
will be) you’ll need to go back and fix them…there’s a reason why data analysts spend so much of
their time cleaning data!
Missing Data
• Almost every dataset contains missing data, which should not be considered easily, since their
presence is one of the most important problems.
• The reasons of being problematic are that the results obtained during calculations can mislead
and no best way of dealing with them is presented.
• A missing value is a value which is not stored in dataset during observations
• The classes are as follows:
Missing Completely at Random (MCAR):
For this case, the missing values are unrelated to the observations. If the possibility of being
missed is equal for all cases, then the data is missed completely at random. MCAR data,
nevertheless, are highly unusual in practice. For example, during the survey of a population, if any
responds are lost, then they are missed completely at random.
Missing Data
Missing at Random (MAR):
This is the case when the missing variable can be defined by another variable, not from the missing
values themselves. For instance, in the survey about the depression levels between two sexes,
males are less likely to respond the questions on the depression level in comparison with females.
Therefore, the missing value depends on gender only.
Mean/Median/Mode:
One of the methods of imputation is using mean or median. The mean and median of the
particular column should be calculated and then filled in place of missing data. For the
categorical data, nonetheless, the mode function is used.
Exploratory Data Analyses
A useful way to detect patterns and anomalies in the data is through the exploratory data
analysis with visualization.
Visualization gives a succinct, holistic view of the data that may be difficult to grasp from the
numbers and summaries alone
An important facet of the initial data exploration, visualization assesses data cleanliness and
suggests potentially important relationships in the data prior to the model planning and
building phases.
Exploratory data analysis is a data analysis approach to reveal the important characteristics
of a dataset, mainly through visualization
Exploratory Data Analyses
A useful way to detect patterns and anomalies in the data is through the exploratory data
analysis with visualization.
Visualization gives a succinct, holistic view of the data that may be difficult to grasp from the
numbers and summaries alone
An important facet of the initial data exploration, visualization assesses data cleanliness and
suggests potentially important relationships in the data prior to the model planning and
building phases.
Exploratory data analysis is a data analysis approach to reveal the important characteristics
of a dataset, mainly through visualization
Visualization of Single Variable
Using visual representations of data is a hallmark of exploratory data analyses: letting
the data speak to its audience rather than imposing an interpretation on the data a
priori.
Dotchart and barplot portray continuous values with labels from a discrete variable.
Figure in next slide shows (a) a dotchart and (b) a barplot based on the mtcars
dataset, which includes the fuel consumption and 10 aspects of automobile design and
performance of 32 automobiles
Histogram and Density Plot
Figure in next slide includes a histogram of household income.
The histogram shows a clear concentration of low household incomes on the right and
the long tail of the higher incomes on the left.
Figure in next slide shows a density plot of the logarithm of household income
values, which emphasizes the distribution.
The income distribution is concentrated in the centre portion of the graph.
In the data preparation phase of the Data Analytics Lifecycle, the data range and
distribution can be obtained.
If the data is skewed, viewing the logarithm of the data (if it’s all positive) can help
detect structures that might otherwise be overlooked in a graph with a regular,
nonlogarithmic scale
Examining if the data is unimodal or multimodal will give an idea of how many
distinct populations with different behaviour patterns might be mixed into the overall
population.
Many modelling techniques assume that the data follows a normal distribution.
Therefore, it is important to know if the available dataset can match that assumption
before applying any of those modelling techniques
Consider a density plot of diamond prices (in USD).
Figure contains two density plots for premium and ideal cuts of diamonds.
The group of premium cuts is shown in red, and the group of ideal cuts is shown in
blue.
The range of diamond prices is wide—in this case ranging from around $300 to
almost $20,000.
Extreme values are typical of monetary data such as income, customer value, tax
liabilities, and bank account sizes
Figure 3.12(b) shows more detail of the diamond prices than Figure 3.12(a) by taking
the logarithm.
The two humps in the premium cut represent two distinct groups of diamond prices:
One group centers around (where the price is about $794), and the other centers
around (where the price is about $5,012).
The ideal cut contains three humps, centering around 2.9, 3.3, and 3.7 respectively
Visualization of Multiple Variable
A scatterplot is a simple and widely used visualization for finding the relationship among
multiple variables.
A scatterplot can represent data with up to five variables using x-axis, y-axis, size, color, and
shape.
But usually only two to four variables are portrayed in a scatterplot to minimize confusion.
When examining a scatterplot, one needs to pay close attention to the possible relationship
between the variables.
If the functional relationship between the variables is somewhat pronounced, the data may
roughly lie along a straight line, a parabola, or an exponential curve.
If variable y is related exponentially to x, then the plot of x versus log(y) is approximately
linear.
If the plot looks more like a cluster without a pattern, the corresponding variables may have a
weak relationship
Dotchart and barplot can visualize multiple variables.
Both of them use color as an additional dimension for visualizing the data.
For the same mtcars dataset, Figure shows a dotchart that groups vehicle cylinders at
the y-axis and uses colors to distinguish different cylinders.
The vehicles are sorted according to their MPG values.
The barplot in Figure visualizes the distribution of car cylinder counts and number of
gears.
The x-axis represents the number of cylinders, and the color represents the number of
gears.
Box and Whisker Plot
Box-and-whisker plots show the distribution of a continuous variable for each value of
a discrete variable.
The box-and-whisker plot in Figure visualizes mean household incomes as a function
of region in the United States.
The first digit of the U.S. postal (“ZIP”) code corresponds to a geographical region in
the United States
In Figure, each data point corresponds to the mean household income from a particular
zip code.
The horizontal axis represents the first digit of a zip code, ranging from 0 to 9, where 0
corresponds to the northeast region of the United States (such as Maine, Vermont, and
Massachusetts), and 9 corresponds to the southwest region (such as California and
Hawaii).
The vertical axis represents the logarithm of mean household incomes. The logarithm
is taken to better visualize the distribution of the mean household incomes
The graph shows how household income varies by region.
The highest median incomes are in region 0 and region 9.
Region 0 is slightly higher, but the boxes for the two regions overlap enough
that the difference between the two regions probably is not significant.
The lowest household incomes tend to be in region 7, which includes states
such as Louisiana, Arkansas, and Oklahoma
HexBin Plot for Larger Dataset
If there is too much data, the structure of the data may become difficult to see in
a scatterplot.
Consider a case to compare the logarithm of household income against the
years of education
The cluster in the scatterplot on the left (a) suggests a somewhat linear
relationship of the two variables.
However, one cannot really see the structure of how the data is distributed
inside the cluster.
This is a Big Data type of problem.
Millions or billions of data points would require different approaches for
exploration, visualization, and analysis
A hexbinplot combines the ideas of scatterplot and histogram.
Similar to a scatterplot, a hexbinplot visualizes data in the xaxis and y-axis.
Data is placed into hexbins, and the third dimension uses shading to represent
the concentration of data in each hexbin
In Figure 3.17(b), the same data is plotted using a hexbinplot.
The hexbinplot shows that the data is more densely clustered in a streak that
runs through the center of the cluster, roughly along the regression line.
The biggest concentration is around 12 years of education, extending to about
15 years
Data Exploration versus Presentation
Using visualization for data exploration is different from presenting results to
stakeholders.
Not every type of plot is suitable for all audiences.
Most of the plots presented earlier try to detail the data as clearly as possible for
data scientists to identify structures and relationships.
These graphs are more technical in nature and are better suited to technical
audiences such as data scientists.
Nontechnical stakeholders, however, generally prefer simple, clear graphics
that focus on the message rather than the data
Statistical Methods
Visualization is useful for data exploration and presentation, but statistics is
crucial because it may exist throughout the entire Data Analytics Lifecycle.
Statistical techniques are used during the initial data exploration and data
preparation, model building, evaluation of the final models, and assessment of
how the new models improve the situation when deployed in the field.
In particular, statistics can help answer the following questions for data
analytics
Hypothesis Testing
When comparing populations, such as testing or evaluating the difference of the means
from two samples of data , a common technique to assess the difference or the
significance of the difference is hypothesis testing
The basic concept of hypothesis testing is to form an assertion and test it with data.
When performing hypothesis tests, the common assumption is that there is no
difference between two samples.
This assumption is used as the default position for building the test or conducting a
scientific experiment.
Statisticians refer to this as the null hypothesis ( ).
The alternative hypothesis ( ) is that there is a difference between two samples. For
example, if the task is to identify the effect of drug A compared to drug B on patients,
the null hypothesis and alternative hypothesis would be this
It is important to state the null hypothesis and alternative hypothesis, because misstating them
is likely to undermine the subsequent steps of the hypothesis testing process.
A hypothesis test leads to either rejecting the null hypothesis in favor of the alternative or not
rejecting the null hypothesis
Once a model is built over the training data, it needs to be evaluated over the testing data to see
if the proposed model predicts better than the existing model currently being used.
The null hypothesis is that the proposed model does not predict better than the existing model.
The alternative hypothesis is that the proposed model indeed predicts better than the existing
model.
In accuracy forecast, the null model could be that the sales of the next month are the same as
the prior month.
The hypothesis test needs to evaluate if the proposed model provides a better prediction.
Take a recommendation engine as an example. The null hypothesis could be that the new
algorithm does not produce better recommendations than the current algorithm being deployed.
The alternative hypothesis is that the new algorithm produces better recommendations than the
old algorithm
Difference of Means
Hypothesis testing is a common approach to draw inferences on whether or not
the two populations, denoted and , are different from each other.
This section provides two hypothesis tests to compare the means of the
respective populations based on samples randomly drawn from each
population.
Specifically, the two hypothesis tests in this section consider the following null
and alternative hypotheses
The basic testing approach is to compare the observed sample means, and
corresponding to each population.
If the values of and are approximately equal to each other, the distributions of
and overlap substantially and the null hypothesis is supported.
A large observed difference between the sample means indicates that the null
hypothesis should be rejected.