Big Data Analytics (1) : Definition
Big Data Analytics (1) : Definition
org
Definition: Big data is anything you can’t fit on one machine, often 4Vs’:
1. Volume: the sheer amount or scale of data.
2. Variety: different forms of data
3. Velocity: A lot of data is produced at high speed.
4. Veracity: A lot of data is noisy.
Key Skills: Hacking skills, math and stats knowledge, and substantive expertise. Data science sits at the
intersection of math, computer science and the application domain.
Collecting Data
Usually the raw data is not usable right away, and it tends to be quite noisy. The goal of data cleaning is
to remove the noise before analyzing the data. We should always do the preprocessing/cleaning on a
copy, rather than on the original data to avoid any loss.
● Missing Values. Errors means that some data is irrevocably lost. Sometimes the data is just not
there (e.g. What is the year of a living person). However, sometimes, we can compensate for
that, but setting missing values to zero is not a good idea: It is usually too ambiguous (e.g. does a
salary of zero mean that: a person is unemployed or they did not answer a survey question?).
The usual techniques include: drop all records with missing values (It is fine if we have enough
data left, and we know the missing is not systematic (otherwise we might be biased). If we want
to make use of records with missing values, we could estimate or impute the missing values.
○ Heuristic-based Imputation. It can be used if we have sufficient domain knowledge.
○ Mean Value Imputation. Using the mean value of a variable as a proxy. Two
advantages: adding more such values leaves the mean unchanged. We do not bias
certain statistics. If there’s a systematic reason, it won’t make sense. E.g. replace year of
death of a living person with mean death year in wikipedia.
○ Random Value Imputation. We can also select a random value from the same column.
With this approach, we may end up with strange results, but we can check the quality by
repeating selection: we can run the model for many times with different imputations, then
if we get widely varying results, we shouldn't use random value imputation,
○ Nearest Neighbor Imputation. We can find a complete record which matches most
closely, and infer missing values using that record. It should be more accurate than
simply using the mean value, especially if there are systematic reasons. However,
identifying nearest neighbors might be expensive.
○ Interpolation Imputation. We can predict values by looking at other fields of the record.
We can train a model on complete records, and then apply to missing values. For
example, we can use a linear regression. But it may lead to outliers.
● Outliers. Outliers are data points lying outside of a distribution. If it is caused by mistakes, then
this will interfere with the analysis. Outliers are often created by data entry mistakes, maybe
errors in scraping. To tackle the errors, we usually use general sanity checks, i.e. looking at
largest and smallest values of a variable. Visual inspection confirms that distribution looks ok.
Depending on the distribution, outliers are hard to detect.
○ For Normal distribution: probability that a value is k standard deviations from mean
decreases exponentially with k.
○ Power law distribution: It is much harder to detect outliers.
Crowdsourcing is another option for data processing. Note that it does not work for tasks that require
advanced training, or we cannot specify clearly or the task quality level is hard to verify.
Summary: Data processing/cleaning is often more an art than a science. In most data science projects, it
takes up most of the time. If it is not done properly, it will impact the analysis negatively.
Traditional science is hypothesis-driven: Researchers formulate a theory, and then seeks to support or
reject the hypothesis.
Data-driven science looks different: Researcher assembles a data set, then hunts for patterns. It helps in
formulating the hypothesis for future analysis.
YAONOTES.org
The motivation behind exploratory data analysis (EDA) is that we shouldn’t engage in data dredging or
p-hacking, i.e. we shouldn’t automatically test huge numbers of hypotheses, exhaustingly searching
variable combinations. On the contrary, we should first find a statistically significant correlation. If we
have huge numbers of hypotheses, we may find correlations with no underlying effect.
We use EDA to try to get a feel for the data. There are some application-independent steps:
● Basic Questions. Try to answer some basic questions.
○ Who constructed the data set, when and why? -> It tells about its relevance and
trustworthiness.
○ How big is it? -> It determines the tools we need to use. If it is too big, we may need to
start with a sample.
○ What do the fields mean? -> Try to understand what the different fields mean.
○ Which are numerical? Which are categorical?
● (Most common) Summary Statistics. Look at the basic statistics of each column. We can start
with: extreme values; media; quartile elements. (max, min, mean, 25%, 75%)
● Pairwise Correlations. We can build a matrix of correlation coefficients, if it is possible between
all pairs of columns (otherwise, columns against dependent variables of interest). It will give first
ideas about which models could be successful. Ideally, some features will strongly correlate.
● Class Breakdowns. We can break things down by categories, e.g. gender, age, location. We
can then look for different distributions when conditioned on categories. Especially when we think
there should be differences.
● (Most common) Plots of Distributions. There are limits to understanding the data without
visualizations. Humans are good at picking up visual patterns, and inspired by this, we can create
dot plots of different variables. We can then spot the general shape of the distribution, outliers
and other patterns.
Building Models
So far, we have manipulated and interpreted the data, but we have not yet actually built the model. A
model is something that summarizes information into a tool that is a simplified representation of the real
world, and is good enough to forecast and make predictions.
Probability distributions are important building blocks for models. Examples include normal distribution,
uniform distribution, Gamma distribution, etc.
Once we have decided on a model, we then have to fit the model. We can estimate the parameters of the
model using the observed data. This process often includes optimization methods and algorithms to help
get the parameters.
Occam’s Razor: The simplest explanation is the best explanation. For statistical modeling, this means
that we should use a simple model, and minimize the parameter count. With complicated models, there’s
a danger of overfitting: Models try to remember the training dataset, but not generalize to the real world.
But simplicity is not the only goal: if a model performs poorly, it may be too simple. There is a trade-off
between accuracy and simplicity. Simpler models tend to be more robust and understandable, but we
need a model that gets the job done.
YAONOTES.org
Bias-Variance Trade-offs: bias is error from incorrect assumptions built into a model, e.g. using linear
functions instead of higher-order curves. Variance is error from sensitivity to fluctuations, e.g. noise from
measurement errors introduces variance. Errors of bias produce underfit models: i.e. the model does
not fit training data as tightly as possible and therefore the model performs badly on training and testing
data. Errors of variance produce overfit models. Quest for accuracy causes mistaking noise for signal
and thus the model adjusts too well to the training set noise.
There are many different types of model, considering the shapes and sizes.
● Linear vs. Non-Linear. Linear models compute weighted sums of variables. Each feature
variable is multiplied with a coefficient (reflecting its importance). These values are summed up to
produce a score. Linear models are readily understandable, generally defensible and easy to
build by using linear regression. However, the world is not linear usually. We need some
nonlinear functions, such as higher-order polynomials, logarithms and exponentials. These
functions might offer a much tighter fit. But it is also harder to find the best possible coefficients
for nonlinear models. On the other hand, we don’t necessarily need the best fit. If the linear model
is accurate enough, it is probably better.
● Black-Box vs. Descriptive. Black boxes do their job, but in an unknown manner. They can be
extremely effective, accomplish tasks that could not be done before but can also be completely
opaque as to the why. Descriptive models provide some insight into the decision process.
Theory-driven models are generally descriptive and are implemented based on well-developed
theory.
● First Principle vs. Data Driven. First principle models are based on a belief on how something
works, like calculus, algebra, physics or heuristics-based reasoning. Data driven models are
based on observed correlations between input parameters and outcome variables. For example,
machine learning builds models from training data sets. It can be done without understanding the
domain. Domain knowledge can be used to build ad-hoc models as it guides their structure
and design. But it can also be vulnerable to changing conditions, and are difficult to apply to new
tasks. Data-driven machine learning is general, and can be retrained on fresh data. If we use a
different data set, we can do something completely different. Generally, mixing both of them is a
good idea because we can make use of the domain knowledge to tune the model and use the
data for best fit and evaluation.
● Stochastic vs. Deterministic. Stochastic basically means “randomly determined”. It builds some
notion of probability into the model. The motivation behind is that the reality is often complex and
it’s hard to be accurate, thus providing event probabilities is more honest (and practical). First
principle models are often deterministic. For example: Newton’s law of motion predicts an object’s
location exactly. Deterministic models are easier to implement and debug, as they return the
same result if we give the same input.
● Flat vs. Hierarchical. Problems often exist on several different levels, for example, predicting
stock prices will depend on the general state of the economy, a company’s balance sheet and
performance of other companies in the sector. Hierarchical models split into submodels, which
allows us to build a more accurate model in a more transparent way. The disadvantages of
hierarchical models include: it is usually more complex to build, and we will need the appropriate
data for each submodel.
Once our model is built, we then need to evaluate how effective the model is. To achieve this, we first
need to build a reasonable evaluation environment, and automate the evaluation. It should generate
reports with all the relevant plots and summary statistics.
YAONOTES.org
Then we need to measure how well a model performs.There are three techniques: Performance
Statistics, Error Distributions, Confusion Matrices. In different cases, we have different metrics.
● Binary Classifier. We can distinguish four different cases.
○ Depending on predicted vs. actual class.
Predicted Class
Yes No
To evaluate our proposed model, we will need to compare against a baseline model. Baseline model is
the simplest reasonable model that produces answers. There are also some different types of baseline
models:
● Classification Baselines. We can uniformly or randomly select a label as output among
labels. We can also choose the most common label. Or we can use the most accurate
single-feature model. Besides these, we can use somebody else’s model and compare with
the state-of-the-art. We can also compare our model with upper bound.
● Value Prediction Baselines. We can use mean, median or most common value to value
prediction. We can also use a linear regression method as a baseline. For time-series forecasts,
we can use the value of the previous point in time for the baseline.
There are lots of good techniques for visualizing data. Edward Tufte proposed several visualization
principles:
● Maximize data-ink ratio. Visualization is supposed to show off data, however, we often spend too
much on other effects. We should focus on showing the data itself by maximizing the ratio:
data-ink/total ink used in graphics.
● Minimize the lie factor. Sometimes data is reported accurately but graphics suggests a trend that
isn’t there. Some common mistakes includes:
YAONOTES.org
● Exploit the power of repetition. Do not put too much information into one plot. It is better to break
it down into multiple plots/charts by arranging them in an array facilitates comparisons. It is also
good for multiple time series.
We can store everything in flat files. Files are simple to read, write and analyse sequentially. However,
they do not provide optimized access and there are also issues with inconsistency, concurrency and
integrity. Usually, during analysis we return to different parts of the dataset, and then we want efficient
access to subsets of the data. There are two types of databases: Relational Systems and NoSQL
systems.
Relational Systems: They are mature technology with well-defined interfaces and are well-organized and
structured (which provides meta-data in the form of a schema). It also supports multiple users with built-in
multi-user synchronization for data integrity. It is good at basic statistics and reporting. It also has
well-tuned query processing engines and thus the database can do efficient preprocessing. By using the
query engine, it will deliver the data in the way that we need it. However, it also has some disadvantages:
It only scales to medium-sized data sets and does not handle scale-out well. It only has limited
functionality when it comes to analytics and it has problems handling semistructured data. To summary, if
YAONOTES.org
the data is already in a relational system, then it can be processed very efficiently and we can assume
that databases will provide the required functionality. If the functionality is not there, most systems allow
developers to extend it via UDFs (user-defined function).
The dominance of relational systems is coming to an end, as NoSQL systems have started challenging
RDBMS. NoSQL systems are well-suited for scale-out. The disadvantages of NoSQL includes:
● It is not as mature as relational systems
● There’s no standard interface, which results in overhead for development.
● Schemaless is a blessing and a curse, which can lead to technical debt.
● NoSQL sacrifices data integrity for performance.
● NoSQL lacks a lot of analytics functionality. But it can be plugged into scalable architectures.
When implementing such distributed systems, nodes can join and leave, but we need to ensure
replication. One method is consistent hashing.
Consistent Hashing. In consistent hashing, the keys of records, and the IDs of nodes are hashed with
the same hash function. We can interpret the domain of hash values as a ring, and a node is responsible
for a certain range. When inserting a new value, it is clear who is responsible.
When a node joins, it splits the range of an existing node: values smaller than the hash value of the new
node are assigned to the new node. When a node leaves, its range is merged with the range of its
successor.
As hash values are distributed randomly, this can be unbalanced. We can fix this by introducing virtual
nodes and mapping them to physical nodes.
To gain resilience, we replicate every record n times. The n-1 copies are stored in the n-1 successor
nodes of the responsible node.
Distributed RDBMS go for consistency, while many NoSQL relaxes consistency. For our application
domain (data analysis), consistency is not that important.
In summary, as soon as we have multiple users and we are modifying parts of the data, files are not that
great. For databases, there’s no silver bullet for databases. Choosing between relational and NoSQL
usually involves lots of trade-offs depending on the context.
Map Reduce
When actually processing the data, we don’t have a standardized declarative query language such as
SQL. Often there are interfaces such as APIs and RESTful (REpresentation State Transfer via HTTP)
interfaces.
Map reduce provides an abstract framework for parallelization. Map includes: iterate over a large number
of records, extract something of interest from each. Then we can shuffle and sort intermediate results.
Then reduce includes aggregate intermediate results and generates final output.
When using map reduce, programmers have to specify two functions. map a nd reduce.
● map: (k1,v1) -> list(k2,v2). This processes key/value pairs and produces set of intermediate pairs.
● reduce: (k2, list(v2)) -> list(v2). It combines all intermediate values for a particular key, and
produces a set of merged output values.
Example:
YAONOTES.org
One actual system is Hadoop, which replaced many proprietary parallel processing systems. It has
several advantages:
● Runs on cheap hardware.
● No licensing fees.
● Scalable.
It is suited for many common data processing tasks, including batch processing and ETL.
HDFS. It is based on the Google File System. It has a master-slave architecture. Master is a NameNode
that manages the metadata and maps blocks of files to DataNodes. Slaves are one or more slave
DataNodes that store the data and process the read and write operations.
YAONOTES.org
YAONOTES.org
There are two different versions: Version 1 and earlier/ Version 2 and later.
Version 1. In version 1, Hadoop used JobTrackers. 1) A user specifies a job with the location of the
input and output files, the implementation of the map and reduce functions and various parameters
describing the job. 2) Then Hadoop sends this information to the JobTracker, which configures the slaves
and plans and supervises the individual tasks. 3) Then the TaskTrackers will execute the tasks on the
individual nodes and write the result of the reduce step into the output files. The JobTracker takes care of:
● Job Scheduling (matching tasks with TaskTrackers)
● Task Progress Monitoring. (keeping track of tasks; restarting failed or slow tasks; doing task
bookkeeping).
The disadvantages is that couples map reduce tasks very tightly to HDFS, and it has scalability and
availability issues.
YAONOTES.org
Version 2. In version 2, Hadoop splits the functionality into ResourceManger (RM) and ApplicationMaster
(AM). The ResourceManager is responsible for managing the allocating resources and scheduling
applications. An applicationMaster handles application lifecycle (includes fault). The split in functionality
adds a layer of abstraction to resource management. There is one RM per cluster and one AM per
application.
● ResourceManager (RM). It runs as a master, additionally, each node runs a NodeManger, which
handles local resources on a single node. RM and NMs are run in a master/slave setup.
Resources on local nodes are bundled up in containers (logical representation of resources,
includes memory and cores).
● ApplicationMaster (AM). An AM is started as the first container of an application. It registers with
the RM and can then update the RM about its status (heartbeat), request resources and run the
application. An AM runs user code, which might be malicious. Thus, an AM does not run as a
privileged service.
Example:
● Naive Bayes. Given a training data set, we want to compute P (x|y) where y ∈ {0, 1} . We only
have to count the number of occurrences, i.e. xj = k && y = 0 , xj = k && y = 1 , y = 0 and y = 1 .
Afterwards, we only need to divide up the training data set and assign the parts to mappers.
Every mapper counts the number of appearances, and then these counts are passed on to
reducers (in this case, every instance gets its own reducer). Every reducer sums up the values
that mapper sends to it.
● Logistic Regression. It is a regression with a dependent binary variable.
YAONOTES.org
Spark