0% found this document useful (0 votes)
66 views15 pages

Big Data Analytics (1) : Definition

The document provides an overview of big data analytics, including definitions of key concepts like the 4 V's of big data. It discusses challenges of analyzing big data like issues with heterogeneity, scale, and privacy. The key skills needed in data science are outlined as well as the typical process, including data acquisition, cleaning, exploration and analysis. Sources of data and techniques for collection are also summarized. The document focuses on the importance of data processing and cleaning to remove noise before analysis and discusses common issues like errors, artifacts, missing values and outliers. Exploratory data analysis is introduced as a process of identifying patterns in data without predefined hypotheses.

Uploaded by

imamjabar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views15 pages

Big Data Analytics (1) : Definition

The document provides an overview of big data analytics, including definitions of key concepts like the 4 V's of big data. It discusses challenges of analyzing big data like issues with heterogeneity, scale, and privacy. The key skills needed in data science are outlined as well as the typical process, including data acquisition, cleaning, exploration and analysis. Sources of data and techniques for collection are also summarized. The document focuses on the importance of data processing and cleaning to remove noise before analysis and discusses common issues like errors, artifacts, missing values and outliers. Exploratory data analysis is introduced as a process of identifying patterns in data without predefined hypotheses.

Uploaded by

imamjabar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

YAONOTES.

org

Big Data Analytics (1) 

Given by Prof. Dr. Sven Helmer

Definition:​ Big data is anything you can’t fit on one machine, often 4Vs’:
1. Volume: the sheer amount or scale of data.
2. Variety: different forms of data
3. Velocity: A lot of data is produced at high speed.
4. Veracity: A lot of data is noisy.

Challenges​: Several challenges when analyzing big data:


1. Heterogeneity: Human’s can cope with nuances of the data, while algorithms are rather bad at
this. They mostly need structured data.
2. Inconsistency and incompleteness.
3. Scale.
4. Timeliness.
5. Privacy and Data Ownership.
6. Interpreting the Data.

Key Skills​: Hacking skills, math and stats knowledge, and substantive expertise. Data science sits at the
intersection of math, computer science and the application domain.

Process of Data Science​:


● Data Acquisition​.
● Information Extraction​. Data in many different formats then has to be: integrated, aggregated
and represented.
● Data Cleaning​.Most data sources are notoriously unreliable.
● Exploratory Data Analysis.​ Plots, Graphs, Summaries. We are trying to get a feel for the data.
● Modeling and Analysis.
● Interpretation.

Collecting Data

There are some possible sources of data:


● Company​: Many companies release certain datasets via rate-limited APIs.
● Government​: Governments tend to collect a lot of data as well, and many countries have passed
laws to grant access.
● Academic​: Research often involves the creation of datasets, and many publication venues
publish datasets with papers.These datasets have been well-studied and might still be interesting
for interdisciplinary studies.

The techniques for collecting them includes:


● Scraping​: Scraping is the technique we used to extract information from webpages. It includes
two steps:
○ Spidering: downloading the right set of pages.
YAONOTES.org

○ Scraping: Stripping content from each page.


It’s important to remember that it is bad form to hit a site more than once per second, and
we should be careful not to violate the terms of service.
● Logging​: If we are actually running a system, we own a potential data source. There are lots of
ways to log the data: web service, communication devices, laboratory instruments, and various
IOT devices.

Data Processing and Cleaning

Usually the raw data is not usable right away, and it tends to be quite noisy. The goal of data cleaning is
to remove the noise before analyzing the data. We should always do the preprocessing/cleaning on a
copy, rather than on the original data to avoid any loss.

There are two types of noise we can find in raw data:


● Errors​: Information that are fundamentally lost during collection. For example: lacking resolution
in sensor data (lost precision); Broken-down sensors sending completely wrong measurements;
Missing logs due to crashed server. ​This information cannot be reconstructed.
● Artifacts: ​Systematic problems arising from data collection. Often we can correct this if we have
the original raw data set. But first we need to detect artifacts. For example:

We have to be aware of the meaning of each field.

● Unit of measurements​ (we can use unit conversion to avoid).


● Numerical Representation​. Numerical data with common units may still cause issues, if we
represent it (integers, decimals, floats, doubles, or even text). Integers are used for counting
discrete entities, thus measurements should be reported as real numbers.
● Entity Resolution​. When integrating records from different data sets, we will need to find the
same objects in different datasets. It is easy when they share the same key field. However, often
names are used as keys. In that case, we need to handle accents, umlauts, and translations. The
general technique is ​unification​: Reduce each name to a single canonical version.
○ There are a couple of different steps: convert all strings to lowercase; eliminate middle
names (or use initials); make sure that datasets use the same encoding.
○ There might be some problems: different names may all of a sudden match, and we need
to tune the unification process.
○ For ​Time/Date Unification​, there are some issues: there are different time zones and a
date line. There is daylight savings time; There are leap years and leap seconds. Usually,
we use a standard like UTC (Coordinated Universal Time).
○ For ​Financial Unification. ​There are also some issues:
■ Currency Conversion. Representing international prices via standardized units.
Exchange rates can vary during a day. Different markets can have different rates
and spreads.
■ Correction for inflation. Inflation changes the value of a currency over time. The
problem is that there might be some unadjusted prices over longer time periods
(then we cannot compare it directly).
■ Stock Prices: A dip in value when paying out dividends. (We cannot compare it
also)
YAONOTES.org

● Missing Values. ​Errors means that some data is irrevocably lost. Sometimes the data is just not
there (e.g. What is the year of a living person). However, sometimes, we can compensate for
that, but setting missing values to zero is not a good idea: It is usually too ambiguous (e.g. does a
salary of zero mean that: a person is unemployed or they did not answer a survey question?).
The usual techniques include: drop all records with missing values (It is fine if we have enough
data left, and we know the missing is not systematic (otherwise we might be biased). If we want
to make use of records with missing values, we could estimate or impute the missing values.
○ Heuristic-based Imputation​. It can be used if we have sufficient domain knowledge.
○ Mean Value Imputation. ​Using the mean value of a variable as a proxy. Two
advantages: adding more such values leaves the mean unchanged. We do not bias
certain statistics. If there’s a systematic reason, it won’t make sense. E.g. replace year of
death of a living person with mean death year in wikipedia.
○ Random Value Imputation​. We can also select a random value from the same column.
With this approach, we may end up with strange results, but we can check the quality by
repeating selection: we can run the model for many times with different imputations, then
if we get widely varying results, we shouldn't use random value imputation,
○ Nearest Neighbor Imputation​. We can find a complete record which matches most
closely, and infer missing values using that record. It should be more accurate than
simply using the mean value, especially if there are systematic reasons. However,
identifying nearest neighbors might be expensive.
○ Interpolation Imputation. ​We can predict values by looking at other fields of the record.
We can train a model on complete records, and then apply to missing values. For
example, we can use a linear regression. But it may lead to outliers.
● Outliers. ​Outliers are data points lying outside of a distribution. If it is caused by mistakes, then
this will interfere with the analysis. Outliers are often created by data entry mistakes, maybe
errors in scraping. To tackle the errors, we usually use general sanity checks, i.e. looking at
largest and smallest values of a variable. Visual inspection confirms that distribution looks ok.
Depending on the distribution, outliers are hard to detect.
○ For Normal distribution: probability that a value is ​k ​standard deviations from mean
decreases exponentially with ​k.​
○ Power law distribution: It is much harder to detect outliers.

Crowdsourcing is another option for data processing. Note that it does not work for tasks that require
advanced training, or we cannot specify clearly or the task quality level is hard to verify.

Summary​: Data processing/cleaning is often more an art than a science. In most data science projects, it
takes up most of the time. If it is not done properly, it will impact the analysis negatively.

Exploratory Data Analysis

Traditional science is ​hypothesis-driven​: Researchers formulate a theory, and then seeks to support or
reject the hypothesis.
Data-driven science looks different: Researcher assembles a data set, then hunts for patterns. It helps in
formulating the hypothesis for future analysis.
YAONOTES.org

The motivation behind exploratory data analysis (EDA) is that we shouldn’t engage in data dredging or
p-hacking, i.e. we shouldn’t automatically test huge numbers of hypotheses, exhaustingly searching
variable combinations. On the contrary, we should first find a ​statistically significant correlation​. If we
have huge numbers of hypotheses, we may find correlations with no underlying effect.

We use EDA to try to get a feel for the data. There are some application-independent steps:
● Basic Questions​. Try to answer some basic questions.
○ Who constructed the data set, when and why? -> It tells about its relevance and
trustworthiness.
○ How big is it? -> It determines the tools we need to use. If it is too big, we may need to
start with a sample.
○ What do the fields mean? -> Try to understand what the different fields mean.
○ Which are numerical? Which are categorical?
● (Most common) Summary Statistics.​ Look at the basic statistics of each column. We can start
with: extreme values; media; quartile elements. (max, min, mean, 25%, 75%)
● Pairwise Correlations. ​We can build a matrix of correlation coefficients, if it is possible between
all pairs of columns (otherwise, columns against dependent variables of interest). It will give first
ideas about which models could be successful. Ideally, some features will strongly correlate.
● Class Breakdowns.​ We can break things down by categories, e.g. gender, age, location. We
can then look for different distributions when conditioned on categories. Especially when we think
there should be differences.
● (Most common) Plots of Distributions.​ There are limits to understanding the data without
visualizations. Humans are good at picking up visual patterns, and inspired by this, we can create
dot plots of different variables. We can then spot the general shape of the distribution, outliers
and other patterns.

Building Models

So far, we have manipulated and interpreted the data, but we have not yet actually built the model. A
model is something that summarizes information into a tool that is a simplified representation of the real
world, and is good enough to forecast and make predictions.

Probability distributions are important building blocks for models. Examples include normal distribution,
uniform distribution, Gamma distribution, etc.

Once we have decided on a model, we then have to fit the model. We can estimate the parameters of the
model using the observed data. This process often includes optimization methods and algorithms to help
get the parameters.

Occam’s Razor:​ The simplest explanation is the best explanation. For statistical modeling, this means
that we should use a simple model, and minimize the parameter count. With complicated models, there’s
a danger of overfitting: Models try to remember the training dataset, but not generalize to the real world.
But simplicity is not the only goal​: if a model performs poorly, it may be too simple. There is a trade-off
between accuracy and simplicity. Simpler models tend to be more robust and understandable, but we
need a model that gets the job done.
YAONOTES.org

Bias-Variance Trade-offs​: bias is error from incorrect assumptions built into a model, e.g. using linear
functions instead of higher-order curves. Variance is error from sensitivity to fluctuations, e.g. noise from
measurement errors introduces variance. ​Errors of bias produce underfit models​: i.e. the model does
not fit training data as tightly as possible and therefore the model performs badly on training and testing
data. ​Errors of variance produce overfit models​. Quest for accuracy causes mistaking noise for signal
and thus the model adjusts too well to the training set noise.

There are many different types of model, considering the shapes and sizes.
● Linear vs. Non-Linear​. Linear models compute weighted sums of variables. Each feature
variable is multiplied with a coefficient (reflecting its importance). These values are summed up to
produce a score. Linear models are readily understandable, generally defensible and easy to
build by using linear regression. However, the world is not linear usually. We need some
nonlinear functions, such as higher-order polynomials, logarithms and exponentials. These
functions might offer a much tighter fit. But it is also harder to find the best possible coefficients
for nonlinear models. On the other hand, we don’t necessarily need the best fit. If the linear model
is accurate enough, it is probably better.
● Black-Box vs. Descriptive​. Black boxes do their job, but in an unknown manner. They can be
extremely effective, accomplish tasks that could not be done before but can also be completely
opaque as to the why. Descriptive models provide some insight into the decision process.
Theory-driven models are generally descriptive and are implemented based on well-developed
theory.
● First Principle vs. Data Driven​. First principle models are based on a belief on how something
works, like calculus, algebra, physics or heuristics-based reasoning. Data driven models are
based on observed correlations between input parameters and outcome variables. For example,
machine learning builds models from training data sets. It can be done without understanding the
domain. ​Domain knowledge​ can be used to ​build ad-hoc models​ as it guides their structure
and design. But it can also be vulnerable to changing conditions, and are difficult to apply to new
tasks. ​Data-driven machine learning​ is general, and can be retrained on fresh data. If we use a
different data set, we can do something completely different. Generally, mixing both of them is a
good idea because we can make use of the domain knowledge to tune the model and use the
data for best fit and evaluation.
● Stochastic vs. Deterministic​. Stochastic basically means “randomly determined”. It builds some
notion of probability into the model. The motivation behind is that the reality is often complex and
it’s hard to be accurate, thus providing event probabilities is more honest (and practical). First
principle models are often deterministic. For example: Newton’s law of motion predicts an object’s
location exactly. Deterministic models are easier to implement and debug, as they return the
same result if we give the same input.
● Flat vs. Hierarchical​. Problems often exist on several different levels, for example, predicting
stock prices will depend on the general state of the economy, a company’s balance sheet and
performance of other companies in the sector. Hierarchical models split into submodels, which
allows us to build a more accurate model in a more transparent way. The disadvantages of
hierarchical models include: it is usually more complex to build, and we will need the appropriate
data for each submodel.

Once our model is built, we then need to evaluate how effective the model is. To achieve this, we first
need to build a reasonable evaluation environment, and automate the evaluation. It should generate
reports with all the relevant plots and summary statistics.
YAONOTES.org

There are three different kinds of datasets:


● Training dataset: it is used to study the domain and fit the model.
● Validation dataset: it is used for evaluating a model while fitting it.
● Testing dataset: it is used to evaluate the final model. It is often a carefully selected, well-curated
set.
We usually split our original data to be: 60%, 20%, and 20% of the total data set. Take care when
sampling the data. In case the dataset is rather small, we can use cross validation. That is:
● Data is partitioned into ​k e ​ qual-sized blocks.
● Then they are used to train ​k​ distinct models.
● Model ​i ​is trained on ​k-1 ​blocks. All except block ​i.
● And then we use block ​i ​to validate the model.

Then we need to measure how well a model performs.There are three techniques: Performance
Statistics, Error Distributions, Confusion Matrices. In different cases, we have different metrics.
● Binary Classifier​. We can distinguish four different cases.
○ Depending on predicted vs. actual class.

Predicted Class

Yes No

Actual Class Yes True Positive (TP) False Negative


(FN)

No False Positive (FP) True Negative (TN)

Based on this, we can define different quality measures:


● Accuracy: (TP+TN)/(TP+TN+FN+FP)
● Precision: (TP)/(TP+FP)
● Recall: TP/(TP+FN)
● F-score: 2 * (precision * recall)/(precision+recall)
● Multiclass Classifier​.
○ We use confusion matrix: assume there are ​d​ labels, we can create a ​d*d​ matrix, ​where
C(x,y)=fraction of instances of class ​x​ labeled as y
​ ​. Ideally, most instances show up
on the main diagonal.
YAONOTES.org

● Value Prediction (regression)​.


○ Classifiers deal with discrete values. For continuous values, we need error statistics,
which measure the differences between forecast and actual result.
○ Absolute error​: Δ = y ′ − y . It is simple to implement and understand, and we can
distinguish y ′ > y and y ′ < y . However, we need to process the signed values as they
offset each other when aggregating this error.
y−y ′
○ Relative error:​ ε = y
. Absolute errors are meaningless without units, while relative
errors produce a unit-less quantity.
○ Squared error​: Δ2 = (y ′ − y )2 . It is always positive and errors don’t cancel each other out.
Thus it is good for aggregation. However, Large error values contribute disproportionately
as outliers have a bigger impact.

To evaluate our proposed model, we will need to compare against a baseline model. Baseline model is
the simplest reasonable model that produces answers. There are also some different types of baseline
models:
● Classification Baselines​. We can ​uniformly or randomly select​ a label as output among
labels. We can also ​choose the most common label​. Or we can use the ​most accurate
single-feature​ model. Besides these, we can use ​somebody else’s model​ and compare with
the state-of-the-art. We can also compare our model with ​upper bound​.
● Value Prediction Baselines​. We can use mean, median or most common value to value
prediction. We can also use a linear regression method as a baseline. For time-series forecasts,
we can use the value of the previous point in time for the baseline.

Communicating / Using the Results

There are lots of good techniques for visualizing data. Edward Tufte proposed several visualization
principles:
● Maximize data-ink ratio. Visualization is supposed to show off data, however, we often spend too
much on other effects. We should focus on showing the data itself by maximizing the ratio:
data-ink/total ink used in graphics.
● Minimize the lie factor. Sometimes data is reported accurately but graphics suggests a trend that
isn’t there. Some common mistakes includes:
YAONOTES.org

○ Presenting means without variance.


○ Presenting interpolations without actual data.
○ Distortion of scales.
○ Eliminating tick labels from axes.
○ Hide the origin point from the plot.
● Minimize chart junk. Don’t add cool visual effects just for the sake of it. The data should tell the
story, not the chart junk. Possible improvements include:
○ Get rid of the grid (jailbreak your data).
○ Remove colored background.
○ Bounding box does not contribute information.
○ Remove lines instead of adding them.
● Use proper scales and clear labeling. Labels need to report the proper magnitude of numbers.
Scaling needs to show the right resolution, otherwise it becomes difficult to compare/interpret the
data.
● Make effective use of color. Usually, colors play two major roles in charts: ​Distinguishing
between different classes and encoding numerical values (e.g. heat maps). ​The general rule
is to show large areas with unsaturated (less colorful) colors while small regions with saturated
(more colorful) colors. Use a well-established scale instead of inventing one.

● Exploit the power of repetition. Do not put too much information into one plot. It is better to break
it down into multiple plots/charts by arranging them in an array facilitates comparisons. It is also
good for multiple time series.

Files, Relational Systems and NoSQL

We can store everything in flat files. Files are simple to read, write and analyse sequentially. However,
they do not provide optimized access and there are also issues with inconsistency, concurrency and
integrity. Usually, during analysis we return to different parts of the dataset, and then we want efficient
access to subsets of the data. There are two types of databases: Relational Systems and NoSQL
systems.

Relational Systems​: They are mature technology with well-defined interfaces and are well-organized and
structured (which provides meta-data in the form of a schema). It also supports multiple users with built-in
multi-user synchronization for data integrity. It is good at basic statistics and reporting. It also has
well-tuned query processing engines and thus the database can do efficient preprocessing. By using the
query engine, it will deliver the data in the way that we need it. However, it also has some disadvantages:
It only scales to medium-sized data sets and does not handle scale-out well. It only has limited
functionality when it comes to analytics and it has problems handling semistructured data. To summary, if
YAONOTES.org

the data is already in a relational system, then it can be processed very efficiently and we can assume
that databases will provide the required functionality. If the functionality is not there, most systems allow
developers to extend it via UDFs (user-defined function).

The dominance of relational systems is coming to an end, as NoSQL systems have started challenging
RDBMS. NoSQL systems are well-suited for scale-out. The disadvantages of NoSQL includes:
● It is not as mature as relational systems
● There’s no standard interface, which results in overhead for development.
● Schemaless is a blessing and a curse, which can lead to technical debt.
● NoSQL sacrifices data integrity for performance.
● NoSQL lacks a lot of analytics functionality. But it can be plugged into scalable architectures.

We have several ways to store the data in our cluster:


● Sharding​. Each data record is assigned to exactly one node. Ideally, the data resides close to
the node where the user is accessing it. It is also used for load balancing: we distribute the data
according to workload (it is often done manually in earlier times, but now a lot of NoSQL systems
offer auto-sharding). It is nice for efficiency, but not so great for resilience.
● Master-Slave replication.​ One data record is stored on multiple nodes. A master node is
responsible for updates that have to be synchronized with other nodes. It is nice for workloads
with lots of read operations as these operations can be parallelized now. It provides higher
resilience for read operations. The disadvantage is that the master nodes can become
bottlenecks for updates (they are also single point of failures sometimes). At the same time, there
might be some inconsistency.
YAONOTES.org

● Peer-to-peer replication. ​Peer-to-peer replication solvers the problem in Master/slave replication


by not having a master node. Every node is allowed to do updates. We can easily scale the
system by adding new nodes. The biggest problem for such a system is inconsistency.
● Sharding with replication.​ We can combine different techniques, for example, master-slave with
sharding. In such systems, there is more than one master but every data record has only one
master. A node can be a master for one record and a slave for another. In this way, we can do
better load balancing. It is also possible to combine peer-to-peer replication with sharding. To
summary, sharding with replication is very flexible, but still has problems with consistency.

When implementing such distributed systems, nodes can join and leave, but we need to ensure
replication. One method is consistent hashing.

Consistent Hashing.​ In consistent hashing, the keys of records, and the IDs of nodes are hashed with
the same hash function. We can interpret the domain of hash values as a ring, and a node is responsible
for a certain range. When inserting a new value, it is clear who is responsible.
When a node joins, it splits the range of an existing node: values smaller than the hash value of the new
node are assigned to the new node. When a node leaves, its range is merged with the range of its
successor.
As hash values are distributed randomly, this can be unbalanced. We can fix this by introducing virtual
nodes and mapping them to physical nodes.
To gain resilience, we replicate every record ​n​ times. The ​n-1​ copies are stored in the ​n-1​ successor
nodes of the responsible node.

Distributed RDBMS go for consistency, while many NoSQL relaxes consistency. For our application
domain (data analysis), consistency is not that important.

In summary, as soon as we have multiple users and we are modifying parts of the data, files are not that
great. For databases, there’s no silver bullet for databases. Choosing between relational and NoSQL
usually involves lots of trade-offs depending on the context.

Map Reduce

When actually processing the data, we don’t have a standardized declarative query language such as
SQL. Often there are interfaces such as APIs and RESTful (REpresentation State Transfer via HTTP)
interfaces.

Map reduce provides an abstract framework for parallelization. Map includes: iterate over a large number
of records, extract something of interest from each. Then we can shuffle and sort intermediate results.
Then reduce includes aggregate intermediate results and generates final output.

When using map reduce, programmers have to specify two functions. ​map a ​ nd ​reduce.
● map: (k1,v1) -> list(k2,v2). This processes key/value pairs and produces set of intermediate pairs.
● reduce: (k2, list(v2)) -> list(v2). It combines all intermediate values for a particular key, and
produces a set of merged output values.
Example:
YAONOTES.org

One actual system is Hadoop, which replaced many proprietary parallel processing systems. It has
several advantages:
● Runs on cheap hardware.
● No licensing fees.
● Scalable.
It is suited for many common data processing tasks, including batch processing and ETL.

The hadoop framework consists of four modules:


● Common Utilities: Java libraries and utilities for the other modules.
● YARN (Yet Another Resource Negotiator): the framework for job scheduling and cluster
management.
● HDFS: Hadoop Distributed File System.
● Map Reduce: YARN-based framework for parallel processing.

HDFS.​ It is based on the Google File System. It has a master-slave architecture. Master is a NameNode
that manages the metadata and maps blocks of files to DataNodes. Slaves are one or more slave
DataNodes that store the data and process the read and write operations.
YAONOTES.org
YAONOTES.org

There are two different versions: Version 1 and earlier/ Version 2 and later.

Version 1​. In version 1, Hadoop used ​JobTrackers​. ​1)​ A user specifies a job with the location of the
input and output files, the implementation of the map and reduce functions and various parameters
describing the job. ​2)​ Then Hadoop sends this information to the JobTracker, which configures the slaves
and plans and supervises the individual tasks. ​3)​ Then the TaskTrackers will execute the tasks on the
individual nodes and write the result of the reduce step into the output files. The JobTracker takes care of:
● Job Scheduling (matching tasks with TaskTrackers)
● Task Progress Monitoring. (keeping track of tasks; restarting failed or slow tasks; doing task
bookkeeping).
The disadvantages is that couples map reduce tasks very tightly to HDFS, and it has scalability and
availability issues.
YAONOTES.org

Version 2.​ In version 2, Hadoop splits the functionality into ResourceManger (RM) and ApplicationMaster
(AM). The ResourceManager is responsible for managing the allocating resources and scheduling
applications. An applicationMaster handles application lifecycle (includes fault). The split in functionality
adds a layer of abstraction to resource management. There is one RM per cluster and one AM per
application.
● ResourceManager (RM).​ It runs as a master, additionally, each node runs a NodeManger, which
handles local resources on a single node. RM and NMs are run in a master/slave setup.
Resources on local nodes are bundled up in containers (logical representation of resources,
includes memory and cores).
● ApplicationMaster (AM). ​An AM is started as the first container of an application. It registers with
the RM and can then update the RM about its status (heartbeat), request resources and run the
application. An AM runs user code, which might be malicious. Thus, an AM does not run as a
privileged service.

Example:

● Naive Bayes.​ Given a training data set, we want to compute P (x|y) where y ∈ {0, 1} . We only
have to count the number of occurrences, i.e. xj = k && y = 0 , xj = k && y = 1 , y = 0 and y = 1 .
Afterwards, we only need to divide up the training data set and assign the parts to mappers.
Every mapper counts the number of appearances, and then these counts are passed on to
reducers (in this case, every instance gets its own reducer). Every reducer sums up the values
that mapper sends to it.
● Logistic Regression.​ It is a regression with a dependent binary variable.
YAONOTES.org

Map Reduce is losing popularity for several reasons:


● It is not efficient for iterative algorithms.
● In each step of an iteration, mappers read data from the disk, and reducers write data to
the disk. Then disk access becomes a major bottleneck.
● For each iteration, new mappers and reducers have to be initialized, which can be a lot of
overhead for short-lived tasks.

Spark

Spark has several important features:


● In-memory cluster computing framework. Avoids writing to disk between operations, and it will
im-memory cache data that is read multiple times.
● Advanced execution engine. Allows optimization of complex sequences of execution steps, and
reduces overhead by reusing application containers.
● Runs on commodity hardware.
● In-built fault tolerance.

You might also like