Week-6 - Lecture Notes
Week-6 - Lecture Notes
EL
PT
N
Dr. Rajiv Misra
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
EL
We will also discuss tools and algorithms that you can use to
PT
create machine learning models that learn from data, and to scale
those models up to big data problems.
N
EL
PT
N
EL
called models can learn to perform a specific task by analyzing
lots of examples for a particular problem. For example, a
machine learning model can learn to recognize an image of a
PT
cat by being shown lots and lots of images of cats.
N
EL
machine learning model is not given the step by step
instructions on how to recognize the image of a cat.
PT
Instead, the model learns what features are important in
determining whether it picture contains a cat from the data
N
that has analyzed. Because the model learns to perform this
task from data it's good to know that the amount and quality
of data available for building the model are important factors in
how well the model learns the task.
EL
Data-driven decisions: These patterns and trends lead to
valuable insights into the data. Thus the use of machine
PT
learning allows for data driven decisions to be made for a
particular problem.
So to summarize, the field of machine learning focuses on the
N
study and construction of computer systems that can learn
from data without being explicitly programmed. Machine
learning algorithms and techniques are used to build models,
to discover hidden patterns and trends in the data allowing for
data-driven decisions to be made.
Big Data Computing Vu Pham Big Data Machine Learning
Machine Learning (ML) is an Interdisciplinary Field
In applying machine learning to a
problem, domain knowledge is
essential to the success of end results.
By domain knowledge we mean an
understanding of the application or
EL
business domain.
PT
data related to the application, and
how the outcomes will be used are
crucial to driving the process of
N
building the machine learning model.
So domain knowledge is also an
integral part of a machine learning
solution.
EL
to determine if the current purchase is a legitimate transaction or a
potentially fraudulent one. If the purchase is very different from your
past purchases, such as for a big ticket item in a category that you
PT
had never shown an interest in or when the point of sales location is
from another country, then it will be flagged as a suspicious activity.
N
EL
decipher than typed digits due to the many variations in
people's handwriting.
PT
N
EL
These related items have been associated with the item
you purchased by a machine learning model, and are now
PT
being shown to you since you may also be interested in
them.
N
Sentiment analysis
EL
Climate monitoring
PT
Crime pattern detection
N
Drug effectiveness analysis
EL
activities related to finding patterns in databases and data warehouses. There
are some practical data management aspects to data mining related to
accessing data from databases. But the process of finding patterns in data is
similar, and can use the same algorithms and techniques as machine learning.
PT
Predictive analytics refers to analyzing data in order to predict future outcomes.
This term is usually used in the business context to describe activities such as
sales forecasting or predicting the purchasing behavior of a customer.
N
Data science is a new term that is used to describe processing and analyzing
data to extract meaning. Again machine learning techniques can also be used
here. Because the term data science became popular at the same time that big
data began appearing, data science usually refers to extracting meaning from
big data and so includes approaches for collecting, storing and managing big
data.
Big Data Computing Vu Pham Big Data Machine Learning
Machine Learning Models
Learn from data
EL
Allow for data-driven decisions
PT
Used in many different applications
N
EL
PT
N
Classification
Regression
EL
Cluster Analysis
PT
N
Association Analysis
EL
PT
N
EL
Predicting whether it will rain the next day.
PT
Determining if a loan application is high-risk, medium-risk
or low-risk.
N
Identifying the sentiment of a tweet or review as being
positive, negative, or neutral.
EL
instead of a classification task.
If you were to predict whether the stock price will rise or fall, then
that would be a classification problem. But if you're predicting the
PT
actual price of the stock, then that is a regression problem.
That is the main difference between classification and regression. In
classification, you're predicting a category and in regression, you're
N
predicting a numeric value.
EL
Predicting a score on a test.
PT
Determining the likelihood of how effective a drug will be
for a particular patient.
N
Predicting the amount of rain for a region.
EL
For example it would be very beneficial to segment your customers into
seniors, adults and teenagers. These groups have different likes and
PT
dislikes and have different purchasing behaviors. By segmenting your
customers to different groups you can more effectively provide
marketing adds targeted for each groups particular interests. Note that
N
cluster analysis is also referred to as clustering.
EL
Categorizing different types of tissues from medical images.
PT
Determining different groups of weather patterns, such as
snowy, dry, monsoon and
N
Discovering hot spots for different types of crime from
police reports.
EL
purchasing behavior.
For example, association analysis can reveal that banking customers
PT
who have CDs, or Certificates of Deposits, also tend to be interested in
other investment vehicles such as money market accounts.
This information can be used for cross selling. If you advertise money
N
market accounts to your customers with CDs they are likely to open
such an account.
EL
Finding items that are often purchased together, such as
garden hose and potting soil, and offer sales on these related
PT
items at the same time to drive sales of both items.
EL
PT
N
EL
Referring back to our example of predicting a weather category
PT
of sunny, windy, rainy or cloudy, every sample in the data set is
labeled as being one of these four categories. So the data is
labeled and predicting the weather categories is a supervised
N
task. In general, classification and regression are supervised
approaches.
EL
Remember the cluster analysis example of segmenting
customers into different groups. The samples in your data are
not labeled with the correct group. Instead, the segmentation
PT
is performed using a clustering technique to group items based
on characteristics that they have in common.
N
Thus, the data is unlabeled and the task of grouping customers
into different segments is an unsupervised one. In general,
cluster analysis and association analysis are unsupervised
approaches.
Big Data Computing Vu Pham Big Data Machine Learning
Classification of Machine Learning Techniques
EL
PT
N
EL
PT
N
EL
Prepare be carried out with a clear purpose in mind. That is, the
problem or opportunity that is being addressed must be
Analyze defined with clearly stated goals and objectives.
Report
PT
For example, the purpose of a project may be to study
N
customer purchasing behavior to come up with a more
effective marketing strategy in order to increase sales
Act revenue. The purpose behind the project will drive the
machine learning process.
EL
PT
N
EL
the data to understand the nature of the data that we have to work
with. Things we want to understand about the data are its
characteristics, format, and quality. A good understanding of the data
PT
leads to a more informed analysis and a more successful outcome.
Once we know more about the data through exploratory analysis,
the next part is pre-processing of the data for analysis. This includes
N
cleaning data, selecting the variables to use, and transforming data
to make the data more suitable for analysis in the next step.
EL
PT
N
EL
PT
N
EL
Determining actions from insights gained from analysis is the
main focus of the act step.
PT
N
EL
For example, during the prepare step, we may find some data
quality issues that may require us to go back to the acquire
PT
step to address some issues with data collection or to get
additional data that we didn't include in the first go around.
N
EL
PT
N
EL
Prepare
PT Analyze
N
Report
Act
EL
such as files, databases, the internet, mobile devices. So remember
to include all data related to the problem you are addressing.
PT
After you've identified your data and data sources, the next step is to
collect the data and integrate data from the different sources. This
may require conversion, as data can come in different formats. And it
N
may also require aligning the data, as data from different sources
may have different time or spatial resolutions. Once you’ve collected
and integrated your data, you now have a coherent data set for your
analysis.
EL
preprocess data.
PT
N
EL
Correlations provide information about the relationship between
PT
variables in your data. Trends in your data will reveal if the
variable is moving in a certain direction, such as transaction
volume increasing throughout the year.
N
Outliers indicate potential problems with the data, or may
indicate an interesting data point that needs further examination.
Without this data exploration activity, you will not be able to use
your data effectively.
Big Data Computing Vu Pham Big Data Machine Learning
Describe your Data
One way to explore your data is to calculate summary statistics to
numerically describe the data.
EL
Some basic summary statistics that you should compute for your
data set are mean, median, mode, range and standard deviation.
Mean and median are measures of the location of a set of values.
PT
Mode is the value that occurs most frequently in your data set, and
range and standard deviation are measures of spread in your data.
Looking at these measures will give you an idea of the nature of your
N
data. They can tell you if there's something wrong with your data.
For example, if the range of the values for age in your data includes
negative numbers, or a number much greater than a hundred,
there's something suspicious in the data that needs to be examined
Big Data Computing Vu Pham Big Data Machine Learning
Visualize Your Data
Visualization techniques also provide
quick and effective ways to explore your
data. Some examples are, a histogram,
such as the plot shown here, shows the
distribution of the data and can show
skewness or unusual dispersion in
EL
outliers.
A line plot, like the one in the lower left,
can be used to look at trends in the data,
PT
such as, the change in the price of a
stock. A heat map can give you an idea of
where the hot spots are.
N
A scatter plot effectively shows
correlation between two variables.
Overall, there are many types of plots to
visualize data. They are very useful in
helping you understand the data you
have.
Big Data Computing Vu Pham Big Data Machine Learning
Step-2-B: Pre-Process
The second part of the prepare step is preprocess. So, after
we've explored the data, we need to preprocess the data to
prepare it for analysis.
EL
The goal here is to create the data that will be used for
analysis.
PT
The main activities on this part are to clean the data, select the
appropriate variables to use and transform the data as needed.
N
EL
survey, duplicate data, such as two different records for the
same customer with different addresses.
PT
Inconsistent or invalid data, such as a six digit zip code. Noise in
the collection of data that distorts the true values. Outliers,
N
such as a number much larger than 100 for someone's age. It is
essential to detect and address these issues that can negatively
affect the quality of the data.
EL
features are very correlated. In that case, one of these features can be
removed without negatively effecting the analysis results.
PT
For example, the purchase price of a product and the amount of sales
tax pain are very likely to be correlated. Eliminating the sales tax amount
then will be beneficial. Removing redundant or irrelevant features will
N
make the subsequent analysis simpler.
You may also want to combine features or create new ones. For
example, adding the applicants education level as a feature to a loan
approval application would make sense. There are also algorithms to
automatically determine the most relevant features based on various
mathematical properties.
Big Data Computing Vu Pham Big Data Machine Learning
Feature Transformation
Feature transformation maps the data from one format to
another. Various transformation operations exist. For example,
scaling maps the data values to a specified range to prevent
any one feature from dominating the analysis results.
EL
Filtering or aggregation can be used to reduce noise and
PT
variability in the data.
EL
evaluate the results that you get from the model.
The analyze steps starts with this determining the type of problem
PT
you have. You begin by selecting appropriate machine learning
techniques to analyze the data.
N
Then you construct the model using the data that you've prepared.
Once the model is built, you will want to apply it to new data
samples to evaluate how well the model performs. Thus data analysis
involves selecting the appropriate technique for your problem,
building the model, then evaluating the results.
Big Data Computing Vu Pham Big Data Machine Learning
Step-4: Report
The next step in the machine learning process is reporting results from your
analysis. In reporting your results, it is important to communicate your
insights and make a case for what actions should follow.
In reporting your results, you will want to think about what to present, as
well as how to present. In deciding what to present, you should consider
EL
what the main results are, what insights were gained from your analysis,
and what added value do these insights bring to the application.
Keep in mind that even negative results are valuable lessons learned, and
PT
suggest further avenues for additional analysis. Remember that all findings
must be presented so that informs decisions can be made for next steps.
In deciding how to present, remember that visualization is an important
N
tool in presenting your results.
Plots and summary statistics discussing the explore step can be used
effectively here as well. You should also have tables with details from your
analysis as backup, if someone wants to take a deeper dive into the results.
In summary, you want to report your findings by presenting your results and
the value added with graphs using visualization tools.
Big Data Computing Vu Pham Big Data Machine Learning
Step-5: Act
The final step in the machine loading process is to determine what action
should be taken based on the insights gained.
What action should be taken based on the results of your analysis? Should
EL
you market certain products to a specific customer segment to increase
sales? What inefficiency is can be removed from your process? What
incentives would be effective in attracting new customers?
PT
Once a specific action has been determined, the next step is to implement
the action. Things to consider here include, how can the action be added to
N
your application? How will end users be affected?
EL
PT
N
EL
sample, that is a success. If the predicted class label is different
from the true class label, then that is an error.
PT
The error rate, then, is the percentage of errors made over the
entire data set. That is, it is the number of errors divided by the
total number of samples in a data set.
N
Error rate is also known as misclassification rate, or simply
error.
EL
PT
N
EL
Error rate, or simply error, on the training data is referred to as
PT
training error, and the error on test data is referred to as test
error. The error on the test data is an indication of how well the
classifier will perform on new data.
N
EL
You want your model to generalize well to new data. If your
model generalizes well, then it will perform well on data sets
PT
that are similar in structure to the training data, but doesn't
contain exactly the same samples as in the training set.
N
Since the test error indicates how well your model generalizes
to new data, note that the test error is also called
generalization error.
EL
This means that the model has learned to model the noise in
the training data, instead of learning the underlying structure
of the data.
PT
N
EL
plot on the right, however, shows that the model has learned to
model the noise in a data set.
PT
The model tries to capture every sample point, instead of the general
trend of the samples together. The training error and the
generalization error are plotted together, during model training.
N
EL
A classifier that performs well on just the training data set will
not be very useful. So it is essential that the goal of good
PT
generalization performance is kept in mind when building a
model.
N
EL
Both are undesirable, since both mean that the model will not
generalize well to new data. Overfitting generally occurs when a
PT
model is too complex, that is, it has too many parameters relative to
the number of training samples. So to avoid overfitting, the model
needs to be kept as simple as possible, and yet still solve the
N
input/output mapping for the given data set.
EL
overfitting so that your model will generalize well to
new data.
PT
N
EL
PT
N
EL
PT
N
EL
This is because the tree has partitioned the input space
according to the noise in the data instead of to the true
PT
structure of a data. In other words, it has overfit.
N
EL
PT
N
EL
must be used. For example, a nose stops expanding if the
number of samples in the node is less than some minimum
threshold.
PT
Another example is to stop expanding a note if the
N
improvement in the impurity measure falls below a certain
threshold.
EL
That is, the tree is trimmed starting with the leaf
nodes. The pruning is done by replacing a subtree with a
PT
leaf node if this improves the generalization error, or if
there is no change to the generalization error with this
replacement.
N
EL
So those nodes should be removed. In practice, post-pruning
PT
tends to give better results. This is because pruning decisions
are based on information from the full tree. Pre-pruning, on
the other hand, may stop the tree growing process
N
prematurely. However, post-pruning is more computationally
expensive since the tree has to be expanded to its full size.
EL
PT
N
EL
PT
N
EL
Recall also that overfitting generally occurs when a model
is too complex.
PT
So to have a model with good generalization
performance, model training has to stop before the
N
model gets too complex.
EL
set. The training set is used to build a model and the test set is
used to see how the model performs a new data.
PT
N
The training set is used to train the model as before and the
EL
validation set is used to determine when to stop training the
model to avoid overfitting, in order to get the best
generalization performance.
PT
N
EL
complexity increases, the training error decreases. On the other hand, the
validation error initially decreases but then starts to increase.
When the validation error increases, this indicates that the model is
PT
overfitting, resulting in decreased generalization performance.
N
EL
illustrated for a decision tree classifier, but the same method can be applied
to any type of machine learning model.
PT
N
Holdout method
EL
Random subsampling
PT
K-fold cross-validation, and
N
Leave-one-out cross-validation
EL
PT
N
EL
now has less data than it originally started out with.
Secondly, if the training and holdout sets do not have the same data distributions,
then the results will be misleading. For example, if the training data has many more
PT
samples of one class and the holdout dataset has many more samples of another
class.
N
EL
Average validation errors over all repetitions.
PT
N
EL
validation exactly once. This is illustrated in this figure. In the first
iteration, the first partition, is used for validation. In the second
iteration, the second partition is used for validation and so on.
PT
N
EL
selected. The process we just described is referred to as k-fold cross-
validation. This is a very commonly used approach to model selection
in practice.
PT
This approach gives you a more structured way to divide available
N
data up between training and validation datasets and provides a way
to overcome the variability in performance that you can get when
using a single partitioning of the data.
Here, for each iteration the validation set has exactly one sample. So
EL
the model is trained to using N minus one samples and is validated
on the remaining sample.
PT
The rest of the process works the same way as regular k-fold cross-
validation.
N
Note that cross-validation is often abbreviated CV and leave-one-out
cross-validation is in abbreviated L-O-O-C-V and pronounced LOOCV.
EL
of the error on the test set.
PT
N
The training dataset is used to train the model, that is to adjust the
EL
parameters of the model to learn the input to output mapping.
PT
The validation dataset is used to determine when training should
stop in order to avoid overfitting.
N
The test data set is used to evaluate the performance of the model
on new data.
EL
The test dataset must always remain independent from model
training and remain untouched until the very end when all training
has been completed. Note that in sampling the original dataset to
PT
create the training, validation, and test sets, all datasets must contain
the same distribution of the target classes.
N
For example, if in the original dataset, 70% of the samples belong to
one class and 30% to the other class, then this same distribution
should approximately be present in each of the training, validation,
and test sets. Otherwise, analysis results will be misleading.
EL
We learned how a validation set can be used to avoid
PT
overfitting and in the process, provide an estimate of
generalization performance.
N
And we covered different ways to create and use a
validation set such as k-fold cross-validation.
EL
Performance
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
Recall is considered a measure of completeness, because
it calculates the percentage of positive samples that the
PT
model correctly identified.
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
MLlib is Spark's scalable machine learning library
consisting of common learning algorithms and utilities,
including classification, regression, clustering,
PT
collaborative filtering, dimensionality reduction.
N
EL
PT
N
EL
PT
N
EL
3. Create a categorical variable for low humidity days
PT
4. Aggregate features used to make predictions.
5. Split the data into training and test sets.
N
6. Create and train the decision tree.
7. Save the predictions to a CSV file
EL
PT
N
EL
PT
N
EL
name, outputCol = label. A new data frame is created with this categorical
variable. binarizeredDF = binarizer.transform df.
PT
N
EL
by running assembled=assembler.transform binarizedDF.
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
We have also discussed tools and algorithms that you
can use to create machine learning models that learn
PT
from data, and to scale those models up to big data
problems.
N
EL
for Big Data Analytics
PT
N
Dr. Rajiv Misra
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Big Data Computing Vu Pham Machine Learning Classification Algorithm
Preface
Content of this Lecture:
In this lecture, we will discuss machine learning
classification algorithm k-means using mapreduce for
big data analytics
EL
PT
N
EL
data set into groups or clusters. By segmenting given data into
clusters, we can analyze each cluster more carefully.
PT
Note that cluster analysis is also referred to as clustering.
N
EL
books. This way, you can provide more targeted suggestions to each different
group.
PT
Characterize different weather patterns for a region: Some other
examples of cluster analysis are characterizing different weather patterns for
a region.
N
Group news articles into topics: Grouping the latest news articles into
topics to identify the trending topics of the day.
Discover crime hot spots: Discovering hot spots for different types of
crime from police reports in order to provide sufficient police presence for
problem areas.
EL
segment data so that differences between samples in the same cluster are
minimized, as shown by the yellow arrow, and differences between samples of
different clusters are maximized, as shown by the orange arrow. Visually, we
PT
can think of this as getting samples in each cluster to be as close together as
possible, and the samples from different clusters to be as far apart as possible.
N
EL
horizontal and vertical path, as shown in the right plot.
To go from point A to point B, you can only step along
either the x-axis or the y-axis in a two-dimensional case.
PT
So the path to calculate the Manhattan distance consists
of segments along the axes instead of along a diagonal
path, as with Euclidean distance.
N
Cosine similarity measures the cosine of the angle
between points A and B, as shown in the bottom plot.
Since distance measures such as Euclidean distance are
often used to measure similarity between samples in
clustering algorithms, note that it may be necessary to
normalize the input variables so that no one value
dominates the similarity calculation.
Big Data Computing Vu Pham Machine Learning Classification Algorithm
Another natural inner product measure
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
Not a proper distance metric
Efficient to compute for
sparse vecs
Big Data Computing Vu Pham Machine Learning Classification Algorithm
Normalize
EL
PT
N
EL
PT
In general, -1 < similarity < 1
N
For positive features (like tf-idf)
0 < similarity < 1
EL
‘correct’ clustering Clusters don't come with labels. You may end up with
five different clusters at the end of a cluster analysis
Clusters don’t
PT
process, but you don't know what each cluster
come with labels represents. Only by analyzing the samples in each cluster
can you come out with reasonable labels for your
clusters. Given all this, it is important to keep in mind
N
that interpretation and analysis of the clusters are
required to make sense of and make use of the results of
cluster analysis.
EL
types of readers, the resulting insights can be used to provide more effective marketing
to the different customer groups based on their preferences. For example, analyzing
each segment separately can provide valuable insights into each group's likes, dislikes
and purchasing behavior, just like we see science fiction, non-fiction and children's
books preferences here.
PT Science fiction
N
non-fiction
children’s
EL
assign a new sample to the closest cluster. The label of that cluster, manually
determined through analysis, is then used to classify the new sample. In our book
buyers' preferences example, a new customer can be classified as being either a
science fiction, non-fiction or children's books customer depending on which cluster
PT
the new customer is most similar to.
N
Label of closest
cluster used to
classify new sample
EL
would be the target class for each sample. This process can be used to provide much
needed labeled data for classification.
PT Label of closest
cluster used to
N
classify new sample
EL
can be flagged as an anomaly. However, these anomalies require further analysis.
Depending on the application, these anomalies can be considered noise, and should be
removed from the data set. An example of this would be a sample with a value of 150
for age.
PT
N
Anomalies that require
further analysis
EL
Analyzing clusters often leads to useful insights about
data
PT
Clusters require analysis and interpretation
N
Repeat
EL
Assign each sample to closest centroid
Calculate mean of cluster to determine new centroid
PT
Until some stopping criterion is reached
N
EL
Original samples Initial centroids Assign samples
PT
N
Re-calculate Assign samples Re-calculate
centroids centroids
EL
Solution:
Run k-means multiple times with different random
PT
initial centroids, and choose best results
N
EL
Squared error = error2
EL
Caveats:
Does not mean that cluster set 1 is more ‘correct’ than
cluster set 2 PT
N
Larger values for k will always reduce WSSE
EL
dataset to see if there are natural groupings of the samples. Scatter plots and the
use of dimensionality reduction are useful here, to visualize the data.
Application-Dependent:
•
• Data-Driven: There are also data-driven method for determining the value
of k. These methods calculate symmetric for different values of k to determine the
best selections of k. One such method is the elbow method.
Big Data Computing Vu Pham Machine Learning Classification Algorithm
Elbow Method for Choosing k
“Elbow” suggests
value for k should be 3
EL
PT
The elbow method for determining the value of k is shown on this plot. As we saw in the
previous slide, WSSE, or within-cluster sum of squared error, measures how much data
samples deviate from their respective centroids in a set of clustering results. If we plot WSSE
N
for different values for k, we can see how this error measure changes as a value of k changes
as seen in the plot. The bend in this error curve indicates a drop in gain by adding more
clusters. So this elbow in the curve provides a suggestion for a good value of k.
Note that the elbow can not always be unambiguously determined, especially for complex
data. And in many cases, the error curve will not have a clear suggestion for one value, but
for multiple values. This can be used as a guideline for the range of values to try for k.
Big Data Computing Vu Pham Machine Learning Classification Algorithm
Stopping Criteria
When to stop iterating ?
• No changes to centroids: How do you know when to stop
iterating when using k-means? One obviously stopping criterion is when
there are no changes to the centroids. This means that no samples would
EL
change cluster assignments. And recalculating the centroids will not result
in any changes. So additional iterations will not bring about any more
changes to the cluster results.
•
PT
Number of samples changing clusters is below
threshold: The stopping criterion can be relaxed to the second
N
stopping criterion: which is when the number of sample changing clusters
is below a certain threshold, say 1% for example. At this point, the
clusters are changing by only a few samples, resulting in only minimal
changes to the final cluster results. So the algorithm can be stopped here.
EL
centroids. Comparing the values of the variables between the centroids will reveal
how different or alike clusters are and provide insights into what each cluster
represents. For example, if the value for age is different for different customer
clusters, this indicates that the clusters are encoding different customer segments
PT
by age, among other variables.
N
EL
Simple to understand and implement and is efficient
PT
Value of k must be specified
N
Final clusters are sensitive to initial centroids
EL
Analysis
PT
N
EL
Recenter: Revise cluster centers as mean of assigned
observations
PT
N
Reduce: Average over all points in cluster j (zi=k)
EL
PT
N
EL
reduce(j, x_in_cluster j : [x1, x3,…, ])
sum = 0
count = 0 PT
for x in x_in_cluster j
N
sum += x
count += 1
emit(j, sum/count)
EL
PT
N
EL
Find Nearest Center
PT Key is Center, Value is Movie
N
Average Ratings
EL
PT
Reduce: recompute means;
data parallel over centers
N
EL
Mapper needs to get data point and all centers
A lot of data!
PT
Better implementation:
N
mapper gets many data points
EL
big data analytics
PT
N