BDA Unit-5
BDA Unit-5
Supervised learning is the types of machine learning in which machines are trained using
well "labelled" training data, and on basis of that data, machines predict the output. The
labelled data means some input data is already tagged with the correct output.
In supervised learning, the training data provided to the machines work as the supervisor
that teaches the machines to predict the output correctly. It applies the same concept as
a student learns in the supervision of the teacher.
Supervised learning is a process of providing input data as well as correct output data to
the machine learning model. The aim of a supervised learning algorithm is to find a
mapping function to map the input variable(x) with the output variable(y).
In the real-world, supervised learning can be used for Risk Assessment, Image
classification, Fraud Detection, spam filtering, etc
The working of Supervised learning can be easily understood by the below example and
diagram:
Suppose we have a dataset of different types of shapes which includes square, rectangle,
triangle, and Polygon. Now the first step is that we need to train the model for each shape.
o If the given shape has four sides, and all the sides are equal, then it will be
labelled as a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is to
identify the shape.
The machine is already trained on all types of shapes, and when it finds a new shape, it
classifies the shape on the bases of a number of sides, and predicts the output.
1. Regression
Regression algorithms are used if there is a relationship between the input variable and
the output variable. It is used for the prediction of continuous variables, such as Weather
forecasting, Market Trends, etc. Below are some popular Regression algorithms which
come under supervised learning:
o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression
2. Classification
Classification algorithms are used when the output variable is categorical, which means
there are two classes such as Yes-No, Male-Female, True-false, etc.
Spam Filtering,
o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines
o Unsupervised learning is helpful for finding useful insights from the data.
o Unsupervised learning is much similar as a human learns to think by their own
experiences, which makes it closer to the real AI.
o Unsupervised learning works on unlabeled and uncategorized data which make
unsupervised learning more important.
o In real-world, we do not always have input data with the corresponding output so
to solve such cases, we need unsupervised learning.
Once it applies the suitable algorithm, the algorithm divides the data objects into groups
according to the similarities and difference between the objects.
o Clustering: Clustering is a method of grouping the objects into clusters such that
objects with most similarities remains into a group and has less or no similarities
with the objects of another group. Cluster analysis finds the commonalities
between the data objects and categorizes them as per the presence and absence
of those commonalities.
o Association: An association rule is an unsupervised learning method which is
used for finding the relationships between variables in the large database. It
determines the set of items that occurs together in the dataset. Association rule
makes marketing strategy more effective. Such as people who buy X item
(suppose a bread) are also tend to purchase Y (Butter/Jam) item. A typical
example of Association rule is Market Basket Analysis.
o K-means clustering
o KNN (k-nearest neighbors)
o Hierarchal clustering
o Anomaly detection
o Neural Networks
o Principle Component Analysis
o Independent Component Analysis
o Apriori algorithm
o Singular value decomposition
Supervised learning model takes direct Unsupervised learning model does not
feedback to check if it is predicting correct take any feedback.
output or not.
Supervised learning model predicts the Unsupervised learning model finds the
output. hidden patterns in data.
In supervised learning, input data is provided In unsupervised learning, only input data
to the model along with the output. is provided to the model.
Supervised learning needs supervision to Unsupervised learning does not need any
train the model. supervision to train the model.
Supervised learning can be used for those Unsupervised learning can be used for
cases where we know the input as well as those cases where we have only input
corresponding outputs. data and no corresponding output data.
Supervised learning model produces an Unsupervised learning model may give
accurate result. less accurate result as compared to
supervised learning.
Supervised learning is not close to true Unsupervised learning is more close to the
Artificial intelligence as in this, we first train true Artificial Intelligence as it learns
the model for each data, and then only it can similarly as a child learns daily routine
predict the correct output. things by his experiences.
Collaborative Filtering
Gather user-item interaction data. This can be explicit, like ratings and reviews, or
implicit, like clicks and view times.
Cosine Similarity: Measures the cosine of the angle between two vectors.
Pearson Correlation: Measures the linear correlation between two sets of
data.
Euclidean Distance: Measures the straight-line distance between two points
in Euclidean space.
Step 4: Prediction
Use the similarity scores to predict the ratings or preferences for a user-item pair.
This can be done using techniques like weighted average or k-nearest neighbors.
Generate a list of recommendations for each user based on the predicted ratings.
This can be done by selecting the top-N items with the highest predicted ratings.
Step 6: Evaluation
Evaluate the performance of the collaborative filtering algorithm using metrics like
precision, recall, and F1-score. This helps in fine-tuning the model and improving its
accuracy.
Here, false negatives are products that the system fails to recommend, although the
consumer would like them.
False positives are products that are recommended, but which the consumer does
not like.
False positives are less desirable because they can annoy or anger consumers.
Again, collaborative filtering is the generic term in which the algorithms use explicit
and implicit ratings and compute similarities of ratings. What is explicit and implicit
In a practical story, collaborative filtering uses the user-item rating matrix that we can
get from the above data. The figures below are examples of user-item rating matrices
As you can see, the user-item matrix based on explicit rating has numerical values.
On the other hand, the user-item matrix based on implicit rating has binary values
instead.
After we understand the details of the user-item rating matrix, or the input data for
When we focus on the users(rows in the user rating matrix), we compare the rating
vector between a user and a user, and it is called user-user collaborative filtering.
On the other hand, when we focus on the items(columns in the user rating matrix), we
compare the rating vector between an item and an item, and it is called item-item
collaborative filtering.
Intuitively, if the rating vector of each user is similar, it means that users’ preferences
are similar.
Also, if the rating vector of each item is similar, it tells us the item is liked by similar
users.
Although you can use both types of data corresponding with the data you have, you
should consider the computation amount in the practical setting. You should utilize
the item-item collaborative filtering if you have more users than the number of items.
In comparison, you should use the user-user collaborative filtering if you have more
items than the users.
Explore how big data is harnessed and utilized in social media and how it can be a
game-changer for your online endeavors.
Utilizing Big Data in Social Media begins with data collection and aggregation.
Social media platforms are designed to gather extensive information about user
behavior.
Every click, like, share, comment, and post generates valuable data points.
This data includes user demographics, content preferences, interaction patterns, and
sentiment analysis based on the language used in posts and comments.
Once the data is collected, the real magic of big data happens through analysis and
deriving actionable insights.
Advanced algorithms and machine learning techniques are used to process and
make sense of this vast sea of information.
User Behavior Analysis: Data analytics tools dissect user behavior patterns.
They determine what content users engage with the most when they are
most active and how they navigate the platform.
Personalization: Social media platforms use big data to create highly
personalized experiences. By analyzing a user’s past interactions and
preferences, algorithms suggest friends, groups, and content tailored to the
individual’s interests.
Sentiment Analysis: Natural language processing algorithms are employed
to understand the sentiment in posts and comments. This can help gauge
public opinion on various topics, products, or brands.
Content Optimization: Data analytics provides insights into content
performance for businesses and content creators. Metrics such as click-
through rates, conversion rates, and audience demographics guide the
creation of content that resonates with the target audience.
Predictive Analysis: Predictive analytics uses historical data to forecast
future trends, helping businesses anticipate customer needs and market
fluctuations.
Mobile Analytics
Marketers want to know what their customers want to see and do on their mobile
device so that they can target the customer.
Similar to the process of analytics used to study the behaviour of users on the web
or social media, mobile analytics is the process of analysing the behaviour of mobile
users.
1.New user: These users who have just started using a mobile service. Users are
identified by unique device Ids. The growth and popularity of the service greatly
depend on the number of new users it is able to attract.
2.Active users: These users who use mobile services at least once in a specified
period. If the period is one day the active user will use the services several times
during the day. The number of active users in any specific period of time shows the
popularity of a service during that period.
3.Percentage of new users: This is the percentage of new users over the total
active users of a mobile service.
4.Session: When a user opens an app, it is counted as one session. The session
starts with the launching of the app and finishes with the apps termination.
5.Average Usage Duration: This is the duration that a mobile user uses the service.
6.User Retention: After a certain period time the total number of new users still
using any app is known as the user retention of that app.
Analyze Screen flow: Mobile analytics can track how and where a user touches the
screen .This is used to make interactive GUIs and also decide the place for mobile
ads.
Keep Customers engaged: Mobile analytics studies the behaviour of the users or
customers and display ads and other screens to keep them engaged.
Analyse the preferences of visitors: On the basis of users touch tap and other
behaviour on the screen mobile analytics can analyse their preferences.
Mobile analytics can collect data and manipulates into useful information which
keeps track of the following
1. Total time spent –this information shows the total time spent by the user with
an application.
2. Visitor’s location: this information shows the location of users using any
particular application.
3. Number of total visitors : we can know the popularity of the application
4. Click paths of visitors: keep the tracks of the activities of the user visiting
the pages of any application.
5. Pages viewed by the visitor: tracks the pages of application visited by the
user which again reflect the popular section of application.
6. Downloading choice of users : Keeps track of files downloaded by the
user ,helpful to understand type of data users like to download.
7. Types of Mobile Device and network used: helps the mobile service
providers and mobile phone sellers to understand popularity of mobile devices
and networks.
8. Screen resolution of mobile phone used: Any information that appears on
the mobile device is according to the screen size of the device. This aspect
ensures that the content fits a particular device screen.
9. Performance of Advertising campaigns : MA is used to keep track of
performance of advertising campaigns by analysing number of visitors and
time spent by them .
R's flexibility and extensibility make it an excellent choice for big data
analytics.
Data Visualization
sparklyr:
Interface between R and Apache Spark.
Enables large-scale data processing and machine learning
on distributed data.
RHadoop:
Integrates R with Hadoop.
Includes packages like:
rhdfs: For interacting with Hadoop Distributed File
System (HDFS).
rmr2: For writing MapReduce jobs in R.
plyrmr: For manipulating structured data on Hadoop.
bigmemory:
Handles datasets larger than RAM by storing them in shared
memory.
ff:
Manages datasets too large for memory by storing them on
disk.
data.table:
Optimized for fast manipulation of large datasets in R.
dplyr with Databases:
Can work with SQL databases for large data.
4. Machine Learning
Real-World Applications
Finance and Banking − Big data analytics in finance and banking can
help in fraud detection, risk modeling, portfolio optimization, and
customer segmentation. R's capabilities in data analysis and modeling
make it a valuable tool in this domain.
1. Data Loading:
Use tools like rhdfs to load data from HDFS or connect to
cloud data sources (e.g., AWS S3).
Load data into distributed memory frameworks like Spark
using sparklyr.
2. Data Preprocessing:
Use dplyr or data.table for cleaning, transformation, and
summarization.
For distributed data, leverage Spark's in-built capabilities.
3. Exploratory Data Analysis (EDA):
Perform summary statistics and visualize data
using ggplot2 or plotly.
Use scalable methods to handle subsets or aggregated data.
4. Model Building:
For distributed machine learning:
Use Spark MLlib via sparklyr for linear regression,
decision trees, clustering, etc.
For large local datasets:
Use packages like biglm for linear models.
5. Result Visualization:
Use visualization libraries like ggplot2, shiny, or plotly to
present findings.
6. Export and Deployment:
Save results to HDFS or a database for further use.
Deploy models using APIs or tools like R Shiny.