0% found this document useful (0 votes)
22 views26 pages

BDA Unit-5

Uploaded by

Jaya Prakash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views26 pages

BDA Unit-5

Uploaded by

Jaya Prakash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Supervised Machine Learning

Supervised learning is the types of machine learning in which machines are trained using
well "labelled" training data, and on basis of that data, machines predict the output. The
labelled data means some input data is already tagged with the correct output.

In supervised learning, the training data provided to the machines work as the supervisor
that teaches the machines to predict the output correctly. It applies the same concept as
a student learns in the supervision of the teacher.

Supervised learning is a process of providing input data as well as correct output data to
the machine learning model. The aim of a supervised learning algorithm is to find a
mapping function to map the input variable(x) with the output variable(y).

In the real-world, supervised learning can be used for Risk Assessment, Image
classification, Fraud Detection, spam filtering, etc

How Supervised Learning Works?


In supervised learning, models are trained using labelled dataset, where the model learns
about each type of data. Once the training process is completed, the model is tested on the
basis of test data (a subset of the training set), and then it predicts the output.

The working of Supervised learning can be easily understood by the below example and
diagram:

Suppose we have a dataset of different types of shapes which includes square, rectangle,
triangle, and Polygon. Now the first step is that we need to train the model for each shape.

o If the given shape has four sides, and all the sides are equal, then it will be
labelled as a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.

Now, after training, we test our model using the test set, and the task of the model is to
identify the shape.

The machine is already trained on all types of shapes, and when it finds a new shape, it
classifies the shape on the bases of a number of sides, and predicts the output.

Steps Involved in Supervised Learning:


o First Determine the type of training dataset
o Collect/Gather the labelled training data.
o Split the training dataset into training dataset, test dataset, and validation
dataset.
o Determine the input features of the training dataset, which should have enough
knowledge so that the model can accurately predict the output.
o Determine the suitable algorithm for the model, such as support vector machine,
decision tree, etc.
o Execute the algorithm on the training dataset. Sometimes we need validation sets
as the control parameters, which are the subset of training datasets.
o Evaluate the accuracy of the model by providing the test set. If the model predicts
the correct output, which means our model is accurate.

Types of supervised Machine learning Algorithms:


Supervised learning can be further divided into two types of problems:

1. Regression

Regression algorithms are used if there is a relationship between the input variable and
the output variable. It is used for the prediction of continuous variables, such as Weather
forecasting, Market Trends, etc. Below are some popular Regression algorithms which
come under supervised learning:

o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression

2. Classification

Classification algorithms are used when the output variable is categorical, which means
there are two classes such as Yes-No, Male-Female, True-false, etc.

Spam Filtering,

o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines

Advantages of Supervised learning:


o With the help of supervised learning, the model can predict the output on the
basis of prior experiences.
o In supervised learning, we can have an exact idea about the classes of objects.
o Supervised learning model helps us to solve various real-world problems such
as fraud detection, spam filtering, etc.

Disadvantages of supervised learning:


o Supervised learning models are not suitable for handling the complex tasks.
o Supervised learning cannot predict the correct output if the test data is different
from the training dataset.
o Training required lots of computation times.
o In supervised learning, we need enough knowledge about the classes of object.

Unsupervised Machine Learning


In the previous topic, we learned supervised machine learning in which models are
trained using labeled data under the supervision of training data. But there may be many
cases in which we do not have labeled data and need to find the hidden patterns from
the given dataset. So, to solve such types of cases in machine learning, we need
unsupervised learning techniques.

What is Unsupervised Learning?


As the name suggests, unsupervised learning is a machine learning technique in which
models are not supervised using training dataset. Instead, models itself find the hidden
patterns and insights from the given data. It can be compared to learning which takes
place in the human brain while learning new things. It can be defined as:
Unsupervised learning is a type of machine learning in which models are trained using unlabeled dataset

and are allowed to act on that data without any supervision.

Unsupervised learning cannot be directly applied to a regression or classification problem


because unlike supervised learning, we have the input data but no corresponding output
data. The goal of unsupervised learning is to find the underlying structure of
dataset, group that data according to similarities, and represent that dataset
in a compressed format.

Example: Suppose the unsupervised learning algorithm is given an input dataset


containing images of different types of cats and dogs. The algorithm is never trained
upon the given dataset, which means it does not have any idea about the features of the
dataset. The task of the unsupervised learning algorithm is to identify the image features
on their own. Unsupervised learning algorithm will perform this task by clustering the
image dataset into the groups according to similarities between images.

Why use Unsupervised Learning?


Below are some main reasons which describe the importance of Unsupervised Learning:

o Unsupervised learning is helpful for finding useful insights from the data.
o Unsupervised learning is much similar as a human learns to think by their own
experiences, which makes it closer to the real AI.
o Unsupervised learning works on unlabeled and uncategorized data which make
unsupervised learning more important.
o In real-world, we do not always have input data with the corresponding output so
to solve such cases, we need unsupervised learning.

Working of Unsupervised Learning


Working of unsupervised learning can be understood by the below diagram:
Here, we have taken an unlabeled input data, which means it is not categorized and
corresponding outputs are also not given. Now, this unlabeled input data is fed to the
machine learning model in order to train it. Firstly, it will interpret the raw data to find
the hidden patterns from the data and then will apply suitable algorithms such as k-
means clustering, Decision tree, etc.

Once it applies the suitable algorithm, the algorithm divides the data objects into groups
according to the similarities and difference between the objects.

Types of Unsupervised Learning Algorithm:


The unsupervised learning algorithm can be further categorized into two types of
problems:

o Clustering: Clustering is a method of grouping the objects into clusters such that
objects with most similarities remains into a group and has less or no similarities
with the objects of another group. Cluster analysis finds the commonalities
between the data objects and categorizes them as per the presence and absence
of those commonalities.
o Association: An association rule is an unsupervised learning method which is
used for finding the relationships between variables in the large database. It
determines the set of items that occurs together in the dataset. Association rule
makes marketing strategy more effective. Such as people who buy X item
(suppose a bread) are also tend to purchase Y (Butter/Jam) item. A typical
example of Association rule is Market Basket Analysis.

Unsupervised Learning algorithms:


Below is the list of some popular unsupervised learning algorithms:

o K-means clustering
o KNN (k-nearest neighbors)
o Hierarchal clustering
o Anomaly detection
o Neural Networks
o Principle Component Analysis
o Independent Component Analysis
o Apriori algorithm
o Singular value decomposition

Advantages of Unsupervised Learning


o Unsupervised learning is used for more complex tasks as compared to supervised
learning because, in unsupervised learning, we don't have labeled input data.
o Unsupervised learning is preferable as it is easy to get unlabeled data in
comparison to labeled data.

Disadvantages of Unsupervised Learning


o Unsupervised learning is intrinsically more difficult than supervised learning as it
does not have corresponding output.
o The result of the unsupervised learning algorithm might be less accurate as input
data is not labeled, and algorithms do not know the exact output in advance.

Supervised Learning Unsupervised Learning

Supervised learning algorithms are trained Unsupervised learning algorithms are


using labeled data. trained using unlabeled data.

Supervised learning model takes direct Unsupervised learning model does not
feedback to check if it is predicting correct take any feedback.
output or not.

Supervised learning model predicts the Unsupervised learning model finds the
output. hidden patterns in data.

In supervised learning, input data is provided In unsupervised learning, only input data
to the model along with the output. is provided to the model.

The goal of supervised learning is to train The goal of unsupervised learning is to


the model so that it can predict the output find the hidden patterns and useful
when it is given new data. insights from the unknown dataset.

Supervised learning needs supervision to Unsupervised learning does not need any
train the model. supervision to train the model.

Supervised learning can be categorized Unsupervised Learning can be classified


in Classification and Regression problems in Clustering and Associations problem
. s.

Supervised learning can be used for those Unsupervised learning can be used for
cases where we know the input as well as those cases where we have only input
corresponding outputs. data and no corresponding output data.
Supervised learning model produces an Unsupervised learning model may give
accurate result. less accurate result as compared to
supervised learning.

Supervised learning is not close to true Unsupervised learning is more close to the
Artificial intelligence as in this, we first train true Artificial Intelligence as it learns
the model for each data, and then only it can similarly as a child learns daily routine
predict the correct output. things by his experiences.

It includes various algorithms such as Linear It includes various algorithms such as


Regression, Logistic Regression, Support Clustering, KNN, and Apriori algorithm.
Vector Machine, Multi-class Classification,
Decision tree, Bayesian Logic, etc.

Collaborative Filtering

Collaborative filtering is a method used by recommender systems to make automatic


predictions about a user’s interests by collecting preferences from many users
(collaborating).

The underlying assumption is that if person A has a similar opinion as person B on


one issue, A is more likely to have B’s opinion on a different issue than that of a
randomly chosen person.

Overview of Collaborative Filtering

Collaborative filtering is integral to the recommendation engines of many online


services, including e-commerce websites, streaming services, and social media
platforms.

It leverages the power of user data to provide personalized recommendations,


enhancing user experience and engagement.

Types of Collaborative Filtering

Collaborative filtering can be broadly categorized into two types:

user-based and item-based filtering.

 User-based Collaborative Filtering: This method finds similarities between


users.
 For example, if user A and user B both rate several items similarly, they are
considered similar. Future recommendations for user A will include items that
user B liked, but user A has not yet rated or seen.
 Item-based Collaborative Filtering: This approach finds similarities between
items. If item X and item Y receive similar ratings from users, they are
considered similar. If a user likes item X, the system will recommend item Y.
Benefits of Collaborative Filtering
Collaborative filtering offers several benefits:

 Personalization: Provides highly personalized recommendations based on


user behavior.
 Scalability: Can handle large datasets effectively, making it suitable for large-
scale applications.
 Implicit Feedback: Can work with implicit feedback, such as clicks or view
times, not just explicit ratings.
Challenges and Limitations

Despite its advantages, collaborative filtering faces several challenges:

 Cold Start Problem: New users or items with no interactions pose a


challenge as the system lacks data to make accurate predictions.
 Sparsity: In large datasets, users interact with only a small fraction of items,
leading to sparse matrices that can hinder the effectiveness of the algorithm.
 Scalability Issues: As the number of users and items grows, the computation
of similarities and recommendations becomes resource-intensive.
Applications of Collaborative Filtering

Collaborative filtering is widely used across various industries to enhance user


experience and increase engagement.

Some common applications include:

 E-commerce: Amazon uses collaborative filtering to recommend products


based on users’ purchase history and ratings.
 Streaming Services: Netflix and Spotify utilize collaborative filtering to
suggest movies, TV shows, or music tracks that align with users’ tastes.
 Social Media: Platforms like Facebook and Twitter use collaborative filtering
to suggest friends or content that users might find interesting.
Implementing Collaborative Filtering

Implementing collaborative filtering involves several steps, from data collection to


recommendation generation. Here is a step-by-step guide:

Step 1: Data Collection

Gather user-item interaction data. This can be explicit, like ratings and reviews, or
implicit, like clicks and view times.

Step 2: Data Preprocessing


Clean and preprocess the data. This may involve normalizing ratings, handling
missing values, and converting the data into a suitable format for analysis.

Step 3: Similarity Calculation

Choose a similarity metric to compute the similarities between users or items.


Common metrics include:

 Cosine Similarity: Measures the cosine of the angle between two vectors.
 Pearson Correlation: Measures the linear correlation between two sets of
data.
 Euclidean Distance: Measures the straight-line distance between two points
in Euclidean space.
Step 4: Prediction

Use the similarity scores to predict the ratings or preferences for a user-item pair.
This can be done using techniques like weighted average or k-nearest neighbors.

Step 5: Recommendation Generation

Generate a list of recommendations for each user based on the predicted ratings.
This can be done by selecting the top-N items with the highest predicted ratings.

Step 6: Evaluation

Evaluate the performance of the collaborative filtering algorithm using metrics like
precision, recall, and F1-score. This helps in fine-tuning the model and improving its
accuracy.

Advanced Techniques in Collaborative Filtering

To overcome some of the challenges and limitations, advanced techniques have


been developed:

 Matrix Factorization: Techniques like Singular Value Decomposition (SVD)


decompose the user-item interaction matrix into lower-dimensional matrices,
capturing latent factors that influence user preferences.
 Hybrid Approaches: Combining collaborative filtering with content-based
filtering or other techniques to improve recommendation accuracy and
address the cold start problem.
 Deep Learning: Utilizing neural networks to model complex interactions
between users and items, enhancing the quality of recommendations.
Collaborative recommender systems face two major challenges: scalability and
ensuring quality recommendations to the consumer.
Scalability is important, because e-commerce systems must be able to search
through millions of potential neighbours in real time.

If the site is using browsing patterns as indications of product preference, it may


have thousands of data points for some of its customers. Ensuring quality
recommendations is essential in order to gain consumers’ trust. If consumers
follow a system recommendation but then do not end up liking the product,
they are less likely to use the recommender system again.

As with classification systems, recommender systems can make two types of


errors: false negatives and false positives.

Here, false negatives are products that the system fails to recommend, although the
consumer would like them.

False positives are products that are recommended, but which the consumer does
not like.

False positives are less desirable because they can annoy or anger consumers.

Dimension reduction, association mining, clustering, and Bayesian learning are


some of the techniques that have been adapted for collaborative recommender
systems. While collaborative filtering explores the ratings of items provided by similar
users, some recommender systems explore a content-based method that provides
recommendations based on the similarity of the contents contained in an item.
Moreover, some systems integrate both content-based and user-based methods to
achieve further improved recommendations.

Collaborative filtering — comprehensive understanding

Again, collaborative filtering is the generic term in which the algorithms use explicit

and implicit ratings and compute similarities of ratings. What is explicit and implicit

ratings? These terms can be explained as shown below:


・ Explicit — users specify(score) how much they liked a product, like Amazon’s

product rating or Netflix’s movie rating.


・Implicit — based on the user’s behavior. For example, if a user buys something or

watches a particular movie, we think the user is interested.


Explicit rating image from the author

Implicit rating image from the author

In a practical story, collaborative filtering uses the user-item rating matrix that we can

get from the above data. The figures below are examples of user-item rating matrices

corresponding with the images above.


Explicit rating matrix image from the author

Implicit rating matrix image from the author

As you can see, the user-item matrix based on explicit rating has numerical values.

On the other hand, the user-item matrix based on implicit rating has binary values

instead.
After we understand the details of the user-item rating matrix, or the input data for

collaborative filtering, we must comprehend how we use it to compute similarities.

There are two paths to calculate similarities.

When we focus on the users(rows in the user rating matrix), we compare the rating

vector between a user and a user, and it is called user-user collaborative filtering.

On the other hand, when we focus on the items(columns in the user rating matrix), we

compare the rating vector between an item and an item, and it is called item-item

collaborative filtering.

Intuitively, if the rating vector of each user is similar, it means that users’ preferences

are similar.

Also, if the rating vector of each item is similar, it tells us the item is liked by similar

users.

User-User similarities image from the author


Item-Item similarities image from the author

Although you can use both types of data corresponding with the data you have, you
should consider the computation amount in the practical setting. You should utilize
the item-item collaborative filtering if you have more users than the number of items.
In comparison, you should use the user-user collaborative filtering if you have more
items than the users.

Social Media Analytics

What is Social Media Analytics?


In this era of social media and networking, Social Media analytics is a process for the
extraction of unseen and unknown insights from the abundant data available
worldwide.
It involves the identification, extraction, and evaluation of social media data
using various tools and methods.
It is also an art for interpreting insights obtained with business goals and objectives.
It focuses on seven layers of social media: text, networks, actions, hyperlinks,
mobile, location, and search engines.
Various tools for social media analytics include Discovertext, Lexalytics,
Netlytic, Google Analytics, Network NodeXL, Netminer, and many more.
Social media analytics is the process of collecting data from social media networks
and gaining insights in order to improve the performance of social media campaigns.
Social media analytics tools perform these analysis with the use of technologies
such as data mining, data analytics, and big data.
Social analytics is the key to ensure high-level performance of your social
media campaigns, and also to make data-driven decisions. It helps your social
media team learn what works and what does not work when it comes to
content and strategy. This analysis is based on the level of audience
engagement, conversion, and outreach.
For example, the number of likes on your post on Facebook defines the level of
audience engagement, whereas the number of clicks on a LinkedIn post indicates
conversions. Based on the likes and clicks, you can measure the performance of
your content and tweak your future content strategy. Similarly, when you analyze
your competitor’s social networks, it gives you an idea of what content is performing
better.
Social media analytics is an essential function of marketing that helps
marketers track, measure and analyze the performance of their social
campaigns. It helps in aligning the performance of social campaigns with marketing
goals, and aides in justifying the investments made in these campaigns.

What is the Importance of Social Media Analytics?


Improve productivity of your organization:
By using various tools to analyze social media, companies can summarize
customer reviews and formulate strategies to improve the quality of products,
thereby increasing the productivity of the organization.
The profitability of the organization can be enhanced by identifying loopholes
and making improvements in the weaker sectors.
To analyze potential competition:
Utilizing various analytics tools also helps in identifying competitors in the
market. It assists in focusing on methodologies to achieve better results.
Comparison charts provide insights about the organization's brand and its
standing relative to competitors in the market.
To enhance customer reach:
Managing the customer journey through social analytics is crucial for retaining them.
Constantly engaging with your consumers enhances social presence and
understanding, leading to further improvements for your business. It considers
semi-structured and unstructured data, summarizing it to identify meaningful insights.
Engagement rate tracks how people are involved with your content and campaigns.
Improve product quality:
Customers often provide product reviews on social media platforms. Companies
analyze these reviews and feedback to enhance product quality. Non-positive
comments can be used by organizations to improve negative aspects, thus
enhancing the overall customer experience. Customer feedback and complaints
provide opportunities for improvements.
Strategic decision-making:
Social Media Analysis also aids in trend analysis and the identification of high-value
features for a brand. It gauges responses to social media and other
communications, facilitating meaningful decision-making for organizations to
improve productivity.
Sentiment analysis:
Comments and reviews about products and services are collected, extracted,
cleaned, and analyzed using various tools. Categorization of these comments
reveals the intention about the brand. Natural language processing
methodologies are employed to understand the intensity and group comments into
positive, negative, or neutral categories regarding a product or service.
Summarization charts about customer sentiment reveal future prospects for product
usage and guide corrective actions accordingly.

How Do You Use Big Data in Social Media?

Billions of users across various platforms produce an enormous amount of


data every day. This data, often called “big data,” holds immense potential for
businesses, marketers, and individuals seeking to make the most of their
social media presence.

Explore how big data is harnessed and utilized in social media and how it can be a
game-changer for your online endeavors.

Data Collection and Aggregation

Utilizing Big Data in Social Media begins with data collection and aggregation.

Social media platforms are designed to gather extensive information about user
behavior.

Every click, like, share, comment, and post generates valuable data points.

This data includes user demographics, content preferences, interaction patterns, and
sentiment analysis based on the language used in posts and comments.

Social media companies employ advanced data collection methods to capture


information stored and organized in massive databases. The scale of data
collected is astounding and grows exponentially with each passing second. This raw
data forms the foundation upon which the power of big data analytics is harnessed.

Data Analysis and Insights

Once the data is collected, the real magic of big data happens through analysis and
deriving actionable insights.
Advanced algorithms and machine learning techniques are used to process and
make sense of this vast sea of information.

Here’s how it’s done:

 User Behavior Analysis: Data analytics tools dissect user behavior patterns.
They determine what content users engage with the most when they are
most active and how they navigate the platform.
 Personalization: Social media platforms use big data to create highly
personalized experiences. By analyzing a user’s past interactions and
preferences, algorithms suggest friends, groups, and content tailored to the
individual’s interests.
 Sentiment Analysis: Natural language processing algorithms are employed
to understand the sentiment in posts and comments. This can help gauge
public opinion on various topics, products, or brands.
 Content Optimization: Data analytics provides insights into content
performance for businesses and content creators. Metrics such as click-
through rates, conversion rates, and audience demographics guide the
creation of content that resonates with the target audience.
 Predictive Analysis: Predictive analytics uses historical data to forecast
future trends, helping businesses anticipate customer needs and market
fluctuations.

Mobile Analytics
Marketers want to know what their customers want to see and do on their mobile
device so that they can target the customer.

Similar to the process of analytics used to study the behaviour of users on the web
or social media, mobile analytics is the process of analysing the behaviour of mobile
users.

The primary goal of mobile analytics is to understand the following:

1.New user: These users who have just started using a mobile service. Users are
identified by unique device Ids. The growth and popularity of the service greatly
depend on the number of new users it is able to attract.

2.Active users: These users who use mobile services at least once in a specified
period. If the period is one day the active user will use the services several times
during the day. The number of active users in any specific period of time shows the
popularity of a service during that period.
3.Percentage of new users: This is the percentage of new users over the total
active users of a mobile service.

4.Session: When a user opens an app, it is counted as one session. The session
starts with the launching of the app and finishes with the apps termination.

5.Average Usage Duration: This is the duration that a mobile user uses the service.

6.User Retention: After a certain period time the total number of new users still
using any app is known as the user retention of that app.

Mobile analytics is the study of data that is collected to achieve:

Track Sales: Mobile analytics can track the sales of products.

Analyze Screen flow: Mobile analytics can track how and where a user touches the
screen .This is used to make interactive GUIs and also decide the place for mobile
ads.

Keep Customers engaged: Mobile analytics studies the behaviour of the users or
customers and display ads and other screens to keep them engaged.

Analyse the preferences of visitors: On the basis of users touch tap and other
behaviour on the screen mobile analytics can analyse their preferences.

Analyse m-commerce activities of visitors: Mobile analytics can analyse the m-


commerce activities of the visitors and find out a lot of useful information like users
frequency of making a purchase and the amount he is willing to spend.

Results from mobile analytics:

Mobile analytics can collect data and manipulates into useful information which
keeps track of the following

1. Total time spent –this information shows the total time spent by the user with
an application.
2. Visitor’s location: this information shows the location of users using any
particular application.
3. Number of total visitors : we can know the popularity of the application
4. Click paths of visitors: keep the tracks of the activities of the user visiting
the pages of any application.
5. Pages viewed by the visitor: tracks the pages of application visited by the
user which again reflect the popular section of application.
6. Downloading choice of users : Keeps track of files downloaded by the
user ,helpful to understand type of data users like to download.
7. Types of Mobile Device and network used: helps the mobile service
providers and mobile phone sellers to understand popularity of mobile devices
and networks.
8. Screen resolution of mobile phone used: Any information that appears on
the mobile device is according to the screen size of the device. This aspect
ensures that the content fits a particular device screen.
9. Performance of Advertising campaigns : MA is used to keep track of
performance of advertising campaigns by analysing number of visitors and
time spent by them .

Mobile Analytical Tools


The top 4 mobile analytics tools are
1.Localytics:This is a big marketing and analytics platform for mobile and
webapps.It supports cross platform ans web based applications.
2.Appsee-It provides analytical services with feaures like conversion funnel
analysis,heatmaps etc.
3.Google Analytics-it is the free servie provided by Google.It offersa
analytics services to mobile app and mobile developers.
4.Maxpanel-It can continually follow the user and mobile web interactions

Mobile analytics tools can be categorized as


1.Location based tracking Tools-These tools store information about the
location of mobile devices. These tools are software applications for mobile
devices and they continuously monitor the location of devices and manipulate
the information thus obtained in various ways. Like it can display the location
of friends, ATMs, Cafes, Hotels, nearby places.
Ex: Geoloql, Placed
2.Real time Analytics Tools-These tools analyse and report data in real
time.
Ex: Geckoboard, Mixpanel
3.User Behaviour Tracking Tools-This tool tracks user behaviour with any
particular mobile application. These reports can help organizations to improve
their applications and services. These reports provide an excellent way of
organization to know how users use specific applications on their mobile
phones.
Ex: TestFlight, Mobile App Tracking

Challenges of Mobile analytics


1.Unavailability of Uniform technology-Different mobile phones support
different technologies.
2.Random change in subscriber identity-TMSI(Temporary Mobile
Subscriber Identity)is the identity of mobile devices and can be known by the
mobile network being used. This identity is randomly assigned by the
VLR(Visitor Location Register)to every mobile in the area as it is switched on.
This random change in the subscriber ID makes it difficult to gather important
information such as the location of user.
3.Redirect-Some mobile device do not support redirects. The term redirect is
used to describe the process in which the system automatically opens another
page.
4.Special characters in the URL-In some mobile devices some special
characters in the URL are not supported.
5.Interrupted Connections-The mobile connection with the tower is not
always dedicated .It can be interrupted when the user is moving from one
tower to another tower. This interruption in the connection breaks the request
sent by the devices.

R for Big Data Analytics


Big data analytics has become an integral part of decision-making and
business intelligence across various industries. With the exponential
growth of data, organizations need robust tools and techniques to
extract meaningful insights.

R, a powerful programming language and software environment, has


gained popularity for its extensive capabilities in data analysis and
statistical computing.

Understanding R for Big Data Analytics

R Programming Language: R is an open-source programming language


that provides a wide range of statistical and graphical techniques.

It offers a rich ecosystem of packages and libraries that support data


manipulation, visualization, and modeling.

R's flexibility and extensibility make it an excellent choice for big data
analytics.

R for Big Data: While R is traditionally known for its performance on


smaller datasets, it can also handle big data efficiently.

Several R packages have been developed specifically for big data


analytics, allowing users to process and analyze large datasets without
compromising performance.

Handling Big Data in R

R Packages for Big Data Analytics: R offers several packages that


facilitate big data analytics. Some popular packages include −

 dplyr − This package provides a grammar of data manipulation,


allowing users to perform various operations like filtering,
summarizing, and joining datasets efficiently.
 data.table − The data.table package enhances data manipulation
by implementing fast and memory-efficient data structures. It can
handle large datasets with millions or even billions of rows.
 SparkR − Built on Apache Spark, the SparkR package enables
distributed data processing with R. It leverages the power of
Spark's distributed computing capabilities to analyze big data
efficiently.
Data Manipulation and Preprocessing

Data Cleaning − Data cleaning is a crucial step in big data analytics. R


provides a variety of functions and packages for data cleaning tasks,
including missing data imputation, outlier detection, and data
transformation.

Data Transformation − R offers powerful functions for transforming


data, such as reshaping data from wide to long format (melt function),
creating new variables using calculated values (mutate function), and
splitting or combining variables (separate and unite functions).

Feature Engineering − Feature engineering involves creating new


features from existing data to improve model performance. R provides a
plethora of packages and functions for feature engineering, including
text mining, time series analysis, and dimensionality reduction
techniques.

Modeling and Analysis

Machine Learning with R − R is widely used for machine learning


tasks. It offers numerous packages for various machine learning
algorithms, including classification, regression, clustering, and ensemble
methods. Popular machine learning packages in R include caret,
randomForest, glmnet, and xgboost.

Deep Learning with R − Deep learning has gained significant popularity


in recent years. R provides several packages for deep learning, such
as keras, tensorflow, and mxnet. These packages allow users to build
and train deep neural networks for tasks like image classification, natural
language processing, and time series analysis.

Data Visualization

Data Visualization Packages − R is renowned for its extensive data


visualization capabilities. It provides a wide range of packages for
creating visually appealing and informative plots and charts.

Some popular data visualization packages in R include −


 ggplot2 − ggplot2 is a highly flexible and powerful package for
creating elegant and customizable data visualizations. It follows
the grammar of graphics principles, allowing users to build
complex plots layer by layer.
 plotly − plotly is an interactive visualization package that enables
the creation of interactive and web-based plots. It offers a wide
range of options for creating interactive charts, maps, and
dashboards.
 lattice − lattice provides a comprehensive set of functions for
creating conditioned plots, such as trellis plots and multi-panel
plots. It is particularly useful for visualizing multivariate data.

Key Tools and Packages for Big Data Analytics with R

1. Integration with Distributed Systems

 sparklyr:
 Interface between R and Apache Spark.
 Enables large-scale data processing and machine learning
on distributed data.
 RHadoop:
 Integrates R with Hadoop.
 Includes packages like:
 rhdfs: For interacting with Hadoop Distributed File
System (HDFS).
 rmr2: For writing MapReduce jobs in R.
 plyrmr: For manipulating structured data on Hadoop.

2. Big Memory Management

 bigmemory:
 Handles datasets larger than RAM by storing them in shared
memory.
 ff:
 Manages datasets too large for memory by storing them on
disk.

3. Data Manipulation and Analysis

 data.table:
 Optimized for fast manipulation of large datasets in R.
 dplyr with Databases:
 Can work with SQL databases for large data.

4. Machine Learning

 MLlib via Spark:


 Scalable machine learning using sparklyr.

Real-World Applications

Finance and Banking − Big data analytics in finance and banking can
help in fraud detection, risk modeling, portfolio optimization, and
customer segmentation. R's capabilities in data analysis and modeling
make it a valuable tool in this domain.

Healthcare − In the healthcare industry, big data analytics can


contribute to disease prediction, drug discovery, patient monitoring, and
personalized medicine. R's statistical and machine learning capabilities
are well-suited for analyzing healthcare data.

Marketing and Customer Analytics − R plays a significant role in


marketing and customer analytics by analyzing customer behavior,
sentiment analysis, market segmentation, and campaign optimization. It
helps organizations make data-driven marketing decisions.

Big Data Analytics with Big R refers to using R programming for


analyzing large-scale datasets, leveraging distributed computing
frameworks, cloud environments, and specialized R packages to perform
data processing and analysis. While R is traditionally known for handling
small to medium-sized datasets, tools and extensions
like bigmemory, sparklyr, and integration
with Hadoop and Spark enable R to manage and analyze big data
effectively.

Why Use R for Big Data Analytics?

1. Statistical Analysis: R provides a rich set of statistical and


machine-learning libraries.
2. Visualization: R offers advanced data visualization packages
like ggplot2 and plotly.
3. Extensibility: It integrates with big data platforms like Hadoop and
Spark.
4. Ease of Use: Its syntax and data manipulation packages
like dplyr simplify working with large datasets.

Challenges in Using R for Big Data

1. Memory Constraints: R loads entire datasets into memory,


making it unsuitable for very large datasets on single machines.
2. Performance: Without optimization, processing large datasets in
R can be slow.
3. Scalability: Requires distributed frameworks to handle datasets
beyond the system's memory.

Steps to Perform Big Data Analytics with Big R

1. Data Loading:
 Use tools like rhdfs to load data from HDFS or connect to
cloud data sources (e.g., AWS S3).
 Load data into distributed memory frameworks like Spark
using sparklyr.
2. Data Preprocessing:
 Use dplyr or data.table for cleaning, transformation, and
summarization.
 For distributed data, leverage Spark's in-built capabilities.
3. Exploratory Data Analysis (EDA):
 Perform summary statistics and visualize data
using ggplot2 or plotly.
 Use scalable methods to handle subsets or aggregated data.
4. Model Building:
 For distributed machine learning:
 Use Spark MLlib via sparklyr for linear regression,
decision trees, clustering, etc.
 For large local datasets:
 Use packages like biglm for linear models.
5. Result Visualization:
 Use visualization libraries like ggplot2, shiny, or plotly to
present findings.
6. Export and Deployment:
 Save results to HDFS or a database for further use.
 Deploy models using APIs or tools like R Shiny.

You might also like