Unit 3
Unit 3
Machine learning is a branch of artificial intelligence that allows us to make our application
intelligent without being explicitly programmed. Machine learning concepts are used to enable
applications to take a decision from the available datasets.
There are many popular organizations that are using machine-learning algorithms to make their
service or product understand the need of their users and provide services as per their
behavior. Google has its intelligent web search engine, which provides a number one search,
spam classification in Google Mail, news labeling in Google News, and Amazon for
recommender systems. There are many open source frameworks available for developing these
types of applications/frameworks, such as R, Python, Apache Mahout, and Weka.
There are three different types of machine-learning algorithms for intelligent system
development:
• Supervised machine-learning algorithms
• Unsupervised machine-learning algorithms
• Recommender systems
Supervised learning
Supervised learning, as the name indicates, has the presence of a supervisor as a teacher.
Basically supervised learning is when we teach or train the machine using data that is well
labelled. Which means some data is already tagged with the correct answer. After that, the
machine is provided with a new set of examples(data) so that the supervised learning
algorithm analyses the training data(set of training examples) and produces a correct
outcome from labelled data.
For instance, suppose you are given a basket filled with different kinds of fruits. Now the first
step is to train the machine with all the different fruits one by one like this:
If the shape of the object is rounded and has a depression at the top, is red in color, then
it will be labeled as –Apple.
If the shape of the object is a long curving cylinder having Green-Yellow color, then it will
be labeled as –Banana.
Now suppose after training the data, you have given a new separate fruit, say Banana from
the basket, and asked to identify it.
Since the machine has already learned the things from previous data and this time has to use
it wisely. It will first classify the fruit with its shape and color and would confirm the fruit
name as BANANA and put it in the Banana category. Thus, the machine learns the things from
training data(basket containing fruits) and then applies the knowledge to test data(new
fruit).
Supervised learning is classified into two categories of algorithms:
Classification: A classification problem is when the output variable is a category, such as
“Red” or “blue” , “disease” or “no disease”.
Regression: A regression problem is when the output variable is a real value, such as
“dollars” or “weight”.
Supervised learning deals with or learns with “labeled” data. This implies that some data is
already tagged with the correct answer.
Types:-
Regression
Logistic Regression
Classification
Naive Bayes Classifiers
K-NN (k nearest neighbors)
Decision Trees
Support Vector Machine
Advantages:-
Supervised learning allows collecting data and produces data output from previous
experiences.
Helps to optimize performance criteria with the help of experience.
Supervised machine learning helps to solve various types of real-world computation
problems.
Disadvantages:-
Classifying big data can be challenging.
Training for supervised learning needs a lot of computation time. So, it requires a lot of
time.
Steps
Unsupervised learning
Unsupervised learning is the training of a machine using information that is neither classified
nor labeled and allowing the algorithm to act on that information without guidance. Here the
task of the machine is to group unsorted information according to similarities, patterns, and
differences without any prior training of data.
Unlike supervised learning, no teacher is provided that means no training will be given to the
machine. Therefore the machine is restricted to find the hidden structure in unlabeled data
by itself.
For instance, suppose it is given an image having both dogs and cats which it has never seen.
Thus the machine has no idea about the features of dogs and cats so we can’t categorize it as
‘dogs and cats ‘. But it can categorize them according to their similarities, patterns, and
differences, i.e., we can easily categorize the above picture into two parts. The first may
contain all pics having dogs in them and the second part may contain all pics having cats in
them. Here you didn’t learn anything before, which means no training data or examples.
It allows the model to work on its own to discover patterns and information that was
previously undetected. It mainly deals with unlabelled data.
Unsupervised learning is classified into two categories of algorithms:
Clustering: A clustering problem is where you want to discover the inherent groupings in
the data, such as grouping customers by purchasing behavior.
Association: An association rule learning problem is where you want to discover rules that
describe large portions of your data, such as people that buy X also tend to buy Y.
Types of Unsupervised Learning:-
Clustering
1. Exclusive (partitioning)
2. Agglomerative
3. Overlapping
4. Probabilistic
Clustering Types:-
1. Hierarchical clustering
2. K-means clustering
3. Principal Component Analysis
4. Singular Value Decomposition
5. Independent Component Analysis
Supervised vs. Unsupervised Machine Learning
Computational
Complexity Simpler method Computationally complex
Recommendation algorithms
Recommendation is a machine-learning technique to predict what new items a user would like
based on associations with the user's previous items. Recommendations are widely used in the
field of e-commerce applications. Through this flexible data and behavior-driven algorithms,
businesses can increase conversions by helping to ensure that relevant choices are
automatically suggested to the right customers at the right time with cross-selling or up-selling.
For example, when a customer is looking for a Samsung Galaxy S IV/S4 mobile phone on
Amazon, the store will also suggest other mobile phones similar to this one, presented in the
Customers Who Bought This Item Also Bought window.
There are two different types of recommendations:
• User-based recommendations: In this type, users (customers) similar to current user
(customer) are determined. Based on this user similarity, their interested/used items can be
recommended to other users. Let's learn it through an example.
Assume there are two users named Wendell and James; both have a similar interest because
both are using an iPhone. Wendell had used two items, iPad and iPhone, so James will be
recommended to use iPad. This is user-based recommendation.
• Item-based recommendations: In this type, items similar to the items that are being currently
used by a user are determined. Based on the item-similarity score, the similar items will be
presented to the users for cross-selling and up-selling type of recommendations.
Let's learn it through an example.
The defined data analytics processes of a project life cycle should be followed by sequences for
effectively achieving the goal using input datasets.
This data process may include identifying the data analytics problems, designing, and collecting
datasets, data analytics, and data visualization.
Let's get some perspective on these stages for performing data analytics
The data analytics project life cycle stages are seen in the following diagram:
Identifying the problem
Today, business analytics trends change by performing data analytics over web datasets for
growing business. Since their data size is increasing gradually day by day, their analytical
application needs to be scalable for collecting insights from their datasets. With the help of web
analytics, we can solve the business analytics problems. Let's assume that we have a large e-
commerce website, and we want to know how to increase the business. We can identify the
important pages of our website by categorizing them as per popularity into high, medium, and
low. Based on these popular pages, their types, their traffic sources, and their content, we will
be able to decide the roadmap to improve business by improving web traffic, as well as content.
Preprocessing data
In data analytics, we do not use the same data sources, data attributes, data tools, and
algorithms all the time as all of them will not use data in the same format. This leads to the
performance of data operations, such as data cleansing, data aggregation, data augmentation,
data sorting, and data formatting, to provide the data in a supported format to all the data
tools as well as algorithms that will be used in the data analytics.
In simple terms, preprocessing is used to perform data operation to translate data into a fixed
data format before providing data to algorithms or tools. The data analytics process will then be
initiated with this formatted data as the input. In case of Big Data, the datasets need to be
formatted and uploaded to Hadoop
Distributed File System (HDFS) and used further by various nodes with Mappers and Reducers
in Hadoop clusters.
Visualizing data
Data visualization is used for displaying the output of data analytics. Visualization is an
interactive way to represent the data insights. This can be done with various data visualization
softwares as well as R packages. R has a variety of packages for the visualization of datasets.
Some popular examples of visualization with R are as follows:
Plots for facet scales (ggplot): The following figure shows the comparison of males and
females with different measures; namely, education, income, life expectancy, and literacy,
using ggplot:
Dashboa
rd charts: This is an rCharts type. Using this we can build interactive animated dashboards
with R.
Understanding data analytics problems
These data analytics problem definitions are designed such that readers can understand how
Big Data analytics can be done with the analytical power of functions, packages of R, and the
computational powers of Hadoop.
The data analytics problem definitions are as follows:
• Exploring the categorization of web pages
• Computing the frequency of changes in the stock market
• Predicting the sale price of a blue book for bulldozers (case study)
Hand-coding
One of the most basic methods for integrating data is hand-coding, or manual data integration.
Realistically, this method is only feasible for integrating a small number of data sources. In this
case, it might be effective to write code to collect the data, transform it if necessary, and
consolidate it. While hand-coding may not require investing in any software, it can take a
considerable amount of time, and scaling the integration to include more data sources may be
difficult.
Data warehousing
Data warehousing is a type of data integration that involves using a common storage area,
often a data warehouse to cleanse, format, and store data. This type of data integration is also
sometimes referred to as common storage integration. Data from all of the different
applications throughout an organization is copied to the data warehouse, where it can be
queried by data analysts.
Querying data on the warehouse rather than on the source applications means that analysts
don’t have to worry about impacting application performance. Plus, analysts can view all of the
data from the entire organization in a single, central location, which means they can check for
data completeness, accuracy, and consistency.
Potential issues with data warehousing include the costs of storing data in multiple locations,
plus the maintenance costs required to create and maintain the data warehouse. This is why
warehousing data in the cloud can be much more cost-effective and simpler.
Middleware data integration is a data integration system that involves using a middleware
application as a go-between that moves data between source systems and a central data
repository. The middleware helps to format and validate data before sending it to the
repository, which could be a cloud data warehouse or a database.
This approach can be particularly helpful when integrating older systems with newer ones,
because the middleware can help with transforming the legacy data into a format that’s usable
by the newer systems.
Potential issues with middleware data integration include maintenance. The middleware must
be deployed and maintained by knowledgeable developers. Another potential issue is limited
functionality, since many middleware applications have limited compatibility with source
application
Data consolidation
Data consolidation involves combining data from multiple systems to create a single,
centralized data source, which can then be used for reporting or analytics. ETL software is often
used to support data consolidation. ETL applications can pull data from multiple sources,
transform it into the necessary format and then transfer it to the final data storage location.
There may be some latency involved in data consolidation, because it can take time to retrieve
the data from the source and transfer it to the central data source. The latency period can be
shortened by more frequent data transfers.
One of the benefits of data consolidation is that because the data is transformed before it is
consolidated, it is in a consistent format on the central data source. This can give data workers
the chance to improve data quality and integrity.
Data virtualization
Data virtualization is interesting because while all of the data remains in its separate systems,
users can still gain a unified view of it. Data virtualization is essentially logical layer that
integrates data from all of the source systems and delivers it to business users in real time.
A benefit to data virtualization is that you don’t actually have to move your data around. Data
stays in the source systems, so you don’t have to worry about the increased storage costs
associated with maintaining multiple copies of your data.
Data federation
Data federation involves creating a virtual database that consolidates data from disparate
sources. Users can then use the virtual database as a single source of truth for all of the data in
the organization. When a user queries the virtual database, the query is actually sent to the
relevant underlying data source, which then serves the data back. So essentially, data is served
on an on-demand basis, rather than other techniques for data integration, where the data is
integrated before it can be queried. With data federation, data is given a common data model,
even though the different data sources may have vastly different data models.
Data propagation
Data propagation entails using applications to copy data from one location to another on an
event-driven basis. Enterprise application integration (EAI) and enterprise data replication (EDR)
technologies can be used for data propagation. EAI can provide a link between two systems, for
purposes such as business transaction processing. EDR is more frequently used to transfer data
between two databases. Unlike ETL, EDR does not involve data transformation. The data is
simply extracted from one database and moved to another one.
With nearly all of the above data integration approaches, you’ll need a data integration tool,
such as an ETL application or a data loader, to support your efforts. Choose a tool that
can integrate with all of the applications you have now, or that allows you to easily create a
connector if a pre-built one doesn’t exist. Ideally, a data integration tool is also flexible enough
that it will support any applications that you adopt in the future as well.