0% found this document useful (0 votes)
1 views13 pages

Unit 3

Machine learning is a branch of artificial intelligence that enables applications to make decisions based on data without explicit programming, with key types including supervised, unsupervised, and recommender systems. Supervised learning uses labeled data for training, while unsupervised learning identifies patterns in unlabeled data. The document also discusses data analytics project life cycles, integration techniques, and the importance of data visualization in deriving insights from data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views13 pages

Unit 3

Machine learning is a branch of artificial intelligence that enables applications to make decisions based on data without explicit programming, with key types including supervised, unsupervised, and recommender systems. Supervised learning uses labeled data for training, while unsupervised learning identifies patterns in unlabeled data. The document also discusses data analytics project life cycles, integration techniques, and the importance of data visualization in deriving insights from data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Machine learning

Machine learning is a branch of artificial intelligence that allows us to make our application
intelligent without being explicitly programmed. Machine learning concepts are used to enable
applications to take a decision from the available datasets.

There are many popular organizations that are using machine-learning algorithms to make their
service or product understand the need of their users and provide services as per their
behavior. Google has its intelligent web search engine, which provides a number one search,
spam classification in Google Mail, news labeling in Google News, and Amazon for
recommender systems. There are many open source frameworks available for developing these
types of applications/frameworks, such as R, Python, Apache Mahout, and Weka.

There are three different types of machine-learning algorithms for intelligent system
development:
• Supervised machine-learning algorithms
• Unsupervised machine-learning algorithms
• Recommender systems

Supervised learning

Supervised learning, as the name indicates, has the presence of a supervisor as a teacher.
Basically supervised learning is when we teach or train the machine using data that is well
labelled. Which means some data is already tagged with the correct answer. After that, the
machine is provided with a new set of examples(data) so that the supervised learning
algorithm analyses the training data(set of training examples) and produces a correct
outcome from labelled data.
For instance, suppose you are given a basket filled with different kinds of fruits. Now the first
step is to train the machine with all the different fruits one by one like this:
 If the shape of the object is rounded and has a depression at the top, is red in color, then
it will be labeled as –Apple.
 If the shape of the object is a long curving cylinder having Green-Yellow color, then it will
be labeled as –Banana.
Now suppose after training the data, you have given a new separate fruit, say Banana from
the basket, and asked to identify it.

Since the machine has already learned the things from previous data and this time has to use
it wisely. It will first classify the fruit with its shape and color and would confirm the fruit
name as BANANA and put it in the Banana category. Thus, the machine learns the things from
training data(basket containing fruits) and then applies the knowledge to test data(new
fruit).
Supervised learning is classified into two categories of algorithms:
 Classification: A classification problem is when the output variable is a category, such as
“Red” or “blue” , “disease” or “no disease”.
 Regression: A regression problem is when the output variable is a real value, such as
“dollars” or “weight”.
Supervised learning deals with or learns with “labeled” data. This implies that some data is
already tagged with the correct answer.
Types:-
 Regression
 Logistic Regression
 Classification
 Naive Bayes Classifiers
 K-NN (k nearest neighbors)
 Decision Trees
 Support Vector Machine
Advantages:-
 Supervised learning allows collecting data and produces data output from previous
experiences.
 Helps to optimize performance criteria with the help of experience.
 Supervised machine learning helps to solve various types of real-world computation
problems.
Disadvantages:-
 Classifying big data can be challenging.
 Training for supervised learning needs a lot of computation time. So, it requires a lot of
time.

Steps

Unsupervised learning
Unsupervised learning is the training of a machine using information that is neither classified
nor labeled and allowing the algorithm to act on that information without guidance. Here the
task of the machine is to group unsorted information according to similarities, patterns, and
differences without any prior training of data.
Unlike supervised learning, no teacher is provided that means no training will be given to the
machine. Therefore the machine is restricted to find the hidden structure in unlabeled data
by itself.
For instance, suppose it is given an image having both dogs and cats which it has never seen.

Thus the machine has no idea about the features of dogs and cats so we can’t categorize it as
‘dogs and cats ‘. But it can categorize them according to their similarities, patterns, and
differences, i.e., we can easily categorize the above picture into two parts. The first may
contain all pics having dogs in them and the second part may contain all pics having cats in
them. Here you didn’t learn anything before, which means no training data or examples.
It allows the model to work on its own to discover patterns and information that was
previously undetected. It mainly deals with unlabelled data.
Unsupervised learning is classified into two categories of algorithms:
 Clustering: A clustering problem is where you want to discover the inherent groupings in
the data, such as grouping customers by purchasing behavior.
 Association: An association rule learning problem is where you want to discover rules that
describe large portions of your data, such as people that buy X also tend to buy Y.
Types of Unsupervised Learning:-
Clustering
1. Exclusive (partitioning)
2. Agglomerative
3. Overlapping
4. Probabilistic
Clustering Types:-
1. Hierarchical clustering
2. K-means clustering
3. Principal Component Analysis
4. Singular Value Decomposition
5. Independent Component Analysis
Supervised vs. Unsupervised Machine Learning

Parameters Supervised machine learning Unsupervised machine learning

Algorithms are trained using Algorithms are used against data


Input Data labeled data. that is not labeled

Computational
Complexity Simpler method Computationally complex

Accuracy Highly accurate Less accurate

No. of classes No. of classes is known No. of classes is not known


Data Analysis Uses offline analysis Uses real-time analysis of data

Linear and Logistics


regression, Random forest, K-Means clustering, Hierarchical
clustering,
Support Vector Machine,
Neural Network, etc. Apriori algorithm, etc.
Algorithms used

Recommendation algorithms

Recommendation is a machine-learning technique to predict what new items a user would like
based on associations with the user's previous items. Recommendations are widely used in the
field of e-commerce applications. Through this flexible data and behavior-driven algorithms,
businesses can increase conversions by helping to ensure that relevant choices are
automatically suggested to the right customers at the right time with cross-selling or up-selling.
For example, when a customer is looking for a Samsung Galaxy S IV/S4 mobile phone on
Amazon, the store will also suggest other mobile phones similar to this one, presented in the
Customers Who Bought This Item Also Bought window.
There are two different types of recommendations:
• User-based recommendations: In this type, users (customers) similar to current user
(customer) are determined. Based on this user similarity, their interested/used items can be
recommended to other users. Let's learn it through an example.
Assume there are two users named Wendell and James; both have a similar interest because
both are using an iPhone. Wendell had used two items, iPad and iPhone, so James will be
recommended to use iPad. This is user-based recommendation.
• Item-based recommendations: In this type, items similar to the items that are being currently
used by a user are determined. Based on the item-similarity score, the similar items will be
presented to the users for cross-selling and up-selling type of recommendations.
Let's learn it through an example.

The data analytics project life cycle Understanding

The defined data analytics processes of a project life cycle should be followed by sequences for
effectively achieving the goal using input datasets.

This data process may include identifying the data analytics problems, designing, and collecting
datasets, data analytics, and data visualization.

Let's get some perspective on these stages for performing data analytics

The data analytics project life cycle stages are seen in the following diagram:
Identifying the problem
Today, business analytics trends change by performing data analytics over web datasets for
growing business. Since their data size is increasing gradually day by day, their analytical
application needs to be scalable for collecting insights from their datasets. With the help of web
analytics, we can solve the business analytics problems. Let's assume that we have a large e-
commerce website, and we want to know how to increase the business. We can identify the
important pages of our website by categorizing them as per popularity into high, medium, and
low. Based on these popular pages, their types, their traffic sources, and their content, we will
be able to decide the roadmap to improve business by improving web traffic, as well as content.

Designing data requirement


To perform the data analytics for a specific problem, it needs datasets from related domains.
Based on the domain and problem specification, the data source can be decided and based on
the problem definition; the data attributes of these datasets can be defined.
For example, if we are going to perform social media analytics (problem specification), we use
the data source as Facebook or Twitter. For identifying the user characteristics, we need user
profile information, likes, and posts as data attributes.

Preprocessing data
In data analytics, we do not use the same data sources, data attributes, data tools, and
algorithms all the time as all of them will not use data in the same format. This leads to the
performance of data operations, such as data cleansing, data aggregation, data augmentation,
data sorting, and data formatting, to provide the data in a supported format to all the data
tools as well as algorithms that will be used in the data analytics.
In simple terms, preprocessing is used to perform data operation to translate data into a fixed
data format before providing data to algorithms or tools. The data analytics process will then be
initiated with this formatted data as the input. In case of Big Data, the datasets need to be
formatted and uploaded to Hadoop
Distributed File System (HDFS) and used further by various nodes with Mappers and Reducers
in Hadoop clusters.

Performing analytics over data


After data is available in the required format for data analytics algorithms, data analytics
operations will be performed. The data analytics operations are performed for discovering
meaningful information from data to take better decisions towards business with data mining
concepts. It may either use descriptive or predictive
analytics for business intelligence. Analytics can be performed with various machine learning as
well as custom algorithmic concepts, such as regression, classification, clustering, and model-
based recommendation. For Big Data, the same algorithms can be translated to MapReduce
algorithms for running them on Hadoop clusters by translating their data analytics logic to the
MapReduce job which is to be run over Hadoop clusters. These models need to be further
evaluated as well as improved by various evaluation stages of machine learning concepts.
Improved or optimized algorithms can provide better insights.

Visualizing data
Data visualization is used for displaying the output of data analytics. Visualization is an
interactive way to represent the data insights. This can be done with various data visualization
softwares as well as R packages. R has a variety of packages for the visualization of datasets.
Some popular examples of visualization with R are as follows:

Plots for facet scales (ggplot): The following figure shows the comparison of males and
females with different measures; namely, education, income, life expectancy, and literacy,
using ggplot:
Dashboa
rd charts: This is an rCharts type. Using this we can build interactive animated dashboards
with R.
Understanding data analytics problems

These data analytics problem definitions are designed such that readers can understand how
Big Data analytics can be done with the analytical power of functions, packages of R, and the
computational powers of Hadoop.
The data analytics problem definitions are as follows:
• Exploring the categorization of web pages
• Computing the frequency of changes in the stock market
• Predicting the sale price of a blue book for bulldozers (case study)

Exploring web pages categorization


This data analytics problem is designed to identify the category of a web page of a website,
which may categorized popularity wise as high, medium, or low (regular),based on the visit
count of the pages. While designing the data requirement stage of the data analytics life cycle,
we will see how to collect these types of data from Google Analytics.

Computing the frequency of stock market change


This data analytics MapReduce problem is designed for calculating the frequency of stock
market changes. Since this is a typical stock market data analytics problem, it will calculate the
frequency of past changes for one particular symbol of the stock market, such as a Fourier
Transformation. Based on this information, the investor can get more insights on changes for
different time periods. So the goal of this analytics is to calculate the frequencies of percentage
change.
Seven Types of Data Integration Techniques

Hand-coding

One of the most basic methods for integrating data is hand-coding, or manual data integration.
Realistically, this method is only feasible for integrating a small number of data sources. In this
case, it might be effective to write code to collect the data, transform it if necessary, and
consolidate it. While hand-coding may not require investing in any software, it can take a
considerable amount of time, and scaling the integration to include more data sources may be
difficult.

Data warehousing

Data warehousing is a type of data integration that involves using a common storage area,
often a data warehouse to cleanse, format, and store data. This type of data integration is also
sometimes referred to as common storage integration. Data from all of the different
applications throughout an organization is copied to the data warehouse, where it can be
queried by data analysts.

Querying data on the warehouse rather than on the source applications means that analysts
don’t have to worry about impacting application performance. Plus, analysts can view all of the
data from the entire organization in a single, central location, which means they can check for
data completeness, accuracy, and consistency.
Potential issues with data warehousing include the costs of storing data in multiple locations,
plus the maintenance costs required to create and maintain the data warehouse. This is why
warehousing data in the cloud can be much more cost-effective and simpler.

Middleware data integration

Middleware data integration is a data integration system that involves using a middleware
application as a go-between that moves data between source systems and a central data
repository. The middleware helps to format and validate data before sending it to the
repository, which could be a cloud data warehouse or a database.

This approach can be particularly helpful when integrating older systems with newer ones,
because the middleware can help with transforming the legacy data into a format that’s usable
by the newer systems.

Potential issues with middleware data integration include maintenance. The middleware must
be deployed and maintained by knowledgeable developers. Another potential issue is limited
functionality, since many middleware applications have limited compatibility with source
application

Data consolidation

Data consolidation involves combining data from multiple systems to create a single,
centralized data source, which can then be used for reporting or analytics. ETL software is often
used to support data consolidation. ETL applications can pull data from multiple sources,
transform it into the necessary format and then transfer it to the final data storage location.

There may be some latency involved in data consolidation, because it can take time to retrieve
the data from the source and transfer it to the central data source. The latency period can be
shortened by more frequent data transfers.

One of the benefits of data consolidation is that because the data is transformed before it is
consolidated, it is in a consistent format on the central data source. This can give data workers
the chance to improve data quality and integrity.

Data virtualization

Data virtualization is interesting because while all of the data remains in its separate systems,
users can still gain a unified view of it. Data virtualization is essentially logical layer that
integrates data from all of the source systems and delivers it to business users in real time.
A benefit to data virtualization is that you don’t actually have to move your data around. Data
stays in the source systems, so you don’t have to worry about the increased storage costs
associated with maintaining multiple copies of your data.

Data federation

Data federation involves creating a virtual database that consolidates data from disparate
sources. Users can then use the virtual database as a single source of truth for all of the data in
the organization. When a user queries the virtual database, the query is actually sent to the
relevant underlying data source, which then serves the data back. So essentially, data is served
on an on-demand basis, rather than other techniques for data integration, where the data is
integrated before it can be queried. With data federation, data is given a common data model,
even though the different data sources may have vastly different data models.

Data propagation

Data propagation entails using applications to copy data from one location to another on an
event-driven basis. Enterprise application integration (EAI) and enterprise data replication (EDR)
technologies can be used for data propagation. EAI can provide a link between two systems, for
purposes such as business transaction processing. EDR is more frequently used to transfer data
between two databases. Unlike ETL, EDR does not involve data transformation. The data is
simply extracted from one database and moved to another one.

Data integration tools

With nearly all of the above data integration approaches, you’ll need a data integration tool,
such as an ETL application or a data loader, to support your efforts. Choose a tool that
can integrate with all of the applications you have now, or that allows you to easily create a
connector if a pre-built one doesn’t exist. Ideally, a data integration tool is also flexible enough
that it will support any applications that you adopt in the future as well.

You might also like