Aiml
Aiml
Machine learning is the field of study that gives the computers ability to learn without being
explicitly programmed.
1.Supervised Learning: The model is trained on labeled data, meaning each input has a known
output. It includes:
• Classification: Assigning inputs to specific categories (e.g., spam or not spam).
• Regression: Predicting continuous values (e.g., house prices).
2. Unsupervised Learning: The model works with unlabeled data to identify patterns or
structure. It includes:
• Cluster Analysis: Grouping similar data points (e.g., customer segmentation).
• Association Mining: Finding relationships between items (e.g., items frequently
bought together).
• Dimension Reduction: Reducing the number of features for simplification.
3. Semi-supervised Learning: Combines a small amount of labeled data with a large amount
of unlabeled data, useful when labeling is expensive.
4.Reinforcement Learning: The model learns by interacting with an environment, receiving
feedback to maximize cumulative rewards over time (e.g., game AI, robotics).
1. Understanding the business – This step involves understanding the objectives and
requirements of the business organization. Generally, a single data mining algorithm is enough
for giving the solution. This step also involves the formulation of the problem statement for the
data mining process.
2. Understanding the data – It involves the steps like data collection, study of the charac teristics
of the data, formulation of hypothesis, and matching of patterns to the selected hypothesis.
3. Preparation of data – This step involves producing the final dataset by cleaning the raw data
and preparation of data for the data mining process. The missing values may cause problems
during both training and testing phases. Missing data forces classifiers to produce inaccurate
results. This is a perennial problem for the classification models. Hence, suitable strategies
should be adopted to handle the missing data.
4. Modelling – This step plays a role in the application of data mining algorithm for the
data to obtain a model or pattern.
5. Evaluate – This step involves the evaluation of the data mining results using statistical
analysis and visualization methods. The performance of the classifier is determined by
evaluating the accuracy of the classifier. The process of classification is a fuzzy issue. For
example, classification of emails requires extensive domain knowledge and requires
domain experts. Hence, performance of the classifier is very crucial.
6. Deployment – This step involves the deployment of results of the data mining algorithm
to improve the existing process or for a new situation.
Q 4. List & explain the applications of machine learning.
1.Sentiment analysis – This is an application of natural language processing (NLP) where
the words of documents are converted to sentiments like happy, sad, and angry which are
captured by emoticons effectively. For movie reviews or product reviews, five stars or one
star are automatically attached using sentiment analysis programs.
2. Recommendation systems – These are systems that make personalized purchases
possible. For example, Amazon recommends users to find related books or books bought
by people who have the same taste like you, and Netflix suggests shows or related movies
of your taste. The recommendation systems are based on machine learning.
3. Voice assistants – Products like Amazon Alexa, Microsoft Cortana, Apple Siri, and
Google Assistant are all examples of voice assistants. They take speech commands and
perform tasks. These chatbots are the result of machine learning technologies.
4. Technologies like Google Maps and those used by Uber are all examples of machine
learning which offer to locate and navigate shortest paths to reduce time.
Q 5. List out important classification and clustering algorithms
The key algorithms of classification are:
• Decision Tree
• Random Forest
• Support Vector Machines
• Naïve Bayes
• Artificial Neural Network and Deep Learning networks like CNN
The key clustering algorithms are:
• k-means algorithm
• Hierarchical algorithms
Q 6. List out and briefly explain the classification algorithms.
1 Decision Tree: Splits data into branches based on feature values, creating a tree structure
for decision-making. Simple and interpretable.
2 Random Forest: Combines multiple decision trees, improving accuracy and reducing
overfitting by taking a majority vote across trees.
3 Support Vector Machine (SVM): Finds the optimal boundary that separates classes,
effective for high-dimensional and binary classification.
4 Naïve Bayes: Uses Bayes' Theorem with an independence assumption between features;
fast and effective, especially for text classification.
5 Artificial Neural Network (ANN): Mimics brain neurons to learn complex patterns,
suitable for various classification tasks.
6 Convolutional Neural Network (CNN): Specializes in image classification by using
convolutional layers to detect features in visual data.
Q 7. What is Data? Explain the different types of Data with Elements.
WHAT IS DATA?
• All facts are data. In computer systems, bits encode facts present in numbers, text, images,
audio, and video.
• Data can be directly human interpretable (such as numbers or texts) or diffused data such
as images or video that can be interpreted only by a computer.
• Data is available in different data sources like flat files, databases, or data warehouses. It
can either be an operational data or a non-operational data.
• Operational data is the one that is encountered in normal business procedures and
processes. For example, daily sales data is operational data, on the other hand,
non-operational data is the kind of data that is used for decision making.
• Data by itself is meaningless. It has to be processed to generate any information. A string
of bytes is meaningless. Only when a label is attached like height of students of a class, the
data becomes meaningful.
• Processed data is called information that includes patterns, associations, or relationships
among data. For example, sales data can be analyzed to extract information like which
product was sold larger in the last quarter of the year.
Elements of Big Data
Data whose volume is less and can be stored and processed by a small-scale computer is called
‘small data’. These data are collected from several sources, and integrated and processed by a
small-scale computer.
Big data, on the other hand, is a larger data whose volume is much larger than ‘small data’ and
is characterized as follows:
1. Volume – Since there is a reduction in the cost of storing devices, there has been a
tremendous growth of data. Small traditional data is measured in terms of gigabytes (GB)
and terabytes (TB), but Big Data is measured in terms of petabytes (PB) and exabytes
(EB). One exabyte is 1 million terabytes.
2. Velocity – The fast arrival speed of data and its increase in data volume is noted as
velocity. The availability of IoT devices and Internet power ensures that the data is
arriving at a faster rate. Velocity helps to understand the relative growth of big data and
its accessibility by users, systems and applications.
3. Variety – The variety of Big Data includes:
• Form – There are many forms of data. Data types range from text, graph, audio, video, to
maps. There can be composite data too, where one media can have many other sources of
data, for example, a video can have an audio song.
• Function – These are data from various sources like human conversations, transaction
records, and old archive data.
• Source of data – This is the third aspect of variety. There are many sources of data.
Broadly, the data source can be classified as open/public data, social media data and
multimodal data.
• Some of the other forms of Vs that are often quoted in the literature as characteristics of
big data are:
4. Veracity of data – Veracity of data deals with aspects like conformity to the facts,
truthfulness,
believability, and confidence in data. There may be many sources of error such as technical
errors,
typographical errors, and human errors. So, veracity is one of the most important aspects of
data.
5. Validity – Validity is the accuracy of the data for taking decisions or for any other goals that
are needed by the given problem.
6. Value – Value is the characteristic of big data that indicates the value of the information that
is extracted
• from the data and its influence on the decisions that are taken based on it. Thus, these 6
Vs are helpful to characterize the big data. The data quality of the numeric attributes is
determined by factors like precision, bias, and accuracy.
• Precision is defined as the closeness of repeated measurements. Often, standard deviation
is used to measure the precision.
• Bias is a systematic result due to erroneous assumptions of the algorithms or procedures.
Accuracy is the degree of measurement of errors that refers to the closeness of
measurements to the true value of the quantity. Normally, the significant digits used to
store and manipulate indicate the accuracy of the measurement.
Types of Data
In Big Data, there are three kinds of data. They are structured data, unstructured data, and semi
structured data.
Structured Data
In structured data, data is stored in an organized manner such as a database where it is available
in the form of a table. The data can also be retrieved in an organized manner using tools like
SQL. The structured data frequently encountered in machine learning are listed below:
Record Data
A dataset is a collection of measurements taken from a process. We have a
collection of objects in a dataset and each object has a set of measurements. The measurements
can be arranged in the form of a matrix. Rows in the matrix represent an object and can be
called as entities, cases, or records. The columns of the dataset are called attributes, features,
or fields. The table is filled with observed data. Also, it is better to note the general jargons that
are associated with the dataset. Label is the term that is used to describe the
individual observations.
Data Matrix
It is a variation of the record type because it consists of numeric attributes. The standard matrix
operations can be applied on these data. The data is thought of as points or vectors in the
multidimensional space where every attribute is a dimension describing the object.
Graph Data
It involves the relationships among objects. For example, a web page can refer to
another web page. This can be modeled as a graph. The modes are web pages and the hyperlink
is an edge that connects the nodes.
Ordered Data
Ordered data objects involve attributes that have an implicit order among them.
The examples of ordered data are:
• Temporal data – It is the data whose attributes are associated with time. For example, the
customer purchasing patterns during festival time is sequential data. Time series data is a
special type of sequence data where the data is a series of measurements over time.
• Sequence data – It is like sequential data but does not have time stamps. This data
involves the sequence of words or letters. For example, DNA data is a sequence of four
characters – A T G C.
• Spatial data – It has attributes such as positions or areas. For example, maps are spatial
data where the points are related by location.
Unstructured Data
Unstructured data includes video, image, and audio. It also includes textual documents,
programs, and blog data. It is estimated that 80% of the data are unstructured data.
Semi-Structured Data
Semi-structured data are partially structured and partially unstructured. These include data like
XML/JSON data, RSS feeds, and hierarchical data.
Q 8. Explain the different types of Data Analytics and Framework
Q 10.Explain the data visualization with some forms of graphs.