Machine Learning Unit - 1
Machine Learning Unit - 1
UNIT -1
INTRODUCTION
Unit I: Introduction: Towards Intelligent Machines Well posed Problems, Example of Applications in
diverse fields, Data Representation, Domain Knowledge for Productive use of Machine Learning, Diversity
of Data: Structured / Unstructured, Forms of Learning, Machine Learning and Data Mining, Basic Linear
Algebra in Machine Learning Techniques.
Image Recognition
Image Recognition is one of the reasons behind the boom one could have experienced in the field of Deep
Learning. The task which started from classification between cats and dog images has now evolved up to the level
of Face Recognition and real-world use cases based on that like employee attendance tracking.
Also, image recognition has helped revolutionized the healthcare industry by employing smart systems in disease
recognition and diagnosis methodologies.
Speech Recognition
Speech Recognition based smart systems like Alexa and Siri have certainly come across and used to communicate
with them. In the backend, these systems are based basically on Speech Recognition systems. These systems are
designed such that they can convert voice instructions into text.
One more application of the Speech recognition that we can encounter in our day-to-day life is that of performing
Google searches just by speaking to it.
Recommender Systems
As our world has digitalized more and more approximately every tech giants try to provide customized services to
its users. This application is possible just because of the recommender systems which can analyze a user’s
preferences and search history and based on that they can recommend content or services to them.
An example of these services is very common for example youtube. It recommends new videos and content based
on the user’s past search patterns. Netflix recommends movies and series based on the interest provided by users
when someone creates an account for the very first time.
Fraud Detection
In today’s world, most things have been digitalized varying from buying toothbrushes or making transactions of
millions of dollars everything is accessible and easy to use. But with this process of digitization cases of fraudulent
transactions and fraudulent activities have increased. Identifying them is not that easy but machine learning
systems are very efficient in these tasks.
Due to these applications only whenever the system detects red flags in a user’s activity than a suitable notification
be provided to the administrator so, that these cases can be monitored properly for any spam or fraud activities.
Self Driving Cars
It would have been assumed that there is certainly some ghost who is driving a car if we ever saw a car being
driven without a driver but all thanks to machine learning and deep learning that in today’s world, this is possible
and not a story from some fictional book. Even though the algorithms and tech stack behind these technologies are
highly advanced but at the core it is machine learning which has made these applications possible.
The most common example of this use case is that of the Tesla cars which are well -tested and proven for
autonomous driving.
Medical Diagnosis
If you are a machine learning practitioner or even if you are a student then you must have heard about projects
like breast cancer Classification, Parkinson’s Disease Classification, Pneumonia detection, and many more health-
related tasks which are performed by machine learning models with more than 90% of accuracy.
Not even in the field of disease diagnosis in human beings but they work perfectly fine for plant disease-related
tasks whether it is to predict the type of disease it is or to detect whether some disease is going to occur in the
future.
Stock Market Trading
Stock Market has remained a hot topic among working professionals and even students because if you have
sufficient knowledge of the markets and the forces which drives them then you can make fortune in this domain.
Attempts have been made to create intelligent systems which can predict future price trends and market value as
well.
This can be considered as one of the applications of time series forecasting because stock price data is nothing but
sequential data in which the time at which data has been taken is of utmost importance.
Virtual Try On
Have you ever purchased your specs or lenses from Lenskart? If yes then you must have come across its feature
where you can try different frames virtually without actually purchasing them or visiting the outlet. This has
become possible just because of the machine learning systems only which identify certain landmarks on a person’s
face and then place the specs virtually on your face using those landmarks .
Data Representation:
Data representation in machine learning is crucial for effectively processing and learning from data. It involves
transforming raw data into a format that is suitable for a machine learning model to understand and make predictions.
Here are some common techniques used for data representation in machine learning:
Numerical Representation:
Most machine learning algorithms require data to be in numerical format. Categorical data is often converted into
numerical values through techniques like one-hot encoding, label encoding, or ordinal encoding.
Feature Scaling:
Features in the dataset may have different scales, which can lead to issues for certain algorithms. Feature scaling
techniques like normalization (scaling features to a range) or standardization (scaling features to have a mean of 0 and
a standard deviation of 1) are used to address this.
Feature Engineering:
This involves creating new features from existing ones to help the model learn better. For example , extracting date-
related features (day of the week, month, etc.) from a date column can be useful for time series analysis.
Text Representation:
Text data needs to be converted into numerical format for machine learning models. Techniques like bag-of-words,
TF-IDF (Term Frequency-Inverse Document Frequency), and word embeddings (like Word2Vec or GloVe) are
commonly used for this purpose.
Image Representation:
Images are represented as pixel values in a matrix format. Techniques like resizing, normalization, and data
augmentation are used to preprocess image data before feeding it to a model.
Data representation plays a crucial role in the performance and interpretability of machine learning models. Choosing
the right representation technique depends on the nature of the data and the machine learning task at hand.
Domain Knowledge for Productive use of Machine Learning :
To make the most of machine learning in your domain, it's important to understand both the technical aspects and the
specific challenges and opportunities within your field. Here's a general overview, tailored to your interests:
Technical Understanding:
Gain a solid grasp of machine learning fundamentals such as supervised, unsupervised, and reinforcement learning.
Explore various algorithms like decision trees, neural networks, and clustering techniques. Understanding model
evaluation, optimization, and deployment is also crucial.
Ethical Considerations:
Machine learning models can sometimes reinforce biases present in the data. It's crucial to be aware of these biases
and take steps to mitigate them, especially in applications that impact individuals or communities.
By integrating these aspects into your machine learning practice, you can effectively leverage the technology to
address challenges and drive innovation in your domain.
Structured and unstructured data are two types of data that are commonly used in machine learning:
Structured Data:
Structured data refers to data that is organized in a well-defined manner, typically in tabular form with rows and
columns. Each column represents a different attribute or feature, and each row represents a single data point.
Examples of structured data include databases, spreadsheets, and CSV files. Structured data is easy to process and
analyze using traditional statistical methods and is well-suited for tasks like classification, regression, and clustering.
Unstructured Data:
Unstructured data refers to data that does not have a predefined format or organization. This type of data is often text-
heavy and includes things like emails, social media posts, images, videos, and audio recordings. Unstructured data is
more challenging to process and analyze compared to structured data because it lacks a clear structure.
However, advances in natural language processing (NLP), computer vision, and audio processing have enabled the
use of machine learning techniques to extract valuable insights from unstructured data.
In machine learning, both structured and unstructured data can be used depending on the nature of the problem and
the type of information available. For example, if you want to classify emails as spam or not spam, you would likely
use techniques that are suitable for processing unstructured text data. On the other hand, if you want to predict the
price of a house based on its features (e.g., number of bedrooms, square footage), you would use structured data and
techniques that are suitable for processing tabular data.
In practice, many real-world datasets contain a mix of structured and unstructured data, and the ability to effectively
handle both types of data is a valuable skill in machine learning.
Forms of Learning:
In machine learning, there are several forms of learning that algorithms can utilize to acquire knowledge and improve
performance. Here are some of the key forms:
Supervised Learning:
In supervised learning, the algorithm is trained on a labeled dataset, where each input is paired with the correct
output. The goal is to learn a mapping from inputs to outputs so that the algorithm can make predictions on new,
unseen data. Common tasks in supervised learning include classification (predicting a discrete label) and regression
(predicting a continuous value).
Unsupervised Learning:
Unsupervised learning involves training the algorithm on unlabeled data, and the goal is to find hidden patterns or
structures in the data. Clustering is a common unsupervised learning task, where the algorithm groups similar data
points together. Dimensionality reduction is another task, where the algorithm reduces the number of features in the
data while preserving important information.
Semi-Supervised Learning:
Semi-supervised learning is a combination of supervised and unsupervised learning. It uses a small amount of labeled
data and a large amount of unlabeled data to improve learning accuracy. This approach is useful when labeling data is
expensive or time-consuming.
Reinforcement Learning:
Reinforcement learning is a type of learning where an agent learns to make decisions by interacting with an
environment. The agent receives feedback in the form of rewards or penalties based on its actions, and the goal is to
learn a policy that maximizes the cumulative reward over time. Reinforcement learning is commonly used in areas
such as robotics, gaming, and autonomous driving.
Self-supervised Learning:
Self-supervised learning is a type of learning where the algorithm generates its own labels from the input data. For
example, in image inpainting, the algorithm is trained to predict missing parts of an image based on the surrounding
pixels. Self-supervised learning is useful for pretraining models on large amounts of unlabeled data before fine-tuning
them on a specific task.
Transfer Learning:
Transfer learning is a technique where a model trained on one task is reused or adapted for another related task. This
can help improve the performance of the model on the new task, especially when labeled data is limited.
These forms of learning are not mutually exclusive, and many machine learning algorithms and techniques can
incorporate elements from multiple forms to achieve better performance.
Machine learning and Data Mining: (Diversity of Data)
Diversity of data in machine learning and data mining refers to the variety of data types, sources, and formats
that are used in the analysis process. Diversity in data is important because it can help improve the performance
and robustness of machine learning models by providing a more comprehensive view of the problem domain.
Here are some aspects of diversity in data:
1. Data Types: Data can be structured, semi-structured, or unstructured. Structured data, such as tabular data,
follows a strict format. Semi-structured data, like XML or JSON, has some organizational properties but is not
strictly formatted. Unstructured data, such as text, images, or videos, lacks a predefined format.
2. Data Sources: Data can come from various sources such as databases, files, APIs, sensors, social media, etc.
Each source may have its own characteristics and challenges, adding to the diversity of the data.
3. Data Formats: Data can be represented in different formats like CSV, JSON, XML, etc. Each format has its
own way of organizing and storing data, which can affect how the data is processed and analyzed.
4. Data Quality: Data quality refers to the accuracy, completeness, consistency, and reliability of the data.
Diverse data may have varying levels of quality, which can impact the performance of machine learning models.
5. Data Distribution: The distribution of data across different categories or classes can also affect the
performance of machine learning models. Imbalanced data, where one class is much more prevalent than others,
can lead to biased models.
6. Feature Representation: The representation of features in the dataset can vary, including numerical,
categorical, ordinal, and text features. Each type of feature requires different preprocessing techniques.
7. Temporal Aspects: Data collected over time may exhibit temporal patterns or trends, which need to be
considered in the analysis.
8. Geographical Aspects: Data collected from different geographical locations may have spatial dependencies
or variations, which can impact the analysis.
By considering the diversity of data in machine learning and data mining, researchers and practitioners can
develop more robust models that generalize well to new and unseen data.
Linear algebra is a fundamental mathematical tool in machine learning, providing the basis for many techniques
and algorithms. Here are some basic concepts of linear algebra commonly used in machine learning:
1.Vectors: Vectors represent quantities that have both magnitude and direction. In machine learning, vectors
are often used to represent features or data points. A vector can be represented as an array of numbers, such as
[1, 2, 3].
2. Matrices: Matrices are rectangular arrays of numbers. They are used to represent datasets, transformations,
and linear equations. For example, a 2x3 matrix can be represented as:
[1, 2, 3]
[4, 5, 6]
3. Matrix Operations:
Addition and Subtraction: Matrices of the same size can be added or subtracted element-wise.
Scalar Multiplication: A matrix can be multiplied by a scalar, multiplying each element by the scalar.
Matrix Multiplication: The dot product of two matrices, where the number of columns in the first matrix must
equal the number of rows in the second matrix.
4. Transpose: The transpose of a matrix flips it over its diagonal, switching its rows and columns. It is denoted
as A^T.
5. Inverse: The inverse of a square matrix A, denoted as A^-1, is a matrix such that A * A^-1 = I, where I is the
identity matrix.
6. Eigenvalues and Eigenvectors: For a square matrix A, an eigenvector is a non-zero vector v such that A * v
= λ * v, where λ is a scalar known as the eigenvalue.
7. Singular Value Decomposition (SVD): SVD is a factorization of a matrix A into the product of three
matrices U, Σ, and V^T, where Σ is a diagonal matrix of singular values.
8. Principal Component Analysis (PCA): PCA is a dimensionality reduction technique that uses SVD to
decompose a dataset into its principal components.
Linear algebra is essential for understanding and implementing various machine learning algorithms, such as
linear regression, support vector machines, neural networks, and more. It provides a framework for
manipulating and analyzing data, making it a foundational concept in machine learning.