0% found this document useful (0 votes)
33 views47 pages

MC4301 - ML Unit 1 (Introduction)

Uploaded by

Sarathi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views47 pages

MC4301 - ML Unit 1 (Introduction)

Uploaded by

Sarathi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

Department of Master of Computer Applications


MC4301 - Machine Learning
Unit - 1
Introduction

Human Learning
Learning is the process of acquiring new understanding, knowledge, behaviors, skills,
values, attitudes, and preferences.
Learning consists of complex information processing, problem-solving, decision-
making in uncertainty and the urge to transfer knowledge and skills into new,
unknown settings.

The process of learning is continuous which starts right from the time of birth of an
individual and continues till the death. We all are engaged in the learning endeavours
in order to develop our adaptive capabilities as per the requirements of the changing
environment.

For a learning to occur, two things are important:

1. The presence of a stimulus in the environment and


2. The innate dispositions like emotional and instinctual
dispositions.

A person keeps on learning across all the stages of life, by constructing or


reconstructing experiences under the influence of emotional and instinctual
dispositions.

Psychologists in general define Learning as relatively permanent behavioural


modifications which take place as a result of experience. This definition of learning
stresses on three important elements of learning:

 Learning involves a behavioural change which can be better or worse.


 This behavioural change should take place as a result of practice and
experience. Changes resulting from maturity or growth cannot be considered
as learning
 This behavioural change must be relatively permanent and last for a relatively
long time enough.

John B Watson is one amongst the first thinkers who has proven that behavioural
changes occur as a result of learning. Watson is believed to be the founder of
Behavioural school of thought, which gained its prominence or acceptability around
the first half of the 20th century.

Gales defined Learning as the behavioural modification which occurs as a result of


experience as well as training.

Crow and Crow defined learning as the process of acquisition of knowledge, habits
and attitudes.

about:blank 1/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

According to E.A, Peel, Learning can be described as a change in the individual


which takes place as a result of the environmental change.

H.J. Klausmeir described Learning as a process which leads to some behavioural


change as a result of some experience, training, observation, activity, etc.

The key characteristics of the learning process are:

1. When described in the simplest possible manner, learning is described as an


experience acquisition process.
2. In the complex form, learning can be described as process of acquisition,
retention and modification of experience.
3. It re-establishes the relationship between a stimulus and response.
4. It is a method of problem solving and is concerned about making adjustments
with the environment.
5. It involves all those gamut of activities which may have a relatively permanent
effect on the individual.
6. The process of learning is concerned about experience acquisition, retention of
experiences, and experience development in a step by step manner, synthesis
of both old and new experiences for creating a new pattern.
7. Learning is concerned about cognitive, conative and affective aspects.
Knowledge acquisition process is cognitive, any change in the emotions is
affective and conative is acquisition of new habits or skills.

Types

Types of Learning

1. Motor Learning: Our day to day activities like walking,


running, driving, etc, must be learnt for ensuring a good life.
These activities to a great extent involve muscular
coordination.
2. Verbal Learning: It is related with the language which we use
to communicate and various other forms of verbal
communication such as symbols, words, languages, sounds,
figures and signs.
3. Concept Learning: This form of learning is associated with
higher order cognitive processes like intelligence, thinking,
reasoning, etc, which we learn right from our childhood.
Concept learning involves the processes of abstraction and
generalization, which is very useful for identifying or
recognizing things.
4. Discrimination Learning: Learning which distinguishes
between various stimuli with its appropriate and different
responses is regarded as discrimination stimuli.
5. Learning of Principles: Learning which is based on principles
helps in managing the work most effectively. Principles based
learning explains the relationship between various concepts.

about:blank 2/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

6. Attitude Learning: Attitude shapes our behaviour to a very great


extent, as our positive or negative behaviour is based on our attitudinal
predisposition.

3 Types of Behavioural Learning

The Behavioural School of Thought which was founded by John B Watson which was
highlighted in his seminal work, “Psychology as the Behaviorist View It”, stressed
on the fact that Psychology is an objective science, hence mere emphasis on the
mental processes should not be considered as such processes cannot be objectively
measured or observed.

Watson tried to prove his theory with the help of his famous Little Albert
Experiment, by way of which he conditioned a small kid to be scared of a white rat.
The behavioural psychology described three types of learning: Classical
Conditioning, Observational Learning and Operant Conditioning.

1. Classical Conditioning: In case of Classical Conditioning, the process of


learning is described as a Stimulus-Response connection or association.

Classical Conditioning theory has been explained with the help of Pavlov’s
Classic Experiment, in which the food was used as the natural stimulus which
was paired with the previously neutral stimuli that’s a bell in this case. By
establishing an association between the natural stimulus (food) and the neutral
stimuli (sound of the bell), the desired response can be elicited. This theory
will be discussed in detail in the next few articles.

2. Operant Conditioning: Propounded by scholars like Edward Thorndike


firstly and later by B.F. Skinner, this theory stresses on the fact that the
consequences of actions shape the behaviour.

The theory explains that the intensity of a response is either increased or


decreased as a result of punishment or reinforcement. Skinner explained how
with the help of reinforcement one can strengthen behaviour and with
punishment reduce or curb behaviour. It was also analyzed that the
behavioural change strongly depends on the schedules of reinforcement with
focus on timing and rate of reinforcement.

3. Observational Learning: The Observational Learning process was


propounded by Albert Bandura in his Social Learning Theory, which focused
on learning by imitation or observing people’s behaviour. For observational
learning to take place effectively, four important elements will be essential:
Motivation, Attention, Memory and Motor Skills.

Machine Learning

Machine learning is a subfield of artificial intelligence, which is broadly defined as


the capability of a machine to imitate intelligent human behavior. Artificial

about:blank 3/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

intelligence systems are used to perform complex tasks in a way that is similar to how
humans solve problems.
Machine learning is used in internet search engines, email filters to sort out spam,
websites to make personalised recommendations, banking software to detect unusual
transactions, and lots of apps on our phones such as voice recognition.

Types

As with any method, there are different ways to train machine learning algorithms,
each with their own advantages and disadvantages. To understand the pros and cons
of each type of machine learning, we must first look at what kind of data they ingest.
In ML, there are two kinds of data — labeled data and unlabeled data.

Labeled data has both the input and output parameters in a completely machine-
readable pattern, but requires a lot of human labor to label the data, to begin with.
Unlabeled data only has one or none of the parameters in a machine-readable form.
This negates the need for human labor but requires more complex solutions.

There are also some types of machine learning algorithms that are used in very
specific use-cases, but three main methods are used today.

about:blank 4/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

Supervised Learning
Supervised learning is one of the most basic types of machine learning. In this type,
the machine learning algorithm is trained on labeled data. Even though the data needs
to be labeled accurately for this method to work, supervised learning is extremely
powerful when used in the right circumstances.

In supervised learning, the ML algorithm is given a small training dataset to work


with. This training dataset is a smaller part of the bigger dataset and serves to give the
algorithm a basic idea of the problem, solution, and data points to be dealt with. The
training dataset is also very similar to the final dataset in its characteristics and
provides the algorithm with the labeled parameters required for the problem.

The algorithm then finds relationships between the parameters given, essentially
establishing a cause and effect relationship between the variables in the dataset. At the
end of the training, the algorithm has an idea of how the data works and the
relationship between the input and the output.

This solution is then deployed for use with the final dataset, which it learns from in
the same way as the training dataset. This means that supervised machine learning
algorithms will continue to improve even after being deployed, discovering new
patterns and relationships as it trains itself on new data.

Unsupervised Learning
Unsupervised machine learning holds the advantage of being able to work with
unlabeled data. This means that human labor is not required to make the dataset
machine-readable, allowing much larger datasets to be worked on by the program.

In supervised learning, the labels allow the algorithm to find the exact nature of the
relationship between any two data points. However, unsupervised learning does not
have labels to work off of, resulting in the creation of hidden structures. Relationships
between data points are perceived by the algorithm in an abstract manner, with no
input required from human beings.

The creation of these hidden structures is what makes unsupervised learning


algorithms versatile. Instead of a defined and set problem statement, unsupervised
learning algorithms can adapt to the data by dynamically changing hidden structures.
This offers more post-deployment development than supervised learning algorithms.

Reinforcement Learning

Reinforcement learning directly takes inspiration from how human beings learn from
data in their lives. It features an algorithm that improves upon itself and learns from
new situations using a trial-and-error method. Favorable outputs are encouraged or
‘reinforced’, and non-favorable outputs are discouraged or ‘punished’.

about:blank 5/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

Based on the psychological concept of conditioning, reinforcement learning works by


putting the algorithm in a work environment with an interpreter and a reward system.
In every iteration of the algorithm, the output result is given to the interpreter, which
decides whether the outcome is favorable or not.

In case of the program finding the correct solution, the interpreter reinforces the
solution by providing a reward to the algorithm. If the outcome is not favorable, the
algorithm is forced to reiterate until it finds a better result. In most cases, the reward
system is directly tied to the effectiveness of the result.

In typical reinforcement learning use-cases, such as finding the shortest route between
two points on a map, the solution is not an absolute value. Instead, it takes on a score
of effectiveness, expressed in a percentage value. The higher this percentage value is,
the more reward is given to the algorithm. Thus, the program is trained to give the
best possible solution for the best possible reward.

Problems not to be solved

We are always amazed at how machine learning has made such an impact on our lives.
There is no doubt that ML will completely change the face of various industries, as
well as job profiles. While it offers a promising future, there are some inherent
problems at the heart of ML and AI advancements that put these technologies at a
disadvantage. While it can solve a plethora of challenges, there are a few tasks which
ML fails to answer.

1. Reasoning Power

One area where ML has not mastered successfully is reasoning power, a distinctly
human trait. Algorithms available today are mainly oriented towards specific use-
cases and are narrowed down when it comes to applicability. They cannot think as to
why a particular method is happening that way or ‘introspect’ their own outcomes.

For instance, if an image recognition algorithm identifies apples and oranges in a


given scenario, it cannot say if the apple (or orange) has gone bad or not, or why is
that fruit an apple or orange. Mathematically, all of this learning process can be
explained by us, but from an algorithmic perspective, the innate property cannot be
told by the algorithms or even us.

In other words, ML algorithms lack the ability to reason beyond their intended
application.

2. Contextual Limitation

If we consider the area of natural language processing (NLP), text and speech
information are the means to understand languages by NLP algorithms. They may
learn letters, words, sentences or even the syntax, but where they fall back is the
context of the language. Algorithms do not understand the context of the language
used. A classic example for this would be the “Chinese room” argument given by

about:blank 6/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

philosopher John Searle, which says that computer programs or algorithms grasp the
idea merely by ‘symbols’ rather than the context given.

So, ML does not have an overall idea of the situation. It is limited by mnemonic
interpretations rather than thinking to see what is actually going on.

3. Scalability

Although we see ML implementations being deployed on a significant basis, it all


depends on data as well as its scalability. Data is growing at an enormous rate and has
many forms which largely affects the scalability of an ML project. Algorithms cannot
do much about this unless they are updated constantly for new changes to handle data.
This is where ML regularly requires human intervention in terms of scalability and
remains unsolved mostly.

In addition, growing data has to be dealt the right way if shared on an ML platform
which again needs examination through knowledge and intuition apparently lacked by
current ML.

4. Regulatory Restriction For Data In ML

ML usually need considerable amounts (in fact, massive) of data in stages such as
training, cross-validation etc. Sometimes, data includes private as well as general
information. This is where it gets complicated. Most tech companies have privatised
data and these data are the ones which are actually useful for ML applications. But,
there comes the risk of the wrong usage of data, especially in critical areas such as
medical research, health insurance etc.,

Even though data are anonymised at times, it has the possibility of being vulnerable.
Hence this is the reason regulatory rules are imposed heavily when it comes to using
private data.

5. Internal Working Of Deep Learning

This sub-field of ML is actually responsible for today’s AI growth. What was once
just a theory has appeared to be the most powerful aspect of ML. Deep Learning (DL)
now powers applications such as voice recognition, image recognition and so on
through artificial neural networks.

But, the internal working of DL is still unknown and yet to be solved. Advanced DL
algorithms still baffle researchers in terms of its working and efficiency. Millions of
neurons that form the neural networks in DL increase abstraction at every level, which
cannot be comprehended at all. This is why deep learning is dubbed a ‘black box’
since its internal agenda is unknown.

about:blank 7/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

Applications

Popular Machine Learning Applications and Examples

1. Social Media Features

Social media platforms use machine learning algorithms and approaches to create
some attractive and excellent features. For instance, Facebook notices and records
your activities, chats, likes, and comments, and the time you spend on specific kinds
of posts. Machine learning learns from your own experience and makes friends and
page suggestions for your profile.

2. Product Recommendations

Product recommendation is one of the most popular and known applications of


machine learning. Product recommendation is one of the stark features of almost
every e-commerce website today, which is an advanced application of machine
learning techniques. Using machine learning and AI, websites track your behavior
based on your previous purchases, searching patterns, and cart history, and then make
product recommendations.

3. Image Recognition

Image recognition, which is an approach for cataloging and detecting a feature or an


object in the digital image, is one of the most significant and notable machine learning
and AI techniques. This technique is being adopted for further analysis, such as
pattern recognition, face detection, and face recognition.

4. Sentiment Analysis

Sentiment analysis is one of the most necessary applications of machine learning.


Sentiment analysis is a real-time machine learning application that determines the
emotion or opinion of the speaker or the writer. For instance, if someone has written a
review or email (or any form of a document), a sentiment analyzer will instantly find
out the actual thought and tone of the text. This sentiment analysis application can be
used to analyze a review based website, decision-making applications, etc.

5. Automating Employee Access Control

Organizations are actively implementing machine learning algorithms to determine


the level of access employees would need in various areas, depending on their job
profiles. This is one of the coolest applications of machine learning.

6. Marine Wildlife Preservation

Machine learning algorithms are used to develop behavior models for endangered
cetaceans and other marine species, helping scientists regulate and monitor their
populations.

about:blank 8/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

7. Regulating Healthcare Efficiency and Medical Services

Significant healthcare sectors are actively looking at using machine learning


algorithms to manage better. They predict the waiting times of patients in the
emergency waiting rooms across various departments of hospitals. The models use
vital factors that help define the algorithm, details of staff at various times of day,
records of patients, and complete logs of department chats and the layout of
emergency rooms. Machine learning algorithms also come to play when detecting a
disease, therapy planning, and prediction of the disease situation. This is one of the
most necessary machine learning applications.

8. Predict Potential Heart Failure

An algorithm designed to scan a doctor’s free-form e-notes and identify patterns in a


patient’s cardiovascular history is making waves in medicine. Instead of a physician
digging through multiple health records to arrive at a sound diagnosis, redundancy is
now reduced with computers making an analysis based on available information.

9. Banking Domain

Banks are now using the latest advanced technology machine learning has to offer to
help prevent fraud and protect accounts from hackers. The algorithms determine what
factors to consider to create a filter to keep harm at bay. Various sites that are
unauthentic will be automatically filtered out and restricted from initiating
transactions.

10. Language Translation

One of the most common machine learning applications is language translation.


Machine learning plays a significant role in the translation of one language to another.
We are amazed at how websites can translate from one language to another
effortlessly and give contextual meaning as well. The technology behind the
translation tool is called ‘machine translation.’ It has enabled people to interact with
others from all around the world; without it, life would not be as easy as it is now. It
has provided confidence to travelers and business associates to safely venture into
foreign lands with the conviction that language will no longer be a barrier.

Your model will need to be taught what you want it to learn. Feeding relevant back
data will help the machine draw patterns and act accordingly. It is imperative to
provide relevant data and feed files to help the machine learn what is expected. In this
case, with machine learning, the results you strive for depend on the contents of the
files that are being recorded.

Languages/Tools

Regardless of the individual preferences for a particular programming language, we


have profiled five best programming languages for machine learning :

about:blank 9/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

1. Python Programming Language

With over 8.2 million developers across the world using Python for coding, Python
ranks first in the latest annual ranking of popular programming languages by IEEE
Spectrum with a score of 100. Stack overflow programming language trends clearly
show that it’s the only language on rising for the last five years.

The increasing adoption of machine learning worldwide is a major factor contributing


to its growing popularity. There are 69% of machine learning engineers and Python
has become the favourite choice for data analytics, data science, machine learning,
and AI – all thanks to its vast library ecosystem that let’s machine learning
practitioners access, handle, transform, and process data with ease. Python wins the
heart of machine learning engineers for its platform independence, less complexity,
and better readability. Below is an interesting poem “The Zen of Python” written by
Tim Peters which beautifully describes why Python is gaining popularity as the best
language for machine learning :

about:blank 10/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

Python is the preferred programming language of choice for machine learning for
some of the giants in the IT world including Google, Instagram, Facebook, Dropbox,
Netflix, Walt Disney, YouTube, Uber, Amazon, and Reddit. Python is an indisputable
leader and by far the best language for machine learning today and here’s why:

 Extensive Collection of Libraries and Packages

Python’s in-built libraries and packages provide base-level code so machine learning
engineers don’t have to start writing from scratch. Machine learning requires
continuous data processing and Python has in-built libraries and packages for almost
every task. This helps machine learning engineers reduce development time and
improve productivity when working with complex machine learning applications. The
best part of these libraries and packages is that there is zero learning curve, once you
know the basics of Python programming, you can start using these libraries.

1. Working with textual data – use NLTK, SciKit, and NumPy


2. Working with images – use Sci-Kit image and OpenCV
3. Working with audio – use Librosa
4. Implementing deep learning – use TensorFlow, Keras, PyTorch
5. Implementing basic machine learning algorithms – use Sci-Kit- learn.
6. Want to do scientific computing – use Sci-Py
7. Want to visualise the data clearly – use Matplotlib, Sci-Kit, and Seaborn.

 Code Readability

The joy of coding in Python should be in seeing short, concise,


readable classes that express a lot of action in a small amount of clear
code — not in reams of trivial code that bores the reader to death –
Guido van Rossum

The math behind machine learning is usually complicated and unobvious. Thus, code
readability is extremely important to successfully implement complicated machine
learning algorithms and versatile workflows. Python’s simple syntax and the
importance it puts on code readability makes it easy for machine learning engineers to
focus on what to write instead of thinking about how to write. Code readability makes
it easier for machine learning practitioners to easily exchange ideas, algorithms, and
tools with their peers. Python is not only popular within machine learning engineers,
but it is also one of the most popular programming languages among data scientists.

 Flexibility

The multiparadigm and flexible nature of Python makes it easy for machine learning
engineers to approach a problem in the simplest way possible. It supports the
procedural, functional, object-oriented, and imperative style of programming allowing
machine learning experts to work comfortably on what approach fits best. The
flexibility Python offers help machine learning engineers choose the programming
style based on the type of problem – sometimes it would be beneficial to capture the
state in an object while other times the problem might require passing around
functions as parameters. Python provides flexibility in choosing either of the
approaches and minimises the likelihood of errors. Not only in terms of programming

about:blank 11/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

styles but Python has a lot to offer in terms of flexibility when it comes to
implementing changes as machine learning practitioners need not recompile the
source code to see the changes.

2. R Programming Langauge

With more than 2 million R users, 12000 packages in the CRAN open-source
repository, close to 206 R Meetup groups, over 4000 R programming questions asked
every month, and 40K+ members on LinkedIn’s R group – R is an incredible
programming language for machine learning written by a statistician for statisticians.
R language can also be used by non-programmer including data miners, data analysts,
and statisticians.

A critical part of a machine learning engineer’s day-to-day job roles is understanding


statistical principles so they can apply these principles to big data. R programming
language is a fantastic choice when it comes to crunching large numbers and is the
preferred choice for machine learning applications that use a lot of statistical data.
With user-friendly IDE’s like RStudio and various tools to draw graphs and manage
libraries – R is a must-have programming language in a machine learning engineer’s
toolkit. Here’s what makes R one of the most effective machine learning languages
for cracking business problems –

 Machine learning engineers need to train algorithms and bring in automation


to make accurate predictions. R language provides a variety of tools to train
and evaluate machine learning algorithms for predicting future events making
machine learning easy and approachable. R has an exhaustive list of packages
for machine learning –

1. MICE for dealing with missing values.


2. CARET for working with classification and regression problems.
3. PARTY and rpart for creating data partitions.
4. randomFOREST for creating decision trees.
5. dplyr and tidyr for data manipulation.
6. ggplot2 for creating beautiful visualisations.
7. Rmarkdown and Shiny for communicating insights through reports.

 R is an open-source programming language making it a highly cost-effective


choice for machine learning projects of any size.
 R supports the natural implementation of matrix arithmetic and other data
structures like vectors which Python does now. For a similar implementation
in Python programming language, machine learning engineers have to use the
NumPy package which is a clumsier implementation when compared to R.
 R is considered a powerful choice for machine learning because of the breadth
of machine learning techniques it provides. Be it data visualisation, data
sampling, data analysis, model evaluation, supervised/unsupervised machine
learning – R has a diverse array of techniques to offer.
 The style of programming in the R language is quite easy.
 R is highly flexible and also offers cross-platform compatibility. R does not
impose restrictions while performing every task in its language, machine

about:blank 12/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

learning practitioners can mix tools – choose the best tool for each task and
also enjoy the benefits of other tools along with R.

3. Java and JavaScript

Though Python and R continue to be the favourites of machine learning enthusiasts,


Java is gaining popularity among machine learning engineers who hail from a Java
development background as they don’t need to learn a new programming language
like Python or R to implement machine learning. Many organisations already have
huge Java codebases, and most of the open-source tools for big data processing like
Hadoop, Spark are written in Java. Using Java for machine learning projects makes it
easier for machine learning engineers to integrate with existing code repositories.
Features like the ease of use, package services, better user interaction, easy debugging,
and graphical representation of data make it a machine learning language of choice –

 Java has plenty of third party libraries for machine learning. JavaML is an in-
built machine learning library that provides a collection of machine learning
algorithms implemented in Java. Also, you can use Arbiter Java library for
hyperparameter tuning which is an integral part of making ML algorithms run
effectively or you can use Deeplearning4J library which supports popular
machine learning algorithms like K-Nearest Neighbor and Neuroph and lets
you create neural networks or can also use Neuroph for neural networks.
 Scalability is an important feature that every machine learning engineer must
consider before beginning a project. Java makes application scaling easier for
machine learning engineers, making it a great choice for the development of
large and complex machine learning applications from scratch.
 Java Virtual Machine is one of the best platforms for machine learning as
engineers can write the same code on multiple platforms. JVM also helps
machine learning engineers create custom tools at a rapid pace and has various
IDE’s that help improve overall productivity. Java works best for speed-
critical machine learning projects as it is fast executing.

4. Julia

Julia is a high-performance, general-purpose dynamic programming language


emerging as a potential competitor for Python and R with many predominant features
exclusively for machine learning. Having said that it is a general-purpose
programming language and can be used for the development of all kinds of
applications, it works best for high-performance numerical analysis and
computational science. With support for all types of hardware including TPU’s and
GPU’s on every cloud, Julia is powering machine learning applications at big
corporations like Apple, Disney, Oracle, and NASA.

Why use Julia for machine learning?

 Julia is particularly designed for implementing basic mathematics and


scientific queries that underlies most machine learning algorithms.
 Julia code is compiled at Just-in-Time or at run time using the LLVM
framework. This gives machine learning engineers great speed without any

about:blank 13/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

handcrafted profiling techniques or optimisation techniques solving all the


performance problems.
 Julia’s code is universally executable. So, once written a machine learning
application it can be compiled in Julia natively from other languages like
Python or R in a wrapper like PyCall or RCall.
 Scalability, as discussed, is crucial for machine learning engineers and Julia
makes it easier to be deployed quickly at large clusters. With powerful tools
like TensorFlow, MLBase.jl, Flux.jl, SciKitlearn.jl, and many others that
utilise the scalability provided by Julia, it is an apt choice for machine learning
applications.
 Offer support for editors like Emacs and VIM and also IDE’s like Visual
studio and Juno.

5. LISP

Founded in 1958 by John McCarthy, LISP (List Processing) is the second oldest
programming language still in use and is mainly developed for AI-centric applications.
LISP is a dynamically typed programming language that has influenced the creation
of many machine learning programming languages like Python, Julia, and Java. LISP
works on Read-Eval-Print-Loop (REPL) and has the capability to code, compile, and
run code in 30+ programming languages.

Lisp is a language for doing what you’ve been told is impossible –


Kent Pitman

LISP is considered as the most efficient and flexible machine learning language for
solving specifics as it adapts to the solution a programmer is coding for. This is what
makes LISP different from other machine learning languages. Today, it is particularly
used for inductive logic problems and machine learning. The first AI chatbot ELIZA
was developed using LISP and even today machine learning practitioners can use it to
create chatbots for eCommerce. LISP definitely deserves a mention on the list of best
language for machine learning because even today developers rely on LISP for
artificial intelligence projects that are heavy on machine learning as LISP offers –

 Rapid prototyping capabilities


 Dynamic object creation
 Automatic garbage collection
 Flexibility
 Support for symbolic expressions

Despite being flexible for machine learning, LISP lacks the support of well-known
machine learning libraries. LISP is neither a beginner-friendly machine learning
language (difficult to learn) and nor does have a large user community like that of
Python or R.

The best language for machine learning depends on the area in which it is going to be
applied, the scope of the machine learning project, which programming languages are
used in your industry/company, and several other factors. Experimentation, testing,
and experience help a machine learning practitioner decide on an optimal choice of
programming language for any given machine learning problem. Of course, the best

about:blank 14/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

thing would be to learn at least two programming languages for machine learning as
this will help you put your machine learning resume at the top of the stack. Once you
are proficient in one machine learning language, learning another one is easy.

Machine Learning Tools

1. Microsoft Azure Machine Learning

Azure Machine Learning is a cloud platform that allows developers to build, train,
and deploy AI models. Microsoft is constantly making updates and improvements to
its machine learning tools and has recently announced changes to Azure Machine
Learning, retiring the Azure Machine Learning Workbench.

2. IBM Watson

No, IBM’s Watson Machine Learning isn’t something out of Sherlock Holmes.
Watson Machine Learning is an IBM cloud service that uses data to put machine
learning and deep learning models into production. This machine learning tool allows
users to perform training and scoring, two fundamental machine learning operations.
Keep in mind, IBM Watson is best suited for building machine learning applications
through API connections.

3. Google TensorFlow

TensorFlow, which is used for research and production at Google, is an open-source


software library for dataflow programming. The bottom line, TensorFlow is a
machine learning framework. This machine learning tool is relatively new to the
market and is evolving quickly. TensorFlow's easy visualization of neural networks is
likely the most attractive feature to developers.

4. Amazon Machine Learning

It should come as no surprise that Amazon offers an impressive number of machine


learning tools. According to the AWS website, Amazon Machine Learning is a
managed service for building Machine Learning models and generating predictions.
Amazon Machine Learning includes an automatic data transformation tool,
simplifying the machine learning tool even further for the user. In addition, Amazon
also offers other machine learning tools such as Amazon SageMaker, which is a fully-
managed platform that makes it easy for developers and data scientists to utilize
machine learning models.

5. OpenNN

OpenNN, short for Open Neural Networks Library, is a software library that
implements neural networks. Written in C++ programming language, OpenNN offers
you the perk of downloading its entire library for free from GitHub or SourceForge.

about:blank 15/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

Issues

Although machine learning is being used in every industry and helps organizations
make more informed and data-driven choices that are more effective than classical
methodologies, it still has so many problems that cannot be ignored. Here are some
common issues in Machine Learning that professionals face to inculcate ML skills
and create an application from scratch.

1. Inadequate Training Data

The major issue that comes while using machine learning algorithms is the lack of
quality as well as quantity of data. Although data plays a vital role in the processing
of machine learning algorithms, many data scientists claim that inadequate data, noisy
data, and unclean data are extremely exhausting the machine learning algorithms. For
example, a simple task requires thousands of sample data, and an advanced task such
as speech or image recognition needs millions of sample data examples. Further, data
quality is also important for the algorithms to work ideally, but the absence of data
quality is also found in Machine Learning applications. Data quality can be affected
by some factors as follows:

 Noisy Data- It is responsible for an inaccurate prediction that affects the


decision as well as accuracy in classification tasks.
 Incorrect data- It is also responsible for faulty programming and results
obtained in machine learning models. Hence, incorrect data may affect the
accuracy of the results also.
 Generalizing of output data- Sometimes, it is also found that generalizing
output data becomes complex, which results in comparatively poor future
actions.

2. Poor quality of data

As we have discussed above, data plays a significant role in machine learning, and it
must be of good quality as well. Noisy data, incomplete data, inaccurate data, and
unclean data lead to less accuracy in classification and low-quality results. Hence,
data quality can also be considered as a major common problem while processing
machine learning algorithms.

3. Non-representative training data

To make sure our training model is generalized well or not, we have to ensure that
sample training data must be representative of new cases that we need to generalize.
The training data must cover all cases that are already occurred as well as occurring.

Further, if we are using non-representative training data in the model, it results in less
accurate predictions. A machine learning model is said to be ideal if it predicts well
for generalized cases and provides accurate decisions. If there is less training data,
then there will be a sampling noise in the model, called the non-representative training
set. It won't be accurate in predictions. To overcome this, it will be biased against one
class or a group.

about:blank 16/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

Hence, we should use representative data in training to protect against being biased
and make accurate predictions without any drift.

4. Overfitting and Underfitting

Overfitting:

Overfitting is one of the most common issues faced by Machine Learning engineers
and data scientists. Whenever a machine learning model is trained with a huge amount
of data, it starts capturing noise and inaccurate data into the training data set. It
negatively affects the performance of the model. Let's understand with a simple
example where we have a few training data sets such as 1000 mangoes, 1000 apples,
1000 bananas, and 5000 papayas. Then there is a considerable probability of
identification of an apple as papaya because we have a massive amount of biased data
in the training data set; hence prediction got negatively affected. The main reason
behind overfitting is using non-linear methods used in machine learning algorithms as
they build non-realistic data models. We can overcome overfitting by using linear and
parametric algorithms in the machine learning models.

Methods to reduce overfitting:

 Increase training data in a dataset.


 Reduce model complexity by simplifying the model by selecting one with
fewer parameters
 Ridge Regularization and Lasso Regularization
 Early stopping during the training phase
 Reduce the noise
 Reduce the number of attributes in training data.
 Constraining the model.

Underfitting:

Underfitting is just the opposite of overfitting. Whenever a machine learning model is


trained with fewer amounts of data, and as a result, it provides incomplete and
inaccurate data and destroys the accuracy of the machine learning model.

Underfitting occurs when our model is too simple to understand the base structure of
the data, just like an undersized pant. This generally happens when we have limited
data into the data set, and we try to build a linear model with non-linear data. In such
scenarios, the complexity of the model destroys, and rules of the machine learning
model become too easy to be applied on this data set, and the model starts doing
wrong predictions as well.

Methods to reduce Underfitting:

 Increase model complexity


 Remove noise from the data
 Trained on increased and better features
 Reduce the constraints
 Increase the number of epochs to get better results.

about:blank 17/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

5. Monitoring and maintenance

As we know that generalized output data is mandatory for any machine learning
model; hence, regular monitoring and maintenance become compulsory for the same.
Different results for different actions require data change; hence editing of codes as
well as resources for monitoring them also become necessary.

6. Getting bad recommendations

A machine learning model operates under a specific context which results in bad
recommendations and concept drift in the model. Let's understand with an example
where at a specific time customer is looking for some gadgets, but now customer
requirement changed over time but still machine learning model showing same
recommendations to the customer while customer expectation has been changed. This
incident is called a Data Drift. It generally occurs when new data is introduced or
interpretation of data changes. However, we can overcome this by regularly updating
and monitoring data according to the expectations.

7. Lack of skilled resources

Although Machine Learning and Artificial Intelligence are continuously growing in


the market, still these industries are fresher in comparison to others. The absence of
skilled resources in the form of manpower is also an issue. Hence, we need manpower
having in-depth knowledge of mathematics, science, and technologies for developing
and managing scientific substances for machine learning.

8. Customer Segmentation

Customer segmentation is also an important issue while developing a machine


learning algorithm. To identify the customers who paid for the recommendations
shown by the model and who don't even check them. Hence, an algorithm is
necessary to recognize the customer behavior and trigger a relevant recommendation
for the user based on past experience.

9. Process Complexity of Machine Learning

The machine learning process is very complex, which is also another major issue
faced by machine learning engineers and data scientists. However, Machine Learning
and Artificial Intelligence are very new technologies but are still in an experimental
phase and continuously being changing over time. There is the majority of hits and
trial experiments; hence the probability of error is higher than expected. Further, it
also includes analyzing the data, removing data bias, training data, applying complex
mathematical calculations, etc., making the procedure more complicated and quite
tedious.

10. Data Bias

Data Biasing is also found a big challenge in Machine Learning. These errors exist
when certain elements of the dataset are heavily weighted or need more importance
than others. Biased data leads to inaccurate results, skewed outcomes, and other

about:blank 18/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

analytical errors. However, we can resolve this error by determining where data is
actually biased in the dataset. Further, take necessary steps to reduce it.

Methods to remove Data Bias:

 Research more for customer segmentation.


 Be aware of your general use cases and potential outliers.
 Combine inputs from multiple sources to ensure data diversity.
 Include bias testing in the development process.
 Analyze data regularly and keep tracking errors to resolve them easily.
 Review the collected and annotated data.
 Use multi-pass annotation such as sentiment analysis, content moderation, and
intent recognition.

11. Lack of Explainability

This basically means the outputs cannot be easily comprehended as it is programmed


in specific ways to deliver for certain conditions. Hence, a lack of explainability is
also found in machine learning algorithms which reduce the credibility of the
algorithms.

12. Slow implementations and results

This issue is also very commonly seen in machine learning models. However,
machine learning models are highly efficient in producing accurate results but are
time-consuming. Slow programming, excessive requirements' and overloaded data
take more time to provide accurate results than expected. This needs continuous
maintenance and monitoring of the model for delivering accurate results.

13. Irrelevant features

Although machine learning models are intended to give the best possible outcome, if
we feed garbage data as input, then the result will also be garbage. Hence, we should
use relevant features in our training sample. A machine learning model is said to be
good if training data has a good set of features or less to no irrelevant features.

Preparing to Model - Introduction

Getting the data right is the first step in any AI or machine learning project -- and it's
often more time-consuming and complex than crafting the machine learning
algorithms themselves. Advanced planning to help streamline and improve data
preparation in machine learning can save considerable work down the road. It can also
lead to more accurate and adaptable algorithms.

"Data preparation is the action of gathering the data you need, massaging it into a
format that's computer-readable and understandable, and asking hard questions of it to
check it for completeness and bias," said Eli Finkelshteyn, founder and CEO of
Constructor.io, which makes an AI-driven search engine for product websites.

about:blank 19/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

It's tempting to focus only on the data itself, but it's a good idea to first consider the
problem you're trying to solve. That can help simplify considerations about what kind
of data to gather, how to ensure it fits the intended purpose and how to transform it
into the appropriate format for a specific type of algorithm.

Good data preparation can lead to more accurate and efficient algorithms, while
making it easier to pivot to new analytics problems, adapt when model accuracy drifts
and save data scientists and business users considerable time and effort down the line.

The importance of data preparation in machine learning

"Being a great data scientist is like being a great chef," surmised Donncha Carroll, a
partner at consultancy Axiom Consulting Partners. "To create an exceptional meal,
you must build a detailed understanding of each ingredient and think through how
they'll complement one another to produce a balanced and memorable dish. For a data
scientist, this process of discovery creates the knowledge needed to understand more
complex relationships, what matters and what doesn't, and how to tailor the data
preparation approach necessary to lay the groundwork for a great ML model."

Managers need to appreciate the ways in which data shapes machine learning
application development differently compared to customary application development.
"Unlike traditional rule-based programming, machine learning consists of two parts
that make up the final executable algorithm -- the ML algorithm itself and the data to
learn from," explained Felix Wick, corporate vice president of data science at supply
chain management platform provider Blue Yonder. "But raw data are often not ready
to be used in ML models. So, data preparation is at the heart of ML."

Data preparation consists of several steps, which consume more time than other
aspects of machine learning application development. A 2021 study by data science
platform vendor Anaconda found that data scientists spend an average of 22% of their
time on data preparation, which is more than the average time spent on other tasks
like deploying models, model training and creating data visualizations.

Although it is a time-intensive process, data scientists must pay attention to various


considerations when preparing data for machine learning. Following are six key steps
that are part of the process.

1. Problem formulation

Data preparation for building machine learning models is a lot more than just cleaning
and structuring data. In many cases, it's helpful to begin by stepping back from the
data to think about the underlying problem you're trying to solve. "To build a
successful ML model," Carroll advised, "you must develop a detailed understanding
of the problem to inform what you do and how you do it."

Start by spending time with the people that operate within the domain and have a
good understanding of the problem space, synthesizing what you learn through
conversations with them and using your experience to create a set of hypotheses that
describes the factors and forces involved. This simple step is often skipped or
underinvested in, Carroll noted, even though it can make a significant difference in

about:blank 20/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

deciding what data to capture. It can also provide useful guidance on how the data
should be transformed and prepared for the machine learning model.

An Axiom legal client, for example, wanted to know how different elements of
service delivery impact account retention and growth. Carroll's team collaborated with
the attorneys to develop a hypothesis that accounts served by legal professionals
experienced in their industry tend to be happier and continue as clients longer. To
provide that information as an input to a machine learning model, they looked back
over the course of each professional's career and used billing data to determine how
much time they spent serving clients in that industry.

"Ultimately," Carroll added, "it became one of the most important predictors of client
retention and something we would never have calculated without spending the time
upfront to understand what matters and how it matters."

2. Data collection and discovery

Once a data science team has formulated the machine learning problem to be solved,
it needs to inventory potential data sources within the enterprise and from external
third parties. The data collection process must consider not only what the data is
purported to represent, but also why it was collected and what it might mean,
particularly when used in a different context. It's also essential to consider factors that
may have biased the data.

"To reduce and mitigate bias in machine learning models," said Sophia Yang, a senior
data scientist at Anaconda, "data scientists need to ask themselves where and how the
data was collected to determine if there were significant biases that might have been
captured." To train a machine learning model that predicts customer behavior, for
example, look at the data and ensure the data set was collected from diverse people,
geographical areas and perspectives.

"The most important step often missed in data preparation for machine learning is
asking critical questions of data that otherwise looks technically correct,"
Finkelshteyn said. In addition to investigating bias, he recommended determining if
there's reason to believe that important missing data may lead to a partial picture of
the analysis being done. In some cases, analytics teams use data that works
technically but produces inaccurate or incomplete results, and people who use the
resulting models build on these faulty learnings without knowing something is wrong.

3. Data exploration

Data scientists need to fully understand the data they're working with early in the
process to cultivate insights into its meaning and applicability. "A common mistake is
to launch into model building without taking the time to really understand the data
you've wrangled," Carroll said.

Data exploration means reviewing such things as the type and distribution of data
contained within each variable, the relationships between variables and how they vary
relative to the outcome you're predicting or interested in achieving.

about:blank 21/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

This step can highlight problems like collinearity -- variables that move together -- or
situations where standardization of data sets and other data transformations are
necessary. It can also surface opportunities to improve model performance, like
reducing the dimensionality of a data set.

Data visualizations can also help improve this process. "This might seem like an
added step that isn't needed," Yang conjectured, "but our brains are great at spotting
patterns along with data that doesn't match the pattern." Data scientists can easily see
trends and explore the data correctly by creating suitable visualizations before
drawing conclusions. Popular data visualization tools include Tableau, Microsoft
Power BI, D3.js and Python libraries such as Matplotlib, Bokeh and the HoloViz
stack.

4. Data cleansing and validation

Various data cleansing and validation techniques can help analytics teams identify
and rectify inconsistencies, outliers, anomalies, missing data and other issues. Missing
data values, for example, can often be addressed with imputation tools that fill empty
fields with statistically relevant substitutes.

But Blue Yonder's Wick cautioned that semantic meaning is an often overlooked
aspect of missing data. In many cases, creating a dedicated category for capturing the
significance of missing values can help. In others, teams may consider explicitly
setting missing values as neutral to minimize their impact on machine learning
models.

A wide range of commercial and open source tools can be used to cleanse and
validate data for machine learning and ensure good quality data. Open source
technologies such as Great Expectations and Pandera, for example, are designed to
validate the data frames commonly used to organize analytics data into two-
dimensional tables. Tools that validate code and data processing workflows are also
available. One of them is pytest, which, Yang said, data scientists can use to apply a
software development unit-test mindset and manually write tests of their workflows.

5. Data structuring

Once data science teams are satisfied with their data, they need to consider the
machine learning algorithms being used. Most algorithms, for example, work better
when data is broken into categories, such as age ranges, rather than left as raw
numbers.

Two often-missed data preprocessing tricks, Wick said, are data binning and
smoothing continuous features. These data regularization methods can reduce a
machine learning model's variance by preventing it from being misled by minor
statistical fluctuations in a data set.

Binning data into different groups can be done either in an equidistant manner, with
the same "width" for each bin, or equi-statistical method, with approximately the
same number of samples in each bin. It can also serve as a prerequisite for local

about:blank 22/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

optimization of the data in each bin to help produce low-bias machine learning
models.

Smoothing continuous features can help in "denoising" raw data. It can also be used
to impose causal assumptions about the data-generating process by representing
relationships in ordered data sets as monotonic functions that preserve the order
among data elements.

Other actions that data scientists often take in structuring data for machine learning
include the following:

 data reduction, through techniques such as attribute or record sampling and


data aggregation;
 data normalization, which includes dimensionality reduction and data
rescaling; and
 creating separate data sets for training and testing machine learning models.

6. Feature engineering and selection

The last stage in data preparation before developing a machine learning model is
feature engineering and feature selection.

Wick said feature engineering, which involves adding or creating new variables to
improve a model's output, is the main craft of data scientists and comes in various
forms. Examples include extracting the days of the week or other variables from a
data set, decomposing variables into separate features, aggregating variables and
transforming features based on probability distributions.

Data scientists also must address feature selection -- choosing relevant features to
analyze and eliminating nonrelevant ones. Many features may look promising but lead
to problems like extended model training and overfitting, which limits a model's
ability to accurately analyze new data. Methods such as lasso regression and
automatic relevance determination can help with feature selection.

Machine Learning Activities

Machine Learning Technology is already a part of all of our lives. It is making


decisions both for us and about us. It is the technology behind:

 Facial recognition
 Targeted advertising
 Voice recognition
 SPAM filters
 Machine translation
 Detecting credit card fraud
 Virtual Personal Assistants
 Self-driving cars
 … and lots more.

about:blank 23/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

To fully understand the opportunities and consequences of the machine learning filled
future, everyone needs to be able to …

 Understand the basics of how machine learning works.


 Develop applications by training machine learning engine.
 Use machine learning applications.
 Understand the Ethical and Societal Issues.

What is Machine Learning?

Machine Learning is a technology that “allows computers to perform specific tasks


intelligently, by learning from examples”. Rather than crafting an algorithm to do a
job step by step…you craft an algorithm that learns to do things itself then train it on
large amounts of data. It is all about spotting patterns in massive amounts of data.

In practice creating machine learning tools is done in several steps.

1. First create a machine learning engine. It is a program implementing an


algorithm of how to learn in general. (This step is for experts!)
2. Next you train it on relevant data (e.g. images of animals). The more data it
sees the better it gets at recognising things or making decisions (e.g.
identifying animals).
3. You package up the newly trained tool in a user interface to make it easy for
anyone to use it.
4. Your users then use the new machine learning application by giving it new
data (e.g. you show it pictures of animals and it tells you what kind of animal
they are).

Here is an example of a robot with a machine learning brain. It reacts just to the tone
of voice – it doesn’t understand the words. It learnt very much like a dog does. It was
‘rewarded’ when it reacted in an appropriate way and was ‘punished’ when it reacted
in an inappropriate way. Eventually it learnt to behave like this.

Understanding how machine learning works

There are several ways to try to make a machine do tasks ‘intelligently’. For example:

 Rule-based systems (writing rules explicitly)


 Neural networks (copying the way our brains learn)
 Genetic algorithms (copying the way evolution improves species to fit their
environment)
 Bayesian Networks (building in existing expert knowledge)

Understanding how machine learning works

There are several ways to try to make a machine do tasks ‘intelligently’. For example:

 Rule-based systems (writing rules explicitly)


 Neural networks (copying the way our brains learn)

about:blank 24/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

 Genetic algorithms (copying the way evolution improves species to fit their
environment)
 Bayesian Networks (building in existing expert knowledge)

Types of data

Why is machine learning important?

Machine learning is a form of artificial intelligence (AI) that teaches computers to


think in a similar way to humans: learning and improving upon past experiences.
Almost any task that can be completed with a data-defined pattern or set of rules can
be automated with machine learning.

So, why is machine learning important? It allows companies to transform processes


that were previously only possible for humans to perform—think responding to
customer service calls, bookkeeping, and reviewing resumes for everyday businesses.
Machine learning can also scale to handle larger problems and technical questions—
think image detection for self-driving cars, predicting natural disaster locations and
timelines, and understanding the potential interaction of drugs with medical
conditions before clinical trials. That’s why machine learning is important.

Why is data important for machine learning?

Machine learning data analysis uses algorithms to continuously improve itself over
time, but quality data is necessary for these models to operate efficiently.

What is a dataset in machine learning?

A single row of data is called an instance. Datasets are a collection of instances that
all share a common attribute. Machine learning models will generally contain a few
different datasets, each used to fulfill various roles in the system.

For machine learning models to understand how to perform various actions, training
datasets must first be fed into the machine learning algorithm, followed by validation

about:blank 25/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

datasets (or testing datasets) to ensure that the model is interpreting this data
accurately.

Once you feed these training and validation sets into the system, subsequent datasets
can then be used to sculpt your machine learning model going forward. The more data
you provide to the ML system, the faster that model can learn and improve.

What type of data does machine learning need?

Data can come in many forms, but machine learning models rely on four primary data
types. These include numerical data, categorical data, time series data, and text data.

Numerical data

Numerical data

Numerical data, or quantitative data, is any form of measurable data such as your
height, weight, or the cost of your phone bill. You can determine if a set of data is
numerical by attempting to average out the numbers or sort them in ascending or
descending order. Exact or whole numbers (ie. 26 students in a class) are considered
discrete numbers, while those which fall into a given range (ie. 3.6 percent interest
rate) are considered continuous numbers. While learning this type of data, keep in
mind that numerical data is not tied to any specific point in time, they are simply raw
numbers.

Categorical data

Categorical data is sorted by defining characteristics. This can include gender, social
class, ethnicity, hometown, the industry you work in, or a variety of other labels.
While learning this data type, keep in mind that it is non-numerical, meaning you are
unable to add them together, average them out, or sort them in any chronological
order. Categorical data is great for grouping individuals or ideas that share similar
attributes, helping your machine learning model streamline its data analysis.

about:blank 26/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

Time series data

Time series data consists of data points that are indexed at specific points in time.
More often than not, this data is collected at consistent intervals. Learning and
utilizing time series data makes it easy to compare data from week to week, month to
month, year to year, or according to any other time-based metric you desire. The
distinct difference between time series data and numerical data is that time series data
has established starting and ending points, while numerical data is simply a collection
of numbers that aren’t rooted in particular time periods.

Text data

Text data is simply words, sentences, or paragraphs that can provide some level of
insight to your machine learning models. Since these words can be difficult for
models to interpret on their own, they are most often grouped together or analyzed
using various methods such as word frequency, text classification, or sentiment
analysis.

Where do engineers get datasets for machine learning?

There is an abundance of places you can find machine learning data, but we have
compiled five of the most popular ML dataset resources to help get you started:

about:blank 27/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

Exploring structure of data

The data structure used for machine learning is quite similar to other software
development fields where it is often used. Machine Learning is a subset of artificial
intelligence that includes various complex algorithms to solve mathematical problems
to a great extent. Data structure helps to build and understand these complex
problems. Understanding the data structure also helps you to build ML models and
algorithms in a much more efficient way than other ML professionals.

What is Data Structure?

The data structure is defined as the basic building block of computer programming
that helps us to organize, manage and store data for efficient search and retrieval.

In other words, the data structure is the collection of data type 'values' which are
stored and organized in such a way that it allows for efficient access and modification.

Types of Data Structure

The data structure is the ordered sequence of data, and it tells the compiler how a
programmer is using the data such as Integer, String, Boolean, etc.

There are two different types of data structures: Linear and Non-linear data structures.

1. Linear Data structure:

The linear data structure is a special type of data structure that helps to organize and
manage data in a specific order where the elements are attached adjacently.

There are mainly 4 types of linear data structure as follows:

about:blank 28/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

Array:

An array is one of the most basic and common data structures used in Machine
Learning. It is also used in linear algebra to solve complex mathematical problems.
You will use arrays constantly in machine learning, whether it's:

 To convert the column of a data frame into a list format in pre-processing


analysis
 To order the frequency of words present in datasets.
 Using a list of tokenized words to begin clustering topics.
 In word embedding, by creating multi-dimensional matrices.

An array contains index numbers to represent an element starting from 0. The lowest
index is arr[0] and corresponds to the first element.

Let's take an example of a Python array used in machine learning. Although the
Python array is quite different from than array in other programming languages, the
Python list is more popular as it includes the flexibility of data types and their length.
If anyone is using Python in ML algorithms, then it's better to kick your journey from
array initially.

Python Array method:

Method Description
Append() It is used to add an element at the end of the list.
Clear() It is used to remove/clear all elements in the list.
Copy() It returns a copy of the list.
Count() It returns the count or total available element with an integer value.
Extend() It is used to add the element of a list to the end of the current list.
Index() It returns the index of the first element with the specified value.
Insert() It is used to add an element at a specific position using an index number.
It is used to remove an element from a specified position using an index
Pop() number.
Remove() Used to remove the elements with specified values.
Reverse() Used to show list in reverse order
Sort() Used to sort the list in an array.

Stacks:

Stacks are based on the concept of LIFO (Last in First out) or FILO (First In Last
Out). It is used for binary classification in deep learning. Although stacks are easy to
learn and implement in ML models but having a good grasp can help in many
computer science aspects such as parsing grammar, etc.

Stacks enable the undo and redo buttons on your computer as they function similar to
a stack of blog content. There is no sense in adding a blog at the bottom of the stack.

about:blank 29/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

about:blank 30/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

about:blank 31/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

about:blank 32/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

about:blank 33/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

about:blank 34/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

about:blank 35/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

about:blank 36/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

about:blank 37/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

about:blank 38/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

about:blank 39/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

about:blank 40/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

about:blank 41/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

about:blank 42/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

about:blank 43/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

about:blank 44/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

about:blank 45/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

about:blank 46/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)

about:blank 47/47

You might also like