MC4301 - ML Unit 1 (Introduction)
MC4301 - ML Unit 1 (Introduction)
Human Learning
Learning is the process of acquiring new understanding, knowledge, behaviors, skills,
values, attitudes, and preferences.
Learning consists of complex information processing, problem-solving, decision-
making in uncertainty and the urge to transfer knowledge and skills into new,
unknown settings.
The process of learning is continuous which starts right from the time of birth of an
individual and continues till the death. We all are engaged in the learning endeavours
in order to develop our adaptive capabilities as per the requirements of the changing
environment.
John B Watson is one amongst the first thinkers who has proven that behavioural
changes occur as a result of learning. Watson is believed to be the founder of
Behavioural school of thought, which gained its prominence or acceptability around
the first half of the 20th century.
Crow and Crow defined learning as the process of acquisition of knowledge, habits
and attitudes.
about:blank 1/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
Types
Types of Learning
about:blank 2/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
The Behavioural School of Thought which was founded by John B Watson which was
highlighted in his seminal work, “Psychology as the Behaviorist View It”, stressed
on the fact that Psychology is an objective science, hence mere emphasis on the
mental processes should not be considered as such processes cannot be objectively
measured or observed.
Watson tried to prove his theory with the help of his famous Little Albert
Experiment, by way of which he conditioned a small kid to be scared of a white rat.
The behavioural psychology described three types of learning: Classical
Conditioning, Observational Learning and Operant Conditioning.
Classical Conditioning theory has been explained with the help of Pavlov’s
Classic Experiment, in which the food was used as the natural stimulus which
was paired with the previously neutral stimuli that’s a bell in this case. By
establishing an association between the natural stimulus (food) and the neutral
stimuli (sound of the bell), the desired response can be elicited. This theory
will be discussed in detail in the next few articles.
Machine Learning
about:blank 3/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
intelligence systems are used to perform complex tasks in a way that is similar to how
humans solve problems.
Machine learning is used in internet search engines, email filters to sort out spam,
websites to make personalised recommendations, banking software to detect unusual
transactions, and lots of apps on our phones such as voice recognition.
Types
As with any method, there are different ways to train machine learning algorithms,
each with their own advantages and disadvantages. To understand the pros and cons
of each type of machine learning, we must first look at what kind of data they ingest.
In ML, there are two kinds of data — labeled data and unlabeled data.
Labeled data has both the input and output parameters in a completely machine-
readable pattern, but requires a lot of human labor to label the data, to begin with.
Unlabeled data only has one or none of the parameters in a machine-readable form.
This negates the need for human labor but requires more complex solutions.
There are also some types of machine learning algorithms that are used in very
specific use-cases, but three main methods are used today.
about:blank 4/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
Supervised Learning
Supervised learning is one of the most basic types of machine learning. In this type,
the machine learning algorithm is trained on labeled data. Even though the data needs
to be labeled accurately for this method to work, supervised learning is extremely
powerful when used in the right circumstances.
The algorithm then finds relationships between the parameters given, essentially
establishing a cause and effect relationship between the variables in the dataset. At the
end of the training, the algorithm has an idea of how the data works and the
relationship between the input and the output.
This solution is then deployed for use with the final dataset, which it learns from in
the same way as the training dataset. This means that supervised machine learning
algorithms will continue to improve even after being deployed, discovering new
patterns and relationships as it trains itself on new data.
Unsupervised Learning
Unsupervised machine learning holds the advantage of being able to work with
unlabeled data. This means that human labor is not required to make the dataset
machine-readable, allowing much larger datasets to be worked on by the program.
In supervised learning, the labels allow the algorithm to find the exact nature of the
relationship between any two data points. However, unsupervised learning does not
have labels to work off of, resulting in the creation of hidden structures. Relationships
between data points are perceived by the algorithm in an abstract manner, with no
input required from human beings.
Reinforcement Learning
Reinforcement learning directly takes inspiration from how human beings learn from
data in their lives. It features an algorithm that improves upon itself and learns from
new situations using a trial-and-error method. Favorable outputs are encouraged or
‘reinforced’, and non-favorable outputs are discouraged or ‘punished’.
about:blank 5/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
In case of the program finding the correct solution, the interpreter reinforces the
solution by providing a reward to the algorithm. If the outcome is not favorable, the
algorithm is forced to reiterate until it finds a better result. In most cases, the reward
system is directly tied to the effectiveness of the result.
In typical reinforcement learning use-cases, such as finding the shortest route between
two points on a map, the solution is not an absolute value. Instead, it takes on a score
of effectiveness, expressed in a percentage value. The higher this percentage value is,
the more reward is given to the algorithm. Thus, the program is trained to give the
best possible solution for the best possible reward.
We are always amazed at how machine learning has made such an impact on our lives.
There is no doubt that ML will completely change the face of various industries, as
well as job profiles. While it offers a promising future, there are some inherent
problems at the heart of ML and AI advancements that put these technologies at a
disadvantage. While it can solve a plethora of challenges, there are a few tasks which
ML fails to answer.
1. Reasoning Power
One area where ML has not mastered successfully is reasoning power, a distinctly
human trait. Algorithms available today are mainly oriented towards specific use-
cases and are narrowed down when it comes to applicability. They cannot think as to
why a particular method is happening that way or ‘introspect’ their own outcomes.
In other words, ML algorithms lack the ability to reason beyond their intended
application.
2. Contextual Limitation
If we consider the area of natural language processing (NLP), text and speech
information are the means to understand languages by NLP algorithms. They may
learn letters, words, sentences or even the syntax, but where they fall back is the
context of the language. Algorithms do not understand the context of the language
used. A classic example for this would be the “Chinese room” argument given by
about:blank 6/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
philosopher John Searle, which says that computer programs or algorithms grasp the
idea merely by ‘symbols’ rather than the context given.
So, ML does not have an overall idea of the situation. It is limited by mnemonic
interpretations rather than thinking to see what is actually going on.
3. Scalability
In addition, growing data has to be dealt the right way if shared on an ML platform
which again needs examination through knowledge and intuition apparently lacked by
current ML.
ML usually need considerable amounts (in fact, massive) of data in stages such as
training, cross-validation etc. Sometimes, data includes private as well as general
information. This is where it gets complicated. Most tech companies have privatised
data and these data are the ones which are actually useful for ML applications. But,
there comes the risk of the wrong usage of data, especially in critical areas such as
medical research, health insurance etc.,
Even though data are anonymised at times, it has the possibility of being vulnerable.
Hence this is the reason regulatory rules are imposed heavily when it comes to using
private data.
This sub-field of ML is actually responsible for today’s AI growth. What was once
just a theory has appeared to be the most powerful aspect of ML. Deep Learning (DL)
now powers applications such as voice recognition, image recognition and so on
through artificial neural networks.
But, the internal working of DL is still unknown and yet to be solved. Advanced DL
algorithms still baffle researchers in terms of its working and efficiency. Millions of
neurons that form the neural networks in DL increase abstraction at every level, which
cannot be comprehended at all. This is why deep learning is dubbed a ‘black box’
since its internal agenda is unknown.
about:blank 7/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
Applications
Social media platforms use machine learning algorithms and approaches to create
some attractive and excellent features. For instance, Facebook notices and records
your activities, chats, likes, and comments, and the time you spend on specific kinds
of posts. Machine learning learns from your own experience and makes friends and
page suggestions for your profile.
2. Product Recommendations
3. Image Recognition
4. Sentiment Analysis
Machine learning algorithms are used to develop behavior models for endangered
cetaceans and other marine species, helping scientists regulate and monitor their
populations.
about:blank 8/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
9. Banking Domain
Banks are now using the latest advanced technology machine learning has to offer to
help prevent fraud and protect accounts from hackers. The algorithms determine what
factors to consider to create a filter to keep harm at bay. Various sites that are
unauthentic will be automatically filtered out and restricted from initiating
transactions.
Your model will need to be taught what you want it to learn. Feeding relevant back
data will help the machine draw patterns and act accordingly. It is imperative to
provide relevant data and feed files to help the machine learn what is expected. In this
case, with machine learning, the results you strive for depend on the contents of the
files that are being recorded.
Languages/Tools
about:blank 9/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
With over 8.2 million developers across the world using Python for coding, Python
ranks first in the latest annual ranking of popular programming languages by IEEE
Spectrum with a score of 100. Stack overflow programming language trends clearly
show that it’s the only language on rising for the last five years.
about:blank 10/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
Python is the preferred programming language of choice for machine learning for
some of the giants in the IT world including Google, Instagram, Facebook, Dropbox,
Netflix, Walt Disney, YouTube, Uber, Amazon, and Reddit. Python is an indisputable
leader and by far the best language for machine learning today and here’s why:
Python’s in-built libraries and packages provide base-level code so machine learning
engineers don’t have to start writing from scratch. Machine learning requires
continuous data processing and Python has in-built libraries and packages for almost
every task. This helps machine learning engineers reduce development time and
improve productivity when working with complex machine learning applications. The
best part of these libraries and packages is that there is zero learning curve, once you
know the basics of Python programming, you can start using these libraries.
Code Readability
The math behind machine learning is usually complicated and unobvious. Thus, code
readability is extremely important to successfully implement complicated machine
learning algorithms and versatile workflows. Python’s simple syntax and the
importance it puts on code readability makes it easy for machine learning engineers to
focus on what to write instead of thinking about how to write. Code readability makes
it easier for machine learning practitioners to easily exchange ideas, algorithms, and
tools with their peers. Python is not only popular within machine learning engineers,
but it is also one of the most popular programming languages among data scientists.
Flexibility
The multiparadigm and flexible nature of Python makes it easy for machine learning
engineers to approach a problem in the simplest way possible. It supports the
procedural, functional, object-oriented, and imperative style of programming allowing
machine learning experts to work comfortably on what approach fits best. The
flexibility Python offers help machine learning engineers choose the programming
style based on the type of problem – sometimes it would be beneficial to capture the
state in an object while other times the problem might require passing around
functions as parameters. Python provides flexibility in choosing either of the
approaches and minimises the likelihood of errors. Not only in terms of programming
about:blank 11/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
styles but Python has a lot to offer in terms of flexibility when it comes to
implementing changes as machine learning practitioners need not recompile the
source code to see the changes.
2. R Programming Langauge
With more than 2 million R users, 12000 packages in the CRAN open-source
repository, close to 206 R Meetup groups, over 4000 R programming questions asked
every month, and 40K+ members on LinkedIn’s R group – R is an incredible
programming language for machine learning written by a statistician for statisticians.
R language can also be used by non-programmer including data miners, data analysts,
and statisticians.
about:blank 12/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
learning practitioners can mix tools – choose the best tool for each task and
also enjoy the benefits of other tools along with R.
Java has plenty of third party libraries for machine learning. JavaML is an in-
built machine learning library that provides a collection of machine learning
algorithms implemented in Java. Also, you can use Arbiter Java library for
hyperparameter tuning which is an integral part of making ML algorithms run
effectively or you can use Deeplearning4J library which supports popular
machine learning algorithms like K-Nearest Neighbor and Neuroph and lets
you create neural networks or can also use Neuroph for neural networks.
Scalability is an important feature that every machine learning engineer must
consider before beginning a project. Java makes application scaling easier for
machine learning engineers, making it a great choice for the development of
large and complex machine learning applications from scratch.
Java Virtual Machine is one of the best platforms for machine learning as
engineers can write the same code on multiple platforms. JVM also helps
machine learning engineers create custom tools at a rapid pace and has various
IDE’s that help improve overall productivity. Java works best for speed-
critical machine learning projects as it is fast executing.
4. Julia
about:blank 13/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
5. LISP
Founded in 1958 by John McCarthy, LISP (List Processing) is the second oldest
programming language still in use and is mainly developed for AI-centric applications.
LISP is a dynamically typed programming language that has influenced the creation
of many machine learning programming languages like Python, Julia, and Java. LISP
works on Read-Eval-Print-Loop (REPL) and has the capability to code, compile, and
run code in 30+ programming languages.
LISP is considered as the most efficient and flexible machine learning language for
solving specifics as it adapts to the solution a programmer is coding for. This is what
makes LISP different from other machine learning languages. Today, it is particularly
used for inductive logic problems and machine learning. The first AI chatbot ELIZA
was developed using LISP and even today machine learning practitioners can use it to
create chatbots for eCommerce. LISP definitely deserves a mention on the list of best
language for machine learning because even today developers rely on LISP for
artificial intelligence projects that are heavy on machine learning as LISP offers –
Despite being flexible for machine learning, LISP lacks the support of well-known
machine learning libraries. LISP is neither a beginner-friendly machine learning
language (difficult to learn) and nor does have a large user community like that of
Python or R.
The best language for machine learning depends on the area in which it is going to be
applied, the scope of the machine learning project, which programming languages are
used in your industry/company, and several other factors. Experimentation, testing,
and experience help a machine learning practitioner decide on an optimal choice of
programming language for any given machine learning problem. Of course, the best
about:blank 14/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
thing would be to learn at least two programming languages for machine learning as
this will help you put your machine learning resume at the top of the stack. Once you
are proficient in one machine learning language, learning another one is easy.
Azure Machine Learning is a cloud platform that allows developers to build, train,
and deploy AI models. Microsoft is constantly making updates and improvements to
its machine learning tools and has recently announced changes to Azure Machine
Learning, retiring the Azure Machine Learning Workbench.
2. IBM Watson
No, IBM’s Watson Machine Learning isn’t something out of Sherlock Holmes.
Watson Machine Learning is an IBM cloud service that uses data to put machine
learning and deep learning models into production. This machine learning tool allows
users to perform training and scoring, two fundamental machine learning operations.
Keep in mind, IBM Watson is best suited for building machine learning applications
through API connections.
3. Google TensorFlow
5. OpenNN
OpenNN, short for Open Neural Networks Library, is a software library that
implements neural networks. Written in C++ programming language, OpenNN offers
you the perk of downloading its entire library for free from GitHub or SourceForge.
about:blank 15/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
Issues
Although machine learning is being used in every industry and helps organizations
make more informed and data-driven choices that are more effective than classical
methodologies, it still has so many problems that cannot be ignored. Here are some
common issues in Machine Learning that professionals face to inculcate ML skills
and create an application from scratch.
The major issue that comes while using machine learning algorithms is the lack of
quality as well as quantity of data. Although data plays a vital role in the processing
of machine learning algorithms, many data scientists claim that inadequate data, noisy
data, and unclean data are extremely exhausting the machine learning algorithms. For
example, a simple task requires thousands of sample data, and an advanced task such
as speech or image recognition needs millions of sample data examples. Further, data
quality is also important for the algorithms to work ideally, but the absence of data
quality is also found in Machine Learning applications. Data quality can be affected
by some factors as follows:
As we have discussed above, data plays a significant role in machine learning, and it
must be of good quality as well. Noisy data, incomplete data, inaccurate data, and
unclean data lead to less accuracy in classification and low-quality results. Hence,
data quality can also be considered as a major common problem while processing
machine learning algorithms.
To make sure our training model is generalized well or not, we have to ensure that
sample training data must be representative of new cases that we need to generalize.
The training data must cover all cases that are already occurred as well as occurring.
Further, if we are using non-representative training data in the model, it results in less
accurate predictions. A machine learning model is said to be ideal if it predicts well
for generalized cases and provides accurate decisions. If there is less training data,
then there will be a sampling noise in the model, called the non-representative training
set. It won't be accurate in predictions. To overcome this, it will be biased against one
class or a group.
about:blank 16/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
Hence, we should use representative data in training to protect against being biased
and make accurate predictions without any drift.
Overfitting:
Overfitting is one of the most common issues faced by Machine Learning engineers
and data scientists. Whenever a machine learning model is trained with a huge amount
of data, it starts capturing noise and inaccurate data into the training data set. It
negatively affects the performance of the model. Let's understand with a simple
example where we have a few training data sets such as 1000 mangoes, 1000 apples,
1000 bananas, and 5000 papayas. Then there is a considerable probability of
identification of an apple as papaya because we have a massive amount of biased data
in the training data set; hence prediction got negatively affected. The main reason
behind overfitting is using non-linear methods used in machine learning algorithms as
they build non-realistic data models. We can overcome overfitting by using linear and
parametric algorithms in the machine learning models.
Underfitting:
Underfitting occurs when our model is too simple to understand the base structure of
the data, just like an undersized pant. This generally happens when we have limited
data into the data set, and we try to build a linear model with non-linear data. In such
scenarios, the complexity of the model destroys, and rules of the machine learning
model become too easy to be applied on this data set, and the model starts doing
wrong predictions as well.
about:blank 17/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
As we know that generalized output data is mandatory for any machine learning
model; hence, regular monitoring and maintenance become compulsory for the same.
Different results for different actions require data change; hence editing of codes as
well as resources for monitoring them also become necessary.
A machine learning model operates under a specific context which results in bad
recommendations and concept drift in the model. Let's understand with an example
where at a specific time customer is looking for some gadgets, but now customer
requirement changed over time but still machine learning model showing same
recommendations to the customer while customer expectation has been changed. This
incident is called a Data Drift. It generally occurs when new data is introduced or
interpretation of data changes. However, we can overcome this by regularly updating
and monitoring data according to the expectations.
8. Customer Segmentation
The machine learning process is very complex, which is also another major issue
faced by machine learning engineers and data scientists. However, Machine Learning
and Artificial Intelligence are very new technologies but are still in an experimental
phase and continuously being changing over time. There is the majority of hits and
trial experiments; hence the probability of error is higher than expected. Further, it
also includes analyzing the data, removing data bias, training data, applying complex
mathematical calculations, etc., making the procedure more complicated and quite
tedious.
Data Biasing is also found a big challenge in Machine Learning. These errors exist
when certain elements of the dataset are heavily weighted or need more importance
than others. Biased data leads to inaccurate results, skewed outcomes, and other
about:blank 18/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
analytical errors. However, we can resolve this error by determining where data is
actually biased in the dataset. Further, take necessary steps to reduce it.
This issue is also very commonly seen in machine learning models. However,
machine learning models are highly efficient in producing accurate results but are
time-consuming. Slow programming, excessive requirements' and overloaded data
take more time to provide accurate results than expected. This needs continuous
maintenance and monitoring of the model for delivering accurate results.
Although machine learning models are intended to give the best possible outcome, if
we feed garbage data as input, then the result will also be garbage. Hence, we should
use relevant features in our training sample. A machine learning model is said to be
good if training data has a good set of features or less to no irrelevant features.
Getting the data right is the first step in any AI or machine learning project -- and it's
often more time-consuming and complex than crafting the machine learning
algorithms themselves. Advanced planning to help streamline and improve data
preparation in machine learning can save considerable work down the road. It can also
lead to more accurate and adaptable algorithms.
"Data preparation is the action of gathering the data you need, massaging it into a
format that's computer-readable and understandable, and asking hard questions of it to
check it for completeness and bias," said Eli Finkelshteyn, founder and CEO of
Constructor.io, which makes an AI-driven search engine for product websites.
about:blank 19/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
It's tempting to focus only on the data itself, but it's a good idea to first consider the
problem you're trying to solve. That can help simplify considerations about what kind
of data to gather, how to ensure it fits the intended purpose and how to transform it
into the appropriate format for a specific type of algorithm.
Good data preparation can lead to more accurate and efficient algorithms, while
making it easier to pivot to new analytics problems, adapt when model accuracy drifts
and save data scientists and business users considerable time and effort down the line.
"Being a great data scientist is like being a great chef," surmised Donncha Carroll, a
partner at consultancy Axiom Consulting Partners. "To create an exceptional meal,
you must build a detailed understanding of each ingredient and think through how
they'll complement one another to produce a balanced and memorable dish. For a data
scientist, this process of discovery creates the knowledge needed to understand more
complex relationships, what matters and what doesn't, and how to tailor the data
preparation approach necessary to lay the groundwork for a great ML model."
Managers need to appreciate the ways in which data shapes machine learning
application development differently compared to customary application development.
"Unlike traditional rule-based programming, machine learning consists of two parts
that make up the final executable algorithm -- the ML algorithm itself and the data to
learn from," explained Felix Wick, corporate vice president of data science at supply
chain management platform provider Blue Yonder. "But raw data are often not ready
to be used in ML models. So, data preparation is at the heart of ML."
Data preparation consists of several steps, which consume more time than other
aspects of machine learning application development. A 2021 study by data science
platform vendor Anaconda found that data scientists spend an average of 22% of their
time on data preparation, which is more than the average time spent on other tasks
like deploying models, model training and creating data visualizations.
1. Problem formulation
Data preparation for building machine learning models is a lot more than just cleaning
and structuring data. In many cases, it's helpful to begin by stepping back from the
data to think about the underlying problem you're trying to solve. "To build a
successful ML model," Carroll advised, "you must develop a detailed understanding
of the problem to inform what you do and how you do it."
Start by spending time with the people that operate within the domain and have a
good understanding of the problem space, synthesizing what you learn through
conversations with them and using your experience to create a set of hypotheses that
describes the factors and forces involved. This simple step is often skipped or
underinvested in, Carroll noted, even though it can make a significant difference in
about:blank 20/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
deciding what data to capture. It can also provide useful guidance on how the data
should be transformed and prepared for the machine learning model.
An Axiom legal client, for example, wanted to know how different elements of
service delivery impact account retention and growth. Carroll's team collaborated with
the attorneys to develop a hypothesis that accounts served by legal professionals
experienced in their industry tend to be happier and continue as clients longer. To
provide that information as an input to a machine learning model, they looked back
over the course of each professional's career and used billing data to determine how
much time they spent serving clients in that industry.
"Ultimately," Carroll added, "it became one of the most important predictors of client
retention and something we would never have calculated without spending the time
upfront to understand what matters and how it matters."
Once a data science team has formulated the machine learning problem to be solved,
it needs to inventory potential data sources within the enterprise and from external
third parties. The data collection process must consider not only what the data is
purported to represent, but also why it was collected and what it might mean,
particularly when used in a different context. It's also essential to consider factors that
may have biased the data.
"To reduce and mitigate bias in machine learning models," said Sophia Yang, a senior
data scientist at Anaconda, "data scientists need to ask themselves where and how the
data was collected to determine if there were significant biases that might have been
captured." To train a machine learning model that predicts customer behavior, for
example, look at the data and ensure the data set was collected from diverse people,
geographical areas and perspectives.
"The most important step often missed in data preparation for machine learning is
asking critical questions of data that otherwise looks technically correct,"
Finkelshteyn said. In addition to investigating bias, he recommended determining if
there's reason to believe that important missing data may lead to a partial picture of
the analysis being done. In some cases, analytics teams use data that works
technically but produces inaccurate or incomplete results, and people who use the
resulting models build on these faulty learnings without knowing something is wrong.
3. Data exploration
Data scientists need to fully understand the data they're working with early in the
process to cultivate insights into its meaning and applicability. "A common mistake is
to launch into model building without taking the time to really understand the data
you've wrangled," Carroll said.
Data exploration means reviewing such things as the type and distribution of data
contained within each variable, the relationships between variables and how they vary
relative to the outcome you're predicting or interested in achieving.
about:blank 21/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
This step can highlight problems like collinearity -- variables that move together -- or
situations where standardization of data sets and other data transformations are
necessary. It can also surface opportunities to improve model performance, like
reducing the dimensionality of a data set.
Data visualizations can also help improve this process. "This might seem like an
added step that isn't needed," Yang conjectured, "but our brains are great at spotting
patterns along with data that doesn't match the pattern." Data scientists can easily see
trends and explore the data correctly by creating suitable visualizations before
drawing conclusions. Popular data visualization tools include Tableau, Microsoft
Power BI, D3.js and Python libraries such as Matplotlib, Bokeh and the HoloViz
stack.
Various data cleansing and validation techniques can help analytics teams identify
and rectify inconsistencies, outliers, anomalies, missing data and other issues. Missing
data values, for example, can often be addressed with imputation tools that fill empty
fields with statistically relevant substitutes.
But Blue Yonder's Wick cautioned that semantic meaning is an often overlooked
aspect of missing data. In many cases, creating a dedicated category for capturing the
significance of missing values can help. In others, teams may consider explicitly
setting missing values as neutral to minimize their impact on machine learning
models.
A wide range of commercial and open source tools can be used to cleanse and
validate data for machine learning and ensure good quality data. Open source
technologies such as Great Expectations and Pandera, for example, are designed to
validate the data frames commonly used to organize analytics data into two-
dimensional tables. Tools that validate code and data processing workflows are also
available. One of them is pytest, which, Yang said, data scientists can use to apply a
software development unit-test mindset and manually write tests of their workflows.
5. Data structuring
Once data science teams are satisfied with their data, they need to consider the
machine learning algorithms being used. Most algorithms, for example, work better
when data is broken into categories, such as age ranges, rather than left as raw
numbers.
Two often-missed data preprocessing tricks, Wick said, are data binning and
smoothing continuous features. These data regularization methods can reduce a
machine learning model's variance by preventing it from being misled by minor
statistical fluctuations in a data set.
Binning data into different groups can be done either in an equidistant manner, with
the same "width" for each bin, or equi-statistical method, with approximately the
same number of samples in each bin. It can also serve as a prerequisite for local
about:blank 22/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
optimization of the data in each bin to help produce low-bias machine learning
models.
Smoothing continuous features can help in "denoising" raw data. It can also be used
to impose causal assumptions about the data-generating process by representing
relationships in ordered data sets as monotonic functions that preserve the order
among data elements.
Other actions that data scientists often take in structuring data for machine learning
include the following:
The last stage in data preparation before developing a machine learning model is
feature engineering and feature selection.
Wick said feature engineering, which involves adding or creating new variables to
improve a model's output, is the main craft of data scientists and comes in various
forms. Examples include extracting the days of the week or other variables from a
data set, decomposing variables into separate features, aggregating variables and
transforming features based on probability distributions.
Data scientists also must address feature selection -- choosing relevant features to
analyze and eliminating nonrelevant ones. Many features may look promising but lead
to problems like extended model training and overfitting, which limits a model's
ability to accurately analyze new data. Methods such as lasso regression and
automatic relevance determination can help with feature selection.
Facial recognition
Targeted advertising
Voice recognition
SPAM filters
Machine translation
Detecting credit card fraud
Virtual Personal Assistants
Self-driving cars
… and lots more.
about:blank 23/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
To fully understand the opportunities and consequences of the machine learning filled
future, everyone needs to be able to …
Here is an example of a robot with a machine learning brain. It reacts just to the tone
of voice – it doesn’t understand the words. It learnt very much like a dog does. It was
‘rewarded’ when it reacted in an appropriate way and was ‘punished’ when it reacted
in an inappropriate way. Eventually it learnt to behave like this.
There are several ways to try to make a machine do tasks ‘intelligently’. For example:
There are several ways to try to make a machine do tasks ‘intelligently’. For example:
about:blank 24/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
Genetic algorithms (copying the way evolution improves species to fit their
environment)
Bayesian Networks (building in existing expert knowledge)
Types of data
Machine learning data analysis uses algorithms to continuously improve itself over
time, but quality data is necessary for these models to operate efficiently.
A single row of data is called an instance. Datasets are a collection of instances that
all share a common attribute. Machine learning models will generally contain a few
different datasets, each used to fulfill various roles in the system.
For machine learning models to understand how to perform various actions, training
datasets must first be fed into the machine learning algorithm, followed by validation
about:blank 25/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
datasets (or testing datasets) to ensure that the model is interpreting this data
accurately.
Once you feed these training and validation sets into the system, subsequent datasets
can then be used to sculpt your machine learning model going forward. The more data
you provide to the ML system, the faster that model can learn and improve.
Data can come in many forms, but machine learning models rely on four primary data
types. These include numerical data, categorical data, time series data, and text data.
Numerical data
Numerical data
Numerical data, or quantitative data, is any form of measurable data such as your
height, weight, or the cost of your phone bill. You can determine if a set of data is
numerical by attempting to average out the numbers or sort them in ascending or
descending order. Exact or whole numbers (ie. 26 students in a class) are considered
discrete numbers, while those which fall into a given range (ie. 3.6 percent interest
rate) are considered continuous numbers. While learning this type of data, keep in
mind that numerical data is not tied to any specific point in time, they are simply raw
numbers.
Categorical data
Categorical data is sorted by defining characteristics. This can include gender, social
class, ethnicity, hometown, the industry you work in, or a variety of other labels.
While learning this data type, keep in mind that it is non-numerical, meaning you are
unable to add them together, average them out, or sort them in any chronological
order. Categorical data is great for grouping individuals or ideas that share similar
attributes, helping your machine learning model streamline its data analysis.
about:blank 26/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
Time series data consists of data points that are indexed at specific points in time.
More often than not, this data is collected at consistent intervals. Learning and
utilizing time series data makes it easy to compare data from week to week, month to
month, year to year, or according to any other time-based metric you desire. The
distinct difference between time series data and numerical data is that time series data
has established starting and ending points, while numerical data is simply a collection
of numbers that aren’t rooted in particular time periods.
Text data
Text data is simply words, sentences, or paragraphs that can provide some level of
insight to your machine learning models. Since these words can be difficult for
models to interpret on their own, they are most often grouped together or analyzed
using various methods such as word frequency, text classification, or sentiment
analysis.
There is an abundance of places you can find machine learning data, but we have
compiled five of the most popular ML dataset resources to help get you started:
about:blank 27/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
The data structure used for machine learning is quite similar to other software
development fields where it is often used. Machine Learning is a subset of artificial
intelligence that includes various complex algorithms to solve mathematical problems
to a great extent. Data structure helps to build and understand these complex
problems. Understanding the data structure also helps you to build ML models and
algorithms in a much more efficient way than other ML professionals.
The data structure is defined as the basic building block of computer programming
that helps us to organize, manage and store data for efficient search and retrieval.
In other words, the data structure is the collection of data type 'values' which are
stored and organized in such a way that it allows for efficient access and modification.
The data structure is the ordered sequence of data, and it tells the compiler how a
programmer is using the data such as Integer, String, Boolean, etc.
There are two different types of data structures: Linear and Non-linear data structures.
The linear data structure is a special type of data structure that helps to organize and
manage data in a specific order where the elements are attached adjacently.
about:blank 28/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
Array:
An array is one of the most basic and common data structures used in Machine
Learning. It is also used in linear algebra to solve complex mathematical problems.
You will use arrays constantly in machine learning, whether it's:
An array contains index numbers to represent an element starting from 0. The lowest
index is arr[0] and corresponds to the first element.
Let's take an example of a Python array used in machine learning. Although the
Python array is quite different from than array in other programming languages, the
Python list is more popular as it includes the flexibility of data types and their length.
If anyone is using Python in ML algorithms, then it's better to kick your journey from
array initially.
Method Description
Append() It is used to add an element at the end of the list.
Clear() It is used to remove/clear all elements in the list.
Copy() It returns a copy of the list.
Count() It returns the count or total available element with an integer value.
Extend() It is used to add the element of a list to the end of the current list.
Index() It returns the index of the first element with the specified value.
Insert() It is used to add an element at a specific position using an index number.
It is used to remove an element from a specified position using an index
Pop() number.
Remove() Used to remove the elements with specified values.
Reverse() Used to show list in reverse order
Sort() Used to sort the list in an array.
Stacks:
Stacks are based on the concept of LIFO (Last in First out) or FILO (First In Last
Out). It is used for binary classification in deep learning. Although stacks are easy to
learn and implement in ML models but having a good grasp can help in many
computer science aspects such as parsing grammar, etc.
Stacks enable the undo and redo buttons on your computer as they function similar to
a stack of blog content. There is no sense in adding a blog at the bottom of the stack.
about:blank 29/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
about:blank 30/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
about:blank 31/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
about:blank 32/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
about:blank 33/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
about:blank 34/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
about:blank 35/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
about:blank 36/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
about:blank 37/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
about:blank 38/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
about:blank 39/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
about:blank 40/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
about:blank 41/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
about:blank 42/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
about:blank 43/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
about:blank 44/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
about:blank 45/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
about:blank 46/47
12/18/23, 6:21 PM MC4301 - ML Unit 1 (Introduction)
about:blank 47/47