Machine_Learning_in_Farm_Animal_Behavior_Using_Python_Natasa_Kleanthous
Machine_Learning_in_Farm_Animal_Behavior_Using_Python_Natasa_Kleanthous
Natasa Kleanthous
O&P Electronics & Robotics Ltd
Limassol, Cyprus
Abir Hussain
Department of Electrical Engineering,
University of Sharjah, Sharjah, UAE
Reasonable efforts have been made to publish reliable data and information, but the author and
publisher cannot assume responsibility for the validity of all materials or the consequences of
their use. The authors and publishers have attempted to trace the copyright holders of all material
reproduced in this publication and apologize to copyright holders if permission to publish in this
form has not been obtained. If any copyright material has not been acknowledged please write and
let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, access www.
copyright.com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive,
Danvers, MA 01923, 978-750-8400. For works that are not available on CCC please contact
[email protected]
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are
used only for identification and explanation without intent to infringe.
First and foremost, our gratitude goes to our families, who have been the
foundation of our strength and patience, supporting us through countless hours of
work, providing their love and encouragement.
We would also like to express our thanks to The Douglas Bomford Trust, whose
funding and support has been the foundation of our research. Special thanks to
Alan Plom and Nick August for their guidance, encouragement, and invaluable
assistance.
Our appreciation also goes to Dr. Jacob Kamminga for permitting us to incorporate
his dataset in our book. Alongside our own data, his contribution has enriched our
analysis and illustration of machine learning applications in farm animal behavior.
Our appreciation extends to the team at Science Publishers and CRC Press …
We thank our readers as well for joining us on this journey. We hope that our work
inspires, informs, and entertains you as much as it has challenged us.
Preface
In a world where technology and nature increasingly intersect, the potential for
machine learning to revolutionize various sectors has become undeniably evident.
This book's purpose is to bridge the worlds of farm animal behavior and the vast
capabilities of machine learning, using Python as our tool.
Our inspiration for writing this book originated from observing the challenges
faced by farmers and researchers in understanding animal behavior. With growing
agricultural practices, and the need for more humane livestock management, there
is an urgent requirement to combine our knowledge of animals with the predictive
and analytic powers of machine learning and deep learning.
This journey will take you through the fundamentals of animal behavior, the core
concepts of machine learning, and the usefulness of Python programming. From
sensors and data collection to machine learning and deep learning projects using
real-world data, this book aims to provide a comprehensive guide for anyone
looking to dive into this fascinating combination of both fields.
Contents
Acknowledgements v
Preface vii
Who is This Book For? xv
This book is designed for: xv
Prerequisites and Notes xv
The Chapters xvi
Code Examples xvi
Conventions Used in This Book xvii
Code Blocks xvii
Tips and Insights xvii
Warnings xvii
Key Terminology xvii
Figures & Diagrams xviii
Contact Information xviii
The Chapters
Chapter 1: Introduction to Machine Learning for Farm Animal Behavior:
This chapter begins by introducing the world of animal behavior and how machine
learning can be utilized to obtain insights from it. This chapter introduces the
types of machine learning and their applications in general giving examples in the
context of farm animals.
Chapter 2: Machine Learning Concepts and Challenges: This chapter provides
information related to key machine learning concepts, ensuring a comprehensive
understanding of the machine learning workflow.
Chapter 3: A Practical Example to Building a Simple Machine Learning
Model: In this chapter, a machine learning project is presented from scratch using
Python. This project is applied to real-world animal behavior data.
Chapter 4: Sensors, Data Collection and Annotation: An overview of the
various sensors employed in collecting data on animal behavior, data collection
practices, and data annotation is presented in this chapter.
Chapter 5: Preprocessing and Feature Extraction for Animal Behavior
Research: This chapter highlights the importance of data preprocessing including
various techniques to clean, scale, and normalize the data. This chapter also
introduces methods to extract meaningful features from raw data providing
hands-on python examples.
Chapter 6: Feature Selection Techniques: This chapter explores various types
of feature selection methods providing theoretical insights coupled with python
implementation.
Chapter 7: Animal Research: Supervised and Unsupervised Learning
Algorithms: This chapter provides theoretical insights of supervised and
unsupervised learning methods. Python examples for the implementation of
classification, regression, and clustering techniques are presented.
Chapter 8: Evaluation, Model Selection and Hyperparameter Tuning: This
chapter introduces various evaluation metrics and techniques to enhance model
performance accompanied by practical examples using Python.
Chapter 9: Deep Learning Algorithms for Animal Activity Recognition: This
chapter explores the foundational concepts of deep learning algorithms providing
Python implementation and real-world applicability using farm animal data.
Code examples
All code examples provided in this book are written in Python, utilizing popular
libraries such as Scikit-learn and PyTorch. A GitHub repository has been set up
allowing readers to access, download, and run the code samples. This ensures a
hands-on, interactive learning experience. To access the repository, visit https://
github.com/nkcAna/WSDpython.
Who is This Book For? | xvii
Code Blocks
Text in Consolas font is used to represent python code scripts.
For example, this code defines a function to read a csv file.
import pandas as pd
def read_csv_file(file_path):
"""
Reads a CSV file and returns a DataFrame.
Parameters:
- file_path (str): Path to the CSV file.
Returns:
- DataFrame: Data from the CSV file.
"""
return pd.read_csv(file_path)
# Usage example:
# df = read_csv_file('path_to_your_file.csv')
Warnings
Emphasized sections that call attention.
Key Terminology
Key terms or jargons are italicized upon their mention to draw attention and signify
their importance.
xviii | Who is This Book For?
Contact Information
For queries, feedback, or further discussions, readers are encouraged to communicate
with us through our Book project on GitHub by opening an issue.
CHAPTER
1
Introduction to Machine Learning for
Farm Animal Behavior
Animal Behavior
The Essence of Animal Behavior
The study of animal behavior is a gateway to understanding the interactions,
survival strategies, and communication that unfold within the animal kingdom.
It is an expedition into interpreting how animals respond to their environment––
ranging from the lands they inhabit to the other animals they encounter. This
exploration examines patterns of feeding, mating, social dynamics, and even the
capacity of animals to learn and remember.
Train
Data Data Pre- Machine Evaluate Apply to
Collection processing Learning Per- Real-life
Algorithm Algorithm formance Problems
Figure 1.1 illustrates the principle of machine learning in five steps. Starting with
‘Data Collection’ data is gathered from diverse sources. Moving on to ‘Data
Preprocessing’ the collected data is refined through cleaning, transformation, and
Introduction to Machine Learning for Farm Animal Behavior | 5
Labeled Data
Unlabeled Data
Supervised Learning
“yes” or “no”. On the other hand, in regression tasks, where the aim is to predict
a continuous value like the weight or height of an animal based on other features,
the label could be a numerical value.
The input space X, constitutes all input values. Each dimension corresponds to a
distinct feature of the observed data. Features are distinct measurable properties
of the data. For example, for a given feature xi, where i is the ith feature, xi might
stand for an animal’s age, heart rate, or its daily activity duration. An input
vector x is a specific instance within the input space X. It is a collection of values
representing multiple features of a single instance or observation. Each of its
elements, labeled as xi (where i marks the feature’s index), denotes a particular
feature value. For instance, if x1 stands for age, x2 for daily activity duration, and
x3 for average resting heart rate, a vector could be represented asx = [x1, x2, x3].
Thus, the input vector provides a full description of an instance in terms of all its
features.
In machine learning, when we feed data into an algorithm, it often comes in the
form of many input vectors (i.e., many rows of data, with each row representing
one observation across multiple features).
Classification and regression are the main categories of supervised learning tasks.
Each offers unique methods and applications adapted to different types of data
and desired outcomes.
Decision Boundary
Classification algorithms often work by establishing a decision boundary (or
boundaries) in the input space X. The decision boundary represents a surface in
the feature space that separates instances of different classes. Its complexity and
shape depend on the learning algorithm and the type of the data.
In a simple 2D space with two features, the decision boundary might literally be a
line. In higher dimensions (e.g., 3D), it might be a plane or even a more complex
shape. In general, for a n-dimensional space, the decision boundary will be a
n –1 dimensional hyperplane.
Healthy
? Lameness
Lameness Healthy
Label
Figure 1.3: Training set and test instance for healthy sheep classification.
8 | Machine Learning in Farm Animal Behavior using Python
Collecting data from Experts observe the At the training phase, Assessing how The algorithm
the motion sensors behaviour and label the algorithm learns accurately the analyzes real-time
deployed on a patsure the data as 'threat' the patterns in the algorithm sensor data to
housing a group of when there is data that indicate a distinguishes between categorize
cows. presence of predators threat. regular behavior movements as safe or
or thieves, and as 'no potential threats by alarming, triggering
threat', otherwise. comparing its alerts to the farmer.
predictions to real
outcomes.
Data Points
Weight
Regression
Line
New Data
Point
Food Consumption
Figure 1.5: Regression example: Predicting the weight gain of a
pig based on food consumption.
Unsupervised Learning
Unsupervised Learning (Abramson et al., 1963) is different from supervised
learning in its approach and objectives. While supervised learning depends on
labeled data to make predictions or classifications, unsupervised learning works
with unlabeled data, aiming to discover underlying structures or patterns within
the data itself.
Definition: In unsupervised learning, given an input space X, the aim is to discover
patterns, groupings, or structures in the data without any predefined labels. The
primary tasks within unsupervised learning are clustering and dimensionality
reduction.
Like supervised tasks, an input vector x in unsupervised learning represents the
features of a data instance. However, there is no associated label or target value.
Unsupervised algorithms work by measuring the similarity or difference between
data points, aiming to group similar data together or represent the data in a reduced
form while retaining maximum information.
Introduction to Machine Learning for Farm Animal Behavior | 13
Clustering Techniques
– K-Means Clustering (Hartigan & Wong, 1979): Partitions data into k
distinct, non-overlapping groups (clusters) according to their similarity.
– Hierarchical Clustering: Builds a tree of clusters, useful for understanding
hierarchical relationships in the data.
– DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
(Ester et al., 1996): Groups together closely packed points and marks points
in low-density regions as outliers.
0.8
10
Behavior Clustering
Interaction Intensity
5 0.6
0 0.4
–5
0.2
–10
0.0
–10 –5 0 5 10
Activity Level
Figure 1.6: Scatter plot of animal behaviors categorized by Activity Level and
Interaction Intensity using K-Means clustering.
By examining the clusters, we can make informed assumptions about the nature of
these behaviors. It is crucial to understand that unsupervised learning, especially
clustering, provides an initial exploration of the data. It is up to domain experts,
in this case, animal behaviorists, to interpret the significance and implications of
these clusters.
Note that the labels assigned to clusters are empirical and based
on the context provided. In a real-world scenario, domain-specific
knowledge would be crucial in labeling and interpreting the clusters.
• Autoencoders (Y. Wang et al., 2015): These are neural network architectures
designed to compress data into a lower-dimensional form and then
reconstruct it.
Figure 1.7: Scatter plot showing animal behaviors based on the first two
principal components derived from PCA. Each point represents an individual animal’s
behavior pattern, plotted according to the two most significant
patterns of variance in the data.
16 | Machine Learning in Farm Animal Behavior using Python
Observing Figure 1.7, the x-axis might represent a pattern related to the overall
activity level of the animals (combining metrics like movement speed, heart rate,
and vocalizations), while the y-axis might capture another significant pattern,
perhaps related to the animal’s social behaviors (like duration and frequency of
interactions with other animals).
What is great about dimensionality reduction is that while the axes do not
correspond directly to any original measurement, they capture the primary modes
of variation within the dataset. So, animals that are close to each other on the
scatter plot have similar behaviors across all the metrics collected, while those far
apart have different behaviors.
• Association Rule Learning: Association rule learning (Kaur & Madan, 2015;
Telikani et al., 2020) is a technique primarily used to discover relationships
and patterns between items in large datasets. In the setting of animal behavior,
this can be applied to find patterns in sequences or combinations of behaviors
exhibited by animals under certain conditions or environments.
There are several key types of association rule learning techniques, including:
• Apriori Algorithm (Santosh Kumar & Balakrishnan, 2019): This is a widely
recognized method for mining frequent itemsets for Boolean association
rules. It progressively extends the size of the frequent itemsets by one item at
a time, checks a candidate set of itemsets for frequency, and then prunes those
candidate sets which are found infrequent.
• Eclat Algorithm (Hahsler et al., 2005): This is an alternative depth-first
technique which uses set intersection to identify frequent itemsets.
• FP-Growth (Borgelt, 2005): Differing from Apriori, this approach avoids
the step of candidate generation. It adopts a divide-and-conquer strategy
facilitated by a prefix tree representation of the database.
Imagine observing a group of animals over, a period of time. Each specific
behavior they exhibit can be seen as an ‘item’. Whenever an animal exhibits a
sequence of behaviors, that sequence can be treated as a transaction. By analyzing
numerous sequences or transactions, association rule mining can help deduce
rules such as, “When an animal displays behavior A, it is likely to show behavior
B shortly after”.
Kicks
When
Milked
Restlessness
This directed graph (Figure 1.8) visualizes the association rules based on cow
behaviors. Nodes represent different behaviors, and directed edges signify an
association rule between them.
• Nodes: Represent behaviors. Antecedent behaviors (starting points of rules)
are shown in light grey, while consequent behaviors (end points of rules) are in
black.
• Edges: The grey arrows represent the directionality of the association rules,
from antecedent to consequent behavior.
The arrows show the direction of the association rule, for instance, there is an
arrow from “Increased Vocalization” to “Restlessness”, indicating the rule “If
18 | Machine Learning in Farm Animal Behavior using Python
and split values, with anomalies often isolated quicker, indicating fewer steps
signify abnormality.
20.0
Speed (km/h)
17.5
15.0
12.5
10.0
7.5
0 20 40 60 80 100
Time
Figure 1.9: Consistent running speed of a horse with a specific segment
highlighting a potential anomaly in the speed.
Based on the graph, a possible anomaly can be detected in the horse’s running
speed between the time intervals 65 to 75. Such examples, identified using anomaly
detection techniques, could suggest issues like health problems, environmental
disturbances, or other concerns that might impact the horse’s usual behavior.
Semi-supervised Learning
Semi-supervised learning (van Engelen & Hoos, 2020) incorporates both labeled
and unlabeled data, particularly in scenarios where obtaining labeled data is
challenging or insufficient. By leveraging both types of data, semi-supervised
learning endeavors to enhance learning performance, achieving improved
accuracy and generalization compared to using either data type independently.
Definition: In semi-supervised learning, given an input space X, where some data
is labeled and some is not, the objective is to improve the performance of a model
by utilizing the information contained within the unlabeled data. This learning
paradigm often focuses on classification, regression, and sometimes clustering
tasks.
20 | Machine Learning in Farm Animal Behavior using Python
–1
–2
–2 –1 0 1 2
Figure 1.10: Visualization of Semi-supervised Learning with Hens.
Reinforcement Learning
Reinforcement learning (RL) (Kaelbling et al., 1996) is a specialized domain
within ML but is outside the scope of this book. However, we will briefly
introduce it in this section. Unlike supervised and unsupervised learning, which
rely on labeled data or characteristic patterns, RL revolves around agents making
sequential decisions to maximize some notion of cumulative reward in uncertain
environments. RL represents a machine learning paradigm where an agent learns
how to behave in an environment by performing certain actions and receiving
rewards or penalties in exchange. It’s akin to teaching a dog a new trick: the dog
represents the agent, the environment serves as the setting where the dog can
execute tricks, and the rewards (or penalties) are the treats (or lack thereof).
22 | Machine Learning in Farm Animal Behavior using Python
Core Concepts
• Agent: The decision maker or the learner.
• Environment: Everything that the agent interacts with.
• Action (A): What the agent can do. For example, in a game, actions might be
moving left or right, jumping, or some other activity.
• State (S): Current situation returned by the environment.
• Reward (R): Feedback from the environment. Can be immediate or delayed.
In RL, an agent takes an action according to its state. The action influences the
environment, which returns to the next state and a reward for the taken action.
The aim of the agent is to learn a policy that optimizes the estimated cumulative
reward over time.
Key Components
• Policy (π): Strategy that defines the agent’s way of taking actions. It can be
deterministic or stochastic.
• Value Function: Estimation of future rewards. There are two types:
– State Value (V): Expected reward initiated from state s and then following
policy π.
– Action Value (Q): Expected reward initiated from state s, then action is
taken, and then following policy π.
• Model of the Environment: This is optional. If the agent knows the model,
it can predict the next state and reward for a given action.
Deterministic:
Environment: If for a given state s and an action a, the resultant next
state s’, and reward r are always the same, the environment is said to
be deterministic. There is no uncertainty about the outcome of an
action.
Introduction to Machine Learning for Farm Animal Behavior | 23
Key Algorithms
• Q-learning (Sayed, 2023): Model-free, value-based method.
• Deep Q Networks (DQN) (Varga et al., 2023): Combines Q-learning with
deep neural networks.
• Policy Gradients (R. S. Sutton et al., 2000): Policy-based method.
• Proximal Policy Optimization (PPO): An optimization for policy gradient
methods.
• A3C (Asynchronous Advantage Actor-critic) (Liu et al., 2018): Combines
value-based and policy-based learning.
24 | Machine Learning in Farm Animal Behavior using Python
Q-learning
Q-learning is a value-based RL algorithm. In this method, the agent learns a
value function, or ‘Q-value’, which represents the expected future reward for
each possible action in each possible state. The agent is assigned with the task of
learning the policy that will maximize this expected future reward.
The learning process commences with arbitrary Q-values, as the agent explores
the environment, taking actions and receiving rewards, it uses these experiences
to update the Q-values. With time, these updates should converge on the true
Q-values under the optimal policy.
An application of Q-learning in a farm setting might be to optimize the feeding
schedule of farm animals. The agent’s state could be the current time, the hunger
level of the animals, and the amount of available feed. The actions could be to
feed a certain amount of food or to wait. The reward could be a measure of the
animals’ health and productivity, with penalties for overfeeding or underfeeding.
The Q-learning algorithm would iteratively learn the best action to take in each
state to maximize the animals’ well-being and productivity, resulting in an
optimized feeding schedule.
Policy Gradients
Policy Gradients, on the other hand, are policy-based RL where the agent directly
learns the policy, i.e., the mapping of states to actions. The policy is typically
represented as a probability distribution over actions, and the learning process
involves adjusting the parameters of this distribution to increase the expected
future reward.
Summary
In the first chapter, the concepts of Animal Behavior were explored, and how
machine learning can help understand farm animal behavior. We learned about
animal behavior basics and how technology, especially machine learning, can
give us new insights. We also broke down the main types of machine learning
techniques: supervised, unsupervised, semi-supervised, and reinforcement learning,
and discussed how they can be used in studying animals.
CHAPTER
2
Foundational Concepts and
Challenges in Machine Learning
model’s predictions on new data and the true values for that data. Ideally, we
want our model’s generalization error to be low, indicating that it performs well
on new, unseen data. A model with a low test error can be considered to be well
generalized, whereas a model with a high test error is likely either overfitting or
underfitting the training data.
While training error gives us insight into how well our model has learned the
training data, it is the generalization (or test) error that truly matters, as it indicates
how our model will perform in real-world scenarios. We need to achieve a
balance, ensuring our models are neither too simple (leading to underfitting and
high bias) nor too complex (leading to overfitting and high variance), to achieve
good generalization.
Model Complexity
In machine learning, a model’s complexity often refers to its ability to fit a wide
variety of functions. A more complex model generally has a greater number
of parameters and, consequently, a higher capacity to adapt its shape to fit the
training data.
where,
• y is the predicted output.
• x is the input feature.
• β0, β1,…, βn are the model’s parameters.
• ɛ is the error term.
Now, the degree of the polynomial determines the capacity of the model:
• A low-degree polynomial (e.g., linear or 2nd-degree) might be too simple to
capture the structure of the data, leading to underfitting.
• A very high-degree polynomial might fit almost every data point perfectly,
including its noise. While this results in low training error, it can lead to high
error on unseen data, or overfitting.
• An optimal model would reach a balance, fitting the general trend of the data
without being influenced by noise.
We can visualize the effect of different polynomial degrees on a dataset:
x x x
Figure 2.1 shows three scatter plots to illustrate the concepts of underfitting,
optimal fit, and overfitting in the context of model complexity and data fitting.
In the first plot, labeled “Underfitting (Low Complexity) Degree 1”, a straight
line attempts to model the data points but fails to capture the underlying trend,
indicating that the model is too simple. The middle plot, “Optimal Fit Degree 4”,
shows a curve that fits the data points well, suggesting that the model’s complexity
is just right to capture the essential patterns without being overly simplistic or too
30 | Machine Learning in Farm Animal Behavior using Python
complex. The third plot, “Overfitting (High Complexity) Degree 15”, portrays a
highly wavy line that passes through large number of points, that may indicate a
complex model that fits the training data too closely. This could be an overfitting
model that may perform poorly on unseen data due to its excessive sensitivity to
noise or minor fluctuations in the training dataset.
Variance
While bias is related to errors resulting from incorrect assumptions, variance is
associated with errors introduced by the model’s sensitivity to minor variations
in the training set. A high variance suggests that the algorithm shapes itself too
closely to the training data, obtaining the noise along with the underlying pattern.
When this happens, while our model might perform very well on the training data,
it will have a harder time generalizing to new, unseen data.
In a more technical sense, variance captures how much the predictions f ̂ (x) for a
given point x would vary between different training sets.
Variance can be defined as:
Var( f̂ (x)) = Ε [( f̂ (x) – Ε[ f̂ (x)])2].
While examining bias and variance individually, the relationship between these
two becomes a foundation in machine learning. Achieving a balance between
them is influential in creating models that reflect both the training data and new,
unseen data.
Machine Learning Concepts and Challenges | 31
3 3
2 2
1 1
0 0
–1 –1
–2 –2
–3 –3
–4 –4
–4 –3 –2 –1 0 1 2 3 4 –4 –3 –2 –1 0 1 2 3 4
4 4
3 3
2 2
1 1
0 0
–1 –1
–2 –2
–3 –3
–4 –4
–4 –3 –2 –1 0 1 2 3 4 –4 –3 –2 –1 0 1 2 3 4
• Low Bias, Low Variance: The dots are closely clustered around the center.
– The model’s predictions are largely accurate and consistent across
different datasets. It means the model is well-calibrated to predict the
correct outcomes and does so consistently.
– This is an ideal scenario for a model, indicating that it generalizes well
to new data and captures the underlying patterns correctly without being
overly sensitive to small variations.
• High Bias, Low Variance: The dots are closely clustered but miss the center.
– The model’s predictions consistently miss the mark, but they do so in a
predictable manner across different datasets. It implies the model may be
too simple, missing important patterns in the data.
– Such a model is underfitting. While it is consistently off-target, its
predictions are stable. However, its simplicity means it is not capturing
the true patterns in the data, leading to systematic errors.
• Low Bias, High Variance: The dots are spread out but are aligned around the
center.
– While the model gets predictions right on average, those predictions can
vary widely depending on the specific dataset it is trained on. It is sensitive
to small changes or noise in the data.
– Such a model is overfitting to its training data. It captures the underlying
patterns and fits noise or random fluctuations in the training set. As a
result, its performance can fluctuate significantly on different datasets.
• High Bias, High Variance: The dots are spread out and miss the center.
– The model’s predictions are off-target on average and can be wildly
different depending on the specific dataset it is trained on.
– This is the least desirable scenario. The model neither captures the
underlying patterns of the data well (due to high bias) nor produces stable
predictions across different datasets (due to high variance). The model’s
poor calibration and inconsistency make it unreliable.
Figure 2.3 explains the fundamental trade-off between bias and variance as we
adjust the complexity of a machine learning model. As the model’s complexity
increases:
• Bias (shown in ––) decreases, indicating that the model becomes more
adaptable and starts fitting the training data more accurately.
• Variance (shown in – –) increases, suggesting that the model becomes more
sensitive to minor changes and noise in the training data, potentially capturing
patterns that don’t generalize well to new, unseen data.
Machine Learning Concepts and Challenges | 33
Bias-Variance Trade-off
1.0
0.8
0.6
Error
Bias
Variance
0.4 Generalization Error
Optimal Complexity
0.2
Underfitting Optimal Overfitting
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Model Complexity
What is Regularization?
Regularization is a method employed to deter overfitting by incorporating
a penalty into the model’s loss function during training. The objective is to
restrict too complex models which have the tendency to overfit the dataset by
introducing a term to the loss function that increases with complexity. By doing
so, regularization attempts to ensure that the model is as simple as possible, while
still fitting the data reasonably well.
Loss:
• Refers to a scalar quantity used to evaluate the extent to which the
model’s predictions align with the true labels, typically averaged
over all instances.
• The loss function calculates the error to adjust the model’s weights
during training. The goal during training is to minimize this loss.
• The loss gives a general view of the model’s performance across all
data points.
Machine Learning Concepts and Challenges | 35
Model Assumptions
• Every algorithm comes with its own set of assumptions about data. If the
data does not adhere to these assumptions, the model performance can be
compromised.
• Solution: Understand the assumptions of each algorithm and preprocess the
data to fit these assumptions, or choose an algorithm better aligned with the
data characteristics.
• Note: For instance, linear regression assumes a linear relationship between
the input and output variables. If this is not met, its predictions can be off
mark.
Imagine a scenario where we are trying to predict the growth of plants based on
the amount of water they receive. Intuitively, one might think that the more water
a plant gets, the more it grows. But after a certain threshold, too much water can
drown the roots, causing the growth to decline. This relationship is not linear but
rather curvilinear1.
If we try to fit a linear regression model to this data, it might perform poorly
because it will attempt to fit a straight line to a curve. The assumptions of linearity
are violated in this scenario. Figure 2.4 presents a plot where the true data follows
a curvilinear trend, and the linear regression model is trying to fit a straight line
to it. The linear model clearly is not capturing the degrees of the true relationship
between the amount of water and plant growth.
Hyperparameter Tuning
• Algorithms have hyperparameters that are not learned from the training
process but influence performance.
Scalability
• As data grows, algorithms may face difficulties in processing them within
reasonable time frames.
• Solution: Distributed computing, data sampling, dimensionality reduction,
online learning, and cloud-based ML platforms can help tackle the scalability
challenges.
• Note: Choosing the optimal method or a combination of methods will depend
upon the specific objectives of the study and the nature of the data collected.
Tracking animal behavior on a large scale can provide invaluable insights in
wildlife research and conservation efforts (Rast et al., 2020). With the introduction
of Internet of Things (IoT), devices and camera traps, it has become feasible to
collect large amounts of data on animal movements, behaviors, and interactions
(Marjani et al., 2017; Tran et al., 2021).
Imagine a nature reserve that has deployed thousands of camera traps and tracking
devices across extensive environments, aiming to monitor the behavior of a
particular endangered species. Over a year, these devices capture millions of hours
of footage and track data, generating terabytes of information. The reserve intends
Machine Learning Concepts and Challenges | 37
Interpretability
• Complex models, especially deep learning ones, can be hard to interpret,
leading to a lack of trust or understanding.
As the algorithms become more complex, they often become analogous to a black
box, where the inner workings and decision-making processes are not directly
transparent to the users. While a highly complex model may offer impressive
predictive accuracy, understanding why it makes certain decisions can be
challenging. This lack of transparency, termed as the issue of interpretability, is
significant in many applications.
Interpretability is important for the user or stakeholder to trust the model especially
in critical applications such as healthcare or finance. They must understand
how the model makes decisions in order to have confidence in its predictions.
Additionally, if the model is making inaccurate decisions, understanding the
decision-making process can help in diagnosing the underlying issues.
In certain sectors, there may be legal requirements for decisions made by
algorithms to be explainable. Beyond the legalities, ensuring that the algorithms
are not maintaining biases or making unreasonable decisions is an ethical
constraint.
Consider a machine learning model designed to predict the health status of cattle
based on various behavioral and physiological indicators. If the model determines
a particular cow is likely to be unhealthy, the farmer or vet will want to understand
the reasons behind this prediction before taking action. Simply knowing the
model’s accuracy rate is not enough; understanding which specific behaviors
or indicators led to the prediction can guide more effective interventions and
treatments.
Potential Solutions
• Model-specific Tools: For models like decision trees or random forests,
feature importance scores can show which features most heavily influence
predictions.
• Model-agnostic Tools: Techniques like LIME (Local Interpretable Model-
agnostic Explanations) (Ribeiro et al., 2016) or SHAP (SHapley Additive
38 | Machine Learning in Farm Animal Behavior using Python
Sensitivity to Initialization
• Some algorithms, especially iterative ones, can be sensitive to initial parameter
values, affecting their convergence or final model quality.
This phenomenon is observed mostly in iterative optimization algorithms,
particularly in the context of training deep neural networks. It concerns the
influence of the initial values (starting points) of parameters on the outcome of
the optimization process.
In many machine learning algorithms, especially in neural networks, the choice of
initial weights can significantly influence the training process. The initial weights are
the starting values of these parameters before training begins. When the optimization
landscape consists of multiple local minima, saddles, or other complex structures,
different initializations can lead the optimization process towards different local
optima. The optimization landscape refers to a conceptual visualization of how the
possible solutions to an optimization problem are distributed with respect to their
performance or error. Imagine a surface where each point represents a possible set
of parameters (weights) of the model, and the height represents the error or cost
associated with those parameters. The landscape might contain valleys, peaks, and
plateaus, corresponding to low-error (optimal) and high-error regions.
2
convergence: A machine learning model converges when its training loss stabilizes, indicating that
additional training will not improve its performance.
40 | Machine Learning in Farm Animal Behavior using Python
While insufficient data causes genuine challenges in machine learning, with the
right techniques and considerations, one can still develop models that are effective
and reliable.
Noisy Data
In the real-world scenario of machine learning, especially when focusing on
specific domains like animal behavior, the data is often far from perfect. This
imperfection frequently manifests as noisy data, a type of data corrupted by
irrelevant or false information. Data that seems innocuous can significantly harm
the performance and reliability of ML models.
Various factors can lead to the generation of noisy data. Common causes include
errors during the data acquisition phase, such as malfunctioning sensors in animal
tracking devices. Outliers, which are data points that exhibit considerable variation
from the rest of the data, often add another layer of noise.
The presence of noisy data in a dataset can misdirect a machine learning model
in several ways:
• Distorted Patterns: A model trained on noisy data might misinterpret
the noise as valid patterns, leading it away from understanding the actual
underlying relationships in the data.
• Compromised Model Performance: As models try to fit noisy data, they
might end up aligning too closely to the dataset’s imperfections, making them
less effective on new data.
• Increased Model Complexity: To accommodate noise, models might
unnecessarily become more complex, making them harder to interpret.
Addressing the challenges inquired by noisy data requires a mix of strategies and
methodologies:
• Data Cleaning: This primary step involves inspecting the dataset for errors
and correcting them. It might involve the removal of duplicates, addressing
outliers, or relabeling mislabeled data points.
• Noise-Resilient Algorithms: Certain algorithms, like Random Forests, have
inherent mechanism s that allow them to handle noise better than others.
• Domain Expertise: Particularly in niche domains, expertise can differentiate
between valid data points and noise. For instance, in animal behavior studies,
an understanding of the conditions under which data was collected can filter
out anomalies.
Fundamentally, while noisy data remains a persistent challenge in the world of
machine learning, a combination of thorough data preprocessing and strategic
modeling can reduce its bad effects.
42 | Machine Learning in Farm Animal Behavior using Python
Imbalanced Datasets
In machine learning, we frequently encounter issues with imbalanced datasets, where
certain classes of data are significantly more common than others. For example, in
a dataset comprising animal behaviors, the examples of common behaviors like
grazing or walking might outnumber rare behaviors like fighting or mating.
The challenges presented by imbalanced datasets can influence the effectiveness
of a ML model in the following manner:
• Biased Model Predictions: A model trained on imbalanced data may
exhibit a strong bias towards the majority class. As a result, the model might
frequently misclassify instances from the minority class, simply because it
has not encountered them often enough during training.
• Compromised Model Performance Metrics: Traditional metrics, such
as accuracy, can be misleading for imbalanced datasets. A model could
achieve high accuracy by predicting the majority class, even if it consistently
misclassifies the minority class.
• Overlooking Significant Insights: In many situations, especially in animal
behavior studies, the minority class (like certain rare behaviors) might carry
significant importance. Imbalanced datasets can lead models to neglect these
critical insights.
Here are some methods and strategies that can be employed to tackle those
challenges:
• Resampling Techniques: This involves either oversampling the minority
class, undersampling the majority class, or a combination of both, to balance
out the class distribution. Techniques such as the Synthetic Minority Over-
sampling Technique (SMOTE) (Chawla et al., 2002) can be used to generate
artificial instances of the minority class.
• Cost-sensitive Training: By assigning different misclassification costs
to different classes, a model can be guided to treat each misclassification
instance based on its associated cost.
• Ensemble Methods: Techniques like bagging and boosting (C.D. Sutton,
2005), can often lead to better performance on imbalanced datasets.
It is important to note that while addressing dataset imbalance can improve model
performance, it is not always necessary to achieve a perfect balance. In some
real-world scenarios, certain classes are naturally less frequent, and achieving a
perfect balance might not be reflective of the true data distribution.
Missing Values
An issue that consistently emerges across various domains, including animal
behavior studies, is the presence of missing values in datasets. Whether due to
Machine Learning Concepts and Challenges | 43
sensor malfunctions, human errors during data entry, or any other reasons, gaps in
data are a common occurrence.
Missing values in the dataset presents distinct challenges, especially when it
comes to ensuring the reliability and accuracy of machine learning models:
• Compromised Model Integrity: A model trained on data with significant
missing values might not capture the underlying relationships effectively,
leading to inaccurate performance.
• Reduction in Dataset Size: Simply removing rows or columns with missing
values can significantly reduce the available dataset size, limiting the amount
of information available for training the model.
As with other challenges, there are numerous techniques available to effectively
handle missing values, ensuring the resultant models remain robust:
• Imputation Techniques (Donders et al., 2006): Depending on the nature and
structure of the data, various imputation methods, such as mean, median, mode
imputation, or techniques like K-nearest Neighbors imputation, can be applied.
• Multiple Imputations (Li et al., 2015; Schafer & Olsen, 1998): Instead of
filling missing values once, multiple imputations involve creating several
datasets with different imputations and averaging the results, offering a more
robust solution to the missing data issue.
• Utilizing Model Algorithms that Handle Missing Values: Some algorithms,
such as decision trees or random forests, can handle missing values inherently,
making them a good choice for datasets with such gaps.
• Using Domain Knowledge: Leveraging domain-specific knowledge can
assist in making educated guesses about missing values.
Handling missing values is crucial, as the chosen method can influence the machine
learning model’s outcomes and interpretations. As we progress through this book,
we will investigate some of these strategies, offering hands-on techniques and
considerations for dealing with missing data effectively.
Non-representative Data
For the purpose of developing reliable and efficient ML models, ensuring that
the training set represents the broader population is of paramount importance.
Non-representative data arises when the dataset used to train a model does not
accurately reflect the reality or does not capture the full spectrum of variations
present in the real-world scenario.
Relying on non-representative data can cause several issues:
• Skewed Predictions: If a model is trained on data that is not representative
of the larger population or context, its predictions can be biased towards the
data it has seen, resulting in inaccurate and unreliable outcomes.
44 | Machine Learning in Farm Animal Behavior using Python
Summary
In Chapter 2, we have covered the essentials of machine learning, focusing
on key concepts like how models generalize from data, the balance between
complexity and accuracy, and the common issues of overfitting and underfitting.
We have also highlighted the importance of evaluating model performance to
ensure it matches expectations closely. The challenges discussed fall into two
categories: those related to model functionality and those related to data quality
and availability. We examined the impact of the size of datasets, and problems
like noisy, incomplete, or biased data. Additionally, we walked through the steps
involved in ML workflow, from defining the problem to deploying the solution.
CHAPTER
3
A Practical Example to Building a
Simple Machine Learning Model
In this chapter, we will build a machine learning model from scratch. We will guide
you through each step of the process, providing practical examples and hands-
on experience. Upon completing this chapter, you will have a comprehensive
understanding of creating and evaluating machine learning models utilizing Python.
• Matlab Variable Files (.mat): These files contain sensor data for each animal
from the neck position in random orientations. Each animal is represented by
a unique .mat file.
Example: S1.mat, S2.mat, G1.mat, G2.mat, G3.mat, G4.mat
• CSV Files (.csv): The sensor data is also available in CSV file format for
each animal, capturing the same neck position with random orientations.
Example: S1.csv, S2.csv, G1.csv, G2.csv, G3.csv, G4.csv
Note that, the readers can download the data from its original source:
(https://lifesciences.datastations.nl/dataset.xhtml?persistentId=doi:10.17026/
dans-zp6-fmna).
import pandas as pd
import os
import glob
def read_csv_files_from_folder(folder_path):
"""
Reads all CSV files from a specified folder and merges them into a
single DataFrame.
Parameters:
folder_path (str): The path to the folder containing CSV files.
Returns:
pd.DataFrame: A Pandas DataFrame containing the merged data or
None
if no data is found.
"""
try:
# Create an empty list to store DataFrames
dfs = []
if not csv_files:
print("No CSV files found in the specified folder.")
return None
A Practical Example to Building a Simple Machine Learning Model | 51
if not dfs:
print("No valid data found in the CSV files.")
return None
return merged_df
except FileNotFoundError:
print("The specified folder or CSV files were not found.")
return None
except Exception as e:
print(f"An error occurred: {str(e)}")
return None
• Checking for Valid Data: After reading all CSV files, it checks if any valid
data was found. If no valid data is found (all DataFrames are empty), it prints
a message and returns None.
• Concatenating DataFrames: If valid data is found, it uses pd.concat to
concatenate all the individual DataFrames stored in the dfs list vertically,
creating a single merged DataFrame named merged_df. The ignore_index =
True argument ensures that the index is reset for the merged DataFrame.
• Returning the Merged DataFrame: Finally, the function returns the merged
DataFrame (merged_df) if data is successfully read and merged. If any errors
occur during the process, it prints appropriate error messages and returns
None.
• This function provides a convenient and robust way to read and merge
multiple CSV files from a specified folder into a single DataFrame.
df = read_csv_files_from_folder("GSdata/Data")
df.head()
There are 18 features (12 are visible in the image Figure 3.1): ‘label’, ‘animal_
ID’, ‘segment_ID’, ‘timestamp_ms’, ‘ax’, ‘ay’, ‘az’, ‘axhg’, ‘ayhg’, ‘azhg’, ‘cx’,
‘cy’, ‘cz’, ‘gx’, ‘gy’, ‘gz’, ‘pressure’, ‘temp’.
Below a function is created to inspect the dataset:
import pandas as pd
def inspect_dataset(dataset):
"""
Inspects a dataset and prints various statistics and
information.
Parameters:
dataset (pd.DataFrame): The dataset to inspect.
Returns:
None
"""
# Basic information about the dataset
print("Dataset Information:")
print(dataset.info())
else:
print("\nNo target column found for class counting.")
54 | Machine Learning in Farm Animal Behavior using Python
Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13778153 entries, 0 to 13778152
This section provides information about the columns (features) in the dataset,
their names, and their respective data types.
Data columns (total 18 columns): This line indicates that the following information
refers to the dataset’s columns, and there are a total of 18 columns in the dataset.
This part of the output is a table that describes each column. It consists of three
columns:
• #: This is the column index, starting from 0 for the first column and
incrementing by 1 for each subsequent column.
• Column: This is the name of the column.
• Dtype: This is the data type of the column.
56 | Machine Learning in Farm Animal Behavior using Python
dtypes: This line provides a summary of the data types present in the dataset. It
indicates that there are: 14 columns with the data type float64, 2 columns with the
data type int64, 2 columns with the data type object.
Memory usage: This line tells us about the memory usage of the dataset. In this
case, it indicates that the dataset consumes approximately 1.8+ gigabytes (GB) of
memory.
None: This is the return value of the info() function. It is displayed because the
info() function does not return a value; it prints the information to the console.
The above output provides summary statistics for the numerical columns in the
dataset.
1. Summary Statistics: This line indicates that the following information pertains
to the summary statistics of the dataset.
A Practical Example to Building a Simple Machine Learning Model | 57
2. Count: The count row under each column shows how many non-null values
exist for each numerical feature. This tells us how many data points are
available for each feature.
3. Mean: The mean row displays the mean value for each feature.
4. Std: The std row represents the standard deviation, which measures the spread
or variability of the data points around the mean.
5. Min: The “min” row shows the minimum value observed for each feature.
6. 25%: This row corresponds to the 25th percentile, indicating the value below
which 25% of the data falls.
7. 50%: The 50th percentile, also known as the median, represents the middle
value of the data.
8. 75%: The 75th percentile indicates the value below which 75% of the data falls.
9. Max: The “max” row shows the maximum value observed for each feature.
The last section of the function’s output is the following:
Figure 3.5: Missing values, features, data types, and class counts information.
58 | Machine Learning in Farm Animal Behavior using Python
The above output provides information related to the missing values of the
dataset, feature names, data types for each feature, and the count of each class in
the dataset.
• Missing Values: It provides the names of the columns that have missing
values. We have identified missing values in several columns, including cx,
cy, cz, and pressure.
• Class Counts: It shows how many instances belong to each behavior class.
Frequency
Frequency
60 125
100 150
40
75
100
20 50
50
25
0 0 0
–3 –2 –1 0 1 2 3 0 2 4 6 8 10 12 –20 –15 –10 –5 0
Value Value Value
Code Breakdown
• import matplotlib.pyplot as plt: This line imports the pyplot module from the
matplotlib library.
• selected_columns: This is a list holding the column names from the dataset
that you want to create histograms for. These columns represent numerical
features from the dataset, such as sensor readings and measurements.
• df[selected_columns].hist(bins = 50, figsize = (30, 20)):
– df [selected_columns]: This part selects the subset of the DataFrame df
that includes only the columns specified in selected_columns.
– .hist(bins = 50, figsize = (30, 20)): This part applies the .hist() method to
the selected subset of the DataFrame.
– bins = 50: This parameter specifies the number of bins or intervals into
which the data will be divided for creating the histogram. In this case, it is
set to 50, meaning that the histogram will have 50 bars.
– figsize = (30, 20): This parameter sets the size of the figure (the plot) in
inches. It determines the dimensions of the histogram plot.
• plt.show(): This line is used to display the histogram plot.
The Output of df.hist()
return df_cleaned
# Call the function to remove the columns that contain missing values
df_cleaned = remove_columns_with_missing_values(df)
Function Definition
• def remove_columns_with_missing_values(df): This line begins the
definition of a function named remove_columns_with_missing_values. The
function takes one argument, df, which is expected to be a pandas DataFrame
• Docstring
– " " " : The function definition contains a docstring (enclosed within triple
quotes) that describes what the function does, its parameters, and what it
returns.
that column. Finally, df.columns[...] selects the names of the columns that
have missing values.
Function Usage
After defining the function, it is called with the following line:
• df_cleaned = remove_columns_with_missing_values(df): This line applies
the function to a DataFrame df. The result is a cleaned version of df that has
had all columns with any missing values removed.
The cleaned dataframe is displayed as below:
• df_cleaned.head(): The .head() shows the first 5 rows of the df_cleaned by
default. If we want to see the first 20 rows of the dataset we can call the .head(20).
Class Distribution
We will begin by inspecting the distribution of animal behaviors in our dataset.
Understanding the balance or imbalance between behavior classes is essential for
model training and evaluation.
To examine the class distribution of behaviors in our dataset, we can use the
following Python code:
The code visualizes the distribution of classes in a dataset using matplotlib and
seaborn libraries.
Key steps include:
1. Importing libraries: matplotlib.pyplot as plt for plotting and seaborn as sns for
easier plot creation.
2. Calculating class frequencies: Using value_counts() on the label column of
df_cleaned to get frequencies.
3. Plotting: Setting the figure size, using sns.barplot for the bar plot, labelling
axes as ‘Behavior’ and ‘Count’, titling the plot ‘Class Distribution’, and
rotating x-axis labels for better readability.
4. Saving and displaying the plot: The plot is saved with a high resolution (300
dpi) and then displayed.
4
Count
0
ing ing ing g g ing ch ng ing
az lk Ly
in ttin nn rat hti ak
nd Wa Tro Sc iting Fig
Sta Gr Ru Sh
b
Behavior
Figure 3.8: Barplot showing the distribution of animal behaviors (classes) in the dataset.
Some classes are more prevalent than others, indicating a class imbalance.
64 | Machine Learning in Farm Animal Behavior using Python
In our dataset, we observe (Figure 3.8) varying class distribution among the
different animal behaviors.
Correlation Analysis
The correlation coefficient, ranging from –1 to 1, is a statistical metric that
expresses the extent of a linear association between two quantitative variables.
Interpreting Correlation
• Positive Correlation: When the correlation coefficient is positive (closer to
+1), it indicates a positive linear relationship. This indicates a tendency for
one variable to increase in conjunction with an increase in the other variable.
• Negative Correlation: A negative correlation coefficient (closer to –1)
indicates a negative linear relationship. As one feature increases, the other
tends to decrease.
• No Correlation: A correlation coefficient near 0 suggests no linear relationship
between the features. Changes in one feature do not significantly impact the
other.
Limitations of Correlation
While correlation is a valuable tool for identifying linear relationships, it has
some limitations:
• Linearity Assumption: Correlation measures only linear relationships. It
may not capture non-linear associations between variables. For instance, two
variables might have a strong quadratic relationship that correlation cannot
detect.
• Outliers: Outliers influence correlation, where even a single outlier can
significantly impact the correlation coefficient, potentially leading to
misleading results.
• Other Factors: Correlation does not account for other factors that might
influence the relationship between variables. It cannot establish causation,
and false correlations can occur when a third variable influences both.
• Limited to Numerical Data: Correlation works only for numerical data. It
cannot quantify relationships between categorical variables.
import pandas as pd
import numpy as np
A Practical Example to Building a Simple Machine Learning Model | 65
Output:
ax ay az gx gy gz
ax 1.000000 0.113400 0.243339 0.038723 -0.012482 -0.027463
ay 0.113400 1.000000 -0.132366 0.117453 -0.043101 -0.018925
az 0.243339 -0.132366 1.000000 0.025681 -0.024909 -0.033320
gx 0.038723 0.117453 0.025681 1.000000 -0.065109 -0.035213
gy -0.012482 -0.043101 -0.024909 -0.065109 1.000000 0.076284
gz -0.027463 -0.018925 -0.033320 -0.035213 0.076284 1.000000
• import numpy as np: This line imports the numpy library and aliases it as np.
numpy
• numerical_features = [‘ax’, ‘ay’, ‘az’, ‘gx’, ‘gy’, ‘gz’]: This line creates a list
of strings where each string represents the name of a numerical feature in the
dataset.
• correlation_matrix = df_cleaned[numerical_features].corr():
– df_cleaned[numerical_features] is indexing into a pandas DataFrame
called df_cleaned, selecting only the columns listed in numerical_features.
– .corr() is a pandas DataFrame method that calculates the correlation
coefficients between the columns in the DataFrame.
• correlation_matrix: This line displays the resulting correlation matrix (pandas
DataFrame).
• sns.heatmap(correlation_matrix, annot = True, fmt = “.2f”, cmap =
‘coolwarm’, cbar = True): This code creates a heatmap using Seaborn
to visualize the correlation matrix correlation_matrix, with annotations
displaying the correlation values formatted to two decimal places, using a
‘coolwarm’ color map, and including a color bar for reference (Figure 3.9).
• The output is a correlation matrix displaying the linear relationship between
pairs of variables (ax, ay, az, gx, gy, gz). Each cell shows the correlation
coefficient for a pair, ranging from –1 to 1. Diagonal cells, which compare
each variable to themselves, are all 1, indicating a perfect positive correlation.
66 | Machine Learning in Farm Animal Behavior using Python
plt.subplot(3, 2, numerical_features.index(feature) + 1)
sns.boxplot(x=df[feature])
plt.xlabel(feature)
plt.tight_layout()
plt.show()
The code uses Seaborn’s boxplot function. This function calculates the quartiles
and plots the box and whiskers, accordingly, showing any outliers as individual
points. Figure 3.10 is the result from the code and illustrates the distribution of
six numerical features (ax, ay, az, gx, gy, and gz) using boxplots. The boxplots for
‘ax’, ‘ay’, and ‘az’ suggest a central clustering of data points with a few outliers,
whereas the ‘gx’, ‘gy’, and ‘gz’ boxplots reveal a more dispersed distribution with
numerous outliers.
–80 –60 –40 –20 0 20 40 60 80 –2000 –1500 –1000 –500 0 500 1000 1500 2000
az gx
–2000 –1500 –1000 –500 0 500 1000 1500 2000 –2000 –1500 –1000 –500 0 500 1000 1500 2000
gy gz
Now, we proceed with calculating the magnitude and extracting the desired
features for each window of data. We will create a function extract_features_
from_window to perform these operations. Before that, we will create two
functions; one to drop the columns that we do not need for the project, and the
second function that calculates the magnitude.
def drop_unneeded_columns(df):
"""
Drop columns that are not needed for feature extraction.
Parameters:
df (pd.DataFrame): DataFrame containing accelerometer data.
Returns:
pd.DataFrame: DataFrame with unnecessary columns dropped.
"""
# List of columns to keep (accelerometer columns)
# Include 'label' column to keep classes
return df
def calculate_magnitude(df):
"""
Calculate the magnitude of accelerometer readings and add it
as a new column.
Parameters:
df (pd.DataFrame): DataFrame containing accelerometer data.
Returns:
pd.DataFrame: DataFrame with magnitude column added.
"""
# Calculate magnitude
df['magnitude'] = np.sqrt(df['ax']**2 + df['ay']**2 +
df['az']**2)
return df
# Output
Window size: 1000 samples
Overlap size: 500 samples
Now, let’s create the function to extract the desired features using the sliding
window.
features.append(window_features)
labels.append(majority_label)
72 | Machine Learning in Farm Animal Behavior using Python
extracted_features = extract_features_with_windows(df_mag,
window_size, overlap_percent)
• We store the extracted features and the majority label in the features and
labels lists, respectively, for each window.
• We then create pandas DataFrames for the extracted features and labels.
• Finally, the function returns a pandas DataFrame containing the extracted
features for each window, along with their corresponding majority labels.
• extract_features_with_windows function, processes the DataFrame df_mag
using specified window size and overlap percentage parameters, to the
variable extracted_features.
The following code snippet will help us to understand the balance or imbalance
of the different activities within our dataset. By using the .value_counts() method
on the ‘label’ column of the extracted_features DataFrame the code evaluates how
many instances of each unique class label are present.
# Output
label
standing 12063
grazing 7472
walking 4220
lying 1968
trotting 814
running 649
scratch_biting 186
fighting 150
shaking 33
From the result, it is clear that we have a highly imbalanced dataset, with
‘standing’ and ‘grazing’ behaviors significantly more represented than activities
like ‘shaking’ or ‘fighting’. This imbalance could lead to a model that is overly
proficient at recognizing the majority classes but performs poorly on the minority
classes, which might be equally or more important for the predictive task at hand.
To address this, one might consider implementing techniques such as resampling
the underrepresented classes, applying class weights during model training, or
choosing evaluation metrics that are sensitive to class imbalance, ensuring the
model’s predictive performance is balanced across all classes.
For the scope of our project, while acknowledging the presence of class imbalance
in our dataset, we will not apply any specific techniques to address this issue.
Instead, we will proceed with the analysis as is, which will allow us to maintain
the natural distribution of behaviors as they occur.
74 | Machine Learning in Farm Animal Behavior using Python
# Perform the initial split into training (70%) and temp (30%)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_
size=0.30, random_state=42, stratify=y)
# Further split the temp set into validation (50%) and test (50%)
X_validation, X_test, y_validation, y_test = train_test_split(X_
temp, y_temp, test_size=0.50, random_state=42, stratify=y_temp)
Code Breakdown
• We import the train_test_split function from sklearn.model_selection.
• We define our feature matrix X as all columns in the dataset except for the
‘label’ column, and our target vector y as the ‘label’ column.
• We perform the initial split using train_test_split:
– X_train and y_train contain 70% of the data for training.
– X_temp and y_temp contain the remaining 30% temporarily.
• We further split the X_temp and y_temp sets into validation and test sets
using another train_test_split:
– X_validation and y_validation contain 50% of the data for validation.
– X_test and y_test contain the remaining 50% for testing.
• We use the stratify parameter to ensure that the class proportions are
maintained in both the training/validation and validation/test splits.
Feature Scaling
Feature scaling guarantees that each feature equally influences the model training
process and can prevent certain features from dominating others simply because
of their scale. In this section, we will explore the importance of feature scaling
and demonstrate how to perform it using Python for our machine learning project.
Why feature scaling is essential:
• Equal Contribution: Scaling ensures that all features have a similar influence
on the learning algorithm. Without scaling, features with larger scales can
dominate those with smaller scales.
76 | Machine Learning in Farm Animal Behavior using Python
This code snippet shows that we import two preprocessing classes from scikit-
learn, MinMaxScaler and StandardScaler. We then initialize Min-Max scaling and
standardization processes, perform scaling on the datasets, and then we create
new data frames with the scaled features.
confusion_matrix_encoded = confusion_matrix(y_validation_
encoded, y_pred_encoded)
results_dict[model_name] = {"Confusion Matrix": confusion_
matrix_encoded,
"Accuracy": accuracy,
"Classification Report": classification_rep,
"Scaler": scaler_name}
print(f"Classification Report:\n{metrics['Classification
Report']}\n")
# Output
Results for Standardized Dataset:
Model: Random Forest (Scaler: StandardScaler)
Accuracy: 0.92
Classification Report:
"""
Train Random Forest on the provided training set and evaluate
it on validation set.
Parameters:
- X_train: Training data
- y_train: Training data labels
- X_val: Validation data
- y_val: Validation data labels
- n_estimators: Number of trees in the Random Forest
- max_depth: Maximum depth of the trees
- random_state: Seed for reproducibility
Returns:
- rf_model: Trained Random Forest model
- performance_metrics: Classification report for the model on
validation data
"""
# Output
Random Forest Model Performance on Validation Data:
precision recall f1-score support
fighting 0.91 0.87 0.89 23
grazing 0.92 0.94 0.93 1121
lying 0.93 0.85 0.89 295
running 0.98 0.98 0.98 97
scratch_biting 0.89 0.29 0.43 28
shaking 1.00 1.00 1.00 5
standing 0.94 0.96 0.95 1809
trotting 0.97 0.93 0.95 122
walking 0.95 0.94 0.95 633
accuracy 0.94 4133
macro avg 0.94 0.86 0.88 4133
weighted avg 0.94 0.94 0.94 4133
Accuracy: 93.90%
and recall across most classes, indicating strong predictive capabilities, with
particularly perfect performance in shaking. However, scratch_biting shows a
significant drop in performance, suggesting difficulty in accurately classifying
this behavior. Overall, the model achieves a good accuracy of 93.90%.
Feature Selection
At this point of our project, we want to examine the features that influence our
model’s predictions. To achieve this, we employ code to extract and rank the
features according to their importance as determined by the Random Forest
classifier.
The reasoning for performing feature selection is twofold. Firstly, it enhances
model interpretability, allowing us to understand which factors are pivotal in
distinguishing between behaviors like grazing, running, or lying. Secondly, by
minimizing the quantity of features, we might improve the model’s performance
and reduce possible overfitting, as the classifier concentrates on the most relevant
information.
While the current approach directly utilizes the built-in feature importance of the
Random Forest model, it is essential to note that this is just one way to perform
feature selection. There are numerous other techniques, each with its advantages
and applications, ranging from univariate statistical tests to model-based methods
and iterative selectors. These methods will be comprehensively explored in
Chapter 6, which is dedicated to the subject of feature selection.
The following code is designed to explain the significance of each feature used in
training our Random Forest classifier:
# Output
Feature rankings based on importance:
1. Feature 7 - Importance: 0.1588
2. Feature 6 - Importance: 0.1404
3. Feature 3 - Importance: 0.0923
4. Feature 4 - Importance: 0.0821
5. Feature 5 - Importance: 0.0632
6. Feature 16 - Importance: 0.0519
7. Feature 18 - Importance: 0.0499
8. Feature 0 - Importance: 0.0479
9. Feature 1 - Importance: 0.0461
10. Feature 14 - Importance: 0.0436
11. Feature 2 - Importance: 0.0367
12. Feature 17 - Importance: 0.0362
13. Feature 15 - Importance: 0.0303
14. Feature 12 - Importance: 0.0264
15. Feature 13 - Importance: 0.0232
16. Feature 10 - Importance: 0.0178
17. Feature 11 - Importance: 0.0158
18. Feature 8 - Importance: 0.0151
19. Feature 9 - Importance: 0.0147
20. Feature 19 - Importance: 0.0079
Here is a breakdown:
1. We begin by extracting the importance of each feature as determined by the
Random Forest model (rf_model.feature_importances_).
2. Then, we sort these importances in descending order (argsort()[::–1]) to
identify which features are most influential.
3. We then iterate over the sorted indices, printing out the rank, the feature
index, and its corresponding importance score (refer to the output).
4. For demonstration purposes, we choose the top 10 features according to their
importance scores. This selection is arbitrary and can be adjusted based on
the desired model complexity or specific domain knowledge. The selected
features are then used to create a subset of the original training data (X_
train_n[top_features]).
5. Finally, we retrain the Random Forest with only the selected top features. By
doing so, we can see how well the model performs when it is focused solely
on the most significant features, potentially leading to improved efficiency
and, possibly, performance due to reduced noise and complexity.
88 | Machine Learning in Farm Animal Behavior using Python
Now that we have trained our model on the top 10 important features, we evaluate
its performance on the validation set using the following code:
# Output
Random Forest Model Performance on Validation Data (Top Features):
Accuracy: 0.9363658359545125
Classification Report:
precision recall f1-score support
fighting 0.95 0.78 0.86 23
grazing 0.93 0.93 0.93 1121
lying 0.93 0.84 0.88 295
running 0.97 0.97 0.97 97
scratch_biting 0.75 0.32 0.45 28
shaking 1.00 0.60 0.75 5
standing 0.94 0.96 0.95 1809
trotting 0.92 0.92 0.92 122
walking 0.94 0.94 0.94 633
accuracy 0.94 4133
macro avg 0.93 0.81 0.85 4133
weighted avg 0.94 0.94 0.94 4133
Parameters:
- X_train: Training features.
- y_train: Training labels.
- X_validation: Validation features.
- y_validation: Validation labels.
- n_components: Number of PCA components or explained variance.
Returns:
- RandomForest classifier trained on the transformed data.
- Classification report on the validation data.
"""
# Apply PCA
pca = PCA(n_components=n_components)
X_train_pca = pca.fit_transform(X_train_standardized)
X_validation_pca = pca.transform(X_validation_standardized)
# You can call the function and print out the results as follows:
rf_model, acc, class_rep = apply_pca(X_train, y_train,
X_validation, y_validation)
# Output
Accuracy: 0.9029760464553593
Classification Report:
precision recall f1-score support
fighting 0.88 0.61 0.72 23
grazing 0.91 0.89 0.90 1121
lying 0.84 0.70 0.77 295
running 0.95 0.97 0.96 97
scratch_biting 1.00 0.18 0.30 28
shaking 1.00 0.80 0.89 5
standing 0.88 0.94 0.91 1809
trotting 0.96 0.93 0.95 122
walking 0.96 0.92 0.94 633
accuracy 0.90 4133
macro avg 0.93 0.77 0.81 4133
weighted avg 0.90 0.90 0.90 4133
import optuna
from sklearn.model_selection import cross_val_score
def objective(trial):
# Define hyperparameter space
n_estimators = trial.suggest_int('n_estimators', 50, 300)
max_depth = trial.suggest_int('max_depth', 10, 40, log=True)
min_samples_split = trial.suggest_int('min_samples_split', 2, 15)
min_samples_leaf = trial.suggest_int('min_samples_leaf', 1, 10)
max_features = trial.suggest_categorical('max_features',
['auto', 'sqrt'])
bootstrap = trial.suggest_categorical('bootstrap', [True, False])
This code defines an objective function for use with Optuna focusing on tuning
a Random Forest classifier. Here is how it works and why specific numbers are
chosen for the hyperparameters:
1. Hyperparameter Space Definition:
• n_estimators: The suggested number of estimators is between 50 and 300.
This range is chosen to give enough variability to see how increasing the
number of trees affects performance without going to extremes that might
only offer diminishing returns or excessive computation.
• max_depth: The suggested max depth is between 10 and 40, with a
logarithmic scale (log = True). The log scale is used to explore the lower
end of the range more finely, where smaller changes can influence the
model complexity and performance.
• min_samples_split: The suggested min samples split is between 2 and
15. This range enables the model to investigate both small and relatively
larger node sizes before deciding on a split.
• min_samples_leaf: The suggested min samples leaf is between 1 and 10.
This parameter helps restrict overfitting by preventing the model from
learning overly intricate patterns.
• max_features: To determine the number of features to be evaluated during
the search for the optimal split, there are two options. The options are to
either use all features (auto) or a square root of the number of features
(sqrt). This choice affects how diverse each tree in the forest is and can
influence model accuracy and overfitting.
• bootstrap: The options of bootstrap are True (bootstrap sampling) or False
(the entire dataset is used to build each tree). Bootstrapping introduces
randomness into the model, which can help improve robustness and
accuracy.
2. Random Forest Classifier Initialization and Training: The classifier is
initialized with the suggested hyperparameters and set to use all available
CPU cores (n_jobs = –1) for faster training.
3. Model Evaluation with Cross-Validation: The cross_val_score function
evaluates the classifier’s performance using cross-validation with 5 folds (cv
= 5), meaning the training set is divided into five parts, with the model trained
on four and validated on the fifth, rotating until each part has been used for
validation. The function returns the mean accuracy across all folds, which
serves as the objective value Optuna which we seek to maximize.
This approach allows Optuna to systematically explore the hyperparameter
space, using the objective function’s return value to guide the search towards
combinations that yield the best cross-validation accuracy.
96 | Machine Learning in Farm Animal Behavior using Python
best_params = study.best_params
best_score = study.best_value
# Output:
best_params:
{'n_estimators': 223,
'max_depth': 24,
'min_samples_split': 2,
'min_samples_leaf': 1,
'max_features': 'auto',
'bootstrap': False}
best_score: 0.941725499462175
predictions = best_rf.predict(X_validation_n)
# Output:
Accuracy: 94.82%
Classification Report:
precision recall f1-score support
fighting 0.91 0.87 0.89 23
grazing 0.93 0.95 0.94 1121
lying 0.95 0.88 0.91 295
running 0.98 0.99 0.98 97
scratch_biting 1.00 0.32 0.49 28
shaking 1.00 1.00 1.00 5
standing 0.95 0.97 0.96 1809
trotting 0.97 0.94 0.96 122
walking 0.96 0.95 0.95 633
accuracy 0.95 4133
macro avg 0.96 0.87 0.90 4133
weighted avg 0.95 0.95 0.95 4133
Code breakdown:
1. best_rf = RandomForestClassifier(**best_params, n_jobs = –1) and best_
rf.fit(X_train_n, y_train): the best_params are applied to construct and train a
new model, best_rf, on our normalized training dataset (X_train_n).
98 | Machine Learning in Farm Animal Behavior using Python
# Output:
Accuracy on Test Set: 95.11%
Classification Report on Test Set:
precision recall f1-score support
fighting 0.85 0.77 0.81 22
grazing 0.95 0.95 0.95 1121
lying 0.95 0.85 0.89 295
running 0.97 0.96 0.96 98
A Practical Example to Building a Simple Machine Learning Model | 99
Code breakdown:
1. Initialize the Final Model: The RandomForestClassifier is initialized with the
optimal hyperparameters discovered earlier: n_estimators = 223, max_depth
= 24, min_samples_split = 2, min_samples_leaf = 1, max_features = ‘auto’,
and bootstrap = False. The n_jobs = –1 parameter allows the classifier to
leverage all available CPU cores to accelerate the training process.
2. Model Training: The final model, final_rf, is then trained (fit) on the
normalized dataset (X_train_n and y_train).
3. Making Predictions: After training the model, predictions are generated on
the normalized test set (X_test_n), which simulates how the model would
perform when making predictions on new, unseen data.
4. Mode Evaluation: accuracy_score and classification_report are called to give
insights into the overall accuracy and detailed performance metrics for each
class.
From the output we can see that the model achieves an accuracy of 95.11% on the
test set, indicating a high predictive performance level.
This marks the conclusion of our journey through applying machine learning to
accelerometer data of sheep and goats. From preprocessing and exploratory data
analysis to feature selection, hyperparameter tuning, and final evaluation, we have
seen how systematic approaches can refine a model to achieve high accuracy. This
example highlights the practical steps in developing a ML model.
Summary
In Chapter 3, we tackled the practical side of machine learning by developing a
Python-based project to classify the behaviors of sheep and goats. Our approach
is hands-on, focusing on the actual steps needed to carry out a machine learning
project. We started with the basics: loading and preprocessing data, followed by
exploratory data analysis using histograms, correlation matrix, and boxplots to
understand the dataset better. We then introduced the feature extraction technique
using windowing methods, which is crucial for preparing the data for modeling.
100 | Machine Learning in Farm Animal Behavior using Python
The subsequent steps involved selecting features and tuning the model’s
hyperparameters to optimize performance. These stages are key to refining the
model and enhancing its accuracy. Finally, we demonstrated how to evaluate the
model’s effectiveness in a real-world scenario, emphasizing the importance of a
systematic assessment to ensure reliability. Throughout this chapter, our goal was
to illustrate the essential components of a machine learning pipeline, rather than
delving into every technical detail. By simplifying some of the content, we aimed
to maintain a clear and coherent narrative that makes the core processes accessible
and understandable, avoiding the potential for confusion that might arise from a
more complex presentation.
CHAPTER
4
Sensors, Data Collection and
Annotation
Over the years, methods for studying and interpreting animal behavior have
become increasingly sophisticated, especially with advancements in technology.
One of the crucial aspects is the evolution of data collection techniques, which
have significantly transformed the way we approach this task.
Animal behavior studies center on the accurate collection, preservation,
and analysis of multifaceted data. This involves an animal’s movements, its
vocalizations, interactions, and various other behavioral attributes. The goal is to
adopt an understanding of the animal’s physical activity, and also its emotional
inclination and its interactions within its environment and social groups.
The field of data collection is massive. A plethora of sensors, each designed for
specific observational needs, exist to researchers and professionals. The choice
of a particular data collection method is based on many variables. The species
of the animal under observation, the behavior allocated for study, and the related
conditions during data collection, are instrumental in guiding this choice.
In the early stages of Precision Livestock Farming (PLF), data collection was
mostly manual. Researchers and farmers would engage in direct observations,
making use of basic tools and their own senses to evaluate animal behavior.
Charts, simple logs, and visual cues served as the means to record observed
behaviors. As time progressed, the need for more accurate and granular data
drove innovations in data collection methodologies. Manual observations began
to be complemented, and in many cases replaced, by mechanical and eventually
electronic sensors.
The modern era of PLF has witnessed the smooth integration of technology. The
arrival of microelectronics, wireless communication, and advanced software
algorithms has revolutionized the way we collect and analyze animal behavior
data. These advancements enable real-time monitoring and offer capabilities like
remote monitoring, predictive analytics, and automated interventions.
while monitoring the heartbeat of a small bird could require highly sensitive
equipment.
3. Equipment Availability and Suitability: The availability of specific
equipment and its compatibility with the study’s needs is vital. Researchers
must weigh the pros and cons of available tools, deciding on aspects like
battery life, data storage capacities, range, and durability.
4. Ethical Considerations: Ensuring the welfare of the animals being monitored
is crucial. Any equipment or sensors used should not harm, stress, or overly
worry the animals. This requires choosing lightweight, non-invasive tools,
and regularly monitoring the animals for signs of discomfort.
Gyroscopes
Gyroscopes are devices used to measure angular velocity. In simpler terms, they
detect how fast something is spinning around an axis.
Working Principle: Gyroscopes rely on the principles of angular momentum.
When an external force is applied, it produces a behavior known as precession,
which is the change in the orientation of the rotational axis.
Data Types: Gyroscopes provide the rate of rotation around the device’s X, Y,
and Z axis.
Capture Methods: Same as accelerometers.
Use Cases: In animal behavior monitoring, gyroscopes can help determine
rotational movements, like when an animal is shaking its head or rolling.
Figure 4.1 illustrates the acceleration data captured from an animal over the span
of one minute. The data is separated into X, Y, and Z directions. The distinct lines
represent acceleration along the X, Y, and Z axes, respectively. The variations
in the graph portray the dynamic movements of the animal, with each axis
representing a different spatial direction.
Displayed in Figure 4.2 is the gyroscope data showcasing the angular velocity of
the animal over a 60-second interval. The lines represent the angular rotational
movements about each respective axis. This provides insights into the orientation
and turning behaviors of the animal during the observation period.
Accelerometer Movements
10.0 X-axis
Acceleration (X-axis)
7.5
5.0
2.5
0.0
–2.5
0 10 20 30 40 50 60
15 Y-axis
Acceleration (Y-axis)
10
–5
0 10 20 30 40 50 60
0 Z-axis
Acceleration (Z-axis)
–5
–10
–15
–20
–25
0 10 20 30 40 50 60
Time (Seconds)
Figure 4.1: Acceleration data showing movements in X, Y, and Z directions over one minute.
Gyroscope Movements
Angular Velocity (X-axis)
200
100
–100
–200 X-axis
0 10 20 30 40 50 60
Angular Velocity (Y-axis)
Y-axis
100
–100
0 10 20 30 40 50 60
300
Angular Velocity (Z-axis)
Z-axis
200
100
0
–100
–200
–300
0 10 20 30 40 50 60
Time (Seconds)
Figure 4.2: Gyroscope data showing angular velocity in X, Y, and Z directions over one
minute.
Sensors, Data Collection and Annotation | 107
Cameras
Cameras, ranging from traditional to thermal and UAV-based, play a pivotal role
in capturing visual data from the field.
Types
1. Traditional Cameras: These devices capture visual details by receiving
light through a lens, translating it into video or still imagery. They are most
effective under sufficiently brightened conditions, making them suitable for
daytime observations.
2. Thermal Cameras: These take advantage of the principle of infrared
radiation detection. By sensing and visualizing temperature differences, they
produce images, making nocturnal observations feasible. Thermal variations
can indicate potential illnesses, positioning these cameras as vital tools for
preventive health checks.
3. UAV-based Cameras: Drones or UAVs, equipped with cameras, offer aerial
visuals, ideal for monitoring large animal gatherings or expansive terrains.
Their primary advantage is mobility, capturing imagery from diverse altitudes
and perspectives.
Working Principle: Cameras capture light through a lens, which hits a sensor
that converts this light into an electronic signal to produce an image.
Data Types: Video footage, still images.
Use Cases: Monitoring herd movement, detecting heat signatures of animals
(especially useful during the night or in dense forests), aerial surveillance of large
herds.
GPS
The Global Positioning System (GPS) is a satellite-based navigation system that
plays an essential role in animal behavior studies. GPS technology is instrumental
in tracking and understanding the spatial movements of animals, offering
invaluable insights into their grazing patterns, territory utilization, and migratory
habits.
Operational Mechanism: GPS operates by triangulating signals from a
constellation of satellites orbiting the Earth. Each satellite transmits a signal
that includes its location and the exact time the signal was transmitted. A GPS
receiver, such as a collar worn by an animal, calculates its precise location by
measuring the time delay between the transmission and reception of signals from
multiple satellites.
Data Types: Coordinates (Longitude, Latitude), Altitude, Time stamp.
108 | Machine Learning in Farm Animal Behavior using Python
Adjustment Period
Once the equipment is attached, animals need time to adjust. Introducing foreign
objects, like sensors or collars, might initially alter an animal’s behavior. It is vital
to provide an adjustment period, allowing the animals to return to their natural
behaviors before actual data collection begins. This period can range from half an
hour to several hours, depending on the animal and the nature of the equipment.
Attention to Detail
Given that this stage lays the groundwork for all subsequent analysis, researchers
must be persistent. A missed detail here could lead to hours of wasted effort later.
For example, sensors that are not calibrated correctly might produce data that is
consistently off, leading to inaccurate machine learning predictions.
Setting up Equipment
Once the right sensors are chosen, the next step is setting them up.
110 | Machine Learning in Farm Animal Behavior using Python
Placement: Depending on the type of animal and the behavior you are studying,
the sensor’s placement is critical. For instance, accelerometers placed on an
animal’s leg will provide data different from those placed on its back. Cameras
should be positioned to capture the full range of the animal’s activities without
obstructions.
Calibration: All sensors require calibration to ensure the data they produce
is accurate. This step is especially crucial for sensors like accelerometers and
gyroscopes. Calibration processes may vary across sensors, but they often involve
setting them up in a known state or condition and adjusting them until their outputs
match the expected values.
Data Acquisition
With everything set up, the next phase is data acquisition, which involves the
actual process of gathering data from the sensors.
Data Storage
Given the large volume of data these sensors can produce, especially in continuous
collection mode, having an effective storage solution is critical.
1. On-device Storage: Many sensors come with built-in storage capabilities.
This is convenient but is often limited in capacity.
2. Cloud Storage: Some modern sensors can transmit data in real-time to cloud
storage solutions. This provides virtually unlimited storage capacity but
requires the sensors to be within the network coverage.
3. Local Storage Solutions: In scenarios where real-time data transmission is
not possible or when on-device storage is insufficient, data can be periodically
offloaded to local storage devices like laptops or external hard drives.
In essence, the planning and setup phase is arguably the most labor-intensive
and critical stage of the entire process. It is not purely about attaching sensors
to animals; it is about ensuring the data that will be collected is as accurate and
representative as possible. A well-laid foundation at this stage will undoubtedly
Sensors, Data Collection and Annotation | 111
benefit the subsequent phases, especially when it is time to analyze the data and
draw meaningful conclusions.
Preliminary Analysis
After the initial cleaning and formatting, it is beneficial to conduct a preliminary
analysis of the data. This does not involve in-depth modeling or prediction tasks
but serves to provide an initial understanding of the data’s nature. During this
stage, researchers may choose to plot the data to visualize trends, examine the
distribution of various readings, or simply check for patterns or anomalies that
were not caught during the cleaning phase. This analysis helps in understanding
the general characteristics of the data, which, in turn, can inform more advanced
analyses later on.
To conclude, post-collection data processing is a preparatory step, ensuring that
data is in its best form before looking into analysis or model training. It is a
demonstration of the fact that in machine learning and data science, the quality
of input (data) often dictates the quality of the output (results or predictions).
A thorough approach here sets the stage for optimal results in subsequent phases.
112 | Machine Learning in Farm Animal Behavior using Python
Data Annotation
Data annotation is a vital step in preparing data for machine learning models,
especially for supervised learning. By labeling or annotating the data, we provide
context to the raw data, transforming it into informative samples from which the
machine learning model can learn.
1.5
1.0
Accelerometer Reading
0.5
0.0
–0.5
–1.0
–1.5
–2.0
0 10 20 30 40 50 60
Time (Seconds)
Figure 4.3 offers an example of how accelerometer data can be annotated to gain
insights into specific animal behaviors. The graph showcases raw accelerometer
readings over a span of one minute. In the first 35 seconds, marked by vertical
dashed lines, the pronounced spikes in the data suggest that the animal was
actively walking. This is further confirmed by the annotated label ‘walking’.
Following this active phase, the accelerometer readings become more uniform
and moderated, indicating a period of inactivity for the animal, as denoted by the
second annotated segment.
Methods of Annotation
There are various methods through which data can be annotated:
Manual: Manual annotation is a labor-intensive process where human annotators
label data points individually. This method is highly accurate, as it relies on
human judgment and understanding. It is particularly useful for complex tasks
where detailed understanding is crucial, such as image segmentation or sentiment
Sensors, Data Collection and Annotation | 113
Step 3: Labeling
Once segments have been identified, they can be labeled with the corresponding
behavior. This can be done manually, where a researcher reviews the segments
and assigns labels, or semi-automatically, where thresholds are set to categorize
the data based on signal magnitude or frequency.
Summary
In Chapter 4, we looked at sensor technologies that are revolutionizing the study
of animal behavior. We discussed tools, including accelerometers, gyroscopes,
GPS trackers, and video cameras, which open up new avenues for understanding
the complex lives of animals. We talked about how these devices capture the
physical movements of animals and offer information on their social interactions,
emotional experiences, and adaptations to their environments. We addressed the
critical aspects of data collection, emphasizing the significance of selecting the
right sensors for our research objectives to ensure the collection of high-quality,
relevant data. Through this process, we introduced challenges of working in
diverse field conditions and the need for a thorough process of sensor calibration.
Our discussion extended to the involved task of data annotation, where we looked
at techniques necessary to transform raw sensor data into valuable insights.
CHAPTER
5
Preprocessing and Feature Extraction
for Animal Behavior Research
0.4
0.3 0.5
0.2
0.0
0.1
–0.5
0.0
–1.0
–0.1
–0.2 –1.5
0 2 4 6 8 10 0 2 4 6 8 10
Time (Seconds) Time (Seconds)
(a) (b)
Figure 5.1: Low sampling rate vs high sampling rate accelerometer data.
Preprocessing and Feature Extraction for Animal Behavior Research | 117
Data Preprocessing
Data preprocessing stands as a critical phase where raw data is converted into
an appropriate format for detailed analysis. This section covers the essential
aspects of data preprocessing, including strategies for data cleaning, scaling,
normalization, and filtering techniques. Each topic is explored, providing insights
into their importance and practical application.
Data Cleaning
Data cleaning is a preliminary step in preparing accelerometer data for analysis.
It involves identifying and correcting (or removing) errors and inconsistencies in
the data to improve its quality.
Handling Missing Data:
1. Identification: The first step is identifying missing values in the dataset.
This can occur due to various reasons like sensor malfunction or interruptions
during data transmission.
2. Imputation Techniques: Depending on the context, different imputation
methods can be used to fill in missing data. Simple methods include using
the mean or median of nearby points, while more complex approaches might
involve predictive modelling.
3. Deletion: In cases where imputation may not be appropriate or feasible,
especially if the missing data is substantial, deletion might be the chosen
strategy.
118 | Machine Learning in Farm Animal Behavior using Python
Figure 5.2 illustrates the accelerometer readings along the x-axis for the first
10 seconds. The original data is shown in dark gray, exhibiting the raw sensor
readings with natural variations. The standardized data, displayed in light gray,
provides a standardized view that is essential for certain analytical comparisons.
The time is marked on the x-axis, and the acceleration values are on the
y-axis, offering a clear visual representation of the effect of standardization on
accelerometer data.
2. Min-Max Scaling (also known as Normalization): Transforms the data
to fit within a specific range, 0 to 1, that can be useful in certain types
of analyses where relative differences are more important than absolute
values.
Preprocessing and Feature Extraction for Animal Behavior Research | 119
8
X-Axis Accelerometer
–2
0 2 4 6 8 10
Time (Seconds)
Formula:
X − Xmin .
Xminmax =
|Xmax − Xmin |
In this formula, Xmin and Xmax are the minimum and maximum values in the
data, respectively.
Use Case: Min-Max scaling is beneficial when the dataset needs to be
normalized within a specific range while preserving the distinctions in value
ranges.
3. Loss of Information:
• In scenarios where the maximum and minimum values carry important
information (e.g., extreme but meaningful fluctuations), Min-Max scaling
might lead to a loss of this information.
• Carefully consider the context of your data. If the extreme values
are meaningful and should be preserved, explore alternative scaling
techniques, or modify the range of Min-Max scaling accordingly.
4
Value
–2
0 2 4 6 8 10
Time (Seconds)
Figure 5.3 illustrates a 10 second block from a large dataset alongside its min-
max scaled version. Notably, the scaled data is tightly clustered within a small
range, indicating a possible influence of outliers on the normalization process.
Such clustering highlights the necessity of addressing outliers before applying
Min-Max scaling to ensure a more evenly distributed range of transformed data.
Filtering Techniques
Filtering plays a crucial role in data preprocessing, particularly with accelerometer
data, where it is vital to isolate the frequencies of interest and reduce noise.
Preprocessing and Feature Extraction for Animal Behavior Research | 121
Different types of filters serve distinct purposes, depending on the nature of the
data and the analysis goals.
It is important to note that filtering is not always necessary; its application depends
on the specific requirements and nature of the data being analyzed. The decision
to use filters should be based on the scientist’s assessment of the data quality and
the objectives of the study. When deemed appropriate, filtering can significantly
enhance the signal quality by reducing noise, thereby facilitating more accurate
activity recognition.
# 1. Low-pass filtering
from scipy.signal import butter, filtfilt
Code Explanation:
• Starting from scipy.signal import butter, filtfilt: This line imports the
butter and filtfilt functions from the scipy.signal module. butter is used
for creating the filter coefficients, and filtfilt is a function that applies the
filter to a data sequence.
• def apply_low_pass_filter(data, cutoff_frequency, sampling_rate, order =
5): This function, apply_low_pass_filter, is designed to apply a low-pass
filter to the input data. It takes four parameters: the data to be filtered,
the cutoff frequency (cutoff_frequency), the sampling rate of the data
(sampling_rate), and the order of the filter (order), which defaults to 5 if
not specified.
• nyquist = 0.5 * sampling_rate: The Nyquist frequency is calculated as
half of the sampling rate. It represents the highest frequency that can be
effectively captured by the sampling process.
• normal_cutoff = cutoff_frequency / nyquist: The cutoff frequency is
normalized by dividing it by the Nyquist frequency. This normalization is
necessary because the butter function expects the cutoff frequency in units
relative to the Nyquist frequency.
# 2. High-pass filter
def apply_high_pass_filter(data, cutoff_frequency, sampling_rate,
order=5):
nyquist = 0.5 * sampling_rate
normal_cutoff = cutoff_frequency / nyquist
b, a = butter(order, normal_cutoff, btype='high', analog=False)
return filtfilt(b, a, data)
Code Explanation:
• def apply_high_pass_filter(data, cutoff_frequency, sampling_rate, order
= 5): Similar to the low-pass filter function, this function, apply_high_
pass_filter, applies a high-pass filter to the input data. The parameters are
the same: the data to be filtered, the cutoff frequency, the sampling rate,
and an optional filter order.
• The Nyquist frequency calculation (nyquist = 0.5 * sampling_rate) and the
normalization of the cutoff frequency (normal_cutoff = cutoff_frequency
/ nyquist) are identical to the low-pass filter code. These steps are standard
in preparing the parameters for filter design in digital signal processing.
• b, a = butter(order, normal_cutoff, btype = high, analog = False): This
line uses the butter function to design a high-pass filter. The only change
from the low-pass filter is the btype = ‘high’ argument, indicating that
it’s a high-pass filter. The function returns the filter coefficients b and a.
• The application of the filter with filtfilt(b, a, data) is the same as in the
low-pass filter code. filtfilt is used for zero-phase filtering, ensuring no
phase shift in the output signal.
• cutoff_freq = 0.5 # Cutoff frequency in Hz
filtered_data_hpf = apply_high_pass_filter(selected_data[‘ax’], cutoff_
freq, sampling_rate): The high-pass filter is applied to the data (selected_
data[‘ax’]) with a specified cutoff frequency (0.5 Hz in this case) and the
sampling rate of the data.
124 | Machine Learning in Farm Animal Behavior using Python
# 3. Band-pass filter
def apply_band_pass_filter(data, lowcut, highcut, sampling_rate,
order=5):
nyquist = 0.5 * sampling_rate
low = lowcut / nyquist
high = highcut / nyquist
b, a = butter(order, [low, high], btype='band', analog=False)
return filtfilt(b, a, data)
and the low and high cutoff frequencies are normalized relative to the
Nyquist frequency.
– The butter function designs a band-pass filter (btype = ‘band’) with the
specified order and frequency range, returning the filter coefficients b
and a.
• The filtfilt function applies the filter to the data, ensuring zero phase
distortion.
• The band-pass filter is applied to a segment of accelerometer data
(selected_data[‘ax’]) with defined lowcut and highcut frequencies.
4. Moving Average Filter: Smoothens data by averaging subsets of data points,
effective for reducing random noise.
Python example:
Code Explanation:
• apply_moving_average_filter:
– The function takes in a dataset (data) and a window_size, indicating
the number of samples to be averaged over.
– The default window size is set to 20, meaning that by default, each
point in the filtered data will be the average of 20 consecutive points
in the original data.
• Applying the Filter:
– The function uses np.convolve to apply a moving average filter.
This function convolves the data with a window of specified size,
effectively computing a running average.
– The window is created using np.ones(window_size) / window_size,
which generates an array of ones (of length window_size) divided by
the window size, ensuring that the sum of the window’s elements is 1.
– The mode = ‘valid’ argument ensures that the output is only computed
for points where the window fits entirely within the signal boundaries,
126 | Machine Learning in Farm Animal Behavior using Python
Code Explanation:
• apply_savitzky_golay_filter:
– This function applies the Savitzky-Golay filter to the provided data.
– Parameters: the data, the size of the smoothing window (window_
size), and the order of the polynomial used in the filter (polynomial_
order).
• Applying the Savitzky-Golay Filter:
– The savgol_filter function from scipy.signal is used to apply the filter.
It performs a polynomial regression on a sliding window of data
points, effectively smoothing the data.
– The window_size must be odd because the filter works by fitting a
polynomial to a symmetric window of data around each point. This
ensures that each point has an equal number of neighbouring data
points on both sides, which is essential for maintaining the symmetry
of the regression fit.
Preprocessing and Feature Extraction for Animal Behavior Research | 127
• Filter Parameters:
– window_size is set to 29, which must be an odd number for the reasons
explained above.
– polynomial_order is set to 2, indicating that a quadratic polynomial is
used for the fit within each window.
• Finally, the function is used to apply the Savitzky-Golay filter to a segment
of accelerometer data (selected_data[‘ax’]).
This technique smooths the data while preserving key features like peaks
and troughs, making it easier to detect meaningful patterns in the animal’s
movement.
6. Kalman Filtering: A sophisticated technique that combines measurements
over time to produce estimates of unknown variables.
• It is useful in settings where the data is affected by uncertainty and various
types of noise.
• The Kalman filter is an iterative filter that estimates the state of a dynamic
system from a series of incomplete and noisy measurements.
• It is widely used in applications like tracking and navigation systems,
where accuracy and efficiency are crucial.
• The Kalman filter works in two steps: prediction and update. In the
prediction step, the filter produces estimates of the current state variables,
along with their uncertainties. During the update step, the filter refines its
predictions based on new measurements.
Python Example:
# 6. Kalman filter
• Filter Parameters:
– The effectiveness of a Kalman filter greatly depends on the tuning
of its parameters, including the process noise covariance (kf.Q) and
measurement noise covariance (kf.R).
– If the measurement noise covariance (kf.R) is set too low, the filter might
place too much trust in the measurements and not enough in its own
model predictions, leading to filtered data that closely follows the original
measurements.
– Equally, if the process noise covariance (kf.Q) is set too low, it implies
that the system is expected to change very little, leading the filter to make
only minor adjustments to the data.
• Simple State Model:
– The example provided uses a very basic 1D state model, represented by
the state transition matrix (kf.F) and measurement matrix (kf.H). This
simple model might not be adequate to capture the complexities of animal
movement, leading to minimal adjustments by the filter.
– In real-world applications, especially with complex movements, a more
sophisticated model that includes aspects like velocity and possibly
acceleration would be more effective.
• Nature of Animal Movements:
– If the animal’s movements are relatively smooth and predictable, with
minimal sharp changes or erratic behavior, the Kalman filter’s impact
might be less noticeable, especially if it is configured with a simple state
model.
Key Considerations:
• Kalman filtering is best suited for scenarios with significant noise or where
integrating data over time is crucial for accurate estimation.
• The effectiveness of the filter depends on careful tuning and, often, a deep
understanding of the system being modelled. This tuning involves setting the
right parameters and possibly designing a more complex system model.
• In cases where the original data is already of high quality, or if the system
dynamics do not align well with the model of the filter, the impact of Kalman
filtering may be subtle.
Practical Advice:
• Experiment with the filter parameters and consider more complex models for
the state and measurement processes.
• Understand the characteristics of the accelerometer data and the nature of the
movements being captured.
Preprocessing and Feature Extraction for Animal Behavior Research | 131
Summing up
Applying filters to accelerometer data for animal activity recognition requires
careful consideration of several factors, including the nature of the signal, the
characteristics of the specific behaviors you are interested in, and the sampling
rate of the accelerometer. Here is a simple guide on how to apply these filters and
what to keep in mind:
Understanding Your Signal
• Characteristics of Animal Movement:
– Identify the typical frequencies associated with the animal behaviors
of interest. For instance, the frequency range of walking might differ
significantly from running or resting.
• Noise Characteristics:
– Understand the source of noise in your data. Is it high-frequency sensor
noise, low-frequency drifts, or transient disturbances?
Sampling Rate Considerations
• Nyquist Theorem: Ensure your sampling rate is at least twice the highest
frequency component you wish to capture (Nyquist rate). For instance, if
the maximum frequency of interest is 50 Hz, you should sample at least at
100 Hz.
• Aliasing: If the sampling rate is too low, you risk aliasing, where high-
frequency components are misrepresented in your data.
Applying Filters
• Low-Pass Filter:
– When to Use: If your data contains high-frequency noise that is not
characteristic of the animal’s movement.
– Setting Cutoff Frequency: The cutoff frequency should be set just above
the highest frequency of interest for the animal behavior.
• High-Pass Filter:
– When to Use: To remove bias or drift, especially if the accelerometer
captures gravitational effects.
– Cutoff Frequency: Set it below the typical frequency range of the behaviors
but high enough to eliminate drift.
• Band-Pass Filter:
– When to Use: To isolate frequencies that correspond to specific animal
132 | Machine Learning in Farm Animal Behavior using Python
Feature Extraction
Feature extraction is another important phase in the analysis of accelerometer data.
It involves transforming raw accelerometer signals into a set of representative and
informative metrics that can be used for further analysis and interpretation. In this
section, we will discuss time-domain and frequency-domain features.
Time-domain features are extracted directly from the raw accelerometer data
without any transformation to the frequency domain. These features often include
statistical measures that capture the central tendency, dispersion, and shape of
the signal over time. They are intuitive and often straightforward to compute,
making them a common option in numerous applications. Examples include
mean, variance, standard deviation, and other statistical moments.
Frequency-domain features, on the other hand, involve transforming the time-
series data into the frequency domain using techniques such as Fourier and
Wavelet Transforms [ref]. These features are crucial in identifying the dominant
frequencies and periodicities in the data, which can be particularly informative
in understanding periodic or repetitive movements in animal behavior. Spectral
density, power spectral density, and specific frequency bands’ energy are typical
examples of frequency-domain features.
Preprocessing and Feature Extraction for Animal Behavior Research | 133
Time-domain Features
Following is a list of time-domain features extracted from accelerometer signals
(x, y, z) in animal behavior studies:
Mean
The mean feature of accelerometer data is a fundamental statistical measure used
to describe the central tendency of the acceleration values along each axis (x, y,
z) of the accelerometer. This feature is particularly useful in understanding the
average behavior over a period of time.
If we have a set of accelerometer readings along a single axis (let’s say the x-axis)
recorded over a period of time, the mean (average) of these readings can be
calculated as follows:
Let x1, x2, …, xn be the accelerometer readings along the x-axis over n samples.
The mean (x⁻) of these readings is given by:
1 n
x= ∑ xi
n i=1
where,
• x⁻ is the mean of the accelerometer readings along the x-axis.
• x is the ith accelerometer reading.
• n is the total number of samples.
The formula sums up all the accelerometer readings along the x-axis and then
divides this total by the number of readings to find the average value.
Application to Accelerometer Data:
• The same formula can be applied independently to the y-axis and z-axis
readings to get the mean values for those axes.
• In the context of accelerometer data, calculating the mean for each axis helps
in understanding the average acceleration or movement in each direction.
• This can be particularly important in scenarios where you are interested in the
overall trend or bias in the movement.
We can also calculate the average of all axes:
x+y+z.
combined mean =
3
Standard Deviation (SD)
The standard deviation is a statistical measure that quantifies the amount
Preprocessing and Feature Extraction for Animal Behavior Research | 135
Varianceaxis = SDaxis2.
The absolute value ensures that the distance is always a non-negative number.
Similarly, you can calculate the mean distances for other pairs of axes, Mean
Distancexz, Mean Distanceyz.
Context in Accelerometer Data Analysis:
• Comparing Movement Patterns: This measure can be particularly useful in
studies where the relationship or disparity between movements in different
planes is of interest. For example, in gait analysis, comparing vertical and
horizontal movements could provide valuable insights.
• Movement Synchronization: The mean distance can help assess the
synchronization or divergence in movement between different axes. A smaller
mean distance might indicate synchronized movements, while a larger mean
distance could suggest divergent or independent movements across axes.
• Activity Characterization: Different activities may exhibit characteristic
patterns in the mean distances between axes. For instance, activities like
walking or running might show distinct mean distance patterns compared to
stationary activities.
Signal Magnitude (SM)
Signal Magnitude (SM) is a key metric used in accelerometer data analysis that
represents the overall magnitude of the acceleration vector at each point in time.
It is a comprehensive measure that combines the acceleration data from all three
axes (x, y, and z) to provide a singular value representing the total acceleration.
Calculating SM:
The SM is calculated by determining the magnitude of the acceleration vector
at each individual data point and then aggregating these values as needed (e.g.,
averaging over time).
The formula for the magnitude of the acceleration vector at the ith data point is:
√
SMi = xi2 + y2i + z2i
where, xi, yi and zi are the acceleration readings at the ith point in time for the
x, y, z axes, respectively. Unlike the Average Signal Magnitude, which involves
averaging these magnitudes, SM is often considered for each individual data point
or used in other forms of aggregation.
Context in Accelerometer Data Analysis:
• Overall Acceleration: SM provides a direct measure of the total acceleration
experienced by the sensor at each point in time, accounting for movements in
all directions.
• Activity Intensity: In applications such as physical activity monitoring or gait
analysis, SM is a valuable tool for assessing the intensity of movements. It
140 | Machine Learning in Farm Animal Behavior using Python
High kurtosis in accelerometer data could imply more frequent extreme movements,
while low kurtosis could suggest a flatter distribution of movements. This can
be particularly relevant in studies where the presence of extreme movements or
outliers is of interest.
Application in Accelerometer Data Analysis:
Skewness and Kurtosis offer a nuanced view of the accelerometer data’s
distribution, providing information that goes beyond basic measures like mean
and standard deviation. Analyzing these metrics can help in understanding the
underlying characteristics of the movement, such as the presence of sporadic,
intense activities or a tendency towards certain types of motion.
Root Mean Square (RMS)
RMS is a statistical measure used to quantify the magnitude of a varying quantity.
In the context of accelerometer data, RMS is used to represent the average
magnitude of the acceleration readings, providing an overall measure of the
intensity of motion.
Calculating RMS:
The RMS value for accelerometer data is calculated by squaring each reading,
averaging these squared values, and then taking the square root of the average.
The formula for RMS along a single axis (say, the x-axis) is:
√
1 n 2.
RMSx = religious∑ xi
n i=1
This calculation is typically done separately for each axis (x, y, and z) to obtain the
RMS values for each directional component of the acceleration.
Context in Accelerometer Data Analysis:
• Overall Intensity: RMS provides a comprehensive measure of the overall
intensity of the movement captured by the accelerometer. It effectively
combines the amplitude and frequency components of the acceleration signal
into a single value.
• Steady-state and Vibrational Analysis: In applications such as vibration
analysis or monitoring steady-state activities, RMS is particularly valuable
as it reflects both the amplitude and the consistency of the movements or
vibrations.
• Comparison Across Axes: By calculating RMS for each axis separately, it
is possible to compare the intensity of movements in different directions,
which can be insightful in understanding the nature of physical activities or
behaviors.
Preprocessing and Feature Extraction for Animal Behavior Research | 143
where,
• max(x) is the maximum value of the accelerometer readings in the window.
• min(x) is the minimum value of the accelerometer readings in the window.
Context in Accelerometer Data Analysis:
Range of Motion: The Peak-to-Peak feature is useful for understanding the range
of motion or the extent of movement variability within each window of data.
• Activity Characterization: This measure can help distinguish between
different types of animal activities or behaviors, especially those that involve
varying degrees of movement and amplitude.
• Signal Fluctuation: It provides insights into the fluctuation levels of the
signal, which can be crucial in applications like gait analysis, gait assessment
in animals’ medical diagnostics, or activity classification.
Integrals and Squared Integrals
In the analysis of accelerometer data, the concepts of integrals and squared
integrals play a crucial role in extracting meaningful information from raw
acceleration readings. These measures are particularly useful in understanding the
overall movement dynamics and the energy characteristics of the motion being
measured.
Integrals in Accelerometer Data:
The integral of accelerometer data refers to the cumulative sum of acceleration
readings over time. This measure provides insight into the overall movement
magnitude or displacement (when double-integrated) of an object.
Formula:
For discrete accelerometer data, where readings are taken at regular intervals, the
integral over a time period can be approximated as the sum of the absolute values
of the accelerometer readings across all three axes (x, y, and z), multiplied by the
time interval between these readings.
Integrals can be defined:
∫T ∫T ∫T
Integrals = |xi (t)| dt + |yi (t)| dt + |zi (t)| dt .
t=0 t=0 t=0
( )2 ( )2 ( )2
∫T ∫T ∫T
Squared Integrals = |x(t)| dt + |y(t)| dt + |z(t)| dt .
t=0 t=0 t=0
Both the integral and squared integral provide distinct yet complementary views
of accelerometer data.
Energy
The Energy feature typically refers to a measure that captures the intensity or
power of movements recorded by the accelerometer. This feature is often derived
from the magnitude of the acceleration signal and is indicative of the vibrational
or dynamic energy present in the motion.
Calculation of Energy:
Energy in accelerometer data can be quantified in several ways, but a common
approach is to calculate the sum of the squared accelerometer readings, which is
analogous to computing the signal’s power. For discrete accelerometer data, this
can be expressed as:
n
Energy = ∑ (xi2 + y2i + z2i ).
i=1
Frequency-domain Features
As we move from time-domain features to frequency-domain features in
accelerometer data analysis, we examine a different aspect of understanding
animal behavior and movement. While time-domain features provide insights
based on the raw accelerometer readings over time, frequency-domain features
offer a perspective based on the frequency content of these signals. This shift
allows us to uncover patterns and characteristics that are not immediately apparent
in the time-domain.
Frequency-domain analysis involves transforming the time-based accelerometer
signals into the frequency domain. This transformation reveals the signal’s
frequency components, highlighting how the signal’s power is distributed across
different frequencies.
Why Frequency-domain Features?
• Capturing Repetitive Patterns: Many animal activities, such as walking,
running, or specific behavioral patterns, exhibit repetitive motions. These
patterns are more apparent in the frequency domain, where they manifest as
distinct peaks or characteristic distributions.
Preprocessing and Feature Extraction for Animal Behavior Research | 147
Fourier Transform
The Fourier Transform is a mathematical transformation used to analyze the
frequencies contained in a signal. It decomposes a time function into its basic
frequencies.
Mathematical Formula:
The continuous Fourier Transform F(ω) of a continuous time-domain signal f (t) is
given by:
+∞
∫
F(ω ) = f (t)e− jω t dt
−∞
where,
• F(ω) is the Fourier Transform of f (t).
• ω is the angular frequency (in radians per second).
• t is time.
• e– jωt is a complex exponential function, where j is the imaginary unit.
The result F(ω) is a function of frequency with complex values. The magnitude
of F(ω) gives the amplitude of each frequency component in a signal, while the
phase F(ω) of provides the phase shift of each frequency component.
Discrete Fourier Transform (DFT)
In digital signal processing, we often work with discrete signals. The DFT is the
version of the Fourier Transform used for sequences of discrete values.
Mathematical Formula:
148 | Machine Learning in Farm Animal Behavior using Python
For a discrete signal x[n] with N samples, the DFT X[k] is defined as:
N−1 2π
X[k] = ∑ x[n]e− j N kn
n=0
where,
• X[k] is the kth element of the of the frequency domain representation of the
sequence where k corresponds to the frequency bin.
• x[n] is the nth sample in the time-domain sequence.
• N is the total number of samples.
—
• j is the imaginary unit (√ –1).
The DFT decomposes the signal into a sum of sinusoids of different frequencies,
each with its own amplitude and phase.
where, PSD( fi ) is the normalised PSD at frequency fi, and N is the total
number of frequency bins.
Application in Activity Recognition:
– Complexity Analysis: Higher Spectral Entropy indicates a more complex
or less predictable frequency pattern, which might be characteristic of
certain types of activities.
– Distinguishing Between Activities: Different activities may exhibit
distinct patterns of Spectral Entropy, aiding in their classification and
recognition.
• Peak Frequency
Peak Frequency is a specific frequency-domain feature used in accelerometer
data analysis. It refers to the frequency within a given window of data that has
the highest amplitude in the frequency spectrum.
Difference from Dominant Frequency:
Peak Frequency vs. Dominant Frequency: While “Peak Frequency” and
“Dominant Frequency” might sound similar and are sometimes used
interchangeably, however there can be subtle differences based on the context
or specific implementation:
– Peak Frequency typically refers to the frequency with the highest peak
(maximum amplitude) in the spectrum, regardless of the total energy
content across the spectrum.
– Dominant Frequency often implies the frequency or frequencies that
contribute most significantly to the signal, which can be interpreted in
terms of energy (spectral energy) rather than just amplitude.
Context-Dependent Interpretation: In some cases, especially when the
spectrum has a clear single peak, both Peak Frequency and Dominant
Frequency might refer to the same frequency. However, in a more complex
spectrum with multiple peaks, the “dominant” aspect might consider
additional factors like the width or energy of the peak.
Calculation of Peak Frequency:
FFT Analysis: First, perform a FFT on the accelerometer data.
Identifying the Peak: The Peak Frequency is determined by finding the
frequency at which the FFT amplitude (magnitude) is maximum.
• Spectral Centroid
The Spectral Centroid is a measure that indicates the center of mass of the
frequency spectrum. It shows where the bulk of the energy in the spectrum is
concentrated, often used to characterize the sharpness of a signal.
Preprocessing and Feature Extraction for Animal Behavior Research | 153
where, fi is the frequency and ai is the amplitude of the ith bin in the FFT.
# Output
x y z label
0 0.548814 0.715189 0.602763 walking
1 0.544883 0.423655 0.645894 walking
2 0.437587 0.891773 0.963663 walking
3 0.383442 0.791725 0.528895 walking
4 0.568045 0.925597 0.071036 walking
.. ... ... ... ...
995 0.698630 0.503697 0.025738 walking
996 0.774353 0.560374 0.082494 walking
997 0.475214 0.287293 0.879682 walking
998 0.284927 0.941687 0.546133 walking
999 0.323614 0.813545 0.697400 walking
# Spectral Centroid
spectral_centroid = np.sum(fft_freq * psd_normalized)
Preprocessing and Feature Extraction for Animal Behavior Research | 159
# Spectral Spread
spectral_spread = np.sqrt(np.sum(((fft_freq - spectral_
centroid)**2) * psd_normalized))
# Spectral Flatness
spectral_flatness = np.exp(np.mean(np.log(psd + 1e-10))) /
np.mean(psd + 1e-10)
The following function is created to extract time and frequency domain features
from the dataset:
# Paiwise Correlation
corr_matrix = data.corr().values
for i, col1 in enumerate(data.columns):
for j, col2 in enumerate(data.columns):
if i<j: # to avoid duplicate pairs
features[f'correlation_{col1}_{col2}'] = corr_
matrix[i,j]
Preprocessing and Feature Extraction for Animal Behavior Research | 161
# Energy calculation
energy = np.sum(data['x']**2 + data['y']**2 + data['z']**2)
features['energy'] = energy
# Combined entropy
combined_data = np.hstack((data['x'], data['y'], data['z']))
features['combined_entropy'] = calculate_entropy(combined_data)
# Spectral Energy
features[f'{col}_spectral_energy'] = calculate_spectral_
energy(fft_vals)
# Dominant Frequency
features[f'{col}_dominant_frequency'] =
calculate_dominant_frequency (positive_fft_vals,
positive_fft_freq)
spectral_entropy = calculate_spectral_entropy(psd)
features[f'{col}_spectral_entropy'] = spectral_entropy
spectral_kurtosis, spectral_flatness =
calculate_spectral_features(positive_fft_vals, positive_fft_freq)
features[f'{col}_spectral_centroid'] = spectral_centroid
features[f'{col}_spectral_spread'] = spectral_spread
features[f'{col}_spectral_skewness'] = spectral_skewness
features[f'{col}_spectral_kurtosis'] = spectral_kurtosis
features[f'{col}_spectral_flatness'] = spectral_flatness
return features
• Frequency-domain Feature Calculations Loop: Iterates over ‘x’, ‘y’, and ‘z’
columns.
– Performs FFT and calculates frequency using np.fft.fft and np.fft.fftfreq.
– Filters to keep only positive frequencies.
– Calculates spectral energy, dominant frequency, PSD, spectral entropy,
peak frequency, and various spectral features (centroid, spread, skewness,
kurtosis, flatness) using custom functions.
• Return Statement: Returns the features dictionary containing all the calculated
features.
This function is comprehensive, covering a wide range of time-domain and
frequency-domain features, making it highly valuable for detailed signal analysis.
Now, the following function is defined to extract the features from the
accelerometer data using sliding windows:
window_features['label'] = window_label
features_list.append(window_features)
return pd.DataFrame(features_list)
# Example
preprocessed_data = feature_extraction_with_windows(df,
window_size,
step_size,
calculate_features, delta_t)
# Check the first five rows of the data
preprocessed_data.head()
Summary
In Chapter 5 we looked into the critical stages of preprocessing and feature
extraction in the context of animal behavior research using accelerometer data.
We talked about the fundamental principles underlying accelerometer data, setting
a foundation for understanding how these datasets can be interpreted and utilized.
Then we introduced the various steps of data preprocessing. Starting with data
cleaning, followed by a thorough discussion on data scaling and normalization,
highlighting the importance of transforming data into a standard format for
consistency and comparability across different datasets.
We further discussed filtering techniques. Filtering helps in reducing noise
and improving the signal quality, which is crucial for accurate analysis and
interpretation of animal movements and behaviors. The section provided a
detailed overview of different filtering methods and their practical applications.
Moving into the core of the chapter, feature extraction is extensively covered. We
discussed windowing in feature extraction, a technique that involves dividing the
continuous stream of data into manageable segments or ‘windows’. This approach
is significant in capturing the temporal dynamics of animal behavior.
The chapter delves into the discussion on time-domain and frequency-domain
features, providing an understanding of how these features can be extracted
and their relevance in animal behavior studies. The time-domain features offer
insights into the basic statistical properties of the data, while the frequency-
domain features shed light on the periodic nature of the animal movements.
Next, we presented a practical Python example for feature extraction, offering a
hands-on approach to applying the theoretical concepts discussed. The section is
particularly valuable for readers looking to implement these techniques in real-
world scenarios.
CHAPTER
6
Feature Selection Techniques
Filter Methods
Filter methods represent a fundamental of feature selection techniques used in
machine learning and data preprocessing. These methods are characterized by
their use of various statistical measures to evaluate the importance of different
168 | Machine Learning in Farm Animal Behavior using Python
Information Gain
Information Gain is a technique used in feature selection to determine the
importance of features by measuring the reduction in entropy. It is particularly
useful in scenarios where we need to understand how much information a feature
provides about the class.
Feature Selection Techniques | 169
Chi-square Test
The Chi-square test is a statistical method used in feature selection to evaluate
the independence between two variables. In machine learning, it is often used to
determine the relevance of categorical features with respect to categorical labels.
The essence of the Chi-square test in feature selection is to assess the dependency
between each feature and the target variable: the higher the Chi-square value, the
more likely the feature is dependent on the target, making it a potentially good
predictor.
that the feature has a significant distinction between classes, suggesting its
potential importance in predictive modeling.
Limitations
• Assumption of Normality: ANOVA assumes that the data for each class is
normally distributed, which may not always be the case in real-world datasets.
• Equal Variance: ANOVA assumes homogeneity of variance (equal variance
across groups), which can be a limitation in certain datasets.
Correlation Coefficient
The correlation coefficient is a statistical metric indicating the degree of linear
relationship between two variables. In the context of feature selection, it is used
to identify features that have a strong linear relationship among them.
Limitations
• Only Linear Relationships: Pearson correlation only captures linear relationships.
Non-linear but strong relationships might be overlooked.
• Sensitive to Outliers: Correlation can be heavily influenced by outliers.
Relief Algorithm
The Relief algorithm is designed to assign a weight to each feature based on
how well the feature can distinguish between instances that are near each other.
It works by randomly selecting an instance and then finding its nearest neighbor
from the same and opposite classes. If a feature value is different for neighbors
of different classes (indicating it is a good feature), its weight is increased; if
it is different for neighbors of the same class, its weight is decreased. Relief is
good for tasks where interactions between attributes are important. However, it is
generally limited to binary classification tasks.
ReliefF Algorithm
ReliefF is an extension of the Relief algorithm that generalizes it to multiclass
problems and is more robust to noisy and incomplete datasets. Unlike Relief,
ReliefF considers multiple nearest neighbors rather than just one. It averages
the feature weights across multiple neighbors, which makes it more reliable,
especially in datasets with more noise and outliers. ReliefF can handle a variety
of data types (discrete, continuous, among others) and is suitable for both binary
and multiclass classification problems.
Variance Threshold
The Variance Threshold is another method within filter-based feature selection
techniques. It operates on a simple principle: Features with low variance are less
likely to be informative or discriminative. In other words, if a feature does not
vary much within the dataset, it is less likely to impact the predictive power of
the model.
The Variance Threshold method involves setting a specific threshold for variance,
and only the features that exceed this threshold are retained. This approach is
particularly useful for removing constant or quasi-constant features1, which do
not contribute to the model’s learning process. By doing so, it helps in reducing
the dimensionality of the dataset, making data processing and model training
more efficient.
This method is especially effective in scenarios where data contains many
redundant or irrelevant features. However, it is crucial to choose the variance
threshold wisely, as setting it too high might lead to the loss of potentially useful
features, while setting it too low might not effectively reduce dimensionality. In
essence, the Variance Threshold is a powerful tool in the feature selection process,
especially in the initial stages of data preprocessing to discard features that offer
little to no variability and hence, informational value.
1 Quasi-constant features refer to those variables within a dataset that exhibit little to no variance
among the observations.
Feature Selection Techniques | 173
Wrapper Methods
Wrapper methods in feature selection are a set of techniques that select features
based on the performance of a predictive or classification model. Unlike filter
methods, which rely on general characteristics of the data, wrapper methods use a
specific machine learning model to evaluate the effectiveness of different subsets
of features.
The process usually involves training a model on various combinations of features
and assessing the model’s performance to determine the best set of features. This
evaluation can be based on different criteria, such as accuracy, precision, recall,
or any relevant performance metric. The key steps in wrapper methods typically
include:
• Subset Generation: This step involves creating different combinations of
features. These combinations could be generated through different strategies,
such as forward selection, backward elimination, or recursive feature
elimination.
• Model Training and Evaluation: For each feature subset, a model is trained and
evaluated. Commonly used machine learning models in this context include
Naïve Bayes, Support Vector Machines (SVM), and Random Forests. The
choice of model can significantly impact the selection process, as different
models have varying sensitivities to different types of features.
• Performance Assessment: The performance of the model with each subset
of features is assessed using a chosen metric. The subset that yields the best
performance is considered the optimal set of features.
While wrapper methods can be effective in finding the best subset of features
for a given model, they have some limitations. The most significant one is
computational expense. Evaluating every possible combination of features can
be impractical, especially with large datasets and a high number of features. This
often restricts the use of wrapper methods to datasets with a smaller number of
features or necessitates the use of more efficient subset generation strategies.
Despite these challenges, wrapper methods are popular due to their model-specific
approach, which can lead to better feature selection for the specific predictive task
at hand, compared to the more general approach of filter methods.
Forward Selection
Forward Selection is a type of wrapper method used in feature selection. It is an
iterative method that starts with having no feature in the model. In each iteration,
it adds the feature that provides the most significant improvement to the model
until a specified criterion is reached.
174 | Machine Learning in Farm Animal Behavior using Python
Backward Elimination
Backward elimination is another wrapper method. Unlike forward selection,
backward elimination starts with all features included and iteratively removes the
least significant feature until a specified stopping criterion is met.
The general process of backward elimination:
• Start with All Features: Initially, the model includes all available features.
• Evaluate and Remove: In each iteration, the model is trained, and its
performance is evaluated. The least significant feature (the one whose
removal most improves model performance or least deteriorates it) is then
removed from the set.
• Iterate Until Criteria Met: This process is repeated until a stopping criterion
is met, which could be a set number of features, a performance threshold, or
a statistical significance level.
• Final Model: The final model contains the subset of features that provides the
best performance according to the chosen metric.
Backward elimination can be more efficient than forward selection when the
number of features is not excessively high, as it begins with a full model and
removes redundant or non-informative features.
Feature Selection Techniques | 175
Boruta
Boruta is a specific feature selection algorithm that falls under the category of
wrapper methods. It is known for its effectiveness in identifying the most relevant
features for a predictive model, particularly when using random forests.
An overview of how Boruta works:
• Shadow Features Creation: Boruta starts by duplicating every feature in
the dataset to create shadow features. These shadow features are shuffled
versions of the real features, which means they should not contain any useful
information about the target variable.
• Random Forest Classifier: A random forest classifier is then trained on the
extended dataset, including both the real and shadow features. Random
forests are chosen for their ability to handle a large number of features and
their robustness to overfitting.
• Feature Importance: After training, the algorithm evaluates the importance of
each feature.
• Comparison with Shadow Features: The importance of each real feature
is compared to the maximum importance among the shadow features. If a
real feature is more important than the most important shadow feature, it is
deemed relevant.
• Iterative Process: This process is repeated several times. In each iteration,
the shadow features are reshuffled, and the model is retrained. Features that
consistently outperform the shadow features are kept as relevant, while those
that do not are progressively discarded.
• Final Selection: The algorithm ends when all features are either confirmed or
rejected, or after a specified number of iterations is reached.
Boruta is useful for datasets where the relevance of features is uncertain, and it
provides a robust method to ensure that only features that have a genuine impact
on the model’s predictive power are selected. It is especially favored in biological
and medical datasets where the interpretability of the model and the understanding
of which features are truly important is crucial.
One of the key benefits of Boruta is that it takes into account the interaction
between variables, which can be missed by simpler filter methods. However, like
other wrapper methods, Boruta can be computationally intensive, especially for
datasets with a large number of features.
Feature Selection Techniques | 177
efficiently. They are useful in situations where traditional methods might fail to
capture the global structure of the problem or when there is a need to balance
exploration and exploitation in search strategy.
Embedded Methods
Embedded methods stand out due to their integration of feature selection directly
into the model training process. This approach ensures that feature selection is
specifically tailored to the model being trained.
How They Work: These methods often utilize learning algorithms that have
built-in mechanisms for feature selection. Classical examples are regularization
techniques in Lasso and Ridge regression, where the former can shrink the
coefficients of certain features to zero, effectively removing them from the model.
Advantages: The primary benefit of embedded methods is their efficiency. By
combining feature selection with model training, they often yield more effective
models. This is because the selection process is optimized for the specific model,
potentially leading to improved performance.
L1 (Lasso) Regularization
Lasso, or L1 regularization, plays a pivotal role in embedded methods by adding
a penalty equal to the absolute value of coefficients. This approach encourages
sparsity in the model’s coefficients, effectively performing feature selection by
reducing some coefficients to zero. Lasso’s utility spans into both regression
and classification tasks, aiding in high-dimensional data by simplifying model
complexity and mitigating overfitting.
L2 (Ridge) Regularization
Ridge regularization, or L2, adds a penalty based on the squared magnitude of the
coefficients. This technique is valuable in scenarios of multicollinearity or when
the number of predictors exceeds the number of observations, helping to maintain
all features in the model but with minimized coefficients.
Elastic Net
Combining the strengths of L1 and L2 regularization, Elastic Net applies both
penalties to the loss function. This hybrid approach is particularly effective in
dealing with correlated features, offering a versatile solution that adjusts the
balance between Lasso and Ridge benefits through its control parameters, α and λ.
Feature Selection Techniques | 179
Hybrid Methods
Hybrid methods are a strategic combination of filter and wrapper or embedded
methods, designed to leverage the strengths of both to optimize feature selection.
How They Work: Typically, a hybrid method begins with a filter approach to
reduce the feature space by removing less relevant features. This is followed by
a wrapper or embedded method that further refines the selection in the reduced
space.
Advantages: Hybrid methods strike a balance between the computational
efficiency of filter methods and the effectiveness of wrapper methods. They are
particularly useful for large feature sets, where they can reduce overall complexity
and execution time.
Common Examples: A typical example would be using a variance threshold filter
followed by recursive feature elimination. This two-step process first reduces the
number of features to a manageable size and then applies a more computationally
intensive wrapper method to select the most effective features.
In conclusion, the choice between embedded, hybrid, or other feature selection
methods depends on the specific requirements of the dataset and the problem at
hand. Embedded methods are integral to the learning algorithm and are efficient
for models with built-in feature selection capabilities. In contrast, hybrid methods
combine the initial simplicity of filter methods with the targeted effectiveness of
wrapper methods, offering a balanced approach suitable for large datasets.
such as shaking, fighting, and scratch biting due to insufficient data representation.
The script for loading and preparing the dataset for feature selection can be found
on our GitHub repository titled Chapter_6_feature_selection_methods-gsdata.
ipynb.
An important step in our workflow is splitting the dataset into training and testing
sets. We have partitioned our dataset with a test size of 30%. Now we will explore
various feature selection methods using Python, applying them to our dataset. We
will also introduce some methods that may not be directly applicable to our dataset
but are important in the broader context of machine learning. These methods
will be demonstrated through Python code examples, providing the reader with
practical insights into their implementation and usage.
In the below python example, we demonstrate how to use Information Gain for
feature selection. We use the already split dataset: X_train, X_test, y_train, and
y_test.
Information Gain can be calculated using the mutual_info_classif function from
sklearn.feature_selection. This function measures the dependency between
variables. Based on the calculated Information Gain, we can then select a subset
of features that contribute the most to predicting the output.
182 | Machine Learning in Farm Animal Behavior using Python
8.0
0.7
0.6
0.5
Information Gain
0.4
0.3
0.2
0.1
0.0
z_std_dev
z_vaiance
x_iqr
ak_to_peak
z_mimimum
z_iqr
z_icv
x_variance
x_std_dev
ral_entropy
energy
svm
ed_integral
ak_to_peak
ak_to_peak
x_icv
y_iqr
erage_svm
y_std_dev
y_variance
ral_entropy
ed_entropy
_maximum
<_minmum
_maximum
_maximum
/_minimum
y_icv
ral_entropy
tral_energy
Figure 6.1: Bar chart of information gain for the top 30 features.
Figure 6.1 displays a bar chart where each bar represents a feature from the
dataset used for animal behavior analysis. The height of each bar indicates the
Information Gain associated with that feature, providing a visual representation
of feature importance.
Chi-square Test
For datasets like ours, which predominantly consist of numerical data, the direct
application of the Chi-square test is not straightforward. Numerical data would
first need to be discretized (binned) into categories, which could lead to loss of
information and potentially misleading results. Hence, while we can technically
apply the Chi-square test, it might not be the most suitable method for our dataset.
Nevertheless, for the purpose of demonstration and learning, we can briefly go
through how one would apply the Chi-square test in Python, bearing in mind the
limitations of its applicability to our dataset (this example is not included in our
python script since it is not suitable for our dataset).
Feature Selection Techniques | 183
In the above code we import the necessary libraries SelectKBest, chi2. The Chi-
square test requires non-negative values, and hence we scale the dataset using
MinMaxScaler appropriately. We then use SelectKBest to select features based
on the Chi-square scores.
ANOVA F-Value
To utilize ANOVA F-Value for feature selection in Python, the scikit-learn library
provides a straightforward implementation. Here is an example:
# Output
Feature Scores:
z_std_dev 35172.453652
z_iqr 32214.437313
svm 25555.910370
average_svm 25555.910370
z_peak_to_peak 23178.704585
...
x_entropy 66.373729
y_kurtosis 55.642534
z_kurtosis 53.889757
y_skewness 10.330706
x_skewness 9.027106
Length: 85, dtype: float64
30000
25000
F-Value
20000
15000
10000
5000
0
std_dev
z_iqr
svm
ge_svm
to_peak
entropy
variance
energy
integral
std_dev
std_dev
to_peak
x_iqr
to_peak
integral
sma
y_iqr
minimum
variance
_energy
This code snippet demonstrates the process of feature selection using ANOVA
F-Value, which is a statistical method for finding the most significant features that
contribute to classifying the data. Here is what each part of the code does:
1. Importing Libraries: We first import SelectKBest and f_classif. SelectKBest
selects features according to the k highest scores, and f_classif computes the
ANOVA F-Value for provided features.
2. Applying ANOVA F-Value: We configure the SelectKBest function to use
the f_classif scoring function and set to select all features (k = ‘all’). We then
Feature Selection Techniques | 185
Correlation Coefficient
In the following python code, we use Pearson’s correlation coefficients for feature
selection:
In the above code, the objective is to identify features in our training dataset (X_
train) that are highly correlated with each other. Here is s a brief explanation of
the code:
• Calculating Correlation Matrix: The first line computes the correlation
matrix for the X_train dataset using the corr() function. This matrix shows
the correlation coefficients between every pair of features in your dataset.
• Identifying Highly Correlated Features:
186 | Machine Learning in Farm Animal Behavior using Python
– A threshold value of 0.6 is set. This value is used to determine what level
of correlation is considered high.
– A list comprehension is then used to iterate through each column (feature)
in the correlation matrix. For each column, it checks if any of its values
are greater than the threshold (0.6 in this case). This means we are looking
for features that have a high correlation with at least one other feature.
• The features that meet this criterion are added to the list, highly_correlated_
features.
• Finally, the code prints out the list of highly correlated features. These are the
features in our dataset that have a correlation coefficient greater than 0.6 with
at least one other feature.
Mean Absolute Difference (MAD)
The calculation of MAD is straightforward.
# Output
x_mean 4.420863
x_std_dev 0.958864
x_variance 5.853508
x_icv 17.396949
x_median 4.388141
...
z_spectral_skewness 2.301578
z_spectral_kurtosis 48.428439
z_spectral_flatness 0.019177
z_spectral_slope 0.498021
z_spectral_rolloff 3.398551
Length: 85, dtype: float64
Variance Threshold
To implement this in Python, we can use the VarianceThreshold from the sklearn.
feature_selection module.
Here is how we can use it:
Code explanation:
• We import the VarianceThreshold class from sklearn.feature_selection.
• We then set Variance Threshold.
• We apply the fit_transform method to our data.
• Finally, we identify which features have been retained by printing them out.
The selector.get_support(indices = True) line returns the indices of the features
that are above the threshold. Using these indices, we can then identify the names
of the features that have been retained after applying the threshold. The selected_
features will contain the names of these features. Remember, the choice of the
variance threshold (threshold) is crucial. It depends on the nature of your dataset
and the specific requirements of your analysis.
Forward Selection
To implement forward selection in Python, we can use SequentialFeatureSelector
from the mlxtend.feature_selection module.
# Forward Selection
sfs = SFS(clf,
k_features=10, # Number of features to select
forward=True,
floating=False,
190 | Machine Learning in Farm Animal Behavior using Python
scoring='accuracy',
cv=4)
# Output
Selected features:
Index(['x_maximum', 'y_minimum', 'y_maximum', 'z_std_dev',
'z_minimum','z_iqr', 'svm', 'squared_integral', 'combined_
entropy', 'x_spectral_entropy'],
dtype='object')
Backward Elimination
# Backward Elimination
sfs = SFS(clf,
Feature Selection Techniques | 191
# Output
Selected features:
Index(['x_minimum', 'y_minimum', 'y_maximum', 'z_minimum', 'svm',
'average_svm', 'squared_integral', 'combined_entropy',
'x_spectral_entropy', 'z_spectral_energy'],
dtype='object')
# Output
Selected features: [ 2 4 12 13 15 16 21 22 27 28]
Explanation:
• RFE Constructor: RFE is initialized with the classifier (RandomForestClassifier
in this case) and the number of features to select (n_features_to_select).
• Step: The step parameter determines how many features should be eliminated
at each iteration.
• Fit the Model: selector.fit trains the model and performs feature elimination.
• Selected Features: selector.get_support(indices = True) provides the indices
of the features that are selected as most important.
# Assuming X_train and y_train are the training data and labels
clf = RandomForestClassifier(n_estimators=100, n_jobs=-1)
efs = EFS(clf,
min_features=3,
max_features=10,
scoring='accuracy',
print_progress=True,
cv=5)
Feature Selection Techniques | 193
# Using a pipeline
lasso_pipeline = make_pipeline(scaler, lasso_clf)
Code Breakdown:
• Label Encoding:
– The LabelEncoder from sklearn.preprocessing is initialized and used to
convert the categorical target labels (y_train and y_test) into a numerical
form.
• Standardization and Logistic Regression with L1 Regularization:
– StandardScaler is used for standardizing the features.
– LogisticRegression is set up with L1 regularization (penalty = ‘l1’).
The solver liblinear is chosen as it supports L1 regularization. The
regularization strength is controlled by C (inverse of regularization
strength; smaller values specify stronger regularization).
• Pipeline Creation:
– A pipeline is created using make_pipeline, combining the scaler and
logistic regression model. This ensures that operations are sequentially
applied: first scaling, then model fitting.
• Model Fitting:
– The pipeline is fitted to the training data (X_train, y_train_encoded).
• Feature Selection:
– SelectFromModel is used to select features based on the importance
weights derived from the logistic regression model. The model is already
fitted (prefit = True), so it directly accesses the feature importance.
Feature Selection Techniques | 195
– It then transforms the training and test datasets to include only the selected
features, effectively reducing the feature space.
• Extracting Selected Features:
– The indices of the selected features are extracted (model.get_
support(indices = True)).
– These indices are used to retrieve the feature names from X_train columns.
• Output:
– The features deemed important by the Lasso logistic regression model are
printed.
# Using a pipeline
ridge_pipeline = make_pipeline(scaler, ridge)
Code Breakdown:
• We standardize the features using the StandardScaler.
• Ridge Regression:
– Ridge regression is initialized with alpha = 1.0, which is the regularization
strength.
• Creating a Pipeline:
– The make_pipeline combines the scaler and Ridge regression model.
• Model Fitting:
– The pipeline is fitted to the X_train, and y_train_encoded.
Feature Selection:
• SelectFromModel is employed for feature selection, using the fitted Ridge
regression model within the pipeline (ridge_pipeline.named_steps[‘ridge’]).
• Transforming Data:
– The transform method of SelectFromModel is then used to reduce X_train
to only the features deemed important by the Ridge model.
# Using a pipeline
elastic_net_pipeline = make_pipeline(scaler, elastic_net)
for f in range(X_train.shape[1]):
print(f"{f + 1}. feature {indices[f]}
({importances[indices[f]]}) – {X_train.columns[indices[f]]}")
2 Gini impurity is a measure used to assess the frequency at which any element from the set is
incorrectly labeled when it is randomly chosen.
Feature Selection Techniques | 199
# Sort the feature importances in descending order and get the indices
indices = np.argsort(importances)[::-1]
# Sort the feature importances in descending order and get the indices
indices = np.argsort(importances)[::-1]
The code follows the same structure as the code we used for Random Forest
and Decision trees. The only difference is the model, which in this case is the
GradientBoostingClassifier.
Note on LightGBM
LightGBM works similarly to XGBoost and can be used in the same way for
feature selection. The primary difference is in the underlying algorithms
and optimizations, with LightGBM being designed for speed and efficiency,
particularly on large datasets. You can replace xgb.XGBClassifier with lightgbm.
LGBMClassifier and follow the same steps for feature selection using LightGBM.
Hybrid Method
Variance Threshold and Feature Selection Using RandomForest
Summary
In this chapter, we looked at feature selection techniques, a crucial aspect of
the data science process. We began by discussing the significance of feature
selection in enhancing model performance, reducing complexity, and increasing
interpretability. Our discussion covered a range of methods, each tailored to
different needs and scenarios in the data science workflow. Key highlights of the
chapter included a look at filter methods, known for their speed and effectiveness
in preliminary feature reduction. These methods rely on statistical measures and
operate independently of the machine learning models. We also examined wrapper
methods, which, despite their computational intensity, provide a high accuracy
204 | Machine Learning in Farm Animal Behavior using Python
Testing Data
Input Data
Not Active
Predictions
Active
Model Training Model Testing
Active
Not Active
Labels
Figure 7.1: Supervised ML block diagram.
Cluster 1
Unlabeled Input
Cluster 2
ML Model
Cluster 3
w1
x1
x2
xn wn
Consider a MLP which has two layers, the hidden and the output layers. The index k
denotes the input units, the index i denotes the output units, and the index j represents
the hidden neurons. Figure 7.3 illustrates the design framework of MLP.
Let the number of inputs, outputs and hidden units to be M, N, and S respectively.
Let y and x represent the N-tuple outputs and the M-tuple inputs to the network,
respectively. The matrix of weights linking the input layer to the hidden layer is
denoted by wjk1 with S × M elements. The weights matrix connecting the hidden
and the output layers is represented by wjk2 with N × M elements. The biases can be
presented separately in the MLP or by adding an extra input value of one to each
layer of the network.
When p (input pattern) is introduced to the MLP, j, which is the hidden neuron,
receives a net input nj defined as:
M
n pj = ∑ w1jk xkp .
k=1
The output of this unit is: ( )
M
V jp = f (n pj ) = f ∑ w1jk xkp
k=1
208 | Machine Learning in Farm Animal Behavior using Python
The transfer functions used at the output and the hidden layers respectively can be
different. However, for simplicity we will assume that the same transfer function
is used in all the layers of the MLP.
The error generated at the output unit i is:
eip = dip − yip
where, dip represent the target value and p denotes the input pattern.
The total error or the cost function per pattern is defined as:
1 N p2 1 N p p
J= ∑ [e ] = ∑ [di − yi ]2 .
2 i=1 i 2 i=1
The change in the weight between a hidden and an output unit is defined as:
∂J
∆w2i j = −η
∂ w2i j
where, η represents the learning rate, and
convincingconvincing
convincing
convincing
convincing convincing
convincing convincing
convincingconvincing
convincing
convincing convincing
convincing
convincing convincing.
Animal Research: Supervised and Unsupervised Learning Algorithms | 209
Let
δip = eip f ′ (nip ),
then
∆w2i j = ηδipV jp .
The chain rule can be used to update the weights between the input and the hidden
layers as follows:
p
∂J ∂ J ∂Vj
∆w1jk = −η 1 = −η
∂ w jk ∂ V jp ∂ w1jk
where, we have:
∂J N
p ′ p 2
p = − ∑ ei f (ni )wi j
∂Vj i=1
and
∂ V jp
= f (n pj )xkp
∂ w1jk
convincingconvincingconvincing
convincing.
The value of the momentum is usually selected between zero and one and a typical
value is 0.9.
The purpose of the momentum term is to give inertia to every weight connection
of the network such that it will move in the direction of decreasing the cost
function, rather than oscillating widely with every update.
The momentum term can be used for both the online learning and batch learning,
although it has first been used in online learning (Hertz et al., 1991).
K-nearest Neighbors
K-nearest neighbors (KNN) is a widely recognized machine learning algorithm
applicable to both classification and regression tasks. In classification tasks,
KNN determines the predicted class for a new data point by adopting the most
Animal Research: Supervised and Unsupervised Learning Algorithms | 211
frequent class among its k nearest neighbors. In the case of regression, the mean
or weighted mean of the target values of the k closest neighbors is used as the
predicted value. KNN algorithm (Géron & Russell, 2019) takes into account
only the single nearest neighbor, which is the training data point closest to the
predicted value as shown in Figure 7.4. The prediction is the known output for
this specific training data point. KNN classifier typically relies on the Euclidean
distance calculated between a test sample and a designated training sample.
Class 2
Y
New Data
Point
Class 1
X
Figure 7.4: KNN plot.
Logistic Regression
Logistic regression (LR) is one of the most important analytic tools in the social and
natural sciences (Jurafsky & Martin, 2014, 2009). In natural language processing,
LR is the baseline supervised machine learning algorithm for classification
and has a very close relationship with neural networks. LR is used to classify
observations into two classes, such as animal has a disease or animal doesn’t have
a disease as shown in Figure 7.5. It can also be used to classify observations into
multiple classes, which is known as multinomial logistic regression.
Y
Positive
Negative
X
Figure 7.5: Logistic regression plot.
or a hyperplane in the transformed feature space (with kernel functions). For non-
linear data, SVM can use kernel functions like polynomial, radial basis function
(RBF), or sigmoid to project the data into a higher-dimensional space, where
it becomes linearly separable. SVM is a traditional machine-learning method
based on classification and can handle both linear and non-linear data. SVM’s
robustness in high-dimensional spaces and its capacity to control the trade-off
between margin and classification errors makes it a valuable tool in ML and data
analysis. These are some SVM concepts and variables that are important to know:
• Margin is defined as the distance between the hyperplane and the nearest data
point from each class. The SVM aims to maximize this margin, as it indicates
a more robust separation between classes.
• Support Vectors are the data points that are closest to the hyperplane and have
the smallest margin. These are critical in defining the position and orientation
of the hyperplane.
• Kernel Trick: SVM can handle data that is not linearly separable by transforming
it into a higher-dimensional space through the application of a kernel function
(e.g., polynomial, radial basis, or sigmoid). This transformation often makes
it possible to identify a hyperplane that distinguishes the classes.
• Regularization: SVM employs a regularization parameter (C) to manage
the balance between maximizing the margin and reducing the classification
error. A small C value creates a larger margin but may allow some
misclassifications, while a larger C value reduces the margin but aims to
minimize misclassifications.
Figure 7.6 shows the SVM plot which includes the margin and the supporting
vectors.
Maximum
Margin Class 2
Support Vectors
Class 1
• Tree Structure: The top node is called the root, and it branches into various
nodes below it. Nodes in the tree represent decisions or test conditions based
on input features. Terminal nodes at the bottom of the tree are called leaf
nodes. These nodes represent the final predicted output.
• Splitting Criteria: DT chooses the top features and conditions for splitting the
dataset into smaller groups at each node. The goal is to create splits that result
in the most homogeneous subsets in terms of the target variable.
• Entropy and Information Gain: In classification, DTs often utilize metrics
such as entropy and information gain to evaluate the impurity or disorder of
a dataset. The goal is to reduce entropy and increase information gain when
making splits.
• Gini Impurity: Another common metric for classification trees is Gini impurity,
which measures the probability of misclassifying a randomly chosen element
from the dataset. Decision trees aim to minimize Gini impurity.
• Pruning: DTs can be prone to overfitting, where they become too complex
and fit the training data noise. Pruning techniques are used to simplify the tree
by eliminating branches that are not significant for the model’s performance.
• Decision Rules: Each path from the root to a leaf node represents a decision
rule. These rules can be easily interpreted, making DT a useful tool.
Feature 1
Condition 1 Condition 2
Feature 2 Feature 3
Condition 3 Condition 4 Condition 5 Condition 6
Yes No Yes No
Random Forest
Random Forest (RF) consists of a number of decision trees. In this case, every
tree exhibits slight variations from the others. The concept of RF revolves
around the concept that while each tree can effectively classify data to some
extent, it is prone to overfitting certain portions of the dataset. Ensemble learning
involves building multiple DTs, each of which may overfit the data differently.
By averaging the predictions of these trees, the ensemble can reduce overfitting
Animal Research: Supervised and Unsupervised Learning Algorithms | 215
Training Dataset
Feature 1 Feature 1
Condition 1 Condition 2 Condition 1 Condition 2
Voting
Final class
import pandas as pd
def load_and_inspect_data():
# Paths to the saved datasets
file_paths = {
'X_train': 'X_train.csv',
'y_train': 'y_train.csv',
'X_test': 'X_test.csv',
'y_test': 'y_test.csv'
}
datasets = {}
return datasets
# Load datasets
datasets = load_and_inspect_data()
To check the distribution of the labels in both defined X_train and X_test, we can
write a function that plots the label distributions using a bar plot. An example
using a bar plot with matplotlib and seaborn is shown below:
# Plotting
fig, axes = plt.subplots(1, 2, figsize=(12, 5), sharey=True)
fig.suptitle('Label Distribution in Training and Testing Sets')
plt.tight_layout()
plt.savefig('datasets_distribution')
plt.show()
6000
5000
4000
Court
3000
2000
1000
0
Grazing Standing Lying Walking Trotting Running Walking Standing Grazing Lying Trotting Running
Labels Labels
MLP, which are sensitive to feature scaling. The StandardScaler from scikit-
learn is used for this purpose.
• Label Encoding: Since our labels are categorical, they should be encoded into
numerical values. This can be done using LabelEncoder from scikit-learn.
This step is essential for all algorithms.
• Handling Missing Values: We will ensure there are no missing values in our
datasets.
• Feature Encoding: If we have categorical features, we should encode them
using techniques like one-hot encoding, especially for algorithms that do not
inherently handle categorical variables well (e.g., SVM, KNN).
In the following code we introduce a comprehensive function that evaluates
various classifiers on our training and test sets. This function is useful for ML
practitioners who are investigating different ML models to find the best fit for
their data.
QuadraticDiscriminantAnalysis
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from tqdm import tqdm
import itertools
},
'Random Forest': {
'model': RandomForestClassifier(),
'params': {'n_estimators': [50, 100, 200],
'max_features': ['sqrt']}
},
'AdaBoost': {
'model':AdaBoostClassifier(),
'params':{}
},
'XGBoost': {
'model': XGBClassifier(eval_metric='mlogloss'),
'params': {}
},
'GBM': {
'model': GradientBoostingClassifier(),
'params': {}
},
'LightGBM': {
'model': LGBMClassifier(),
'params': {}
},
'Naive Bayes': {
'model': GaussianNB(),
'params': {}
},
'MLP': {
'model': MLPClassifier(),
'params': {'hidden_layer_sizes': [(50,50,50),
(50,100,50), (100,)],
'activation': ['tanh', 'relu'],
'solver': ['sgd', 'adam'],
'learning_rate': ['constant','adaptive'],
'alpha': [0.0001, 0.001, 0.5],
'max_iter': [1000, 2000],
'random_state' : [42]}
},
'Linear Discriminant Analysis': {
'model': LinearDiscriminantAnalysis(),
'params': {}
},
'Quadratic Discriminant Analysis': {
'model': QuadraticDiscriminantAnalysis(),
'params': {}
}
# We can add more classifiers and parameters here
}
Animal Research: Supervised and Unsupervised Learning Algorithms | 221
results = []
return results
# Find the highest score and its corresponding model and parameters
best_model, best_params, best_score = max(results, key=lambda x: x[2])
# Output
Best Model: XGBoost
Best Parameters: {}
Best Score: 0.9617422012948793
Code Breakdown:
• Converting y Datasets to 1D Arrays: The target variables are converted from
DataFrames into 1-dimensional numpy arrays to fit the requirements of
scikit-learn’s models.
• Label Encoding: The target variables are label-encoded to transform
categorical labels into a numerical format.
• Feature Scaling: The StandardScaler is applied to scale the features in X_train
and X_test.
222 | Machine Learning in Farm Animal Behavior using Python
Linear Regression
Linear regression serves as a foundational approach, aiming to portray the
correlation between two variables (exploratory variable x and dependent variable
y) through the application of a linear equation to fit the gathered data.
Formula:
y = w0 + w1 x + ε
where,
• y is the dependent variable (predicted variable e.g., weight gain).
• x is the independent variable (predictor e.g., food intake).
• w1 is the slope (regression coefficient).
Animal Research: Supervised and Unsupervised Learning Algorithms | 225
• w0 is the y-intercept of the line. This parameter represents the value of y when
x is 0.
• ɛ is the random error term.
The objective of simple linear regression is to establish a line (often referred
to as the optimal fit line) that passes through the data points in such a way as
to minimize the total of the squared residuals. Here, a residual represents the
discrepancy between the actual value and the value predicted by the linear model.
To estimate the parameters within a linear regression framework, the Ordinary
Least Squares (OLS) technique is employed. This method aims to reduce the total
squared deviations between the observed dependent variable in the dataset and the
predictions made by the linear equation.
X_intercept = add_dummy_feature(food_intake)
# Output
array([[52.12061003],
[ 0.49769395]])
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
# Output
Intercept: 52.120610025626576
Coefficient for Food Intake: 0.4976939463543877
In this code:
• Data Preparation: The dataset is defined as X and y, representing the
independent variable (Food Intake) and dependent variable (Weight Gain),
respectively.
• Model Creation and Fitting: A linear regression model is created using
LinearRegression() and then fitted to the data with model.fit(X, y). This step
involves internally calculating the best-fit line.
• Displaying Model Parameters: After fitting the model, the intercept and
coefficient for the Food Intake variable are printed. These values represent
the y-intercept and slope of the regression line, respectively.
• Visualization: The code then plots the original data points (X, y) as green dots
and the regression line as a red line. This visualization helps in understanding
the fit of the model and the relationship between Food Intake and Weight
Gain.
• Plot Formatting: The plot is formatted with labels, grid, legend, and title for
better readability and understanding (refer to Figure 7.10).
Regression Line
110
100
Weight Gain
90
80
70
60
50
0 20 40 60 80 100 120
Food Intake
where,
• y is the output variable.
• x1, x2, …, xn are the predictors.
• w1, w2, …, wn are the coefficients that represent the weight of each predictor.
• w0 is the intercept.
• ɛ is the random error.
# Plotting
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')
# Output
Intercept: 49.72927664354033
Coefficients: [0.50184912 0.10654829]
In this example, LinearRegression is used, but this time with multiple features
to predict Weight Gain. The model’s coefficients indicate the influence of each
feature. We also create a meshgrid using np.meshgrid (Figure 7.11). This meshgrid
is used to predict values across the entire grid, providing data for the surface plot.
Then, multi_model.predict(onlyX) is used to predict the Weight_Gain over the
meshgrid. The ax.scatter plots the actual data points in the 3D space. Additionally,
the ax.plot_surface plots the plane of best fit using the predicted values. This
plane represents the multiple linear regression model.
230 | Machine Learning in Farm Animal Behavior using Python
Polynomial Regression
In polynomial regression the predictor x and the dependent variable y is modelled
as a n degree polynomial.
Advantages:
• It can model a wider range of relationships between variables.
• Provides a better fit for datasets that are not well-described by a linear
relationship.
Disadvantages:
• Prone to overfitting, especially with high-degree polynomials.
• The interpretability of the model can be challenging compared to simple
linear regression.
Parameters in Polynomial Regression:
• Degree of the Polynomial (n): Determines the curving of the fit.
• Regularization: Techniques like Ridge or Lasso can help to avoid overfitting,
especially for higher-degree polynomials.
Animal Research: Supervised and Unsupervised Learning Algorithms | 231
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
import matplotlib.pyplot as plt
poly_reg_model = LinearRegression()
poly_reg_model.fit(X_poly, y)
# Plot formatting
plt.xlabel('Food Intake')
plt.ylabel('Weight Gain')
plt.legend()
plt.title('Polynomial Regression with Different Degrees')
plt.show()
15.0
12.5
Weight Gain
10.0
7.5
5.0
2.5
0.0
–4 –3 –2 –1 0 1 2 3 4
Food Intake
Figure 7.12: Polynomial regression of various degree values.
Figure 7.12 illustrates how each polynomial degree fits the data. The low-degree
polynomial will appear as a smooth curve, the moderate degree as a more flexible
curve, and the high degree might show extreme bends and turns, indicating
overfitting. This demonstration highlights the importance of choosing the right
degree for polynomial regression.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
X = data.drop('Weight_Gain', axis=1)
y = data['Weight_Gain']
# Ridge Regression
# Evaluation
y_pred = ridge_reg.predict(X_test)
# Output
Mean Squared Error (MSE): 15.48
R-squared (R2) Score: 0.4744
• Lasso Regression (L1): Effective for models where some features are
irrelevant, using L1 regularization to promote sparsity.
– Common Hyperparameters: alpha (regularization strength).
# Lasso regression
lasso_reg = Lasso(alpha=0.1)
lasso_reg.fit(X_train, y_train)
# Evaluation
y_pred = lasso_reg.predict(X_test)
# Output
Mean Squared Error (MSE): 15.41
R-squared (R2) Score: 0.4765
# Elastic net
# Evaluation
y_pred = elastic_net.predict(X_test)
# Output
Mean Squared Error (MSE): 14.74
R-squared (R2) Score: 0.4992
• Support Vector Regression (SVR): Suitable for both linear and non-linear
relationships, especially effective in high-dimensional spaces.
– Common Hyperparameters: C (regularization parameter), kernel (type of
kernel used), epsilon (specifies the epsilon where there is no associated
penalty).
Animal Research: Supervised and Unsupervised Learning Algorithms | 235
# Evaluation
y_pred = svr_reg.predict(X_test)
# Output
Mean Squared Error (MSE): 2.45
R-squared (R2) Score: 0.9166
tree_reg = DecisionTreeRegressor(max_depth=3)
tree_reg.fit(X_train, y_train)
# Evaluation
y_pred = tree_reg.predict(X_test)
# Output
Mean Squared Error (MSE): 2.63
R-squared (R2) Score: 0.9107
236 | Machine Learning in Farm Animal Behavior using Python
forest_reg = RandomForestRegressor(n_estimators=100)
forest_reg.fit(X_train, y_train)
# Evaluation
y_pred = forest_reg.predict(X_test)
# Output
Mean Squared Error (MSE): 1.95
R-squared (R2) Score: 0.9339
# Evaluation
y_pred = gbrt.predict(X_test)
# Output
Mean Squared Error (MSE): 1.75
R-squared (R2) Score: 0.9404
Unsupervised Learning
Unsupervised Competitive Learning and Self-organizing
Feature Maps
Unsupervised learning requires no target values, this means that the network uses
the information available from the input patterns to classify its outputs. Therefore,
the input patterns must have some redundant information for the network to extract
the features, regularities, correlations, or categories embedded in the input data.
In a competitive learning algorithm only one output or only one output per group
is activated. The activated output unit is called the winner-take-all unit or the
grandmother cell (Hertz et al., 1991). A network that uses such a learning rule
aims at categorizing or clustering the input data into separate groups. An input
pattern recognized as belonging to a particular group should fire the same output
unit as the other members of the same group.
Self-organising feature maps (SOFMs) are a class of neural networks that use the
unsupervised competitive learning rule. The network needs only one output for
each group involved, this means that for N groups, the network requires N output
units. Furthermore, the output units are placed in a line or plane geometrical
structure. Hence, the position of the winning units gives some information about
its neighboring units. The locations of cells represent a particular domain of their
input patterns.
SOFMs have been used as an alternative to the traditional structures of neural
networks. An analysis of these networks has been carried out from the technical
rather than the biological viewpoint, however the learning results look natural
which indicates that the adaptive processes may be like those encountered in the
brain. SOFMs have various applications like BP networks; they can be applied to
pattern recognition, robotics, and process control.
238 | Machine Learning in Farm Animal Behavior using Python
j y
i x
x1 x2
j
a) b) i
There are two common types to the structure of SOFMs as shown in Figure 7.13.
For both cases, the outputs are organized in a two-dimensional plane, while the
inputs are presented to the networks as continuous values for the first type and
the inputs themselves are arranged in a two-dimensional plane in the second type.
For the first type of SOFM (refer to Figure 7.13(a)), the inputs x1 and x2 are
connected to each output unit via the weights and are presented directly to the
network. The organization of the output does not have to be in the square plane
and the inputs can have higher dimensions. The second type (illustrated in Figure
7.13(b)) has more biological importance, however it is less studied than the first
type. In this case, the inputs are organized in a two-dimensional plane that defines
the input space. In the simplest form, one of the input patterns defined in the input
space is turned on. The aim of the training process is to find a suitable mapping
from the input to the output spaces. The importance of this type of feature map
comes from the fact that such mapping frequently occurs in the brain such as in
the connections of the sensing organs including the eye, ear and skin to the cortex
and the connections of the different areas of the cortex.
The input vector, X, is compared to the weights of each element of the output units.
A simple way of comparison is using the Euclidean distance, where we have:
M
X −W j = ∑ (xi − w ji )2 .
i=1
The best-matching node is defined as the output node in which the Euclidean
distance between the input pattern and the weight vector is minimum, i.e.,
where, α(t) represents the learning rate, while σ(t) defines the width of the kernel
function. Both parameters α(t) and σ(t) are monotonically decreasing functions
of time. The values of rc and rcj are the coordinates of cells c and i, respectively.
The initial value of α(t) can be selected to be near to one and then its value is
decreased monotonically with time. The function that controls the value of σ(t)
can be selected as linear, exponential, or inversely proportional to t. An alternative
to the value of α(t) is 0.9(1 – t/1000).
The number of steps used to train the network has great influence on the final
accuracy of the mapping. The learning process is a stochastic, meaning that the
240 | Machine Learning in Farm Animal Behavior using Python
accuracy of the mapping is based on the steps in the last convergence phase. A
proper selection for the number of steps can be around 500 times the number of
units in the network.
What is make_blobs?
make_blobs is a utility from Scikit-learn’s datasets module. It generates isotropic
Gaussian blobs for clustering, meaning it creates clusters of data points centered
around specified locations, with a certain degree of variance and number of
centers.
• n_features = 2: This sets the number of features (or dimensions) for each
sample to 2. This means every data point will have two coordinates, making
it easy to visualize on a 2-dimensional plot.
• cluster_std = [0.7, 1.0, 0.4, 0.7, 1.0]: This is a list specifying the standard
deviation of the clusters. The standard deviation controls how much the data
points in each cluster are spread out around the cluster center.
• random_state = 42: This sets the seed for the random number generator that
make_blobs use to create the dataset.
The result of this function call is two variables, X and y:
• X will be a NumPy array of shape (1000, 2), containing the coordinates for the
1000 generated samples across 2 features.
• y will be a NumPy array of shape (1000,), containing the cluster labels for
each sample, ranging from 0 to 4 (since there are 5 centers).
10
5
Feature 2
–5
–10
–10.0 –7.5 –5.0 –2.5 0.0 2.5 5.0 7.5
Feature 1
We are using the GaussianMixture class to fit the model to our data. We specify
n_components = 5, corresponding to the number of clusters we know our data to
have. The GaussianMixture object uses the Expectation-Maximization algorithm
244 | Machine Learning in Farm Animal Behavior using Python
10
5
Feature 2
–5
–10
–10.0 –7.5 –5.0 –2.5 0.0 2.5 5.0 7.5
Feature 1
Figure 7.15: Effective cluster identification using Gaussian Mixture model: A close match
with predefined groups in the synthetic dataset.
to find the parameters of the Gaussians that best fit our data. After fitting the
model, we use it to predict labels for our dataset, which we then plot to visualize
the resulting clusters.
The application of GMM on our synthetic dataset has yielded good results. The
clusters identified by the GMM align closely with the five groups we initially
defined using the make_blobs function.
Hierarchical Clustering
Hierarchical clustering is an approach to cluster analysis that aims to create a
hierarchy of clusters. The results of hierarchical clustering are usually presented
in a dendrogram, which illustrates the arrangement of the clusters produced by the
analysis. In the context of farm animal activity recognition, hierarchical clustering
can be particularly useful for understanding behaviors that operate at different
scales or granularities.
# Agglomerative Clustering
300
250
200
150
100
50
Code Explanation:
• Agglomerative Clustering: AgglomerativeClustering from scikit-learn is
used with 5 clusters. The model is fitted to X, and a cluster label is given to
each data point.
• Dendrogram Visualization (Figure 7.16): The dendrogram is plotted using sch.
dendrogram with Ward’s method as the linkage criterion. This visualization
helps in understanding the cluster formations and their hierarchical
relationships.
• Cluster Plotting: The data points are plotted with colours indicating their
cluster membership, showing how they are grouped into clusters.
The flexibility and specific constraints of metric choices in scikit-learn are key
factors when exploring distance metrics for linkage computation in agglomerative
clustering:
• Metric Options: The metric used to compute the linkage in agglomerative
clustering can vary. Acceptable metrics include euclidean, l1, l2, manhattan,
cosine, or precomputed. This range of metrics allows the algorithm to
be tailored to different types of data and different notions of distance or
similarity.
• Constraint for Ward Linkage: If the linkage method chosen is ward, the only
accepted metric is euclidean. This is because Ward’s method inherently relies
on the minimization of variance within clusters, which is directly tied to the
Euclidean distance.
• Precomputed Distance Matrix: When using a precomputed metric, the input
for the fit method needs to be a distance matrix rather than the raw data points.
Animal Research: Supervised and Unsupervised Learning Algorithms | 247
10
5
Feature 2
–5
–10
–10.0 –7.5 –5.0 –2.5 0.0 2.5 5.0 7.5
Feature 1
Partitional Clustering
Unlike hierarchical clustering, partitional clustering divides the data set into disjoint
clusters without any explicit structure that would relate clusters to each other.
Characteristics:
• Disjoint Clusters: Each object belongs to one and only one cluster.
• No Hierarchical Relationship: There in no concept of child or parent clusters;
all clusters are on the same level.
248 | Machine Learning in Farm Animal Behavior using Python
K-Means
K-Means clustering is straightforward and a widely utilized algorithm for grouping
data. Its objective is to divide a collection of observations into a predefined
number of clusters, minimizing the within-cluster variances.
How K-Means Works:
1. Initialization: K-Means starts by randomly selecting ‘k’ centroids, where
‘k’ represents the total count of clusters you choose. These centroids are the
initial guesses for the locations of the cluster centers.
2. Assignment Step: Every data point is allocated to the closest centroid,
determined by the squared Euclidean distance. This forms ‘k’ clusters.
3. Update Step: The centroids of the clusters are recalculated. This is typically
done by calculating the average of all data points assigned to that cluster’s
centroid.
4. Iterative Optimization: Steps 2 and 3 continue until there is minimal to no
movement in the centroids, indicating that convergence has been achieved.
Formula:
The objective function of K-Means is defined as:
k
J = ∑ ∑ x − µi 2
i=1 x∈Si
where,
• J is the cost function to be minimized.
• k is the number of clusters.
• Si is the set of data points in the ith cluster.
• x is a data point in cluster Si.
• μi is the centroid of the ith cluster.
Python Example for K-Means:
# K-Means Clustering
Figure 7.18 illustrates the results of the K-Means clustering analysis above,
showing five distinct groups across two features with their respective centroids
marked by black crosses.
10
5
Feature 2
–5
–10
–10.0 –7.5 –5.0 –2.5 0.0 2.5 5.0 7.5
Feature 1
Figure 7.18: K-Means clustering plot.
250 | Machine Learning in Farm Animal Behavior using Python
for each value, calculate the sum of squared distances from each point to its
assigned center (within-cluster sum of squares(WCSS)).
• Plotting the Results: The values of WCSS are then plotted against the number
of clusters. As the number of clusters increases, the WCSS will typically
decrease (since the points are closer to their respective centers).
• Identifying the Elbow: The key is to find the point where the rate of decrease
changes, which typically represents a situation where adding more clusters
does not provide better data modelling. This point is often referred to as the
elbow, similar to the angle in the human arm.
The Benefits of Using the Elbow Method:
• Trade-off Between Simplicity and Accuracy: This method helps in restoring
equilibrium between the utmost data compression through a single cluster
and the highest accuracy by allocating each data point to its distinct cluster.
• Intuitive and Easy to Implement: It provides a simple way to visually assess
the optimal number of clusters.
Limitations of the Elbow Method:
• Subjectivity: At times, the elbow point may not be distinctly visible or
pronounced, leading to subjective interpretations
• Not Applicable to All Datasets: Some datasets might not demonstrate a clear
elbow, or the elbow method may not be appropriate for datasets with structures
that do not align well with spherical clusters, as assumed by K-Means.
Example of Elbow Method in Python:
wcss = []
for i in range(1, 11): # experimenting on different numbers of
clusters
kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
kmeans.fit(X)
wcss.append(kmeans.inertia_) # inertia_ is the WCSS for each model
In the above script, the WCSS (inertia) is calculated for different values of k and
plotted. The elbow of the plot is where you would typically choose the optimal
number of clusters to use for your K-Means clustering.
The plot generated by the Elbow Method provides a visual way to determine
the optimal number of clusters for K-Means. In this graph, the x-axis represents
the number of clusters tested, while the y-axis shows the WCSS. The WCSS is a
measure of variance within each cluster; lower values indicate that the data points
are closer to their respective cluster centroids.
In Figure 7.19, we can notice that as the number of clusters increases, the WCSS
initially decreases rapidly. This decrease slows down after reaching a certain
number of clusters, beyond which adding more clusters does not significantly
improve the compactness of the clusters. This point, where the rate of decrease
changes and the plot starts to level off, is the elbow.
Elbow Method
60000
50000
40000
WCSS
30000
20000
10000
0
2 4 6 8 10
Number of Clusters
b(i) − a(i)
S(i) =
max(a(i), b(i))
where,
• a(i) represents the mean distance of the ith point from the other points within
the same cluster, assessing the cluster’s tightness (measures cohesion).
• b(i) denotes the least mean distance of the ith point to points in another
cluster, determined by finding the lowest among clusters, indicating how
well-separated it is from other clusters (measures separation).
The silhouette score, which can vary between –1 and +1, serves as a measure
of how well an object fits into its own cluster compared to neighbouring clusters.
A higher score suggests a good fit within its own cluster and a poor fit with adjacent
clusters. A clustering setup is considered suitable if most objects score highly.
Conversely, a prevalence of low or negative scores may indicate an inappropriate
number of clusters.
Animal Research: Supervised and Unsupervised Learning Algorithms | 253
Python Example:
# Silhouette Score
from sklearn.metrics import silhouette_score
# Output
For n_clusters = 2 the average silhouette_score is : 0.6104719594142752
For n_clusters = 3 the average silhouette_score is : 0.7281412339755425
For n_clusters = 4 the average silhouette_score is : 0.7749732690936251
For n_clusters = 5 the average silhouette_score is : 0.7286939542403489
For n_clusters = 6 the average silhouette_score is : 0.6616726780183116
For n_clusters = 7 the average silhouette_score is : 0.6420350016075017
For n_clusters = 8 the average silhouette_score is : 0.6438986259781332
For n_clusters = 9 the average silhouette_score is : 0.5477904966277551
For n_clusters = 10 the average silhouette_score is : 0.45689258731284543
Code Explanation:
• The K-Means clustering algorithm is applied to this dataset with a varying
number of clusters (n_clusters).
• After clustering, the silhouette score for each number of clusters is calculated
using silhouette_score from sklearn.metrics.
• Finally, we plot the silhouette scores against the number of clusters to visually
determine the best number of clusters.
254 | Machine Learning in Farm Animal Behavior using Python
0.75
0.70
Silhouette Score
0.65
0.60
0.55
0.50
0.45
2 3 4 5 6 7 8 9 10
Number of Clusters
Figure 7.20: Silhouette scores for K-Means clustering with different numbers of clusters.
Figure 7.20 shows silhouette scores for different numbers of clusters, ranging
from 2 to 10. The key observations are:
• The silhouette score is highest when the number of clusters is 4, with an
average score of approximately 0.775. This indicates a very good structure,
as the score is close to 1, suggesting that the clusters are well separated and
distinct.
• As the number of clusters increases beyond 4, the silhouette score starts to
decline. This suggests that adding more clusters does not contribute to better-
defined or more distinct clusters. Particularly, there is a notable decrease in
the score when moving from 4 to 7 clusters and beyond.
• With 2 or 3 clusters, the scores are lower than with 4 clusters, indicating that
the data points are not as appropriately grouped into distinct clusters as they
are with 4 clusters.
• For higher numbers of clusters (from 8 to 10), the silhouette scores decrease
further, suggesting that such a high number of clusters may lead to overfitting
and clusters that are not very meaningful.
Note to the Reader: Impact of Initial Centroid Selection on Model Accuracy in
K-Means Clustering.
When applying the K-Means clustering algorithm, it is important to note that the
selection of initial centroids can significantly influence the model’s accuracy. The
initial placement of these centroids essentially sets the starting condition for the
iterative process of K-Means, which seeks to optimize the positions of centroids
to minimize within-cluster variances. Different initialization methods can lead to
Animal Research: Supervised and Unsupervised Learning Algorithms | 255
different clustering outcomes, and in some cases, affect the convergence speed of
the algorithm.
Initialization Methods in Scikit-learn’s K-Means:
Scikit-learn offers several options for initializing centroids in K-Means, each with
its own advantages:
• K-Means++ (Default): This method improves upon simple random
initialization by spacing out the initial centroids. It does so by selecting
centroids that are likely to be distant from each other, reducing the chances
of poor cluster formation and speeding up convergence. The process involves
multiple sampling steps where the most suitable centroid is chosen from
several candidates.
• Random: In this straightforward approach, a set number of data points are
randomly chosen from the dataset to serve as the initial centroids. While
simple, this method can sometimes lead to less optimal clustering, particularly
if the randomly chosen centroids are not well distributed.
• Custom Array: If you have prior knowledge or a specific strategy, you can
directly pass an array of shape (n_clusters, n_features) that specifies the exact
initial positions for the centroids.
• Custom Function: For even greater control, a callable function can be
provided. This function should accept the data (X), the number of clusters
(n_clusters), and a random state as arguments and return the initial centroid
coordinates. This approach allows for a custom initialization strategy tailored
to specific data characteristics or requirements.
Selecting the right initialization method depends on the nature of your data and
the specific requirements of your clustering task. While the K-Means++ method
is generally a good starting point due to its balanced approach, exploring other
methods can be beneficial in certain scenarios, such as when dealing with datasets
that have unusual distributions or when prior knowledge about the data can be
leveraged for more informed centroid placement.
• Core, Border, and Noise Points: DBSCAN categorizes points into core points,
border points, and noise. A core point has a minimum number of points
(minPts) within a given radius (ε), a border point is within the radius of a core
point but with fewer neighbors than minPts, and a noise point is neither a core
nor a border.
• Forming Clusters: Clusters are formed by connecting core points that are
within a distance ε of each other and including any border points that are
within this radius of core points.
Parameters:
The DBSCAN python algorithm has two primary parameters:
• eps (ε): This is the maximum distance permitted for one sample to be
considered within the neighborhood of another.
• minPts: This is the number of points necessary to establish a dense region
(minimum points to form a cluster).
Python Example:
We will use DBSCAN on a synthetic dataset to illustrate its clustering capabilities,
especially in identifying noise and handling arbitrary cluster shapes.
# DBSCAN
# Apply DBSCAN
dbscan = DBSCAN(eps=0.2, min_samples=5)
dbscan.fit(X)
labels = dbscan.labels_
# Create subplots
fig, axs = plt.subplots(1, 2, figsize=(12, 5))
# Plot original dataset
axs[0].scatter(X[:, 0], X[:, 1], s=20, alpha=.3)
axs[0].set_title('Original Dataset')
axs[0].set_xlabel('Feature 1')
axs[0].set_ylabel('Feature 2')
# Plot DBSCAN clustering result
axs[1].scatter(X[:, 0], X[:, 1], c=labels, s=20, alpha=.5)
Animal Research: Supervised and Unsupervised Learning Algorithms | 257
In this example, make_moons is used to create a dataset with two half circles.
DBSCAN is then applied to this dataset. The algorithm is capable of identifying
the two moon-shaped clusters while marking any outliers as noise (refer to Figure
7.21).
Applying DBSCAN:
• DBSCAN is imported from Scikit-learn and applied to the dataset with eps =
0.2 and min_samples = 5.
• dbscan.fit(X) fits the DBSCAN model to the data X.
• labels contains the cluster labels for each data point. Noise points are labeled
as –1.
The world of supervised and unsupervised algorithms extends beyond what
we have covered. We encourage readers who are interested in expanding their
knowledge to explore further resources and literature. The field of machine
learning is rapidly evolving, and staying updated with the latest developments
and research can provide deeper insights and more sophisticated tools for data
analysis.
To classify the behaviors of walking, grazing, and resting in cattle, data from
the Global Positioning System (GPS) was analyzed using linear discriminant
analysis, which identified 71% of the behaviors (Schlecht et al., 2004). Similarly,
another research effort implemented collars equipped with both accelerometer
and GPS sensors on cattle to differentiate behaviors such as foraging,
ruminating, resting, and other active states. By applying a method based on
threshold decision trees, this study claimed to achieve an accurate classification
of 90.5% of the collected data points (González et al., 2015). Furthermore,
classification tree analysis and K-Means clustering were employed on GPS data
for the behavioral categorization of cattle (Ungar et al., 2005; Schwager et al.,
2007).
In the domain of equine monitoring, activities of horses were examined using
sensors for acceleration, gyroscope, and magnetometry, with data processing
conducted through an embedded multilayer perceptron algorithm (Gutierrez-
Galan et al., 2018). This approach resulted in an 81% accuracy rate in recognizing
horse behaviors in real-world conditions. Additionally, accelerometer data was
utilized in conjunction with threshold-based statistical methods to monitor
standing and feeding behaviors in cows (Arcidiacono et al., 2017).
Another research used a Boruta feature selection technique, in conjunction with
several machine learning algorithms, including multilayer perceptron, random
forests, extreme gradient boosting, and K-nearest Neighbors. Among these, the
random forests algorithm stood out, delivering results with an accuracy of 96.47%
and a kappa value of 95.41% (Kleanthous et al., 2018).
The recognition of animal activities holds considerable significance for the
agricultural community, animal behaviorists, and conservationists, as it serves as
a crucial indicator of an animal’s health and nutritional intake, especially when
observations are made throughout their daily cycles. Leveraging machine learning
techniques offers a sophisticated means to discern the activities of livestock,
facilitating the differentiation of complex behavioral patterns that are challenging
and laborious to identify through human observation alone.
Together, these studies underscore the transformative potential of machine learning
and IoT technologies in advancing the field of animal behavior monitoring. By
leveraging data science approaches, researchers are enhancing our understanding
of animal welfare as well as paving the way for more sustainable and efficient
agricultural practices.
Summary
In this chapter, we took a look at supervised and unsupervised learning algorithms
and their applications within animal research. We started with a foundational
overview of supervised machine learning models, highlighting the significance
260 | Machine Learning in Farm Animal Behavior using Python
Confusion Matrix
The Confusion Matrix is a fundamental tool in evaluating the performance of
classification models. It is especially valuable when dealing with multiple
classes, but its utility is also evident in binary classification scenarios. Essentially,
a Confusion Matrix is a tabular representation that illustrates the accuracy of a
model by comparing the actual and predicted classifications.
# Confusion matrix
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
Confusion Matrix
8000
Grazing
4824 0 0 1 5
7000
Resting
6000
1 8182 0 89 0
5000
Scratching
Actual
29 0 129 0 7 4000
3000
Standing
4 98 0 4150 2
2000
Walking
1000
14 0 0 0 1853
0
Grazing Resting Scratching Standing Walking
Predicted
Figure 8.1: Confusion matrix visualization for multiclass sheep behavior classification.
264 | Machine Learning in Farm Animal Behavior using Python
This code uses seaborn and matplotlib for visualization, providing a heatmap
representation of the confusion matrix. The confusion matrix is generated
by typing confusion_matrix(y_test, y_pred). This matrix offers a visual and
quantitative insight into the performance of our classification model, making it
easier to identify the areas where the model excels or needs improvement. Figure
8.1 shows the generated confusion matrix.
In the generated confusion matrix, each cell represents the count of predictions for
the actual labels (rows) versus the predicted labels (columns).
Here is how to interpret the TP, TN, FP, FN values from this matrix:
• TP: Diagonal cells where the predicted label matches the actual label (e.g.,
‘grazing’ predicted as ‘grazing’).
• TN: For a specific class, these are all the cells that correctly predict the
negative class. In a multiclass matrix, this would be calculated for each class
by considering all the correct predictions that are not the current class.
• FP: Off-diagonal cells in a row where the predicted label is the class in question
and the actual label is not (e.g., actual ‘grazing’ predicted as ‘walking’).
• FN: Off-diagonal cells in a column where the actual label is the class in
question and the predicted label is not (e.g., actual ‘resting’ predicted as
‘grazing’).
Please note that for multiclass confusion matrices, the concept of TN is less
straightforward than in binary classifications because it involves all the true
negatives for each class across the matrix. The visual representation in the heatmap
above makes it easier to identify the TP and FN directly. However, calculating the
TN and FP for each class involves a bit more consideration of the other classes.
• Main Diagonal (True Predictions):
– ‘Grazing’: 4824 instances were correctly predicted as grazing.
– ‘Resting’: 8182 instances were correctly predicted as resting.
– ‘Scratching’: 129 instances were correctly predicted as scratching.
– ‘Standing’: 4150 instances were correctly predicted as standing.
– ‘Walking’: 1853 instances were correctly predicted as walking.
• Off-Diagonal (Incorrect Predictions):
– ‘Grazing’ was once incorrectly predicted as ‘walking’ and 5 times as
‘standing’.
– ‘Resting’ was once incorrectly predicted as ‘grazing’ and 89 times as
‘standing’.
– ‘Scratching’ had 29 instances incorrectly predicted as ‘grazing’ and 7 as
‘standing’.
Evaluation, Model Selection and Hyperparameter Tuning | 265
Accuracy
This is one of the most straightforward metrics. It calculates the proportion of
correct predictions in the overall predictions. In essence, it measures the overall
correctness of the model by answering the question: “Of all the predictions made,
how many were correct?”
TP+TN .
Accuracy =
T P + T N + FP + FN
Accuracy is particularly useful when the class distribution is even, meaning each
class has a roughly equal number of instances. However, its usefulness weakens
when dealing with imbalanced datasets where one class significantly outnumbers
the other(s), as it can give a misleading impression of the model’s performance.
266 | Machine Learning in Farm Animal Behavior using Python
Limitations of Accuracy
While accuracy is an excellent measure for giving a quick overview of model
performance, it has its limitations:
• Imbalanced Classes: In datasets where some classes are underrepresented,
accuracy can be skewed, as the model may bias towards the majority class.
• Misleading Interpretation: A model with high accuracy might still be
performing poorly in one or more classes, which is why it is essential to
look at other metrics like precision and recall for a more comprehensive
evaluation.
# Accuracy
# Output
Accuracy: 0.987
Precision
Precision, also referred to as the positive predictive value, is a metric that estimates
the accuracy of the positive predictions made by a classification model. It answers
the question: “Of all the instances classified as positive, how many are actually
positive”? Precision is a critical measure when the costs of False Positives are
high.
The formula to calculate precision for a binary or multiclass classification task is:
TP
Recall = .
T P + FN
Evaluation, Model Selection and Hyperparameter Tuning | 267
Limitations of Precision
Precision alone does not tell the complete story of a model’s performance. It does
not take into account the false negatives, or the positive instances that the model
incorrectly classified as negative. Therefore, it is often used in conjunction with
recall (also known as sensitivity) to provide a more comprehensive evaluation of
a classifier’s performance.
# Precision
# Output
Multiclass Precision (Macro-average): 0.990
Multiclass Precision (Weighted-average): 0.987
268 | Machine Learning in Farm Animal Behavior using Python
In the above example, replace ‘positive_class_name’ with the actual name or label
of the positive class in your dataset. The average parameter defines the averaging
method used for multiclass classification. If not specified, the default is binary
classification, which requires the pos_label parameter to be set for imbalanced
or non-binary classification tasks. The printed results show the precision of the
model, formatted to three decimal places.
Recall
Recall, also known as sensitivity, addresses the question: “Of all the actual
positives, how many were identified correctly”? Recall is especially important in
situations where missing a positive instance is costly, such as in animal medical
diagnosis.
The formula for recall is:
TP
Recall = .
T P + FN
Limitations of Recall
While recall is an essential metric, it does not consider False Positives. A model
with a high recall rate might also have a high number of False Positives, which
would not be ideal in every situation. As highlighted before, recall is often used in
conjunction with precision and other metrics to get a balanced view of the model’s
performance.
# Recall
from sklearn.metrics import recall_score
The average parameter is used to specify how the recall should be calculated in
a multiclass scenario. The printed results show the recall metric for the model,
formatted to three decimal places.
F1-score
The F1-score is a metric that combines precision and recall into a single value,
providing a balance between the two. It is particularly useful when you need to
find a balance between precision and recall, especially in scenarios with an uneven
class distribution or when the cost of false positives and false negatives differs.
The F1-score is the harmonic mean of precision and recall, giving both metrics
equal weight.
The formula for the F1-score is:
Precision × Recall .
F
F1-score = 2×
Precision + Recall
This formula ensures that the F1-score takes both false positives and false
negatives into account. Consequently, a high F1-score indicates that the model
has a robust balance of precision and recall.
• Weighted-average F1-Score: Computing the F1-score for each class and then
taking the average, giving weights to each class according to their presence in
the dataset.
# F1-score
from sklearn.metrics import f1_score
# Output
Multiclass F1-Score (Macro-average): 0.966
Multiclass F1-Score (Weighted-average): 0.987
Python Code
# Classification report
from sklearn.metrics import classification_report
# Output
precision recall f1-score support
The ROC curve is created by plotting the TPR against the FPR at several settings
of the threshold. The TPR is plotted on the y-axis, while the FPR is plotted on the
x-axis.
The formulas for these are:
T P + FN
T PR =
TP
FP + T N .
FPR =
FP
The AUC-ROC metric indicates the model’s proficiency in differentiating between
classes. A higher AUC signifies a superior model, while an AUC of 0.5 implies an
absence of discriminatory power, which is equivalent to random guessing.
The ROC AUC is more straightforward and traditionally used for binary
classification problems. In binary classification, the ROC AUC provides a clear
and interpretable measure of a model’s ability to distinguish between the two
classes.
But, the answer to the question, ‘why it is more suitable for binary classification’
is provided below:
• Clear Interpretation: In binary classification, the ROC curve plots the TPR
against the FPR at various threshold settings. This gives a clear and intuitive
understanding of the trade-off between correctly identifying positive cases
and incorrectly identifying negative cases as positive.
• Threshold-Independent: The ROC AUC provides a summary of model
performance across all possible classification thresholds, making it a robust
measure that is not dependent on a particular threshold value.
• Balanced View of Performance: It considers both classes (positive and
negative) equally, which is particularly useful when the classes are balanced
or when both types of classification errors (false positives and false negatives)
are equally important.
Limitations of AUC-ROC
While AUC-ROC is a powerful metric, it has its limitations:
• Class Imbalance: AUC-ROC may be overly optimistic with imbalanced
datasets, as it averages over all possible thresholds.
• Not Sensitive to Threshold Selection: The ROC curve does not reflect the
impact of threshold selection on the number of false positives.
# AUC-ROC
# Predict probabilities
rf_probs = rf_model.predict_proba(X_test_b)
# Display the AUC scores for the baseline and Random Forest
classifiers
print('Baseline: ROC AUC=%.4f' % (baseline_auc))
print('Random Forest: ROC AUC=%.4f' % (rf_auc))
# Output
Baseline: ROC AUC=0.5000
Random Forest: ROC AUC=1.0000
plt.figure(figsize=(8, 6))
plt.plot(baseline_fpr, baseline_fpr_tpr, linestyle='--',
label='Baseline',
color='gray')
plt.plot(rf_fpr, rf_tpr, marker='.', lw = 2, label='Random Forest',
color='purple')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) - Binary
Classification')
plt.legend()
plt.show()
• The ROC curve is calculated for both the baseline classifier and the Random
Forest model using the roc_curve function.
• Plot ROC Curves (Figure 8.2): The ROC curves are plotted with matplotlib.
pyplot. The baseline classifier’s curve is a dashed line, while the Random
Forest’s curve is marked with points. The x-axis represents the false positive
rate, and the y-axis represents the true positive rate.
Figure 8.2: ROC curve comparison between baseline classifier and Random Forest model.
276 | Machine Learning in Farm Animal Behavior using Python
Log Loss
Log Loss is another metric in classification models that measures the accuracy
by considering the predicted probabilities. It focuses on the reliability of the
predictions by penalizing false classifications based on the predicted probabilities.
Definition: Log Loss is the negative average of the logarithm of corrected predicted
probabilities for each instance in the dataset. It emphasizes the probability estimates
of the true class for each observation.
Formula:
1 N
Log Loss = −
LogLoss ∑ (yi · log (pi ) + (1 − yi ) · log (1 − pi ))
N i=1
where, N is the number of samples, yi is the actual label, and pi is the predicted
probability for the ith sample.
# Log Loss
from sklearn.metrics import log_loss
# Output
Log Loss: 0.0668
Kolmogorov-Smirnov (K-S)
The Kolmogorov-Smirnov (K-S) is a valuable tool in assessing the predictive
power of classification models, particularly in binary classification scenarios.
The K-S statistic measures the degree of separation between the distributions
of positive and negative cases. It evaluates how well the model distinguishes
between these two classes.
Evaluation, Model Selection and Hyperparameter Tuning | 277
# Kolmogorov-Smirnov
import numpy as np
# Output
Maximum K-S Value: 0.9996 at threshold 0.5800
The provided Python code executes a sequence of steps to compute the K-S
statistic for a binary classification task using the RF algorithm.
• The code converts the string labels in our dataset to binary format, with
‘inactive’ as the positive class (represented by 1) and the other class as the
negative class (represented by 0).
• Splits the dataset into training and test subsets.
• Trains RF classifier on the training subset.
• Predicts probabilities for the test subset.
Evaluation, Model Selection and Hyperparameter Tuning | 279
• Calculates the ROC, which provides the TPR and FPR at various threshold
levels.
• Calculates the K-S statistic as the maximum difference between the TPR and
FPR, which represents the best threshold for distinguishing between the two
classes.
• Finally, it plots the cumulative distributions of the classes to visualize the
model’s discriminatory power.
For this purpose, we will generate a dataset using the make_regression() function.
This approach allows us to tailor the dataset’s characteristics to effectively
illustrate various regression evaluation metrics.
# Make predictions
y_pred = model.predict(X_test)
Evaluation, Model Selection and Hyperparameter Tuning | 281
Key Points:
• n_samples and n_features determine the size and dimensionality of the
dataset.
• noise_level adds a specified amount of noise to the output, simulating real-
world data imperfections.
• Splitting the dataset using train_test_split().
• Applying linear regression to get the y_pred results.
This synthetic dataset provides a controlled environment to understand how
different metrics respond to various regression outcomes and model behaviors.
For our examples, we first apply the linear regression model and predict the
results.
In regression analysis, accuracy is not defined the same way as it is for
classification tasks. Unlike classification, where accuracy refers to the
percentage of correctly predicted instances, regression tasks involve predicting
continuous outcomes, making the concept of exact accuracy less straightforward.
Instead, we use metrics like MAE, MSE, and RMSE to evaluate the model’s
performance.
where, N is the number of samples, yi is the actual value of the sample, and ŷ i is
the predicted value.
Interpretation:
• Scale-dependent: MAE is scale-dependent and should be used to compare
models on the same dataset. Its value indicates how close the predictions are
to the actual values, on average.
• Robust to Outliers: MAE is not sensitive to outliers. Large errors have a
linearly proportional impact, making MAE a robust metric in the presence of
outliers.
282 | Machine Learning in Farm Animal Behavior using Python
# Calculate MAE
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error (MAE): {mae:.4f}")
# Output
Mean Absolute Error (MAE): 1.6051
RMSE
RMSE is the square root of the MSE. It is a measure of the average error and
is widely used in regression analysis. Compared to MAE, RMSE tends to give
higher weights to larger errors, punishing models with larger deviations more.
Evaluation, Model Selection and Hyperparameter Tuning | 283
Formula:
√ 1 N
RMSE = MSE = ∑ (yi − ŷi )2 .
N i=1
# Output
Mean Squared Error (MSE): 4.1272
Root Mean Squared Error (RMSE): 2.0316
In this example, we calculate both MSE and RMSE on our synthetic dataset.
These metrics provide different perspectives on the model’s error magnitude, with
RMSE often being more representative due to its scale alignment with the target
variable.
284 | Machine Learning in Farm Animal Behavior using Python
# RMSLE
from sklearn.metrics import mean_squared_log_error
from sklearn.ensemble import RandomForestRegressor
y_train = np.abs(y_train).to_numpy().ravel()
y_test = np.abs(y_test).to_numpy().ravel()
# Make predictions
y_pred = model.predict(X_test)
# Calculate RMSLE
rmsle = np.sqrt(mean_squared_log_error(y_test, y_pred))
print(f"Root Mean Squared Logarithmic Error (RMSLE): {rmsle:.4f}")
# Output
Root Mean Squared Logarithmic Error (RMSLE): 0.2256
Adjusted R-Squared
• While R2 is a useful metric, it has its limitations, particularly it tends to increase
as more predictors are added to a model, regardless of their usefulness.
Adjusted R-Squared addresses this issue by penalizing the addition of
286 | Machine Learning in Farm Animal Behavior using Python
# Assuming df_X and df_y are features and target of our synthetic
dataset
# Make predictions
y_pred = model.predict(X_test)
# Calculate R-Squared
r_squared = r2_score(y_test, y_pred)
print(f"R-Squared: {r_squared:.4f}")
adjusted_r_squared = 1 - (1 - r_squared) * (n - 1) / (n - k - 1)
print(f"Adjusted R-Squared: {adjusted_r_squared:.4f}")
# Output
R-Squared: 0.9974
Adjusted R-Squared: 0.9973
Evaluation, Model Selection and Hyperparameter Tuning | 287
In this example, both R-Squared and Adjusted R-Squared are calculated for a
linear regression model. These metrics provide insights into how well the model
is capturing the variance in the data, with the adjusted version offering a more
nuanced view that accounts for the number of predictors used.
The results obtained for R-Squared and Adjusted R-Squared are 0.9974 and
0.9973, respectively. These values are exceptionally high, indicating that our
regression model fits the data very well. These results are indicative of a strong
predictive performance by our regression model. However, it is always beneficial
to complement these metrics with other forms of validation, such as cross-
validation, to confirm the model’s effectiveness across different subsets of the
data.
Cross-Validation
Given the limitations of the Holdout Method, Cross-Validation appears as a more
robust technique for model evaluation, regarding generalization.
Cross-Validation Explained
Cross-Validation is a resampling procedure used to evaluate machine learning
models on a limited data sample. The data is divided into ‘K’ folds, and the model
is trained and validated ‘K’ times, each time using a different fold as the test set
while training on the remaining folds. This process helps in mitigating the issues
associated with the Holdout Method.
• Comprehensive Data Utilization: Each data point gets to be in a test set
exactly once, and in a training set ‘K–1’ times. This approach is beneficial
when dealing with limited datasets.
• Reduced Bias: Cross-Validation reduces the bias associated with the random
sampling of the Holdout Method.
Evaluation, Model Selection and Hyperparameter Tuning | 289
Types of Cross-Validation
• K-Fold Cross-Validation: This is the most common type of cross-validation.
The data is divided into ‘K’ folds, and the model is trained and tested ‘K’
times, using a different fold as the test set each time.
• Leave-One-Out Cross-Validation (LOOCV): In this type of cross-
validation, ‘K’ equals the number of data points in the dataset. This means
that for each iteration, the model is trained on all data points except one,
which is used as the test set. While LOOCV is exhaustive and eliminates bias,
it is computationally expensive.
The above code initializes a dictionary named models, where key-value pairs
represent different algorithms such as Random Forests, LightGBM, Decision
Trees, XGBoost, SVM Radial, and MLP. Each key is a string denoting the name
of the model, and the corresponding value is the model object itself from scikit-
learn or other libraries like XGBoost and LightGBM.
A StratifiedKFold object is created with 5 splits. shuffle = True shuffles the data
before splitting it into folds, adding randomness which helps in reducing bias.
In the loop, each model is evaluated using Stratified K-Fold Cross-Validation.
The cross_val_score computes the model’s accuracy for each fold of the cross-
validation process and returns a list of scores. The np.mean(scores) calculates
the average accuracy across all folds for each model, and the results are stored in
the model_scores dictionary. The n_jobs = –1 parameter enables the function to
use all available CPU cores for parallel computation, speeding up the evaluation
process.
After evaluating all models, the one with the highest average accuracy is
determined using the max function with model_scores.get as the key function.
This model is identified as the best model for the given task.
The stratified cross-validation results are as follows:
• Random Forest: Average Accuracy = 0.9878
• LightGBM: Average Accuracy = 0.9891
• Decision Tree: Average Accuracy = 0.9663
• XGBoost: Average Accuracy = 0.9918
• SVM Radial: Average Accuracy = 0.7568
• MLP: Average Accuracy = 0.8979.
Based on these results, the model that stands out with the highest average
accuracy is XGBoost, with an accuracy of 0.9918. This performance indicates
292 | Machine Learning in Farm Animal Behavior using Python
that XGBoost is highly effective at classifying the different behaviors in the sheep
dataset. Given the high accuracy of the XGBoost model in our cross-validation
process, the consequent steps would involve fine-tuning this model, followed by
a thorough evaluation on a separate test set to confirm its generalizability and
effectiveness in practical scenarios.
Key Features:
• Exhaustive Search: Tests every combination of hyperparameters in the grid.
• Precision: Can identify the optimal parameters accurately if they are included
in the grid.
• Resource-Intensive: This method can be computationally expensive,
especially for large datasets and complex models.
# Output
Fitting 5 folds for each of 27 candidates, totalling 135 fits
Best Parameters: {'learning_rate': 0.2, 'max_depth': 5, 'n_
estimators': 300}
Best Score: 0.9902958046871252
294 | Machine Learning in Farm Animal Behavior using Python
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.savefig('cm.png', dpi=300)
plt.show()
# Output
Classification Report:
precision recall f1-score support
grazing 1.00 1.00 1.00 4830
resting 0.99 0.99 0.99 8272
scratching 0.99 0.90 0.94 165
standing 0.99 0.99 0.99 4254
walking 1.00 0.99 1.00 1867
accuracy 0.99 19388
macro avg 0.99 0.98 0.98 19388
weighted avg 0.99 0.99 0.99 19388
Confusion Matrix
8000
Grazing
4824 0 0 1 5
7000
Resting
6000
1 8182 0 89 0
5000
Scratching
Actual
29 0 129 0 7 4000
3000
Standing
4 98 0 4150 2
2000
Walking
1000
14 0 0 0 1853
0
Grazing Resting Scratching Standing Walking
Predicted
The code evaluates the performance of the XGBoost model optimized via
Grid Search, on a test dataset. It involves generating a classification report
and a confusion matrix (Figure 8.4) to understand the model’s effectiveness in
classification.
Breakdown of the code and its purpose:
• Retrieve and Fit Best Model: best_model = grid_search.best_estimator_
retrieves the XGBoost model with the best hyperparameters identified during
Grid Search. best_model.fit(X_train, y_train) fits this model to the training
data.
• Prediction: y_pred = best_model.predict(X_test) uses the fitted model to
make predictions on the test dataset.
• Classification Report: This part of the code generates a classification report,
which includes key metrics like precision, recall, and F1-score for each class
in the test dataset.
• target_names = label_encoder.classes_ translates the encoded class labels
back to their original names for better interpretability in the report.
• Confusion Matrix: confusion_matrix(y_test, y_pred) creates a confusion matrix
• Visualization: The confusion matrix is visualized as a heatmap using seaborn
and matplotlib.
• Class Labels: The matrix includes class names on both axes for clarity,
facilitating a quick understanding of how predictions compare against actual
values.
• Final Output: Finally, the classification report is printed, providing a detailed
account of the model’s performance metrics for each class.
# RandomizedSearchCV
# Output
Best Parameters: {'max_depth': 9, 'max_features': 'auto', 'n_
estimators': 370}
Best Score: 0.9508820574195646
298 | Machine Learning in Farm Animal Behavior using Python
Successive Halving
Successive Halving operates on the principle of iteratively allocating resources to
a subset of configurations based on their performance.
Process:
• Initial Screening: Begins with a large number of randomly selected
hyperparameter configurations and a minimal resource budget.
• Performance Evaluation: Each configuration is evaluated using the allocated
budget.
• Pruning: The configurations are ranked by performance, and only the top half
(or another predefined fraction) is retained for the next round.
• Resource Doubling: The budget for each remaining configuration is doubled,
and steps 2-3 are repeated until a satisfactory configuration is found or the
resource limit is reached.
Hyperband
Hyperband improves upon Successive Halving by dynamically adjusting the
number of configurations and the allocated resources through multiple brackets,
each with a different starting budget. It effectively creates a balance between
Evaluation, Model Selection and Hyperparameter Tuning | 299
# HalvingGridSearchCV
# Output
Best Parameters: {'max_depth': 15, 'n_estimators': 300}
Best Score: 0.9768948804996072
Test Set Accuracy: 0.9802
300 | Machine Learning in Farm Animal Behavior using Python
Why Optuna?
• Efficient Search: Optuna can find better hyperparameters with fewer trials
compared to grid or random search.
Evaluation, Model Selection and Hyperparameter Tuning | 301
Summary
In this chapter, we have presented model evaluation, model selection, and
hyperparameter tuning in machine learning. We started with evaluation metrics,
describing accuracy, precision, recall, F1-score and other metrics for classification,
and MSE, RMSE, and MAE for regression, to provide a multifaceted understanding
of model performance. We then shifted our focus to model selection, emphasizing
the importance of cross-validation methods like K-Fold and Stratified K-Fold in
ensuring model robustness and reliability across diverse data sets.
In the last part of the chapter, we focused on hyperparameter tuning, comparing
traditional approaches like Grid Search and Randomized Search with advanced
techniques such as Halving methods and Bayesian Optimization via Optuna.
These methods offer significant improvements in efficiency and effectiveness,
particularly in handling complex models with large parameter spaces. Practical
Python examples were integrated to demonstrate these concepts in action,
equipping readers with the skills to evaluate, select, and fine-tune machine
learning models efficiently and effectively.
CHAPTER
9
Deep Learning Algorithms for
Animal Activity Recognition
In ML, the focus shifts from programming specific rules to designing algorithms
that can learn these rules from data. As we discussed in the previous chapters,
ML algorithms, from simple linear regressions to complex ensemble methods,
provide the flexibility and adaptability to learn from data, making them well-
suited for a wide range of applications.
like random forests and gradient boosting machines. These models often
require feature engineering, where domain knowledge is used to create input
features that are fed into the model. The complexity of ML models varies, but
they are generally less complex than deep learning models.
• Deep Learning Models: DL models, primarily based on neural networks,
are characterized by their depth, which refers to the number of layers in the
network. These layers enable DL models to learn increasingly abstract features
from raw input data, eliminating the need for manual feature engineering.
DL models are fundamentally more complex, capable of handling vast and
intricate architectures, making them suitable for more complex tasks.
Data Requirements
• ML Data Requirements: Traditional ML models can often achieve high
performance with relatively smaller datasets. They are skilled at capturing
relationships in data where the underlying patterns can be observed with
fewer examples. This characteristic makes ML models useful in scenarios
where data collection is challenging or expensive.
• DL Data Requirements: DL models succeed on large datasets. Their ability to
learn and generalize improves significantly with the amount of data fed into
them. This data-hungry nature of DL allows them to excel in complex tasks
like image and speech recognition, but it also poses challenges in scenarios
where data is scarce or expensive to acquire.
Computational Resources
• Resources for ML: In general, ML models are less computationally intensive
compared to DL models. They can often be trained on standard computing
306 | Machine Learning in Farm Animal Behavior using Python
resources, making them more accessible for a wider range of users and
applications.
• Resources for DL: Particularly large DL models require significant
computational power. They often need the use of specialized hardware like
GPUs (Graphics Processing Units) or TPUs (Tensor Processing Units).
This requirement for high computational resources makes DL models more
expensive and less accessible for casual users or small-scale applications.
In this chapter, our primary focus will be on understanding the key types of neural
networks and their roles in deep learning. It is important to note, however, that
the application of deep learning for wearable sensor data, especially in the context
of animal activity recognition, is not as extensively explored as in other domains
like computer vision. The deep learning techniques most commonly applied to
wearable sensor data include:
• Multilayer Perceptron Neural Networks (MLPs): Often referred to as fully
connected or dense networks, these represent the simplest form of neural
networks and consist of layers where each neuron is connected to all neurons
in the preceding and subsequent layers. Their fundamental structure is crucial
for understanding the basic workings of neural network architecture.
• Convolutional Neural Networks (CNNs): Although widely recognized for
their role in processing image data, CNNs can also be adapted for analyzing
wearable sensor data, particularly when there is a need to capture spatial
hierarchies or patterns in the data.
• Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM),
and Gated Recurrent Units (GRUs): These networks are designed to handle
sequential data, making them suitable for wearable sensor data that often has
a time-series nature. RNNs and LSTMs can capture temporal dynamics,
which is valuable for analyzing sequences of sensor readings over time.
GRUs, like LSTMs, are pivotal for their efficiency in sequence modelling
and their suitability for time-series data. Although GRUs share similarities
with LSTMs, their simpler structure can offer computational advantages in
certain applications.
While this chapter will focus on these specific techniques due to their relevance
to wearable sensor data and their applications in animal activity recognition, it is
important to acknowledge that the field of deep learning is vast and diverse. There
are numerous other techniques and models within deep learning, such as Deep
Transfer Learning, Autoencoders, Generative Adversarial Networks (GANs), and
more advanced models for Natural Language Processing (NLP). However, these
are beyond the scope of this book. Readers interested in these advanced topics are
encouraged to seek specialized resources for further study.
Our objective is to provide a comprehensive understanding of the neural network
architectures most pertinent to our focus on wearable sensor data for animal
activity recognition, guiding readers in applying these techniques effectively
within this specific context.
Base
x1
w1
x2 w2 Neuron y
..
. wn
xn
A neuron receives one or more inputs, processes these inputs, and generates
an output. The inputs to a neuron can be features from a dataset or outputs
from preceding neurons in the network. Each input xi (independent variable) is
Deep Learning Algorithms for Animal Activity Recognition | 309
associated with a weight wi, signifying the strength or importance of the inputs.
The neuron also includes a bias term , which adjusts the output (y) independently
of the inputs. The output (y) can be continuous, binary or categorical variable.
Neuron Computation
The output of a neuron is calculated by first computing the weighted sum of its
inputs:
n
Weighted Sum = ∑ wi xi + b
i=1
where, n is the number of inputs, wi is the weight associated with the ith input (xi),
and b is the bias.
After calculating the weighted sum, the neuron applies an activation function
(ϕ) to this sum. A nonlinear activation function introduces nonlinearity into the
model, enabling the network to learn complex patterns.
Note that the input variables in a dataset are not isolated entities;
they represent a single observation. For example, observations like
temperature, age of the animal, food intake, are distinct attributes,
however, they describe one single observation (the target). When these
processed values traverse through the primary neuron, they transform
into output values on the other side of the network.
ex − e−x
tanh(x) = .
ex + e−x
Here is a breakdown of its characteristics and usage:
– Range: The tanh function outputs values in the range of –1 to 1. This is
one of its primary differences from the sigmoid function, which outputs
values between 0 and 1.
– Zero-centered: Since its output ranges from –1 to 1, the tanh function is
zero-centered. This means that its outputs have a mean close to 0, which
can help with the convergence of the neural network during training, as it
avoids bias in the gradients.
– Non-linearity: Just like the sigmoid function, tanh is also non-linear. This
allows neural networks using tanh to learn complex data patterns and
solve classification problems that are not linearly separable.
– Usage: tanh is commonly used in hidden layers of a neural network. Its
zero-centered nature makes it more efficient than the sigmoid function.
– Vanishing Gradient Problem: Similar to the sigmoid function, tanh suffers
from the vanishing gradient problem. When the inputs are very high or
very low, the gradient of the tanh function becomes very small. This small
gradient can slow down the training process significantly, as it causes very
small updates to the weights during backpropagation.
Deep Learning Algorithms for Animal Activity Recognition | 311
convincing
, convincing
LeakyReLU(x) = max (ax, x) =
convincing
, .
Key Characteristics
– Non-zero Gradient for Negative Inputs: Unlike ReLU, which outputs
zero for all negative inputs, Leaky ReLU allows a small, non-zero output.
This ensures that neurons continue to learn and adapt during the training
process, even if they are activated infrequently.
– Reduced Risk of Neurons Dying: By maintaining a small gradient for
negative inputs, Leaky ReLU reduces the risk of neurons becoming
inactive during training. This makes it particularly useful in deeper
networks where the dying ReLU problem is more pronounced.
– Computational Efficiency: Similar to ReLU, leaky ReLU is computationally
efficient. The additional operation required to implement the “leak” is
minimal and does not significantly impact the overall computational cost.
Usage and Applications
It is useful in training models where the data is not well normalized, and the
inputs to neurons in the network have a significant negative component.
Limitations
While Leaky ReLU addresses some issues of ReLU, it also has limitations:
– Tuning the a Parameter: The effectiveness of Leaky ReLU can depend on
the choice of a. This parameter may need tuning specific to the application
and dataset.
– Not a Universal Solution: Leaky ReLU does not always outperform ReLU
and needs to be tested empirically for each specific application.
Here is a small constant (typically around 0.01) that gives the function a slight
slope for negative values, ensuring that the gradient is never entirely zero.
This modification addresses the dying ReLU problem by allowing the flow
of gradients even for negative input values.
• Exponential Linear Unit (ELU)
The ELU activation function is another variant that aims to improve upon the
ReLU function. It was introduced to address some of the limitations of ReLU
and Leaky ReLU.
Deep Learning Algorithms for Animal Activity Recognition | 313
Formula:
a (ex –1),
x, if ifx >x <00
ELU(x) =
x
ax, if, ifx ≤
x ≥00.
LeakyReLU(x) = max(ax, x)
For positive values of x, ELU behaves just like ReLU, but for negative values, it
has a different behavior. Instead of having a constant value (like in Leaky ReLU),
ELU has an exponential curve, which allows it to push the mean activations closer
to zero. This zero-centering property helps speed up learning by bringing the
average output of the neurons closer to zero, similar to the tanh function.
A key advantage of ELU over ReLU is its non-zero gradient for negative values,
which helps alleviate the dying ReLU problem. This characteristic ensures that all
neurons in the network continue to learn and adjust during training.
The parameter a in the ELU formula is a constant that defines the value to which
an ELU saturates for negative net inputs. It is usually set to 1 but can be tuned
based on specific requirements of the neural network.
However, ELU can be computationally more expensive than ReLU and its other
variants due to the exponential function, particularly during the backward pass in
training.
• Softmax Activation Function
The Softmax activation function stands out in activation functions, especially
in the context of classification problems. It is typically used in the output
layer of a neural network to transform raw output scores, often called logits,
into probabilities, which are easier to interpret.
The Softmax function is defined as:
convincing
convincing convincing
convincing
convincing
where,
– zi is the raw score (logit) for the ith class.
– K is the total number of classes.
The Softmax function exponentiates each logit and then normalizes these
values by dividing each by the sum of all the exponentiated logits. This
ensures that the output values (probabilities) are between 0 and 1 and sum up
to 1.
Key Characteristics
– Conversion to Probabilities: The primary role of the Softmax function
is to convert logits into probabilities, which are more interpretable and
useful for classification tasks.
– Multi-Class Classification: Softmax is ideal for multi-class classification
314 | Machine Learning in Farm Animal Behavior using Python
0.8 0.75
2.5
050
2.0
0.6 0.25
0.00 1.5
0.4
2.5 0.05
2.5
2.0
2.0 0.04
1.5
1.5 1.0 0.03
• Output Layer: This produces the final output, representing the network’s
prediction or decision.
The process of calculating the output of a neural network is known as forward
propagation. It involves the following steps:
Combining Inputs with Weights: Each neuron receives inputs, multiplies them
by their weights, and adds a bias term. Mathematically, this is represented as:
n
z = ∑ (wi · xi ) + b
i=1
where, wi is the weight, xi is the input, b is the bias, and n is the number of inputs
to the neuron.
Applying Activation Function: The combined value z is then passed through an
activation function ϕ:
a = ϕ(z)
where, a is the output of the neuron.
Mechanism of Backpropagation
As indicated in Chapter 7, backpropagation works by calculating the gradient
(rate of change) of the loss function (error measure) with respect to each weight
in the neural network. This is done in a backward manner, starting from the output
layer, and moving towards the input layer.
(or derivative) of the cost function with respect to each parameter indicates the
direction in which the parameter should be adjusted to decrease the cost.
1 N
MSE = ∑ (yi − ŷi )2 .
N i=1
x1 Σ f y1
x2 Σ f y2
..
. .. ..
. .
xp
Σ f ys
x1 Σ f Σ f y1
x2 Σ f Σ f y2
..
. .. .. .. .. ..
. . . . .
xp
Σ f Σ f ys
1
Hyperparameters
• Input Layer Shape (in_features):
– Binary Classification: The number of input features corresponds to the
number of variables in your dataset. For instance, in a lameness disease
prediction model this could be the number of animal characteristics like
age, breed, weight, and activity levels.
– Multiclass Classification: The input layer shape is identical to that in
binary classification, as it is determined by the number of features in your
dataset, not the number of classes.
• Hidden Layer(s):
– Both Binary and Multiclass: The number of hidden layers is determined based
on the complexity of the problem. At least one hidden layer is typical, but
more complex problems may require multiple layers. There is no strict upper
limit on the number of layers, but adding too many can lead to overfitting.
• Neurons per Hidden Layer:
– Both Binary and Multiclass: The number of neurons in each hidden layer
is problem-specific and generally ranges from 10 to 512. This parameter
should be tuned based on the complexity of the task and the amount of
available data.
• Output Layer Shape (out_features):
– Binary Classification: The output layer typically has a single neuron,
representing the probability of belonging to one of the two classes.
– Multiclass Classification: The number of neurons in the output layer
equals the number of classes.
• Hidden Layer Activation Function:
– Both Binary and Multiclass: ReLU (Rectified Linear Unit) is a common
choice for the activation function in hidden layers, though other functions
like tanh or Leaky ReLU can also be used.
• Output Activation Function:
– Binary Classification: The sigmoid function is used to output a probability
between 0 and 1, indicating the likelihood of the instance belonging to one
class.
– Multiclass Classification: The softmax function is used to output a
probability distribution across all classes.
• Loss Function:
– Binary Classification: Binary crossentropy loss is suitable as it compares
the predicted probabilities with the actual binary labels.
Deep Learning Algorithms for Animal Activity Recognition | 321
import torch
import torch.nn as nn
import torch.optim as optim
# Optimizer
optimizer =optim.Adam(model.parameters(), lr=0.001)
model = nn.Sequential(
nn.Linear(in_features=10, out_features=20),
nn.ReLU(),
nn.Linear(20, 10),
nn.ReLU(),
nn.Linear(10, 3) # Assume 3 classes for multiclass classification
# No activation function here
)
For multiclass classification, the primary differences would be in the output layer
and the loss function:
1. The output layer should have as many neurons as there are classes. For
example, if there are 3 classes, you would have nn.Linear(10, 3).
2. Instead of using nn.Sigmoid(), you would not apply an activation function in
the output layer because nn.CrossEntropyLoss() applies the softmax function
internally.
3. Use nn.CrossEntropyLoss() as the loss function.
import torch
print(torch.__version__)
if torch.cuda.is_available():
device = "cuda"
elif torch.backends.mps.is_available():
device = "mps"
else:
device = "cpu"
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
df = pd.read_csv('binary_data.csv')
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
# Fit the scaler on the training data and transform both training
and testing data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
– We load our dataset using Pandas and split it into features (X) and labels
(y).
– We use LabelEncoder to encode categorical labels into numerical format,
which is necessary for neural network training.
– We split the dataset into training and testing sets.
– StandardScaler is used to standardize the features.
326 | Machine Learning in Farm Animal Behavior using Python
– We have to convert the NumPy arrays into PyTorch tensors and ensure
they have the correct data types (float for features and long for labels).
• Defining the Neural Network Model
import torch.nn as nn
binary_model = BinaryClassifier().to(device)
binary_model
criterion = torch.nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(params=binary_model.parameters(),
lr=0.01)
torch.manual_seed(42)
epochs = 5000
# Calculate loss
loss_value = criterion(y_logits, y_train)
# Calculate accuracy
correct_train = (y_pred == y_train).sum().item()
train_accuracy = 100 * correct_train / len(y_train)
loss_value.backward()
optimizer.step()
# Testing Phase
binary_model.eval()
with torch.no_grad():
test_logits = binary_model(X_test).squeeze()
test_pred = torch.round(torch.sigmoid(test_logits))
test_loss = criterion(test_logits, y_test)
binary_model.train()
import numpy as np
with torch.no_grad():
y_preds = torch.round(torch.sigmoid(binary_model(X_test))).
squeeze()
print(f"Accuracy: {accuracy_percentage:.2f}%")
# Output
Accuracy: 99.98%
import torch.nn as nn
class MulticlassClassifier(nn.Module):
def __init__(self, input_size, num_classes):
super(MulticlassClassifier, self).__init__()
# Define the layers of the neural network
self.fc1 = nn.Linear(input_size, 64) # First linear layer
self.fc2 = nn.Linear(64, 128) # Second linear layer
self.fc3 = nn.Linear(128, 64) # Third linear layer
self.fc4 = nn.Linear(64, num_classes) # Output layer
# Activation function
Deep Learning Algorithms for Animal Activity Recognition | 331
self.relu = nn.ReLU()
num_features = 82
num_classes = 5
• num_features = 82: This specifies that the input data has 82 features.
• num_classes = 5: This indicates that there are 5 distinct classes that the model
needs to classify.
• MulticlassClassifier(num_features, num_classes): This creates an instance
of the MulticlassClassifier with the specified number of input features and
classes.
• criterion = nn.CrossEntropyLoss(): This line sets up the loss function for
the model. CrossEntropyLoss is a standard loss function for multiclass
classification tasks in PyTorch.
• optimizer = torch.optim.Adam(multiclass_model.parameters(), lr = 0.001):
This line defines the optimizer for the training process. The learning rate (lr)
is set to 0.001, a common starting point that can be adjusted based on the
specific requirements of your training process.
332 | Machine Learning in Farm Animal Behavior using Python
import torch
torch.manual_seed(42)
# Number of epochs
# You can adjust this number
epochs = 1000
# Forward pass
y_log = multiclass_model(X_train)
# Convert logits to prediction labels for accuracy calculation
y_pred = torch.argmax(y_log, dim=1)
# Zero gradients
optimizer.zero_grad()
# Perform backpropagation
loss.backward()
# Optimizer step
optimizer.step()
Since we have already worked with a binary classifier and this process is similar,
we will focus on highlighting the differences that apply to multiclass classification:
• For multiclass classification, the model’s output (y_log) consists of logits
for each class. To get the predicted class labels, torch.argmax is used,
which selects the class with the highest logit value. This is different from
the binary case, where the output is typically a single probability score per
instance.
• The criterion here is CrossEntropyLoss, but in the binary case it was
BCEWithLogitsLoss(). CrossEntropyLoss is appropriate for multiclass
problems as it calculates the loss by applying a softmax to the logits to convert
them into probabilities.
• Similar to the training phase, during the test phase, the model outputs logits
for each class, and torch.argmax is used to determine the predicted class.
The final step in evaluating our MulticlassClassifier involves generating a detailed
classification report, which provides insights beyond mere accuracy.
print(report)
# Output
precision recall f1-score support
grazing 1.00 1.00 1.00 3186
resting 0.97 0.97 0.97 5568
scratching 0.92 0.90 0.91 116
standing 0.95 0.94 0.95 2800
walking 0.99 1.00 0.99 1256
accuracy 0.97 12926
macro avg 0.97 0.96 0.96 12926
weighted avg 0.97 0.97 0.97 12926
• F1-score: torchmetrics.F1Score().
• Confusion matrix: torchmetrics.ConfusionMatrix().
This completes our introduction to neural networks and their application to both
binary and multiclass classification problems using PyTorch. Through these
sections, we have seen how to structure neural network models for different types
of classification tasks, how to train, and finally evaluate these models.
Introduction to CNNs
CNNs (Bengio, 2009; LeCun et al., 2015) are a class of deep neural networks,
primarily used in analyzing visual imagery, and other data-intensive applications
like time-series analysis. CNNs are known for their ability to learn spatial
hierarchies of features automatically and adaptively from input data.
Convolutional Operation: The core of a CNN is the convolutional operation.
This involves sliding a filter or kernel over the input data and computing the
dot product of the filter with the input data at each position. A convolutional
layer consists of several such filters, and each filter generates a feature map.
Mathematically, this operation for a single filter can be expressed as:
k k
F(I)x,y = ∑ ∑ Ix+i,y+ j · Ki, j
i=−k j=−k
where, F(I )x,y is the feature map, I is the input data, K is the kernel, and k is the
kernel size.
Role of Convolution: The convolutional layer acts as a feature extractor that
learns the spatial hierarchies in the data. Lower layers might learn basic features
like edges, while deeper layers can learn more complex features like shapes or
specific objects in the case of image data.
Key Components
• Convolutional Layers: These layers perform the convolution operation,
extracting features from the input data.
• Pooling Layers: Following convolutional layers, pooling layers (like max
pooling) are used to reduce the spatial dimensions (width and height) of
the input volume for the next convolutional layer. It helps in reducing the
computational load, memory usage, and also help in making the detection of
features invariant to scale and orientation changes.
336 | Machine Learning in Farm Animal Behavior using Python
• AlexNet (2012):
– Developer: Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton.
– Features: A deeper and wider architecture compared to LeNet. It
introduced the use of the ReLU activation function, implemented dropout
layers, and utilized GPUs for training.
– Applications: Large-scale image recognition tasks.
• ZFNet (2013):
– Developer: Matthew Zeiler and Rob Fergus.
– Features: This is built on the architecture of AlexNet but with different
filter sizes and numbers of filters. Additionally, visualization techniques
are introduced for understanding the network.
– Applications: ImageNet classification.
• VGGNet (2014):
– Developer: Karen Simonyan and Andrew Zisserman.
– Features: It is characterized by deeper networks with smaller filters
(3×3), and standardized depth across all convolutional layers. It comes in
multiple configurations like VGG16 and VGG19.
– Applications: Large-scale image recognition.
• ResNet (2015):
– Developer: Kaiming He et al.
– Features: It introduces “skip connections” or “shortcuts” to enable the
training of deeper networks. It is available in multiple configurations like
ResNet-50, ResNet-101, and ResNet-152.
– Applications: Large-scale image recognition; notably won the 1st place
in the ILSVRC 2015.
• GoogleLeNet (2014):
– Developer: Christian Szegedy et al. at Google.
– Features: Introduced the Inception module, allowing for more efficient
computation and enabling deeper networks. Evolved through multiple
versions including Inception v1, v2, v3, and v4.
– Applications: Large-scale image recognition; won 1st place in the
ILSVRC 2014.
• MobileNets (2017):
– Developer: Andrew Howard et al.
– Features: Designed for mobile and embedded vision applications. Uses
depthwise separable convolutions to reduce the model size and complexity.
338 | Machine Learning in Farm Animal Behavior using Python
255 255 255 0 0
0 255 255 255 0
0 0 255 255 255
0 0 255 255 0
0 255 255 0 0
This filter is a simple edge detection filter, often used in image processing for
highlighting vertical edges. Let’s break down how it works:
• Positive Weights (1s on the Left Side): The positive values on the left side
of the filter are designed to react strongly to light areas of the image that are
aligned with these weights.
• Negative Weights (–1s on the Right Side): The negative values on the right
side of the filter react to dark areas of the image.
• Zero Weights (0s in the Center): The zeros in the middle column of the filter
do not influence the result.
When this filter is convolved over an image, it computes the difference in intensity
between the left and right sides of the filter. This makes it effective at detecting
vertical edges, where there is a significant intensity change from left to right. In
areas of the image with a strong vertical edge, this filter produces high values in
the feature map, highlighting those edges.
In summary, this filter is a basic example of an edge detection filter in image
processing, specifically tuned to highlight vertical edges due to its pattern of
positive, zero, and negative weights.
Figure 9.6 illustrates the convolution process in a CNN using a simple 5×5
grayscale image, a 3×3 edge-detection filter, and the resulting feature map. The
leftmost image represents the input matrix (or image), the middle image shows
the applied 3×3 vertical edge-detection filter, and the rightmost image displays
the resulting feature map after convolution, highlighting detected edges.
Transformation with Stride and Padding
Let’s assume a stride of 1 and no padding (valid padding). Here is how the first
convolution operation would look:
• Position the filter over the top-left corner of the image.
• Perform element-wise multiplication and sum the results
• Move the filter one pixel to the right (stride = 1) and repeat the process.
340 | Machine Learning in Farm Animal Behavior using Python
After completing the operation across the entire image, you get a feature map. If
the stride is 1 and no padding is used, the output feature map will have a smaller
dimension than the input image. In our case, the output feature map will be a 3×3
matrix, as the 3×3 filter can only move three steps horizontally and vertically
across the original 5×5 input.
Result with Padding
If we apply padding, for instance, a padding of 1 (adding a border of zeros around
the input image), the input matrix becomes a 7×7 matrix, with the original image
in the middle. The convolution operation with the same filter and a stride of 1
would then produce a 5×5 feature map, preserving the original input size.
By manipulating the stride and padding parameters, you can control the size and
resolution of the feature maps produced by the convolutional layers. This plays a
significant role in the architecture of the CNN and the level of feature extraction
it can perform.
Pooling Layers
Pooling layers are another critical component in CNNs positioned typically after
convolutional layers. Their primary role is to reduce the spatial dimensions,
specifically, the width and height of the input feature maps. This dimensionality
reduction is crucial for several reasons:
• Reducing Computational Complexity: By downsizing the input’s spatial
dimensions, pooling layers decrease the number of parameters and
computations required in the network. This efficiency is vital in handling
large and complex inputs, like high-resolution images, and contributes to
faster training times.
Deep Learning Algorithms for Animal Activity Recognition | 341
in an image classification task, these layers might take the features extracted from
previous layers (such as edges, textures, and so on) and use them to identify the
presence of an object like a sheep or a pig in the image.
Hierarchy of Feature Learning: The sequence of convolutional layers followed
by pooling layers in a CNN establishes a hierarchical pattern learning mechanism.
The initial layers tend to capture basic patterns like edges, and as we go deeper
into the network, subsequent layers build upon these to recognize more intricate
and complex patterns. The FC layers, being at the end of this hierarchy, play a
pivotal role in interpreting these patterns in the context of the specific task at hand.
Output Layer
The output layer in a CNN is the final layer, responsible for producing the final
result or prediction of the network. Its design and function are directly linked to
the specific task that the CNN is intended to perform.
Considerations in Designing the Output Layer
Number of Neurons: The number of neurons in the output layer corresponds to
the number of classes in a classification task or the dimensionality of the output
in a regression task.
Activation Function: The choice of activation function in the output layer is
crucial and depends on the specific type of problem. For classification, softmax is
common, while for regression tasks, linear or other suitable activation functions
are used.
Loss Function Association: The design of the output layer is closely linked to the
choice of the loss function during the training process. For instance, a softmax
output layer is often paired with a cross-entropy loss function in classification tasks.
L1 Regularization (Lasso)
Mechanism: L1 regularization adds a penalty equal to the absolute value of the
magnitude of the weights to the loss function. Mathematically, the L1 loss for a
neural network is given by:
LL1 = Loriginal + λ ∑ |wi |
i
where, Loriginal is the original loss function (such as cross-entropy loss in classifi-
cation tasks), represents each weight in the network, and λ is the regularization
strength, a hyperparameter that controls the degree of regularization.
Effect: L1 regularization encourages the model weights to be sparse, meaning
many of the weights will be zero or close to zero.
L2 Regularization (Ridge)
Mechanism: L2 regularization adds a penalty equal to the square of the magnitude
of the weights. The L2 loss for a neural network is defined as:
Effect: Unlike L1, L2 regularization spreads the error among all the weights and
tends to drive the weights to small, non-zero values. This technique is effective in
handling the overfitting by penalizing the weights with larger magnitudes.
Application in CNNs
Incorporation in Loss Function: In CNNs, these regularization terms are added
to the loss function during training. As the network learns from the data, the
regularization term discourages the model from fitting too closely to the training
data by penalizing large weights.
Impact on Learning: Both L1 and L2 regularization influence the training process
by minimizing the loss to fit the data and keep the model weights as small as
possible. This duality helps the model in generalizing better to unseen data.
Choice Between L1 and L2
• L1 regularization is often chosen when we want to impose sparsity on the
weights (i.e., making irrelevant weights equal to zero).
• L2 regularization is usually preferred in most scenarios as it tends to perform
better in minimizing overfitting without increasing sparsity.
Hyperparameter Tuning: The regularization strength λ is a hyperparameter that
requires tuning. Depending on its value, the regularization can have a more or less
significant impact on the weights of the network.
Deep Learning Algorithms for Animal Activity Recognition | 345
L1 vs. L2 Regularization
• L1 Regularization: Leads to sparsity and is useful when we have a high
number of features, some of which may be irrelevant for prediction.
• L2 Regularization: Tends to give better prediction accuracy as it does not
enforce sparse weights.
Early Stopping
Early stopping is a regularization technique used during the training of a deep
learning model to prevent overfitting. It involves monitoring the model’s
performance on a validation set and stopping the training process once the model’s
performance ceases to improve.
Concept and Mechanism
• Monitoring Validation Loss: Early stopping focuses on tracking the loss on
a validation dataset, which is separate from the training data. The idea is
to identify the point during training when the model’s performance on the
validation set starts to degrade, indicating overfitting on the training data.
• Stopping Criterion: Training is stopped when the validation loss has not
improved for a predetermined number of epochs. This number is often referred
to as patience. The model weights associated with the lowest validation loss
are typically saved and used for future predictions.
Implementation
• Training Loop Adjustment: In the training loop, after each epoch, the model
is evaluated on the validation set. The loss on this validation set is then
compared to the best loss observed in previous epochs.
• Patience Hyperparameter: If the validation loss fails to improve for a
consecutive number of epochs (defined by the patience parameter), the
training process is halted.
• Model Checkpointing: Often, early stopping is used in conjunction with
model checkpointing, where the model weights are saved whenever an
improvement in validation loss is observed. This ensures that the best
performing model is retained, even if the model’s performance degrades in
subsequent epochs.
Benefits
• Prevents Overfitting: By stopping training early at the right moment, it is
ensured that the model does not overfit to the training data, enhancing its
ability to generalize to new, unseen data.
• Saves Time and Resources: It reduces unnecessary training time and
computational resources by stopping the training process once further training
ceases to yield better results on the validation set.
346 | Machine Learning in Farm Animal Behavior using Python
Considerations
• Validation Set Selection: The effectiveness of early stopping heavily relies on
having a representative validation set. If the validation data does not reflect
the distribution of the unseen test set or real-world data, early stopping might
halt training too early or too late.
• Patience Setting: Setting the right patience is crucial. Too little patience may
stop training prematurely, while too much patience might delay the stopping,
leading to overfitting. This hyperparameter often requires tuning based on the
specific dataset and model architecture.
Data Augmentation
Data augmentation is a widely used regularization technique in deep learning,
particularly effective in tasks like image and speech recognition. It involves
artificially expanding the training dataset by applying various transformations
to the existing data, thereby creating additional, altered versions of the data
points.
Concept and Purpose
• Expanding Dataset: Data augmentation increases the size and diversity of
the training dataset without the need to collect new data. This is achieved by
applying a series of transformations that alter the data in realistic ways.
• Preventing Overfitting: By introducing a variety of modified versions of the
original data, data augmentation helps the model learn to generalize better.
Implementation and Considerations
• Automated Augmentation: Many deep learning frameworks provide tools
to automate the process of data augmentation, applying transformations
randomly during training.
• Careful Design: The choice and extent of augmentations should be relevant to
the problem domain. For instance, flipping images horizontally might not be
appropriate for digit recognition, as it would change the meaning of numbers
like 6 and 9.
• Impact on Training: While data augmentation can significantly improve
model robustness and generalization, it also increases the computational
workload during training, as it effectively enlarges the training dataset.
Benefits
• Enhanced Generalization: Models trained on augmented data are less likely
to overfit and usually perform better on unseen data.
• Improved Robustness: Augmentation can make models more robust to
variations and distortions in input data, which is crucial for real-world
applications.
Deep Learning Algorithms for Animal Activity Recognition | 347
CNNs in Practice
In this section, we will look into how to prepare accelerometer data for analysis
using CNNs with PyTorch (Refer to Chapter_9_CNN.ipynb in the GitHub
repository).
To process the accelerometer data from a CSV file for use in a CNN with PyTorch,
we will follow these steps:
• Load the dataset from the CSV file.
• Normalize the feature data.
• Encode the categorical labels.
• Reshape the data for CNN input. CNNs expect input data in a certain shape.
For PyTorch, the data should be shaped as [batch_size, channels, length],
where batch_size is the number of samples in a batch, channels corresponds
to the number of sensor axes (usually 3 for accelerometer data), and length is
the number of time steps in each segment.
• Convert the data to PyTorch tensors. PyTorch works with tensors, so the
preprocessed data needs to be converted into tensor format. This can be done
using PyTorch’s tensor utilities.
• Create a dataset and a DataLoader for training. In PyTorch, datasets and
dataloaders are used to efficiently load data during the training and testing
process. The TensorDataset and DataLoader classes are particularly useful for
handling batches of data and simplifying the training loop.
Let’s start with the code:
Step 1: Load the Dataset
import pandas as pd
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
import torch
from torch.utils.data import TensorDataset, DataLoader
# Output
(64626, 83)
348 | Machine Learning in Farm Animal Behavior using Python
encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y)
label_names = encoder.classes_
label_names, y_encoded
# Output
(array(['grazing', 'resting', 'scratching', 'standing', 'walking'],
dtype=object),
array([0, 0, 0, ..., 4, 4, 4]))
X_reshaped = X_normalized.reshape(-1,1,82)
X_reshaped.shape
# Output
(64626, 1, 82)
# Output
(torch.Size([64626, 1, 82]), torch.Size([64626]))
Deep Learning Algorithms for Animal Activity Recognition | 349
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
class AccelerometerCNN(nn.Module):
def __init__(self):
super(AccelerometerCNN, self).__init__()
# Assuming input shape [batch_size, 1, 82]
self.conv1 = nn.Conv1d(1, 32, kernel_size=3, padding=1)
# Output: [batch_size, 32, 82]
self.pool = nn.MaxPool1d(2) # Output: [batch_size, 32, 41]
self.conv2 = nn.Conv1d(32, 64, kernel_size=3, padding=1)
# Output: [batch_size, 64, 41]
# Apply pooling again: [batch_size, 64, 20]
self.dropout = nn.Dropout(p=0.5)
self.fc1 = nn.Linear(64 * 20, 128) # Fully connected layers
self.fc2 = nn.Linear(128, 5) # Output layer, 5 classes
350 | Machine Learning in Farm Animal Behavior using Python
class EarlyStopping:
def __init__(self, patience=10, min_delta=0):
self.patience = patience
self.min_delta = min_delta
self.counter = 0
self.best_loss = None
self.early_stop = False
# Initialize model
model = AccelerometerCNN()
# Early stopping
early_stopping = EarlyStopping(patience=20, min_delta=0.01)
In this step, we set up the foundational elements required for training our CNN
model. This involves initializing the model, setting up the loss function and
Deep Learning Algorithms for Animal Activity Recognition | 353
optimizer, and ensuring the model utilizes the GPU for faster computation, if
available.
• Model Creation: An instance of the AccelerometerCNN class is created. This
instance, model, encapsulates our CNN architecture and is the object we will
train and use for predictions.
• Device Selection: This step checks if a GPU with CUDA support is available.
If it is, device is set to use the GPU; otherwise, it falls back to the CPU.
• Model to Device: The model.to(device) command moves our model to the
chosen device (GPU or CPU), ensuring all computations of the model are
performed on that device.
• Loss Function: nn.CrossEntropyLoss is used as the loss function, appropriate
for multi-class classification tasks. It combines a softmax layer and a cross-
entropy loss in one single class, simplifying the model architecture and the
training loop.
• Optimizer: The Adam optimizer is chosen for its efficiency in handling sparse
gradients and adaptive learning rates. The learning rate (lr) is set to 0.001, a
common choice for many tasks.
• L2 Regularization: weight_decay in the Adam optimizer is used for L2
regularization. It adds a penalty on the size of the weights, helping to prevent
overfitting by discouraging overly complex models.
• Early Stopping Setup: An instance of the EarlyStopping class is created with
a patience of 20 epochs and a minimum delta (min_delta) of 0.01. This means
the training will stop if the validation loss does not improve by at least 0.01
for 20 consecutive epochs.
Step 10: Model Training and Evaluation on Validation Set
This step covers the training of the AccelerometerCNN model and its evaluation
using the validation set. The goal is to optimize the model parameters based on
the training data and to gauge its performance on unseen data (validation set) to
prevent overfitting.
max_epochs = 1000
total = 0
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
train_loss += loss.item()
_, predicted = torch.max(output.data, 1)
total += target.size(0)
correct += (predicted == target).sum().item()
loss.backward()
optimizer.step()
train_loss /= len(train_loader)
train_accuracy = 100 * correct / total
model.eval()
val_loss = 0.0
correct = 0
total = 0
with torch.no_grad():
for data, target in val_loader:
data, target = data.to(device), target.to(device)
output = model(data)
loss = criterion(output, target)
val_loss += loss.item()
_, predicted = torch.max(output.data, 1)
total += target.size(0)
correct += (predicted == target).sum().item()
val_loss /= len(val_loader)
val_accuracy = 100 * correct / total
early_stopping(val_loss)
if early_stopping.early_stop:
print("Early stopping triggered")
break
model.eval()
y_true = []
y_pred = []
with torch.no_grad():
356 | Machine Learning in Farm Animal Behavior using Python
# Calculate accuracy
accuracy = accuracy_score(y_true, y_pred)
print("Accuracy:", accuracy)
# Output
Classification Report:
precision recall f1-score support
grazing 0.99 1.00 1.00 2416
resting 0.98 0.97 0.98 4166
scratching 0.96 0.89 0.92 88
standing 0.95 0.96 0.96 2058
walking 1.00 1.00 1.00 966
accuracy 0.98 9694
macro avg 0.98 0.96 0.97 9694
weighted avg 0.98 0.98 0.98 9694
Accuracy: 0.9789560552919332
– The true and predicted labels are appended to y_true and y_pred lists,
respectively, after moving them to the CPU (cpu().numpy()).
• Generate Classification Report:
– classification_report from Scikit-learn provides a detailed analysis of the
model’s performance, including metrics like precision, recall, and F1-
score for each class, as well as overall accuracy.
– target_names are the names of the classes obtained from the LabelEncoder
used during data preprocessing.
• Calculate Overall Accuracy:
– accuracy_score computes the overall accuracy of the model on the test
dataset.
• Output and Interpretation:
– The classification report and accuracy score are printed, providing a
detailed overview of the model’s performance.
– In the provided example, the model achieves an accuracy of approximately
97.9% (which is not bad). The precision, recall, and F1-score for each
class (like grazing and resting,) indicate how well the model performs for
each specific category.
This final evaluation step is a definitive measure of the model’s ability to handle
new data and is crucial for assessing its practical utility. The detailed metrics
provided in the classification report offer insights into the strengths and weaknesses
of the model, guiding potential improvements or adjustments for future iterations
or similar projects.
In the context of RNNs, this narrative structure allows the network to learn from
the entire sequence of data, rather than viewing each data point in isolation.
At each timestep of processing, a RNN combines the current input vector with the
previous timestep’s hidden state to compute a new hidden state. This process is
repeated across the sequence, enabling the network to effectively remember and
incorporate past information to inform current and future decisions (Figure 9.7).
It is this ability to capture temporal dependencies that makes RNNs particularly
suited to tasks involving time series data, such as monitoring animal activities
through sensors.
However, traditional RNNs are not without their challenges. They can be difficult
to train and are prone to issues like vanishing or exploding gradients, convergence
issues, complicating their application in tasks requiring modeling of long-term
dependencies. To address these limitations, advancements in RNN architecture
have been introduced, including LSTM and Gated Recurrent Units, which will be
explored in subsequent sections.
yt yt–1 yt yt+1
xt x(t–1) xt xt+1
Figure 9.7: Unfolding of a Recurrent Neural Network (RNN) over time steps,
illustrating the flow of information from one unit to the next with the incorporation
of the previous hidden state into the current input.
RNNs are designed with the capability to remember past information, integrating
it with current inputs to produce outputs. This memory component allows RNNs
to maintain a form of internal state that captures information about the sequence it
has processed so far. In contrast to feedforward neural networks that handle inputs
independently, RNNs utilize feedback loops, enabling them to perform both
prediction and classification tasks by leveraging static and temporal information
within the input sequence.
Mathematically, this behavior can be described as follows:
State Update: The new state ht is a function f of the current input xi and the
previous state ht–1:
ht = f (ht−1 , xt ).
Output Generation: The output at each step yt can be calculated using the current
state ht:
yt = g(ht )
where,
• ht is the state at time t.
• xt is the input at time t.
• yt is the output at time t.
• f and g are the functions learned during training.
Exploding Gradients: Conversely, the exploding gradients issue arises from the
accumulation of large error gradients, leading to markedly large updates to the neural
network’s weights during the train process. This can result in an unstable model, with
the model weights diverging and possibly resulting in NaN values due to numerical
instability. In RNNs, this is also more likely to occur with long sequences, where
gradients can compound over many time steps and grow exponentially.
Both of these issues make it difficult for the network to learn, as they lead to either
very slow learning or to divergence of the model weights. Several techniques have
been devised to mitigate these problems:
• Gradient clipping: This involves scaling down gradients when they exceed a
certain threshold to prevent them from exploding.
• Weight Initialization: Careful initialization of weights can prevent gradients
from vanishing or exploding at the start of training.
• Using LSTM/GRU cells: These variants of RNNs include gating mechanisms
that help to control the flow of gradients and are less susceptible to the
vanishing gradients problem.
• Batch Normalization: Though less common in RNNs, this technique can help
maintain stable gradients throughout the network.
Understanding and mitigating vanishing and exploding gradients is crucial for
training deep neural networks effectively, especially when dealing with sequential
data that requires capturing long-range dependencies.
Architecture
LSTMs are centered around the concept of a cell state, which flows directly
through the network’s entire chain, experiencing minimal linear interactions. This
design allows information to flow along it unchanged if necessary. LSTMs possess
the capability to modify the cell state by either adding or removing information, a
process thoroughly controlled by mechanisms known as gates.
Figure 9.8 illustrates the architecture of an LSTM unit. It shows the input xt and
the previous hidden state ht –1 feeding into three gates: the forget gate ft, the input
gate it, and the output gate ot, represented by boxes with the sigmoid activation
⁓.
function σ. Additionally, the tanh function creates a new candidate cell state C t
Deep Learning Algorithms for Animal Activity Recognition | 361
ht
Ct–1 * + Ct
* tanh
ft
it
Ct
tanh
ht–1 ot * ht
xt
These gates regulate the update and output of the new cell state Ct and the new
hidden state ht, utilizing operations like element-wise multiplication and addition,
leading to the final cell and hidden state outputs.
Gates
Gates serve as selective channels for information flow, consisting of a sigmoid
neural network layer and a pointwise multiplication operation to either permit
or block information. The sigmoid layer outputs numbers between zero and one,
describing how much of each component should be let through. The LSTM has
three of these gates to protect and control the cell state:
• Forget Gate ( ft ):
This decides what information should be discarded from the cell state. It looks
at the previous hidden state ht–1 and the current input xtand generates a value
ranging from 0 to 1 for each element in the cell state Ct–1, where 1 signifies fully
retaining the element, and 0 indicates entirely discarding it.
ft = σ (W f · [ht−1 , xt ] + b f )
• Input Gate (it):
This decides what new information will be stored in the cell state. It involves two
parts: a sigmoid layer that decides which values to update, and a tanh layer that
creates a vector of new candidate values C⁓t that could be added to the state.
ft = σ (W f · [ht−1 , xt ] + b f )
C̃t = tanh(WC · [ht−1 , xt ] + bC )
362 | Machine Learning in Farm Animal Behavior using Python
ht–1 * + ht
*
1– *
rt zt
ht
tanh
xt
Figure 9.9: GRU architecture.
determine the level of influence the previous state should have on the current
state. Mathematically, this can be represented as:
zt = σ (Wz · [ht−1 , xt ] + bz ).
Here, zt is the update gate vector, σ is the sigmoid function, Wz is the weight
matrix for the update gate, ht–1 is the previous hidden state, xt is the input at
time t, and bz is the bias for the update gate.
• Reset Gate: This gate controls the amount of past information to be discarded,
deciding on the extent of previous data that should be forgotten. The reset
gate can be expressed as:
rt = σ (Wr · [ht−1 , xt ] + br )
where, rt is the reset gate vector, Wt and br and are the weight matrix and bias
for the reset gate, respectively.
The current memory content utilizes the reset gate to store the relevant information
from the past. It is then combined with the update gate to form the final output of
the GRU.
The new hidden state ht is a combination of the old hidden state ht–1 and the
⁓
candidate hidden state ht and is computed as:
⁓
ht = (1–zt ) * ht–1 + zt * ht .
⁓
The candidate hidden state ht is calculated using the current input and the reset
gate:
⁓
ht = tanh(W . [rt * ht–1, xt ] + b).
In this equation, W and b are the weights and biases applied to the candidate
hidden state, and denotes an element-wise multiplication.
364 | Machine Learning in Farm Animal Behavior using Python
Advantages of GRUs
• Simplicity: GRUs have fewer parameters than LSTMs because they lack an
output gate.
• Efficiency: GRUs are generally faster to compute and train due to their
simpler structure.
• Flexibility: They have shown competitive performance with LSTMs on a
variety of tasks.
import pandas as pd
import os
import glob
def read_csv_files_from_folder(folder_path):
"""
Reads all CSV files from a specified folder and merges them into
a single DataFrame.
Parameters:
folder_path (str): The path to the folder containing CSV files.
Returns:
pd.DataFrame: A Pandas DataFrame containing the merged data or
None if no data is found.
"""
try:
# Create an empty list to store DataFrames
dfs = []
if not csv_files:
print("No CSV files found in the specified folder.")
return None
if not dfs:
print("No valid data found in the CSV files.")
return None
return merged_df
except FileNotFoundError:
print("The specified folder or CSV files were not found.")
return None
except Exception as e:
print(f"An error occurred: {str(e)}")
return None
new_df.head()
# Output:
label animal_ID timestamp_ms ax ay az
0 walking G1 1 1.57538 4.34787 -9.27514
1 walking G1 6 1.47962 4.30477 -9.31105
2 walking G1 11 1.36469 4.24492 -9.42118
3 walking G1 16 1.21386 4.22816 -9.59835
4 walking G1 21 1.07021 4.29520 -9.67257
# Import libraries
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
Deep Learning Algorithms for Animal Activity Recognition | 367
# Output:
(13778153, 6)
# Now, the labels in the DataFrame have been updated, and the
specified labels are assigned as "other"
df['label'].value_counts()
# Output:
standing 6031069
grazing 3735835
walking 2109794
lying 984396
trotting 408051
running 324775
other 184233
Name: label, dtype: int64
The core of our data preparation involves segmenting it into smaller windows
that will be fed into the LSTM model. We iterate through the DataFrame, creating
segments of 1000 time steps each, based on the ‘ax’, ‘ay’, and ‘az’ features.
The label for each segment is determined by finding the mode of the ‘label’
column within that segment. Finally, we reshaped the segments and encoded the
categorical labels.
368 | Machine Learning in Farm Animal Behavior using Python
segments = []
labels = []
# Iterate over the data with a moving window of 1000 time steps
segments.append(segment)
labels.append(label)
reshaped_segments.shape, encoded_labels.shape
# Output
((68886, 1000, 3), (68886,))
Deep Learning Algorithms for Animal Activity Recognition | 369
Following these preprocessing steps, our data is now in the form of reshaped
segments, with the shape of (68886, 1000, 3). This means we have approximately
69,000 thousand segments, each containing 1000 time steps and 3 features (‘x’,
‘y’, ‘z’). Additionally, we have label encoded labels ready for classification.
In the following code we split and prepare our data for use in PyTorch:
import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Create TensorDatasets
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
val_dataset = TensorDataset(X_val_tensor, y_val_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)
The above python example involves splitting the data into training, validation,
and test sets, normalizing it, and preparing it for use in PyTorch.
In our code:
• train_test_split(reshaped_segments, encoded_labels, test_size = 0.3, random_
state = random_seed):
– We use the train_test_split function to split our preprocessed data into
training, validation, and test sets. reshaped_segments contain the sensor
data and encoded_labels contains the corresponding encoded labels.
– test_size = 0.3: We have specified that 30% of the data should be allocated
for testing, leaving 70% for training and validation.
– random_state = random_seed: Setting the random seed ensures
reproducibility, meaning that every time you run the code with the same
seed, you’ll get the same data split.
• Data Normalization: We created an instance of StandardScaler() to standardize
our data.
• Converting to PyTorch Tensors: We convert our normalized data from NumPy
arrays into PyTorch tensors using torch.tensor(). We specify the data type for
the tensors, such as torch.float32 for the sensor data and torch.int64 for the
labels. The data type matters for the operations performed by the model.
• Creating Data Loaders: Machine learning models are often trained in batches,
not on entire datasets. Data loaders are used to efficiently load and batch the
data during training.
– We create TensorDataset objects, which combine the input data tensors
and their corresponding label tensors.
– We use DataLoader objects to batch the data from the datasets. These data
loaders automatically divide our data into batches of the specified size,
making it suitable for training deep learning models.
Deep Learning Algorithms for Animal Activity Recognition | 371
By performing these steps, we have organized our data into training, validation,
and test sets, normalized the input data, and converted it into a format suitable for
PyTorch. This prepares the data for efficient model training and evaluation.
class LSTMModel(nn.Module):
def __init__(self, input_size, hidden_size, num_layers, num_
classes, seq_length, dropout_prob=0.5):
super(LSTMModel, self).__init__()
self.num_classes = num_classes
self.num_layers = num_layers
self.input_size = input_size
self.hidden_size = hidden_size
self.seq_length = seq_length
The LSTM (Long Short-Term Memory) model, defined here, is a type of recurrent
neural network (RNN) commonly used for sequence prediction tasks. It consists
of an input layer, an LSTM layer, and two fully connected (dense) layers. The
input layer accepts data in the format of (batch_size, sequence_length, input_size),
where batch_size represents the number of samples in each batch, sequence_
length denotes the length of the input sequence, and input_size represents the
number of features in each time step.
The LSTM layer processes the input sequence, capturing temporal dependencies,
and producing hidden states. The hidden states are then fed into a fully connected
layer with a ReLU activation function, followed by another fully connected layer,
which outputs the final predictions. The model is parameterized by input_size,
hidden_size, num_layers, and num_classes, where hidden_size determines the
number of units in the LSTM layer, num_layers specify the number of LSTM
layers stacked on top of each other, and num_classes represent the number of
output classes for classification tasks. In this specific implementation, the model
is designed to operate on sequences of length 1000 (5 second windows).
# Training loop
for epoch in range(n_epochs):
model.train()
total_loss = 0.0
correct_train = 0
total_train = 0
for inputs, labels in train_loader:
inputs, labels = inputs.to(device), labels.to(device) #
Move data to GPU
Deep Learning Algorithms for Animal Activity Recognition | 373
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
total_loss += loss.item()
_, predicted = torch.max(outputs.data, 1)
total_train += labels.size(0)
correct_train += (predicted == labels).sum().item()
# Validation
model.eval()
total_loss = 0.0
correct_val = 0
total_val = 0
with torch.no_grad():
for inputs, labels in val_loader:
inputs, labels = inputs.to(device), labels.to(device)
# Move data to
GPU
outputs = model(inputs)
loss = criterion(outputs, labels)
total_loss += loss.item()
_, predicted = torch.max(outputs.data, 1)
total_val += labels.size(0)
correct_val += (predicted == labels).sum().item()
print('Training finished!')
374 | Machine Learning in Farm Animal Behavior using Python
In the above code, we prepare our model for training on a GPU if available, by
checking for its presence and moving the model to the GPU using PyTorch. We
define our loss function, which is the Cross Entropy Loss, and our optimizer,
which is the Adam optimizer. Additionally, we initialize lists to keep track of
training and validation losses, as well as accuracies, to monitor the model’s
performance during training. The code then enters a training loop, where the
model iterates over the specified number of epochs. Within each epoch, the
model is set to training mode, and we calculate the loss and accuracy on the
training dataset. We also evaluate the model’s performance on the validation
dataset to prevent overfitting. Finally, the training and validation losses and
accuracies are printed for each epoch, and these values are stored for later
visualization. Once training ends, a message indicating the end of training is
displayed.
# Plotting losses
plt.figure(figsize=(10, 5))
plt.plot(train_losses, label='Train Loss')
plt.plot(val_losses, label='Validation Loss')
plt.title('Training and Validation Losses')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)
# Plotting accuracies
plt.figure(figsize=(10, 5))
plt.plot(train_accuracies, label='Train Accuracy')
plt.plot(val_accuracies, label='Validation Accuracy')
plt.title('Training and Validation Accuracies')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
In our code, we plotted the results to visually track the training and validation
accuracies and losses across 50 epochs. By plotting these metrics, we can easily
observe how our machine learning model performed over time, providing valuable
insights into its learning dynamics.
The first plot of Figure 9.10 illustrates the path of training and validation losses
over epochs. Initially, both exhibit a downward trend, indicating the model’s
ability to minimize errors on both training and validation data. The second plot
reveals the progression of training and validation accuracies over epochs.
1.2
1.0
Loss
0.8
0.6
0.4
0 10 20 30 40 50
Epoch
0.8
Accuracy
0.7
0.5
0.4
0 10 20 30 40 50
Epoch
Figure 9.10: Training and validation losses and accuracies plotted over 50 epochs.
376 | Machine Learning in Farm Animal Behavior using Python
model.eval()
total_test = 0
correct_test = 0
with torch.no_grad():
for inputs, labels in test_loader:
inputs, labels = inputs.to(device), labels.to(device)
# Move data to GPU
outputs = model(inputs)
_, predicted = torch.max(outputs.data, 1)
total_test += labels.size(0)
correct_test += (predicted == labels).sum().item()
# Output
Test Accuracy: 0.8619
Finally, we evaluate the trained model on a separate test dataset to assess its
performance on unseen data. The model is put into evaluation mode using model.
eval(), ensuring that certain layers (like dropout) behave differently during
evaluation compared to training. We then initialize variables to keep track of the
total number of test samples (total_test) and the number of correctly predicted
samples (correct_test). Within a torch.no_grad() block to disable gradient
calculation, we iterate over batches of test data. For each batch, we move the
data to the GPU (if available), pass it through the model to obtain predictions,
and compute the number of correct predictions. After processing all test data, we
calculate the test accuracy by dividing the number of correctly predicted samples
by the total number of test samples. Finally, the test accuracy is printed to assess
the model’s performance on unseen data, indicating that the model correctly
predicted approximately 86.19% of the test samples.
As we wrap up our discussion on using LSTM models for analyzing accelerometer
data to recognize farm animal activities, it is key to note that we have covered
foundational aspects to get you started with LSTMs in PyTorch. From data
preprocessing, model architecture setup, training, to evaluation, we have provided
a step-by-step guide to introduce you to the process.
Deep Learning Algorithms for Animal Activity Recognition | 377
for Search and Rescue (SaR) of dogs, incorporating wearable devices and cloud
infrastructure to monitor and analyze the dog’s activity, audio signals, and
location, using CNNs for activity and sound recognition. The system, validated
in two SaR scenarios, achieved a F1-score of over 99% in detecting victims and
providing real-time alerts to rescuers (Kasnesis et al., 2022). Hussain et al. utilized
a one-dimensional convolution CNN with raw sensor data features, obtaining a
96.85% accuracy in detecting dog behaviors (Hussain et al., 2022). A recent study
has expanded the application of CNNs to monitor hens activities. Shahbazi et al.
claimed that they reached nearly 100% accuracy in classifying hens activity levels
using body-worn inertial measurement unit sensors, focusing on broader activity
categories (Shahbazi et al., 2023).
Current studies also have underscored the efficacy of RNNs, particularly LSTM
and GRU variants, in the domain of AAR using wearable sensors. These studies
have leveraged the sophisticated capabilities of RNNs to recognize and classify
complex patterns of behavior based on sensor data, achieving significant
accuracy in identifying various animal activities. For instance, research by
Peng et al. (Peng et al., 2019) has been pioneering in applying LSTMs to cattle
behavior recognition, utilizing nine-axial-motion data from collar-attached
inertial measurement unit sensors. Their work has demonstrated the potential of
LSTMs in distinguishing between multiple cattle behaviors with high accuracy,
thereby offering insights into cattle health and welfare. In one study, the LSTM
model achieved 88.7% accuracy in identifying activities such as feeding, licking
salt, and headbutting.
Further studies have expanded on this foundation, exploring the use of deep
residual bidirectional LSTMs for early identification of diseases in cattle, with
one notable study reporting a classification accuracy of 94.9% (Wu et al., 2022).
Such high levels of accuracy highlight the potential of RNNs in early disease
detection and prevention, significantly impacting animal welfare and farm
management practices.
Comparative analyses have also shown that RNNs can outperform conventional
CNN models in classifying cattle behaviors. These RNN models have been praised
for their efficiency, requiring fewer computational resources while still achieving
comparable or superior accuracy. A two-layer bidirectional GRU model, for
example, achieved accuracy rates of 89.5% and 80% on collar- and ear-attached
sensor datasets, respectively (Wang et al., 2023).
Moreover, the application of RNNs extends beyond cattle to include other
species, such as dogs, where LSTM-based methods have been employed to detect
activities using motion data from accelerometers (Chambers et al., 2021). These
methods have successfully classified various dog activities with a high degree of
accuracy, further illustrating the versatility and effectiveness of RNNs in animal
activity recognition.
Deep Learning Algorithms for Animal Activity Recognition | 379
This overview of deep learning applications in the field of farm animals demonstrates
the significant advances made in leveraging neural networks, including MLPs,
CNNs, and RNNs, for accurate and efficient behavior classification across various
species. These advances set a solid foundation for the practical implementation of
deep learning models in animal welfare and management.
Summary
In Chapter 9, we delve into the critical role that various neural network architectures
play in the domain of deep learning, with a specific focus on their application to
wearable sensor data for the purpose of animal activity recognition.
We start by introducing the foundational Multilayer Perceptron Neural Networks,
explaining their basic structure and operational principles as the groundwork
for more complex neural network architectures. The discussion then extends to
Convolutional Neural Networks, showing how they can be adeptly applied to
analyze wearable sensor data to identify spatial patterns. Further, we explore
Recurrent Neural Networks and Long Short-Term Memory networks, emphasizing
their ability to handle time-series data, which is often generated by wearable
sensors. Recognizing the importance of efficiency in processing sequences, we
also introduce Gated Recurrent Units, presenting them as a streamlined alternative
to LSTMs.
A significant portion of this chapter is dedicated to demonstrating practical
applications using PyTorch. We showed how neural networks can be employed
for both binary and multiclass classification tasks, using NNs, CNNs, and LSTMs,
thus providing readers with hands-on examples. These examples are particularly
focused on classifying animal activities based on accelerometer data, aiming to
equip readers with the knowledge and skills to apply deep learning techniques
effectively in the context of wearable sensor data analysis.
Through this comprehensive analysis and practical demonstration, our goal is
to bridge the gap between theoretical understanding to practical application of
deep learning in wearable sensor data analysis, enhancing our collective ability to
interpret and understand animal behavior through technology.
380 | Machine Learning in Farm Animal Behavior using Python
Final Remarks
As we conclude our exploration into machine learning and deep learning for
sensor data analysis in farm animal activity recognition, we would like to express
our appreciation to you for engaging with this book. Throughout this book, we
have covered a broad spectrum of topics, from the basics of animal behavior
and machine learning to the intricacies of data collection, preprocessing,
feature selection, and various learning techniques, concluding with an insightful
discussion on deep learning. Practical applications in Python have been a
foundation of each chapter, aiming to provide a hands-on understanding of the
concepts discussed.
We recognize the dynamic nature of the field and the possibility of updates or
corrections to the code examples provided. Your feedback is invaluable to us, and
we welcome any corrections or suggestions you may have. Feel free to contact us
with your feedback or questions at https://github.com/nkcAna/WSDpython.
Given the introductory nature of this book, we encourage you to delve deeper
into the latest literature for the most current advancements in the field. The fast-
paced evolution of machine learning and deep learning technologies means there
is always something new to learn. Practicing with Python and building upon
the foundational knowledge acquired here will further enhance your skills and
understanding.
This book is intended to serve as a stepping stone into the vast and exciting world
of machine learning and deep learning in animal behavior analysis. We hope it has
sparked your curiosity and equipped you with the tools to continue your learning
journey. Thank you for joining us on this journey.
References | 381
References
Abramson, N., Braverman, D., & Sebestyen, G. (1963). Pattern recognition and machine
learning. IEEE Transactions on Information Theory, 9(4), 257–261. https://doi.org/10.1109/
TIT.1963.1057854
Arablouei, R., Currie, L., Kusy, B., Ingham, A., Greenwood, P. L., & Bishop-Hurley, G. (2021).
In-situ classification of cattle behavior using accelerometry data. Computers and Electronics
in Agriculture, 183. https://doi.org/10.1016/j.compag.2021.106045
Arablouei, R., Wang, L., Currie, L., Yates, J., Alvarenga, F. A. P., & Bishop-Hurley, G. J. (2023).
Animal behavior classification via deep learning on embedded systems. Computers and
Electronics in Agriculture, 207. https://doi.org/10.1016/j.compag.2023.107707
Arablouei, R., Wang, Z., Bishop-Hurley, G. J., & Liu, J. (2023). Multimodal sensor data fusion
for in-situ classification of animal behavior using accelerometry and GNSS data. Smart
Agricultural Technology, 4. https://doi.org/10.1016/j.atech.2022.100163
Arcidiacono, C., Porto, S. M. C. C., Mancino, M., & Cascone, G. (2017). Development of a
threshold-based classifier for real-time recognition of cow feeding and standing behavioural
activities from accelerometer data. Computers and Electronics in Agriculture, 134, 124–
134. https://doi.org/10.1016/j.compag.2017.01.021
Barwick, J., Lamb, D., Dobos, R., Schneider, D., Welch, M., & Trotter, M. (2018). Predicting
lameness in sheep activity using tri-axial acceleration signals. Animals, 8(1), 1–16. https://
doi.org/10.3390/ani8010012
Belkina, A. C., Ciccolella, C. O., Anno, R., Halpert, R., Spidlen, J., & Snyder-Cappione, J. E.
(n.d.). Automated optimized parameters for T-distributed stochastic neighbor embedding
improve visualization and analysis of large datasets. https://doi.org/10.1038/s41467-019-
13055-y
Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine
Learning, 2(1), 1–27. https://doi.org/10.1561/2200000006
Benos, L., Tagarakis, A. C., Dolias, G., Berruto, R., Kateris, D., & Bochtis, D. (2021). Machine
learning in agriculture: A comprehensive updated review. In Sensors (Vol. 21, Issue 11,
p. 3758). Multidisciplinary Digital Publishing Institute. https://doi.org/10.3390/s21113758
Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford university press.
Bishop, C. M. (2006). Pattern recognition and machine learning (1st ed., Vol. 1). Springer New
York, NY.
Borgelt, C. (2005). An implementation of the FP-growth algorithm. Proceedings of the ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining. https://doi.
org/10.1145/1133905.1133907
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.
org/10.1023/A:1010933404324/METRICS
Broom, D. M. (2010). Animal welfare: An aspect of care, sustainability, and food quality
required by the public. Journal of Veterinary Medical Education, 37(1), 83–88. https://doi.
org/10.3138/JVME.37.1.83
382 | Machine Learning in Farm Animal Behavior using Python
Chambers, R. D., Yoder, N. C., Carson, A. B., Junge, C., Allen, D. E., Prescott, L. M., Bradley, S.,
Wymore, G., Lloyd, K., & Lyle, S. (2021). Deep learning classification of canine behavior
using a single collar-mounted accelerometer: Real-world validation. Animals, 11(6). https://
doi.org/10.3390/ani11061549
Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly detection: A survey. In ACM
Computing Surveys (Vol. 41, Issue 3). https://doi.org/10.1145/1541880.1541882
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic
Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321–
357. https://doi.org/10.1613/JAIR.953
Cho, K., van Merriënboer, B., Bahdanau, D., & Bengio, Y. (2014). On the properties of
neural machine translation: Encoder–decoder approaches. Proceedings of SSST 2014 -
8th Workshop on Syntax, Semantics and Structure in Statistical Translation. https://doi.
org/10.3115/v1/w14-4012
Cichocki, A., & Unbehauen, R. (1993). Neural Networks for Optimization and Signal Processing.
Wiley and Sons Ltd.
Cunningham, P., & Delany, S. J. (2021). k-Nearest Neighbour Classifiers - A Tutorial. ACM
Computing Surveys (CSUR), 54(6). https://doi.org/10.1145/3459665
David B. Parker. (1985). Learning-logic: Casting the cortex of the human brain in silicon. Center
for Computational Research in Economics and Management Science, MIT.
Dietterich, T. G. (2000). Ensemble methods in machine learning. Lecture Notes in Computer
Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in
Bioinformatics), 1857 LNCS. https://doi.org/10.1007/3-540-45014-9_1
Donders, A. R. T., van der Heijden, G. J. M. G., Stijnen, T., & Moons, K. G. M. (2006). Review:
A gentle introduction to imputation of missing values. Journal of Clinical Epidemiology,
59(10). https://doi.org/10.1016/j.jclinepi.2006.01.014
Eerdekens, A., Callaert, A., Deruyck, M., Martens, L., & Joseph, W. (2022). Dog’s Behaviour
Classification Based on Wearable Sensor Accelerometer Data. 5th Conference on Cloud and
Internet of Things, CIoT 2022. https://doi.org/10.1109/CIoT53061.2022.9766553
Eerdekens, A., Deruyck, M., Fontaine, J., Martens, L., De Poorter, E., Plets, D., & Joseph,
W. (2021). A framework for energy-efficient equine activity recognition with leg
accelerometers. Computers and Electronics in Agriculture, 183. https://doi.org/10.1016/j.
compag.2021.106020
Eerdekens, A., Deruyck, M., Fontaine, J., Martens, L., Poorter, E. De, Plets, D., & Joseph, W.
(2020). Resampling and Data Augmentation for Equines’ Behaviour Classification Based
on Wearable Sensor Accelerometer Data Using a Convolutional Neural Network. 2020
International Conference on Omni-Layer Intelligent Systems, COINS 2020. https://doi.
org/10.1109/COINS49042.2020.9191639
Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A Density-Based Algorithm for Discovering
Clusters in Large Spatial Databases with Noise. Proceedings of the Second International
Conference on Knowledge Discovery and Data Mining, 226–231.
Fogarty, E. S., Swain, D. L., Cronin, G. M., & Trotter, M. (2019). A systematic review of
the potential uses of on-animal sensors to monitor the welfare of sheep evaluated using
References | 383
the five domains model as a framework. Animal Welfare, 28(4), 407–420. https://doi.
org/10.7120/09627286.28.4.407
Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized
linear models via coordinate descent. Journal of Statistical Software, 33(1). https://doi.
org/10.18637/jss.v033.i01
Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism
of pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4). https://
doi.org/10.1007/BF00344251
García, R., Aguilar, J., Toro, M., Pinto, A., & Rodríguez, P. (2020). A systematic literature review
on the use of machine learning in precision livestock farming. Computers and Electronics in
Agriculture, 179, 105826. https://doi.org/10.1016/J.COMPAG.2020.105826
Gaye, B., Zhang, D., & Wulamu, A. (2021). Improvement of Support Vector Machine Algorithm
in Big Data Background. Mathematical Problems in Engineering, 2021. https://doi.
org/10.1155/2021/5594899
Géron, A., & Russell, Rudolph. (2019). Machine learning step-by-step guide to implement
machine learning algorithms with Python. O’Reilly Media, Inc.
Goldberg, X. (2009). Introduction to semi-supervised learning. Synthesis Lectures
on Artificial Intelligence and Machine Learning, 6. https://doi.org/10.2200/
S00196ED1V01Y200906AIM006
González, L. A., Bishop-Hurley, G. J., Handcock, R. N., & Crossman, C. (2015). Behavioral
classification of data from collars containing motion sensors in grazing cattle. Computers
and Electronics in Agriculture, 110, 91–102. https://doi.org/10.1016/j.compag.2014.10.018
Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville,
A., & Bengio, Y. (2014). Generative adversarial nets. Advances in Neural Information
Processing Systems, 3(January), 2672–2680. https://doi.org/10.3156/jsoft.29.5_177_2
Gutierrez-Galan, D., Dominguez-Morales, J. P., Cerezuela-Escudero, E., Rios-Navarro, A.,
Tapiador-Morales, R., Rivas-Perez, M., Dominguez-Morales, M., Jimenez-Fernandez, A.,
& Linares-Barranco, A. (2018). Embedded neural network for real-time animal behavior
classification. Neurocomputing, 272, 17–26. https://doi.org/10.1016/j.neucom.2017.03.090
Hahsler, M., Grün, B., & Hornik, K. (2005). Arules - A computational environment for mining
association rules and frequent item sets. Journal of Statistical Software, 14. https://doi.
org/10.18637/jss.v014.i15
Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A K-Means Clustering Algorithm.
Applied Statistics, 28(1), 100. https://doi.org/10.2307/2346830
Hearst, M. A., Dumais, S. T., Osuna, E., Platt, J., & Scholkopf, B. (1998). Support vector
machines. IEEE Intelligent Systems and Their Applications, 13(4), 18–28. https://doi.
org/10.1109/5254.708428
Hertz, J., Krogh, A., Palmer, R. G., & Horner, H. (1991). Introduction to the Theory of Neural
Computation . Physics Today, 44(12). https://doi.org/10.1063/1.2810360
Hilbe, J. M. (2009). Logistic regression models. CRC Press. https://www.routledge.com/
Logistic-Regression-Models/Hilbe/p/book/9781138106710
384 | Machine Learning in Farm Animal Behavior using Python
Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation,
9(8). https://doi.org/10.1162/neco.1997.9.8.1735
Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (1989). Applied Logistic Regression, 3rd
Edition. Wiley Series in Probability and Statistics, 528. https://www.wiley.com/en-us/
Applied+Logistic+Regression%2C+3rd+Edition-p-9780470582473
Hosseininoorbin, S., Layeghy, S., Kusy, B., Jurdak, R., Bishop-Hurley, G. J., Greenwood, P.
L., & Portmann, M. (2021). Deep learning-based cattle behaviour classification using joint
time-frequency data representation. Computers and Electronics in Agriculture, 187. https://
doi.org/10.1016/j.compag.2021.106241
Hussain, A., Ali, S., Abdullah, & Kim, H. C. (2022). Activity Detection for the Wellbeing of
Dogs Using Wearable Sensors Based on Deep Learning. IEEE Access, 10. https://doi.
org/10.1109/ACCESS.2022.3174813
Iliyasu, R., & Etikan, I. (2021). Comparison of quota sampling and stratified random sampling.
Biometrics & Biostatistics International Journal, 10(1). https://doi.org/10.15406/
bbij.2021.10.00326
Johnson, A. A., Ott, M. Q., & Dogucu, M. (2022). Bayes Rules! : An Introduction to Applied
Bayesian Modeling. https://doi.org/10.1201/9780429288340
Jurafsky, D., & Martin, J. (2014). Speech and Language Processing. In Speech and Language
Processing. (Vol. 3).
Jurafsky, D., & Martin, J. H. (2009). Book Review Speech and Language Processing ( second
edition ). Computational Linguistics.
Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement Learning: A Survey.
Journal of Artificial Intelligence Research, 4, 237–285. https://doi.org/10.1613/JAIR.301
Kaler, J., Mitsch, J., Vázquez-Diosdado, J. A., Bollard, N., Dottorini, T., & Ellis, K. A. (2020).
Automated detection of lameness in sheep using machine learning approaches: novel
insights into behavioural differences among lame and non-lame sheep. Royal Society Open
Science, 7(1), 190824. https://doi.org/10.1098/rsos.190824
Kamminga, J. W., Bisby, H. C., Le, D. V., Meratnia, N., & Havinga, P. J. M. (2017). Generic
Online Animal Activity Recognition on Collar Tags. Proceedings of the 2017 ACM
International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings
of the 2017 ACM International Symposium on Wearable Computers on - UbiComp ’17,
October, 597–606. https://doi.org/10.1145/3123024.3124407
Kasnesis, P., Doulgerakis, V., Uzunidis, D., Kogias, D. G., Funcia, S. I., González, M. B.,
Giannousis, C., & Patrikakis, C. Z. (2022). Deep Learning Empowered Wearable-Based
Behavior Recognition for Search and Rescue Dogs. Sensors, 22(3). https://doi.org/10.3390/
s22030993
Kaur, J., & Madan, N. (2015). Association Rule Mining: A Survey. International Journal of
Hybrid Information Technology, 8(7). https://doi.org/10.14257/ijhit.2015.8.7.22
Kleanthous, N., Hussain, A., Khan, W., Sneddon, J., & Liatsis, P. (2022). Deep transfer learning
in sheep activity recognition using accelerometer data. Expert Systems with Applications,
207, 117925. https://doi.org/10.1016/j.eswa.2022.117925
References | 385
Kleanthous, N., Hussain, A., Mason, A., & Sneddon, J. (2019). Data Science Approaches for the
Analysis of Animal Behaviours. In Lecture Notes in Computer Science (including subseries
Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics): Vol. 11645
LNAI (pp. 411–422). https://doi.org/10.1007/978-3-030-26766-7_38
Kleanthous, N., Hussain, A., Mason, A., Sneddon, J., Shaw, A., Fergus, P., Chalmers, C., &
Al-Jumeily, D. (2018). Machine Learning Techniques for Classification of Livestock
Behavior. In Lecture Notes in Computer Science (including subseries Lecture Notes in
Artificial Intelligence and Lecture Notes in Bioinformatics): Vol. 11304 LNCS (pp. 304–
315). https://doi.org/10.1007/978-3-030-04212-7_26
Kohonen, T. (1989). Self-Organization and Associative Memory (3rd ed., Vol. 8). Springer
Berlin Heidelberg. https://doi.org/10.1007/978-3-642-88163-3
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444. https://
doi.org/10.1038/nature14539
Lehner, P. N. (1996). Handbook of ethological methods. Cambridge University Press.
Li, P., Stuart, E. A., & Allison, D. B. (2015). Multiple imputation: A flexible tool for handling
missing data. In JAMA - Journal of the American Medical Association (Vol. 314, Issue 18).
https://doi.org/10.1001/jama.2015.15281
Liakos, K. G., Busato, P., Moshou, D., Pearson, S., & Bochtis, D. (2018). Machine learning in
agriculture: A review. In Sensors (Switzerland) (Vol. 18, Issue 8, p. 2674). Multidisciplinary
Digital Publishing Institute. https://doi.org/10.3390/s18082674
Liu, Q., Zhai, J. W., Zhang, Z. Z., Zhong, S., Zhou, Q., Zhang, P., & Xu, J. (2018). A Survey on
Deep Reinforcement Learning. In Jisuanji Xuebao/Chinese Journal of Computers (Vol. 41,
Issue 1). https://doi.org/10.11897/SP.J.1016.2018.00001
Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions.
Advances in Neural Information Processing Systems, 2017-December.
machinelearningmastery. (2019). Overfitting and Underfitting With Machine Learning
Algorithms. Machinelearningmastery.
Mao, A., Huang, E., Gan, H., & Liu, K. (2022). FedAAR: A Novel Federated Learning
Framework for Animal Activity Recognition with Wearable Sensors. Animals, 12(16).
https://doi.org/10.3390/ani12162142
Marjani, M., Nasaruddin, F., Gani, A., Karim, A., Hashem, I. A. T., Siddiqa, A., & Yaqoob, I.
(2017). Big IoT Data Analytics: Architecture, Opportunities, and Open Research Challenges.
IEEE Access, 5, 5247–5261. https://doi.org/10.1109/ACCESS.2017.2689040
Neethirajan, S. (2020). The role of sensors, big data and machine learning in modern animal
farming. In Sensing and Bio-Sensing Research (Vol. 29, p. 100367). Elsevier. https://doi.
org/10.1016/j.sbsr.2020.100367
Ostertagová, E. (2012). Modelling using Polynomial Regression. Procedia Engineering, 48,
500–506. https://doi.org/10.1016/J.PROENG.2012.09.545
Pan, S. J., & Yang, Q. (2010). A survey on transfer learning. In IEEE Transactions on Knowledge
and Data Engineering (Vol. 22, Issue 10, pp. 1345–1359). https://doi.org/10.1109/
TKDE.2009.191
386 | Machine Learning in Farm Animal Behavior using Python
Paschalidis, I. C., & Chen, Y. (2010). Statistical anomaly detection with sensor networks. ACM
Transactions on Sensor Networks, 7(2). https://doi.org/10.1145/1824766.1824773
Peng, Y., Kondo, N., Fujiura, T., Suzuki, T., Wulandari, Yoshioka, H., & Itoyama, E. (2019).
Classification of multiple cattle behavior patterns using a recurrent neural network with
long short-term memory and inertial measurement units. Computers and Electronics in
Agriculture, 157. https://doi.org/10.1016/j.compag.2018.12.023
Pu, G., Wang, L., Shen, J., & Dong, F. (2021). A hybrid unsupervised clustering-based anomaly
detection method. Tsinghua Science and Technology, 26(2). https://doi.org/10.26599/
TST.2019.9010051
Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106. https://doi.
org/10.1007/bf00116251
Rast, W., Kimmig, S. E., Giese, L., & Berger, A. (2020). Machine learning goes wild: Using
data from captive individuals to infer wildlife behaviours. PLoS ONE, 15(5). https://doi.
org/10.1371/journal.pone.0227317
Reddy, G. T., Reddy, M. P. K., Lakshmanna, K., Kaluri, R., Rajput, D. S., Srivastava, G., &
Baker, T. (2020). Analysis of Dimensionality Reduction Techniques on Big Data. IEEE
Access, 8, 54776–54788. https://doi.org/10.1109/ACCESS.2020.2980942
Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why Should I Trust You?”: Explaining the
Predictions of Any Classifier. NAACL-HLT 2016 - 2016 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies,
Proceedings of the Demonstrations Session, 97–101. https://doi.org/10.18653/v1/n16-3020
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-
propagating errors. Nature, 323(6088). https://doi.org/10.1038/323533a0
Safavian, S. R., & Landgrebe, D. (1991). A Survey of Decision Tree Classifier Methodology.
IEEE Transactions on Systems, Man and Cybernetics, 21(3), 660–674. https://doi.
org/10.1109/21.97458
Samariya, D., & Thakkar, A. (2023). A Comprehensive Survey of Anomaly Detection
Algorithms. In Annals of Data Science (Vol. 10, Issue 3). https://doi.org/10.1007/s40745-
021-00362-9
Santosh Kumar, M. B., & Balakrishnan, K. (2019). Development of a model recommender
system for agriculture using apriori algorithm. Advances in Intelligent Systems and
Computing, 768. https://doi.org/10.1007/978-981-13-0617-4_15
Sayed, A. H. (2023). Q-Learning. In Inference and Learning from Data. https://doi.
org/10.1017/9781009218245.022
Schafer, J. L., & Olsen, M. K. (1998). Multiple imputation for multivariate missing-data
problems: A data analyst’s perspective. In Multivariate Behavioral Research (Vol. 33, Issue
4). https://doi.org/10.1207/s15327906mbr3304_5
Schlecht, E., Hülsebusch, C., Mahler, F., & Becker, K. (2004). The use of differentially corrected
global positioning system to monitor activities of cattle at pasture. Applied Animal Behaviour
Science, 85(3), 185–202. https://doi.org/10.1016/j.applanim.2003.11.003
References | 387
Schwager, M., Anderson, D. M., Butler, Z., & Rus, D. (2007). Robust classification of animal
tracking data. Computers and Electronics in Agriculture, 56(1), 46–59. https://doi.
org/10.1016/j.compag.2007.01.002
Seber, G. A. F. (George A. F., & Lee, A. J. (2003). Linear regression analysis. 557. https://www.
wiley.com/en-ie/Linear+Regression+Analysis%2C+2nd+Edition-p-9780471415404
Shahbazi, M., Mohammadi, K., Derakhshani, S. M., & Groot Koerkamp, P. W. G. (2023).
Deep Learning for Laying Hen Activity Recognition Using Wearable Sensors. Agriculture
(Switzerland), 13(3). https://doi.org/10.3390/agriculture13030738
Song, Y. Y., & Lu, Y. (2015). Decision tree methods: applications for classification and
prediction. Shanghai Archives of Psychiatry, 27(2), 130. https://doi.org/10.11919/J.
ISSN.1002-0829.215044
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout:
A simple way to prevent neural networks from overfitting. Journal of Machine Learning
Research, 15.
Sutton, C. D. (2005). Classification and Regression Trees, Bagging, and Boosting. Handbook of
Statistics, 24, 303–329. https://doi.org/10.1016/S0169-7161(04)24011-1
Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (2000). Policy gradient methods for
reinforcement learning with function approximation. Advances in Neural Information
Processing Systems.
Telikani, A., Gandomi, A. H., & Shahbahrami, A. (2020). A survey of evolutionary computation
for association rule mining. Information Sciences, 524. https://doi.org/10.1016/j.
ins.2020.02.073
The Nobel Prize in Physiology or Medicine 1973 - NobelPrize.org. (n.d.). Retrieved August 29,
2023, from https://www.nobelprize.org/prizes/medicine/1973/summary/
Tran, D. N., Nguyen, T. N., Khanh, P. C. P., & Trana, D. T. (2021). An IoT-based Design Using
Accelerometers in Animal Behavior Recognition Systems. IEEE Sensors Journal, 1–1.
https://doi.org/10.1109/JSEN.2021.3051194
Ungar, E. D., Henkin, Z., Gutman, M., Dolev, A., Genizi, A., & Ganskopp, D. (2005). Inference
of Animal Activity From GPS Collar Data on Free-Ranging Cattle. Rangeland Ecology &
Management, 58(3), 256–266. https://doi.org/10.2111/1551-5028(2005)58[256:IOAAFG]2
.0.CO;2
Valletta, J. J., Torney, C., Kings, M., Thornton, A., & Madden, J. (2017). Applications of
machine learning in animal behaviour studies. Animal Behaviour, 124, 203–220. https://doi.
org/10.1016/j.anbehav.2016.12.005
van Engelen, J. E., & Hoos, H. H. (2020). A survey on semi-supervised learning. Machine
Learning, 109(2). https://doi.org/10.1007/s10994-019-05855-6
Varga, B., Kulcsár, B., & Chehreghani, M. H. (2023). Deep Q-learning: A robust control
approach. International Journal of Robust and Nonlinear Control, 33(1). https://doi.
org/10.1002/rnc.6457
Walker, R. T., & Hill, H. M. (2020). Behavioral Ecology. Encyclopedia of Personality and
Individual Differences, 406–408. https://doi.org/10.1007/978-3-319-24612-3_1610
388 | Machine Learning in Farm Animal Behavior using Python
Wang, L., Arablouei, R., Alvarenga, F. A. P., & Bishop-Hurley, G. J. (2023). Classifying animal
behavior from accelerometry data via recurrent neural networks. Computers and Electronics
in Agriculture, 206. https://doi.org/10.1016/j.compag.2023.107647
Wang, Y., Yao, H., & Zhao, S. (2015). Auto-encoder based dimensionality reduction. https://doi.
org/10.1016/j.neucom.2015.08.104
Werbos, P. (1974). Beyond Regression: New Tools for Prediction and Analysis in the Behavioral
Science. Thesis (Ph. D.). Appl. Math. Harvard University.
Wu, Y., Liu, M., Peng, Z., Liu, M., Wang, M., & Peng, Y. (2022). Recognising Cattle Behaviour
with Deep Residual Bidirectional LSTM Model Using a Wearable Movement Monitoring
Collar. Agriculture (Switzerland), 12(8). https://doi.org/10.3390/agriculture12081237
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal
of the Royal Statistical Society. Series B: Statistical Methodology, 67(2). https://doi.
org/10.1111/j.1467-9868.2005.00503.x
Index
A B
Accelerometer Data 113, 115, 117, 121, Backpropagation 205, 207, 210, 316
134, 135, 136, 137, 138, 139, 140, 141, Bagging 212, 387
142, 143, 144, 382 Bayesian optimization 93, 300, 301
Accuracy 10, 11, 19, 37, 42, 43, 46, 77, behavioral ecology 3
78, 79, 80, 81, 82, 83, 84, 85, 86, 88, 89, behavioral patterns 146, 259
90, 91, 92, 95, 96, 97, 98, 99, 100, 103, Biased model 42
113, 114, 127, 167, 173, 174, 175, 190, Bias-variance trade-off 31, 35
191, 192, 203, 207, 222, 223, 236, 239, Binary classification 6, 8, 172, 202, 262,
240, 250, 255, 258, 259, 261, 262, 265, 267, 268, 269, 270, 271, 272, 273, 276,
266, 270, 271, 274, 276, 279, 281, 287, 278, 319, 320, 321, 322, 324, 326, 327
291, 292, 295, 299, 300, 301, 305, 318, Body temperature 15
327, 328, 329, 330, 332, 333, 334, 345,
353, 354, 355, 356, 357, 372, 373, 374, C
376, 377, 378 Classification model 173, 258, 264, 266,
Activation functions 309, 310, 340 267, 268, 323, 330
Activity recognition 68, 104, 108, 121, Classification report 78, 79, 80, 81, 82, 88,
131, 135, 145, 146, 148, 149, 167, 180, 91, 97, 98, 270, 295, 296, 356, 357
210, 224, 240, 244, 303, 304, 307, 308, Class imbalance 63, 73, 265, 270
364, 377, 378, 379, 380, 382, 384 Clustering 13, 18, 240, 241, 243, 244, 245,
AdaBoost 213, 220, 222, 258 246, 247, 248, 249, 252, 255, 257, 383
Adam optimizer 327, 353, 374 Compass 49
Algorithms 23, 35, 41, 43, 76, 177, 205, Computational complexity 226
210, 222, 232, 252, 303, 309, 385, 386 confusion 77, 78, 83, 100, 261, 262, 263,
Animal behavior 1, 2, 3, 4, 5, 13, 15, 16, 264, 265, 294, 296
18, 21, 24, 25, 36, 37, 41, 42, 44, 47, 50, Confusion matrix 263, 295, 335
54, 62, 101, 102, 103, 104, 105, 107, Constraints 45, 117, 246, 288
111, 114, 115, 117, 131, 132, 133, 134, Continuous model updating 44
136, 137, 140, 146, 147, 166, 182, 206, Convolutional Neural Networks (CNNs)
241, 258, 259, 261, 379, 380, 381, 383, 311
388 Correlation coefficient 64, 65, 138, 170,
Anomaly detection 10, 19, 386 186
Apriori algorithm 16 Cross-Validation 94, 95, 96, 97, 175, 190,
Association Rule Learning 16 191, 287, 289, 290, 291, 292, 294, 300,
AUC-ROC 271, 272, 273 301, 343
390 | Index
Poor initialization 39 S
Precision 79, 80, 81, 82, 85, 86, 88, 89, 91, Sample 27, 69, 71, 93, 112, 117, 131, 148,
92, 97, 98, 102, 103, 173, 261, 266, 267, 169, 211, 242, 256, 276, 281, 288, 297,
268, 269, 270, 271, 287, 295, 296, 301, 298, 317, 322
318, 334, 356, 357, 383 Scalability 1, 36, 37
Prediction 11, 30, 34, 37, 111, 127, 129, Scalability 36, 170, 299
179, 202, 211, 212, 213, 222, 225, 258, Scarcity 40
274, 279, 316, 319, 320, 322, 332, 342, Scikit-learn 77, 83, 183, 187, 191, 218,
345, 359, 372, 377, 387 221, 223, 226, 228, 231, 232, 237, 246,
Predictive models 4, 38, 167, 171, 212, 266, 267, 270, 291, 334
213, 279 Scipy 70, 121, 122, 126, 153, 154, 245,
Predictive performance 73, 99, 287 297, 368
Preprocessing 41, 50, 66, 76, 77, 83, 99, Scratch-biting 48
111, 115, 117, 119, 120, 132, 133, 166, Seaborn 62, 65, 67
167, 171, 172, 183, 193, 194, 195, 215, Segmentation 112
217, 218, 225, 231, 235, 290, 324, 325, Semi-Supervised Learning 5
330, 347, 357, 365, 368, 369, 376, 380 Sensor data 8, 48, 49, 60, 68, 70, 72, 109,
PyTorch 307, 308, 321, 322, 323, 324, 114, 307, 358, 364, 370, 377, 378, 379,
325, 326, 330, 331, 334, 335, 347, 348, 380, 381
349, 350, 364, 369, 370, 371, 374, 376, Shaking 48, 73, 85, 86, 88, 89, 91, 97, 99,
379 105, 181, 364, 367
SHAP (SHapley Additive exPlanations) 37
Q Signal 3, 45, 68, 71, 72, 107, 108, 113, 121,
Q-learning 23, 24, 387 122, 123, 124, 125, 126, 131, 132, 133,
136, 142, 143, 144, 145, 146, 147, 148,
R 149, 150, 151, 152, 153, 155, 157, 158,
Random forests 9, 41, 173, 179, 198, 200, 159, 162, 163, 166, 171, 180, 205, 377
291 Signal processing 45, 123, 143, 147, 171
Real-time 8, 11, 102, 110, 338, 378, 381, Sliding window 70, 126, 165
383 Social dynamics 2
Real-time monitoring 102 Softmax 313, 314
Recall 79, 80, 81, 82, 83, 85, 86, 88, 89, Sparse 40, 344, 345, 353
91, 92, 97, 98, 173, 261, 266, 267, 268, Sparse data 40
269, 270, 271, 287, 295, 296, 301, 334, Sparsity 178, 233, 344, 345
356, 357 Standardization 76, 77, 83, 84, 118, 119,
Recurrent Neural Networks (RNNs) 307 197, 324
Regression models 10, 175, 237, 279, 284, StandardScaler 76, 77, 78, 79, 80, 90, 91,
285, 287, 383 193, 194, 195, 196, 197, 218, 219, 221,
Regularization 12, 20, 33, 34, 39, 178, 325, 347, 348, 369, 370
179, 193, 194, 196, 197, 212, 222, 223, Statistics 53, 54, 56, 153, 154
226, 232, 233, 234, 344, 345, 346, 351, Supervised learning 6, 9, 12, 19, 20, 21,
352, 353 112, 205, 206, 383, 387
Reinforcement Learning 5, 21, 23, 384, Support Vector Machines (SVM) 9, 173,
385 211
Resource 45, 47, 115, 180, 240, 287, 298, Synthetic data 40, 241, 256
377
Root Mean Squared Error (RMSE) 261, 283 T
R-squared (R2) 233, 234, 235, 236, 237 Techniques 5, 12, 14, 16, 19, 20, 25, 34,
Index | 393
41, 43, 47, 73, 76, 86, 89, 92, 93, 101, Variance 14, 15, 27, 28, 30, 31, 32, 33, 35,
114, 115, 117, 119, 120, 121, 132, 133, 89, 90, 91, 128, 129, 132, 135, 160, 162,
166, 167, 172, 173, 178, 179, 193, 203, 168, 169, 170, 172, 180, 186, 189, 202,
205, 212, 214, 218, 224, 232, 259, 260, 203, 215, 241, 245, 246, 251, 285, 287,
287, 301, 305, 306, 307, 308, 337, 342, 289, 317
343, 360, 379, 380 Vocalizations 15, 16, 101
Tensor 306
Thresholding 151 W
Time-series analysis 164, 335 Walking 42, 48, 73, 85, 88, 91, 97, 99,
tqdm 290 112, 116, 117, 124, 131, 139, 146, 154,
Transfer Learning 40, 307, 377 180, 258, 259, 264, 265, 271, 295, 334,
Trotting 19, 48, 73, 85, 88, 91, 97, 99, 151, 348, 356, 364, 366, 367
180, 367 Weight decay 34
Wildlife 36, 108, 386
U
Underfitting 27, 28, 29, 32, 35, 40, 46 Workflow 27, 44, 45, 46, 181, 203, 262
Unsupervised Learning 5, 12, 13, 18, 205,
X
237
XGBoost 179, 201, 202, 220, 221, 222,
V 290, 291, 292, 293, 294, 295, 296
Validation set 84, 88, 93, 97, 288, 345,
346, 353