0% found this document useful (0 votes)

41 views73 pages

Gradient Boosting Machine and SHAP For Biogas Production

The document discusses using gradient boosting machine learning to identify correlations between key variables in biogas production. Gradient boosting is well-suited for this task as it has been successful in many machine learning competitions and is often used by winning teams. The approach can optimize biogas production processes and improve efficiency by uncovering relationships between production parameters like temperature, pH, volatile fatty acids, and microorganisms, and the amount of biogas produced.

Uploaded by

thgoo0orama

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views73 pages

Gradient Boosting Machine and SHAP For Biogas Production

Uploaded by

thgoo0orama

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 73

Uncovering Correlations Among

Key Variables in Biogas Production

Approaches and insights aided by gradient boosting machine

Thiago Ribas Bella, PhD student

Advisor: Marcelo Falsarella Carazzolle
Why Machine Learning? Identifying correlations between variables in
biogas production can optimize the process and
improve biogas production eﬃciency
Level

Level
Variables Variables
Why Gradient Boosting? Competitions:
Among the 29 challenge winning solutions, 17
solutions used XGBoost.

…the go-to method and often the key

component in winning solutions for a range of
problems in machine learning competitions

[…] The success of the system was also

witnessed in KDDCup 2015, where XGBoost
was used by every winning team in the
top-10

- XGBoost: A Scalable Tree Boosting System, 2016 /

- https://machinelearningmastery.com/extreme-gradient-boosting-ensemble-in-python/

- GRINSZTAJN, L.; OYALLON, E.; VAROQUAUX, G. Why do tree-based models still

outperform deep learning on typical tabular data? 36º Conference on Neural Information
Processing Systems Datasets and Benchmarks Track. 16 out. 2022
Artiﬁcial Intelligence
Some concepts

● Types of learning
● The two basic tasks of supervised learning
● Decision trees
Types of learning algorithms

● Supervised learning
● Unsupervised Learning
● Reinforcement Learning
Types of learning algorithms

● Supervised learning
○ Function → maps input/output
○ Labeled data
Types of learning algorithms

● Supervised learning
○ Function → maps input/output
○ Labeled data

● Unsupervised Learning
○ Unlabeled data
○ No previous knowledge of the data
○ Mainly used in pattern detection
Types of learning algorithms

● Supervised learning
○ Function → maps input/output
○ Labeled data

Input: Process parameters (temp., pH, VFA, microorganism)

Output: Biogas production
Some concepts

● Types of learning
● The two basic tasks of supervised learning
● Decision trees
The 2 basic tasks of supervised learning

Regression Classiﬁcation
predict a continuous value predict a discrete value (labels or
based on the input variables categories) based on the input variables

inﬁnite values within a given interval can only take on whole numbers
0 ← 0.001, 0.15, ….., 0.871, 0.999 → 1 0, 1, 2, …

Regression example: Classiﬁcation example:

- Temperature prediction (ºC) - Plant species prediction (A, B or C)
- Rain amount prediction (mm/h) - It will rain? (yes/no)
- Blood Pressure prediction (mmHg) - Blood Pressure level (low /high)
The 2 basic tasks of supervised learning

Regression Classiﬁcation
predict a continuous value predict a discrete value (labels or
based on the input variables categories) based on the input variables

inﬁnite values within a given interval can only take on whole numbers
0 ← 0.001, 0.15, ….., 0.871, 0.999 → 1 0, 1, 2, …

Regression example: Classiﬁcation example:

Regression Classiﬁcation
predict a continuous value predict a discrete value (labels or
based on the input variables categories) based on the input variables

inﬁnite values within a given interval can only take on whole numbers
0 ← 0.001, 0.15, ….., 0.871, 0.999 → 1 0, 1, 2, …

Regression example: Classiﬁcation example:

● Types of learning
● The two basic tasks of supervised learning
● Decision trees
Example to get the intuition

Recursive binary splitting

Decision Tree
Target (output)
Input
Variable
Variables
Observations
Example to get the intuition

Decision Tree (trained) - classiﬁcation

Target (output)
Input
Variable
Variables
Observations
Example to get the intuition

Decision Tree (trained) - classiﬁcation

Target (output)
Input
Variable
Variables
Observations
Example to get the intuition

Decision Tree (trained) - classiﬁcation

Target (output)
Input
Variable
Variables
Observations
Example to get the intuition

Decision Tree (trained) - classiﬁcation

Target (output)
Input
Variable
Variables
Observations
Decision Tree - Regression
100%

YES NO

52.8%
YES NO

4.2%
2.5% YES NO
Some concepts

● Types of learning
● The two basic tasks of supervised learning
● Decision trees
: A gradient boosting algorithm

Machine learning algorithm based

on decision trees.

*Comitê de Máquinas
Decision Decision
Ensemble
Trees Bag g Trees
gin Learning
o stin
g Bo
Random Gradient
Forest Boosting
Ensemble learning: United we stand, divided we fall

Bagging Boosting
Bootstraping the data plus using the Each tree uses knowledge from previous tree
aggregate to make a decision

Various trees → random forest

…etc…etc…etc…
Ensemble learning: United we stand, divided we fall

Bagging Boosting
Bootstraping the data plus using the Each tree uses knowledge from previous tree
aggregate to make a decision

Various trees → random forest

…etc…etc…etc…
SHAP: SHapley Additive exPlanations

SHapley Additive exPlanations is a game theoretic approach to explain the output of any machine learning model.
SHAP: SHapley Additive exPlanations

SHapley Additive exPlanations is a game theoretic approach to explain the output of any machine learning model.
SHAP: How it works?

How much a speciﬁc feature value contributes to output?

“... compute the difference between the model's output with and without each feature…”
SHAP: How it works?
?
SHAP: How it works?
? ?
SHAP: How it works?
? ? ?
SHAP: How it works?
? ? ?

… and so on …
SHAP: How it works?
? ? ?

…
SHAP: How it works?

Assume Ann, Bob, and Cindy together were hammering

a wood log, 38 meters, to the ground:
SHAP: How it works?

Assume Ann, Bob, and Cindy together were hammering

a wood log, 38 meters, to the ground:
SHAP: How it works?

Assume Ann, Bob, and Cindy together were hammering

a wood log, 38 meters, to the ground:

Average of all the permutations

for each person
SHAP: SHapley Additive exPlanations
How much a speciﬁc feature value contributes to output?

CatBoost M = total features in the subset

interactions
Output with and without the speciﬁc feature

Example of just one subset → Age and BMI

SHAP: SHapley Additive exPlanations
How much a speciﬁc feature value contributes to output?

“... compute the difference between the model's output with and without each feature…”
SHAP: SHapley Additive exPlanations
How much a speciﬁc feature value contributes to output?

“... compute the difference between the model's output with and without each feature…”
Correlation vs causation → Attention!

Making correlations transparent does not make them causal

Divorce rate Margarine consumed

Jacob LaRiviere - Amazon
Real causation x Model correlation Eleanor Dillon - Microsoft
Scott Lundberg - Microsoft
Predicted relationship (XGBoost)
Jonathan Roth - Microsoft
Vasilis Syrgkanis - Stanford University
Real relationship

*There are unmeasured variables affecting the measured ones and the output
Separate the wheat from the tare
Predicted relationship wheat tare
Real relationship

SPECIALIST KNOWLEDGE
Uncovering Correlations Among
Key Variables in Biogas Production
Approaches and insights aided by gradient boosting machine
Approach

Literature data Pre-process Machine learning Interpretability

Dados Pré- Aprendizado de Interpretabilidade
processamento Máquina
Saída Saída

Variáveis
Variáveis
Approach 2022

Literature data

● We got operational data (164 points -

567 days) → 1 point each 7 days
● 3,000 m³ full-scale anaerobic digesters
● Wastewater treatment plant
Approach 2022

Literature data

ACTGGTAC
GTTCATCA
TGACGTTA
We got sequencing data (16s)
Ion Torrent PGM sequencer
Approach 2022

Literature data

ACTGGTAC
GTTCATCA
TGACGTTA
We got sequencing data (16s)
Only 14 samples of 164 data points
1 point each 70 days!!
Approach 2022

Literature data

16s

We got sequencing data (16s)

Only 14 samples of 164 data points
1 point each 70 days!!
Approach

Literature data

We got sequencing data (16s)

Only 14 samples of 164 data points
1 point each 70 days!!
Temperaturas (ºC) Temperaturas (ºC)
Taxonomy
Approach 35 35 35 35 35 32 35 35 35 30 25 25 25 35

Reanalysis/
Pre-process

16s pipeline with QIIME2

Reactor A Reactor B
Temperaturas (ºC) Temperaturas (ºC)
Taxonomy
Approach 35 35 35 35 35 32 35 35 35 30 25 25 25 35

Reanalysis/
Pre-process

16s pipeline with QIIME2

Reactor A Reactor B
Approach

Pre-process

create 3 datasets
Time independent
Datasets (No taxonomy)
METADATA

Time series - No lags

Time series - Lags

Time independent
Machine Learning

Time series - No lags

Time series - Lags

Splitting - Time independent ~(70/30)

Shufﬂed data
Splitting - Time dependent (lag and no lags) ~(70/30)

Reactor A

No Shufﬂe
Reactor B
Error Metrics
Root Mean Squared Error (RMSE) Mean Absolute Percentage Error (MAPE)
(Raiz do erro quadrático médio) (Erro percentual absoluto médio)

sample size
real value
predicted value
Results - Time independent

Train

Test
Results - Time dependent (no lags)

Reactor A

Time step

Reactor B

Time step
Results - Time dependent (with lags)

Reactor A

Time step

Reactor B

Time step
Summary

6.9% better than naive

Summary

1 Ammonia

2 Volatile fatty acids

3 Temperature
Chemical oxygen demand

Total alkalinity

Total solids

Volatile solids
pH
Summary

3 1 2

Temperature pH Ammonia Volatile fatty Total alkalinity Total solids Volatile solids Chemical
ºC mg/L acids mg/L g/L g/L oxygen demand
mg/L g/L
Data with
Model with taxonomy taxonomy

14 data points
Model with taxonomy

14 data points

3 different models:
Order

Family

Genus
Model with taxonomy
Model with taxonomy
Model with taxonomy
Model with taxonomy: Genus

Chemical Family Family Genus Genus

oxygen demand Prolixibacteraceae Spirochaetaceae Blvii28 Gracilibacter
g/L (%) (%) wastewater-sludge-group (%)
(%)
Model with taxonomy: Genus

Prolixibacteraceae
- ability to break down complex carbohydrates

Spirochaetaceae
- production of propionic and butyric acid
- establish a syntrophic relationship with hydrogenotrophic methanogens

Gracilibacter:
- involved in diﬀerent fermentative pathways
- involved in the acidogenesis stage of biogas production
- speciﬁc function in the process is not well understood

Blvii28 wastewater-sludge group

- limited information about the genus Blvii28 wastewater-sludge group
Model with taxonomy: Genus

Prolixibacteraceae
- ability to break down complex carbohydrates

Spirochaetaceae
- production of propionic and butyric acid
- establish a syntrophic relationship with hydrogenotrophic methanogens

Gracilibacter:
- involved in diﬀerent fermentative pathways
- involved in the acidogenesis stage of biogas production
- speciﬁc function in the process is not well understood

Blvii28 wastewater-sludge group

- limited information about the genus Blvii28 wastewater-sludge group
Conclusions

● Time independent model presented lower error

● Higher taxonomic resolution presented lower errors (Total DNA)
● Most relevant variables:
○ 1 - Ammonia
○ 2 - Volatile fatty acids
○ 3 - Temperature
● Most relevant microbes:
○ Family: Prolixibacteraceae
○ Family: Spirochaetaceae
○ Genus: Blvii28 wastewater-sludge-group
○ Genus: Gracilibacter
Attention points / Next steps

● The higher the dataset the better

● Higher resolution of VFA
● Include Virus and Fungi

05 - Ensemble Learning
No ratings yet
05 - Ensemble Learning
39 pages
Mastering The Basics of Machine Learning
No ratings yet
Mastering The Basics of Machine Learning
65 pages
2023 ML
No ratings yet
2023 ML
69 pages
21csc305p Machine Learning Unit 5
No ratings yet
21csc305p Machine Learning Unit 5
61 pages
Aws ML PDF
No ratings yet
Aws ML PDF
74 pages
Lecture 5 - Feature Extraction, Model Building & Evaluation
No ratings yet
Lecture 5 - Feature Extraction, Model Building & Evaluation
35 pages
ML 2 PPT Unit 2
No ratings yet
ML 2 PPT Unit 2
214 pages
ML Workshop
No ratings yet
ML Workshop
78 pages
LearningFromExamples II
No ratings yet
LearningFromExamples II
36 pages
Unit 4 ML
No ratings yet
Unit 4 ML
25 pages
ML Unit-3
No ratings yet
ML Unit-3
16 pages
Types of Machine Learning
No ratings yet
Types of Machine Learning
63 pages
Sem 3 Syllabus
No ratings yet
Sem 3 Syllabus
32 pages
Machine Learning Notes ?
No ratings yet
Machine Learning Notes ?
64 pages
1725876123-Unit 1 Fundamental of Deep Learning
No ratings yet
1725876123-Unit 1 Fundamental of Deep Learning
51 pages
Overfitting & Feature Engineering
No ratings yet
Overfitting & Feature Engineering
37 pages
2018 02 Msu Data Science
No ratings yet
2018 02 Msu Data Science
65 pages
KNN Evaluation
No ratings yet
KNN Evaluation
51 pages
XGBoost - Unleashing The Power of Gradient Boosting
No ratings yet
XGBoost - Unleashing The Power of Gradient Boosting
10 pages
Lecture 4 - Intro To Machine Learning and Decision Trees
No ratings yet
Lecture 4 - Intro To Machine Learning and Decision Trees
61 pages
Machine Learning Slides
No ratings yet
Machine Learning Slides
281 pages
Algorithmeknn 121213175830 Phpapp02
No ratings yet
Algorithmeknn 121213175830 Phpapp02
52 pages
Supervised and Unsupervised Learning
No ratings yet
Supervised and Unsupervised Learning
92 pages
Machine Learning Engineering
No ratings yet
Machine Learning Engineering
80 pages
Data Analyst Interview Questionaries
No ratings yet
Data Analyst Interview Questionaries
16 pages
LS Project Report
No ratings yet
LS Project Report
10 pages
Metabolites Using ML
No ratings yet
Metabolites Using ML
16 pages
Python 06 MachineLearning
No ratings yet
Python 06 MachineLearning
45 pages
Data Science
No ratings yet
Data Science
5 pages
Session 5
No ratings yet
Session 5
36 pages
DL
No ratings yet
DL
10 pages
Week 01
No ratings yet
Week 01
37 pages
ML Unit 2
No ratings yet
ML Unit 2
37 pages
Complete ML Notes
No ratings yet
Complete ML Notes
62 pages
Machine Learning Engineer Interview Preparation Guide
No ratings yet
Machine Learning Engineer Interview Preparation Guide
14 pages
MCS224 Dec 2024 Solved
No ratings yet
MCS224 Dec 2024 Solved
22 pages
Unit 3 Ds
No ratings yet
Unit 3 Ds
10 pages
Machine Learning With R
No ratings yet
Machine Learning With R
2 pages
ChatGPT - Machine Learning Overview
No ratings yet
ChatGPT - Machine Learning Overview
34 pages
ML4 - Decision Trees & Random Forest
No ratings yet
ML4 - Decision Trees & Random Forest
44 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
20 pages
Machine Learning
No ratings yet
Machine Learning
4 pages
XG Boosting Reference
No ratings yet
XG Boosting Reference
6 pages
Machine Learning Notes ?
No ratings yet
Machine Learning Notes ?
14 pages
Machinelearning
No ratings yet
Machinelearning
59 pages
01 - Introduction
No ratings yet
01 - Introduction
35 pages
Introduction To Deep Learning
100% (1)
Introduction To Deep Learning
122 pages
Machine Learning
No ratings yet
Machine Learning
4 pages
MLR PDF
No ratings yet
MLR PDF
2 pages
5.1 Large Scale ML
No ratings yet
5.1 Large Scale ML
10 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
15 pages
Lecture 1
No ratings yet
Lecture 1
85 pages
An Introduction To Supervised Machine Learning and Pattern Classification - The Big Picture
No ratings yet
An Introduction To Supervised Machine Learning and Pattern Classification - The Big Picture
55 pages
BDA3
No ratings yet
BDA3
61 pages
Accelerated Data Science Introduction To Machine Learning Algorithms
No ratings yet
Accelerated Data Science Introduction To Machine Learning Algorithms
37 pages
Machine Learning
No ratings yet
Machine Learning
16 pages
Internship Report Poorab
No ratings yet
Internship Report Poorab
30 pages
Machine Learning
No ratings yet
Machine Learning
9 pages
Syllabus-S4 - R23-01.06.2024
No ratings yet
Syllabus-S4 - R23-01.06.2024
20 pages
Rapidminer Report
No ratings yet
Rapidminer Report
28 pages
A Short Guide For Feature Engineering and Feature Selection
No ratings yet
A Short Guide For Feature Engineering and Feature Selection
32 pages
Unit 4 Learning
No ratings yet
Unit 4 Learning
5 pages
Machine Learning 1707965934
No ratings yet
Machine Learning 1707965934
15 pages
MAchine Learning Notes
No ratings yet
MAchine Learning Notes
6 pages
Machine Learning Fundamentals
No ratings yet
Machine Learning Fundamentals
4 pages
Neural Approaches To Conversational AI
No ratings yet
Neural Approaches To Conversational AI
95 pages
(ArXiv2407.06322v2) MagMax Leveraging Model Merging For Seamless Continual Learning
No ratings yet
(ArXiv2407.06322v2) MagMax Leveraging Model Merging For Seamless Continual Learning
22 pages
IV Sem Internship Report
No ratings yet
IV Sem Internship Report
17 pages
A Predictive Model For The Risk of Mental Illness in Nigeria Using Data Mining
No ratings yet
A Predictive Model For The Risk of Mental Illness in Nigeria Using Data Mining
12 pages
Final ML
No ratings yet
Final ML
2 pages
PJ 1
No ratings yet
PJ 1
4 pages
A Survey of Data Fusion in Smart City Applications, Lau, 2019
No ratings yet
A Survey of Data Fusion in Smart City Applications, Lau, 2019
18 pages
Tutorial 10 Answers
No ratings yet
Tutorial 10 Answers
10 pages
Data Mining - Practical Machine Learning Tools and
No ratings yet
Data Mining - Practical Machine Learning Tools and
3 pages
Advance Machine Learning
No ratings yet
Advance Machine Learning
4 pages
May 2021 Examination Diet School of Mathematics & Statistics ID5059
No ratings yet
May 2021 Examination Diet School of Mathematics & Statistics ID5059
6 pages
Thesis Task 1
No ratings yet
Thesis Task 1
4 pages
Event Analysis of Pulse-Reclosers in Distribution Systems Through Sparse Representation
No ratings yet
Event Analysis of Pulse-Reclosers in Distribution Systems Through Sparse Representation
6 pages
ICT515 Assignment1
No ratings yet
ICT515 Assignment1
2 pages
Laboratorium Pembelajaran Ilmu Komputer Fakultas Ilmu Komputer Universitas Brawijaya
No ratings yet
Laboratorium Pembelajaran Ilmu Komputer Fakultas Ilmu Komputer Universitas Brawijaya
11 pages
Segregating Tweets Using Machine Learning
No ratings yet
Segregating Tweets Using Machine Learning
4 pages
Short Dicription About Papaya
No ratings yet
Short Dicription About Papaya
3 pages
Bloom's Taxonomy Interpreted For Mathematics: Lindsey Shorser
No ratings yet
Bloom's Taxonomy Interpreted For Mathematics: Lindsey Shorser
1 page
15 066 PDF
No ratings yet
15 066 PDF
5 pages
Research Article: Real-Time Vehicle Detection Using Cross-Correlation and 2D-DWT For Feature Extraction
No ratings yet
Research Article: Real-Time Vehicle Detection Using Cross-Correlation and 2D-DWT For Feature Extraction
10 pages
Letter Recognition Using HOlland-Style Adaptive Classifiers PDF
No ratings yet
Letter Recognition Using HOlland-Style Adaptive Classifiers PDF
22 pages
Handbook Lean Six Sigma, Proces control, Change management: Methods, tools and inspiration for operational improvement projects
From Everand
Handbook Lean Six Sigma, Proces control, Change management: Methods, tools and inspiration for operational improvement projects
Gunter Wiededmann
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Decision Tree Pruning: Fundamentals and Applications
From Everand
Decision Tree Pruning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet

Gradient Boosting Machine and SHAP For Biogas Production

Uploaded by

Gradient Boosting Machine and SHAP For Biogas Production

Uploaded by

Uncovering Correlations Among

Key Variables in Biogas Production

Thiago Ribas Bella, PhD student

…the go-to method and often the key

[…] The success of the system was also

- XGBoost: A Scalable Tree Boosting System, 2016 /

- GRINSZTAJN, L.; OYALLON, E.; VAROQUAUX, G. Why do tree-based models still

Input: Process parameters (temp., pH, VFA, microorganism)

Regression example: Classiﬁcation example:

Regression example: Classiﬁcation example:

Regression example: Classiﬁcation example:

Recursive binary splitting

Decision Tree (trained) - classiﬁcation

Decision Tree (trained) - classiﬁcation

Decision Tree (trained) - classiﬁcation

Decision Tree (trained) - classiﬁcation

Machine learning algorithm based

Various trees → random forest

Various trees → random forest

How much a speciﬁc feature value contributes to output?

Assume Ann, Bob, and Cindy together were hammering

Assume Ann, Bob, and Cindy together were hammering

Assume Ann, Bob, and Cindy together were hammering

Average of all the permutations

CatBoost M = total features in the subset

Example of just one subset → Age and BMI

Making correlations transparent does not make them causal

Divorce rate Margarine consumed

Literature data Pre-process Machine learning Interpretability

● We got operational data (164 points -

We got sequencing data (16s)

We got sequencing data (16s)

16s pipeline with QIIME2

16s pipeline with QIIME2

Time series - No lags

Time series - Lags

Time series - No lags

Time series - Lags

6.9% better than naive

2 Volatile fatty acids

Chemical Family Family Genus Genus

Blvii28 wastewater-sludge group

Blvii28 wastewater-sludge group

● Time independent model presented lower error

● The higher the dataset the better

You might also like