0% found this document useful (0 votes)
41 views73 pages

Gradient Boosting Machine and SHAP For Biogas Production

The document discusses using gradient boosting machine learning to identify correlations between key variables in biogas production. Gradient boosting is well-suited for this task as it has been successful in many machine learning competitions and is often used by winning teams. The approach can optimize biogas production processes and improve efficiency by uncovering relationships between production parameters like temperature, pH, volatile fatty acids, and microorganisms, and the amount of biogas produced.

Uploaded by

thgoo0orama
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views73 pages

Gradient Boosting Machine and SHAP For Biogas Production

The document discusses using gradient boosting machine learning to identify correlations between key variables in biogas production. Gradient boosting is well-suited for this task as it has been successful in many machine learning competitions and is often used by winning teams. The approach can optimize biogas production processes and improve efficiency by uncovering relationships between production parameters like temperature, pH, volatile fatty acids, and microorganisms, and the amount of biogas produced.

Uploaded by

thgoo0orama
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

Uncovering Correlations Among

Key Variables in Biogas Production


Approaches and insights aided by gradient boosting machine

Thiago Ribas Bella, PhD student


Advisor: Marcelo Falsarella Carazzolle
Why Machine Learning? Identifying correlations between variables in
biogas production can optimize the process and
improve biogas production efficiency
Level

Level
Variables Variables
Why Gradient Boosting? Competitions:
Among the 29 challenge winning solutions, 17
solutions used XGBoost.

…the go-to method and often the key


component in winning solutions for a range of
problems in machine learning competitions

[…] The success of the system was also


witnessed in KDDCup 2015, where XGBoost
was used by every winning team in the
top-10

- XGBoost: A Scalable Tree Boosting System, 2016 /


- https://machinelearningmastery.com/extreme-gradient-boosting-ensemble-in-python/

- GRINSZTAJN, L.; OYALLON, E.; VAROQUAUX, G. Why do tree-based models still


outperform deep learning on typical tabular data? 36º Conference on Neural Information
Processing Systems Datasets and Benchmarks Track. 16 out. 2022
Artificial Intelligence
Some concepts

● Types of learning
● The two basic tasks of supervised learning
● Decision trees
Types of learning algorithms

● Supervised learning
● Unsupervised Learning
● Reinforcement Learning
Types of learning algorithms

● Supervised learning
○ Function → maps input/output
○ Labeled data
Types of learning algorithms

● Supervised learning
○ Function → maps input/output
○ Labeled data

● Unsupervised Learning
○ Unlabeled data
○ No previous knowledge of the data
○ Mainly used in pattern detection
Types of learning algorithms

● Supervised learning
○ Function → maps input/output
○ Labeled data

Input: Process parameters (temp., pH, VFA, microorganism)


Output: Biogas production
Some concepts

● Types of learning
● The two basic tasks of supervised learning
● Decision trees
The 2 basic tasks of supervised learning

Regression Classification
predict a continuous value predict a discrete value (labels or
based on the input variables categories) based on the input variables

infinite values within a given interval can only take on whole numbers
0 ← 0.001, 0.15, ….., 0.871, 0.999 → 1 0, 1, 2, …

Regression example: Classification example:


- Temperature prediction (ºC) - Plant species prediction (A, B or C)
- Rain amount prediction (mm/h) - It will rain? (yes/no)
- Blood Pressure prediction (mmHg) - Blood Pressure level (low /high)
The 2 basic tasks of supervised learning

Regression Classification
predict a continuous value predict a discrete value (labels or
based on the input variables categories) based on the input variables

infinite values within a given interval can only take on whole numbers
0 ← 0.001, 0.15, ….., 0.871, 0.999 → 1 0, 1, 2, …

Regression example: Classification example:


- Temperature prediction (ºC) - Plant species prediction (A, B or C)
- Rain amount prediction (mm/h) - It will rain? (yes/no)
- Blood Pressure prediction (mmHg) - Blood Pressure level (low /high)
The 2 basic tasks of supervised learning

Regression Classification
predict a continuous value predict a discrete value (labels or
based on the input variables categories) based on the input variables

infinite values within a given interval can only take on whole numbers
0 ← 0.001, 0.15, ….., 0.871, 0.999 → 1 0, 1, 2, …

Regression example: Classification example:


- Temperature prediction (ºC) - Plant species prediction (A, B or C)
- Rain amount prediction (mm/h) - It will rain? (yes/no)
- Blood Pressure prediction (mmHg) - Blood Pressure level (low /high)
Some concepts

● Types of learning
● The two basic tasks of supervised learning
● Decision trees
Example to get the intuition

Recursive binary splitting


Decision Tree
Target (output)
Input
Variable
Variables
Observations
Example to get the intuition

Decision Tree (trained) - classification


Target (output)
Input
Variable
Variables
Observations
Example to get the intuition

Decision Tree (trained) - classification


Target (output)
Input
Variable
Variables
Observations
Example to get the intuition

Decision Tree (trained) - classification


Target (output)
Input
Variable
Variables
Observations
Example to get the intuition

Decision Tree (trained) - classification


Target (output)
Input
Variable
Variables
Observations
Decision Tree - Regression
100%

YES NO

52.8%
YES NO

4.2%
2.5% YES NO
Some concepts

● Types of learning
● The two basic tasks of supervised learning
● Decision trees
: A gradient boosting algorithm

Machine learning algorithm based


on decision trees.

*Comitê de Máquinas
Decision Decision
Ensemble
Trees Bag g Trees
gin Learning
o stin
g Bo
Random Gradient
Forest Boosting
Ensemble learning: United we stand, divided we fall

Bagging Boosting
Bootstraping the data plus using the Each tree uses knowledge from previous tree
aggregate to make a decision

Various trees → random forest


…etc…etc…etc…
Ensemble learning: United we stand, divided we fall

Bagging Boosting
Bootstraping the data plus using the Each tree uses knowledge from previous tree
aggregate to make a decision

Various trees → random forest


…etc…etc…etc…
SHAP: SHapley Additive exPlanations

SHapley Additive exPlanations is a game theoretic approach to explain the output of any machine learning model.
SHAP: SHapley Additive exPlanations

SHapley Additive exPlanations is a game theoretic approach to explain the output of any machine learning model.
SHAP: How it works?

How much a specific feature value contributes to output?

“... compute the difference between the model's output with and without each feature…”
SHAP: How it works?
?
SHAP: How it works?
? ?
SHAP: How it works?
? ? ?
SHAP: How it works?
? ? ?

… and so on …
SHAP: How it works?
? ? ?


SHAP: How it works?

Assume Ann, Bob, and Cindy together were hammering


a wood log, 38 meters, to the ground:
SHAP: How it works?

Assume Ann, Bob, and Cindy together were hammering


a wood log, 38 meters, to the ground:
SHAP: How it works?

Assume Ann, Bob, and Cindy together were hammering


a wood log, 38 meters, to the ground:

Average of all the permutations


for each person
SHAP: SHapley Additive exPlanations
How much a specific feature value contributes to output?

CatBoost M = total features in the subset

interactions
Output with and without the specific feature

Example of just one subset → Age and BMI


SHAP: SHapley Additive exPlanations
How much a specific feature value contributes to output?

“... compute the difference between the model's output with and without each feature…”
SHAP: SHapley Additive exPlanations
How much a specific feature value contributes to output?

“... compute the difference between the model's output with and without each feature…”
Correlation vs causation → Attention!

Making correlations transparent does not make them causal

Divorce rate Margarine consumed


Jacob LaRiviere - Amazon
Real causation x Model correlation Eleanor Dillon - Microsoft
Scott Lundberg - Microsoft
Predicted relationship (XGBoost)
Jonathan Roth - Microsoft
Vasilis Syrgkanis - Stanford University
Real relationship

*There are unmeasured variables affecting the measured ones and the output
Separate the wheat from the tare
Predicted relationship wheat tare
Real relationship

SPECIALIST KNOWLEDGE
Uncovering Correlations Among
Key Variables in Biogas Production
Approaches and insights aided by gradient boosting machine
Approach

Literature data Pre-process Machine learning Interpretability


Dados Pré- Aprendizado de Interpretabilidade
processamento Máquina
Saída Saída

Variáveis
Variáveis
Approach 2022

Literature data

● We got operational data (164 points -


567 days) → 1 point each 7 days
● 3,000 m³ full-scale anaerobic digesters
● Wastewater treatment plant
Approach 2022

Literature data

ACTGGTAC
GTTCATCA
TGACGTTA
We got sequencing data (16s)
Ion Torrent PGM sequencer
Approach 2022

Literature data

ACTGGTAC
GTTCATCA
TGACGTTA
We got sequencing data (16s)
Only 14 samples of 164 data points
1 point each 70 days!!
Approach 2022

Literature data

16s

We got sequencing data (16s)


Only 14 samples of 164 data points
1 point each 70 days!!
Approach

Literature data

We got sequencing data (16s)


Only 14 samples of 164 data points
1 point each 70 days!!
Temperaturas (ºC) Temperaturas (ºC)
Taxonomy
Approach 35 35 35 35 35 32 35 35 35 30 25 25 25 35

Reanalysis/
Pre-process

16s pipeline with QIIME2

Reactor A Reactor B
Temperaturas (ºC) Temperaturas (ºC)
Taxonomy
Approach 35 35 35 35 35 32 35 35 35 30 25 25 25 35

Reanalysis/
Pre-process

16s pipeline with QIIME2

Reactor A Reactor B
Approach

Pre-process

create 3 datasets
Time independent
Datasets (No taxonomy)
METADATA

Time series - No lags

Time series - Lags


Time independent
Machine Learning

Time series - No lags

Time series - Lags


Splitting - Time independent ~(70/30)

Shuffled data
Splitting - Time dependent (lag and no lags) ~(70/30)

Reactor A

No Shuffle
Reactor B
Error Metrics
Root Mean Squared Error (RMSE) Mean Absolute Percentage Error (MAPE)
(Raiz do erro quadrático médio) (Erro percentual absoluto médio)

sample size
real value
predicted value
Results - Time independent

Train

Test
Results - Time dependent (no lags)

Reactor A

Time step

Reactor B

Time step
Results - Time dependent (with lags)

Reactor A

Time step

Reactor B

Time step
Summary

6.9% better than naive


Summary

1 Ammonia

2 Volatile fatty acids

3 Temperature
Chemical oxygen demand

Total alkalinity

Total solids

Volatile solids
pH
Summary

3 1 2

Temperature pH Ammonia Volatile fatty Total alkalinity Total solids Volatile solids Chemical
ºC mg/L acids mg/L g/L g/L oxygen demand
mg/L g/L
Data with
Model with taxonomy taxonomy

14 data points
Model with taxonomy

14 data points

3 different models:
Order

Family

Genus
Model with taxonomy
Model with taxonomy
Model with taxonomy
Model with taxonomy: Genus

Chemical Family Family Genus Genus


oxygen demand Prolixibacteraceae Spirochaetaceae Blvii28 Gracilibacter
g/L (%) (%) wastewater-sludge-group (%)
(%)
Model with taxonomy: Genus

Prolixibacteraceae
- ability to break down complex carbohydrates

Spirochaetaceae
- production of propionic and butyric acid
- establish a syntrophic relationship with hydrogenotrophic methanogens

Gracilibacter:
- involved in different fermentative pathways
- involved in the acidogenesis stage of biogas production
- specific function in the process is not well understood

Blvii28 wastewater-sludge group


- limited information about the genus Blvii28 wastewater-sludge group
Model with taxonomy: Genus

Prolixibacteraceae
- ability to break down complex carbohydrates

Spirochaetaceae
- production of propionic and butyric acid
- establish a syntrophic relationship with hydrogenotrophic methanogens

Gracilibacter:
- involved in different fermentative pathways
- involved in the acidogenesis stage of biogas production
- specific function in the process is not well understood

Blvii28 wastewater-sludge group


- limited information about the genus Blvii28 wastewater-sludge group
Conclusions

● Time independent model presented lower error


● Higher taxonomic resolution presented lower errors (Total DNA)
● Most relevant variables:
○ 1 - Ammonia
○ 2 - Volatile fatty acids
○ 3 - Temperature
● Most relevant microbes:
○ Family: Prolixibacteraceae
○ Family: Spirochaetaceae
○ Genus: Blvii28 wastewater-sludge-group
○ Genus: Gracilibacter
Attention points / Next steps

● The higher the dataset the better


● Higher resolution of VFA
● Include Virus and Fungi

You might also like