Gradient Boosting Machine and SHAP For Biogas Production
Gradient Boosting Machine and SHAP For Biogas Production
Level
Variables Variables
Why Gradient Boosting? Competitions:
Among the 29 challenge winning solutions, 17
solutions used XGBoost.
● Types of learning
● The two basic tasks of supervised learning
● Decision trees
Types of learning algorithms
● Supervised learning
● Unsupervised Learning
● Reinforcement Learning
Types of learning algorithms
● Supervised learning
○ Function → maps input/output
○ Labeled data
Types of learning algorithms
● Supervised learning
○ Function → maps input/output
○ Labeled data
● Unsupervised Learning
○ Unlabeled data
○ No previous knowledge of the data
○ Mainly used in pattern detection
Types of learning algorithms
● Supervised learning
○ Function → maps input/output
○ Labeled data
● Types of learning
● The two basic tasks of supervised learning
● Decision trees
The 2 basic tasks of supervised learning
Regression Classification
predict a continuous value predict a discrete value (labels or
based on the input variables categories) based on the input variables
infinite values within a given interval can only take on whole numbers
0 ← 0.001, 0.15, ….., 0.871, 0.999 → 1 0, 1, 2, …
Regression Classification
predict a continuous value predict a discrete value (labels or
based on the input variables categories) based on the input variables
infinite values within a given interval can only take on whole numbers
0 ← 0.001, 0.15, ….., 0.871, 0.999 → 1 0, 1, 2, …
Regression Classification
predict a continuous value predict a discrete value (labels or
based on the input variables categories) based on the input variables
infinite values within a given interval can only take on whole numbers
0 ← 0.001, 0.15, ….., 0.871, 0.999 → 1 0, 1, 2, …
● Types of learning
● The two basic tasks of supervised learning
● Decision trees
Example to get the intuition
YES NO
52.8%
YES NO
4.2%
2.5% YES NO
Some concepts
● Types of learning
● The two basic tasks of supervised learning
● Decision trees
: A gradient boosting algorithm
*Comitê de Máquinas
Decision Decision
Ensemble
Trees Bag g Trees
gin Learning
o stin
g Bo
Random Gradient
Forest Boosting
Ensemble learning: United we stand, divided we fall
Bagging Boosting
Bootstraping the data plus using the Each tree uses knowledge from previous tree
aggregate to make a decision
Bagging Boosting
Bootstraping the data plus using the Each tree uses knowledge from previous tree
aggregate to make a decision
SHapley Additive exPlanations is a game theoretic approach to explain the output of any machine learning model.
SHAP: SHapley Additive exPlanations
SHapley Additive exPlanations is a game theoretic approach to explain the output of any machine learning model.
SHAP: How it works?
“... compute the difference between the model's output with and without each feature…”
SHAP: How it works?
?
SHAP: How it works?
? ?
SHAP: How it works?
? ? ?
SHAP: How it works?
? ? ?
… and so on …
SHAP: How it works?
? ? ?
…
SHAP: How it works?
interactions
Output with and without the specific feature
“... compute the difference between the model's output with and without each feature…”
SHAP: SHapley Additive exPlanations
How much a specific feature value contributes to output?
“... compute the difference between the model's output with and without each feature…”
Correlation vs causation → Attention!
*There are unmeasured variables affecting the measured ones and the output
Separate the wheat from the tare
Predicted relationship wheat tare
Real relationship
SPECIALIST KNOWLEDGE
Uncovering Correlations Among
Key Variables in Biogas Production
Approaches and insights aided by gradient boosting machine
Approach
Variáveis
Variáveis
Approach 2022
Literature data
Literature data
ACTGGTAC
GTTCATCA
TGACGTTA
We got sequencing data (16s)
Ion Torrent PGM sequencer
Approach 2022
Literature data
ACTGGTAC
GTTCATCA
TGACGTTA
We got sequencing data (16s)
Only 14 samples of 164 data points
1 point each 70 days!!
Approach 2022
Literature data
16s
Literature data
Reanalysis/
Pre-process
Reactor A Reactor B
Temperaturas (ºC) Temperaturas (ºC)
Taxonomy
Approach 35 35 35 35 35 32 35 35 35 30 25 25 25 35
Reanalysis/
Pre-process
Reactor A Reactor B
Approach
Pre-process
create 3 datasets
Time independent
Datasets (No taxonomy)
METADATA
Shuffled data
Splitting - Time dependent (lag and no lags) ~(70/30)
Reactor A
No Shuffle
Reactor B
Error Metrics
Root Mean Squared Error (RMSE) Mean Absolute Percentage Error (MAPE)
(Raiz do erro quadrático médio) (Erro percentual absoluto médio)
sample size
real value
predicted value
Results - Time independent
Train
Test
Results - Time dependent (no lags)
Reactor A
Time step
Reactor B
Time step
Results - Time dependent (with lags)
Reactor A
Time step
Reactor B
Time step
Summary
1 Ammonia
3 Temperature
Chemical oxygen demand
Total alkalinity
Total solids
Volatile solids
pH
Summary
3 1 2
Temperature pH Ammonia Volatile fatty Total alkalinity Total solids Volatile solids Chemical
ºC mg/L acids mg/L g/L g/L oxygen demand
mg/L g/L
Data with
Model with taxonomy taxonomy
14 data points
Model with taxonomy
14 data points
3 different models:
Order
Family
Genus
Model with taxonomy
Model with taxonomy
Model with taxonomy
Model with taxonomy: Genus
Prolixibacteraceae
- ability to break down complex carbohydrates
Spirochaetaceae
- production of propionic and butyric acid
- establish a syntrophic relationship with hydrogenotrophic methanogens
Gracilibacter:
- involved in different fermentative pathways
- involved in the acidogenesis stage of biogas production
- specific function in the process is not well understood
Prolixibacteraceae
- ability to break down complex carbohydrates
Spirochaetaceae
- production of propionic and butyric acid
- establish a syntrophic relationship with hydrogenotrophic methanogens
Gracilibacter:
- involved in different fermentative pathways
- involved in the acidogenesis stage of biogas production
- specific function in the process is not well understood