0% found this document useful (0 votes)
63 views53 pages

Session 6 - Machine Learning Fundamentals and Orange Introduction

The document provides an overview of supervised machine learning. It discusses how supervised learning algorithms are trained using historical labeled data that contains known input and output pairs. The data is partitioned into training, validation, and test sets. The training set is used to train the model by finding patterns in the relationship between inputs and outputs. The validation set is used to evaluate how well the model is learning during training. The test set evaluates the performance of the fully trained model. Supervised learning algorithms learn by iteratively updating internal parameters to reduce error rates when making predictions on the training data.

Uploaded by

Shishir Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views53 pages

Session 6 - Machine Learning Fundamentals and Orange Introduction

The document provides an overview of supervised machine learning. It discusses how supervised learning algorithms are trained using historical labeled data that contains known input and output pairs. The data is partitioned into training, validation, and test sets. The training set is used to train the model by finding patterns in the relationship between inputs and outputs. The validation set is used to evaluate how well the model is learning during training. The test set evaluates the performance of the fully trained model. Supervised learning algorithms learn by iteratively updating internal parameters to reduce error rates when making predictions on the training data.

Uploaded by

Shishir Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 53

Advanced HR Analytics

Session 6
DELIVERED BY
DR. PRATYUSH BANERJEE
A BRIEF INTRO ABOUT MYSELF
• Present Designation- Associate Professor– IMI Bhubaneshwar
• Earlier associated with TAPMI, XIMB, Management Department, BITS-Pilani, and IBS
Hyderabad
• Teaching HR Analytics since 2015; involved in curriculum development at several institutes
and executive education programs
• Certified HR Analytics Professional from Aon-Hewitt Learning Center, 2017
• Certified Business Analytics
Course Professional from Carlton Advanced Management Institute
Instructor:
(UK), 2015 Dr. Pratyush
Banerjee
• Conducted FDPs[PhD and (Management),
workshops on HR and Business Analytics with Indian Oil Corporation,
CGI, CCIL and guest faculty
PGDM atTech
(HR), B. IIM Amritsar’s executive certification program with HPCL;
(Electronics &
• Taken guest lectures for CII’s certificate program on People Analytics in 2019 and 2020
Telecomm.]
• Visiting faculty with at IIM Raipur, IIM Indore and MDI
• Written a book on HR Analytics titled “Practical Applications of HR Analytics: A step by step
Guide” by Sage Publishers in 2019. Currently developing my second book on data analytics
with Python.
Learning objectives

• Introduction to Artificial Intelligence


• Introduction towards machine learning
• Supervised Machine learning
• Understanding application of Logistic regression as Supervised
Learning Algorithm
• Interpreting output using Orange, a Python based GUI Tool
The hidden wall of Analytics
Maturity

Why can’t
most firms go
beyond level
2?

4
HOW TO BREAK THE ANALYTICS MATURITY BARRIER?

Knowledge needed:
How to break
a)Statistics (correlation,
this wall? decision tree,
regression,
cluster analysis, factor
analysis)
b)Machine learning
(logistics regression,
decision tree, neural
networks, random
forests), NLP (Text
mining, sentiment
mining)
c)Linear programming
Some myths which hold us
back
Myth 1: Software tools are difficult to learn, lots of coding involved
Reality – most applications can be executed through GUI tools
Myth 2: GUI tools are expensive.
Reality – there is GUI interface for both R and Python which are free
software
Myth 3: The statistics and Machine learning concepts are very difficult
Reality – can be approached in a simpler way.

6
What can be ignored at the foundational phase:
a)Coding in R and Python
b)Deep knowledge
So,ofwhat
statistics
do I mean by making analytics easier to digest?
c)Deep knowledge in linear Algebra or Calculus

What can not be ignored:


a)Fundamentals of statistics
b)Fundamentals of machine learning
c)Learning to interpret results in correct manner
d)Understand R / Python in code-agnostic manner (though choice of code-agnostic
software is yours)
Definition of Artificial
Intelligence
AI basically stands for the discipline of training a machine (computer
programs / algorithms) to mimic a human brain and it’s thinking
capabilities
Termed by John McCarthy, an American computer scientist, in 1956
AI focuses on 3 major aspects: learning, reasoning and self-
correction to obtain maximum possible efficiency
AI mimics the neural network mechanism of human brain
TYPES OF AI
Based on the functionality of AI-based systems, AI can be
categorized into the following types:
Reactive Machines AI – IBM Watson, Google’s Deep Mind, Deep
Blue Computer, AlphaGo
Limited Memory AI- Self driving cars
Theory Of Mind AI – AI based emotion recognition
Self-aware AI- Still at a hypothetical stage
WHAT TYPE OF AI IS SEEN IN
ACTION IN THE PIC?
THE EVOLUTION OF AI COMPLEXITY
How AI, ML and Deep Learning are related?
TYPES OF MACHINE
LEARNING
So what does the ML Algorithm
do exactly?
Hold on Yogi
THE ASTROLOGER ANALOGY Don’t worry,
ji, Can you I will predict
predict my your future
past first?
Just to check
your
genuineness

15
Supervised learning
Here the ML Algorithm is provided historical / past data containing the outcome data which
is treated as an output label (target variable) associated with each instance in the dataset
This output can be discrete/categorical (Yes, No, True, False, Success, Failure, Jaguar, Ford,
Mustang, STOP sign, spam/ham etc) or
Can also be continuous / real number (Satisfaction score, revenue, TRP Rating, BMI)
In supervised learning, the supervisor first trains the model with historical data by
partitioning the same into separate chunks of data sets called training, validation and
testing data.
Then, based on its learning, it can predict future outcomes with new / future data
Labeled vs. Unlabeled Data
1. For supervised learning, we need historical data containing the
target class or variable. This is known as labeled data
2. Let's say you want to classify some employees in two categories:
Good and Poor. Then you can use labeled training data. Labeled
data have a label, in our case: Good and Poor; the algorithm tries to
predict the type of employee using the labeled data for training
3. In case of unlabeled data, the employees are not classified into
specific classes; the algorithm tries to do that for you

17
DATA FOR
CLASSIFICATION
Example of LABELED
DATA FOR REGRESSION
Supervised vs. Unsupervised
When the ML algorithm learns The algorithm finds patterns from
through the analyst’s intervention the data on its own
That is, the analyst decides when There is no specific data
to stop the training, what will be partitioning that takes place
the ratio for training vs. holdout /
validation data etc.
Mostly used for classification Used for clustering problems,
problems such as predicting factor analysis and association rule
outcomes mining

20
Examples of SL and USL
Supervised learning
For prediction type problems
E.g. Logistic Regression, Neural Networks, Naïve Bayes, K
Nearest Neighbors, Decision Trees, Random Forest,
ADABoost, SVM
Unsupervised learning
For clustering type problems
E.g. association rules, cluster analysis, Self-organizing
feature maps, PCA, Manifold learning, DBSCAN, Louvain
clustering

21
Partitioning the data for supervising the
learning

Training data (70% of whole data) is used by the ML to


understand the relation between the IVS with the DV
Holdout/ validation data (15% of whole data) is used by the ML
to test how well the ML can predict the outcome
Testing data (15% of whole data)-A separate data known as
testing data is finally used to check whether the algorithm gives
more accurate prediction than that obtained through training and
validation data
K fold Cross-validation- when the historical data set is partitioned
k times for multiple iterations (used when a large data-set is
available)
How to train the data?
Data is split into three parts either using some transform
function or by default
◦ Training (60% of total sample) – training data/sample
◦ Validation (20% of total sample)- validation/ holdout
data/sample
◦ Testing (20% of total sample) – testing data/ sample

◦ The analyst may choose otherwise, eg: 70:20:10


◦ In some software like RATTLE, by default, the partition is
70:15:15
◦ In Orange, multiple options are available for data splitting
The Three Data Partitions and Their
Role in the Data Mining Process
Eg: Say, this is
the data for
2015-2020 on
target variable
Attrition / Sales
To verify
how much
the model
has learnt

Eg: Say, this is In industry, the


the data for new data is also
2021-22, where sometimes
target variable referred to as
is to be test data
predicted
In a nutshell-
How Does a Supervised ML
Model learn to predict?
Step 3:
Step 1: The Model tries
Step 2: Step 4:
Provide the to understand
Specify the Observe the
AI/ML Model the pattern of
input features model
the historical relationship
and the target performance
data between the
I/P and the O/P

This is also Involves Data Data splitting takes Metrics for


known as cleaning, Data place here, evaluation-
training data preprocessing, Hyperparameters Precision,
Feature such as Learning Recall,
selection and rate is specified Classification
Feature accuracy, ROC
engineering Curve
SOFTWARE REQUIREMENT
FOR CONDUCTING AI & ML
APPLICATIONS
PROPRIETARY SOFTWARE OPEN-SOURCE SOFTWARE
- SAS ENTERPRISE MINER - R (CODE & GUI)

- SPSS MODELER - PYTHON (CODE BASED)


- RAPIDMINER PRO - ANACONDA (CODE BASED, EG:
JUPYTER NOTEBOOK)
- WEKA (GUI)
- KNIME (GUI)
- ORANGE (CODE & GUI)
Usual route of data mining
through Anaconda

Jupyter Notebook

Jupyter Notebook has its origins in the developments made through Project Jupyter, a non-profit organization which is committed to
developing open-source software, open standards, and services for interactive computing for users. Project

Jupyter is actually a spin-off from IPython, another Python environment developed by Fernando Perez. The name Jupyter is an
amalgamation for the three mother languages which form the core of the software – Julia, Python and R. The name also pays homage
to the event of the discovery of the moons of the planet Jupiter which were recorded in a notebook by Galileo.

The Jupyter Notebook has become a popular user interface for cloud computing, and major cloud providers have adopted the Jupyter
Notebook or its derivative as a frontend interface for their users. For example, Microsoft's Azure Notebook is built on Jupyter.
Source: Shen, H. (2014). Interactive notebooks: Sharing the code. Nature, 515, 151–152 (06 November 2014) doi:10.1038/515151a
Jupyter
Notebook
Console
A TYPICAL PYTHON
SCRIPT WRITTEN IN
JUPYTER NOTEBOOK 29
A way out of the coding
conundrum – Orange Data Mining
Suite
Orange is an open-source data visualization, machine learning and
data mining toolkit.
It features a visual programming front-end for explorative data
analysis and interactive data visualization and can also be used as a
Python library.
Developed by Demsar and colleagues (2013), Faculty of Computer and
Information Science at the University of Ljubljana, Slovenia
Can help in executing a wide array of applications including machine
learning based predictive modelling, text mining and sentiment analysis

30
How to install Orange?
Download Anaconda Python from the following link:
https://www.anaconda.com/distribution/

Alternatively, may be downloaded from the website https://orangedatamining.com/download/#windows

31
The Orange Canvas

32
Some essential facts about
Orange
Widgets hold the key to any operation in
the Orange environment
Workflows are created which is very
similar to Rapid-miner or SAS Enterprise
miner workflows
Viewing of output requires separate
widgets
Add-on widgets can be downloaded for
specific applications

33
Double click
on the icon to
import your
data
Application
of File
Widget Drag and
drop the
File
widget
Icon on
the Orange
Canvas
Importing other file formats and
connecting to SQL Server

In Orange, apart from MS Excel, there are widgets to import datasets which are in Comma Separated
Values (CSV) format, or stored as an Orange dataset, or in the form of a Structured Query Language
(SQL) table. With regards to SQL data, there are some a-priori actions that need to be conducted.

In Orange, the SQL Table widget can help to connect with either PostgreSQL (psycopg2 module needs
to be installed) or Microsoft SQL Server (pymssql module needs to be installed). If the necessary
modules are not installed, then the widget will not work. For instance, if one clicks on the SQL icon
without the necessary prior actions, then Orange will show the following error:
Instructions for enabling SQL
Widget
To install PostGreSQL, the installation syntax is as follows: pip install psycopg2

To install Microsoft SQL, the corresponding syntax is: pip install pymssql

Ajda Pretnar (2018). How to enable SQL widget in Orange. Feb 16, available at:
https://orange.biolab.si/blog/2018/02/16/how-to-enable-sql-widget-in-orange/
Select Column
Widget (used
for Feature
selection and
feature
engineering)
Data Table
Widget (used
for
Exploratory
Data
Analysis)
Impute Widget
(used for
Missing value
analysis)
Outlier Widget
(To detect
Outliers)
Discretize
and
Continuize
Widgets (For
feature
engineering)
Preprocess
widget (one
stop shop for
all types of
data
reengineering
)
Feature
Statistics
Widget
(Descriptive
stats)
Correlation
Widget
Python Script
Widget (for
coding in
Orange)
Save Data Widget
Data Widget Tab
The data widget tab contains all
functionalities for importing, pre-
processing and defining the data
The circled widgets in the
diagram are of importance for
conducting data imputation

47
The Visualize Widget Tab
This holds the key to data
visualization
Some of these widgets are also
useful for viewing output of ML
algorithm models

48
The Model Widget Tab
The model widgets are key to
predictive modeling and analytics
As can be seen, almost all major
predictive modeling tools are
provided by Orange
We will be covering the circled
widgets in subsequent lessons

49
The Evaluate Widget
Tab
This widget helps in estimating the
predictive power of models
The widgets confusion matrix, test and
score and ROC analysis will be covered
in subsequent discussions

50
Unsupervised Widget tab
This widget tab includes all
necessary models related to
unsupervised machine
learning
We will be covering
Hierarchical and K-means
clustering in this section

51
Conducting Text analytics in Orange-
Text Widget Add-in
This widget tab includes all
necessary applications of text
mining
We will be covering Sentiment
analysis with this

52
Thank you

You might also like