0% found this document useful (0 votes)
40 views2 pages

KNIME Guide - Week 2

KNIME+Guide+-+Week+2

Uploaded by

amitkinwar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views2 pages

KNIME Guide - Week 2

KNIME+Guide+-+Week+2

Uploaded by

amitkinwar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Numeric Variable: Nominal (Categorical) Variable: Variables that are

Univariate Analysis: When one (1) variable is analysed at a


Variables that are numeric described by categories or labels. For example,
Week 2 time. EDA: Refers to Exploratory Data Analysis. Data Preprocessing: It refers to preprocessing the data such as
in nature such as 'age', "Marital Status" may contain categories 'Married',
Data Visualization Bivariate Analysis: Analysing the relationship between two (2) Univariate and Bivariate analysis are part of it. imputing missing values, correcting amomalous values, etc.
'annual income', 'price', 'Single', 'Separated'. In KNIME when we say
variables.
etc. Nominal variable, it also includes Ordinal variables.
Sl. No. KNIME Nodes Usage/Application Description Important 'Configuration Window' Options Output (After executing the node) Type of insights
(a) We can observe the distribution of the numeric variables by
looking at descriptive statistics such as mean, minimum,
maximum, variance, Histogram, etc. It also gives a sense of
It takes input data and compute different statistics on Numeric Variables: Minimum, maximum, mean, Standard
anomalous data.
Univariate Analysis - it. (a) We can change the number bins in histogram by going to Deviation, Number of Missing values, Histogram plot, etc.
(b) We can observe the distribution of the nominal variables by
1 Data Explorer Both Numeric & Nominal We can view the statitstics by going to the the option "Number of numeric histogram bars" and typing in a Nominal Variables: Number of unique values, All nominal
looking at the frequency count for each category of the nominal
variables. "Interactive View: Data Explorer View" option after number, by default it considers 10 bins. values (categories), Number of Missing values, Frequency
variable.
executing the node. Bar Chart (category counts)
(c) We can observe number of missing values, numer of zeros
which gives us sense of the kind of data preprocessing that would
be required.
(a) This gives us an idea of whether the numeric variable is
Univariate Analysis - A Histogram is created from the data. (a) We can change the number of bins the histogram by going
Normally distributed, right-skewed, left-skewed, uniformly
2 Histogram Numeric variables We can observe the plot by going to the Interactive to the "Binning" tab in the configuration window and enter a A Histogram with the chosen number of bins is created.
distributed.
View: Histogram View option. value in the "Number of bins" option. By default it is 5.

(a) Shows the Box plots of all the included variables at a


time if multiple boxplots is selected.
It includes only the numeric data type variables. We can either
A box plot displays the five statistical parameters : (b) Shows boxplot of one variable at a time if multiple
(a) take a look at all the numeric varaibles at a time by checking (a) With the help of the 5 point summary we can comment on the
minimum, lower quartile, median, upper quartile, and boxplots option is unchecked.
Univariate Analysis - on the option "Plot multiple boxes". This is checked by default. distribution of the numeric variable. Approximately what
3 Box Plot maximum. It also shows outliers (if any). Note: We can observe the variables one by one in the
Numeric variables (b) take a look at them individually by unchecking the above percentage of data lies in what range, the median vlaue.
We can observe the plot by goining to the output window itself by changing the selected column. In
option. This is preferred as the scale of one varible may affect (b) We can also look at the presence or absense of outliers.
Interactive View: Box Plot option. the output go to right hand corner settings option shown by
the other variables.
three parallel lines and change the variable in the 'Selected
Column" option.
Bar Chart node can be used for both Univariate &
Bivariate analysis. Here we are looking at only
(a) In the 'Category column' we can select the Nominal variable
Univariate analysis. Here for a given nominal variable, the number of data points (a) We can have a comparative understanding of the frequency
Univariate Analysis - for which we want to do analysis.
A bar chart showing the occurence count of a (occurence counts) of each unique category is shown as a count for each category of a nominal variable.
4 Bar Chart Nominal (categorical) (b) For univariate analysis, we keep the default option i.e.
Nominal variable is created. bar. The height of the bar correpsonds to the count of a We can observe this in the Data Explorer output as well but it can
variables Aggregation method as 'Occurence count".
We can observe the plot by goining to the category presnt in the nominal variable. be more clearly observed here.
"Interactive View: Bar Chart" option.
[email protected]
SDMCJVQ92P Shows boxplots of the selected numeric variable for
(a) With the help of the 5 point summary we can comment on the
distribution of the numeric variable for each category of the
It is used to observe the distribution of numeric
different categories of the Categorical column. nominal variable. This can help us in understanding how the
variables with respect to the different categories of a
(a) We select the Nominal variable from the 'Category column' Note: We can change the numeric variable in the output numeric variable varies for the different categories of the nominal
Bivariate Analysis - One Nominal variable.
Conditional Box option. window itself by changing the selected column. In the variable.
5 Numeric and One Nominal It plots boxplots of a numeric variable corresponding
Plot (b) We select the Numeric variable from the 'Selected column' output go to right hand corner settings option shown by For example, if we have a nominal variable consisting of the
variable to each category of a nominal variable.
option. three parallel lines and change the variable in the 'Selected education level of a bank's customers and a numeric variable
We can observe the plot by goining to the
Column" option. However, to change the nominal variable representing the credit card spendings by them. Using the Box
"Interactive View: Conditional Box Plot" option.
we need to make changes in the configuration window. plots we can compare if there is any difference in credit card
spendings for customers having different levels of education.
(a) We can compare the 'average' or 'sum' of the numeric
variables corresponding to the categories of a nominal variable.
(a) In the 'Category column' we can select the Nominal variable
It can be used to observe Bar chart showing either For example, considering a retail shop, we have different product
for which we want to do analysis.
Bivariate Analysis - One sum or average of a numeric variable corresponding The output shows bar charts correponding to either sum or categories say Electronics, Beauty, Health, Grocery. Now we
(b) We can choose either of the Aggregation methods - 'Sum' or
6 Bar Chart Numeric and One Nominal to each category of a nominal variable. average of all the numeric variables included in the want to analyse the sales of these products by gender of
'Average' to plot the bar charts.
variable We can observe the plot by goining to the configuration window. customers (say 'male' and 'female'). We can plot the 'sum' or
(c) All the numeric variables included in the configuration
"Interactive View: Bar Chart" option. 'average' sales of these products for the 'male' and 'female'
window will be plotted at one go.
customers and see if a particular product category attracts a
particular gender more.
The result table consists of a table with one row for each
(a) In the 'Groups' tab, we include the nominal variable. existing value combination of the selected columns.
It can be used to observe different statistical
(b) In the 'Manual Aggregation' tab, the numeric variables are For example, if we have selected a Nominal variable
measures such as Mean, median, minimum, (a) Just like 'Conditional Box Plot' and 'Bar Chart' node we can
shown in the left panel. From here we can select a particluar Gender having categories 'male' and 'female' and numeric
Bivariate Analysis - One maximum, variance, etc. of numeric variable use the 'Groupby' node to compare the statistics of numeric
variable and click on 'add'. Then after adding the numeric columns 'age' and 'annual income' with the measures
7 GroupBy Numeric and One Nominal corresponding to each category of a nominal variables for different categories of nominal variables.
variable to the right panel we can decide the type of statistics 'mean', 'minimum' and 'maximum' for both the numeric
variable variable. Here we get the output in a tabular form as different measures
we want to observe. By default Mean is selected, we can click variables. Then the output will contain 4 rows combining the
We can observe the table by goining to the "Group such as Mean, Median, Variance, sum, minimum, maximum, etc.
on the 'Mean' option and a drop down will appear using which two categories of Gender and the numeric variables. And it
table" option.
we can select the measure of our choice. will contain 6 columns as we have selected 3 parameters
for each of the two numeric variables.

This file is meant for personal use by [email protected] only.


Sharing or publishing the contents in part or full is liable for legal action.
Numeric Variable: Nominal (Categorical) Variable: Variables that are
Univariate Analysis: When one (1) variable is analysed at a
Variables that are numeric described by categories or labels. For example,
Week 2 time. EDA: Refers to Exploratory Data Analysis. Data Preprocessing: It refers to preprocessing the data such as
in nature such as 'age', "Marital Status" may contain categories 'Married',
Data Visualization Bivariate Analysis: Analysing the relationship between two (2) Univariate and Bivariate analysis are part of it. imputing missing values, correcting amomalous values, etc.
'annual income', 'price', 'Single', 'Separated'. In KNIME when we say
variables.
etc. Nominal variable, it also includes Ordinal variables.
Sl. No. KNIME Nodes Usage/Application Description Important 'Configuration Window' Options Output (After executing the node) Type of insights
(a) We can observe the distribution of the categories of a nominal
(a) Row variable: The selected nominal coulmn will have its The "View: Cross tabulation" output gives us the following variable across the categories of another nominal variable.
categories represented along the rows. options that can be used for Bivariate analysis: For example, say for a telecom sector we want to observe attrition
It can be used to compare the percentage/frequency
(b) Column variable: The selected nominal coulmn will have its Frequency:The cell frequency, i.e. the number of data among customers. We have information on whether the customer
distribution across the categories of two Nominal
Bivariate Analysis - Two categories represented along the columns. points that fall in the particular combination. had subsribed to any special plan. So basically we have two
8 Crosstab (local) variables.
Nominal variables (c) Weight column: This can be set to <none>. It applies a Row Percent:The row percent is computed as Frequency/ nominal variables, 'attrition' - 'Attrited' & 'Not attrited' and
We can observe the table by goining to the "View:
numeric weight for each record in the input causing the row total. 'special plan' - 'Subscribed' & 'Not Subscribed'. Now we want to
Cross tabulation" option.
Crosstab node to treat each record as if it were repeated Column Percent:The column percent is computed as compare the rate of attrition among customers who have
WEIGHT number of times. Frequency/ column total. subscribed to special plans and who have not, we can use the
Crosstab node for this purpose.
It can be used to observe how one numeric variable (a) We can observe the relation of one numeric variable with the
varies with the other. For a corresponding row, it The output shows a scatter plot of the two selected other by looking at the scatter plot.
(a) We can select the maximum number of rows that would be
Bivariate Analysis - Two plots one numeric variable along the X-axis and the variables. For example, let us consider a bank which wants to observe how
9 Scatter Plot considered for the plot, by default it is 2,500.
Numeric variables other along the Y-axis and shows the output as dot. Note: We can change the numeric variables in the output the credit card spendings relate with age of customers. We can
(b) We can choose the columns for X-axis and Y-axis.
We can observe the plot by goining to the window itself by changing the X column or Y column. plot a scatter plot and see if credit card spendings increases or
"Interactive View: Scatter Plot" option. decreases or has no relation with age.
Let us consider a retail shop's data which indentifies each
It can be used to remove duplicate rows in the data. customer with a unique customer id. Now if we observe duplicate
Duplicate Row (a) We can select the columns we want to include for this
10 Data Preprocessing This is especially useful when we are not expecting The output removes the duplicate rows from the input data. rows in the data we can be sure that the data is anomalous
Filter purpose.
two rows to be identical across all the columns. because two rows with the same customer id is not expected, we
can remove the duplicate rows in such a situation.
(a) Default Settings: Here we can fill the missing values
present in the different columns by their data type. We have
This node helps to handle missing values found in
three data types: Number (integer), Number (double), and Let us consider a retail shop's data that has a column capturing
Data Preprocessing - cells of the input table.
String. We can set the impuation method which will affect all The output fills in the missing values according to the the age of the customers. Now, we observe that there are certain
11 Missing Value Both Numeric & Nominal The missing values can be imputed by mean,
the columns that fall in any of these data types. method chosen. missing values in the data, then we can use the average age of
variables. median, most frequent value, customized value,
(b) Column Settings: Here we can fill the missing values the customers to fill in the missing values.
rows that have missing values can be removed etc.
specifically for a particular column. It overrides the Default
settings for the selected columns.
This node detects and treats the outliers for each of
[email protected] the selected columns individually by means of
(a) Select the columns where we want to treat the outliers by
SDMCJVQ92P interquartile range (IQR) which is based on the
Boxplot.
including them.
There are two output options:
(b) We can choose either to apply the outlier treatment on
To detect the outliers, IQR is computed using the (a) 1st Output (Treated Table): The node applies the
outliers in the higher end or the lower end. This is given in the Let us consider a retail shop's data that has a column capturing
first quartile (Q1) and third quartile (Q3). chosen outlier treatment strategy on the input table and
'Apply to' option of the Outlier Treatment. the age of the customers. Now, we observe that there are certain
Data Preprocessing - IQR = Q3 - Q1. gives an output table.
12 Numeric Outliers (c) We can choose from the following Treatment options - customers that have an age greater than 120 years. This is highly
Numeric variables An observation is flagged as an outlier if it lies (b) 2nd Output (Summary): We can look at the summary of
replace the outliers or remove the outlier rows. unlikely and can be considered as outliers and treated
outside the range the outlier treatment which shows the number of outliers in
(d) If we choose to 'replace the outliers' treatment option, we accordingly.
R = [ Q1 - k * (IQR), Q3 + k * (IQR) ] each column and the lower and upper bounds after
can apply either of the treatment strategy - 'replace them with
K can be any value >=0, by default k = 1.5. The treatment.
the closest permissible value' or 'convert them to missing
smallest value in R corresponds, typically, to the
values' (which can be treated separately).
lower end of a boxplot's whisker and the largest
value to its upper end.
We can use it as a third dimension for visualization.
Let us consider a retail company's data where we want to observe
Colors can be assigned for either nominal (possible
EDA - Both Numeric & the scatter plot between age and monthly spendings on
13 Color Manager values) or numeric columns (with lower and upper Select the column and set the color codes. It color codes the output according to the set configuration.
Nominal variables. electronics. At the same time we want to observe if gender of the
bounds).
customers gives us any extra insights, for this, we can color code
the gender and then observe the scatter plot.
We have three options for filtering:
(a) Include or Exclude rows by attribute (column) value: We
can select the column in 'Column to test' option. In the matching
criteria, we can either put in a matching pattern, use numeric
ranges or decide based on presence of missing values in the
particular column. Let us consider data collected on share market. The data contains
The node allows for filtering the rows according to (b) Include or exclude rows by number: We need to specify a column that has values correponding to the sector the share
14 Row Filter EDA/Data Preprocessing certain criteria. It can be used to either include or the first row number to in/exclude. The end of the range can We get the filtered table. belongs. As the share market metrics of particular sector stocks
exclude rows. either be specified by row number, or set to the end of the could be quite different from another we can filter out the sectors
table, causing all remaining rows to be in/excluded. using the 'Row Filter' and then perform the analysis.
(c) Include or exclude row by row ID: We need to specify a
regular expression, which is matched against the row ID of
each row.A checkmark can be set, if a case sensitive match
should be performed and if the row ID should start with the
specified pattern.

This file is meant for personal use by [email protected] only.


Sharing or publishing the contents in part or full is liable for legal action.

You might also like