0% found this document useful (0 votes)
12 views47 pages

Computer in Economics

The document provides an overview of logical functions in Excel, focusing on the IF, AND, and OR functions for decision-making. It also explains basic formulas and functions like SUM, AVERAGE, COUNT, and others, detailing their syntax and usage. Additionally, it introduces advanced functions such as SUBTOTAL, MOD, POWER, CEILING, and FLOOR, emphasizing their applications in data analysis.

Uploaded by

teddyswimsss24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views47 pages

Computer in Economics

The document provides an overview of logical functions in Excel, focusing on the IF, AND, and OR functions for decision-making. It also explains basic formulas and functions like SUM, AVERAGE, COUNT, and others, detailing their syntax and usage. Additionally, it introduces advanced functions such as SUBTOTAL, MOD, POWER, CEILING, and FLOOR, emphasizing their applications in data analysis.

Uploaded by

teddyswimsss24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 47

ECO 108

INTRO TO COMPUTER PROGRAMMING FOR ECONOMICS II


Session 1
UNDERSTANDING LOGICAL FUNCTIONS
Logical functions provide decision-making tools for information in a spreadsheet. They
allow you to look at the contents of a cell, or to perform a calculation, and then test that result
against a required figure or value. You can then use the IF logical function to determine
which calculation to perform or action to take depending on the outcome of the test.
The IF Function The IF function is the key logical function used for decision making. It
takes the format: =IF(condition, true, false) For example, you could use the following
formula: =IF(B2 > 400, “High”, “Low”) where, B2 > 400 is the condition being tested (this
could be translated as “Is the value in cell B2 greater than 400?”) “High” is the text to display
if B2 is greater than 400 (the result of the test is yes or TRUE) “Low” is the text to display if
B2 is less than or equal to 400 (the result of the test is no or FALSE) The AND Function
The AND function is used to compare more than one condition. It returns TRUE only if all
of the conditions are met, and takes the format: =AND(condition1, condition2,…) For
example, you could use the following formula: =AND(B2 > 400, C2 < 300) where, B2 > 400
is the first condition being tested C2 < 300 is the second condition being tested This will only
return the result TRUE if the value in cell B2 is greater than 400 and the value in cell C2 is
less than 300. In all other situations, the result will be FALSE.
The OR Function The OR function is also used to compare more than one condition. It
returns TRUE if any of the conditions are met, and takes the format: =OR(condition1,
condition2,…) For example, you could use the following formula: =OR(B2 > 400, C2 < 300)
where, B2 > 400 is the first condition being tested C2 < 300 is the second condition being
tested This will return the result TRUE if either the value in cell B2 is greater than 400 or the
value in cell C2 is less than 300. The result will be FALSE only if neither of the conditions is
met.

Formulas & Functions in Excel


What is Excel Formula?
In Microsoft Excel, a formula is an expression that operates on values in a range of cells.
These formulas return a result, even when it is an error. Excel formulas enable you to perform
calculations such as addition, subtraction, multiplication, and division. In addition to these,

1
you can find out averages and calculate percentages in excel for a range of cells, manipulate
date and time values, and do a lot more.
Formulae in Excel
There is another term that is very familiar to Excel formulas, and that is "function". The two
words, "formulae" and "functions" are sometimes interchangeable. They are closely related,
but yet different. A formula begins with an equal sign. Meanwhile, functions are used to
perform complex calculations that cannot be done manually. Functions in excel have names
that reflect their intended use.

In Excel, formulas are used to perform calculations or manipulate data, while functions are
pre-defined formulas that simplify complex tasks. Here's a breakdown of some common
formulas and functions:
Basic Formulas & Functions:
SUM: Adds numbers in a range of cells (e.g., =SUM(A1:A10))
AVERAGE: Calculates the average of numbers in a range (e.g., =AVERAGE(B1:B10))
CONCATENATE: Combines text from multiple cells into one (e.g., =CONCATENATE(A1,
" ", B1))
IF: Performs a logical test and returns one value if true, and another if false (e.g., =IF(A1>10,
"Yes", "No"))
VLOOKUP: Searches for a value in a column and returns a value from the same row
(e.g., =VLOOKUP(A1,Sheet2!A:B,2,FALSE))
COUNT: Counts cells in a range that contain numbers (e.g., =COUNT(A1:A10))
COUNTIF: Counts cells in a range that meet a specific condition
(e.g., =COUNTIF(A1:A10,">10"))
MAX: Finds the largest value in a range (e.g., =MAX(A1:A10))
MIN: Finds the smallest value in a range (e.g., =MIN(A1:A10))
DATE: Creates a date value (e.g., =DATE(2025,3,19))
TRIM: Removes extra spaces from text (e.g., =TRIM(A1))
COUNTA: Counts cells that are not empty (e.g., =COUNTA(A1:A10))
The example below shows how we have used the multiplication formula manually with the
‘*’ operator.
Sample Formula: "=A2*B2"

2
Fig: Microsoft Excel Formula
This example below shows how we have used the function - ‘PRODUCT’ to perform
multiplication. As you can see, we didn’t use the mathematical operator here.
Sample Formula: "=PRODUCT(A2,B2)"

Excel formulas and functions help you perform your tasks efficiently, and it's time-saving.
Let's proceed and learn the different types of functions available in Excel and use relevant
formulas as and when required.
Want to Become a Data Analyst? Learn From Experts!
Data Analyst Master’s ProgramExplore Program

Top Excel Formulas and Functions


There are plenty of Excel formulas and functions depending on what kind of operation you
want to perform on the dataset. We will look into the formulas and functions on mathematical
operations, character-text functions, data and time, sumif-countif, and few lookup functions.
1. SUM
The SUM() function, as the name suggests, gives the total of the selected range of cell values.
It performs the mathematical operation which is addition. Here’s an example of it below:
=SUM(C2:C4)

3
Fig: Sum function in Excel
As you can see above, to find the total amount of sales for every unit, we had to simply type
in the function “=SUM(C2:C4)”. This automatically adds up 300, 385, and 480. The result is
stored in C5.
2. AVERAGE
The AVERAGE() function focuses on calculating the average of the selected range of cell
values. As seen from the below example, to find the avg of the total sales, you have to simply
type in:
=AVERAGE(C2, C3, C4)

Fig: Average function in Excel


It automatically calculates the average, and you can store the result in your desired location.
3. COUNT
The function COUNT() counts the total number of cells in a range that contains a number. It
does not include the cell, which is blank, and the ones that hold data in any other format apart
from numeric.
=COUNT(C1:C4)

4
Fig: Microsoft Excel Function - Count
As seen above, here, we are counting from C1 to C4, ideally four cells. But since the COUNT
function takes only the cells with numerical values into consideration, the answer is 3 as the
cell containing “Total Sales” is omitted here.
If you are required to count all the cells with numerical values, text, and any other data
format, you must use the function ‘COUNTA()’. However, COUNTA() does not count any
blank cells.
To count the number of blank cells present in a range of cells, COUNTBLANK() is used.
Learn The Latest Trends in Data Analytics!
Post Graduate Program In Data AnalyticsExplore Program

4. SUBTOTAL
Moving ahead, let’s now understand how the subtotal function works. The SUBTOTAL()
function returns the subtotal in a database. Depending on what you want, you can select
either average, count, sum, min, max, min, and others. Let’s have a look at two such
examples.

Fig: Subtotal function in Excel


In the example above, we have performed the subtotal calculation on cells ranging from A2 to
A4. As you can see, the function used is

5
=SUBTOTAL(1, A2: A4)
In the subtotal list “1” refers to average. Hence, the above function will give the average of
A2: A4 and the answer to it is 11, which is stored in C5. Similarly,
“=SUBTOTAL(4, A2: A4)”
This selects the cell with the maximum value from A2 to A4, which is 12. Incorporating “4”
in the function provides the maximum result.

Fig: Count function in Excel


5. MODULUS
The MOD() function works on returning the remainder when a particular number is divided
by a divisor. Let’s now have a look at the examples below for better understanding.
In the first example, we have divided 10 by 3. The remainder is calculated using the function,
the result is stored in B2. We can also directly type “=MOD(10,3)” as it will give the same
answer.
=MOD(A2,3)

Fig: Modulus function in Excel


Similarly, here, we have divided 12 by 4. The remainder is 0 is, which is stored in B3.

Fig: Modulus function in Excel

6
Become an Expert in Data Analytics
With Our Unique Data Analyst Master’s ProgramExplore Program

6. POWER
The function “Power()” returns the result of a number raised to a certain power. Let’s have a
look at the examples shown below:

Fig: Power function in Excel


As you can see above, to find the power of 10 stored in A2 raised to 3, we have to type:
=POWER(A2,3)
This is how power function works in Excel.
7. CEILING
Next, we have the ceiling function. The CEILING() function rounds a number up to its
nearest multiple of significance.

Fig: Ceiling function in Excel


=CEILING(A2,1)
The nearest highest multiple of 5 for 35.316 is 40.
Want to stand out in this competitive market? Join our Data Analyst program, and take the
first step towards a rewarding career in Data analysis! 📈
8. FLOOR
Contrary to the Ceiling function, the floor function rounds a number down to the nearest
multiple of significance.

7
Fig: Floor function in Excel
=FLOOR(A2,1)
The nearest lowest multiple of 5 for 35.316 is 35.
SUM
The first Excel function you should be familiar with is the one that performs the basic
arithmetic operation of addition:
SUM(number1, [number2], …)
In the syntax of all Excel functions, an argument enclosed in [square brackets] is optional,
other arguments are required. Meaning, your Sum formula should include at least 1 number,
reference to a cell or a range of cells. For example:
=SUM(B2:B6) - adds up values in cells B2 through B6.
=SUM(B2, B6) - adds up values in cells B2 and B6.
If necessary, you can perform other calculations within a single formula, for example, add up
values in cells B2 through B6, and then divide the sum by 5:
=SUM(B2:B6)/5
To sum with conditions, use the SUMIF function: in the 1st argument, you enter the range of
cells to be tested against the criteria (A2:A6), in the 2nd argument - the criteria itself (D2),
and in the last argument - the cells to sum (B2:B6):
=SUMIF(A2:A6, D2, B2:B6)
In your Excel worksheets, the formulas may look something similar to this:

Tip. The fastest way to sum a column or row of numbers is to select a cell next to the
numbers you want to sum (the cell immediately below the last value in the column or to the
right of the last number in the row), and click the AutoSum button on the Home tab, in
the Formats group. Excel will insert a SUM formula for you automatically.

8
Useful resources:
Excel Sum formula examples - formulas to total a column, rows, only filtered (visible) cells,
or sum across sheets.
Excel AutoSum - the fastest way to sum a column or row of numbers.
SUMIF in Excel - formula examples to conditionally sum cells.
SUMIFS in Excel - formula examples to sum cells based on multiple criteria.
AVERAGE
The Excel AVERAGE function does exactly what its name suggests, i.e. finds an average, or
arithmetic mean, of numbers. Its syntax is similar to SUM's:
AVERAGE(number1, [number2], …)
Having a closer look at the formula from the previous section (=SUM(B2:B6)/5), what does
it actually do? Sums values in cells B2 through B6, and then divides the result by 5. And what
do you call adding up a group of numbers and then dividing the sum by the count of those
numbers? Yep, an average!
The Excel AVERAGE function performs these calculations behind the scenes. So, instead of
dividing sum by count, you can simply put this formula in a cell:
=AVERAGE(B2:B6)
To average cells based on condition, use the following AVERAGEIF formula, where A2:A6
is the criteria range, D3 is he criteria, and B2:B6 are the cells to average:
=AVERAGEIF(A2:A6, D3, B2:B6)

Useful resources:
Excel AVERAGE - average cells with numbers.
Excel AVERAGEA - find an average of cells with any data (numbers, Boolean and text
values).
Excel AVERAGEIF - average cells based on one criterion.
Excel AVERAGEIFS - average cells based on multiple criteria.
How to calculate weighted average in Excel
How to find moving average in Excel

9
MAX & MIN
The MAX and MIN formulas in Excel get the largest and smallest value in a set of numbers,
respectively. For our sample data set, the formulas will be as simple as:
=MAX(B2:B6)
=MIN(B2:B6)

Useful resources:
MAX function - find the highest value.
MAX IF formula - get the highest number with conditions.
MAXIFS function - get the largest value based on multiple criteria.
MIN function - return the smallest value in a data set.
MINIFS function - find the smallest number based on one or several conditions.
COUNT & COUNTA
If you are curious to know how many cells in a given range contain numeric
values (numbers or dates), don't waste your time counting them by hand. The Excel COUNT
function will bring you the count in a heartbeat:
COUNT(value1, [value2], …)
While the COUNT function deals only with those cells that contain numbers, the COUNTA
function counts all cells that are not blank, whether they contain numbers, dates, times, text,
logical values of TRUE and FALSE, errors or empty text strings (""):
COUNTA (value1, [value2], …)
For example, to find out how many cells in column B contain numbers, use this formula:
=COUNT(B:B)
To count all non-empty cells in column B, go with this one:
=COUNTA(B:B)
In both formulas, you use the so-called "whole column reference" (B:B) that refers to all the
cells within column B.
The following screenshot shows the difference: while COUNT processes only numbers,
COUNTA outputs the total number of non-blank cells in column B, including the the text

10
value in the column header.

Useful resources:
Excel COUNT function - a quick way to count cells with numbers.
Excel COUNTA function - count cells with any values (non-empty cells).
Excel COUNTIF function - count cells that meet one condition.
Excel COUNTIFS function - count cells with several criteria.
IF
Judging by the number of IF-related comments on our blog, it's the most popular function in
Excel. In simple terms, you use an IF formula to ask Excel to test a certain condition and
return one value or perform one calculation if the condition is met, and another value or
calculation if the condition is not met:
IF(logical_test, [value_if_true], [value_if_false])
For example, the following IF statement checks if the order is completed (i.e. there is a value
in column C) or not. To test if a cell is not blank, you use the "not equal to" operator ( <>) in
combination with an empty string (""). As the result, if cell C2 is not empty, the formula
returns "Yes", otherwise "No":
=IF(C2<>"", "Yes", "No")

Useful resources:
IF function in Excel with formula examples
How to use nested IFs in Excel
IF formulas with multiple AND/OR conditions

11
TRIM
If your obviously correct Excel formulas return just a bunch of errors, one of the first things
to check is extra spaces in the referenced cells (You may be surprised to know how many
leading, trailing and in-between spaces lurk unnoticed in your sheets just until something
goes wrong!).
There are several ways to remove unwanted spaces in Excel, with the TRIM function being
the easiest one:
TRIM(text)
For example, to trim extra spaces in column A, enter the following formula in cell A1, and
then copy it down the column:
=TRIM(A1)
It will eliminate all extra spaces in cells but a single space character between words:

Useful resources:
Excel TRIM function with formula examples
How to delete line breaks and non-printing characters
How to remove non-breaking spaces (&nbsp;)
How to delete a specific non-printing character
LEN
Whenever you want to know the number of characters in a certain cell, LEN is the function to
use:
LEN(text)
Wish to find out how many characters are in cell A2? Just type the below formula into
another cell:
=LEN(A2)

12
Please keep in mind that the Excel LEN function counts absolutely all characters including
spaces:

Want to get the total count of characters in a range or cells or count only specific characters?
Please check out the following resources.
Useful resources:
Excel LEN formulas to count characters in a cell
Count the number of characters in cells and ranges
AND & OR
These are the two most popular logical functions to check multiple criteria. The difference is
how they do this:
AND returns TRUE if all conditions are met, FALSE otherwise.
OR returns TRUE if any condition is met, FALSE otherwise.
While rarely used on their own, these functions come in very handy as part of bigger
formulas.
For example, to check the test results in columns B and C and return "Pass" if both are
greater than 60, "Fail" otherwise, use the following IF formula with an embedded AND
statement:
=IF(AND(B2>60, B2>60), "Pass", "Fail")
If it's sufficient to have just one test score greater than 60 (either test 1 or test 2), embed the
OR statement:

13
=IF(OR(B2>60, B2>60), "Pass", "Fail")

Useful resources:
Excel AND function with formula examples
Excel OR function with formula examples
CONCATENATE
In case you want to take values from two or more cells and combine them into one cell, use
the concatenate operator (&) or the CONCATENATE function:
CONCATENATE(text1, [text2], …)
For example, to combine the values from cells A2 and B2, just enter the following formula in
a different cell:
=CONCATENATE(A2, B2)
To separate the combined values with a space, type the space character (" ") in the arguments
list:
=CONCATENATE(A2, " ", B2)

Useful resources:
How to concatenate in Excel - formula examples to combine text strings, cells and columns.
CONCAT function - newer and improved function to combine the contents of multiple cells
into one cell.
TODAY & NOW

14
To see the current date and time whenever you open your worksheet without having to
manually update it on a daily basis, use either:
=TODAY() to insert the today's date in a cell.
=NOW() to insert the current date and time in a cell.
The beauty of these functions is that they don't require any arguments at all, you type the
formulas exactly as written above.

Useful resources:
Excel NOW function - how to insert the current date and time as a dynamic value.
How to insert today's date in Excel - different ways to enter the current date in Excel: as an
unchangeable time stamp or automatically updatable date and time.
Excel date functions with formula examples - formulas to convert date to text and vice versa,
extract a day, month or year from a date, calculate the difference between two dates, and a lot
more.
Best practices for writing Excel formulas
Now that you are familiar with the basic Excel formulas, these tips will give you some
guidance on how to use them most effectively and avoid common formula errors.
Do not enclose numbers in double quotes
Any text included in your Excel formulas should be enclosed in "quotation marks". However,
you should never do that to numbers, unless you want Excel to treat them as text values.
For example, to check the value in cell B2 and return 1 for "Passed", 0 otherwise, you put the
following formula, say, in C2:
=IF(B2="pass", 1, 0)
Copy the formula down to other cells and you will have a column of 1's and 0's that can be
calculated without a hitch.
Now, see what happens if you double quote the numbers:
=IF(B2="pass", "1", "0")
At first sight, the output is normal - the same column of 1's and 0's. Upon a closer look,
however, you will notice that the resulting values are left-aligned in cells by default, meaning
those are numeric strings, not numbers! If later on someone will try to calculate those 1's and
0's, they might end up pulling their hair out trying to figure out why a 100% correct Sum or

15
Count formula returns nothing but zero.

Don't format numbers in Excel formulas


Please remember this simple rule: numbers supplied to your Excel formulas should be
entered without any formatting like decimal separator or dollar sign. In North America and
some other countries, comma is the default argument separator, and the dollar sign ($) is used
to make absolute cell references. Using those characters in numbers may just drive your
Excel crazy :) So, instead of typing $2,000, simply type 2000, and then format the output
value to your liking by setting up a custom Excel number format.
Match all opening and closing parentheses
When crating a complex Excel formula with one or more nested functions, you will have to
use more than one set of parentheses to define the order of calculations. In such formulas, be
sure to pair the parentheses properly so that there is a closing parenthesis for every opening
parenthesis. To make the job easier for you, Excel shades parenthesis pairs in different colors
when you enter or edit a formula.
Copy the same formula to other cells instead of re-typing it
Once you have typed a formula into a cell, there is no need to re-type it over and over again.
Simply copy the formula to adjacent cells by dragging the fill handle (a small square at the
lower right-hand corner of the cell). To copy the formula to the whole column, position the
mouse pointer to the fill handle and double-click the plus sign.

16
Note. After copying the formula, make sure that all cell references are correct. Cell
references may change depending on whether they are absolute (do not change)
or relative (change).
For the detailed step-by-step instructions, please see How to copy formulas in Excel.
How to delete formula, but keep calculated value
When you remove a formula by pressing the Delete key, a calculated value is also deleted.
However, you can delete only the formula and keep the resulting value in the cell. Here's
how:
Select all cells with your formulas.
Press Ctrl + C to copy the selected cells.
Right-click the selection, and then click Paste Values > Values to paste the calculated values
back to the selected cells. Or, press the Paste Special shortcut: Shift+F10 and then V.
For the detailed steps with screenshots, please see How to replace formulas with their values
in Excel.
Excel formulas are essential for several reasons:
Efficiency: They automate repetitive tasks, saving time and reducing manual errors.
Data analysis: Excel's range of formulas enables sophisticated data analysis, crucial for
informed decision-making.
Accuracy: Formulas ensure consistent and accurate results, essential in fields like finance
and accounting.
Data manipulation: They allow for efficient sorting, filtering, and manipulation of large
datasets.
Accessibility: Excel provides a user-friendly platform, making complex data analysis
accessible to non-technical users.
Versatility: Widely used across various industries, proficiency in Excel formulas enhances
employability and career advancement.
Customization: Excel offers customizable formula options to meet specific data handling
needs.

Session 2
DESCRIPTIVE ANALYSIS

Leading statistical analysis usually begins with a descriptive analysis. It is also known as
descriptive analytics or descriptive statistics. It helps you think about how to utilize your data,
17
help you identify exceptions and mistakes, and see how variables are related, putting you in a
position to lead future statistical research.
Keeping raw data in a format that makes it easy to understand and analyze, i.e., rearranging,
sorting, and changing data so that it can tell you something useful about the data it contains.
Descriptive analysis is one of the most crucial phases of statistical data analysis. It provides
you with a conclusion about the distribution of your data and aids in detecting errors and
outliers. It lets you spot patterns between variables, preparing you for future statistical
analysis.
What is Descriptive Analysis?
Descriptive analysis is a sort of data research that aids in describing, demonstrating, or
helpfully summarizing data points so those patterns may develop that satisfy all of the
conditions of the data.
It is the technique of identifying patterns and links by utilizing recent and historical data.
Because it identifies patterns and associations without going any further, it is frequently
referred to as the most basic data analysis.
When describing change over time, this analysis is beneficial. It utilizes patterns as a
jumping-off point for further research to inform decision-making. When done systematically,
they are not tricky or tiresome.
Data aggregation and mining are two methods used in descriptive analysis to generate
historical data. Information is gathered and sorted in data aggregation to simplify large
datasets. Data mining is the next analytical stage, which entails searching the data for patterns
and significance. Data analytics and data analysis are closely related processes that involve
extracting insights from data to make informed decisions.

Types of Descriptive Analysis


The four types of descriptive analysis methods are:
01. Measurements of Frequency
Understanding how often a particular event or reaction is likely to occur is crucial for
descriptive analysis. The main goal of frequency measurements is to provide something like a
count or a percentage.
02. Measures of Central Tendency
Finding the central (or average) tendency or response is crucial in descriptive analysis. Three
standards—mean, median, and mode—are used to calculate central tendency.
03. Measures of Dispersion
18
At times, understanding how data is distributed throughout a range is crucial. This kind of
distribution may be measured using dispersion metrics like range or standard deviation.
04. Measures of Position
Finding a value’s or response’s location concerning other matters is another aspect of
descriptive analysis. In this area of knowledge, metrics like quartiles and percentiles are
beneficial.
Measures of central tendency
Measures of central tendency estimate the center, or average, of a data set. The mean, median
and mode are 3 ways of finding the average.
Here we will demonstrate how to calculate the mean, median, and mode using the first 6
responses of our survey. Mean Median Mode
The mean, or M, is the most commonly used method for finding the average.
To find the mean, simply add up all response values and divide the sum by the total number
of responses. The total number of responses or observations is called N.

Measures of variability
Measures of variability give you a sense of how spread out the response values are. The
range, standard deviation and variance each reflect different aspects of spread.
Range
The range gives you an idea of how far apart the most extreme response scores are. To find
the range, simply subtract the lowest value from the highest value.
Range of visits to the library in the past year Ordered data set: 0, 3, 3, 12, 15, 24
Range: 24 – 0 = 24
Standard deviation
The standard deviation (s or SD) is the average amount of variability in your dataset. It tells
you, on average, how far each score lies from the mean. The larger the standard deviation, the
more variable the data set is.
In another word a standard deviation (or σ) is a measure of how dispersed the data is in
relation to the mean. Low, or small, standard deviation indicates data are clustered tightly
around the mean, and high, or large, standard deviation indicates data are more spread out. A
standard deviation close to zero indicates that data points are very close to the mean,
whereas a larger standard deviation indicates data points are spread further away from
the mean.

19
In the image, the curve on top is more spread out and therefore has a higher standard
deviation, while the curve below is more clustered around the mean and therefore has a lower
standard deviation.1

There are six steps for finding the standard deviation:


1. List each score and find their mean.
2. Subtract the mean from each score to get the deviation from the mean.
3. Square each of these deviations.
4. Add up all of the squared deviations.
5. Divide the sum of the squared deviations by N – 1.
6. Find the square root of the number you found.
7. In the table below, you complete Steps 1 through 4.
Raw Deviation from mean Squared deviation
data

15 15 – 9.5 = 5.5 30.25

3 3 – 9.5 = -6.5 42.25

12 12 – 9.5 = 2.5 6.25

0 0 – 9.5 = -9.5 90.25

24 24 – 9.5 = 14.5 210.25

3 3 – 9.5 = -6.5 42.25

M = 9.5 Sum = 0 Sum of squares = 421.5

20
Step 5: 421.5/5 = 84.3
Step 6: √84.3 = 9.18
From learning that s = 9.18, you can say that on average, each score deviates from the mean
by 9.18 points.
Variance
The variance is the average of squared deviations from the mean. Variance reflects the degree
of spread in the data set. The more spread the data, the larger the variance is in relation to the
mean.
To find the variance, simply square the standard deviation. The symbol for variance is s2.
Variance of visits to the library in the past yearData set: 15, 3, 12, 0, 24, 3
s = 9.18
s2 = 84.3
you can say that on average, each score deviates from the mean by 9.18 points.
Standard Deviation

Example Standard Deviation Curves


To calculate the standard deviation, use the following formula:

In this formula, σ is the standard deviation, x i is each individual data point in the set, µ is the
mean, and N is the total number of data points. In the equation, x i, represents each individual
data point, so if you have 10 data points, subtract x 1 (first data point) from the mean and then
square the absolute value. This process is continued all the way through x10 (last data point).

.
Skewness:
Skewness describes the asymmetry of a distribution (leaning left or right), while kurtosis
measures the "tailedness" or peakedness of a distribution compared to a normal distribution

21
 Symmetry: A distribution with zero skewness is perfectly symmetrical, meaning
the left and right sides of the distribution are mirror images.
 Positive Skewness: Indicates a longer or fatter right tail, suggesting a tendency
towards higher values.
 Negative Skewness: Indicates a longer or fatter left tail, suggesting a tendency
towards lower values.
 Interpretation: Skewness between -0.5 and 0.5 indicates a nearly symmetrical
distribution.
 Examine the spread of your data to determine whether your data appear to be
skewed
 When data are skewed, the majority of the data are located on the high or low side
of the graph. Often, skewness is easiest to detect with a histogram or boxplot.


 Right-skewed


 Left-skewed
 The histogram with right-skewed data shows wait times. Most of the wait times are
relatively short, and only a few wait times are long. The histogram with left-skewed
data shows failure time data. A few items fail immediately, and many more items
fail later.
 Identify outliers
 Outliers, which are data values that are far away from other data values, can
strongly affect the results of your analysis. Often, outliers are easiest to identify on
a boxplot.

22

 On a boxplot, asterisks (*) denote outliers.
Kurtosis:
 Tailedness: Kurtosis measures the "tailedness" of a distribution, focusing on the
distribution's peak and the weight of its tails.
 Leptokurtic (High Kurtosis):Heavy tails with more outliers, a distribution that is
more peaked than a normal distribution.
 Platykurtic (Low Kurtosis): Light tails with fewer outliers, a distribution that is
flatter than a normal distribution.
 Mesokurtic (Near-zero Excess Kurtosis):Close to a normal distribution.
 Interpretation:
A kurtosis greater than +2 suggests a too peaked distribution, while less than -2 indicates a
too flat one.
Graphically:

How to Conduct a Descriptive Analysis


USING E-VIEWS
STEP 1. Open new slide from eviews
Step 2. Load data
Step3. Highlight the variables
Step 4 view
Step5. Go to Descriptive statistics

23
Step 6. Go to individual samples
Results shown thus:
Sample of summary of descriptive statistics
REW GDP IND FDI OPN CO2
Mean 57.24032 20.79657 37.40677 19.93721 6.314136 5.840115
Median 58.74000 20.81679 37.37491 19.31145 6.335495 5.797576
Maximum 63.45000 21.24648 39.09078 22.07906 7.285695 6.415751
Minimum 47.32000 20.35905 32.75358 16.51014 5.272603 5.270432
Std. Dev. 5.039758 0.263450 1.218966 1.806833 0.610753 0.396611
Skewness -0.730468 -0.073151 -1.721139 -0.225644 0.019655 0.035982
Kurtosis 2.268185 1.838233 8.172105 1.651418 1.567781 1.580057
Jarque-
20.69164 10.62608 299.1496 15.67310 15.90918 15.66597
Bera
Probability 0.000032 0.004927 0.000000 0.000395 0.000351 0.000396
Sum 10646.70 3868.163 6957.659 3708.320 1174.429 1086.261
Sum Sq.
4698.844 12.84006 274.8876 603.9595 69.00858 29.10061
Dev.
Observatio
186 186 186 186 186 186
ns

Jarque-Bera (JB)
The Jarque-Bera (JB) test is a goodness-of-fit test used to determine if sample data has
skewness and kurtosis matching a normal distribution; a higher JB statistic indicates a greater
deviation from normality. The JB test assesses whether a dataset deviates significantly from a
normal distribution by examining its skewness and kurtosis.
The null hypothesis assumes that the data comes from a normal distribution, meaning the
skewness and excess kurtosis are zero. The JB statistic is always non-negative. A value far
from zero suggests the data is not normally distributed.
 Interpreting the Results:
 JB Statistic: A higher JB statistic indicates a greater deviation from
normality.
 P-value: The p-value represents the probability of observing the test statistic,
assuming the data is normally distributed. A lower p-value suggests stronger
evidence against the null hypothesis (i.e., the data is not normally distributed).

24
 Significance Level: Typically, a significance level of 0.05 (or 5%) is used. If
the p-value is less than 0.05, the null hypothesis is rejected, implying the data
is not normally distributed.
Limitations:
 The JB test is most appropriate for large sample sizes.
 It is sensitive to outliers.

How to Conduct a Descriptive Analysis


USING E-VIEWS
STEP 1. Open new slide from eviews
Step 2. Load data
Step3. Highlight the variables
Step 4 view
Step5. Go to Descriptive statistics
Step 6. Go to individual samples

Session 3
DATA TRANSFORMATION TECHNIQUES
What is the data transformation process?
Data transformation is the process of converting data structures, formats, and/or types to
make it more accessible and usable. For example, a business might extract raw data from
various systems, transform it using a method such as normalization, and then load it into a
centralized repository for storage. Individual projects will use different data transformation
steps depending on their goals.
Data transformation allows raw data to be harnessed and used in data analytics and data
science. This essential step enhances the quality, reliability, and accessibility of data, making
it suitable for analysis, visualization, and training machine learning models.
There are numerous data transformation techniques available, each catering to different
project requirements and dataset characteristics. To unify, standardize, and refine raw data,
the data transformation process includes cleansing, formatting, and removing unnecessary
data.
Five Steps of Standard Data Transformation Process:

25
1. Data discovery: The data transformation process starts with data experts using data
profiling tools to identify the original data sources' structures, formats, and
characteristics. This helps them choose the best data transformation techniques to use
for the next steps.
2. Data mapping: The experts then map out how to transform the data. They define the
relationships between data elements from different sources and construct a plan for
aggregating and transforming them. Data mapping also helps professionals maintain a
clear record of changes through metadata.
3. Code generation: Data analysts use the data map to guide the generation of
transformation code. The generation of these codes is often automated by using
scripts, algorithms, and data transformation platforms.
4. Code execution: The data transformation is now executed. During this phase, the data
is gathered from sources, transformed according to the rules defined in the mapping
stage, and sent to a storage location. Batch or real-time processing pipelines are used
for this step.
5. Review: The last step is reviewing the transformed data to ensure it is accurate,
complete, and meets the previously defined transformation requirements. If there are
any discrepancies, anomalies, or errors, data experts will use further data
transformations to correct or remove them.
Types of data transformation
1. Constructive Transformations
Constructive transformations create new data attributes or features within the dataset. They
enhance existing features or generate new ones to improve the quality and effectiveness of
data analysis or machine learning models. Examples include feature engineering, aggregating
data points, or deriving new metrics. These transformations add value to the dataset by
generating additional information or providing a better representation of existing data,
making it more suitable for analysis.
2. Destructive Transformations
Destructive transformations remove unnecessary or irrelevant data from the dataset. This
streamlines the information, making it more focused and efficient. Destructive data
transformation types include data cleaning (removing duplicates or correcting errors), dealing
with missing values (imputation or deletion), and feature selection (eliminating redundant or
irrelevant features). By reducing noise and distractions, destructive transformations
contribute to more accurate insights and improved model performance.
26
3. Formatting transformations
Formatting transformations deal with the presentation and organization of data, ensuring it
adheres to a common format. These transformations include data standardization (converting
data to a common format), sorting, and formatting. While formatting transformations may not
directly affect the analytical or predictive power of the data, they play a vital role in
facilitating efficient data exploration, visualization, and effective communication of insights.
4. Structural Transformations
Structural transformations involve modifying the overall structure and organization of the
dataset, making it more suitable for analysis or machine learning models. This includes
reshaping data, normalizing or denormalizing databases, and integrating data from multiple
sources. These transformations are useful for time series analysis, multi-source data
integration, preparing data for machine learning, data warehousing, and data visualization.
Benefits of data transformation
Transforming data increases its usability, results in more effective data analysis, and offers a
host of other benefits.
1. Improved data quality: Data transformation helps identify and correct inconsistencies,
errors, and missing values, leading to cleaner and more accurate data for analysis.
2. Enhanced data integration: Data transformation converts data into a standardized format,
enabling seamless integration from multiple sources. This allows for greater collaboration
and data sharing across different systems.
3. Better decision-making and business intelligence: Data transformation cleans and
integrates data, providing a solid foundation for analysis and the generation of business
insights. Organizations improve efficiency and competitiveness by using these insights to
inform decisions.
4. Easier scalability: Data transformation helps teams manage increasing volumes of data,
allowing organizations to scale their data processing and analytics capabilities as needed.
5. Stronger data privacy: Data transformation techniques like anonymization,
pseudonymization, or encryption help businesses protect data privacy and comply with data
protection regulations.
6. Improved data visualization: Data transformation makes it easier to create engaging and
insightful data visualizations by supplying data in appropriate formats or aggregating it.
7. Easier machine learning: Data transformation prepares data for machine learning
algorithms by converting it into a suitable format and addressing issues like missing values,
outliers, or class imbalance.
27
8. Greater time and cost savings: Data transformation automation helps organizations
reduce the time and effort needed for data preparation, allowing data scientists and analysts to
focus on higher-value tasks.
Challenges of data transformation
Alongside its numerous benefits, the data transformation process does present some
challenges.
1. Complexity: Modern businesses collect huge volumes of raw data from multiple sources.
The more data there is, the more complex the data transformation process becomes,
especially for companies with convoluted data infrastructures.
2. Expertise: Data professionals must possess a deep understanding of data context, different
types of data transformations, and the ability to use common automation tools. If businesses
don’t retain employees with this high-level expertise, they will be at risk of critical
misinterpretations and errors.
3. Resources: Data transformation demands extensive computing, processing, memory, and
manual effort. For businesses with high data volumes and complexity, the widespread
resource strain of transforming data can slow down operations.
4. Cost: The cost of the tools, infrastructure, and expertise needed for data transformation can
be a significant budget constraint. This can be a particular difficulty for smaller companies.

Session 4
INFERENTIAL STATISTICS

Inferential Statistics Definition


Inferential statistics can be defined as a field of statistics that uses analytical tools for drawing
conclusions about a population by examining random samples. The goal of inferential
statistics is to make generalizations about a population. In inferential statistics, a statistic is
taken from the sample data (e.g., the sample mean) that used to make inferences about the
population parameter (e.g., the population mean)
Inferential statistics helps to develop a good understanding of the population data by
analyzing the samples obtained from it. It helps in making generalizations about the
population by using various analytical tests and tools. In order to pick out random samples
that will represent the population accurately many sampling techniques are used. Some of the
28
important methods are simple random sampling, stratified sampling, cluster sampling, and
systematic sampling techniques.
First, a null hypothesis, H0, is proposed, which takes the form of a written statement or a
mathematical expression. The most commonly proposed null hypothesis is ‘no difference(s)
exist between the groups, they all come from one population’.
Second, an alternative hypothesis, H1, is proposed that will be accepted if there is good
evidence against the null hypothesis. If we are unconcerned about the direction of any
difference between groups, H1 will simply be ‘the two populations are different’ and we will
use a two-tailed test.
 A p value of 0.5 suggests that there is a 50% chance that the observation fits the null
hypothesis, i.e. a one-in-two chance,
 whereas a p value of 0.05 suggests that this probability is only 5%, i.e. a 1 in 20
chance.
A one-in-two chance is not low enough for us to be sure the null hypothesis is incorrect,
whereas a 1 in 20 chance makes it much more likely. At this latter level we might agree that
the null hypothesis is incorrect, so a p value of 0.05 is usually taken as the ‘cut-off’
probability. We say that p values less than 0.05 provide good evidence against the null
hypothesis, whereas values greater than this do not. Statistically speaking, we always talk
about evidence against the null hypothesis, never for it; our study is usually designed to reject
the null hypothesis, not support it.
Before undertaking an inferential test it is important to understand the type of data being
analyzed and whether the data, or transformed data, are normally distributed. If they are,
then parametric tests can be used to analyze the data; if not, then nonparametric tests can be
used. Nonparametric tests involve the ranks of the observations rather than the observations
themselves, so no assumptions need be made as to the actual distribution of data.

Types of Inferential Statistics


Inferential statistics can be classified into hypothesis testing and regression analysis.
Hypothesis testing also includes the use of confidence intervals to test the parameters of a
population. Given below are the different types of inferential statistics.

29
Importance of Statistical Inference
 Inferential Statistics is important to examine the data properly.
 To make an accurate conclusion, proper data analysis is important to interpret the
research results.
 It is majorly used in the future prediction for various observations in different fields.
 It helps us to make inference about the data.
 The statistical inference has a wide range of application in different fields, such as:
Business Analysis, Artificial Intelligence, Financial Analysis, Fraud Detection,
Machine Learning, Share Market

T-tests
Independent samples

Introduction to t-Tests
A t test is a statistical test that is used to compare the means of two groups. It is often used
in hypothesis testing to determine whether a process or treatment actually has an effect on
the population of interest, or whether two groups are different from one another.

30
Example: You want to know whether the mean petal length of iris flowers differs according
to their genders. You find two different groups of boys and girls respectively. You can test
the difference between these two groups using a t test and null and alterative hypotheses.

• The null hypothesis (H0) is that the true difference between these group means is
zero.
Or There is no difference between the means of these groups
• The alternate hypothesis (Ha) is that the true difference is different from zero.
Or the mean of these groups differ
What type of t test should I use?
When choosing a t test approach, you will need to consider two things:
1. Whether the groups being compared come from a single population or two different
populations.
2. Whether you want to test the difference in a specific direction.
• If the groups come from a single population (e.g., measuring before and after an
experimental treatment), perform a paired t-test. This is a within-subjects design.
• If the groups come from two different populations (e.g., two different species, or
people from two separate cities), perform a two-sample t test (a.k.a. independent t test).
This is a between-subjects design.
Standard deviation formula
i. Population standard deviation
When you have collected data from every member of the population that you’re interested
in, you can get an exact value for population standard deviation.

The population standard deviation formula looks like this:


Formula Explanation
= population standard deviation
= sum of…
= each value
= population mean
= number of values in the population

ii. Sample standard deviation


When you collect data from a sample, the sample standard deviation is used to make
estimates or inferences about the population standard deviation.
The sample standard deviation formula looks like this:
Formula Explanation
= sample standard deviation
= sum of…
= each value
= sample mean
= number of values in the sample

31
With samples, we use n – 1 in the formula because using n would give us a biased estimate
that consistently underestimates variability. The sample standard deviation would tend to
be lower than the real standard deviation of the population.

1. The Independent Samples t-test


t-Test compares the means of two independent groups in order to determine whether there
is statistical evidence that the associated population means are significantly different. The
Independent Samples t-test is a parametric test.
We use an independent samples t-test when you want to compare the means of precisely
two groups—no more and no less! The samples are randomly selected from the population
called the raw data. The independent samples t-test is also known as the two-sample t-test.

Independent Samples T-Tests Hypotheses


Independent samples t-tests have the following hypotheses:
Null hypothesis: There is no difference between the mean of the two populations.
Alternative hypothesis: The means for the two populations are not equal.

Decision making
The null hypothesis is rejected when the calculated value of test statistic is greater than the
critical (or table) value at the correct significance level.

Independent Samples T-Test Assumptions


For reliable independent samples t-test results, your data should satisfy the following
assumptions:
1. You have a random sample
Drawing a random sample from the population you are studying helps ensure that your data
represent the population. Representative samples are vital when you want to make
inferences about the population. If your data do not represent the population, your analysis
results will not be valid for that population.
2. Your data must be continuous
T-tests require continuous data. Continuous variables can take on any numeric value, and
the scale can be meaningfully divided into smaller increments, including fractional and
decimal values. There are an infinite number of possible values between any two values.
Typically, you measure continuous variables on a scale. For example, when you measure
temperature, weight, and height, you have continuous data.
3. Your sample data should follow a normal distribution or each group has more than
15 observations
All t-tests assume that your data follow the normal distribution. However, your group
distributions can be skewed if your sample size is large enough.

32
4. The groups are independent
Independent samples contain different sets of items in each sample. Independent samples t
tests compare two distinct samples. Hence, it’s a two-sample t-test. If you have the same
people or items in both groups, you can use the paired t-test.
5. Groups can have equal or unequal variances but use the correct form of the test
Variance, and the closely related standard deviation, are measures of variability. Because the
two sample t-test uses two independent samples, each sample has its own variance.
Consequently, the independent samples t-test has two methods. One method assumes that
the two groups have equal variances while the other does not assume they are equal. The
form that does not assume equal variances is known as Welch’s t-test.

WORKING STEPS – independent sample


Step 1-Count the number of the objects in each column = N
Step2-Sum each column and find the mean =mean X1 and mean X2
Step 3-Square each object of the mean and sum up
Step 4- Subtract the mean from each object, square it, and sum up =SS
Step 5-Input into the formula= t calculated
Step 6-Find the critical region/cut-off value by adding N 1 + N2 -2 to find df and find the
corresponding value from the student t-table under the prescribed p-value
Step 7-Take decision by comparing the calculated t value with the critical value

Example1 -Independent Samples T Test


Given the following data from two different groups, are the groups mean different?
Group A : 5,8,7, 8 and 7
Group B: 3,5,2 and 3
Solution
X1 X2 (X1 – Mean)2 (X2 -Mean)2
(X1 – 7)2 (X2 -3.25)2
5 3 (5-7)2 = 4 (3-3.25) =0.0625
8 5 (8-7)2 =1 (5-3.25) =3.0625
2
7 2 (7-7) = 0 (2-3.25) =1.5625
8 3 (8-7)2 = 1 3-3.25) = 0.0625
2
7 (7-7) =0
= 35 =13 SS1 =6 SS2 =4.75
N1=5 N2=4
Mean Mean X2
X1 =7 = 3.25
Df = 5+4-2 = 7

Compute the value of t using the equation below.

33
Computed t = 4.52

Hypothesis:
H0= Means not different
H1- Means different
Next step- check the critical value from the Student's t Table (alpha = 0.05 and df = N1 + N2 - 2
= 7).
critical value = 2.365
Interpretation
Because the computed value of t (4.52) exceeds the critical value (2.365),
Therefore, we reject the null hypothesis and conclude that the two populations from which
the samples are drawn do have different means.

Example2 - Independent Samples T-Test


Calculate a t-test for the following data of the number of times people prefer
coffee or tea in five times intervals
Preference for coffee : 4,5,7,6,9
Preference for tea: 3,8,6,4,7

Solution:
let x1 be the sample of data that prefers coffee and x2 be the sample of data
that prefers tea.

Solution
X1 X2 (X1 – d X12 = (X2 -Mean) d X22 =
Mean) SS1 (X2 -5.6) SS2
(X1 – 6.2)
4 3 -2.2 4.84 -2.6 6.76
5 8 -1.2 1.44 2.4 5.76
7 6 0.8 0.64 0.4 0.16
6 4 -0.2 0.04 -1.6 2.56
9 7 2.8 7.84 1.4 1.96
= 31 =28 SS1=14.18 SS2
=17.2
N1=5 N2=5
Mean Mean
X1 X2 =
=6.2 5.6

Df1 = 5-1 =4
Df 2 = 5-1 =4
SS 1 14.18
S12 = Variance D = =3.7
df = 4

34
SS 2 17.2
S22 = Variance D = = 4.3
df = 4

X 1−X 2 6.2−5.6
=¿ 0.6 0.6
t=

√ S S 1 SS 2
n1
+
n2
=

√ 3.7 4.3
5
+
5
√1.6
=
1.26
= 0.47

What is the t-test syntax?


While one Excel t-test method uses basic menu commands, the second uses an Excel
formula with unique syntax and arguments. Understanding this syntax may help you when
performing formula-based t-tests by letting you properly define your data sets. Use this
syntax when running a t-test:=T.TEST(array1,array2,tails,type)Understanding these terms can
help you know which data your formula needs:
 Array1: This term shows your first data set by its starting and ending cell values and
connects each value with a colon. For example, a properly inputted "Array1" value
for a data set may read "A2:A11", which shows all cells within that range.
 Array2: "Array2" is the second data set in your t-test comparison. Like with "Array1",
you input your data set by including the first and last cell values in your set, such as
"B2:B11", connecting them with a colon.
 Tails: The "Tails" value shows your tail distribution number. Input "1" for a one-tailed
distribution test and "2" for a two-tailed distribution test, based on your test's needs.
 Type: Excel can perform three t-test types, including paired, two-sample equal
variance and two-sample unequal variance. Paired tests include data with the same
average mean, two-sample equal variance includes tests from the same population
and two-sample unequal variance includes tests from different populations.
When finished, your t-test syntax may read something like
this:=T.TEST(A2:A11,B2:B11,1,2)This syntax shows a one-tailed distribution t-test that gauges
a data set from A2 to A11 with a second from B2 to B11 with two-sample equal variance.
You may adjust your syntax as needed to meet your data needs.Related: How To Use the IFS
Function in Excel (With Steps and Tips)
How to perform a t-test
Follow these five steps to add your t-test syntax to your Excel spreadsheet:
1. Select the data sets
Identify the data sets you want to test and note their starting and ending cell values. A
typical data set fills rows in a vertical spreadsheet column, though may include multiple
columns. For example, your first data set may fill 10 rows, from two to 11, in the "A" column,
while your second fills the same rows in the B column. In this situation, your data sets
include A2 through A11 for the first and B2 through B11 for the second.
2. Type your starting syntax
Select a spreadsheet cell and type "=T.Test(" here. This value tells Excel to run a t-test
following the terms inputted later on. These include the data set values chosen in step one,
as well as the tails and test type. Your syntax should read:=T.TEST(
3. Input your data arrays
After your starting syntax, type your first data range, add a comma and type the second data
range followed by a comma. Create a data range by typing the first cell value, a colon and
the last cell value. For example, you may type "A2:A11" for your first data array, a comma

35
and then "B2:B11" for your second data array. Your syntax should
read:=T.TEST(A2:A11,B2:B11,
3. Add the tail distribution value
Input the tail distribution value based on your data's relationship. For example, you type a
"1" for a one-tailed distribution test or a "2" for a two-tailed distribution test. Add a comma
after this single number to separate it from the last value. Adding commas helps this formula
run smoothly and minimizes error messages. After adding this value, your syntax should
read:=T.TEST(A2:A11,B2:B11,1
4. Choose the test type
Add your test type value after the comma and end your syntax with a closing parenthesis.
Input a "1" for a paired test, "2" for a two-sample equal variance test or "3" for a two-
sample unequal variance test. When uncertain about the appropriate test type, add a "3" for
this value. This option may return accurate difference percentages for many test types. Your
test syntax should now read:=T.TEST(A2:A11,B2:B11,1,3)

z-test
A z-test is a statistical test to determine whether two population means are different or to
compare one mean to a hypothesized value when the variances are known and the sample
size is large. A z-test is a hypothesis test for data that follows a normal distribution.

Z-tests and t-tests are both statistical tools used to compare means, but they are used in
different situations. Z-tests are generally used for large sample sizes (n > 30) and when the
population standard deviation is known, while t-tests are used for smaller sample sizes (n <
30) or when the population standard deviation is unknown. Both tests assume a normal
distribution of the data.
Z Test = (x̄ – μ) / (σ / √n)
Where
 x̄ = Mean of Sample
 μ = Mean of Population
 σ = Standard Deviation of Population
 n = Number of Observation

Z Test Statistics Formula – Example #1


Suppose a person wants to check or test if tea and coffee both are equally popular in the
city. In that case, he can use a z test statistics method to obtain the results by taking a
sample size say 500 from the city out of which suppose 280 are tea drinkers. So to test this
hypothesis he can use z test method.
Principal at school claims that students in his school are above average intelligence and a
random sample of 30 students IQ scores have a mean score of 112.5 and mean population
IQ is 100 with a standard deviation of 15. Is there sufficient evidence to support the
principal claim?

36
Solution:
Solution:
Z Test Statistics is calculated using the formula given below
Z Test = (x̄ – μ) / (σ / √n)

 Z Test = (112.5 – 100) / (15 / √30)


 Z Test = 4.56
Compare the z test results with z test standard table and you can come to the conclusion in
this example null hypothesis is rejected and the principal claim is right.

What is F Test in Statistics?


F test is statistics is a test that is performed on an f distribution. A two-tailed f test is used to
check whether the variances of the two given samples (or populations) are equal or not.
However, if an f test checks whether one population variance is either greater than or lesser
than the other, it becomes a one-tailed hypothesis f test.
F Test Definition
F test can be defined as a test that uses the f test statistic to check whether the variances of
two samples (or populations) are equal to the same value. To conduct an f test, the
population should follow an f distribution and the samples must be independent events. On
conducting the hypothesis test, if the results of the f test are statistically significant then the
null hypothesis can be rejected otherwise it cannot be rejected.
F Test Formula

37
The f test is used to check the equality of variances using hypothesis testing. The f test
formula for different hypothesis tests is given as follows:
Left Tailed Test:
Null Hypothesis: H0 : σ21=σ22
Alternate Hypothesis: H1 : σ21<σ22
Decision Criteria: If the f statistic < f critical value then reject the null hypothesis
Right Tailed test:
Null Hypothesis: H0 : σ21=σ22
Alternate Hypothesis: H1: σ21>σ22

Hypothesis Testing
Hypothesis testing is a statistical method used to determine if there's enough evidence in a
sample to draw conclusions about a population. It involves formulating a null hypothesis
(H0) and an alternative hypothesis (Ha), collecting data, and using statistical tests to
determine if the evidence supports rejecting the null hypothesis.
1. The main Idea:
 Hypothesis:A hypothesis is a testable statement about a population.
 Null Hypothesis (H0): This is a statement of no effect or no difference. It represents
the status quo.
 Alternative Hypothesis (Ha):This is a statement that contradicts the null hypothesis. It
represents the effect or difference we are trying to prove.
 Sample Data: Data collected from a sample is used to evaluate the evidence for or
against the null hypothesis.
 Statistical Test: A test is performed to determine the probability of obtaining the
observed data (or more extreme data) if the null hypothesis were true.
 Decision: Based on the test results, a decision is made to either reject or fail to reject
the null hypothesis.
3. Key Concepts:
 Significance Level (alpha): The probability of incorrectly rejecting the null hypothesis
when it is true (Type I error).
 P-value: The probability of obtaining the observed data (or more extreme data) if the
null hypothesis were true.
 Type I Error: Rejecting the null hypothesis when it is actually true.
 Type II Error: Failing to reject the null hypothesis when it is false.

38
CORRELATION AND REGRESSION
Correlation and regression are statistical measurements that are used to give a relationship
between two variables. For example, suppose a person is driving an expensive car then it is
assumed that she must be financially well. To numerically quantify this relationship,
correlation and regression are used.
Correlation Definition
Correlation can be defined as a measurement that is used to quantify the relationship between
variables. If an increase (or decrease) in one variable causes a corresponding increase (or
decrease) in another then the two variables are said to be directly correlated. Similarly, if an
increase in one causes a decrease in another or vice versa, then the variables are said to be
indirectly correlated. If a change in an independent variable does not cause a change in the
dependent variable then they are uncorrelated. Thus, correlation can be positive (direct
correlation), negative (indirect correlation), or zero. This relationship is given by
the correlation coefficient.
Regression Definition
Regression can be defined as a measurement that is used to quantify how the change in one
variable will affect another variable. Regression is used to find the cause and effect between
two variables. Linear regression is the most commonly used type of regression because it is
easier to analyze as compared to the rest. Linear regression is used to find the line that is the
best fit to establish a relationship between variables.
Correlation and Regression Analysis
Both correlation and regression analysis are done to quantify the strength of the relationship
between two variables by using numbers. Graphically, correlation and regression analysis can
be visualized using scatter plots.
Correlation analysis is done so as to determine whether there is a relationship between the
variables that are being tested. Furthermore, a correlation coefficient such as Pearson's
correlation coefficient is used to give a signed numeric value that depicts the strength as well
as the direction of the correlation. The scatter plot gives the correlation between two variables
x and y for individual data points as shown below.

39
Regression analysis is used to determine the relationship between two variables such that the
value of the unknown variable can be estimated using the knowledge of the known variables.
The goal of linear regression is to find the best-fitted line through the data points. For two
variables, x, and y, the regression analysis can be visualized as follows:

Difference between Correlation and Regression


Correlation and regression are both used as statistical measurements to get a good
understanding of the relationship between variables. If the correlation coefficient is negative
(or positive) then the slope of the regression line will also be negative (or positive). The table
given below highlights the key difference between correlation and regression.

40
Correlation Regression

Regression is used to numerically


Correlation is used to determine
describe how a dependent variable
whether variables are related or
changes with a change in an
not.
independent variable

It finds the best-fitted regression


Correlation tries to establish a
line to estimate an unknown
linear relationship between
variable on the basis of the known
variables.
variable.

The variables can be used The variables cannot be


interchangeably interchanged.

Regression is used to show the


Correlation uses a signed numerical
impact of a unit change in the
value to estimate the strength of the
independent variable on the
relationship between the variables.
dependent variable.

CORRELATION
Types of Correlation:
Correlation is described or classified in several different ways. Three of the most
important are: I. Positive, Negative and Zero II. Simple, Partial and Multiple III.
Linear and non-linear
I. Positive, Negative and Zero Correlation: Whether correlation is positive (direct)
or negative (in-versa) would depend upon the direction of change of the variable.
Positive Correlation: If both the variables vary in the same direction, correlation
is said to be positive. It means if one variable is increasing, the other on an
average is also increasing or if one variable is decreasing, the other on an average

41
is also deceasing, then the correlation is said to be positive correlation. For
example, the correlation between heights and weights of a group of persons is a
positive correlation.
Height (cm) : X 158 160 163 166 168 171 174 176
Weight (kg) : Y 60 62 64 65 67 69 71 72
Negative Correlation: If both the variables vary in opposite direction, the
correlation is said to be negative. If means if one variable increases, but the other
variable decreases or if one variable decreases, but the other variable increases,
then the correlation is said to be negative correlation. For example, the correlation
between the price of a product and its demand is a negative correlation.
Price of Product (Rs. Per Unit) : X 6 5 4 3 2 1
Demand (In Units) : Y 75 120 175 250 215 400
Zero Correlation: Actually, it is not a type of correlation but still it is called as zero
or no correlation. When we don’t find any relationship between the variables then, it
is said to be zero correlation. It means a change in value of one variable doesn’t
influence or change the value of other variable. For example, the correlation between
weight of person and intelligence is a zero or no correlation.
II. Simple, Partial and Multiple Correlation: The distinction between simple, partial and
multiple correlation is based upon the number of variables studied. Simple Correlation: When
only two variables are studied, it is a case of simple correlation. For example, when one
studies relationship between the marks secured by student and the attendance of student in
class, it is a problem of simple correlation. Partial Correlation: In case of partial correlation
one studies three or more variables but considers only two variables to be influencing each
other and the effect of other influencing variables being held constant. For example, in above
example of relationship between student marks and attendance, the other variable influencing
such as effective teaching of teacher, use of teaching aid like computer, smart board etc are
assumed to be constant.
Multiple Correlation: When three or more variables are studied, it is a case of multiple
correlation. For example, in above example if study covers the relationship between student
marks, attendance of students, effectiveness of teacher, use of teaching aids etc, it is a case of
multiple correlation.
II. Linear and Non-linear Correlation: Depending upon the constancy of the ratio
of change between the variables, the correlation may be Linear or Non-linear
Correlation. Linear Correlation: If the amount of change in one variable bears a
42
constant ratio to the amount of change in the other variable, then correlation is
said to be linear. If such variables are plotted on a graph paper all the plotted
points would fall on a straight line. For example: If it is assumed that, to produce
one unit of finished product we need 10 units of raw materials, then subsequently
to produce 2 units of finished product we need double of the one unit. Raw
material : X 10 20 30 40 50 60 Finished Product : Y 2 4 6 8 10 12
III. Non-linear Correlation: If the amount of change in one variable does not bear a
constant ratio to the amount of change to the other variable, then correlation is
said to be non-linear. If such variables are plotted on a graph, the points would fall
on a curve and not on a straight line. For example, if we double the amount of
advertisement expenditure, then sales volume would not necessarily be doubled.
Advertisement Expenses : X 10 20 30 40 50 60 Sales Volume : Y 2 4 6 8 10 12
Scatter Diagram: This is graphic method of measurement of correlation. It is a
diagrammatic representation of bivariate data to ascertain the relationship between two
variables. Under this method the given data are plotted on a graph paper in the form of dot.
i.e. for each pair of X and Y values we put dots and thus obtain as many points as the number
of observations. Usually an independent variable is shown on the X-axis whereas the
dependent variable is shown on the Y-axis. Once the values are plotted on the graph it reveals
the type of the correlation between variable X and Y. A scatter diagram reveals whether the
movements in one series are associated with those in the other series.
.

CHAPTER FIVE
INTER-PROGRAM COMMUNICATION (IPC)
It refers to the mechanisms and techniques that allow different programs or processes running
on a computer to communicate and share data with each other. It enables programs to
exchange information, coordinate their activities, and work together to accomplish a specific
task. IPC facilitates the sharing of data and resources, and the synchronization of actions
between different programs. IPC is crucial for modern operating systems that support
multitasking, as it enables different programs to cooperate and share resources effectively.

Methods in Inter program/process Communication

43
Inter Process communication (IPC) refers to the mechanisms and techniques used by
operating systems to allow different processes to communicate with each other. This allows
running programs concurrently in an Operating System.
The two fundamental models of Inter-Program Communication are: Shared Memory and
Message Passing
Shared Memory
IPC through Shared Memory is a method where multiple processes are given access to the
same region of memory. This shared memory allows the processes to communicate with each
other by reading and writing data directly to that memory area.
Shared Memory in IPC can be visualized as Global variables in a program which are shared
in the entire program but shared memory in IPC goes beyond global variables, allowing
multiple processes to share data through a common memory space, whereas global variables
are restricted to a single process.
Message Passing
IPC through Message Passing is a method where processes communicate by sending and
receiving messages to exchange data. In this method, one process sends a message, and the
other process receives it, allowing them to share information. Message Passing can be
achieved through different methods like Sockets, Message Queues or Pipes.
Sockets provide an endpoint for communication, allowing processes to send and receive
messages over a network. In this method, one process (the server) opens a socket and listens
for incoming connections, while the other process (the client) connects to the server and
sends data. Sockets can use different communication protocols, such
as TCP (Transmission Control Protocol) for reliable, connection-oriented
communication or UDP (User Datagram Protocol) for faster, connectionless
communication.
Methods of Inter pogram Communication (IPC)
1. Pipes – A pipe is a unidirectional communication channel used for IPC between two
related processes. One process writes to the pipe, and the other process reads from it.
Types of Pipes are Anonymous Pipes and Named Pipes (FIFOs)
2. Sockets – Sockets are used for network communication between processes running on
different hosts. They provide a standard interface for communication, which can be
used across different platforms and programming languages.

44
3. Shared memory – In shared memory IPC, multiple processes are given access to a
common memory space. Processes can read and write data to this memory, enabling
fast communication between them.
4. Semaphores – Semaphores are used for controlling access to shared resources. They
are used to prevent multiple processes from accessing the same resource
simultaneously, which can lead to data corruption.
5. Message Queuing – This allows messages to be passed between processes using
either a single queue or several message queue. This is managed by system kernel
these messages are coordinated using an API.
6. Remote Procedure calls (RPC) allows a program to call a procedure (or function) on
another machine in a network, as though it were a local call. It abstracts the details of
communication and makes distributed systems easier to use. RPC is a technique used
for distributed computing. It allows processes running on different hosts to call
procedures on each other as if they were running on the same host.
Remote Method Invocation (RMI) is a Java-based technique used for Inter-Process
Communication (IPC) across systems, specifically for calling methods on objects
located on remote machines. It allows a program running on one computer (the client)
to execute a method on an object residing on another computer (the server), as if it
were a local method call.
Each method of IPC has its own advantages and disadvantages, and the choice of which
method to use depends on the specific requirements of the application. For example, if high-
speed communication is required between processes running on the same host, shared
memory may be the best choice. On the other hand, if communication is required between
processes running on different hosts, sockets or RPC may be more appropriate.
Inter-Process Communication (IPC) enables processes to share data and work together
through methods like Shared Memory, Message Passing, Pipes, Sockets, Semaphores,
Remote Procedure Calls (RPC) and Remote Method Invocation (RMI). Each method has
unique strengths suited for different scenarios, such as local or distributed systems. This page
provides an overview of IPC methods, and you can explore detailed explanations and uses of
each method from here.

Dashboards

45
Dashboards are effective data visualization tools for tracking and visualizing data from
multiple data sources, providing visibility into the effects of specific behaviors by a team or
an adjacent one on performance.

Data visualization
Data visualization is the graphical representation of information and data.
By using visual elements like charts, graphs, and maps, data visualization
tools provide an accessible way to see and understand trends, outliers,
and patterns in data. Additionally, it provides an excellent way for
employees or business owners to present data to non-technical audiences
without confusion. In the world of Big Data, data visualization tools and
technologies are essential to analyze massive amounts of information and
make data-driven decisions.
Data visualization is a critical step in the data science process, helping teams and individuals
convey data more effectively to colleagues and decision makers. Teams that manage
reporting systems typically leverage defined template views to monitor performance.
However, data visualization isn’t limited to performance dashboards. For example, while text
mining an analyst may use a word cloud to to capture key concepts, trends, and hidden
relationships within this unstructured data. Alternatively, they may utilize a graph structure to
illustrate relationships between entities in a knowledge graph. There are a number of ways to
represent different types of data, and it’s important to remember that it is a skillset that should
extend beyond your core analytics team.
Types of data visualizations

Dashboards include common visualization techniques, such as:


 Tables: This consists of rows and columns used to compare variables. Tables can
show a great deal of information in a structured way, but they can also overwhelm
users that are simply looking for high-level trends.
 Pie charts and stacked bar charts: These graphs are divided into sections that
represent parts of a whole. They provide a simple way to organize data and compare
the size of each component to one other.
 Line charts and area charts: These visuals show change in one or more quantities
by plotting a series of data points over time and are frequently used within predictive

46
analytics. Line graphs utilize lines to demonstrate these changes while area charts
connect data points with line segments, stacking
 variables on top of one another and using color to distinguish between variables.
 Histograms: This graph plots a distribution of numbers using a bar chart (with no
spaces between the bars), representing the quantity of data that falls within a
particular range. This visual makes it easy for an end user to identify outliers within a
given dataset.
 Scatter plots: These visuals are beneficial in reveling the relationship between two
variables, and they are commonly used within regression data analysis. However,
these can sometimes be confused with bubble charts, which are used to visualize three
variables via the x-axis, the y-axis, and the size of the bubble.
 Heat maps: These graphical representation displays are helpful in visualizing
behavioral data by location. This can be a location on a map, or even a webpage.
Tree maps, which display hierarchical data as a set of nested
shapes, typically rectangles. Treemaps are great for comparing the
proportions between categories via their area size.
Advantages
1. There is quick recognition of data trends and spreads When we see a chart, we
quickly see trends and outliers.
2. Interactively explore opportunities.
3. Visualize patterns and relationships.

Disadvantages
1. It’s easy to make an inaccurate assumption. Or sometimes the visualization is just
designed wrong so that it’s biased or confusing.
2. Biased or inaccurate information.
3. Correlation doesn’t always mean causation.
4. Core messages can get lost in translation.

47

You might also like