Computer in Economics
Computer in Economics
1
you can find out averages and calculate percentages in excel for a range of cells, manipulate
date and time values, and do a lot more.
Formulae in Excel
There is another term that is very familiar to Excel formulas, and that is "function". The two
words, "formulae" and "functions" are sometimes interchangeable. They are closely related,
but yet different. A formula begins with an equal sign. Meanwhile, functions are used to
perform complex calculations that cannot be done manually. Functions in excel have names
that reflect their intended use.
In Excel, formulas are used to perform calculations or manipulate data, while functions are
pre-defined formulas that simplify complex tasks. Here's a breakdown of some common
formulas and functions:
Basic Formulas & Functions:
SUM: Adds numbers in a range of cells (e.g., =SUM(A1:A10))
AVERAGE: Calculates the average of numbers in a range (e.g., =AVERAGE(B1:B10))
CONCATENATE: Combines text from multiple cells into one (e.g., =CONCATENATE(A1,
" ", B1))
IF: Performs a logical test and returns one value if true, and another if false (e.g., =IF(A1>10,
"Yes", "No"))
VLOOKUP: Searches for a value in a column and returns a value from the same row
(e.g., =VLOOKUP(A1,Sheet2!A:B,2,FALSE))
COUNT: Counts cells in a range that contain numbers (e.g., =COUNT(A1:A10))
COUNTIF: Counts cells in a range that meet a specific condition
(e.g., =COUNTIF(A1:A10,">10"))
MAX: Finds the largest value in a range (e.g., =MAX(A1:A10))
MIN: Finds the smallest value in a range (e.g., =MIN(A1:A10))
DATE: Creates a date value (e.g., =DATE(2025,3,19))
TRIM: Removes extra spaces from text (e.g., =TRIM(A1))
COUNTA: Counts cells that are not empty (e.g., =COUNTA(A1:A10))
The example below shows how we have used the multiplication formula manually with the
‘*’ operator.
Sample Formula: "=A2*B2"
2
Fig: Microsoft Excel Formula
This example below shows how we have used the function - ‘PRODUCT’ to perform
multiplication. As you can see, we didn’t use the mathematical operator here.
Sample Formula: "=PRODUCT(A2,B2)"
Excel formulas and functions help you perform your tasks efficiently, and it's time-saving.
Let's proceed and learn the different types of functions available in Excel and use relevant
formulas as and when required.
Want to Become a Data Analyst? Learn From Experts!
Data Analyst Master’s ProgramExplore Program
3
Fig: Sum function in Excel
As you can see above, to find the total amount of sales for every unit, we had to simply type
in the function “=SUM(C2:C4)”. This automatically adds up 300, 385, and 480. The result is
stored in C5.
2. AVERAGE
The AVERAGE() function focuses on calculating the average of the selected range of cell
values. As seen from the below example, to find the avg of the total sales, you have to simply
type in:
=AVERAGE(C2, C3, C4)
4
Fig: Microsoft Excel Function - Count
As seen above, here, we are counting from C1 to C4, ideally four cells. But since the COUNT
function takes only the cells with numerical values into consideration, the answer is 3 as the
cell containing “Total Sales” is omitted here.
If you are required to count all the cells with numerical values, text, and any other data
format, you must use the function ‘COUNTA()’. However, COUNTA() does not count any
blank cells.
To count the number of blank cells present in a range of cells, COUNTBLANK() is used.
Learn The Latest Trends in Data Analytics!
Post Graduate Program In Data AnalyticsExplore Program
4. SUBTOTAL
Moving ahead, let’s now understand how the subtotal function works. The SUBTOTAL()
function returns the subtotal in a database. Depending on what you want, you can select
either average, count, sum, min, max, min, and others. Let’s have a look at two such
examples.
5
=SUBTOTAL(1, A2: A4)
In the subtotal list “1” refers to average. Hence, the above function will give the average of
A2: A4 and the answer to it is 11, which is stored in C5. Similarly,
“=SUBTOTAL(4, A2: A4)”
This selects the cell with the maximum value from A2 to A4, which is 12. Incorporating “4”
in the function provides the maximum result.
6
Become an Expert in Data Analytics
With Our Unique Data Analyst Master’s ProgramExplore Program
6. POWER
The function “Power()” returns the result of a number raised to a certain power. Let’s have a
look at the examples shown below:
7
Fig: Floor function in Excel
=FLOOR(A2,1)
The nearest lowest multiple of 5 for 35.316 is 35.
SUM
The first Excel function you should be familiar with is the one that performs the basic
arithmetic operation of addition:
SUM(number1, [number2], …)
In the syntax of all Excel functions, an argument enclosed in [square brackets] is optional,
other arguments are required. Meaning, your Sum formula should include at least 1 number,
reference to a cell or a range of cells. For example:
=SUM(B2:B6) - adds up values in cells B2 through B6.
=SUM(B2, B6) - adds up values in cells B2 and B6.
If necessary, you can perform other calculations within a single formula, for example, add up
values in cells B2 through B6, and then divide the sum by 5:
=SUM(B2:B6)/5
To sum with conditions, use the SUMIF function: in the 1st argument, you enter the range of
cells to be tested against the criteria (A2:A6), in the 2nd argument - the criteria itself (D2),
and in the last argument - the cells to sum (B2:B6):
=SUMIF(A2:A6, D2, B2:B6)
In your Excel worksheets, the formulas may look something similar to this:
Tip. The fastest way to sum a column or row of numbers is to select a cell next to the
numbers you want to sum (the cell immediately below the last value in the column or to the
right of the last number in the row), and click the AutoSum button on the Home tab, in
the Formats group. Excel will insert a SUM formula for you automatically.
8
Useful resources:
Excel Sum formula examples - formulas to total a column, rows, only filtered (visible) cells,
or sum across sheets.
Excel AutoSum - the fastest way to sum a column or row of numbers.
SUMIF in Excel - formula examples to conditionally sum cells.
SUMIFS in Excel - formula examples to sum cells based on multiple criteria.
AVERAGE
The Excel AVERAGE function does exactly what its name suggests, i.e. finds an average, or
arithmetic mean, of numbers. Its syntax is similar to SUM's:
AVERAGE(number1, [number2], …)
Having a closer look at the formula from the previous section (=SUM(B2:B6)/5), what does
it actually do? Sums values in cells B2 through B6, and then divides the result by 5. And what
do you call adding up a group of numbers and then dividing the sum by the count of those
numbers? Yep, an average!
The Excel AVERAGE function performs these calculations behind the scenes. So, instead of
dividing sum by count, you can simply put this formula in a cell:
=AVERAGE(B2:B6)
To average cells based on condition, use the following AVERAGEIF formula, where A2:A6
is the criteria range, D3 is he criteria, and B2:B6 are the cells to average:
=AVERAGEIF(A2:A6, D3, B2:B6)
Useful resources:
Excel AVERAGE - average cells with numbers.
Excel AVERAGEA - find an average of cells with any data (numbers, Boolean and text
values).
Excel AVERAGEIF - average cells based on one criterion.
Excel AVERAGEIFS - average cells based on multiple criteria.
How to calculate weighted average in Excel
How to find moving average in Excel
9
MAX & MIN
The MAX and MIN formulas in Excel get the largest and smallest value in a set of numbers,
respectively. For our sample data set, the formulas will be as simple as:
=MAX(B2:B6)
=MIN(B2:B6)
Useful resources:
MAX function - find the highest value.
MAX IF formula - get the highest number with conditions.
MAXIFS function - get the largest value based on multiple criteria.
MIN function - return the smallest value in a data set.
MINIFS function - find the smallest number based on one or several conditions.
COUNT & COUNTA
If you are curious to know how many cells in a given range contain numeric
values (numbers or dates), don't waste your time counting them by hand. The Excel COUNT
function will bring you the count in a heartbeat:
COUNT(value1, [value2], …)
While the COUNT function deals only with those cells that contain numbers, the COUNTA
function counts all cells that are not blank, whether they contain numbers, dates, times, text,
logical values of TRUE and FALSE, errors or empty text strings (""):
COUNTA (value1, [value2], …)
For example, to find out how many cells in column B contain numbers, use this formula:
=COUNT(B:B)
To count all non-empty cells in column B, go with this one:
=COUNTA(B:B)
In both formulas, you use the so-called "whole column reference" (B:B) that refers to all the
cells within column B.
The following screenshot shows the difference: while COUNT processes only numbers,
COUNTA outputs the total number of non-blank cells in column B, including the the text
10
value in the column header.
Useful resources:
Excel COUNT function - a quick way to count cells with numbers.
Excel COUNTA function - count cells with any values (non-empty cells).
Excel COUNTIF function - count cells that meet one condition.
Excel COUNTIFS function - count cells with several criteria.
IF
Judging by the number of IF-related comments on our blog, it's the most popular function in
Excel. In simple terms, you use an IF formula to ask Excel to test a certain condition and
return one value or perform one calculation if the condition is met, and another value or
calculation if the condition is not met:
IF(logical_test, [value_if_true], [value_if_false])
For example, the following IF statement checks if the order is completed (i.e. there is a value
in column C) or not. To test if a cell is not blank, you use the "not equal to" operator ( <>) in
combination with an empty string (""). As the result, if cell C2 is not empty, the formula
returns "Yes", otherwise "No":
=IF(C2<>"", "Yes", "No")
Useful resources:
IF function in Excel with formula examples
How to use nested IFs in Excel
IF formulas with multiple AND/OR conditions
11
TRIM
If your obviously correct Excel formulas return just a bunch of errors, one of the first things
to check is extra spaces in the referenced cells (You may be surprised to know how many
leading, trailing and in-between spaces lurk unnoticed in your sheets just until something
goes wrong!).
There are several ways to remove unwanted spaces in Excel, with the TRIM function being
the easiest one:
TRIM(text)
For example, to trim extra spaces in column A, enter the following formula in cell A1, and
then copy it down the column:
=TRIM(A1)
It will eliminate all extra spaces in cells but a single space character between words:
Useful resources:
Excel TRIM function with formula examples
How to delete line breaks and non-printing characters
How to remove non-breaking spaces ( )
How to delete a specific non-printing character
LEN
Whenever you want to know the number of characters in a certain cell, LEN is the function to
use:
LEN(text)
Wish to find out how many characters are in cell A2? Just type the below formula into
another cell:
=LEN(A2)
12
Please keep in mind that the Excel LEN function counts absolutely all characters including
spaces:
Want to get the total count of characters in a range or cells or count only specific characters?
Please check out the following resources.
Useful resources:
Excel LEN formulas to count characters in a cell
Count the number of characters in cells and ranges
AND & OR
These are the two most popular logical functions to check multiple criteria. The difference is
how they do this:
AND returns TRUE if all conditions are met, FALSE otherwise.
OR returns TRUE if any condition is met, FALSE otherwise.
While rarely used on their own, these functions come in very handy as part of bigger
formulas.
For example, to check the test results in columns B and C and return "Pass" if both are
greater than 60, "Fail" otherwise, use the following IF formula with an embedded AND
statement:
=IF(AND(B2>60, B2>60), "Pass", "Fail")
If it's sufficient to have just one test score greater than 60 (either test 1 or test 2), embed the
OR statement:
13
=IF(OR(B2>60, B2>60), "Pass", "Fail")
Useful resources:
Excel AND function with formula examples
Excel OR function with formula examples
CONCATENATE
In case you want to take values from two or more cells and combine them into one cell, use
the concatenate operator (&) or the CONCATENATE function:
CONCATENATE(text1, [text2], …)
For example, to combine the values from cells A2 and B2, just enter the following formula in
a different cell:
=CONCATENATE(A2, B2)
To separate the combined values with a space, type the space character (" ") in the arguments
list:
=CONCATENATE(A2, " ", B2)
Useful resources:
How to concatenate in Excel - formula examples to combine text strings, cells and columns.
CONCAT function - newer and improved function to combine the contents of multiple cells
into one cell.
TODAY & NOW
14
To see the current date and time whenever you open your worksheet without having to
manually update it on a daily basis, use either:
=TODAY() to insert the today's date in a cell.
=NOW() to insert the current date and time in a cell.
The beauty of these functions is that they don't require any arguments at all, you type the
formulas exactly as written above.
Useful resources:
Excel NOW function - how to insert the current date and time as a dynamic value.
How to insert today's date in Excel - different ways to enter the current date in Excel: as an
unchangeable time stamp or automatically updatable date and time.
Excel date functions with formula examples - formulas to convert date to text and vice versa,
extract a day, month or year from a date, calculate the difference between two dates, and a lot
more.
Best practices for writing Excel formulas
Now that you are familiar with the basic Excel formulas, these tips will give you some
guidance on how to use them most effectively and avoid common formula errors.
Do not enclose numbers in double quotes
Any text included in your Excel formulas should be enclosed in "quotation marks". However,
you should never do that to numbers, unless you want Excel to treat them as text values.
For example, to check the value in cell B2 and return 1 for "Passed", 0 otherwise, you put the
following formula, say, in C2:
=IF(B2="pass", 1, 0)
Copy the formula down to other cells and you will have a column of 1's and 0's that can be
calculated without a hitch.
Now, see what happens if you double quote the numbers:
=IF(B2="pass", "1", "0")
At first sight, the output is normal - the same column of 1's and 0's. Upon a closer look,
however, you will notice that the resulting values are left-aligned in cells by default, meaning
those are numeric strings, not numbers! If later on someone will try to calculate those 1's and
0's, they might end up pulling their hair out trying to figure out why a 100% correct Sum or
15
Count formula returns nothing but zero.
16
Note. After copying the formula, make sure that all cell references are correct. Cell
references may change depending on whether they are absolute (do not change)
or relative (change).
For the detailed step-by-step instructions, please see How to copy formulas in Excel.
How to delete formula, but keep calculated value
When you remove a formula by pressing the Delete key, a calculated value is also deleted.
However, you can delete only the formula and keep the resulting value in the cell. Here's
how:
Select all cells with your formulas.
Press Ctrl + C to copy the selected cells.
Right-click the selection, and then click Paste Values > Values to paste the calculated values
back to the selected cells. Or, press the Paste Special shortcut: Shift+F10 and then V.
For the detailed steps with screenshots, please see How to replace formulas with their values
in Excel.
Excel formulas are essential for several reasons:
Efficiency: They automate repetitive tasks, saving time and reducing manual errors.
Data analysis: Excel's range of formulas enables sophisticated data analysis, crucial for
informed decision-making.
Accuracy: Formulas ensure consistent and accurate results, essential in fields like finance
and accounting.
Data manipulation: They allow for efficient sorting, filtering, and manipulation of large
datasets.
Accessibility: Excel provides a user-friendly platform, making complex data analysis
accessible to non-technical users.
Versatility: Widely used across various industries, proficiency in Excel formulas enhances
employability and career advancement.
Customization: Excel offers customizable formula options to meet specific data handling
needs.
Session 2
DESCRIPTIVE ANALYSIS
Leading statistical analysis usually begins with a descriptive analysis. It is also known as
descriptive analytics or descriptive statistics. It helps you think about how to utilize your data,
17
help you identify exceptions and mistakes, and see how variables are related, putting you in a
position to lead future statistical research.
Keeping raw data in a format that makes it easy to understand and analyze, i.e., rearranging,
sorting, and changing data so that it can tell you something useful about the data it contains.
Descriptive analysis is one of the most crucial phases of statistical data analysis. It provides
you with a conclusion about the distribution of your data and aids in detecting errors and
outliers. It lets you spot patterns between variables, preparing you for future statistical
analysis.
What is Descriptive Analysis?
Descriptive analysis is a sort of data research that aids in describing, demonstrating, or
helpfully summarizing data points so those patterns may develop that satisfy all of the
conditions of the data.
It is the technique of identifying patterns and links by utilizing recent and historical data.
Because it identifies patterns and associations without going any further, it is frequently
referred to as the most basic data analysis.
When describing change over time, this analysis is beneficial. It utilizes patterns as a
jumping-off point for further research to inform decision-making. When done systematically,
they are not tricky or tiresome.
Data aggregation and mining are two methods used in descriptive analysis to generate
historical data. Information is gathered and sorted in data aggregation to simplify large
datasets. Data mining is the next analytical stage, which entails searching the data for patterns
and significance. Data analytics and data analysis are closely related processes that involve
extracting insights from data to make informed decisions.
Measures of variability
Measures of variability give you a sense of how spread out the response values are. The
range, standard deviation and variance each reflect different aspects of spread.
Range
The range gives you an idea of how far apart the most extreme response scores are. To find
the range, simply subtract the lowest value from the highest value.
Range of visits to the library in the past year Ordered data set: 0, 3, 3, 12, 15, 24
Range: 24 – 0 = 24
Standard deviation
The standard deviation (s or SD) is the average amount of variability in your dataset. It tells
you, on average, how far each score lies from the mean. The larger the standard deviation, the
more variable the data set is.
In another word a standard deviation (or σ) is a measure of how dispersed the data is in
relation to the mean. Low, or small, standard deviation indicates data are clustered tightly
around the mean, and high, or large, standard deviation indicates data are more spread out. A
standard deviation close to zero indicates that data points are very close to the mean,
whereas a larger standard deviation indicates data points are spread further away from
the mean.
19
In the image, the curve on top is more spread out and therefore has a higher standard
deviation, while the curve below is more clustered around the mean and therefore has a lower
standard deviation.1
20
Step 5: 421.5/5 = 84.3
Step 6: √84.3 = 9.18
From learning that s = 9.18, you can say that on average, each score deviates from the mean
by 9.18 points.
Variance
The variance is the average of squared deviations from the mean. Variance reflects the degree
of spread in the data set. The more spread the data, the larger the variance is in relation to the
mean.
To find the variance, simply square the standard deviation. The symbol for variance is s2.
Variance of visits to the library in the past yearData set: 15, 3, 12, 0, 24, 3
s = 9.18
s2 = 84.3
you can say that on average, each score deviates from the mean by 9.18 points.
Standard Deviation
In this formula, σ is the standard deviation, x i is each individual data point in the set, µ is the
mean, and N is the total number of data points. In the equation, x i, represents each individual
data point, so if you have 10 data points, subtract x 1 (first data point) from the mean and then
square the absolute value. This process is continued all the way through x10 (last data point).
.
Skewness:
Skewness describes the asymmetry of a distribution (leaning left or right), while kurtosis
measures the "tailedness" or peakedness of a distribution compared to a normal distribution
21
Symmetry: A distribution with zero skewness is perfectly symmetrical, meaning
the left and right sides of the distribution are mirror images.
Positive Skewness: Indicates a longer or fatter right tail, suggesting a tendency
towards higher values.
Negative Skewness: Indicates a longer or fatter left tail, suggesting a tendency
towards lower values.
Interpretation: Skewness between -0.5 and 0.5 indicates a nearly symmetrical
distribution.
Examine the spread of your data to determine whether your data appear to be
skewed
When data are skewed, the majority of the data are located on the high or low side
of the graph. Often, skewness is easiest to detect with a histogram or boxplot.
Right-skewed
Left-skewed
The histogram with right-skewed data shows wait times. Most of the wait times are
relatively short, and only a few wait times are long. The histogram with left-skewed
data shows failure time data. A few items fail immediately, and many more items
fail later.
Identify outliers
Outliers, which are data values that are far away from other data values, can
strongly affect the results of your analysis. Often, outliers are easiest to identify on
a boxplot.
22
On a boxplot, asterisks (*) denote outliers.
Kurtosis:
Tailedness: Kurtosis measures the "tailedness" of a distribution, focusing on the
distribution's peak and the weight of its tails.
Leptokurtic (High Kurtosis):Heavy tails with more outliers, a distribution that is
more peaked than a normal distribution.
Platykurtic (Low Kurtosis): Light tails with fewer outliers, a distribution that is
flatter than a normal distribution.
Mesokurtic (Near-zero Excess Kurtosis):Close to a normal distribution.
Interpretation:
A kurtosis greater than +2 suggests a too peaked distribution, while less than -2 indicates a
too flat one.
Graphically:
23
Step 6. Go to individual samples
Results shown thus:
Sample of summary of descriptive statistics
REW GDP IND FDI OPN CO2
Mean 57.24032 20.79657 37.40677 19.93721 6.314136 5.840115
Median 58.74000 20.81679 37.37491 19.31145 6.335495 5.797576
Maximum 63.45000 21.24648 39.09078 22.07906 7.285695 6.415751
Minimum 47.32000 20.35905 32.75358 16.51014 5.272603 5.270432
Std. Dev. 5.039758 0.263450 1.218966 1.806833 0.610753 0.396611
Skewness -0.730468 -0.073151 -1.721139 -0.225644 0.019655 0.035982
Kurtosis 2.268185 1.838233 8.172105 1.651418 1.567781 1.580057
Jarque-
20.69164 10.62608 299.1496 15.67310 15.90918 15.66597
Bera
Probability 0.000032 0.004927 0.000000 0.000395 0.000351 0.000396
Sum 10646.70 3868.163 6957.659 3708.320 1174.429 1086.261
Sum Sq.
4698.844 12.84006 274.8876 603.9595 69.00858 29.10061
Dev.
Observatio
186 186 186 186 186 186
ns
Jarque-Bera (JB)
The Jarque-Bera (JB) test is a goodness-of-fit test used to determine if sample data has
skewness and kurtosis matching a normal distribution; a higher JB statistic indicates a greater
deviation from normality. The JB test assesses whether a dataset deviates significantly from a
normal distribution by examining its skewness and kurtosis.
The null hypothesis assumes that the data comes from a normal distribution, meaning the
skewness and excess kurtosis are zero. The JB statistic is always non-negative. A value far
from zero suggests the data is not normally distributed.
Interpreting the Results:
JB Statistic: A higher JB statistic indicates a greater deviation from
normality.
P-value: The p-value represents the probability of observing the test statistic,
assuming the data is normally distributed. A lower p-value suggests stronger
evidence against the null hypothesis (i.e., the data is not normally distributed).
24
Significance Level: Typically, a significance level of 0.05 (or 5%) is used. If
the p-value is less than 0.05, the null hypothesis is rejected, implying the data
is not normally distributed.
Limitations:
The JB test is most appropriate for large sample sizes.
It is sensitive to outliers.
Session 3
DATA TRANSFORMATION TECHNIQUES
What is the data transformation process?
Data transformation is the process of converting data structures, formats, and/or types to
make it more accessible and usable. For example, a business might extract raw data from
various systems, transform it using a method such as normalization, and then load it into a
centralized repository for storage. Individual projects will use different data transformation
steps depending on their goals.
Data transformation allows raw data to be harnessed and used in data analytics and data
science. This essential step enhances the quality, reliability, and accessibility of data, making
it suitable for analysis, visualization, and training machine learning models.
There are numerous data transformation techniques available, each catering to different
project requirements and dataset characteristics. To unify, standardize, and refine raw data,
the data transformation process includes cleansing, formatting, and removing unnecessary
data.
Five Steps of Standard Data Transformation Process:
25
1. Data discovery: The data transformation process starts with data experts using data
profiling tools to identify the original data sources' structures, formats, and
characteristics. This helps them choose the best data transformation techniques to use
for the next steps.
2. Data mapping: The experts then map out how to transform the data. They define the
relationships between data elements from different sources and construct a plan for
aggregating and transforming them. Data mapping also helps professionals maintain a
clear record of changes through metadata.
3. Code generation: Data analysts use the data map to guide the generation of
transformation code. The generation of these codes is often automated by using
scripts, algorithms, and data transformation platforms.
4. Code execution: The data transformation is now executed. During this phase, the data
is gathered from sources, transformed according to the rules defined in the mapping
stage, and sent to a storage location. Batch or real-time processing pipelines are used
for this step.
5. Review: The last step is reviewing the transformed data to ensure it is accurate,
complete, and meets the previously defined transformation requirements. If there are
any discrepancies, anomalies, or errors, data experts will use further data
transformations to correct or remove them.
Types of data transformation
1. Constructive Transformations
Constructive transformations create new data attributes or features within the dataset. They
enhance existing features or generate new ones to improve the quality and effectiveness of
data analysis or machine learning models. Examples include feature engineering, aggregating
data points, or deriving new metrics. These transformations add value to the dataset by
generating additional information or providing a better representation of existing data,
making it more suitable for analysis.
2. Destructive Transformations
Destructive transformations remove unnecessary or irrelevant data from the dataset. This
streamlines the information, making it more focused and efficient. Destructive data
transformation types include data cleaning (removing duplicates or correcting errors), dealing
with missing values (imputation or deletion), and feature selection (eliminating redundant or
irrelevant features). By reducing noise and distractions, destructive transformations
contribute to more accurate insights and improved model performance.
26
3. Formatting transformations
Formatting transformations deal with the presentation and organization of data, ensuring it
adheres to a common format. These transformations include data standardization (converting
data to a common format), sorting, and formatting. While formatting transformations may not
directly affect the analytical or predictive power of the data, they play a vital role in
facilitating efficient data exploration, visualization, and effective communication of insights.
4. Structural Transformations
Structural transformations involve modifying the overall structure and organization of the
dataset, making it more suitable for analysis or machine learning models. This includes
reshaping data, normalizing or denormalizing databases, and integrating data from multiple
sources. These transformations are useful for time series analysis, multi-source data
integration, preparing data for machine learning, data warehousing, and data visualization.
Benefits of data transformation
Transforming data increases its usability, results in more effective data analysis, and offers a
host of other benefits.
1. Improved data quality: Data transformation helps identify and correct inconsistencies,
errors, and missing values, leading to cleaner and more accurate data for analysis.
2. Enhanced data integration: Data transformation converts data into a standardized format,
enabling seamless integration from multiple sources. This allows for greater collaboration
and data sharing across different systems.
3. Better decision-making and business intelligence: Data transformation cleans and
integrates data, providing a solid foundation for analysis and the generation of business
insights. Organizations improve efficiency and competitiveness by using these insights to
inform decisions.
4. Easier scalability: Data transformation helps teams manage increasing volumes of data,
allowing organizations to scale their data processing and analytics capabilities as needed.
5. Stronger data privacy: Data transformation techniques like anonymization,
pseudonymization, or encryption help businesses protect data privacy and comply with data
protection regulations.
6. Improved data visualization: Data transformation makes it easier to create engaging and
insightful data visualizations by supplying data in appropriate formats or aggregating it.
7. Easier machine learning: Data transformation prepares data for machine learning
algorithms by converting it into a suitable format and addressing issues like missing values,
outliers, or class imbalance.
27
8. Greater time and cost savings: Data transformation automation helps organizations
reduce the time and effort needed for data preparation, allowing data scientists and analysts to
focus on higher-value tasks.
Challenges of data transformation
Alongside its numerous benefits, the data transformation process does present some
challenges.
1. Complexity: Modern businesses collect huge volumes of raw data from multiple sources.
The more data there is, the more complex the data transformation process becomes,
especially for companies with convoluted data infrastructures.
2. Expertise: Data professionals must possess a deep understanding of data context, different
types of data transformations, and the ability to use common automation tools. If businesses
don’t retain employees with this high-level expertise, they will be at risk of critical
misinterpretations and errors.
3. Resources: Data transformation demands extensive computing, processing, memory, and
manual effort. For businesses with high data volumes and complexity, the widespread
resource strain of transforming data can slow down operations.
4. Cost: The cost of the tools, infrastructure, and expertise needed for data transformation can
be a significant budget constraint. This can be a particular difficulty for smaller companies.
Session 4
INFERENTIAL STATISTICS
29
Importance of Statistical Inference
Inferential Statistics is important to examine the data properly.
To make an accurate conclusion, proper data analysis is important to interpret the
research results.
It is majorly used in the future prediction for various observations in different fields.
It helps us to make inference about the data.
The statistical inference has a wide range of application in different fields, such as:
Business Analysis, Artificial Intelligence, Financial Analysis, Fraud Detection,
Machine Learning, Share Market
T-tests
Independent samples
Introduction to t-Tests
A t test is a statistical test that is used to compare the means of two groups. It is often used
in hypothesis testing to determine whether a process or treatment actually has an effect on
the population of interest, or whether two groups are different from one another.
30
Example: You want to know whether the mean petal length of iris flowers differs according
to their genders. You find two different groups of boys and girls respectively. You can test
the difference between these two groups using a t test and null and alterative hypotheses.
• The null hypothesis (H0) is that the true difference between these group means is
zero.
Or There is no difference between the means of these groups
• The alternate hypothesis (Ha) is that the true difference is different from zero.
Or the mean of these groups differ
What type of t test should I use?
When choosing a t test approach, you will need to consider two things:
1. Whether the groups being compared come from a single population or two different
populations.
2. Whether you want to test the difference in a specific direction.
• If the groups come from a single population (e.g., measuring before and after an
experimental treatment), perform a paired t-test. This is a within-subjects design.
• If the groups come from two different populations (e.g., two different species, or
people from two separate cities), perform a two-sample t test (a.k.a. independent t test).
This is a between-subjects design.
Standard deviation formula
i. Population standard deviation
When you have collected data from every member of the population that you’re interested
in, you can get an exact value for population standard deviation.
31
With samples, we use n – 1 in the formula because using n would give us a biased estimate
that consistently underestimates variability. The sample standard deviation would tend to
be lower than the real standard deviation of the population.
Decision making
The null hypothesis is rejected when the calculated value of test statistic is greater than the
critical (or table) value at the correct significance level.
32
4. The groups are independent
Independent samples contain different sets of items in each sample. Independent samples t
tests compare two distinct samples. Hence, it’s a two-sample t-test. If you have the same
people or items in both groups, you can use the paired t-test.
5. Groups can have equal or unequal variances but use the correct form of the test
Variance, and the closely related standard deviation, are measures of variability. Because the
two sample t-test uses two independent samples, each sample has its own variance.
Consequently, the independent samples t-test has two methods. One method assumes that
the two groups have equal variances while the other does not assume they are equal. The
form that does not assume equal variances is known as Welch’s t-test.
33
Computed t = 4.52
Hypothesis:
H0= Means not different
H1- Means different
Next step- check the critical value from the Student's t Table (alpha = 0.05 and df = N1 + N2 - 2
= 7).
critical value = 2.365
Interpretation
Because the computed value of t (4.52) exceeds the critical value (2.365),
Therefore, we reject the null hypothesis and conclude that the two populations from which
the samples are drawn do have different means.
Solution:
let x1 be the sample of data that prefers coffee and x2 be the sample of data
that prefers tea.
Solution
X1 X2 (X1 – d X12 = (X2 -Mean) d X22 =
Mean) SS1 (X2 -5.6) SS2
(X1 – 6.2)
4 3 -2.2 4.84 -2.6 6.76
5 8 -1.2 1.44 2.4 5.76
7 6 0.8 0.64 0.4 0.16
6 4 -0.2 0.04 -1.6 2.56
9 7 2.8 7.84 1.4 1.96
= 31 =28 SS1=14.18 SS2
=17.2
N1=5 N2=5
Mean Mean
X1 X2 =
=6.2 5.6
Df1 = 5-1 =4
Df 2 = 5-1 =4
SS 1 14.18
S12 = Variance D = =3.7
df = 4
34
SS 2 17.2
S22 = Variance D = = 4.3
df = 4
X 1−X 2 6.2−5.6
=¿ 0.6 0.6
t=
√ S S 1 SS 2
n1
+
n2
=
√ 3.7 4.3
5
+
5
√1.6
=
1.26
= 0.47
35
and then "B2:B11" for your second data array. Your syntax should
read:=T.TEST(A2:A11,B2:B11,
3. Add the tail distribution value
Input the tail distribution value based on your data's relationship. For example, you type a
"1" for a one-tailed distribution test or a "2" for a two-tailed distribution test. Add a comma
after this single number to separate it from the last value. Adding commas helps this formula
run smoothly and minimizes error messages. After adding this value, your syntax should
read:=T.TEST(A2:A11,B2:B11,1
4. Choose the test type
Add your test type value after the comma and end your syntax with a closing parenthesis.
Input a "1" for a paired test, "2" for a two-sample equal variance test or "3" for a two-
sample unequal variance test. When uncertain about the appropriate test type, add a "3" for
this value. This option may return accurate difference percentages for many test types. Your
test syntax should now read:=T.TEST(A2:A11,B2:B11,1,3)
z-test
A z-test is a statistical test to determine whether two population means are different or to
compare one mean to a hypothesized value when the variances are known and the sample
size is large. A z-test is a hypothesis test for data that follows a normal distribution.
Z-tests and t-tests are both statistical tools used to compare means, but they are used in
different situations. Z-tests are generally used for large sample sizes (n > 30) and when the
population standard deviation is known, while t-tests are used for smaller sample sizes (n <
30) or when the population standard deviation is unknown. Both tests assume a normal
distribution of the data.
Z Test = (x̄ – μ) / (σ / √n)
Where
x̄ = Mean of Sample
μ = Mean of Population
σ = Standard Deviation of Population
n = Number of Observation
36
Solution:
Solution:
Z Test Statistics is calculated using the formula given below
Z Test = (x̄ – μ) / (σ / √n)
37
The f test is used to check the equality of variances using hypothesis testing. The f test
formula for different hypothesis tests is given as follows:
Left Tailed Test:
Null Hypothesis: H0 : σ21=σ22
Alternate Hypothesis: H1 : σ21<σ22
Decision Criteria: If the f statistic < f critical value then reject the null hypothesis
Right Tailed test:
Null Hypothesis: H0 : σ21=σ22
Alternate Hypothesis: H1: σ21>σ22
Hypothesis Testing
Hypothesis testing is a statistical method used to determine if there's enough evidence in a
sample to draw conclusions about a population. It involves formulating a null hypothesis
(H0) and an alternative hypothesis (Ha), collecting data, and using statistical tests to
determine if the evidence supports rejecting the null hypothesis.
1. The main Idea:
Hypothesis:A hypothesis is a testable statement about a population.
Null Hypothesis (H0): This is a statement of no effect or no difference. It represents
the status quo.
Alternative Hypothesis (Ha):This is a statement that contradicts the null hypothesis. It
represents the effect or difference we are trying to prove.
Sample Data: Data collected from a sample is used to evaluate the evidence for or
against the null hypothesis.
Statistical Test: A test is performed to determine the probability of obtaining the
observed data (or more extreme data) if the null hypothesis were true.
Decision: Based on the test results, a decision is made to either reject or fail to reject
the null hypothesis.
3. Key Concepts:
Significance Level (alpha): The probability of incorrectly rejecting the null hypothesis
when it is true (Type I error).
P-value: The probability of obtaining the observed data (or more extreme data) if the
null hypothesis were true.
Type I Error: Rejecting the null hypothesis when it is actually true.
Type II Error: Failing to reject the null hypothesis when it is false.
38
CORRELATION AND REGRESSION
Correlation and regression are statistical measurements that are used to give a relationship
between two variables. For example, suppose a person is driving an expensive car then it is
assumed that she must be financially well. To numerically quantify this relationship,
correlation and regression are used.
Correlation Definition
Correlation can be defined as a measurement that is used to quantify the relationship between
variables. If an increase (or decrease) in one variable causes a corresponding increase (or
decrease) in another then the two variables are said to be directly correlated. Similarly, if an
increase in one causes a decrease in another or vice versa, then the variables are said to be
indirectly correlated. If a change in an independent variable does not cause a change in the
dependent variable then they are uncorrelated. Thus, correlation can be positive (direct
correlation), negative (indirect correlation), or zero. This relationship is given by
the correlation coefficient.
Regression Definition
Regression can be defined as a measurement that is used to quantify how the change in one
variable will affect another variable. Regression is used to find the cause and effect between
two variables. Linear regression is the most commonly used type of regression because it is
easier to analyze as compared to the rest. Linear regression is used to find the line that is the
best fit to establish a relationship between variables.
Correlation and Regression Analysis
Both correlation and regression analysis are done to quantify the strength of the relationship
between two variables by using numbers. Graphically, correlation and regression analysis can
be visualized using scatter plots.
Correlation analysis is done so as to determine whether there is a relationship between the
variables that are being tested. Furthermore, a correlation coefficient such as Pearson's
correlation coefficient is used to give a signed numeric value that depicts the strength as well
as the direction of the correlation. The scatter plot gives the correlation between two variables
x and y for individual data points as shown below.
39
Regression analysis is used to determine the relationship between two variables such that the
value of the unknown variable can be estimated using the knowledge of the known variables.
The goal of linear regression is to find the best-fitted line through the data points. For two
variables, x, and y, the regression analysis can be visualized as follows:
40
Correlation Regression
CORRELATION
Types of Correlation:
Correlation is described or classified in several different ways. Three of the most
important are: I. Positive, Negative and Zero II. Simple, Partial and Multiple III.
Linear and non-linear
I. Positive, Negative and Zero Correlation: Whether correlation is positive (direct)
or negative (in-versa) would depend upon the direction of change of the variable.
Positive Correlation: If both the variables vary in the same direction, correlation
is said to be positive. It means if one variable is increasing, the other on an
average is also increasing or if one variable is decreasing, the other on an average
41
is also deceasing, then the correlation is said to be positive correlation. For
example, the correlation between heights and weights of a group of persons is a
positive correlation.
Height (cm) : X 158 160 163 166 168 171 174 176
Weight (kg) : Y 60 62 64 65 67 69 71 72
Negative Correlation: If both the variables vary in opposite direction, the
correlation is said to be negative. If means if one variable increases, but the other
variable decreases or if one variable decreases, but the other variable increases,
then the correlation is said to be negative correlation. For example, the correlation
between the price of a product and its demand is a negative correlation.
Price of Product (Rs. Per Unit) : X 6 5 4 3 2 1
Demand (In Units) : Y 75 120 175 250 215 400
Zero Correlation: Actually, it is not a type of correlation but still it is called as zero
or no correlation. When we don’t find any relationship between the variables then, it
is said to be zero correlation. It means a change in value of one variable doesn’t
influence or change the value of other variable. For example, the correlation between
weight of person and intelligence is a zero or no correlation.
II. Simple, Partial and Multiple Correlation: The distinction between simple, partial and
multiple correlation is based upon the number of variables studied. Simple Correlation: When
only two variables are studied, it is a case of simple correlation. For example, when one
studies relationship between the marks secured by student and the attendance of student in
class, it is a problem of simple correlation. Partial Correlation: In case of partial correlation
one studies three or more variables but considers only two variables to be influencing each
other and the effect of other influencing variables being held constant. For example, in above
example of relationship between student marks and attendance, the other variable influencing
such as effective teaching of teacher, use of teaching aid like computer, smart board etc are
assumed to be constant.
Multiple Correlation: When three or more variables are studied, it is a case of multiple
correlation. For example, in above example if study covers the relationship between student
marks, attendance of students, effectiveness of teacher, use of teaching aids etc, it is a case of
multiple correlation.
II. Linear and Non-linear Correlation: Depending upon the constancy of the ratio
of change between the variables, the correlation may be Linear or Non-linear
Correlation. Linear Correlation: If the amount of change in one variable bears a
42
constant ratio to the amount of change in the other variable, then correlation is
said to be linear. If such variables are plotted on a graph paper all the plotted
points would fall on a straight line. For example: If it is assumed that, to produce
one unit of finished product we need 10 units of raw materials, then subsequently
to produce 2 units of finished product we need double of the one unit. Raw
material : X 10 20 30 40 50 60 Finished Product : Y 2 4 6 8 10 12
III. Non-linear Correlation: If the amount of change in one variable does not bear a
constant ratio to the amount of change to the other variable, then correlation is
said to be non-linear. If such variables are plotted on a graph, the points would fall
on a curve and not on a straight line. For example, if we double the amount of
advertisement expenditure, then sales volume would not necessarily be doubled.
Advertisement Expenses : X 10 20 30 40 50 60 Sales Volume : Y 2 4 6 8 10 12
Scatter Diagram: This is graphic method of measurement of correlation. It is a
diagrammatic representation of bivariate data to ascertain the relationship between two
variables. Under this method the given data are plotted on a graph paper in the form of dot.
i.e. for each pair of X and Y values we put dots and thus obtain as many points as the number
of observations. Usually an independent variable is shown on the X-axis whereas the
dependent variable is shown on the Y-axis. Once the values are plotted on the graph it reveals
the type of the correlation between variable X and Y. A scatter diagram reveals whether the
movements in one series are associated with those in the other series.
.
CHAPTER FIVE
INTER-PROGRAM COMMUNICATION (IPC)
It refers to the mechanisms and techniques that allow different programs or processes running
on a computer to communicate and share data with each other. It enables programs to
exchange information, coordinate their activities, and work together to accomplish a specific
task. IPC facilitates the sharing of data and resources, and the synchronization of actions
between different programs. IPC is crucial for modern operating systems that support
multitasking, as it enables different programs to cooperate and share resources effectively.
43
Inter Process communication (IPC) refers to the mechanisms and techniques used by
operating systems to allow different processes to communicate with each other. This allows
running programs concurrently in an Operating System.
The two fundamental models of Inter-Program Communication are: Shared Memory and
Message Passing
Shared Memory
IPC through Shared Memory is a method where multiple processes are given access to the
same region of memory. This shared memory allows the processes to communicate with each
other by reading and writing data directly to that memory area.
Shared Memory in IPC can be visualized as Global variables in a program which are shared
in the entire program but shared memory in IPC goes beyond global variables, allowing
multiple processes to share data through a common memory space, whereas global variables
are restricted to a single process.
Message Passing
IPC through Message Passing is a method where processes communicate by sending and
receiving messages to exchange data. In this method, one process sends a message, and the
other process receives it, allowing them to share information. Message Passing can be
achieved through different methods like Sockets, Message Queues or Pipes.
Sockets provide an endpoint for communication, allowing processes to send and receive
messages over a network. In this method, one process (the server) opens a socket and listens
for incoming connections, while the other process (the client) connects to the server and
sends data. Sockets can use different communication protocols, such
as TCP (Transmission Control Protocol) for reliable, connection-oriented
communication or UDP (User Datagram Protocol) for faster, connectionless
communication.
Methods of Inter pogram Communication (IPC)
1. Pipes – A pipe is a unidirectional communication channel used for IPC between two
related processes. One process writes to the pipe, and the other process reads from it.
Types of Pipes are Anonymous Pipes and Named Pipes (FIFOs)
2. Sockets – Sockets are used for network communication between processes running on
different hosts. They provide a standard interface for communication, which can be
used across different platforms and programming languages.
44
3. Shared memory – In shared memory IPC, multiple processes are given access to a
common memory space. Processes can read and write data to this memory, enabling
fast communication between them.
4. Semaphores – Semaphores are used for controlling access to shared resources. They
are used to prevent multiple processes from accessing the same resource
simultaneously, which can lead to data corruption.
5. Message Queuing – This allows messages to be passed between processes using
either a single queue or several message queue. This is managed by system kernel
these messages are coordinated using an API.
6. Remote Procedure calls (RPC) allows a program to call a procedure (or function) on
another machine in a network, as though it were a local call. It abstracts the details of
communication and makes distributed systems easier to use. RPC is a technique used
for distributed computing. It allows processes running on different hosts to call
procedures on each other as if they were running on the same host.
Remote Method Invocation (RMI) is a Java-based technique used for Inter-Process
Communication (IPC) across systems, specifically for calling methods on objects
located on remote machines. It allows a program running on one computer (the client)
to execute a method on an object residing on another computer (the server), as if it
were a local method call.
Each method of IPC has its own advantages and disadvantages, and the choice of which
method to use depends on the specific requirements of the application. For example, if high-
speed communication is required between processes running on the same host, shared
memory may be the best choice. On the other hand, if communication is required between
processes running on different hosts, sockets or RPC may be more appropriate.
Inter-Process Communication (IPC) enables processes to share data and work together
through methods like Shared Memory, Message Passing, Pipes, Sockets, Semaphores,
Remote Procedure Calls (RPC) and Remote Method Invocation (RMI). Each method has
unique strengths suited for different scenarios, such as local or distributed systems. This page
provides an overview of IPC methods, and you can explore detailed explanations and uses of
each method from here.
Dashboards
45
Dashboards are effective data visualization tools for tracking and visualizing data from
multiple data sources, providing visibility into the effects of specific behaviors by a team or
an adjacent one on performance.
Data visualization
Data visualization is the graphical representation of information and data.
By using visual elements like charts, graphs, and maps, data visualization
tools provide an accessible way to see and understand trends, outliers,
and patterns in data. Additionally, it provides an excellent way for
employees or business owners to present data to non-technical audiences
without confusion. In the world of Big Data, data visualization tools and
technologies are essential to analyze massive amounts of information and
make data-driven decisions.
Data visualization is a critical step in the data science process, helping teams and individuals
convey data more effectively to colleagues and decision makers. Teams that manage
reporting systems typically leverage defined template views to monitor performance.
However, data visualization isn’t limited to performance dashboards. For example, while text
mining an analyst may use a word cloud to to capture key concepts, trends, and hidden
relationships within this unstructured data. Alternatively, they may utilize a graph structure to
illustrate relationships between entities in a knowledge graph. There are a number of ways to
represent different types of data, and it’s important to remember that it is a skillset that should
extend beyond your core analytics team.
Types of data visualizations
46
analytics. Line graphs utilize lines to demonstrate these changes while area charts
connect data points with line segments, stacking
variables on top of one another and using color to distinguish between variables.
Histograms: This graph plots a distribution of numbers using a bar chart (with no
spaces between the bars), representing the quantity of data that falls within a
particular range. This visual makes it easy for an end user to identify outliers within a
given dataset.
Scatter plots: These visuals are beneficial in reveling the relationship between two
variables, and they are commonly used within regression data analysis. However,
these can sometimes be confused with bubble charts, which are used to visualize three
variables via the x-axis, the y-axis, and the size of the bubble.
Heat maps: These graphical representation displays are helpful in visualizing
behavioral data by location. This can be a location on a map, or even a webpage.
Tree maps, which display hierarchical data as a set of nested
shapes, typically rectangles. Treemaps are great for comparing the
proportions between categories via their area size.
Advantages
1. There is quick recognition of data trends and spreads When we see a chart, we
quickly see trends and outliers.
2. Interactively explore opportunities.
3. Visualize patterns and relationships.
Disadvantages
1. It’s easy to make an inaccurate assumption. Or sometimes the visualization is just
designed wrong so that it’s biased or confusing.
2. Biased or inaccurate information.
3. Correlation doesn’t always mean causation.
4. Core messages can get lost in translation.
47