Lesson 3 Notes
Lesson 3 Notes
Data might not be useful on its own. However, when it is analysed, the information gained from data
can help the user make better decisions. Data analysis includes collecting, cleaning, transforming,
and processing data. This analysis is done to get information that can be useful in decision-making
and can help individuals and organizations make sense of data.
In this lesson, you will learn about various data analysis methods and how to use them to draw
insights from collected data. Before completing this lesson, you should have the following knowledge:
Understand how basic data variable types, structures, and categories are used in data
analysis.
Know how to import, clean, organize, and aggregate data using various statistical tools.
Data analysts use various methods and tools to make sense of data. Different kinds of data and
information need different methods of analysis.
For example,
one analyst may be trying to understand how an ancient group of people migrated across
countries in the distant past for a historian.
Another might be predicting modern travel patterns for an airline or airport. These two
analyses would require different data collection methods and analysis tools.
Descriptive analysis
Descriptive analysis is used to find out what happened. It uses statistical tools with data to produce
summaries and conclusions. Descriptive analysis is an important first step in making sense of a data
set.
For example, a list of blood pressure readings of patients in a trial study might not be easily
understood. Descriptive analysis can be used to pull together the overall information about individual
measurements. This overall information is more easily understood, and the next steps can be settled
on easily.
The following are examples of business questions that descriptive analysis addresses:
What were the sales during each quarter of the year 2022?
How do sales during the first quarter of the year 2022 compare with sales during the first
quarter of 2021?
Measures of frequency
Measures of dispersion
Measures of position
Measures Of Frequency
Measures of frequency provide information on the number of times a given variable occurs. They
include frequencies, ratios, and proportions. In our “which product had the most sales” question, the
analyst could provide a breakdown of sale amounts by product, as in Figure 3-1:
Graphical examples of measures of central tendency are shown in Figure 3-2 with instructions
provided for creation in later sections.
Measures Of Dispersion
Measures of dispersion are used to describe the spread of data. It helps analysts see if there is a
wide variation in data, or if it is clustered around one or more specific values. Measures of
dispersion include the range, variance, and standard deviation. Someone reporting on product sales
could give the range of sales, or the difference between the highest and the lowest sale numbers, as
in Figure 3-3.
Measures Of Position
Measures of position are used to determine where each data point exists in a given dataset. These
measures include percentiles and quartiles. An analyst looking at product sales could arrange the
items from least to most popular and then give a percentile (ranking out of a hundred) or a quartile
(ranking out of four) to each product. Quartiles and percentiles can also be shown using the previous
chart, Figure 3-3.
Diagnostic analysis
Diagnostic analysis is used to help explain why data behaves the way it does. It helps explain the
relationships between variables. This analysis is often done after descriptive analysis because it
uses the results from descriptive analysis and looks for a cause. For example, a business owner can
use diagnostic analysis to explain the reason for a sudden increase in sales. The following are
examples of business questions addressed by this type of analysis:
Why were sales in the first quarter of 2022 lower than sales in the first quarter of 2021?
Step 2 Identify data that can be useful in investigating the events identified in Step 1. After the
analyst determines that Q1 in 2021 had greater sales than Q1 in 2022, they will look for internal and
external reasons in related data. Was there a change in product or in overall shopping patterns?
What external factors can be included or eliminated (weather, economic changes, and social media
trends to name a few)?
Step 3 Use the data identified in Step 2 to discover hidden relationships that may have led to the
events identified in Step 1. An analyst can perform a number of statistical tests to find relationships
between data sets. Correlations, or specific relationship patterns, can be tested. For example, the
analyst might find that average daily temperature is correlated, or related in a predictable way, to
sales.
Predictive analysis
Predictive analysis uses current and historical data to determine what might happen in the future.
It can be used to answer the following business questions:
How many new competitors are expected to enter our target market in the next 16 months?
Prescriptive analysis
Prescriptive analysis uses data to recommend the best course of action. It is used to help decide
what should be done. Prescriptive analysis tools help analyze data, determine possible action points,
and make recommendations on the next steps to follow. Predictive analysis can be used to answer the
following business questions:
Hypothesis testing
Hypothesis testing is a data analysis tool that uses data from a sample, then applies the test results
to the whole group, or population.
A population is a group where every member has something in common. Examples of populations
include all registered voters in the US or all startups in Canada that failed before three years of
operation.
Often, analysts want to know things about a population but cannot collect data on all members of the
population. Instead, they choose a part of the population, called a sample, and use it to draw
conclusions about the population. A sample must be chosen carefully to make sure it doesn’t
misrepresent the whole population. One key feature of a sample is that it must be random. In a
random sample, every selection has an equal chance of getting selected.
Table 3-1
A hypothesis is a yes/no statement about a characteristic of a population. In the following example,
the analyst is interested in a population average and writes a hypothesis about the average.
Light bulbs slowly get dimmer as they are used. A company that makes light bulbs claims that the mean
(average) time its bulbs take to reach a specific level of dimness (called a lumen test) is 1000 hours. A
consumer advocate would like to determine if the mean lumen test of the bulbs is actually less than
1000 hours.
Hypothesis testing using the critical value approach
Hypothesis testing using the p-value approach
In the p-value approach, the analyst is still concerned with where her data falls on the bell curve, and
if it is extreme enough to prove (with her desired level of accuracy) that the population mean has to
be less than 1000. However, in this approach when she reaches Step 4, she won’t find the critical
region. Instead, she is going to calculate the probability that her value is possible given a true
population mean of 1000, and compare that to her desired confidence level of α = 0.05, or 5%.
If the probability she calculates, called the p-value of her test, is less than or equal to her chosen α-
level, the null hypothesis is rejected. If it is greater than the chosen α-level, the null hypothesis is
not rejected. For example, a sample with a calculated p-value of 0.15 or 15% would not allow the
analyst to reject the null hypothesis and conclude the true population mean is less than 1000 hours. A
sample with a calculated p-value of 0.02 or 2%, on the other hand, would allow her to reject the null
hypothesis and conclude the true population mean is lower than 1000.
To calculate the p-value for the test in this problem, the analyst can run a bit of code in R. She
calculates the test statistic as before to be -5.22. The degrees of freedom in a statistics problem is
the size of the sample minus 1, and this one-tailed test is looking at the left, or lower, tail. In this
code, she uses the pt function with the syntax:
[1] 2.437634e-05
Because the p-value is less than 0.05, she can reject the null hypothesis and conclude that there is
evidence at the 5% significance level that the population mean is less than 1000 hours.
Video: Data Description
Data aggregation is another important part of the data exploration process. It helps analysts find
trends in the data, make comparisons, and discover information that might have gotten lost in all the
individual data points.
Data interpretation is the process of giving meaning to processed and analyzed data. It involves the
following steps:
Data processing techniques, such as data filtering and searching, are essential in the process of data
interpretation.
Data filtering involves splitting up the sample into groups to create new subsets to be analyzed.
Data searching helps to find specific records, e.g., unique values or specific strings, in a dataset.
Data is often aggregated using descriptive statistics, e.g., measures of frequency, central tendency,
dispersion, and position.
In the following topics, we’ll examine some of the common statistical measures used to aggregate
data.
Count
The count or frequency of an item is the number of times the item occurs in a dataset.
Example
Find the number of dealerships that contain models of cars in each color in the following data:
Color Count
black 1
blue 2
green 2
red 3
The table gives the count or frequency of each car color in the data.
Example
Find the sum of all the red cars in stock at all locations in the dataset.
The mean of a variable in a dataset is calculated by adding all of the values of the variable in the
dataset and dividing their sum by the number of observations.
In this section, you will use the chickwts dataset to aggregate data in R.
The chickwts dataset is a built-in dataset in R. The data comes from an experiment in which newly
hatched chickens were randomly divided into six groups with each of the groups receiving different
feed supplements. The weights of the chickens were measured in grams after six weeks.
The dataset contains 71 observations on two variables, namely, weight and feed. In this
dataset, weight denotes the weight of chickens in grams, while feed denotes the feed supplement
type.
You can explore the dataset using the dim(), head(), and str() functions, which provide information on
the dimensions, the first few rows, and the internal structure of the data frame, respectively.
df = chickwts
dim(df)
[1] 71 2
head(df)
weight feed
1 179 horsebean
2 160 horsebean
3 136 horsebean
4 227 horsebean
5 217 horsebean
6 168 horsebean
str(df)
$ weight: num 179 160 136 227 217 168 108 124 143 140 ...
You can save the chickwts dataset as chickwts.csv on your desktop to use later in the Excel and SQL
environments using the following steps on Windows:
Open RStudio or any other R environment that supports access to the local file system on
your computer.
df = chickwts
Click the "Run" button, or use the keyboard shortcut (Ctrl + Enter on Windows) to execute
the code.
You may want to answer the following questions using the chickwts dataset:
2. What is the total weight of chickens for each group of feed supplements?
3. What is the mean weight of chickens for each group of feed supplements?
To answer these questions in R, you can use the aggregate() function. The syntax of
the aggregate() function is as follows:
In this case, dataframe is df (i.e., the name of the dataset you are using) and function is the
function you want to apply to the values in the grouped data (e.g., sum, mean, or min).
Code implementation
Solution to Q. 1 (number Use the aggregate() function to group the data by feed. In this case,
of chickens fed each the function argument takes the value length as follows:
feed type)
df = chickwts
feed type x
1 casein 12
2 horsebean 10
3 linseed 12
4 meatmeal 11
5 soybean 14
6 sunflower 12
Alternatively, you can use the table() function to obtain the counts for
each group of feed supplements.
table(df$feed)
12 10 12 11 14 12
Solution to Q. 2 (total Use the aggregate() function to group the dataset by feed and find the
weight for each group sum of weights of the chickens for each group of feed supplements:
of feed)
df = chickwts
feed type x
1 casein 3883
2 horsebean 1602
3 linseed 2625
4 meatmeal 3046
5 soybean 3450
6 sunflower 3947
Solution to Q. 3 (mean Use the aggregate() function in either of the following ways to group
weight by feed type) the dataset by feed and find the mean weight of each group of chicks
based on the feed type they received:
df = chickwts
feed type x
1 casein 323.5833
2 horsebean 160.2000
3 linseed 218.7500
4 meatmeal 276.9091
5 soybean 246.4286
6 sunflower 328.9167
OR
df = chickwts
aggregate(weight~feed, df,mean)
feed weight
1 casein 323.5833
2 horsebean 160.2000
3 linseed 218.7500
4 meatmeal 276.9091
5 soybean 246.4286
6 sunflower 328.9167
You will use the chickwts dataset to aggregate data in this section. Recall that you have saved this
dataset in the folder named DATA as the file chickwts.csv.
Aggregate functions, commonly performed with the GROUP BY command, output a single value
computed from a set of data. Examples of commonly used aggregate functions in SQL
are COUNT(), SUM(), AVG(), MIN(), and MAX(). All aggregate functions in SQL ignore null values
except for COUNT().
The GROUP BY clause groups rows with similar values into summary rows.
2. What is the mean weight of chickens for each group of feed supplements?
1. Open MicrosoftSQL Server Management Studio and connect to your SQL Server instance.
2. In the Object Explorer, expand the database where you want to import the dataset.
3. Right-click on the database, choose Tasks > Import Flat File and follow the instructions on
screen to import the file.
Implementation In SQL
Code solution to Q. 1 Use the COUNT() function and the GROUP BY clause to calculate the
number of chickens by groups of feed as follows:
(Number of chickens fed
each feed type)
FROM chickwts
GROUP BY feed;
Code solution to Q. 2 Use the AVG() function and the GROUP BY clause to calculate the mean
weight of the chickens in each group of feed as follows:
(Mean weight by feed
type)
FROM chickwts
GROUP BY feed;
Code solution to Q. 3 Use the VAR() function and the GROUP BY clause to calculate the
(variance by feed type) variance by groups of feed as follows:
FROM chickwts
GROUP BY feed;
In this section, you will use Power Query in Excel to aggregate data. You will also use chickwts.csv in
the DATA folder.
First, you should import chickwts.csv into Power Query using the following steps.
o In Excel 2016: Click Data > New Query > From File > From CSV
Navigate to the chickwts.csv file. Select it and click Transform Data in the window that pops
up.
What is the mean weight of chickens for each group of feed supplements?
To calculate mean weight by groups of feed, click Group By in the Home tab.
In the Group By dialog box, choose the variables to group your data by, as shown in Figure 3-9.
Figure 3-9
Group your data by feed, name the new column that will contain the aggregates mean_weight; and
apply the operation Average on the column weight.
Click OK.
The actions above produce the aggregate mean table shown in Figure 3-10, and answers the
questions posed in this section.
Figure 3-10
To move this result to the worksheet, click Close & Load To … in the Home tab of the query. Use
the Load To dialog box that appears to choose how you would like to import the data and click Load, as
shown in Figure 3-11.
Figure 3-11
Data interpretation
While working with the chickwts dataset, you discovered that the mean weight of chickens
fed casein was 323.5833. The mean weight of those fed soybeans was 246.4286. What do these
numbers mean?
Data interpretation is how an analyst finds meaning in data and requires the analyst to judge the
results of the analysis. The analyst relates the processed data to the research questions, explores
relationships between the measurements, and draws inferences. They ask the question: What is the
meaning of the pattern that the data displays?
The chickwts dataset statistics can help answer the following research questions:
Does the mean weight of chickens fed casein at six weeks differ significantly from that of
chickens fed soybean?
To answer these questions, the chickwts dataset must be analyzed using more advanced methods.
Exploratory data analysis (EDA) is a critical first step in analyzing data for interpretation. It
summarizes datasets by their main characteristics. Often data charting methods and summary
statistics are used. EDA is useful in finding errors in the data, detecting outliers (weird data points),
and analyzing relationships between variables.
Drill a dataset
Mine a dataset
A critical part of EDA is finding relationships between variables. The analyst must understand what
situation the data describes, the types of variables in the data, and if some variables influence the
values of other variables.
Lots of methods can be used to find relationships between variables. The method chosen will depend on
what kind of data an analyst has. Is it numerical, where each value is represented by a number? Is it
categorical, where each value is a category record (for example favorite color or hair color)? Is it
mixed data, where some entries are numbers and some categories? Different kinds of datasets
require different analysis methods.
Correlation
Correlation is a statistical measure that explains certain kinds of relationships between two
variables. Specifically, correlation is a straight-line relationship; if one variable goes up, does
another consistently go up or down? Correlation analysis is important because it helps an analyst pick
factors for more investigation, and to include in mathematical models. It can be measured using
the correlation coefficient (denoted by r). The value of r ranges from -1 to +1.
There are several types of correlation coefficients, depending on the data type. The most commonly
used correlation coefficient is called Pearson’s correlation (also called Pearson’s r).
Pearson’s r measures the strength and direction of the linear relationship between two variables.
Correlation Values
Positive correlations (an r-value between 0 and 1) show that the two variables change in the
same direction. When one increases, so does the other. When one decreases, so does the
second.
Negative correlations (an r-value between -1 and 0) show two variables move in opposite
directions. When one variable increases, the other variable decreases. When one decreases,
the second increases.
A zero correlation exists when there is no linear relationship between two variables.
Generally, you can use Table 3-6 to help you interpret the correlation value:
Table 3-6
It is important to know that a strong linear correlation between two variables does not mean that
one of the variables is causing a change in the other (causation). In other words, correlation does
not imply causation. There is a positive correlation between the number of firefighters at a fire
scene and the amount of damage caused by the fire. However, this does not mean that having more
firefighters at the scene causes more damage. Rather, it is more likely that fires that are more
severe and likely to cause more damage require more firefighters to respond to the scene. (Causation
can only be determined from a properly designed, randomized, and controlled experiment and further
analysis.)
When performing correlation analysis, analysts investigate how the data looks in a graph, in addition
to looking at correlation values. Charting data points helps you better visualize and interpret
correlation values. For example, it is easier to identify non-linear or curved relationships between
variables by looking at a graph like a scatter plot.
Correlations are best visualized using scatter plots. A scatter plot shows the relationship between
two numerical variables by plotting a dot for every observation. It allows you to identify overall
patterns, directions, and strength of association.
In this section, you will use the marketing dataset from the datarium package in R.
The marketing dataset contains data on the amount of money (in thousands of dollars) that a
company is willing to set aside for advertising on three different media platforms (YouTube, Facebook,
and newspaper) and the correlating effect on sales. It has 200 rows and 4 columns.
To obtain the marketing dataset, first install the datarium package using the following code:
Note: In some environments or installations of R, the datarium package may already be included by
default. It is recommended that you check the installed packages in your specific R environment
before installing it. You can do this by using the installed.packages() function or checking the
package list in your IDE. If the datarium package is already present, there is no need to install it
again.
install.packages("datarium")
Load the marketing dataset into the variable md using the following code:
require(datarium)
md <- marketing
To better understand the dataset, use the function dim() to determine the dimensions of the
dataset, str() to obtain information about the rows and columns of the dataset, and head() to view
the first few rows of the dataset. The code to view the dataset dimensions is as follows:
dim(md)
This code produces the following output:
[1] 200 4
To display information about the rows and columns in the dataset, run the following code:
str(md)
Run the following code to view the first six rows of the dataset:
head(md)
Missing values can be found in R using the function is.na(). Missing values in an R dataset are fields
that contain NA. The is.na function checks each data point in a dataset and returns TRUE if it
contains NA and FALSE if it does not contain NA. Tabulate these values using the table() function as
follows:
table(is.na(md))
FALSE
800
The code output above shows that none of the 800 data points in the dataset are missing values.
From the results received so far, the following statements can be made about the dataset.
The dataset has 200 observations and 4 variables, namely, YouTube, Facebook, newspaper,
and sales.
Descriptive statistics (e.g., averages) help an analyst better understand the data.
The summary() function in R is used to compute summary statistics of data and models.
summary(md)
1st Qu.: 89.25 1st Qu.:11.97 1st Qu.: 15.30 1st Qu.:12.45
From the code output above, note that the company spends the most on YouTube advertising (mean
budget is $176,450) and the least on Facebook advertising (mean budget is $27,920).
Scatter plots also help you visualize the relationship between sales and each explanatory variable
(i.e., YouTube, Facebook, or newspaper). The function par(mfrow=c(nrows, ncols)) allows you to
combine many plots in a single graph in R, i.e., a matrix of nrows by ncols plots. The mfrow argument
specifies the dimensions of the grid, indicating the number of rows (nrows) and columns (ncols) of
plots you want to arrange.
par(mfrow=c(1,3))
plot(md$youtube,md$sales,xlab = "youtube",ylab="sales")plot(md$facebook,md$sales,xlab =
"facebook",ylab="sales")
plot(md$newspaper,md$sales,xlab = "newspaper",ylab="sales")
Figure 3-12
From Figure 3-12, the relationships between Facebook and sales, and YouTube and sales appear both
positive and linear. The data is clumped in a line shape, rising from left to right. However, the
relationship between YouTube and sales appears to be stronger than that
between Facebook and sales because the line is clearer and the data is more tightly clustered. The
relationship between newspaper and sales does not appear to be linear.
Compute the numerical value of the correlation between sales and each advertising medium can be
done using the cor() function as follows:
cor(md)
From the last column of the R output, the correlation between sales and YouTube (0.78 to 2 decimal
places) is shown to be stronger than that between sales and Facebook (0.58 to 2 decimal places). The
correlation between sales and newspapers is weak (0.23 to 2 decimal places).
After the correlation analysis, further analysis of the relationship between sales and the
explanatory variable YouTube is the next step because these two variables have a linear relationship
and the strongest correlation.
Save the marketing data in the previously created DATA folder of your desktop as marketing.csv to
use in later exercises of the lesson:
Create a scatter plot (or chart) to investigate the relationship between sales and YouTube using the
following steps:
Click on the new worksheet anywhere far away from the data.
Click Insert > Scatter (under the charts section) to create an empty scatter chart.
Click Chart Design Tools > Select Data to open the Select Data Source dialog box.
Click Add in Legend Entries (Series) to open the Edit Series dialog box.
Use the Edit Series dialog box to choose the appropriate data ranges for each axis. X values
should correspond to YouTube values (in the range A2:A201), and Y values should correspond
to sales values (in the range D2:D201); refer to Figure 3-13.
Figure 3-13
Click OK. You can then add a chart title and axes titles to the graph.
Figure 3-14
To calculate the Pearson correlation coefficient in Excel, use the function CORREL(range1, range2),
where range1 and range 2 are the cell references containing the data.
Correlation formulas can be added in the cells I2, I3, and I4 of the worksheet for correlations
between YouTube and sales, i.e., =CORREL(A2:A201,D2:D201), Facebook and sales,
i.e., =CORREL(B2:B201,D2:D201), and newspaper and sales, i.e., =CORREL(C2:C201,D2:D201),
respectively.
Figure 3-15
Correlation and scatter charts in Excel Part 1
Cross tabulation
Cross tabulation is a method used to analyze the relationship between two or more non-numerical
variables. Cross tabulation involves grouping variables to determine the correlation between them.
The process uses a table called a crosstab (also called a contingency table or a two-way table) to
allow you to discover how often a combination of two variable values occurs.
Cross Tabulation In R
A food service worker at a local university wants to better understand the food preferences of the
students served in the cafeteria. He does a brief survey of lunch students, asking them what food
they would like to see added to the menu. He creates the dataset called ct containing two categorical
variables food and gender. ct records the favorite foods and genders of 28 college students.
head(ct,5)
gender food
1 Female sushi
2 Female sushi
3 Female sushi
4 Female sushi
5 Female sushi
Next, use ct and the function table() to create a crosstab containing frequencies based
on gender and food as follows:
ct_table
Female 5 3 6
Male 3 4 7
From the table, the frequencies of various pairs of characteristics are shown. For example, note
that 6 females would like to see sushi added to the menu, while 5 females would like to see ice
cream added to the menu.
A proportions table can also be created where the cell values are proportions of the total number of
entries in the table (i.e., 28) using the following R code:
p_table
In this example, 25 percent of students surveyed are males who would like sushi added to the menu
and 21 percent of students surveyed are females who would like sushi added to the menu.
Save the dataset ct as ct.csv in the previously created DATA folder to use in later exercises:
3. Click the Get Data/ From File / From Text/CSV button, click on Text/CSV.
6. At the next window, select Transform Data. This opens the Power Query Editor.
7. In the Power Query Editor, you will see a preview of your data in a table format.
8. Click on the Load button in the Power Query Editor. This will load the transformed data into
your Excel worksheet. The data will be displayed in Excel with the first row as headers, as
specified in the Power Query Editor.
The PivotTable From Table/Range dialog box appears. Select the range that contains the data and a
cell in the Existing Worksheet in which to place the crosstab (e.g., cell H3), as shown in Figure 3-16.
Click OK.
Figure 3-16
In the PivotTable Fields window that appears, drag the variable gender to the Rows area, the
variable food to the Columns area, and the variable food again to the Values area, as shown in Figure
3-17.
Figure 3-17
The crosstab in Figure 3-18 should appear after completing the steps outlined above.
Figure 3-18
An outlier is a data point that is far from other points. In other words, it differs significantly from
other data points.
Outliers in a dataset can be the result of measurement errors, data entry errors, or sampling
problems. For example, a height of 155 m in a dataset containing human heights is obviously an error
and the result of human error when inserted into the dataset.
Outliers can heavily influence statistical results, like the mean and standard deviation, resulting in
misleading interpretations. Therefore, an analyst should identify any outliers present in a dataset
and then decide what to do with them.
Outliers are best detected using the interquartile range (IQR) and visualizations, such as histograms.
A histogram is a diagram used to study the distribution of numerical data. It is a type of bar plot in
which every bar represents a class of data. The heights of the bars are the frequencies, the number
of occurrences for each category, of the data classes. The bars have equal widths and touch each
other. An example of a histogram is shown in Figure 3-19.
Figure 3-19
Typically, outliers will be found at the far left (extremely small values) or far right (extremely large
values) of the histogram.
The interquartile range measures the spread of the middle half of a dataset. It is calculated using
the formula IQR = Q3 – Q1, where Q3 is the third quartile (or upper quartile) and Q1 is the first
quartile (or lower quartile).
Quartiles are values that divide a dataset into four equal parts. As such, there are three quartiles,
namely, Q1, Q2, and Q3. Q2 is also the median of the data. An example of how quartiles can be plotted
in a box and whiskers plot is shown in Figure 3-20.
Figure 3-20
To find quartiles from a dataset, the dataset must be arranged in ascending order.
A data point is considered an outlier if it is less than Q1 – 1.5(IQR) or more than Q3 + 1.5(IQR).
Consider the following data containing the heights of 10 individuals in centimeters. One of the
measurements is an outlier.
155, 167, 300, 168, 188, 170, 180, 177, 165, 175
The methods outlined above help to determine whether the data contains an outlier.
Code
implementatio
n
3. heights <-c (155, 167, 300, 168, 188, 170, 180, 177, 165, 175)
4. hist(heights, breaks = 6)
Note: By setting breaks = 6, you are specifying that you want the histogram to
have 6 bins. The number of bins determines the granularity or level of detail in the
histogram.
Figure 3-21
3. Obtain threshold values for the outliers, i.e., Tmin and Tmax denoting
minimum and maximum threshold, respectively (Tmin = – 1.5(IQR)
and Tmax = + 1.5(IQR)).
4. Outliers should be all points greater than Tmax and all points less
than Tmin.
5. heights<-c(155, 167, 300, 168, 188, 170, 180, 177, 165, 175)
IQR(heights)
[1] 12
summary(heights)
Run the following code to determine the minimum and maximum threshold:
[1] 300
From the R output, you can see that the outlier is 300.
You can use formulas or a histogram to detect outliers in Excel. Let's look at both of these techniques.
1. Enter the data in cells A2:A11 in a new Excel worksheet. Compute Q1 by entering the
formula =QUARTILE(A2:A11,1) in cell F1.
6. To determine whether a data point is greater than Tmax or less than Tmin, enter the
formula =OR(A2>$F$5, A2<$F$4) in cell B2. Copy this formula to the remaining cells in column
B by double-clicking on the fill handle of the cell. A TRUE value should appear for all outliers in
the data, as shown in Figure 3-22.
Figure 3-22
OR
1. Select the data: Select the range of cells that contain your data.
2. Insert a histogram: Go to the Insert tab in the Excel ribbon and click on the Recommended
Charts or Insert Statistic Chart button (the exact location may vary depending on your
Excel version).
3. Choose the histogram: In the Recommended Charts or Insert Chart window, select
the Histogram option. It is typically found under the Column or Bar chart category.
4. Customize the histogram: After inserting the histogram, you can customize its appearance
and settings. Right-click on the chart and select Format Chart Area or use the various
formatting options available in the ribbon to modify the chart's appearance, labels, axes, etc.
5. Adjust the bin size: By default, Excel automatically determines the bin size for the histogram.
However, you can adjust the bin size to fit your needs. Right-click on the horizontal axis of
the histogram and select Format Axis to open the axis formatting options. In the Axis
Options panel, you can specify the bin width or number of bins under the Bounds or Axis
Options section.
Data drilling
Data drilling is a method of analyzing data by pulling out statistically interesting subsets or
subcategories. It involves an in-depth investigation of the underlying data to allow an analyst to
understand it better and enhance the decision-making process. One key component of data drilling
is granularity. Granularity is how separately and distinctly you view each data point. Sometimes
looking closer at data (more granular), you can notice features that aren’t apparent when looking at
the dataset as a whole. Sometimes looking at the dataset from a broader perspective (less granular)
allows you to see overall trends or patterns. For example, if a sales report shows a decline in sales,
an analyst can drill down into the report to find the products or departments that contributed to this
decline.
Drill down
Drill up
Data drill down starts by looking at the calculated summary statistics for details on the underlying
data. It helps the analyst shift from an overall view of the data to a more detailed and specific view.
Data Drill Up
Data drill up is the opposite of data drill down. It enables the data analyst to shift from a detailed
view of the data to a more overall view of the situation.
Data mining
Data mining is the process of extracting information from large datasets. It involves analyzing large
datasets to find odd entries, patterns, and correlations that can help solve business problems. For
example, businesses can use data mining techniques to develop customer profiles from customer data.
These profiles help businesses find their best customers and tailor marketing strategies to attract
others with similar behaviors.
Although data mining is often used interchangeably with machine learning, the two terms are
different. Machine learning involves using data and algorithms to develop methods that learn and
change in response to reinforcement or new data, similar to the way humans learn.
Anomaly Detection
This technique is also referred to as outlier analysis. It helps identify suspicious and rare events that
differ significantly from most of the data.
An anomaly is an event or item that does not follow the expected pattern. An example of an anomaly
is a spike in unsuccessful login attempts in an online banking system.
Clustering
This technique finds clusters or groupings of data points that are similar to one another in a dataset.
It aims to make groups that are similar on the inside of the cluster, while making sure the clusters
are as different as possible from one another. For example, business owners can use clustering to
identify distinct groups of customers and develop marketing strategies specific to each group.
The primary goal of data analysis is to get fair and honest information from data. These insights
should be supported by the data and have practical value. In other words, analysts should be able to
use the results of their analysis to make recommendations that can cause real change in an
organization. Exploratory data analysis helps uncover patterns in data, and provides the context
needed for these patterns. Exploratory data analysis identifies the most important variables in your
dataset. Analysts can then perform more in-depth tests to obtain detailed predictions and insights
from the data.
Scatter plots and correlations are methods used to explore the relationship between two numerical
variables. If a scatter plot and a correlation coefficient reveal that the relationship between two
variables is linear and strong, this relationship can be explored further using techniques such as
linear regression.
Simple linear regression is a statistical method used to study the relationship between two
continuous variables.
The variable that is changing in response to another factor is called the response variable.
The variable used to predict the response is called the explanatory or predictor variable.
For example, if a dataset is comparing the height of basketball players and their shooting percentage,
the predictor variable would be the player height and the response variable would be their shooting
percentage.
Typically, data points do not exactly fit a straight line. One method for finding the line that best fits
the data as a whole is called least squares regression. This method is similar to the variance
calculations from earlier, in that it compares how far each data point is from a line and finds the line
that makes these distances the smallest possible overall. A regression line is a line that best fits the
data.
Simple linear regression using a real dataset in R
Earlier, the marketing dataset from the datarium package was used in R to explore relationships in
data. A strong linear relationship was found to exist between the variables YouTube and sales. In this
section, the relationship between these two variables will be explored more using simple linear
regression.
Load the marketing dataset into a variable in R using the following code:
require(datarium)
md <- marketing
Use the function lm() to perform linear regression with the variables sales and YouTube. In this
case, sales is the response variable, and YouTube is the explanatory variable.
summary(model)
Call:
Residuals:
Coefficients:
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation Of R Output
o The estimated regression line equation is sales = 8.439 + 0.048 · youtube. The value
of the y-intercept is 8.439 (to 3 decimal places) and the slope is 0.048 (to 3 decimal
places). Note: the coefficient labeled as "Estimate" under the "Coefficients" section
represents the slope of the YouTube estimated regression line.
o Alternatively, compute the slope of the regression line using the formula (to 3
decimal places), where r is the correlation coefficient, is the standard deviation of
the dependent variable (sales), and is the standard deviation of the independent
variable (YouTube). Calculate this value in R using the following code:
o cor(md$youtube,md$sales)*(sd(md$sales)/sd(md$youtube))
[1] 0.04753664
[1] 8.357352
Note: The slope and y-intercept calculated here are slightly different from the results of the R linear
regression due to rounding the slope to two significant figures.
o When the YouTube advertising budget is 0, sales are expected to amount to 8.439 =
8,439 dollars (recall that sales and YouTube units are in thousands of dollars)
o A 1 unit increase in the YouTube budget should result in a 0.048 unit increase in sales.
As sales and YouTube units are given in thousands of dollars, it means that a 1000
dollar increase in the YouTube budget should result in a 48-dollar increase in sales.
Import the marketing.csv dataset in the folder DATA to a new Excel worksheet.
Figure 3-23
The table in Figure 3-24 provides statistical measures that indicate how well the model fits the data.
Figure 3-24
R-square is a statistical measure that explains how much of the variance in the response variable
(sales) is explained by the explanatory variable (YouTube). Often, the larger the value of R-square,
the better the regression model fits your observations. The R-squared value of 0.6119 indicates that
the YouTube predictor accounts for approximately 61% of the variance in sales. The Multiple R is the
correlation coefficient that we computed earlier.
The standard error of the regression is a summary of how far each of the observed values falls from
the regression line. In this example, the distance is 3.91. A low distance value is better. Such a value
would indicate that the distances between the data points and the fitted values are small.
An analyst can show the relationship between sales and YouTube using a linear regression chart.
To create this chart, first, create a scatter plot of sales against YouTube using the method from
earlier in the lesson.
Now, draw the least squares regression line. Right click on any point and select Add Trendline from
the context menu.
On the right pane that appears, select Linear under Trendline Options and check Display Equation on
Chart.
Use the Fill & Line tab to customize the line, e.g., change the line to a solid line, the color of the line
to red, and the dash type to unbroken line. Figure 3-27 shows the resulting linear regression chart.
Figure 3-27
You will see that the regression line in Figure 3-27 has the same coefficients as those obtained from
the R and Excel regression outputs.
Simple linear regression helps you to determine the relationship between a response and an
explanatory variable. A simple linear regression model can also be used to predict the values of new
observations.
In this section, you will use the marketing dataset from the datarium package in R. In the previous
section, you developed a simple linear model using the variables YouTube and sales. In this section,
you will use this model to answer the following question:
What do you predict the sales will be when YouTube is (1) 200 and (2) 340?
Code implementation
2. Perform simple linear regression using the variables sales and YouTube.
3. Use the function predict() to predict sales when YouTube is (i) 200 and (ii) 340:
4. require(datarium)
5. md <- marketing
1 2
17.94644 24.60157
Exce 1. Import marketing.csv to a new worksheet. The sales data are in the range
l D2:D201, and YouTube data are in the range A2:A201.
2. Create a column called newYT (cell H1), denoting new YouTube values.
3. Enter the values 200 and 340 in the column newYT (cells H2 and H3).
5. Use the FORECAST function to predict the sales for the two new YouTube values.
Enter the formulas:
From the analysis above, you can see that sales are predicted to be 17.964 and 24.602
when YouTube spend is 200 and 340, respectively. All values are in thousands of dollars.
Artificial intelligence (AI) studies how to make machines perform tasks commonly associated with
human beings. It looks at how the human brain works and how human beings learn and make decisions
when solving problems. AI then uses these results to develop systems that are adaptive and can learn
progressively, much like humans.
Today, the volume, velocity, and variety of data generated in the world are massive. In other words,
the data is too big, moves too fast, and data sources are numerous. These characteristics of big data
require artificially intelligent systems that can help human beings in handling data.
Artificial intelligence (AI) is a broad field of science concerned with building machines that can
perform tasks requiring human intelligence. It aims to design machines with human-like intelligence.
An algorithm is a step-by-step procedure used for solving a problem. Algorithms are crucial
components of AI systems. These systems use algorithms to perform calculations, process data,
analyze data, detect anomalies in data, etc.
Machine learning is a branch of AI that aims to develop systems that can learn from the data they
receive. Machine learning algorithms use sample data to build models that can perform laid-out tasks.
For example, models can help organizations to predict possible outcomes based on historical data. A
simple example of a machine learning algorithm is a simple linear regression model.
Deep learning is a branch of AI that uses algorithms called artificial neural networks to learn from
data. Artificial neural networks are designed to think and learn like humans. Examples of systems
based on deep learning are self-driving cars, virtual assistants, and facial recognition.
Machine learning and deep learning are two well-known subsets of AI.
AI is typically used to analyze big data to find patterns that can be used to derive insights to improve
work processes.
Big data can be defined as data that is too much in scope for desktop software or calculators to
process and analyze. The three features of big data are volume, velocity, and variety.
Big data and AI complement each other. AI requires considerable data to learn, and big data analytics
requires AI technologies for efficient data analysis.
Solving common data problems, e.g., detecting outliers and missing values, de-duplicating
data, or reducing dimensions of data
Performing various types of data analyses, from simple descriptive and diagnostic analyses to
complex predictive and prescriptive analyses
Summary
Data analysis is the process of collecting, cleaning, transforming, and processing data to
yield information that can be useful in decision-making.
The main methods for data analysis are descriptive analysis, diagnostic analysis, predictive
analysis, prescriptive analysis, and hypothesis testing.
Predictive analysis uses current and historical data to answer the question “What might
happen in the future?”.
Hypothesis testing is a method of data analysis that uses data from a sample to draw
conclusions about the overall population.
Data aggregation is the process of collecting data from multiple sources and working it into a
summarized form.
Data interpretation is how an analyst attaches meaning to processed and analyzed data.
Correlation is a statistical measure that explains how much two variables are linearly related.
Cross tabulation is a method used to analyze the relationship between two or more
categorical variables.
Data drilling is a method of analyzing data by providing different perspectives of the data in
reports or spreadsheets.
Simple linear regression is a statistical method used to study the relationship between two
numerical variables.
Artificial intelligence works to make machines perform tasks commonly associated with
human thinking and decision-making abilities.
Machine learning focuses on developing systems that learn from the data they receive.