0% found this document useful (0 votes)
31 views34 pages

Unitwise Imp Notes

Data analytics.

Uploaded by

Nikin Cheruvelil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views34 pages

Unitwise Imp Notes

Data analytics.

Uploaded by

Nikin Cheruvelil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

UNIT 1- INTRODCTION TO DATA ANALYTICS

1. Define data? List the sources of data in day today activities


Data refers to raw facts,figures and statistics that are collected & stored for reference.
Data can take various forms like text, numbers, images, audio & video.
Diffrerent sources of data are:
a. Social media data
b. Ecommerce transaction data
c. IOT devices data
d. Healthcare records
e. Financial transactions
f. Web traffic & search queries
g. Mobile devices
h. Sensors & surveillance systems
2. Explain the types of data with example
Types are
I. Qualitative or Categorical Data
a. Nominal Data
Nominal data is one of the types of qualitative information which helps to
label the variables without providing the numerical value. Nominal data is
also called the nominal scale. It cannot be ordered and measured.
b. Ordinal Data
Ordinal data/variable is a type of data that follows a natural order. The
significant feature of the nominal data is that the difference between the data
values is not determined.
II. Quantitative or Numerical Data
Quantitative data is also known as numerical data which represents the numerical
value (i.e., how much, how often, how many).
a. Discrete Data
Discrete data can take only discrete values. Discrete information contains only
a finite number of possible values.
Example: Number of students in the class
b. Continuous Data
Continuous data is data that can be calculated. It has an infinite number of
probable values that can be selected within a given specific range.
Example: Temperature range
III. Time series data
Time-series data or temporal data is a sequence of data points collected over time
intervals, allowing us to track changes over time. Time-series data can track
changes over milliseconds, days, or even years.
IV. Special data
Spatial data can be referred to as geographic data or geospatial data. Spatial data
provides the information that identifies the location of features and boundaries on
Earth.
V. Textual data
Textual data is information that is stored and written in a text format. It can be
anything from emails to blog posts to social media posts and online forum
comments. In short, it's any data that has been expressed in words.
VI. Multivariate data
Multivariate data is a dataset where each observation or sample point contains
multiple variables or features. These variables can represent different aspects,
characteristics, or measurements related to the observed phenomenon
VII. Structured data
Structured data is data that has been organized into a formatted repository,
typically a database.
VIII. Unstructured data
Unstructured data is information that isn't stored in a specific format. It can
contain images, audio, or documents.
3. Define big data?
Big data is data that contains greater variety, arriving in increasing volumes and with
more velocity. This is also known as the three Vs.
Big data is made up of three parts, known as the "three v's":
Volume: The amount of information
Velocity: The speed at which the information is created and collected
Variety: The scope of the data points being covered
4. Define data analytics
Data Analytics is a strategy/ a method to investigate, analyse, and demonstrate data to
find useful information and decisions. Data Analysis involves extraction, cleaning,
analysis, transformation, modelling and visualization of data with an objective to extract
vital and useful information that can derive conclusions and make decisions
Some examples of Data Analytics in various fields:
i. Game companies can use data analytics to recommend new games to players based on
their past gaming behaviour. This can help to increase player engagement and retention.
ii. Data analytics can be used to balance game mechanics and difficulty levels to ensure
that the game is fun and challenging for all kind of players.
iii. Game companies can use data analytics to detect and prevent fraud, such as cheating
and account hacking.
5. Briefly explain the evolution of data analytics
The evolution of data analytics can be broadly divided into four eras:
1. Era 1 (1960s to 1980s): This era was dominated by early data processing technologies,
such as punch cards and mainframe computers. Data analytics was largely limited to
descriptive analytics, which involved using simple statistical techniques to analyze
historical data.
2. Era 2 (1990s to early 2000s): The rise of relational databases and business intelligence
tools made it possible to analyse larger and more complex datasets. This led to the
development of more sophisticated data analytics techniques, such as diagnostic and
predictive analytics.
3. Era 3 (mid-2000s to early 2010s): The emerging of big data and cloud computing
concepts made it possible to analyse unprecedented volumes of data. This led to the
development of new data analytics techniques, such as machine learning and deep
learning.
4. Era 4 (present day): Data analytics is now becoming increasingly pervasive and
accessible. AIpowered data analytics tools are enabling businesses of all sizes to extract
insights from their
data and make better decisions.
6. Explain the key features / importance/benefits of data analytics
1. Improved decision-making: Data analytics can help businesses make better decisions
by
providing them with insights into their data. For example, a company can use data
analytics to identify which marketing campaigns are most effective or which products are
most popular with customers. This information can then be used to make better
decisions about how to allocate resources and improve business operations.
2B: Data analytics can help businesses automate tasks and streamline
processes. For example, a company can use data analytics to automate customer service
tasks or to optimize production schedules. This can free up employees to focus on more
strategic initiatives.
3. Reduced costs: Data analytics can help businesses identify and reduce costs. For
example, a company can use data analytics to identify areas where they are wasting
money or to identify opportunities to negotiate better deals with suppliers.
4. Improved customer satisfaction: Data analytics can help businesses improve
customer
satisfaction by providing them with deeper insights into their customers' needs and
preferences. For example, a company can use data analytics to identify which products or
services are most popular with customers or to identify areas where they can improve
the customer experience.
5. New product development: Data analytics can help businesses to develop new
products
and services that meet the needs of their customers. For example, a technology company
can use data analytics to identify which features are most important to their customers
and to prioritize the development of new features. This information can then be used to
develop new products and services that are more likely to be successful.
7. Explain the steps involved in data analytics
Data analytics is the process of collecting, cleaning, and analyzing data to extract
meaningful insights. It is a broad field that encompasses a variety of techniques and
tools, and it is used in a wide range of industries. The data analytics process can be
broadly divided into the following steps:
i.Data collection: The first step is to collect the data that will be analysed. This data can
come from a variety of sources, such as internal databases, customer surveys, and social
media.
ii. Data cleaning: Once the data has been collected, it needs to be cleaned to remove any
errors or inconsistencies. This may involve correcting typos, filling in missing values,
and removing outliers.
iii. Data preparation: Once the data has been cleaned, it needs to be prepared for analysis.
This may involve converting the data to a different format or splitting the data into
different subsets
iv. Data analysis: This is the step where the data is actually analysed to extract
meaningful insights. This can be done using a variety of statistical and machine learning
techniques.
v. Data visualization: Once the data has been analysed, the insights need to be
communicated to the required people in a clear and concise way. This can be done using
data visualization tools to create charts, graphs, and other visuals
8. Differentiate between data analytics & data science
Data Science Data Analytics
Data science deals with explorations and Data Analysis makes use of existing
new innovations. resources
Data Scientists is a multidisciplinary field Data analytics is a broad field which
including data engineering, computer includes data integration, data analysis and
science, statistics, machine learning, and data presentation.
predictive analytics in addition to
presentation of findings
Data Scientists produces both broad Data analytics is more focused on
insights by exploring the data and producing insights to answer specific
actionable insights (building data models) questions and which can be put into action.
that answer specific questions
Data Scientists prepare, manage and Data analysts prepare, manage and analyse
explore large data sets and then develop welldefined datasets to identify trends and
custom analytical models and algorithms to create visual presentations to help
produce the required business insights organizations make better, data-driven
decisions
Python is the most commonly used The Knowledge of Python and R Language
language for data science along with the is essential for Data Analytics
use of other languages such as C++, Java,
Perl, etc
9. Write the differences between data analytics & data mining
Data Mining Data Analytics
Data mining is a process of extracting Data analysis is a method that can be used
useful information, patterns, and trends to investigate, analyse, and demonstrate
from raw data. data to find useful information and
decisions.
The data mining output gives the data The data analysis output is a verified
pattern. hypothesis or insights based on the data.
It includes the intersection of databases, It requires expertise in computer science,
machine learning, and statistics mathematics, statistics, AI
It is known as Knowledge Discovery in It is known as Data- Driven Decision
Database (KDD). Making Strategy.
In this data set are generally large and Dataset can be large, medium or small,
structured. Also structured, semi structured,
unstructured

It generally does not require visualization. Surely requires Data visualization.


Prime goal is to make data usable. Goal is to make data driven decisions.

10. What are the diffrences between data analytics & business analytics

11.Explain the different types of data analytics?


There are four main types of data analytics:
1. Descriptive analytics:
Descriptive analytics is the simplest type of data analysis and the foundation the
other types are built on. It allows you to pull trends (means classify customers into
groups based on product choosing patterns) from raw data to describe what happened
or is currently happening. It mines historical data to understand the cause of success
or failure occurred. Hence we say Descriptive analytics deals with what happened in
past/ currently. Most commonly all kinds of management reports (sales, marketing,
operations performed, finances), data queries, data dashboards, descriptive statistics
use this kind of analysis.
Examples:
i. A retail company uses descriptive analytics to track sales data and identify which
products are selling well and which products are not.
ii.Tracking the cases/ deaths happened in COVID- 19 dataset, descriptive analysis can
identify infected population of a country.
2. Diagnostic analytics: Diagnostic analytics takes descriptive analytics a step further
by trying to understand why something happened. It uses a variety of statistical
techniques to identify patterns and relationships, dependencies in the data of a
particular problem. Hence Diagnostic analytics deal with why did it happen in the
past.
Examples
i. A marketing company uses diagnostic analytics to identify which marketing
campaigns/ promoting are most effective at driving sales (includes particular
promoting month, particular theme relating to any region).
ii. A footwear company uses Diagnostic analytics to find why particularly April
month is having highest sales. It identifies that children beach foot wears are having
highest reviews as its vacation month for children.
3. Predictive analytics: Predictive analytics uses historical data to predict future
outcomes. It uses a variety of machine learning techniques to develop models that can
predict things like customer churn, product demand, and fraud risk. Hence we say
Predictive analytics deals with what will happen in the future.
Examples:
i.An e-commerce company uses predictive analytics to recommend products to
customers based on their past purchase history
ii. In elections this analytics is used to predict winning candidate (requires historical
polling data, current polling data).
4. Prescriptive analytics: Prescriptive analytics takes predictive analytics a step further
by recommending actions that can be taken to improve outcomes. It uses optimization
techniques to identify the best way to achieve a desired outcome. Hence we say
Prescriptive analytics deals with how we can make it happen.
Examples:
i.A retailer uses prescriptive analytics to optimize their inventory levels and pricing
strategies.
Ii. A manufacturing company uses prescriptive analytics to optimize their production
processes and supply chain management.

12.What are the different tools used for data analytics? Explain
Tools used in Descriptive Analytics
i.Statistical Summary : It provides statistical descriptions for a given business metric, e.g.
Mean, Median, Standard Deviation, Percentile, Interquartile range, etc.
ii.Z–Score : Z Score tells us how far (in terms of standard deviation) is a particular value
of x
from its mean.
iii. Coefficient of Variance : It is a ratio where we divide standard deviation with mean.
iv. Interquartile Range : It is an important measure to gauge the variation in the dataset.
Data
v.Dashboard: Is a tool used to track, organise, visualize, analyse data. Overall purpose is
to make it easier for data analysts, decision makers and average users to understand their
data, gain deeper insights and make better data- driven decisions
vi. Descriptive Statistics: Includes central tendency, variability, and frequency
distribution of the dataset. The frequency distribution records how often data occurs,
central tendency records the data's centre point of distribution, and variability of a data
set records its degree of dispersion

Tools used in Diagnostic Analytics


i.Correlation Analysis : It is a statistical measure that indicates the strength of the
relationship between two variables.
ii. 5 Why Analysis : It is a very structured approach where we try to dig into a
problem and peel it layer by layer to reach the root cause of the problem.
iii.Cause and Effect Analysis : Here, we identify all possible reasons for one problem
then we pick up all the reasons as a problem one by one and try to find other causes
for that problem.

Tools used in Predictive Analytics  Regression Analysis :


It establishes the mathematical relationship between input variables and output
variables, which means if we can calculate the future value of output for any given
input, e.g. sales forecast for next month.
i.Logistic Regression : It is a classification predictive analytics technique that can
predict the output class for any given set of inputs. E.g. by providing customer
demographics logistic regression can indicate whether the customer will default bank
loan in the future or not.
ii.Decision Tree : Most of the time, we use a decision tree as a classification
technique; it tells us the output probability of the output variable for various
permutations of our input variables. Although it can be used for continuous output
variables also.
iii.Clustering Techniques : These techniques segregate our customers into a few
logical segments so that we can create tailored offers for a different type of customers
as per their needs and interests.
iv.Random Forest : It is another very famous business analytics technique that uses a
collaborative approach to solve the problem by generating a large number of
predictive models. Their accuracy is generally better.

Tools used in Prescriptive Analytics


i.Linear Programming: In linear programming, we optimize the objective functions
like revenue, market share and customer feedback ratings by also keeping constraints
in the model like budget, no. of people deployed, etc. as linear functions.
ii.Analytical Hierarchy Process: We apply these techniques in scenarios where we
have to identify the best solution among various available options, and there is the list
of criteria's to select the solution, e.g. select best cloud service providers among top 5
organizations by keeping multiple factors into consideration like budget, customer
service and flexibility to upgrade, backup services, maintenance cost, etc.
iii.Combinational Optimization: It involves identifying optimal solutions from a
considerable number of finite solutions, e.g. the travelling salesman problem, vehicle
routing problem, etc.
13.Write the importance & benefits of data analytics
Refer answer of 6th question
14. Briefly explain text analytics in detail
Text analytics is a type of data analytics that focuses on extracting insights from
unstructured text data. Unstructured text data can come from a variety of sources, such as
social media posts, customer reviews, and product descriptions.
Text analytics can be used to:
i.Understand customer sentiment: Text analytics can be used to identify the overall
sentiment of customer feedback, as well as the specific topics that customers are most
concerned about.
ii.Identify emerging trends: Text analytics can be used to identify emerging trends in the
market, such as new products or services that customers are interested in.
ii.Improve customer service: Text analytics can be used to identify customer support
issues and to develop targeted solutions.  Improve marketing campaigns: Text analytics
can be used to improve the effectiveness of marketing campaigns by identifying the
keywords and phrases that are most likely to resonate (reverbing the words) with
customers
15.Explain web analytics in detail
Web analytics is a type of data analytics that focuses on extracting insights from website
data. Website data can include things like page views, visitor demographics, and traffic
sources.
Web analytics can be used to:
i.Understand website traffic: Web analytics can be used to identify which pages are most
visited, which pages are leading to conversions (provides actual customers from targeted
onces), and where visitors are coming from.
ii.Improve website performance: Web analytics can be used to identify areas where the
website can be improved, such as pages that are loading slowly or pages that have a high
bounce rate
iii. Optimize marketing campaigns: Web analytics can be used to optimize marketing
campaigns by tracking the performance of different campaigns and identifying the
campaigns that are driving the most traffic to the website
16.What are the skills required by for business analyst?
Technical skills: Business analysts need to have strong technical skills, including
knowledge of statistical analysis, machine learning, and data visualization tools.
Problem-solving skills: Business analysts need to be able to identify and solve complex
business problems.
Communication skills: Business analysts need to be able to communicate their findings to
both technical and non-technical audiences.
Business knowledge: Business analysts need to have a good understanding of business
principles and practices.
17.Explain the different applications of analytics in business
Healthcare Analytics:
Improve patient outcomes: By tracking the effectiveness of new treatment methods,
healthcare providers can improve their treatment plans
Improve efficiency: Businesses can use data to predict the busiest times for patient intake
and schedule their staff and manage medical supplies accordingly
Improve the claims filing process: Insurance claims analysis can streamline the filing
process and help providers with fraud detection
Financial Analytics
Financial analytics is used to predict financial trends and identify investment
opportunities. Experts in financial analytics might work with fund managers or as
corporate accountants.
Retail Analytics
Business owners use retail analytics to predict demand, measure customer loyalty and
even optimize their store layout.6 This can be helpful in improving decision-making and
reducing risk.
E-Commerce Analytics
E-commerce analytics is similar to retail analytics, but it‟s specifically geared toward
online businesses.

 Returning visitors: The number of visitors who return to your site after an initial visit
may indicate whether your web design and marketing efforts are effective
 Customer lifetime value: This measures the value you‟ll gain from a repeat customer
and can help you decide whether you should prioritize new customer acquisition or
retention
Marketing Analytics
Marketing analytics helps business owners gain insight into their customers‟ preferences
and track the effectiveness of their marketing campaigns.
18.What is NLP?
Natural Language Processing (NLP) is a subfield of artificial intelligence that studies the
interaction between computers and languages. The goals of NLP are to find new methods of
communication between humans and computers, as well as to grasp human speech as it is
uttered
UNIT 2- PROBABILITY & STATISTICAL METHODS
1. Define sample space
A sample space is a collection or a set of possible outcomes of a random
experiment. The sample space is represented using the symbol, “S”.
Ex: For rolling a die, we will get the sample space, S as {1, 2, 3, 4, 5, 6 }
2. Define probability?
the probability of an event is a measure of how likely the event is to occur
when we run the experiment. Mathematically, probability is a function on
the collection of events that satisfies certain axioms.
3. What is event? Explain the types
The collection of some outcomes of an experiment is called an event.
Probability comes into application in various fields. Probability refers to the
occurrence of a random event.
The probability of an event E is defined as
P(E) = [Number of favorable outcomes of E]/[Total number of possible
outcomes].
The different types of events in probability are:
1. Sure event
2. Impossible event
3. Independent event
4. Dependent event
5. Mutually exclusive event
6. Complementary event
7. Compound event
8. Exhaustive event
9. Simple event

Sure Event

It is an event that always occurs when an experiment is conducted. For example,


getting a tail when a coin is tossed. The probability of a sure event is 1.

Example:

The probability of an event that has all outcomes of the experiment, i.e., sample
space, is 1.
Impossible Event

If the probability of occurrence of an event is zero, then it is an impossible event.

Example: The event of getting 7 when a die is thrown is impossible. This is


because the outcomes of throwing a die include {1, 2, 3, 4, 5, 6}.

Independent Event

When the outcome of the first event does not influence the outcome of the second

event, those events are known as independent events.

Example: The event of getting a tail after tossing a coin and the event of getting a
head when tossing another coin.

Dependent Event

When the outcome of the first event influences the outcome of the second event,
those events are called dependent events.

Example: If we draw two colored marbles from a bag and the first marble is not
replaced before we draw the second marble, then the outcome of the second draw
will depend on the outcome of the first draw.

Mutually Exclusive Event

These events cannot happen at the same time. They cannot occur at the same time.

Example: The events of getting head and tail are mutually exclusive while tossing
a coin.

Complementary Event

For any event A, another event, A„, shows the remaining elements of the sample
space S. A‟ = S – A.

Example: Suppose the set of the first 10 natural numbers is a sample space, S = {1,
2, 3, 4, 5, 6, 7, 8, 9, 10} and A be the event of choosing an even number less than
10. So, A = {2, 4, 6, 8}

Thus, A‟ = S – A = {1, 3, 5, 7, 9, 10}


Compound Event

If an event has more than one sample point, it is termed as a compound event.

Example: If S = {1, 2, 3, 4, 5, 6} such that E1 = {1, 3, 6}, E2 = {2, 6}. Thus, E1


and E2 represents compound events.

Exhaustive Event

The events E1, E2,……., En are exclusive if E1 ⋃ E2 ⋃…….⋃ En = S, where S is


the sample space.

Example: Suppose E1 be the event of getting an even number and E2 be the event
of getting an odd number when throwing a die.

Here, E1 = {2, 4, 6}, E2 = {1, 3, 5}

E1 ⋃ E2 = {1, 2, 3, 4, 5, 6} = S (sample space}

Thus, E1 and E2 are exhaustive events.

Simple event

An event that has a single point of the sample space is known as a simple event in
probability.

Example: If S = {1, 2, 3, 4} and E = {3} then E is a simple event.

4. Define baye’s theorem


Bayes‟ theorem is one of the most important concepts in analytics since
several problems are solved using Bayesian statistics. Consider two events A
and B. We can write the following two conditional probabilities:

Using the two equations, we can show that


Bayes‟ theorem helps the data scientists to update the probability of an event
(B) when any additional information is provided.
 The following terminologies are used to describe various components:
1. P(B) is called the prior probability (estimate of the probability without any
additional information).
2. P(B|A) is called the posterior probability (that is, given that the event A
has occurred, what is the probability of occurrence of event B). That is, post
the additional information (or additional evidence) that A has occurred, what
is estimated probability of occurrence of B.
3. P(A|B) is called the likelihood of observing evidence A if B is true.
4. P(A) is the prior probability of A.

5. Define conditional probability

6. Explain chi-square distribution


A chi-squared test (symbolically represented as χ2) is basically a data
analysis on the basis of observations of a random set of variables. This test
was introduced by Karl Pearson in 1900
The chi-square test is used to estimate how likely the observations that are
made would be, by considering the assumption of the null hypothesis as true.
The chi-squared test helps to determine whether there is a notable difference
between the normal frequencies and the observed frequencies in one or more
classes or categories. It gives the probability of independent variables.
Chi-squared test is applicable only for categorical data, such as men and
women falling under the categories of Gender, Age, Height, etc.
Properties
The following are the important properties of the chi-square test:
 Two times the number of degrees of freedom( the maximum number of
logically independent values) is equal to the variance(a measurement of
the spread between numbers in a data set).
 The number of degree of freedom is equal to the mean distribution
 The chi-square distribution curve approaches the normal distribution when
the degree of freedom increases.
Formula
χ2 = ∑(Oi – Ei)2/Ei
Χ2 is chi square test statistics

Types of chi-square tests


The two types of Pearson‟s chi-square tests are:
I. Chi-square goodness of fit test
You can use this test when you have one categorical variable.
Ex:Null hypothesis (H0): The bird species visit the bird feeder
in equal proportions.
Alternative hypothesis (HA): The bird species visit the bird feeder
in different proportions.
II. Chi-square test of independence
You can use this test when you have two categorical variables. It allows you
to test whether the two variables are related to each other.
Ex:Null hypothesis (H0): The proportion of people who are left-handed
is the same for Americans and Canadians.
Alternative hypothesis (HA): The proportion of people who are left-
handed differs between nationalities.

7. Explain normal distribution


Normal distribution, also known as the Gaussian distribution, is a probability
distribution that is symmetric about the mean, showing that data near the
mean are more frequent in occurrence than data far from the mean.
In graphical form, the normal distribution appears as a "bell curve".

The standard normal distribution has two parameters: the mean and the
standard deviation.
In a normal distribution the mean is zero and the standard deviation is 1.
Properties of the Normal Distribution
First, its mean (average), median (midpoint), and mode (most frequent
observation) are all equal to one another. Moreover, these values all
represent the peak, or highest point, of the distribution. The distribution then
falls symmetrically around the mean, the width of which is defined by the
standard deviation.
The Formula for the Normal Distribution
where:
x = value of the variable or data being examined and f(x) the probability
function
μ = the mean
σ = the standard deviation

Applications
 Marks scored on the test
 Heights of different persons
 Size of objects produced by the machine
 Blood pressure and so on.

8. Explain t-test & its types


A t test is a statistical hypothesis test that is used to compare the means of
two groups. It is often used in hypothesis testing to determine whether a
process or treatment actually has an effect on the population of interest, or
whether two groups are different from one another.
A t test can only be used when comparing the means of two groups (a.k.a.
pairwise comparison). If you want to compare more than two groups, or if
you want to do multiple pairwise comparisons, use an ANOVA test or a
post-hoc test.
The t test is a parametric test of difference, meaning that it makes the same
assumptions about your data as other parametric tests.
The t test assumes your data:
 are independent
 are (approximately) normally distributed
 have a similar amount of variance within each group being compared (a.k.a.
homogeneity of variance)
Types of t-test
There are three types of t-tests we can perform based on the data at hand:
 One sample t-test
 Independent two-sample t-test
 Paired sample t-test
One-Sample t-test
 In a one-sample t-test, we compare the average (or mean parameter) of one
group against the set average (or mean). This set average can be any
theoretical value (or it can be the population mean).
Consider the following example – A research scholar wants to determine if
the average eating time for a (standard size) burger differs from a set value.
Let‟s say this value is 10 minutes.
follow the below steps:
 Select a group of people
 Record the individual eating time of a standard-size burger
 Calculate the average eating time for the group
 Finally, compare that average value with the set value of 10
Formula

where,
 t = t-statistic
 m = mean of the group
 µ = theoretical value or population mean
 n = sample size
 s = standard deviation of the group

Independent Two-Sample t-test


 The two-sample t-test is used to compare the means of two different
samples.
Let‟s say we want to compare the average height of the male employees to
the average height of the females. Of course, the number of males and
females should be equal for this comparison. This is where a two-sample t-
test is used.
 Here‟s the formula to calculate the t-statistic for a two-sample t-test:
Paired Sample t-test
Here, we measure one group at two different times. We compare separate
means for a group at two different times or under two different conditions
Ex: A certain manager realized that the productivity level of his employees
was trending significantly downwards. This manager decided to conduct a
training program for all his employees with the aim of increasing their
productivity levels.
Finally just compare the productivity level of the employees before versus
after the training program.
The formula to calculate the t-statistic for a paired t-test is:

where,
t = t-statistic
m = mean of the group
s = standard deviation of the group
n = group size or sample size

9. Explain ANOVA
ANOVA stands for Analysis of Variance. It is a statistical method used to
analyze the differences between the means of two or more groups or
treatments.ANOVA is also called the Fisher analysis of variance, and it is
the extension of the t tests.
The ANOVA test allows a comparison of more than two groups at the same
time to determine whether a relationship exists between them. The result of
the ANOVA formula, the F statistic (also called the F-ratio), allows for the
analysis of multiple groups of data to determine the variability between
samples and within samples.
Types of ANOVA
1.One way ANOVA –has just one independent variable. For example,
difference in IQ can be assessed by Country, and County can have 2, 20, or
more different categories to compare.
2.Two way ANOVA – assess two independent variables
3.Multivariate ANOVA – assess multiple independent variables
Formula of ANOVA
 F= Mean sum of squares between the groups (MSB)/ Mean squares of
errors (MSE).
 Therefore F = MSB/MSE
where,
Mean squares between groups, MSB = SSB / (k – 1)
Mean squares of errors, MSE = SSE / (N – k)
of freedom of errors, N – k = df2 here, N is the total number of observations
throughout k groups.
SSB = ∑ nj j – )2 SSE =∑∑ - j)2

10. Differentiate between hypothesis testing & Estimation analysis

Criteria Hypothesis Testing Estimation

Definition Hypothesis testing is Estimation is the


a method that tests process of inferring
an assumption the value of a
regarding a population parameter
population based on a sample.
parameter.
D Purpose The purpose is to The aim is to guess
e decide the value of a an approximate
g population value or range of
r parameter based on values for a
e sample data. population
e parameter.
s
Types Null Hypothesis Point Estimation and
(H0) and Alternative Interval Estimation
Hypothesis (H1) are are kinds of
kinds of hypothesis estimation.
testing.

Statistical values In hypothesis testing, we In estimation, we calculate


calculate a test statistic an estimate (a single value
and p-value. or range).

Decision A decision has to be made No explicit decision is


to either reject or fail to made; an estimate is
reject the null hypothesis. provided.

Example If a beverage company If we want to guess the


claims that its juice pack average height of all
contains 300 ml of students in a class, we
purified water, we can could use estimation based
apply hypothesis testing to on a small sample of pupil.
find out if this is true
based on a sample.
11. Define correlation analysis & explain its types
Correlation Analysis is statistical method that is used to discover if there is a
relationship between two variables/datasets, and how strong that relationship
may be.

Correlation types

Simple, Partial and Multiple Correlation:

Whether the correlation is simple, partial or multiple depends on the number of


variables studied.

The correlation is said to be simple when only two variables are studied. The
correlation is either multiple or partial when three or more variables are studied.

Positive Correlation The value of one variable increases linearly with an increase in
another variable. This indicates a similar relation between both
variables. So its correlation coefficient would be positive or 1
in this case.

Negative Correlation When there is a decrease in the values of one variable with an
increase in the values of another variable, in that case, the
correlation coefficient would be negative.

Zero Correlation or There is one more situation when there is no specific relation
No Correlation between two variables.

12. Define scatterplot


The scatter diagram graphs pairs of numerical data, with one variable on
each axis, to look for a relationship between them. If the variables are
correlated, the points will fall along a line or curve. The better the
correlation, the tighter the points will hug the line.
13. Define regression. Explain its types
Regression is defined as a statistical method that helps us to analyze and
understand the relationship between two or more variables of interest.

Two types are : linear regression & Logistic regression

Linear regression is a supervised machine learning algorithm that is used to


predict a continuous value, such as price, profit, or weight, based on a set of
independent variables.

The basic concept of linear regression is to find a line that best fits the data points.

There are three main types of linear regression:

 Simple linear regression: Important to understand but will never come in


real life scenario. It has only one independent variable. This means that the
dependent variable is modeled as a linear function of the independent
variable.

 Multiple linear regression : Multiple linear regression has multiple


independent variables. This means that the dependent variable is modeled as
a linear function of multiple independent variables.

 Polynomial linear regression : Polynomial linear regression is a special


case of multiple linear regression where the independent variables are
raised to different powers. This allows the model to fit non-linear
relationships between the dependent and independent variables.

Logistic regression

Logistic regression analysis is a popular and widely used analysis that is


similar to linear regression analysis except that the outcome is
dichotomous(binary) (e.g., success/failure or yes/no or died/lived).

Simple logistic regression analysis refers to the regression application with


one dichotomous(binary) outcome and one independent variable;

multiple logistic regression analysis applies when there is a single


dichotomous(binary) outcome and more than one independent variable.
14.Differentiate between linear regression & logistic regression
UNIT 3- DATA VISUALIZATION

1. What is data visualization? Write the importance of it


Data visualization is the graphical representation of information and data.
By using visual elements like charts, graphs, and maps, data visualization
tools provide an accessible way to see and understand trends, outliers, and
patterns in data.
Types:-Charts,Graphs,Dashboards,Reports.

Here are some benefits of data visualization:


i.Communication
Data visualization is a fast and useful communication tool that can bring
employees, decision-makers, and other parties together on information and
data.
ii. Comprehension
Data visualization lets you comprehend vast amounts of data at a glance and
in a better way.
iii.Problem-solving
Data visualization allows for more innovation, creativity, and better
problem-solving and teamwork.

iv. Decision-making

Data visualization can help identify areas that need attention or


improvement. It can also aid in decision making.

2. Explain the different types of data visualization/


TYPES OF DATA VISUALIZATION
1. **Temporal Data Visualization:**
- **Definition:** Temporal data visualization focuses on representing
information that changes over time.
- **Examples:** Time series charts, line graphs, Gantt charts, and calendar
heatmaps.
- **Use Cases:** Analyzing trends, patterns, and seasonality in data over
specific time intervals.
2. **Hierarchical Data Visualization:**
- **Definition:** Hierarchical data visualization is used to represent data in
a hierarchical structure, where elements are organized in levels or tiers.
- **Examples:** Tree diagrams, sunburst charts, and dendrogram
visualizations.
- **Use Cases:** Displaying relationships and structures within nested
categories or organizational hierarchies
3. **Network Data Visualization:**
- **Definition:** Network data visualization focuses on representing
relationships and connections between different entities.
- **Examples:** Network graphs, force-directed graphs, and social network
visualizations.
- **Use Cases:** Analyzing relationships in social networks, organizational
structures, or any interconnected systems.
4. **Multidimensional Data Visualization:**
- **Definition:** Multidimensional data visualization deals with datasets
that have more than three dimensions.
- **Examples:** Parallel coordinate plots, radar charts, and 3D scatter
plots.
- **Use Cases:** Visualizing and understanding relationships in complex
datasets with multiple variables.
5. **Geospatial Data Visualization:**
- **Definition:** Geospatial data visualization is used to represent
information on maps or geographical spaces.
- **Examples:** Choropleth maps, heatmaps, and point maps.
- **Use Cases:** Analyzing regional patterns, spatial distribution, and
relationships between data and geographical locations.

3. List advantages & disadvantages of data visualization?

Here are some advantages of data visualization:


i.Simplifies complex data

Data visualization can make it easier to understand large amounts of data.

ii.Helps identify patterns and trends

Data visualization can help users recognize new patterns and errors in the data.

iii. Improves communication

Data visualization can be a faster and more effective communication tool than
reports and spreadsheets.

iv.Saves time

Data visualization tools can simplify the data analysis process and present results
attractively.

v.Improves insights

Data visualization can help users make informed decisions.

vi. Increases accessibility

Data visualization can benefit everyone from stakeholders to executives and


decision-makers.

DISADVANTAGES OF DATA VISUALIZATION:

Improper visualization

The core of a lot of issues and disadvantages stems from this main one. If you‟re
not careful in how you build your visualizations, you may end up with
visualizations that don‟t properly convey your data. This can lead to confusion and
issues down the line if you use that improper viz to do analysis and draw
conclusions.

Incorrect conclusions

As talked about above, a risk of using data visualization is that your audience may
draw incorrect conclusions. And that‟s not just because of improper visualizations.
Sometimes a visual medium can lead to confusion in the viewer, so different
people in your audience may walk away with drastically different conclusions after
viewing the same viz.

Inexact

If you‟re creating a visual representation of numerical data, there comes an


inherent risk of creating an inexact perception of the data in the mind of the
viewer.

3. Explain features of power BI


Microsoft Power BI is a business platform that provides tools for analyzing,
visualizing, and sharing data. BI stands for business
intelligence(Collects,stores,analyzes data from companies
activities). Microsoft Power BI is a platform that provides tools for business
users to analyze, visualize, and share data. B

features are:
i.data connectivity
ii. data transformation
iii. data modeling
iv. isualization
v. collaboration
vi. mobile access
vii. natural language queries
viii. AI- powered insights
ix. real time data streaming
x. custom visalization
xi. integration with other Microsoft products

4. Explain power BI components


1)Power Query Editor:- The Power Query Editor is a graphical user
interface (GUI) that allows users to prepare data. It connects to a variety
of data sources and allows users to apply data transformations by
previewing data and selecting transformations from the UI.
Go TO->Transform data->Displays editor called power query.
2)Power BI Desktop:- Power BI Desktop is a free application that can be
installed on a local computer. It allows users to connect to data, transform it,
and visualize it. Power BI Desktop is the core development application used
to develop Power BI components.
3)PowerBI Service:- Power BI is a cloud-based service that allows users to
visualize and analyze data. It is a collection of software services, apps, and
connectors that can turn unrelated data sources into interactive insights.
Power BI can convert data from different sources to create interactive
dashboards and Business Intelligence reports. It can also support report
editing and collaboration for teams and organizations.
Go to->Publish->If you have license then you can upload it into the cloud
and shae with the team members to build realtime projects.
4)Power View:- Power View is a data visualization tool that allows users to
create interactive charts, graphs, maps, and other visuals. It is available in
Excel, SharePoint, SQL Server, and Power BI. (Visual part only)
5)Power Pivot:- Power Pivot is an Excel add-in that allows users to perform
data analysis and create data models. It can import, manipulate, and analyze
large amounts of data without losing speed or functionality and also
calculation part i.e new measure option
Go to->model view(left most corner 3rd option)
6)Power Q&A(Question & Answer):-
The Q&A feature in Power BI lets you explore your data in your own words.
Q&A is available on dashboards and on reports. The AI and ML in powerBI
automatically answers the questions considering the datasets On the
dataset/charts page->double tab->Dialogue box opens->Ask question
7)Power Map and PowerBI Mobile App:-
Power Map is a 3D data visualization tool in Microsoft's Power BI and
Excel. It allows users to map and plot data from Excel tables or Data Models
in Excel on Bing maps in 3D format.
5. Explain different variants of Power BI
Business Intelligence (BI) tools encompass a variety of functionalities
designed to help organizations make informed decisions based on their data.
Here are explanations of different types of BI tools:
I. **Mobile BI:**
- **Definition:** Mobile BI tools allow users to access and interact with
business intelligence data on mobile devices such as smartphones and
tablets.
- **Use Cases:** Enables on-the-go decision-making, real-time updates,
and responsiveness.EG:-Zoho analytics
II. **Real-time BI:**
- **Definition:** Real-time BI tools provide insights and analytics on data
as it is generated, allowing for immediate decision-making.
- **Use Cases:** Critical for industries where timely decisions are crucial,
such as finance, stock trading, and emergency response.
III. **Operational BI:**
- **Definition:** Operational BI focuses on providing real-time data to
support day-to-day operational activities and processes.
- **Use Cases:** Helps monitor and optimize ongoing business processes,
improving efficiency and responsiveness.
IV. **Collaborative BI:**
- **Definition:** Collaborative BI tools promote sharing and collaboration
on business intelligence insights among team members.
- **Use Cases:** Facilitates teamwork, knowledge sharing, and collective
decision-making within organizations.
V.. **Location Intelligence:**
- **Definition:** Location Intelligence, or spatial intelligence, involves
analyzing and visualizing data in the context of geographic locations.
- **Use Cases:** Useful for understanding spatial patterns, optimizing
logistics, and making location-based decisions.
VI.**SaaS BI (Software as a Service BI):**
- **Definition:** SaaS BI tools are cloud-based business intelligence
solutions that users access over the internet without needing to install
software locally.
- **Use Cases:** Offers flexibility, scalability, and cost-effectiveness for
organizations with varying data needs.
VII.**OLAP (Online Analytical Processing):**
- **Definition:** Analyzing data from different views.OLAP tools allow
users to analyze multidimensional data interactively.
- **Use Cases:** Ideal for complex data analysis, data mining, and
generating business insights from multi-dimensional datasets.

VIII.. **Ad hoc Analytics:**

- **Definition:**Identifies patterns in the data. Ad hoc analytics tools enable


users to create reports and perform analyses on-the-fly without relying on
predefined reports.
- **Use Cases:** Provides flexibility for users to explore and analyze data
according to their specific needs without requiring IT support.

6. Explain visualization elements/ building blocks in PowerBI

5 Building blocks of powerBI:-

1)Tile:- In Power BI, a tile is a snapshot of data that is pinned to a dashboard. Tiles

can be created from a variety of sources, including reports, dashboards, the Q&A
box, Excel, and SQL Server Reporting Services (SSRS) reports.

2)Report:- A Power BI report is a multi-perspective view of a data model. It can


contain a single visualization or multiple pages of visualizations. Reports are used
for in-depth analysis and exploration of data to answer complex business
questions.

3)Dashboard:- A Power BI report is a multi-perspective view of a data model. It


can contain a single visualization or multiple pages of visualizations. Reports are
used for in-depth analysis and exploration of data to answer complex business
questions.

4)Visual:- A visual in Power BI is a visual representation of data. Visuals are a key

part of any Power BI report, as they help users identify and understand patterns in
the data. Visualizations part in powerBI

5)Datasets:- A dataset in Power BI is a collection of data that can be imported,

connected to, and used for reporting and visualizations. Datasets include data,
tables, relationships, calculations, and a connection to the data source. You can
find in Get data option->Excel,Web,Text/CSV(comma-separated values).You will
get plenty of datasets in kaggle.com where you can download the datasets.

6. Write the advantages & disadvantages of power bi


Advantages are
Data Visualization and Analysis:
i.Need: Organizations generate vast amounts of data, and there is a need to
convert this data into meaningful insights. Power BI allows users to create
interactive and visually appealing dashboards and reports to analyze data
effectively.
ii.Ease of Use:
Many business users may not have a technical background, and there is a
need for a user-friendly tool that allows non-technical users to create reports
and visualizations without extensive training. Power BI's intuitive interface
makes it accessible to a broad audience.
iii. Data Integration:
Businesses often have data stored in various sources, such as databases,
spreadsheets, and cloud services. Power BI provides robust data connectivity
options, enabling users to connect to diverse data sources and create a
unified view of their data.
iv.Real-time Insights:
In today's fast-paced business environment, real-time insights are crucial for
making informed decisions. Power BI supports real-time data updates,
allowing organizations to monitor and respond to changes as they happen.
v. Collaboration and Sharing:
Effective collaboration is essential for decision-making. Power BI facilitates
collaboration by allowing users to share reports and dashboards with
colleagues. The cloud-based Power BI Service enables real-time
collaboration and sharing of insights across the organization

Disadvantages are

 Power BI does not accept file sizes larger than 1 GB and doesn't mix
imported data accessed from real-time connections.
 There are very few data sources that allow real-time connections to
Power BI reports and dashboards.
 It only shares dashboards and reports with users logged in with the
same email address.
 Dashboard doesn't accept or pass user, account, or other entity
parameters.

(refer notes for below questions)

7. Explain powerBI architecture with a neat diagram


8. Explain different types of charts
9. Explain visualization techniques for special data in power BI
10.Explain visualization techniques for Geo special data in power BI
11.Explain visualization techniques for time oriented data in power BI
12.What is power query?
13.What is M language? Explain its importance & features
14.Explain text & document visualization technique
UNIT4- CASE STUDY

1. List the types of case stdies


2. Explain business analytics use cases of uber
3. Discuss the role of data analytics in tracking & managing the COVID-
19 pandemic
4. Write a cast study on you tube data with an example
5. Define term “survey”
6. Explain various stages involved in case study
7. Write a case study on amazon data analytics with an example
8. Write a case study on uber with an example
9. Write a case study on twitter with an example
10. Explain benefits & limitations of case study

Survey definition
A survey is a research method that collects data from a large sample or
population to understand their opinions on a topic.
A survey refers to research where data is gathered from an entire
population or a very large sample in order to comprehend the opinions
on a particular matter
A case study is a detailed study of a specific subject, such as a person,
group, or situation.

You might also like