0% found this document useful (0 votes)
7 views39 pages

2 Marks With Answers

The document provides an overview of data science fundamentals, including characteristics of quality data, definitions, applications, and techniques such as data cleansing and outlier detection. It also compares data science with big data, outlines the steps in the data science process, and discusses descriptive analytics concepts like frequency distribution and measures of central tendency. Additionally, it highlights the importance of data science across various industries and its benefits in decision-making.

Uploaded by

poojasree232006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views39 pages

2 Marks With Answers

The document provides an overview of data science fundamentals, including characteristics of quality data, definitions, applications, and techniques such as data cleansing and outlier detection. It also compares data science with big data, outlines the steps in the data science process, and discusses descriptive analytics concepts like frequency distribution and measures of central tendency. Additionally, it highlights the importance of data science across various industries and its benefits in decision-making.

Uploaded by

poojasree232006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS

UNIT I- INTRODUCTION TO DATA SCIENCE

Part A
1. What are the characteristics of a quality data?

• Validity - The degree to which your data conforms to defined business rules or constraints.
• Accuracy - Ensure your data is close to the true values.
• Completeness - The degree to which all required data is known.
• Consistency - Ensure your data is consistent within the same data set and/or across multiple data
sets.
• Uniformity - The degree to which the data is specified using the same unit of measure.

2. What do you mean by Data Science? Or Define Data Science.

• Data science is the domain of study that deals with vast volumes of data using modern tools and
techniques to find hidden patterns, derive meaningful information, and make business decisions.
• Data science can be explained as the entire process of gathering actionable insights from raw data
that involves concepts like pre-processing of data, data modelling, statistical analysis, data analysis,
machine learning algorithms, etc.
• The main purpose of data science is to compute better decision making.
3. List out at least five applications of data science.

• Finance and Fraud & Risk Detection.


• Healthcare.
• Internet Search and Website Recommendations.
• Retail Marketing and Targeted Advertising.
• Advanced Image Recognition.
• Speech Recognition.
• Airline Route Planning.
4. Write short note on outlier detection and state its real-time application

• In statistics, an outlier is a data point that differs significantly from other observations.
• An outlier detection technique (ODT) is used to detect anomalous observations/samples that do not
fit the typical/normal statistical distribution of a dataset.
• Applications of Outlier Detection are SPAM Detection, Credit Card Fraudulent Activity detection,
intrusion detection in cyber security
5. What are the contents should be included in a project charter?
A project charter requires teamwork, and your input covers at least the following:
i. A clear research goal
ii. The project mission and context
iii. How you’re going to perform your analysis
iv. What resources you expect to use
v. Proof that it’s an achievable project, concepts
vi. Deliverables and a measure of success
vii. A timeline
6. Define Data Cleansing.

• Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted,
duplicate, or incomplete data within a dataset.
• When combining multiple data sources, there are many opportunities for data to be duplicated or
mislabeled. If data is incorrect, outcomes and algorithms are unreliable, even though they may look
correct.
• Data cleansing, also referred to as data cleaning or data scrubbing.
7. What is brushing and linking in exploratory data analysis? (April/May 2023)
Brushing and Linking is the connection of two or more views of the same data, such that a change to the
representation in one view affects the representation in the other.
Brushing and linking is also an important technique in interactive visual analysis, a method for
performing visual exploration and analysis of large, structured data sets.
Linking and brushing is one of the most powerful interactive tools for doing exploratory data analysis
using visualization.
8. How does confusion matrix define the performance of classification algorithm? (April/May 2023)
A confusion matrix is a matrix that summarizes the performance of a machine learning model on a set of
test data. It is often used to measure the performance of classification models, which aim to predict a
categorical label for each input instance. The matrix displays the number of true positives (TP), true
negatives (TN), false positives (FP), and false negatives (FN) produced by the model on the test data.
9. Specify the Facets of data with an example for each and how Benchmarking tools and Scheduling
tools support data science process.
The various facets of data are:

• Structured
• SQL, or Structured Query Language, is the preferred way to manage and query data that resides in
databases
• Unstructured
Example: E-mail
• Natural language
It is an another type of unstructured data
• Machine-generated
• Graph-based
• Audio, video, and images
• Streaming
10. Outline the Data cleansing techniques and what are the types of errors description and give the
solution for it?
It is a sub process of the data science process. It focuses on removing errors in our data. So our data becomes
a true and consistent representation of the processes it originates from.
11. Define Outlier and show the distribution with an example. How it differs from sanity check?
An outlier is an observation that lies an abnormal distance from other values in a random sample from a
population. A machine learning model sanity check is a set of tests performed in a pre-production
environment to detect these sorts of systematic errors and biases, so you can ensure models work as expected
before deploying them to production.
12. Define Bigdata.
It is huge, large or voluminous data, information or the relevant statistics acquired by the large organizations
and ventures. Many software and data storage created and prepared as it is difficult to compute the big data
manually.
13. Compare Data Science vs Big Data

Data Science Big data


Data Science is an area/Domain Big Data is a technique to collect, maintain and
process the huge information.
It is about collection, processing, analyzing It is about extracting the vital and valuable
and utilizing of data into various operations. information from huge amount of the data.
It is more conceptual
It is a field of study just like the Computer It is a technique of tracking and discovering of
Science, Applied Statistics or Applied trends of complex data sets.
Mathematics, Data Base Management System.
Tools mainly used in Data Science includes Tools mostly used in Big Data includes
SAS, R, Python, etc Hadoop, Spark, Flink, etc.
It is a sub set of Data Science as mining It is a super set of Big Data as data science
activities which is in a pipeline of the Data consists of Data scrapping, cleaning,
science. visualization, statistics and many more
techniques.
It is mainly used for scientific purposes It is mainly used for business purposes and
customer satisfaction

14. List out the characteristics of big data.

• Volume
• Variety
• Velocity
• Veracity
• Value
15. Give the various challenges of bigdata.

• Data Capture
• Curation
• Storage
• Search
• Sharing
• Transfer
• Visualization
16. State the need of Data Science.

• Data Science is used in many industries in the world today, e.g. banking,consultancy, healthcare, and
manufacturing.
• Examples of where Data Science is needed:
For route planning: To discover the best routes to ship

• To foresee delays for flight/ship/train etc. (through predictive analysis)


• To create promotional offers
• To find the best suited time to deliver goods
• To forecast the next years revenue for a company
• To analyze health benefit of training
17. What are the advantages/ benefits of data science?

• Commercial Companies in all business wish to o analyses and gain insights into their customers,
processes, staff, completion, and products.
• Many companies use data science to offer customers a better user experience, cross-sell, up-sell, and
personalize their offerings.
• Human resource professionals use people analytics and text mining to screen candidates monitor the
mood of employees study informal networks among coworkers.
• Financial institutions use data science to predict stock markets determine the risk of lending money
• learn how to attract new clients for their services.
• Many governmental organizations not only rely on internal data scientists to discover valuable
information, but also share their data with the public.
• You can use this data to gain insights or build data-driven applications.
• Nongovernmental organizations (NGOs) can use it as a source for get funding. o Many data scientists
devote part of their time to helping NGOs, because NGOs often lack the resources to collect data and
employ data scientists.
• Universities use data science in their research but also to Enhance the study experience of their
students.
• The rise of massive open online courses (MOOC) produces a lot of data, which allows universities to
study how this type of learning can complement traditional classes.
Part B

1. Discuss in detail about step-by-step process in Data Science with neat diagram.
2. Discuss briefly about:
i.Life cycle of Data Science
ii. Machine Learning in Data Science
3. Exemplify in detail about different facets of data with examples. (April/May 2023)
4. Sketch and outline the step-by-step activities in the data science process.(April/May 2023)
5. Explain in detail about cleansing, integrating, and transforming data with example. (April/May
2023)
6. Discuss a Linear prediction model execution on a semi random data and give
7. the python code for the same with model diagnostic and comparison.
Give a detailed view on the methodologies of transforming data with examples.
8. Discuss in detail about the characteristics of data, benefits, applications.
9. Discuss a K- Nearest neighbour model execution with confusion matrix on a semi random data
and give the python code for the same with model diagnostic and comparison.
10. Give a detailed case study of building a recommender system inside a database
with all required steps for a data science model.
11. Give a detailed case study of predicting malicious URLs from the set of URLs data with all the
required steps of data science process.
UNIT II- DESCRIPTIVE ANALYTICS
Part A

1. During their first swim through a water maze, 15 laboratory rats made
the following number of errors (blind alleyway entrances): 2, 17, 5, 3, 28, 7, 5,
8, 5, 6, 2,
12, 10, 4, 3. Find the mode and median for these data.
Mode: The mode of a data set is the number that occurs most frequently in the set.
Data Mode
2 2
3 2
4 1
5 3
6 1
7 1
8 1
10 1
12 1
17 1
28 1

Her the data 05 occurs most frequently as 03 times compared to other data. Hence
the mode these data is 05.
Median:The median M is the midpoint of the distribution
Ordered list of given data set is 2,2,3,3,4,5,5,5,6,7,8,10,12,17,28
Total no of observation in the data set is:15
Here ‘n’=15. It is an Odd value. So, median located in (15+1)/2=8th spot in the ordered
list is 5
So, median is 05
2. Mentions the essential and optional guidelines to be followed for
frequency distributions or State the “Guidelines for frequency distribution”.
• Each observation should be included in one, and only one, class.
• List all classes, even those with zero frequencies.
• All classes should have equal intervals.
• All classes should have both an upper boundary and a lower boundary.
• Select the class interval from convenient numbers, particularly 5 and
10 ormultiples of 5 and 10.
• The lower boundary of each class should be a multiple of the class interval.
• Aim for a total of approximately 10 classes.
3. GRE scores for a group of graduate school applicants are distributed as follows:
Convert to a relative frequency distribution. When calculating proportions,
round numbers to two digits to the right of the decimal point, using the
rounding procedure
(a) Convert the distribution of GRE scores shown in above table to a cumulative
frequency distribution.
(b) Convert the distribution of GRE scores obtained in above table to a
cumulative percent frequency distribution
Solun
a)
GRE Frequency (f) cumulative frequency
distribution
725-749 1 1/200=0.005
700-724 3 3/200=0.015
675-699 14 14/200=0.07
650-674 30 30/200=0.15
625-649 34 34/200=0.17
600-624 42 42/200=0.21
575-599 30 30/200=0.15
550-574 27 27/200=0.135
525-549 13 13/200=0065
500-524 4 4/200=0.02
475-499 2 2/100=0.02
b)
GRE Frequenc cumulativ Convert the distribution of
y (f) e GRE scores obtained in above
frequency table to a cumulative percent
distributio frequency
n distribution
725-749 1 1/200=0.005 0.5%
700-724 3 3/200=0.015 1.5%
675-699 14 14/200=0.07 7%
650-674 30 30/200=0.15 15%
625-649 34 34/200=0.17 17%
600-624 42 42/200=0.21 21%
575-599 30 30/200=0.15 15%
550-574 27 27/200=0.135 13.5%
525-549 13 13/200=0.065 6.5%
500-524 4 4/200=0.02 2%
475-499 2 2/100=0.02 2%
4. Write short note on Stem-and-leaf display. Represent the following datain
stem- and-leaf display. 67, 74, 63, 88, 82, 97, 65, 79
• A stem-and-leaf display is used to present quantitative data in a
graphicalformat, similar to a histogram, to assist in visualizing the shape of
adistribution.
• A stem and leaf plot displays data by splitting up each value in a datasetinto a
“stem” and a “leaf.”
Raw Data Stem Leaf
67
6 7 3 5
74
7 4 9
63
88 8 8 2

82 9 7
97
65
79

5. Why Frequency Distribution is important in Data Science?


• Frequency distribution is an organized tabulation/graphical representation of
the number of individuals in each category on the scale of measurement.
• The reasons for constructing a frequency distribution are as follows:
o To organize the data in a meaningful, intelligible way
o To determine the shape of the distribution
o To facilitate computational procedures for measures of average and
spread
o To draw charts and graphs for the presentation of data
o To enable the reader to make comparisons among different data sets
6. How the skewness of a data distribution can be identified?
7. Define frequency distribution? Or Define Frequency distribution.
A frequency distribution is a collection of observations produced by sorting
observations into classes and showing their frequency (f ) of occurrence in each class.
8. The IQ scores for a group of 35 high school dropouts are given in the table:
a) Construct a frequency distribution for grouped data.
(b) Specify the real limits for the lowest class interval in this frequency distribution.

Solution
a)Calculating the class width, 123-69/10= 54/10=5.4
Round off to a convenient number, such as 5.

b) 64.5–69.5

9. What are some possible poor features of the following frequency distribution?

Solun:
• Not all observations can be assigned to one and only one class (because of
gap between 20–22 and 25–30 and overlap between 25–30 and 30–34).
• All classes are not equal in width (25–30 versus 30–34).
• All classes do not have both boundaries (35–above).
10. Define Outlier.
• An outlier is an observation that lies an abnormal distance from other values
in a random sample from a population.
• It will be considered as abnormal.
• i.e., the appearance of one or more very extreme scores, or outliers.
11. Identify any outliers in each of the following sets of data collected from nine
college students.
Solun:
Outliers are a summer income of $25,700; an age of 61; and a family size of
18. No outliers for GPA.
12. List out the typical shapes of smoothed frequency distribution.
Positiv
ely
skewe
d
Norma
l
Bimod
al
Negatively skewed
13. Define mean.
• The mean is the average of a set of observations.
• i.e., the sum of the observations divided by the number of observations.
• If the n observations are written as their mean can be written
mathematically as:
• X1,x2…xn
• We read the symbol as “x-bar.”
• The bar notation is commonly used to represent the sample mean,
i.e. the mean of the sample.

14. Find the sample mean value for the best actress Oscar winner data
set: 34 34 26 37 42 41 35 31 41 33 30 74 33 49 38 61 21 41 26 80 43
29 33 35 45 49 39 34 26 25 35 33.
Solun.

15. Define Median.


• The median M is the midpoint of the distribution.
• It is the number such that half of the observations fall above, and half
fall below.
16. State the steps to find the median value.
• Order the data from smallest to largest.
• Consider whether n, the number of observations, is even or odd.
• If n is odd, the median M is the center observation in the ordered list.
• This observation is the one “sitting” in the (n + 1) / 2 spot in the ordered
list.
• If n is even, the median M is the mean of the two center
observations in the ordered list.
• These two observations are the ones “sitting” in the (n / 2) and (n / 2)
+ 1 spots in
the ordered list.

17. Compare mean and median

• The mean and the median, the most common measures of center.
• Each describe the center of a distribution of values in a different way.
• The mean describes the center as an average value, in which the actual
values of the data points play an important role.
• The median, on the other hand, locates the middle value as the
center, and the order of the data is the key.
18. Define mode.
• The mode of a data set is the number that occurs most frequently in the
set.
o If no value appears more than once in the data set, the
data set has no mode.
o If a there are two values that appear in the data set an
equal number of times, they both will be modes etc.
19. When to use mean/ median?

• Use the sample mean as a measure of center for symmetric


distributions with no outliers.
• Otherwise, the median will be a more appropriate measure of the
center of our data.
20. What do you meant by range?
• A range measures the spread of a data inside the limits of a data set.
• It is calculated as a difference between the highest and lowest
values in the data set.
• The larger the range, the greater the spread of the data.
• The range covered by the data is the most intuitive measure of
variability.
• The range is exactly the distance between the smallest data point
(min) and the largest one (Max).
• Range = Max – min
21. Define standard deviation.
• The standard deviation is to quantify the spread of a distribution
by measuring how far the observations are from their mean.
• The standard deviation gives the average (or typical distance) between
a
• Standard deviation is the measure of the overall spread
(variability) of a data set values from the mean.
• The more spread out a data set is, the greater are the distances
from the mean and the standard deviation.
• There are many notations for the standard deviation: SD, s, Sd, StDev.

22. Compute the standard deviation of the sample data: 3, 5, 7 with a


sample mean of 5.

23. What do you meant by degree of freedom.

• Degrees of freedom (df) refers to the number of values that are free
to vary, given one or more mathematical restrictions, in a sample
being used to estimate a population character
• The number of values frees to vary, given one or more mathematical
restrictions.
• Degrees of freedom, that is, df = n – 1.
24. Define Inter-Quartile Range (IQR).
• The Inter-Quartile Range or IQR measures the variability of a
distribution by giving us the range covered by the MIDDLE 50% of the
data.
• To find the interquartile range (IQR), first find the median
(middle value) of the lower and upper half of the data.
• These values are quartile 1 (Q1) and quartile 3 (Q3).
• The IQR is the difference between Q3 and Q1
• IQR = Q3 – Q1
• Q3 = 3rd Quartile = 75th Percentile
• Q1 = 1st Quartile = 25th Percentile
25. How to measure/interpret the strength of a relationship based on the
absolute value
of ‘r’?
Absolute Value of r Strength of Relationship

r<0.3 None or very Weak

0.3<r<0.5 Weak

0.5<r<0.7 Moderate

r>0.7 Strong

26. Define Correlation.


• Correlation is a statistical term describing the degree to which
two variables move in coordination with one another.
• If the two variables move in the same direction, then those
variables are said to have a positive correlation.
• If they move in opposite directions, then they have a negative
correlation.
27. Define correlation coefficient.
• It is a number between -1 to 1.
• It tells you that the strength and direction of a relationship between
variables.
• i.e., it reflects how similar the measurements of 2 or more
variable across dataset.

28. What are the 4 things to describe the relationship between the
variables?

•Strength
o Strength of the relationship is given by the correlation
coefficient
• Direction
o It can be +ve or –ve based on the sign of the correlation
coefficient
• Shape
o It must always be linear to computer a pearson correlation
coefficient
• Statistically significant
o It is based on p-value.
29. What does correlation coefficient tells you?
• It summarizes the data
• It helps you to compare the results between studies.
30. State the guidelines for interpreting correlation strength.
31. List out the types of correlation coefficients
• Pearson’s r Correlation coefficient
• Spearman’s rho Correlation coefficient

32. What is a Linear Regression?


In simple terms,linear regression is adopting a linear approach to
modeling the relationship between a dependent variable (scalar
response) and one or more independent variables (explanatory
variables). In case you have one explanatory variable, you call it a simple
linear regression. In case you have more than one independent variable,
you refer to the process as multiple linear regressions.
33. What are the disadvantages of the linear regression model?
One of the most significant demerits of the linear model is that it is
sensitive and dependent on the outliers. It can affect the overall result.
Another notable demerit of the linear model is over fitting. Similarly,
under fitting is also a significant disadvantage of the linear model
34. What are the different types of least squares?
Least squares problems fall in to two categories: linear or ordinary least
squares and nonlinear least squares, depending on whether or not the
residuals are linear in all unknowns. The linear least-squares problem
occurs in statistical regression analysis; it has a closed-form solution.
35. What is the difference between least squares regression and
multiple regressions? The goal of multiple linear regressions is to
model the linear relationship between the explanatory (independent)
variables and response (dependent) variables. In essence, multiple
regression is the extension of ordinary least-squares (OLS) regression
because it involves more than one explanatory variable.
36. What is the principle of least squares?
Principle of Least Squares" states that the most probable values of a
system of unknown quantities upon which observations have been made,
are obtained by making the sum of the squares of the errors a minimum.
Part B
1. Explain the step by step procedure to construct the frequency
distribution with an example of data set of the following table
2. In a survey, a question was asked “During your life time, how often
have you changed your permanent residence?” a group of 18 college
students replied a follows: 1,3,4,1,0,2,5,8,0,2,3,4,7,11,0,2,3,3. Find the
mode, median, and standard deviation [April/May 2023]
3. Consider an example. Tom who is the owner of a retail shop, found the
price of different T-shirts vs the number of T-shirts sold at his shop
over a period of one week. He tabulated this like shown below:
Price of T-Shirt Number of T-Shirt Sold
2 4
3 5
5 7
7 10
9 15
Explain the concept of least squares regression to find the line of best fit for the
above data
4. The following frequency distribution shows the annual incomes in
dollars for a group of college graduates.

(a) Construct a histogram.


(b) Construct a frequency polygon.
(c) Is this distribution balanced or lopsided?
5. Consider the best actress Oscar winners dataset given below,
construct the stem plot for the above dataset.
34 34 26 37 42 41 35 31 41 33 30 74 33 49 38 61 21 41 26 80 43 29 33
35 45 49 39
34 26 25 35 33

6. Explain multiple linear regression model with the prediction of


sales through the various attributes like budget for TV advertisement,
Radio Advertisement and News paper Advertisement using statistical
model
7. Consider the following x and y set of values, create least square linear
regression and check the result of model fitting to know whether the
model is satisfactory
8. Discuss in detail the various typical shapes of frequency distribution.
Analyze its characteristics with an example
9. The following are the number of customers who entered a video store
in 8 consecutive hours: 7,9,5,13,3,11,15,9. Find the standard deviation
of the number of hourly customers. Summarize about the
aforementioned data with the help of standard deviation
10. Explain the steps to calculate IQR with an example of best actress Oscar
winners
11. For each of the following pairs of distributions, first decide whether
their standard deviations are about the same or different. If their
standard deviations are different, indicate which distribution should
have the larger standard deviation. Note that the distribution with the
more dissimilar set of scores or individuals should produce the larger
standard deviation regardless of whether, on average, scores or
individuals in one distribution differ from those in other distribution.
12. The IQ scores for a group of 35 high school dropouts are as follows:

i) Construct a frequency distribution for grouped data (4)


ii) Relative Frequency distribution (3)
iii) Cumulative Frequency distribution (3)
13. Discuss in detail about “Measures of Central Tendency” and calculate
each measure for the following retirement ages data:
60, 63, 45, 63, 65, 70, 55, 63, 60, 65, 63
Is it possible to calculate “Mean” for qualitative data? Justify your answer.
Is the above data following “Bimodal”? Justify your answer.
14. Discuss about following measures and calculate them with
given“residence changes” data.
1, 3, 4, 1, 0, 2, 5, 8, 0, 2, 3, 4, 7, 11, 0, 2, 3, 4
i. Range
ii. Variance
iii. Standard Deviation
iv. Inter Quartile Range (IQR)
v. Z-Score
UNIT III- INFERENTIAL STATISTICS
Part A
1. What is population?
In statistics, population is the entire set of items from which you draw data for a statistical
study. It can be a group of individuals, a set of items, etc. It makes up the data pool for a
study.
2. What is a sample?
A sample represents the group of interest from the population, which you will use to
represent the data. The sample is an unbiased subset of the population that best
represents the whole data.
3. When are samples used?
• The population is too large to collect data.
• The data collected is not reliable.
• The population is hypothetical and is unlimited in size. Take the example of a study
that documents the results of a new medical procedure. It is unknown how the procedure
will affect people across the globe, so a test group is used to find out how people react to
it.
4. Difference Between Population and Sample? (April/May 2023)
Population Sample
All residents of a country would constitute All residents who live above the poverty
the Population set line would be the Sample
All residents above the poverty line in a All residents who are millionaires would
country would be the Population make up the Sample
All employees in an office would be the Out of all the employees, all managers in
Population the office would be the Sample

5. Define Hypothetical Population


A population containing a finite number of individuals, members or units is a class. ... All
the 400 students of 10th class of particular school is an example of existent type of
population and the population of heads and tails obtained by tossing a coin on infinite
number of times is an example of hypothetical population.
6. What is Random Samplings?
Random sampling occurs if, at each stage of sampling, the selection process guarantees
that all potential observations in the population have an equal chance of being included
in the sample
7. What is Sampling Distribution ?
The sampling distribution of the mean refers to the probability distribution of means for
all possible random samples of a given size from some population.
8. What are the types of Sampling Distribution?
• Sampling distribution of mean
• Sampling distribution of proportion
• T-distribution
9. Define Sampling distribution of mean
The most common type of sampling distribution is the mean. It focuses on calculating the
mean of every sample group chosen from the population and plotting the data points. The
graph shows a normal distribution where the center is the mean of the sampling
distribution, which represents the mean of the entire population.
10. What is mean by Sampling distribution of proportion
This sampling distribution focuses on proportions in a population. Samples are selected
and their proportions are calculated. The mean of the sample proportions from each
group represent the proportion of the entire population,
11. Define T-distribution
T-distribution is a sampling distribution that involves a small population or one where
not much is known about it. It is used to estimate the mean of the population and other
statistics such as confidence intervals, statistical differences and linear regression. The T-
distribution uses a t- score to evaluate data that wouldn't be appropriate for a normal
distribution
The formula for t-score is:

In the formula, "x" is the sample mean and "μ" is the population mean and signifies
standard deviation.
12. Define MEAN OF ALL THE SAMPLE MEAN
The mean of the sampling distribution of the mean always equals the mean of the
population
13. Standard Error Of The Mean
The standard error of the mean equals the standard deviation of the population divided
by the square root of the sample size
14. What is the Special Type of Standard Deviation
You might find it helpful to think of the standard error of the mean as a rough measure of
the average amount by which sample means deviate from the mean of the sampling
distribution or from the population mean
15. What Is the Hypothesis Testing
Hypothesis testing is a form of statistical inference that uses data from a sample to draw
conclusions about a population parameter or a population probability distribution. First,
a tentative assumption is made about the parameter or distribution. This assumption is
called the null hypothesis and is denoted by H0.
16. Define Hypothesized Sampling Distribution
When you perform a hypothesis test of a single population mean μ using a normal
distribution (often called a z-test), you take a simple random sample from the population.
... Then the binomial distribution of a sample (estimated) proportion can be approximated
by the normal distribution with μ = p and σ=√pqn σ = p q n
17. Define Decision Rule
A decision rule specifies precisely when H0 should be rejected (because the observed z
qualifies as a rare outcome). There are many possible decision rules, as will be seen in
Section 11.3. A very common one, already introduced in Figure 10.3, specifies that H0
should be rejected if the observed z equals or is more positive than 1.96 or if the observed
z equals or is more negative than –1.96. Conversely, H0 should be retained if the observed
z falls between ± 1.96.
18. Define null hypothesis
The null hypothesis is a typical statistical theory which suggests that no statistical
relationship and significance exists in a set of given single observed variable, between two
sets of observed data and measured phenomena.
19. What is Level of Significance
Total area that is identified with rare outcomes. Often referred to as the level of
significance of the statistical test, this proportion is symbolized by the Greek letter α
(alpha. In the present example, the level of significance, α, equals 05.
20. Define One-Tailed And Two-Tailed Tests
A one-tailed test is a statistical test in which the critical area of a distribution is one- sided
so that it is either greater than or less than a certain value, but not both. If the sample
being tested falls into the one-sided critical area, the alternative hypothesis will be
accepted instead of the null hypothesis.
In statistics, a two-tailed test is a method in which the critical area of a distribution is two-
sided and tests whether a sample is greater or less than a range of values. It is used in
null-hypothesis testing and testing for statistical significance.
21. State Addition Rule and Multiplication Rule
Addition rule states that add together the separate probabilities of several mutually
exclusive events to find the probability that any one of these events will occur
Multiplication rule states that multiply together the separate probabilities of several
independent events to find the probability that these events will occur together.

22. Imagine a very simple population consisting of only four observations:2, 4,


6, 8. List all possible samples of size two.
For given sample size 2, list of possible samples that can be taken from above observations
are

23. What is One-Tailed Test (Lower Tail Critical)


Now let’s assume that the research hypothesis for the investigation of SAT math scores
was based on complaints from instructors about the poor preparation of local freshmen.
Assume also that if the investigation supports these complaints, a remedial program will
be instituted. Under these circumstances, the investigator might prefer a hypothesis test
that is specially designed to detect only whether the population mean math score for all
local freshmen is less than the national average. This alternative hypothesis reads:

24. What are four possible outcomes for any hypothesis test?
• If H0 really is true, it is a correct decision to retain the true H0.
• If H0 really is true, it is a type I error to reject the true H0.
• If H0 really is false, it is a type II error to retain the false H0.
• If H0 really is false, it is a correct decision to reject the false H0.
25. Define Point Estimate
A point estimate for μ uses a single value to represent the unknown population mean
26. What is mean by confidence interval ( ci ) for μ?
A confidence interval for μ uses a range of values that, with a known degree of certainty,
includes the unknown population mean.
27. What do you mean by Hypothesis? Name at least 4 of its types.
Hypothesis is a statement about the nature of a population. It is often stated in terms of a
population parameter. Hypothesis testing is a form of statistical inference that uses data
from a sample to draw conclusions about a population parameter or a population
probability distribution. Some types of hypothesis statements are
• Directional Hypothesis,
• Non-Directional Hypothesis
• Null hypothesis,
• Alternative hypothesis,
• Associative Hypothesis
28. State Central Limit Theorem (April/May 2023)
Central Limit Theorem states that regardless of the population shape, the shape of the
sampling distribution of the mean approximates a normal curve if the sample size is
sufficiently large.
According to this theorem, it doesn’t matter whether the shape of the parent population
is normal, positively skewed or negatively skewed, as long as the sample size is sufficiently
large.
If the shape of the parent population is normal, then any sample size will be sufficiently
large. Otherwise, depending on the degree of non-normality in the parent population, a
sample size between 25 and 100 is sufficiently large
29. Indicate whether the following statements are True or False with proper
justification.
The mean of all sample means,
(a) always equals the value of a particular sample mean.(b) equals 100 if, in fact,
the population mean equals 100.(c) usually equals the value of a particular sample
mean.(d) is interchangeable with the population mean
• FALSE, Mean of all sample mean will not represent a particular sample mean
• TRUE, The population mean can be equated to mean of all sample mean
• FALSE, Mean of all sample mean will not represent a particular sample mean
• TRUE, The population mean can be equated to mean of all sample mean
30. Indicate what’s wrong with each of the following statistical hypothesis:
(a) Null hypothesis and its respective alternative hypothesis cannot have different
anchor point values. In given scenario, both hypothesis dintcover any values between155
and160 exclusively.
(b) Any hypothesis statement represents details about any one of population
parameter. But Sample mean X ̅ is referred in given above scenario.
31. Define Effect of Sample Size
The larger the sample size, the smaller the standard error and, hence, the more precise
(narrower) the confidence interval will be. Indeed, as the sample size grows larger, the
standard error will approach zero and the confidence interval will shrink to a point
estimate. Given this perspective, the sample size for a confidence interval, unlike that for
a hypothesis test, never can be too large.
Part B
1. Explain population and samples. And difference?
2. Describe random sampling
3. Explain sampling distribution and types
4. Describe null hypothesis test in detail
5. Explain in detail hypothesis testing and examples
6. Does the mean of SAT math score for all local freshman differ for all local average
of 500? (z test for population mean)
7. Explain one tailed and two tailed test
8. Define estimation. Explain in detail about point estimation.
9. Discuss about the following with suitable example:
i. Random Sampling vs Random Assignments
ii. Independent vs Dependent Events
iii. Independent vs Mutually Exclusive Events
iv. Conditional Probability
v. Sampling Distribution of the Mean
10. Imagine a very simple population consisting of only four observations:2 3 4 5
i. Explain the process of constructing relative frequency table showing
the sampling distribution of the mean.
ii. Construct a relative frequency table showing the sampling distribution of the
mean for the above observations.
11. Define Hypothesis. Discuss in detail about at least 5 types of hypothesis statement
with an example.
12. Calculate the value of the z test for each of the following situations. Also, given
critical z score of +/- 1.96, calculate the critical confidence level.
i. X=12; σ=9; n=25; µhyp=15
ii. X=3600; σ=4000; n=100; µhyp=3500
iii. X=0.25; σ=010; n=36; µhyp=0.22
13. Reading achievement scores are obtained for a group of fourth graders. A scores
of 4.0 indicates a level of achievement appropriate for fourth grades, a score below 4.0
indicates under achievement., and a score above 4.0 indicates over achievement. Assume
that the population standard deviation equals 0.4. A random sample of 64 fourth graders
reveals a mean achievement score of 3.82. Construct a 95% confidence interval for the
unknown population mean. (Remember to convert the standard deviation to a standard
error). Interpret this confidence interval; that is, do you find any consistent evidence
either of overachievement or of underachievement?
14. Illustrate in detail about estimation method and confidence interval.
15. For the population at large, the Wechsler Adult Intelligence Scale is designed to
yield a normal distribution of test score with a mean of 100 and a standard deviation of
15. School district officials wonder whether, on the average, an IQ score different from
100 describes the intellectual aptitudes of all students in their district. Wechsler IQ scores
are obtained for random sample of 25 of their students, and the mean IQ is found to equal
105. Using the step-by-step procedure, test the null hypothesis at the .05 level of
significance.
16. Imagine a simple population consisting of only 5 observations: 2 4 6 8 10. List all
possible sample of size two. Construct relative frequency table showing the sample
distribution of the mean.
17. According to the American Psychological Association, members with a doctorate
and a full-time teaching appointment earn, on the average, $82,500 per year, with a
standard deviation of $6,000. An investigator wishes to determine whether $82,500 is
also the mean salary for all female members with a doctorate and a full-time teaching
appointment. Salaries are obtained for a random sample of 100 women from this
population, and the mean salary equals $80,100.
i. Someone claims that the observed difference between $80,100 and $82,500 is
large enough by itself to support the conclusion that female members earn less than male
members. Explain why it is important to conduct a hypothesis test.
ii. The investigator wishes to conduct a hypothesis test for what population?
iii. What is the null hypothesis, H0?
iv. What is the alternative hypothesis, H1?
v. Specify the decision rule, using the .05 level of significance.
vi. Calculate the value of z. (Remember to convert the standard deviation to a
standard error.)
vii. What is your decision about H0?
viii. Using words, interpret this decision in terms of the original problem.
18. According to the California Educational Code
(http://www.cde.ca.gov/ls/fa/sf/peguidemidhi.asp), students in grades 7 through 12
should receive 400 minutes of physical education every 10 school days. A random sample
of 48 students has a mean of 385 minutes and a standard deviation of 53 minutes. Test
the hypothesis at the .05 level of significance that the sampled population satisfies the
requirement.
19. According to a 2009 survey based on the United States census
(http://www.census. gov/prod/2011pubs/acs-15.pdf), the daily one-way commute time
of U.S. workers averages 25 minutes with, we’ll assume, a standard deviation of 13
minutes. An investigator wishes to determine whether the national average describes the
mean commute time for all workers in the Chicago area. Commute times are obtained for
a random sample of 169 workers from this area, and the mean time is found to be 22.5
minutes. Test the null hypothesis at the .05 level of significance.
20. Each of the following statements could represent the point of departure for a
hypothesis test. Given only the information in each statement, would you use a two- tailed
(or nondirectional) test, a one-tailed (or directional) test with the lower tail critical, or a
one-tailed (or directional) test with the upper tail critical? Indicate your decision by
specifying the appropriate H0 and H1. Furthermore, whenever you conclude that the test
is one-tailed, indicate the precise word (or words) in the statement that justifies the one-
tailed test.
i. An investigator wishes to determine whether, for a sample of drug addicts, the
mean score on the depression scale of a personality test differs from a score of 60, which,
according to the test documentation, represents the mean score for the general
population.
ii. To increase rainfall, extensive cloud-seeding experiments are to be conducted, and
the results are to be compared with a baseline figure of 0.54 inch of rainfall (for
comparable periods when cloud seeding was not done).
iii. Public health statistics indicate, we will assume, that American males gain an
average of 23 lbs during the 20-year period after age 40. An ambitious weight- reduction
program, spanning 20 years, is being tested with a sample of 40-year-old men.
iv. When untreated during their lifetimes, cancer-susceptible mice have an average
life span of 134 days. To determine the effects of a potentially life-prolonging (and cancer-
retarding) drug, the average life span is determined for a group of mice that receives this
drug.
21. For each of the following situations, indicate whether H0 should be retained or
rejected. Given a one-tailed test, lower tail critical with α = .01, and

22. Specify the decision rule for each of the following situations (referring to Table
11.1 to find critical z values):
23. Each of the following statements could represent the point of departure for a
hypothesis test. Given only the information in each statement, would you use a two- tailed
(or nondirectional) test, a one-tailed (or directional) test with the lower tail critical, or a
one-tailed (or directional) test with the upper tail critical? Indicate your decision by
specifying the appropriate H0 and H1. Furthermore, whenever you conclude that the test
is one-tailed, indicate the precise word (or words) in the statement that justifies the one-
tailed test.
i. An investigator wishes to determine whether, for a sample of drug addicts, the
mean score on the depression scale of a personality test differs from a score of 60, which,
according to the test documentation, represents the mean score for the general
population.
ii. To increase rainfall, extensive cloud-seeding experiments are to be conducted, and
the results are to be compared with a baseline figure of 0.54 inch of rainfall (for
comparable periods when cloud seeding was not done).
iii. Public health statistics indicate, we will assume, that American males gain an
average of 23 lbs during the 20-year period after age 40. An ambitious weight- reduction
program, spanning 20 years, is being tested with a sample of 40-year-old men.
iv. When untreated during their lifetimes, cancer-susceptible mice have an average
life span of 134 days. To determine the effects of a potentially life- prolonging (and cancer-
retarding) drug, the average life span is determined for a group of mice that receives this
drug.
24. For each of the following situations, indicate whether H0 should be retained or
rejected. Given a one-tailed test, lower tail critical with α = .01, and
(a) z = – 2.34 (b) z = – 5.13 (c) z = 4.04
Given a one-tailed test, upper tail critical with α = .05, and
(d) z = 2.00 (e) z = – 1.80 (f) z = 1.61
25. Reading achievement scores are obtained for a group of fourth graders. A score of
4.0 indicates a level of achievement appropriate for fourth grade, a score below 4.0
indicates underachievement, and a score above 4.0 indicates overachievement. Assume
that the population standard deviation equals 0.4. A random sample of 64 fourth graders
reveals a mean achievement score of 3.82.
i. Construct a 95 percent confidence interval for the unknown population mean.
(Remember to convert the standard deviation to a standard error.)
ii. Interpret this confidence interval; that is, do you find any consistent evidence
either of overachievement or of underachievement?
UNIT IV- ANALYSIS OF VARIANCE
Part A
1. Define T-Test?
Statistical method for the comparison of the mean of the two groups of the normally
distributed sample(s).
2. Define F-Test?
An F-test is any statistical test in which the test statistic has an F-distribution under the
null hypothesis. It is most often used when comparing statistical models that have been
fitted to a data set, in order to identify the model that best fits the population from which
the data were sampled.
3. What is analysis of variance?
Analysis of variance is a collection of statistical models and their associated estimation
procedures used to analyze the differences among means. ANOVA was developed by the
statistician Ronald Fisher
4. Define effect size estimation
Effect size estimates provide important information about the impact of a treatment on
the outcome of interest or on the association between variables. • Effect size estimates
provide a common metric to compare the direction and strength of the relationship
between variables across studies
5. What is mean by multiple comparisons, multiplicity or multiple testing.
The multiple comparisons, multiplicity or multiple testing problem occurs when one
considers a set of statistical inferences simultaneously or infers a subset of parameters
selected based on the observed values.
6. Define ANOVA.
Analysis of variance (ANOVA) is an analysis tool used in statistics that splits an observed
aggregate variability found inside a data set into two parts: systematic factors and random
factors. The systematic factors have a statistical influence on the given data set, while the
random factors do not. Analysts use the ANOVA test to determine the influence that
independent variables have on the dependent variable in a regression study.
7. Write the formula for calculating F-score value.
8. Compare one-way vs two-way ANOVA.
There are two main types of analysis of variance: one-way (or unidirectional) and two
way (bidirectional). One-way or two-way refers to the number of independent variables
in your analysis of variance test. A one-way ANOVA evaluates the impact of a sole factor
on a sole response variable. It determines whether the observed differences between the
means of independent (unrelated) groups are explainable by chance alone, or whether
there are any statistically significant differences between groups.
A two-way ANOVA is an extension of the one-way ANOVA. With a one-way, you have one
independent variable affecting a dependent variable. With a two-way ANOVA, there are
two independents. For example, a two-way ANOVA allows a company to compare worker
productivity based on two independent variables, such as department and gender. It is
utilized to observe the interaction between the two factors. It tests the effect of two factors
at the same time.
A three-way ANOVA, also known as three-factor ANOVA, is a statistical means of
determining the effect of three factors on an outcome
9. What do you mean by two-factor factorial design?
A two-factor factorial design is an experimental design in which data is collected for all
possible combinations of the levels of the two factors of interest. If equal sample sizes are
taken for each of the possible factor combinations then the design is a balanced two-factor
factorial design.
10. Define statistical test in F-test
An F-test is any statistical test in which the test statistic has an F-distribution under the
null hypothesis. It is most often used when comparing statistical models that have been
fitted to a data set, in order to identify the model that best fits the population from which
the data were sampled.
11. What are the two- way analyses of variance?
The two-way analysis of variance is an extension of the one-way ANOVA that examines
the influence of two different categorical independent variables on one continuous
dependent variable.
12. What are the types of ANOVA?
There are two main types of ANOVA: one-way (or unidirectional) and two-way. There also
variations of ANOVA. For example, MANOVA (multivariate ANOVA) differs from ANOVA as
the former tests for multiple dependent variables simultaneously while the latter
assesses only one dependent variable at a time.
13. Define chi-square test.
The Chi-Square test is a statistical procedure used by researchers to examine the
differences between categorical variables in the same population. For example, imagine
that a research group is interested in whether or not education level and marital status
are related for all people in the U.S.
14. What Does the Analysis of Variance Reveal?
The ANOVA test is the initial step in analyzing factors that affect a given data set. Once the
test is finished, an analyst performs additional testing on the methodical factors that
measurably contribute to the data set's inconsistency. The analyst utilizes the ANOVA test
results in an f- test to generate additional data that aligns with the proposed regression
models.
The ANOVA test allows a comparison of more than two groups at the same time to
determine whether a relationship exists between them. The result of the ANOVA formula,
the F statistic (also called the F-ratio), allows for the analysis of multiple groups of data to
determine the variability between samples and within samples.
If no real difference exists between the tested groups, which is called the null hypothesis,
the result of the ANOVA's F-ratio statistic will be close to 1. The distribution of all possible
values of the F statistic is the F-distribution. This is actually a group of distribution
functions, with two characteristic numbers, called the numerator degrees of freedom and
the denominator degrees of freedom.
15. How to Use ANOVA?
A researcher might, for example, test students from multiple colleges to see if students
from one of the colleges consistently outperform students from the other colleges. In a
business application, an R&D researcher might test two different processes of creating a
product to see if one process is better than the other in terms of cost efficiency.
The type of ANOVA test used depends on a number of factors. It is applied when data
needs to be experimental. Analysis of variance is employed if there is no access to
statistical software resulting in computing ANOVA by hand. It is simple to use and best
suited for small samples. With many experimental designs, the sample sizes have to be
the same for the various factor level combinations.
ANOVA is helpful for testing three or more variables. It is similar to multiple two-sample
ttests. However, it results in fewer type I errors and is appropriate for a range of issues.
ANOVA groups differences by comparing the means of each group and includes spreading
out the variance into diverse sources. It is employed with subjects, test groups, between
groups and within groups.
16. What is the Analysis of Variance in Other Applications?
In addition to its applications in the finance industry, ANOVA is also used in a wide variety
of contexts and applications to test hypotheses in reviewing clinical trial data.For
example, to compare the effects of different treatment protocols on patient outcomes; in
social science research (for instance to assess the effects of gender and class on specified
variables), in software engineering (for instance to evaluate database management
systems), in manufacturing (to assess product and process quality metrics), and
industrial design among other fields.
17. What is a Test?
In technical analysis and trading, a test is when a stock’s price approaches an established
support or resistance level set by the market. If the stock stays within the support and
resistance levels, the test passes. However, if the stock price reaches new lows and/or
new highs, the test fails. In other words, for technical analysis, price levels are tested to
see if patterns or signals are accurate.
A test may also refer to one or more statistical techniques used to evaluate differences or
similarities between estimated values from models or variables found in data. Examples
include the t-test and z-test
18. Define Range-Bound Market Test.
When a stock is range-bound, price frequently tests the trading range’s upper and lower
boundaries. If traders are using a strategy that buys support and sells resistance, they
should wait for several tests of these boundaries to confirm price respects them before
entering a trade.
Once in a position, traders should place a stop-loss order in case the next test of support
or resistance fails.
19. What is the Trending Market Test?
In an up-trending market, previous resistance becomes support, while in a down-
trending market, past support becomes resistance. Once price breaks out to a new high
or low, it often retraces to test these levels before resuming in the direction of the trend.
Momentum traders can use the test of a previous swing high or swing low to enter a
position at a more favorable price than if they would have chased the initial breakout. A
stop-loss order should be placed directly below the test area to close the trade if the trend
unexpectedly reverses.
20. Define Statistical Tests.
Inferential statistics uses the properties of data to test hypotheses and draw conclusions.
Hypothesis testing allows one to test an idea using a data sample with regard to a
population parameter. The methodology employed by the analyst depends on the nature
of the data used and the reason for the analysis. In particular, one seeks to reject the null
hypothesis, or the notion that one or more random variables have no effect on another. If
this can be rejected, the variables are likely to be associated with one another
21. What is Alpha Risk?
Alpha risk is the risk that in a statistical test a null hypothesis will be rejected when it is
actually true. This is also known as a type I error, or a false positive. The term "risk" refers
to the chance or likelihood of making an incorrect decision. The primary determinant of
the amount of alpha risk is the sample size used for the test. Specifically, the larger the
sample tested, the lower the alpha risk becomes. Alpha risk can be contrasted with beta
risk, or the risk of committing a type II error (i.e., a false negative).
22. What is Range-Bound Trading?
Range-bound trading is a trading strategy that seeks to identify and capitalize on
securities, like stocks, trading in price channels. After finding major support and
resistance levels and connecting them with horizontal trend lines, a trader can buy a
security at the lower trend line support (bottom of the channel) and sell it at the upper
trend line resistance (top of the channel).
23. What is a One-Tailed Test?
A one-tailed test is a statistical test in which the critical area of a distribution is one-sided
so that it is either greater than or less than a certain value, but not both. If the sample
being tested falls into the one-sided critical area, the alternative hypothesis will be
accepted instead of the null hypothesis.
24. Give the four Possible Outcomes of the Vitamin C Experiment and also do
hypothesis testing
Vitamin C has an effect on IQ scores, it makes sense to estimate, with a 95 percent
confidence interval, that the interval between 102 and 112 describes the possible size of
that effect, namely, an increase (above 100) of between 2 and 12 IQ points.
25. Distinguish between dependent variables and explanatory variables
Dependent variables Explanatory/Independent variables
A dependent variable is a variable An Independent variable is a variable
whose value depends on another whose value never depends on another
variable. variable but the researcher.
The dependent variable is the presumed The Independent variable is the presumed
effect. cause.
Dependent variable changes, then the Any change in the independent variable
independent variable will not be also affects the dependent variable.
affected.
Dependent variables are often referred Independent variables are the predictors or
as the predicted variable. regressors.
Dependent variables are obtained from Independent variables can become easily
longitudinal research or by solving obtainable and do not need any complex
complex mathematical equations. mathematical procedures and observations.
You cannot be manipulated by the Independent variables are can be
research or any other external factor. manipulated by the researcher. So he or
she is biased. Then it may affect the results
of the research.
Independent variables are positioned Dependent variables are positioned
horizontally on the graph. vertically on the graph.

26. What is the significance of p-value in hypothesis? (APRIL/MAY 2023)


The p value is a number, calculated from a statistical test, that describes how likely you
are to have found a particular set of observations if the null hypothesis were true. P values
are used in hypothesis testing to help decide whether to reject the null hypothesis
27. Comparison between t-test and ANOVA. (APRIL/MAY 2023)
The t-test is a method that determines whether two populations are statistically different
from each other, whereas ANOVA determines whether three or more populations are
statistically different from each other.
28. Compare the various test static like Z-Score, t-statistic,f-statistic, chi-squared
with its associated test.

Part B
1. A library systems lends books for the periods of 21 days. This policy is being
reevaluated in view of a possible new loan period that could be either longer or shorter
than 21 days. To aid in making this decision, books-lending records were consulted to
determine the loan period actually used by the patrons. A random sample of 8 records
revealed the following loan periods in days: 21,15,12,24,20,21,13 and 16. Test the null
hypothesis with t-test, using the .05 level of significance. (APRIL/MAY 2023)
2. A consumers’ group randomly samples 10 “one-pound” package of ground wheat
sold by a super market. Calculate the mean and the estimated standard error of the mean
for this sample, given the following weight in ounces:16,15,14,15,14,15,16,14,14,14
3. Illustrate in detail about one factor ANOVA with example. (APRIL/MAY 2023)
4. A random sample of 90 college students indicates whether they most desire love,
wealth, power, health, fame, or family happiness.
i. Using the .05 level of significance and the following results, test the null hypothesis
that, in the underlying population, the various desires are equally popular.
ii. Specify the approximate p-value for this test result. (APRIL/MAY 2023)

5. Estimate the calculations for the t test for gas mileage investigation. Showcase the
hypothesis analysis, t ratio calculation with three panels along with confidence interval
6. Estimate the calculations for the t test using two independent samples for EPO
experiment. Showcase the hypothesis analysis, sampling distribution, t ratio calculation
with three panels, p value estimation along with confidence interval
7. State the use of counterbalancing and explain the EPO experiment with repeated
measures. Give the detailed table of summary of t tests for population MEANS for one
sample, two independent samples and two related samples
8. Suggest the hypothesis test summary for t test for a population correlation
coefficient for the case study on Greeting Card Exchange
9. Suggest the hypothesis test summary using One-Factor F Test for Sleep
Deprivation Experiment and also the variance estimates, mean squares, sum of squares
with degree of freedom
10. Blood pressure of 8 patients are before and after are recorded: Before:
180,200,230, 240,170,190,200 and 165
After: 140,145, 150,155,120,130,140 and 130.
Find, is there any significant difference between BP reading before and after by applying
two-sample t-test.
11. Marks of student are 10.5, 9, 7, 12, 8.5, 7.5, 6.5, 8, 11 and 9.5.Mean population score
is 12 and standard deviation is 1.80.Is the mean value for student significantly differ from
the mean population value.
12. Estimate the calculations for the t test for gas mileage investigation. Showcase the
hypothesis analysis, t ratio calculation with three panels along with confidence interval.
13. Odds ratios can be calculated for larger cross-classification tables, and one way of
doing this is by reconfiguring into a smaller 2 × 2 table. The 2 × 3 table for the lost letter
study, could be reconfigured into a 2 × 2 table if, for example, the investigator is primarily
interested in comparing return rates of lost letters only for campus and off-campus
locations (both suburbia and downtown), that is
(i) Given (1,n = 200) = 7.42, p < .01, 2 c = .037 for these data, calculate and interpret
the odds ratio for a returned letter from campus.
(ii) Calculate and interpret the odds ratio for a returned letter from off-campus.

14. Estimate the calculation of Sum of Squares (Two-Factor ANOVA) with an example
15. Explain in detail about the chi-square test with an example.
UNIT V- PREDICTIVE ANALYTICS
Part A
1. What is Predictive Analytics?
The term predictive analytics refers to the use of statistics and modeling techniques to
make predictions about future outcomes and performance. Predictive analytics looks at
current and historical data patterns to determine if those patterns are likely to emerge
again. This allows businesses and investors to adjust where they use their resources to
take advantage of possible future events. Predictive analysis can also be used to improve
operational efficiencies and reduce risk
2. Define Predictive Analytics.
Predictive analytics is a form of technology that makes predictions about certain
unknowns in the future. It draws on a series of techniques to make these determinations,
including artificial intelligence (AI), data mining, machine learning, modeling, and
statistics.3 For instance, data mining involves the analysis of large sets of data to detect
patterns from it. Text analysis does the same, except for large blocks of text
3. What are the areas of applications can predictive models be applied?
• Weather forecasts
• Creating video games
• Translating voice to text for mobile phone messaging
• Customer service
• Investment portfolio development
4. What is mean by Forecasting?
Forecasting is essential in manufacturing because it ensures the optimal utilization of
resources in a supply chain. Critical spokes of the supply chain wheel, whether it is
inventory management or the shop floor, require accurate forecasts for functioning.
Predictive modelling is often used to clean and optimize the quality of data used for such
forecasts. Modelling ensures that more data can be ingested by the system, including from
customer-facing operations, to ensure a more accurate forecast
5. Define Credit
Credit scoring makes extensive use of predictive analytics. When a consumer or business
applies for credit, data on the applicant's credit history and the credit record of borrowers
with similar characteristics are used to predict the risk that the applicant might fail to
perform on any credit extended.
6. Define Underwriting
Data and predictive analytics play an important role in underwriting. Insurance
companies examine policy applicants to determine the likelihood of having to pay out for
a future claim based on the current risk pool of similar policyholders, as well as past
events that have resulted in pay-outs. Predictive models that consider characteristics in
comparison to data about past policyholders and claims are routinely used by actuaries.
7. What is mean by Marketing?
Individuals who work in this field look at how consumers have reacted to the overall
economy when planning on a new campaign. They can use these shifts in demographics
to determine if the current mix of products will tice consumers to make a purchase.
Active traders, meanwhile, look at a variety of metrics based on past events when deciding
whether to buy or sell a security. Moving averages, bands, and breakpoints are based on
historical data and are used to forecast future price movements
8. Compare Predictive Analytics vs. Machine Learning
A common misconception is that predictive analytics and machine learning are the same
things. Predictive analytics help us understand possible future occurrences by analyzing
the past. At its core, predictive analytics includes a series of statistical techniques
(including machine learning, predictive modelling, and data mining) and uses statistics
(both historical and current) to estimate, or predict, future outcomes
9. What is the Decision Trees?
If you want to understand what leads to someone's decisions, then you may find decision
trees useful. This type of model places data into different sections based on certain
variables, such as price or market capitalization. Just as the name implies, it looks like a
tree with individual branches and leaves. Branches indicate the choices available while
individual leaves represent a particular decision
Decision trees are the simplest models because they're easy to understand and dissect.
They're also very useful when you need to make a decision in a short period of time
10. Define Regression
This is the model that is used the most in statistical analysis. Use it when you want to
determine patterns in large sets of data and when there's a linear relationship between
the inputs. This method works by figuring out a formula, which represents the
relationship between all the inputs found in the dataset. For example, you can use
regression to figure out how price and other key factors can shape the performance of a
security
11. Define Neural Networks
Neural networks were developed as a form of predictive analytics by imitating the way
the human brain works. This model can deal with complex data relationships using
artificial intelligence and pattern recognition. Use it if you have several hurdles that you
need to overcome like when you have too much data on hand, when you don't have the
formula you need to help you find a relationship between the inputs and outputs in your
dataset, or when you need to make predictions rather than come up with explanations
12. What are the Benefits of Predictive Analytics?
There are numerous benefits to using predictive analysis. As mentioned above, using this
type of analysis can help entities when you need to make predictions about outcomes
when there are no other (and obvious) answers available
Investors, financial professionals, and business leaders are able to use models to help
reduce risk. For instance, an investor and their advisor can use certain models to help
craft an investment portfolio with minimal risk to the investor by taking certain factors
into consideration, such as age, capital, and goals.
There is a significant impact to cost reduction when models are used. Businesses can
determine the likelihood of success or failure of a product before it launches. Or they can
set aside capital for production improvements by using predictive techniques before the
manufacturing process begins
13. What are the criticisms of Predictive Analysis.
The use of predictive analytics has been criticized and, in some cases, legally restricted
due to perceived inequities in its outcomes. Most commonly, this involves predictive
models that result in statistical discrimination against racial or ethnic groups in areas
such as credit scoring, home lending, employment, or risk of criminal behaviour.
A famous example of this is the (now illegal) practice of redlining in home lending by
banks. Regardless of whether the predictions drawn from the use of such analytics are
accurate, their use is generally frowned upon, and data that explicitly include information
such as a person's race are now often excluded from predictive analytics
14. How Does Netflix Use Predictive Analytics?
Data collection is very important to a company like Netflix. It collects data from its
customers based on their behaviour and past viewing patterns. It uses information and
makes predictions based to make recommendations based on their preferences. This is
the basis behind the "Because you watched..." lists you'll find on your subscription.
15. What is Data Analytics?
Data analytics is the science of analysing raw data to make conclusions about that
information. Many of the techniques and processes of data analytics have been automated
into mechanical processes and algorithms that work over raw data for human
consumption
16. Why do we need Goodness of Fit? (APRIL/MAY 2023)
Goodness-of-Fit is a statistical hypothesis test used to see how closely observed data
mirrors expected data. Goodness-of-Fit tests can help determine if a sample follows a
normal distribution, if categorical variables are related, or if random samples are from
the same distribution.
17. What is survival analysis? (APRIL/MAY 2023)
Survival analysis is a collection of statistical procedures for data analysis where the
outcome variable of interest is time until an event occurs. Because of censoring–the non
observation of the event of interest after a period of follow-up–a proportion of the
survival times of interest will often be unknown.
18. Specify the importance of exponentially weighted moving average
The Exponentially Weighted Moving Average (EWMA) is a statistic for monitoring the
process that averages the data in a way that gives less and less weight to data as they are
further removed in time. An EMA does serve to alleviate the negative impact of lags to
some extent. Because the EMA calculation places more weight on the latest data, it “hugs”
the price action a bit more tightly and reacts more quickly. This is desirable when an EMA
is used to derive a trading entry signal.
19. What is Time series analysis?
Time series analysis is a specific way of analyzing a sequence of data points collected over
an interval of time. In time series analysis, analysts record data points at consistent
intervals over a set period of time rather than just recording the data points intermittently
or randomly.
20. State the use of auto correlation in time series analysis
Autocorrelation represents the degree of similarity between a given time series and a
lagged version of itself over successive time intervals. Autocorrelation measures the
relationship between a variable's current value and its past values.
21. State the difference between Exponentially Weighted Moving average and
Moving average in Time series analysis
SMA calculates the average price over a specific period, while WMA gives more weight to
current data. EMA is also weighted toward the most recent prices, but the rate of decrease
between one price and its preceding price is not consistent but exponential.
22. What are the various steps of Data Analysis?
The process involved in data analysis involves several different steps:
• The first step is to determine the data requirements or how the data is grouped.
Data may be separated by age, demographic, income, or gender. Data values may be
numerical or be divided by category.
• The second step in data analytics is the process of collecting it. This can be done
through a variety of sources such as computers, online sources, cameras, environmental
sources, or through personnel.
• Once the data is collected, it must be organized so it can be analyzed. This may take
place on a spreadsheet or other form of software that can take statistical data.
The data is then cleaned up before analysis. This means it is scrubbed and checked to
ensure there is no duplication or error, and that it is not incomplete. This step helps
correct any errors before it goes on to a data analyst to be analyzed
Part B
1. How do you solve the least square problem in Python? What is least square method
in Python?
2. What is the goodness-of-fit test?
3. One study indicates that the number of televisions that American families have is
distributed (this is the given distribution for the American population) as in the table.

Number of Televisions Percent


0 10
1 16
2 55
3 11
4+ 8

The table contains expected (E) percents.


A random sample of 600 families in the far western United States resulted in the data in
this table.

Number of Televisions Frequency


0 66
1 119
2 340
3 60
4+ 15
Total 600

The table contains observed (O) frequency values.


At the 1% significance level, does it appear that the distribution “number of televisions”
of far western United States families is different from the distribution for the American
populations a whole?
4. Explain in detail about time series analysis with example.
5. Describe Regression using Stats Models
6. Explain multiple regression with an example
7. What is the nonlinear relationships and types .Difference between linear and non
linear relationship
8. Describe logistic regression in detail
9. Explain in detail serial correlation and auto correlation
10. Describe in detail Introduction to survival analysis
11. Consider an example, Sam found how many hours of sunshine vs how many ice
creams were sold at the shop from Monday to Friday is given in the following table.
Hours of Ice cream Sold
Sunshine
3 5
5 7
7 10
9 15

Explain the concept of least squares regression to find the line of best fit for the above
data and sam would like to find the how many no of ice cream would sold if he hears the
weather forecast which says “We expect 8 hours of sun tomorrow” using linear regression
model
12. Describe in detail about logistic regression model in predictive analysis.
13. Exemplify in detail about multiple regression models with example
14. Explain in depth about Time series analysis and its technique with relevant
examples
15. Explain multiple linear regression model with the prediction of sales through the
various attributes like budget for TV advertisement, Radio Advertisement and News
paper Advertisement using statistical model.
16. How is to test linear model? Explain in detail about the role of weighted resample
in linear model testing
17. Explain linear least square predictive analysis with an example.
18. Explain in detail about TSA with an example

You might also like