0% found this document useful (0 votes)
13 views17 pages

Ad3491 QB

The document outlines a comprehensive curriculum on data science, covering topics such as data characteristics, data cleaning, outlier detection, and the differences between data science and big data. It includes detailed questions and exercises across multiple units, focusing on statistical methods, hypothesis testing, sampling distributions, and measures of central tendency. Additionally, it emphasizes practical applications and case studies related to data science methodologies.

Uploaded by

Vetri Vijayan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views17 pages

Ad3491 QB

The document outlines a comprehensive curriculum on data science, covering topics such as data characteristics, data cleaning, outlier detection, and the differences between data science and big data. It includes detailed questions and exercises across multiple units, focusing on statistical methods, hypothesis testing, sampling distributions, and measures of central tendency. Additionally, it emphasizes practical applications and case studies related to data science methodologies.

Uploaded by

Vetri Vijayan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

UNIT-1

1. What are the characteristics of a quality data?


2. What do you mean by Data Science? Or Define Data Science.
3. List out at least five applications of data science.
4. Write short note on outlier detection and state its real-time application
5. What are the contents should be included in a project charter?
6. Define Data Cleaning.
7. What is brushing and linking in exploratory data analysis? (April/May 2023)
8. How does confusion matrix define the performance of classification algorithm?
(April/May 2023)
9. Specify the Facets of data with an example for each and how Benchmarking tools
and Scheduling tools support data science process.
10. Outline the Data cleansing techniques and what are the types of errors description
and give the solution for it.
11. Define Outlier and show the distribution with an example. How it differs from
sanity check?
12. Define Bigdata.
13. Compare Data Science vs Big Data
14. List out the characteristics of big data.
15. Give the various challenges of bigdata.
16. State the need of Data Science.
17. What are the advantages/ benefits of data science?
18. List the overview of techniques that handle missing data and mention the pros and
cons of it.

PART-B

1. Discuss in detail about step-by-step process in Data Science with neat diagram

2. Discuss briefly about: (Analyze)

i.Life cycle of Data Science


ii. Machine Learning in Data Science

3. Exemplify in detail about different facets of data with examples. (Analyze)

(April/May 2023)

4. Sketch and outline the step-by-step activities in the data science process.

(Remember) (April/May 2023)

5. Explain in detail about cleansing, integrating, and transforming data with

example. (Analyze) (April/May 2023)

6. Discuss a Linear prediction model execution on a semi random data and give

the python code for the same with model diagnostic and comparison. (Analyze)

7. Give a detailed view on the methodologies of transforming data with examples.

(Understand)

8. Discuss in detail about the characteristics of data, benefits, applications.

(Understand)

9. Discuss a K- Nearest neighbour model execution with confusion matrix on a

semi random data and give the python code for the same with model diagnostic

and comparison. (Analyze)

10. Give a detailed case study of building a recommender system inside a database

with all required steps for a data science model. (Analyze)

11. Give a detailed case study of predicting malicious URLs from the set of URLs

data with all the required steps of data science process. (Analyze)

UNIT-2

1. During their first swim through a water maze, 15 laboratory rats made the
following number of errors (blind alleyway entrances): 2, 17, 5, 3, 28, 7, 5, 8, 5, 6, 2,
12, 10, 4, 3. Find the mode and median for these data.
2. Mentions the essential and optional guidelines to be followed for frequency
distributions or State the “Guidelines for frequency distribution”.
3. Write short note on Stem-and-leaf display. Represent the following datain stem-
and-leaf display. 67, 74, 63, 88, 82, 97, 65, 79
4. Why Frequency Distribution is important in Data Science?
5. How the skewness of a data distribution can be identified?
6. Define frequency distribution? Or Define Frequency distribution.
7. What are some possible poor features of the following frequency distribution?
Define Outlier.
8. Identify any outliers in each of the following sets of data collected from nine college
students.
9. List out the typical shapes of smoothed frequency distribution.
10. Define mean.
11. Find the sample mean value for the best actress Oscar winner data set: 34 34 26 37 42
41 35 31 41 33 30 74 33 49 38 61 21 41 26 80 43 29 33 35 45 49 39 34 26 25 35 33.
12. Define Median.
13. State the steps to find the median value.
14. Compare mean and median
15. Define mode.
16. When to use mean/ median?
17. What do you meant by range?
18. Compute the standard deviation of the sample data: 3, 5, 7 with a sample mean of
5.
19. What do you meant by degree of freedom.
20. Define Inter-Quartile Range (IQR).
21. How to measure/interpret the strength of a relationship based on the absolute value of
‘r’?
22. Define Correlation.
23. Define correlation coefficient.
24. What are the 4 things to describe the relationship between the variables?
25. What is a Linear Regression?
26. List out the types of correlation coefficients
27. What are the different types of least squares?

PART-B

1.Explain the step by step procedure to construct the frequency distribution with an

example of data set of the following table,


2.In a survey, a question was asked “During your life time, how often have you

changed your permanent residence?” a group of 18 college students replied a

follows: 1,3,4,1,0,2,5,8,0,2,3,4,7,11,0,2,3,3. Find the mode, median, and standard

deviation [April/May 2023]

3. Consider an example. Tom who is the owner of a retail shop, found the price of

different T-shirts vs the number of T-shirts sold at his shop over a period of one

week. He tabulated this like shown below:

Price of T-Shirt Number of T-Shirt Sold

24

35

57

7 10

9 15

Explain the concept of least squares regression to find the line of best fit for the above data

4. The following frequency distribution shows the annual incomes in dollars for a

group of college graduates.

(a) Construct a histogram.


(b) Construct a frequency polygon.

(c) Is this distribution balanced or lopsided?

5. Consider the best actress Oscar winners dataset given below, construct the stem plot

for the above dataset.

34 34 26 37 42 41 35 31 41 33 30 74 33 49 38 61 21 41 26 80 43 29 33 35 45 49 39

34 26 25 35 33

6. Explain multiple linear regression model with the prediction of sales through the various
attributes like budget for TV advertisement, Radio Advertisement and News

paper Advertisement using statistical model

7. Consider the following x and y set of values, create least square linear regression

and check the result of model fitting to know whether the model is satisfactory

8. Discuss in detail the various typical shapes of frequency distribution. Analyze its

characteristics with an example

9. The following are the number of customers who entered a video store in 8

consecutive hours: 7,9,5,13,3,11,15,9. Find the standard deviation of the number of

hourly customers. Summarize about the aforementioned data with the help of

standard deviation

10. Explain the steps to calculate IQR with an example of best actress Oscar winners

11. For each of the following pairs of distributions, first decide whether their standard

deviations are about the same or different. If their standard deviations are different,

indicate which distribution should have the larger standard deviation. Note that the

distribution with the more dissimilar set of scores or individuals should produce the

larger standard deviation regardless of whether, on average, scores or individuals in

one distribution differ from those in other distribution.

12. The IQ scores for a group of 35 high school dropouts are as follows:

i) Construct a frequency distribution for grouped data (4)


ii) Relative Frequency distribution (3)

iii) Cumulative Frequency distribution (3)

13. Discuss in detail about “Measures of Central Tendency” and calculate each measure

for the following retirement ages data:

60, 63, 45, 63, 65, 70, 55, 63, 60, 65, 63

Is it possible to calculate “Mean” for qualitative data? Justify your answer.

Is the above data following “Bimodal”? Justify your answer.

14. Discuss about following measures and calculate them with given“residence

changes” data.

1, 3, 4, 1, 0, 2, 5, 8, 0, 2, 3, 4, 7, 11, 0, 2, 3, 4

i. Range

ii. Variance

iii. Standard Deviation

iv. Inter Quartile Range (IQR)

v. Z-Score

UNIT-3

1. What is population?
2. What is a sample?
3. When are samples used?
4. Difference Between Population and Sample? (April/May 2023)
5. Define Hypothetical Population
6. What is Random Samplings?
7. What is Sampling Distribution ?
8. What are the types of Sampling Distribution?
9. Define Sampling distribution of mean
10. What is mean by Sampling distribution of proportion
11. Define T-distribution
12. Define MEAN OF ALL THE SAMPLE MEAN
13. What is the Special Type Of Standard Deviation
14. What Is The Hypothesis Testing
15. Define Hypothesized Sampling Distribution
16. Define Decision Rule
17. Define null hypothesis
18. What is Level of Significance
19. Define One-Tailed And Two-Tailed Tests
20. State Addition Rule and Multiplication Rule
21. Imagine a very simple population consisting of only four observations:2, 4, 6, 8.
22. List all possible samples of size two.
23. What is One-Tailed Test (Lower Tail Critical)
24. What are four possible outcomes for any hypothesis test?
25. Define Point Estimate
26. What is mean by confidence interval ( ci ) for μ?
27. What do you mean by Hypothesis? Name at least 4 of its types.
28. State Central Limit Theorem (April/May 2023)
29. Indicate whether the following statements are True or False with proper
justification. The mean of all sample means,
(a) always equals the value of a particular sample mean.(b) equals 100 if, in fact,
the population mean equals 100.(c) usually equals the value of a particular sample
mean.(d) is interchangeable with the population mean.
30. Define Effect Of Sample Size.

PART-B

1. Explain population and samples. And difference?


2. Describe random sampling
3. Explain sampling distribution and types
4. Describe null hypothesis test in detail
5. Explain in detail hypothesis testing and examples
6. Does the mean of SAT math score for all local freshman differ for all local average of
500? (z test for population mean)
7. Explain one tailed and two tailed test
8. Define estimation .Explain in detail about point estimation.
9. Discuss about the following with suitable example:
i.Random Sampling vs Random Assignments
ii. Independent vs Dependent Events
iii.Independent vs Mutually Exclusive Events
iv. Conditional Probability
v. Sampling Distribution of the Mean
10. Imagine a very simple population consisting of only four observations:2 3 4 5
i.Explain the process of constructing relative frequency table showing
the sampling distribution of the mean.
ii. Construct a relative frequency table showing the sampling distributionof the
mean for the above observations.
11. Define Hypothesis. Discuss in detail about at least 5 types of hypothesis
statement with an example.
12. Calculate the value of the z test for each of the following situations. Also, given
critical z score of +/- 1.96, calculate the critical confidence level.
i.X=12; σ=9; n=25; μhyp=15
ii. X=3600; σ=4000; n=100; μhyp=3500
iii.X=0.25; σ=010; n=36; μhyp=0.22
13. Reading achievement scores are obtained for a group of fourth graders. A scores
of 4.0 indicates a level of achievement appropriate for fourth grades, a score
below 4.0 indicates under achievement., and a score above 4.0 indicates over
achievement. Assume that the population standard deviation equals 0.4. A
random sample of 64 fourth graders reveals a mean achievement score of 3.82.
Construct a 95% confidence interval for the unknown population mean.
(Remember to convert the standard deviation to a standard error). Interpret this
confidence interval; that is, do you find any consistent evidence either of
overachievement or of underachievement?
14. Illustrate in detail about estimation method and confidence interval.
15. For the population at large, the Wechsler Adult Intelligence Scale is designed to
yield a normal distribution of test score with a mean of 100 and a standard
deviation of 15. School district officials wonder whether, on the average, an IQ
score different from 100 describes the intellectual aptitudes of all students in
their district. Wechsler IQ scores are obtained for random sample of 25 of their
students, and the mean IQ is found to equal 105. Using the step-by-step
procedure, test the null hypothesis at the .05 level of significance.
16. Imagine a simple population consisting of only 5 observations: 2 4 6 8 10. List
all possible sample of size two. Construct relative frequency table showing the
sample distribution of the mean.
17. According to the American Psychological Association, members with a
doctorate and a full-time teaching appointment earn, on the average, $82,500
per year, with a standard deviation of $6,000. An investigator wishes to
determine whether $82,500 is also the mean salary for all female members with
a doctorate and a full-time teaching appointment. Salaries are obtained for a
random sample of 100 women from this population, and the mean salary equals
$80,100.
i.Someone claims that the observed difference between $80,100 and $82,500 is large
enough by itself to support the conclusion that female members earn less than male
members. Explain why it is important to conduct a hypothesis test.
ii. The investigator wishes to conduct a hypothesis test for what population?
iii.What is the null hypothesis, H0?
iv. What is the alternative hypothesis, H1?
v. Specify the decision rule, using the .05 level of significance.
vi. Calculate the value of z. (Remember to convert the standard deviation to a
standard error.)
vii. What is your decision about H0?
viii. Using words, interpret this decision in terms of the original problem.
18. According to the California Educational Code
(http://www.cde.ca.gov/ls/fa/sf/peguidemidhi.asp), students in grades 7 through 12
should receive 400 minutes of physical education every 10 school days. A random
sample of 48 students has a mean of 385 minutes and a standard deviation of 53
minutes. Test the hypothesis at the .05 level of significance that the sampled
population satisfies the requirement.
19. According to a 2009 survey based on the United States census (http://www.census.
gov/prod/2011pubs/acs-15.pdf), the daily one-way commute time of U.S. workers
averages 25 minutes with, we’ll assume, a standard deviation of 13 minutes. An
investigator wishes to determine whether the national average describes the mean
commute time for all workers in the Chicago area. Commute times are obtained for a
random sample of 169 workers from this area, and the mean time is found to be 22.5
minutes. Test the null hypothesis at the .05 level of significance.
20. Each of the following statements could represent the point of departure for a
hypothesis test. Given only the information in each statement, would you use a two-
tailed (or nondirectional) test, a one-tailed (or directional) test with the lower tail
critical, or a one-tailed (or directional) test with the upper tail critical? Indicate your
decision by specifying the appropriate H0 and H1. Furthermore, whenever you
conclude that the test is one-tailed, indicate the precise word (or words) in the
statement that justifies the one-tailed test.
i.An investigator wishes to determine whether, for a sample of drug addicts, the mean
score on the depression scale of a personality test differs from a score of 60, which,
according to the test documentation, represents the mean score for the general
population.
ii. To increase rainfall, extensive cloud-seeding experiments are to be conducted, and
the results are to be compared with a baseline figure of 0.54 inch of rainfall (for
comparable periods when cloud seeding was not done).
iii.Public health statistics indicate, we will assume, that American males gain an
average of 23 lbs during the 20-year period after age 40. An ambitious weight-
reduction program, spanning 20 years, is being tested with a sample of 40-year-old
men.
iv. When untreated during their lifetimes, cancer-susceptible mice have an average
life span of 134 days. To determine the effects of a potentially life-prolonging (and
cancer-retarding) drug, the average life span is determined for a group of mice that
receives this drug.
21. Each of the following statements could represent the point of departure for a
hypothesis test. Given only the information in each statement, would you use a two-
tailed (or nondirectional) test, a one-tailed (or directional) test with the lower tail
critical, or a one-tailed (or directional) test with the upper tail critical? Indicate your
decision by specifying the appropriate H0 and H1. Furthermore, whenever you
conclude that the test is one-tailed, indicate the precise word (or words) in the
statement that justifies the one-tailed test.
i.An investigator wishes to determine whether, for a sample of drug addicts, the
mean score on the depression scale of a personality test differs from a score of 60,
which, according to the test documentation, represents the mean score for the
general population.
ii. To increase rainfall, extensive cloud-seeding experiments are to be conducted,
and the results are to be compared with a baseline figure of 0.54 inch of rainfall
(for comparable periods when cloud seeding was not done).
iii.Public health statistics indicate, we will assume, that American males gain an
average of 23 lbs during the 20-year period after age 40. An ambitious weight-
reduction program, spanning 20 years, is being tested with a sample of 40-year-old
men.
iv. When untreated during their lifetimes, cancer-susceptible mice have an
average life span of 134 days. To determine the effects of a potentially life-
prolonging (and cancer-retarding) drug, the average life span is determined for a
group of mice that receives this drug.
22. For each of the following situations, indicate whether H0 should be retained or
rejected. Given a one-tailed test, lower tail critical with α = .01, and
(a) z = – 2.34 (b) z = – 5.13 (c) z = 4.04
Given a one-tailed test, upper tail critical with α = .05, and
(d) z = 2.00 (e) z = – 1.80 (f) z = 1.61
23. Reading achievement scores are obtained for a group of fourth graders. A score of
4.0 indicates a level of achievement appropriate for fourth grade, a score below 4.0
indicates underachievement, and a score above 4.0 indicates overachievement.
Assume that the population standard deviation equals 0.4. A random sample of 64
fourth graders reveals a mean achievement score of 3.82.
i.Construct a 95 percent confidence interval for the unknown population mean.
(Remember to convert the standard deviation to a standard error.)
ii. Interpret this confidence interval; that is, do you find any consistent evidence
either of overachievement or of underachievement?

UNIT-4
1. Define T-Test?
2. Define F-Test?
3. What is analysis of variance?
4. Define effect size estimation
5. What is mean by multiple comparisons, multiplicity or multiple testing.
6. Define ANOVA.
7. Write the formula for calculating F-score value.
8. Compare one-way vs two-way ANOVA.
9. What do you mean by two-factor factorial design?
10. Define statistical test in F-test
11. What are the two- way analyses of variance?
12. What are the types of ANOVA?
13. Define chi-square test.
14. What Does the Analysis of Variance Reveal?
15. How to Use ANOVA?
16. What is the Analysis of Variance in Other Applications?
17. What is a Test?
18. Define Range-Bound Market Test.
19. What is the Trending Market Test?
20. Define Statistical Tests.
21. What is Alpha Risk?
22. What is Range-Bound Trading?
23. What is a One-Tailed Test?
24. Give the four Possible Outcomes of the Vitamin C Experiment and also do hypothesis
testing
25. Distinguish between dependent variables and explanatory variables
26. What is the significance of p-value in hypothesis? (APRIL/MAY 2023)
27. Comparison between t-test and ANOVA. (APRIL/MAY 2023)
28. Compare the various test static like Z-Score, t-statistic,f-statistic, chi-squared with
29. its associated test.

PART-B
1.A library systems lends books for the periods of 21 days. This policy is being
reevaluated in view of a possible new loan period that could be either longer or shorter
than 21 days. To aid in making this decision, books-lending records were consulted to
determine the loan period actually used by the patrons. A random sample of 8 records
revealed the following loan periods in days: 21,15,12,24,20,21,13 and 16. Test the null
hypothesis with t-test, using the .05 level of significance. (APRIL/MAY 2023)
2. A consumers’ group randomly samples 10 “one-pound” package of ground wheat sold
by a super market. Calculate the mean and the estimated standard error of the mean for
this sample, given the following weight in ounces:16,15,14,15,14,15,16,14,14,14.

3. Illustrate in detail about one factor ANOVA with example. (APRIL/MAY 2023)

4. A random sample of 90 college students indicates whether they most desire


love,wealth, power, health, fame, or family happiness.
i.Using the .05 level of significance and the following results, test the null hypothesis
that, in the underlying population, the various desires are equally popular.
ii. Specify the approximate p-value for this test result. (APRIL/MAY 2023)

5. Estimate the calculations for the t test for gas mileage investigation. Showcase
the hypothesis analysis, t ratio calculation with three panels along with confidence
interval.

6. Estimate the calculations for the t test using two independent samples for EPO
experiment. Showcase the hypothesis analysis, sampling distribution, t ratio
calculation with three panels, p value estimation along with confidence interval.

7. State the use of counterbalancing and explain the EPO experiment with repeated
measures. Give the detailed table of summary of t tests for population MEANS for one
sample, two independent samples and two related samples

8. Suggest the hypothesis test summary for t test for a population correlation
coefficient for the case study on Greeting Card Exchange

9. Suggest the hypothesis test summary using One-Factor F Test for Sleep
Deprivation Experiment and also the variance estimates, mean squares, sum of
squares with degree of freedom

10. Blood pressure of 8 patients are before and after are recorded:
Before: 180,200,230, 240,170,190,200 and 165
After: 140,145, 150,155,120,130,140 and 130.
Find, is there any significant difference between BP reading before and after by
applying two-sample t-test.

11. Marks of student are 10.5, 9, 7, 12, 8.5, 7.5, 6.5, 8, 11 and 9.5.Mean population
score is 12 and standard deviation is 1.80.Is the mean value for student
significantly differ from the mean population value.

12. Estimate the calculations for the t test for gas mileage investigation. Showcase
the hypothesis analysis, t ratio calculation with three panels along with confidence
interval.

13. Odds ratios can be calculated for larger cross-classification tables, and one way of
doing this is by reconfiguring into a smaller 2 × 2 table. The 2 × 3 table for the lost letter
study, could be reconfigured into a 2 × 2 table if, for example, the investigator is primarily
interested in comparing return rates of lost letters only for campus and off-campus
locations (both suburbia and downtown), that is
(i) Given (1,n = 200) = 7.42, p < .01, 2 c = .037 for these data, calculate and
interpret the odds ratio for a returned letter from campus.
(ii) Calculate and interpret the odds ratio for a returned letter from off-campus.

UNIT-5
1. What is Predictive Analytics?
2. Define Predictive Analytics.
3. What are the areas of applications can predictive models be applied?
4. What is mean by Forecasting?
5. Define Credit
6. Define Underwriting
7. What is mean by Marketing?
8. Compare Predictive Analytics vs. Machine Learning
9. What is the Decision Trees?
10. Define Regression
11. Define Neural Networks
12. What are the Benefits of Predictive Analytics?
13. What are the criticisms of Predictive Analysis.
14. How Does Netflix Use Predictive Analytics?
15. What is Data Analytics?
16. Why do we need Goodness of Fit? (APRIL/MAY 2023)
17. What is survival analysis? (APRIL/MAY 2023)
18. Specify the importance of exponentially weighted moving average
19. What is Time series analysis?
20. State the use of auto correlation in time series analysis
21. State the difference between Exponentially Weighted Moving average and
Moving
22. average in Time series analysis.
23. What are the various steps of Data Analysis?

PART-B

1.How do you solve the least square problem in Python? What is least square method

in Python?

2. What is the goodness-of-fit test?

3. One study indicates that the number of televisions that American families have is

distributed (this is the given distribution for the American population) as in the

table.
Percent
Number of Televisions
0 10
1 16
2 55
3 11
4+ 8

The table contains expected (E) percents.

A random sample of 600 families in the far western United States resulted in the

data in this table.

Frequency
Number of Televisions
0 66
1 119
2 340
3 60
4+ 15

The table contains observed (O) frequency values.

At the 1% significance level, does it appear that the distribution “number of

televisions” of far western United States families is different from the distribution

for the American populations a whole?

4. Explain in detail about time series analysis with example.

5. Describe Regression using Stats Models

6. Explain multiple regression with an example

7. What is the nonlinear relationships and types .Difference between linear and non
linear relationship

8. Describe logistic regression in detail

9. Explain in detail serial correlation and auto correlation

10. Describe in detail Introduction to survival analysis

11. Consider an example, Sam found how many hours of sunshine vs how many ice

creams were sold at the shop from Monday to Friday is given in the following table.

Hours of Sunshine Ice cream Sold


3 5
5 7
7 10
9 15

12. Describe in detail about logistic regression model in predictive analysis.

13. Exemplify in detail about multiple regression models with example

14. Explain in depth about Time series analysis and its technique with relevant

examples

15. Explain multiple linear regression model with the prediction of sales through the

various attributes like budget for TV advertisement, Radio Advertisement and News

paper Advertisement using statistical model.

16. How is to test linear model? Explain in detail about the role of weighted resample in

linear model testing

17. Explain linear least square predictive analysis with an example.

18. Explain in detail about TSA with an example

You might also like