0% found this document useful (0 votes)
57 views16 pages

01.ad3491 Fdsa QB

The document discusses fundamentals of data science and analytics. It covers topics like data science definitions and processes, different data types, descriptive analytics techniques, and more. Several questions are provided with explanations on key concepts in data science.

Uploaded by

kandasamy.1229
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views16 pages

01.ad3491 Fdsa QB

The document discusses fundamentals of data science and analytics. It covers topics like data science definitions and processes, different data types, descriptive analytics techniques, and more. Several questions are provided with explanations on key concepts in data science.

Uploaded by

kandasamy.1229
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

AD3491FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS

QUESTION BANK

UNIT I
INTRODUCTION TO DATA SCIENCE
Part A
1. Define Data Science.
Data science is an interdisciplinary field that uses scientific methods, processes,
algorithms and systems to extract knowledge from noisy, structured or unstructured
data and apply knowledge from data across a wide range of application domains.
Data science is related to Data mining, Machine learning and Big data.
2. What is big data?
Big data is a comprehensive term for any collection of large or complex data sets
which is difficult to process using traditional data management techniques such as, for
example, the RDBMS (relational database management systems).
3. What is machine learning?
Machine learning is a branch of artificial intelligence (AI) and computer science
which focuses on the use of data and algorithms to imitate the way that humans learn,
gradually improving its accuracy.
4. What is data mining?
Data mining is the process of reviewing large data sets to identify patterns and
relationships that can solve business problems through data analysis. Data mining
techniques and tools enable enterprises to predict future trends and make
moreinformed business decisions.
5. What are the characteristics of big data?
The characteristics of big data are:
- Volume - Quantity of data
- Variety - Different types of data
- Velocity - Speed at which new data generated
- These characteristics are complemented with a fourth V, Veracity - Accuracy of
data
6. List the categories of data.
The main categories of data are:
a) Structured Data
b) Unstructured Data
Introduction to Data Science
c) Natural language Data
d) Machine-generated Data
e) Graph-based Data
f) Audio, video and images Data
g) Streaming Data
7. List some of the application domains of data science?
The application domains of data science are:
- Health care applications
- Transportation
- Education
- Government Organization
- Commercial applications
8. What is structured data? Give examples.
Structured data is data that depends on a data model and resides in a fixed field
within a record.
Examples: Tabies within databases or Excel files.
9. What is unstructured data? Give an example.
Unstructured data is data that is not easy to fit into a data model because the
content is context-specific or varying. One example of unstructured data is the regular
email.
10. What is machine generated data?
Machine-generated data is information that is automatically created by a
computer, process, application, or other machine without human intervention.
Machinegenerated data is becoming a major data resource. The analysis of machine data
relies on highly scalable tools, due to its high volume and speed.
Examples: Web server logs, call detail records, network event logs, and telemetry.
11. State the importance of setting the research goal.
The goal of a data science project is to fulfill a precise and measurable objective
that is clearly connected to the purpose, workflows, and decision-making processes of
the business. This step defines the scope of the project.
12. List the phases involved in the data science process.
- Setting the research goal
- Retrieving data
- Data Preparation
- Data Exploration
- Data Modeling
- Presentation and automation
13. What is meant by data cleaning?
Data cleaning is the process of fixing or removing incorrect, corrupted,
incorrectly formatted, duplicate, or incomplete data within a dataset. When combining
multiple data sources. there are many opportunities for data to be duplicated or
mislabeled.
14. What is project charter?
Project charter is a document that lays out the project vision, scope, objectives,
project team, and their responsibilities, key stakeholders, and how it will be carried out
or the implementation plan.
15. Identify the important contents of a project charter.
Project charter must include the following:
- Clear research goal
- Project mission and context
- Performing the analysis
- Resources to be used
- Proof that it is an achievable project, or proof of concepts
- Deliverables and a measure of success
- Timeline
16. List some of the visualization techniques.
- Line graphs
- Histograms
- Bar graphs
- Boxplots
- Sankey and network graphs.
17. List the problems associated with real world data.
The problems with the real world data are:
- Incomplete data: Missing attribute values.
- Noisy: Contains errors.
- Inconsistent: Contain discrepancies in codes and names.
18. Define data warchouse, data mart and data lake,
- Data warehouse is constructed by integrating data from multiple
heterogeneous sources that support analytical reporting. structured and/or ad
hoc queries, and decision making.
- Data mart supplies subject-oriented data necessary to support a specific
business unit.
- A data lake stores an organization's raw and processed (unstructured and
structured) data at both large and small scales.
19. List some of the factors involved in selecting the modeling technique.
- Model performance
- Moving the model to a production environment for easy implementation
- Maintenance of the model
20. What is dummy variable?
Dummy variables can only take two values: true (1) or false (0). They are used to
indicate the absence of a categorical effect that may explain the observation.
21. What do you mean by Exploratory Data Analysis?
Exploratory Data Analysis is the critical process of performing initial
investigations on data so as to discover patterns, to spot anomalies, to test hypothesis
and to check assumptions with the help of summary statistics and graphical
representations. It is used to discover trends and patterns from the dataset
22. List out the methods for combining data from different table.
- Joining tables
- Appending tables
- Using views to simulate data joins and appends
- Enriching aggregated measures
Part B & C
1. Discuss the applications of data science and big data with suitable examples.
2. Illustrate the overview of the data science process.
3. Elaborate any 5 application domains of data science.
4. Describe the categories of data for data mining.
5. Discuss the significance of setting the research goal for the data science project.
6. Discuss the strategies involved in retrieving relevant data from different sources of
data.
7. Explain the different stages of data preparation phase.
8. Elucidate the techniques involved in data cleansing.
9. Illustrate the steps involved in combining data from different data sources.
10. Explain impact of variable reduction on data science project highlighting its pros
and cons.
11. Elaborate on the steps involved in model building with suitable diagrams.
UNIT – II
DESCRIPTIVE ANALYTICS
PART - A
1. What is meant by frequency distribution?
A frequency distribution is a collection of observations produced by sorting
observations into classes and showing their frequency ( $f$ ) of occurrence in each
class.
2. What is Qualitative data? Give example.
Qualitative or Categorical data is a set of observations where any single
observation is a word, letter, or numerical code that represents a class or a category.
Example: Words - Yes or No, Letters - Y or N, Numerical code - 0 or 1
3. What is quantitative data? Give an example
Quantitative Data is a set of observations where any single observation is a
number that represents an amount or a count. It can be expressed in numerical values,
which make it countable and includes statistical data analysis. It is also known as
numerical data.
Example: Weights: 35,56 ….., 70 kgs
4. Compare Discrete and Continuous Variables.
Discrete Variable is a variable that consists of isolated numbers separated by
gaps.
- Example: Number of students in a class
Continuous Variable is a variable that consists of numbers whose values, at least
in theory, have no restrictions.
- Example: Height of students in a class
5. State the differences between Nominal and Ordinal data.
Nominal Data Ordinal Data
Cannot be quantified and there is no There is a sequential order by their position
intrinsic ordering on the scale
It is "in-between" qualitative data and
Qualitative data or categorical data
quantitative data
They do not provide any quantitative They provide sequence and can assign
value and cannot perform any numbers to ordinal data but cannot perform
arithmetical operation arithmetical operation
Cannot be used to compare with one Can compare one item with another by
another ranking or ordering
Examples: Economic status, customer
Examples: Gender, Nationality
satisfaction
6. What are the types of Frequency Distribution?
- Grouped frequency distribution
- Ungrouped frequency distribution
- Cumulative frequency distribution
- Relative frequency distribution
- Relative cumulative frequency distribution
7. Define an outlier.
An outlier is a data point that differs significantly from other observations. An
outlier can occur due to variability in the measurement or it may indicate an
experimental error.
8. What is percentile rank?
Percentile Rank (PR) of an observation is the percentage of scores in the entire
distribution with equal or smaller values than that score. Its mathematical formula is
given as:

9. State the differences between a histogram and bar graph.


Histogram Bar graph
Histogram refers to a graphical representation Bar graph is a pictorial representation
that displays data by way of bars to show the of data that uses bars to compare
frequency of numerical data. different categories of data.
Distribution of non-discrete variables Comparison of discrete variables
Quantitative data Categorical data
Bars touch each other, hence there are no Bars do not touch each other; hence
spaces between bars there are spaces between bars.
Elements are grouped together, so that they Elements are taken as individual
are considered as ranges. entities.

10. What are the measures of central tendency?


Mean, Median and Mode
11. Define Mode.
The mode represents the value of the most frequently occurring score.
12. Define Median.
Median represents the middle value when observations are ordered from least
most.
13. What is a Positively Skewed Distribution?
Positively Shewed Distribution is a distribution that includes a few extreme
observations in the positive direction (to the right of the majority of observations),
14. What is a Negatively Skewed Distribution?
Negatively Skewed Distribution is a distribution that includes a few extreme
observations in the negative direction (to the left of the majority of observations).
15. What is Variance?
Variance is the mean of all squared deviation scores.

16. What is standard deviation?


Standard Deviation is a rough measure of the average (or standard) amount by
which scores deviate on either side of their mean.

17. What is a Normal Curve?


A theoretical curve noted for its symmetrical bell-shaped form.
18. What is z Score?
A z score is a unit-free, standardized score that, regardless of the original units of
measurement, indicates how many standard deviations a score is above or below the
mean of its distribution.

Where, X is the original score and i and σ are the mean and the standard deviation,
respectively, for the normal distribution of the original scores.
19. How will you convert a z score to original score?

20. What is Correlation?


Correlation measures the relationship between two variables.
Example: The relationship between the computer skills and GPA of the student
21. What are the types of correlation?
- Positive correlation
- Negative correiation
- No correlation
22. What is a scatterplot?
A scatterplot is a graph containing a cluster of dots that represents all pairs of
scores. Scatter plots are used to observe relationships between variables.
23. What is a curvilinear relationship?
Curvilinear Relationship is a relationship that can be described best with a
curved line.
24. What are the Key Properties of correlation coefficient r ?
The two properties are:
- The sign of r indicates the type of linear relationship, whether positive or
negative.
- The numerical value of r, without regard to sign, indicates the strength of the
linear relationship.
25. What is regression?
Regression is a statistical method to determine the relationship between one
dependent variable and a series of other variables known as independent (explanatory)
variables.
26. What are the types of regression models?
- Linear model
- Non Linear model
27. What is restricted range?
Restricted range refers to a range of values that has been condensed, or
shortened. Example: The entire range of GPA scores is 0 to 10.0. A restricted range
could be 6.0 to 10.0 or 8.0 to 10.0 .
28. What is a regression line?
A regression line is a line which is used to describe the behavior of a set of data.
It displays the connection between scattered data points in any set. It gives the best
trend of the given data.
29. What is the Interpretation of r2 ?
The squared correlation coefficient, r2, provides us with not only a key
interpretation of the correlation coefficient but also a measure of predictive accuracy
that supplements the standard error of estimate, sy|x.
30. What is Standard Error of Estimate?
Standard Error of Estimate sy|x is a rough measure of the average amount of
predictive error i.e., as a rough measure of the average amount by which known Y
values deviate from their predicted Y values.
31. Give the Least Squares Regression Equation.
32. State the desirable property of least square regression?
The desirable property is that it automatically minimizes the total of all squared
predictive errors for known Y scores in the original correlation analysis.
33. State the Multiple Regression Equation.

34. When does Regression Fallacy occur?


Regression fallacy occurs whenever regression towards the mean is interpreted
as a real, rather than a chance, effect. The regression fallacy can be avoided by splitting
the subset of extreme observations into two groups.

Part B & C
1. Explain the different types of frequency distribution with suitable examples and
diagrams.
2. Elaborate the different ways to describe or represent data using tables with suitable
examples.
3. Explain the various ways by which data can be represented or described using graphs
with suitable examples and diagrams.
4. Explain the different measures of central tendency and describe the suitable
measures for the different types of data distribution.
5. Construct the frequency table and draw bar graph and stem and leaf displays for the
following data: 139,145,150,145,136,150,152,144,138,138
6. Construct the histogram and convert it to a frequency polygon for the following data:
138,139,139,145,145,150,145,136,150,152,144,138,138,150,149.133,134.152$.
155,151 .
7. Compute the mean, median and mode for the foliowing data sets:
- 45,55,60,60,63,63,63,63,65,65,70
- 26.9,26.3,28.7,27.4,26.6,27.4,26.9,26.9
8. Explain the various measures of variability with suitable examples.
9. Using the computation formula for the sum of squares, calculate the population
standard deviation and the sample standard deviation for the scores:
- 1,3,7,2,0,4,3,7
- 10,8,5,0,1,7,9,2,1
10. Consider the test scores approximating a normal curve with a mean of 500 and a
standard deviation of 100 . Sketch a normal curve and shade in the target area
described by the following:
- more than 550
- less than 525
- between 520 and 540
Plan solutions for the target areas. Convert to z scores and find proportions that
correspond to the target areas.
11. Elaborate in detail the significance of correlation and the various types of
correlation.
12. What are scatterplots? Illustrate on the various types with suitable examples.
13. Elaborate on the correlation coefficient r. Compare the various correlation
coefficients,
14. Calculate and analyze the correlation coefficient between the number of study hours
and the number of sleeping hours of different students.
Number of Study Hours 2 4 6 8 10
Number of Sleeping Hours 10 9 8 7 6

15. What is the significance of r2 ? Give a detailed interpretation of r2 ?


16. Discuss the importance of regression. Elabourate on the types of Regression.
17. Calculate the regression coefficient and obtain the lines of regression for the
following data.
X123 4 5 6 7
Y 9 8 10 12 11 13 14

18. Explain the significance of regression line and Least squares regression equation.
19. Find the standard error of the estimate of the mean weight of high school football
player using the data given of weights of the players.
Player Number Weight in Pounds
1 150
2 203
3 176
4 190
5 168
6 193
7 189
8 178
9 197
10 172

20. Elaborate on multiple regression equations.


21. Elucidate regression towards the mean. Explain regression fallacy and state how it
can be avoided.
UNIT III

INFERENTIAL STATISTICS

Part A

1. Define Population. Give an example.

A population is characterized by any complete set of observations (or potential


observations). It includes all the elements from the data set and the measurable
characteristics of the population such as mean and standard deviation are known as
parameters.

Example: All students in a college, all people living in India indicate the population of
India.

2. What is real population?

A real population is one in which all potential observations are accessible at the time of
sampling.

Examples of real populations: The ages of all visitors to a Park on a given day, the
ethnicity of all employees in an organization.

3. List the different types of population.

The different types of population are:

 Finite Population
 Infinite Population
 Existent Population
 Hypothetical Population

4. What is Hypothetical Population?

The population in which whose unit is not available in solid form is known as the
hypothetical population. A population consists of sets of observations, objects etc. that
are all something in common. In some situations, the populations are only hypothetical.

Examples: An outcome of rolling the dice, outcome of tossing a coin.

5. Define Sample.

Any subset of observations from a population may be characterized as a sample. In


typical applications of inferential statistics, the sample size is small relative to the
population size.
6. List the categories of sampling.

There are two categories of sampling generally used:

 Probability sampling
 Non-probability sampling

7. What is random sampling?

Probability sampling, also known as random sampling, is a kind of sample selection


where randomization is used instead of deliberate choice.

8. List the types of random sampling.

Types of Probability/Random sampling:

 Simple random sampling


 Systematic sampling
 Stratified random sampling
 Cluster sampling

9. Differentiate population and sample.

Population Sample
Measurable quantity is called parameter Measurable quantity is called statistics
It is a complete set This is a subset of population
It is the true representation of opinion It has margin error and confidence interval
Data collection is by complete Data collection is by means of Sampling or
enumeration or census sample survey

10. List the types of non-probability sampling.

Types of Non-probability sampling:

 Convenience sampling
 Quota sampling
 Purposive sampling
 Snowball sampling

11. What is Snowball sampling?

Snowball sampling is also known as referral sampling. This technique helps researchers
find a sample when they are difficult to locate. Researchers use this technique when the
sample size is small and not easily available.
12. What is the difference between non-probability sampling and probability
sampling?

Non-probability sampling Probability sampling


Sample selection is based on the subjective
Sample is selected at random.
judgment of the researcher.
Not everyone has an equal chance to Everyone in the population has an equal
participate. chance of getting selected.
The researcher does not consider sampling Used when sampling bias has to be
bias. reduced.
Useful when the population has similar
Useful when the population is diverse.
traits.
The sample does not accurately represent
Used to create an accurate sample.
the population.
Finding respondents is easy. Finding the right respondents is not easy.

13. What is the Optimal Sample Size?

The sample sizes are in hundreds or thousands for surveys, but less than 100 for most
experiments. Optimal sample size depends on the estimated variability among
observations and the acceptable amount of error in the conclusion.

14. What is Systematic sampling?

Systematic sampling is also known as systematic clustering. In this method, random


selection applies only to the first item chosen. A rule is then applied so that every $n$th
item or person after that is picked.

15. What is Cluster sampling?

Groups rather than individual units of the target population are selected at random.
Cluster sampling is similar to stratified sampling, besides the population is divided into
a large number of subgroups (for example, hundreds of thousands of strata or
subgroups). After that, some of these subgroups are chosen at random and simple
random samples are then gathered within these subgroups. These subgroups are
known as clusters. It is basically utilized to lessen the cost of data compilation.

16. What are the advantages of random sampling?

Advantages of random sampling are:

 It helps to reduce the bias involved in the sample compared to other methods of
sampling and it is considered as a fair method of sampling.
 This method does not require any technical knowledge, as it is a fundamental
method of collecting the data.
 The data collected through this method is well informed.
 As the population size is large in the simple random sampling method,
researchers can create the sample size that they want.
 It is easy to pick the smaller sample size from the existing larger population.

17. What is Consecutive sampling?

This non-probability sampling method is very similar to convenience sampling, with a


slight variation. In this case, the researcher picks a single person or a group of a sample,
conducts research over a period, analyzes the results, and then moves on to another
subject or group if needed. Consecutive sampling technique gives the researcher a
chance to work with many topics and fine-tune their research by collecting results that
have vital insights.

18. What is the Standard Error of the Mean?

The standard error of the mean equals the standard deviation of the population divided
by the square root of the sample size. It is a rough measure of the average amount by
which sample means deviate from the mean of the sampling distribution or from the
population mean.

19. What is hypothesis testing?

Hypothesis testing is an act in statistics whereby an analyst tests an assumption


regarding a population parameter. The methodology employed by the analyst depends
on the nature of the data used and the reason for the analysis.

20. What is one tailed test?

A one-tailed test is a statistical test in which the critical area of a distribution is one-
sided so that it is either greater than or less than a certain value, but not both. If the
sample being tested falls into the one-sided critical area, the alternative hypothesis will
be accepted instead of the null hypothesis.

21. What is two tailed test?

A two-tailed test, in statistics, is a method in which the critical area of a distribution is


two-sided and tests whether a sample is greater than or less than a certain range of
values. It is used in null-hypothesis testing and testing for statistical significance. If the
sample being tested falls into either of the critical areas, the alternative hypothesis is
accepted instead of the null hypothesis.

22. State the Central Limit Theorem.

The Central Limit Theorem states that the distribution of a sample variable
approximates a normal distribution (i.e., a bell curve) as the sample size becomes larger,
assuming that all samples are identical in size, and regardless of the population's actual
distribution shape.
23. What is confidence interval?

Confidence intervals measure the degree of uncertainty or certainty in a sampling


method. They can take any number of probability limits, with the most common being a
95 % or 99 % confidence level. Confidence intervals are conducted using statistical
methods, such as a t-test.

24. What is the formula for the confidence interval for µ (based on z )?

Where

25. What is point estimate?

A point estimate is defined as a calculation where a sample statistic is used to estimate


or approximate an unknown population parameter.

26. List the methods to calculate point estimates.

 Method of Moments
 Maximum Likelihood
 Bayes Estimators
 Best Unbiased Estimators

27. What are the properties of point estimators?

 Bias
 Consistency
 Most efficient or unbiased

28. What is the drawback of point estimates? How it can be resolved?

Point estimates convey no information about the degree of inaccuracy due to sampling
variability. Statisticians supplement point estimates with another, more realistic type of
estimate, known as interval estimates or confidence intervals.

29. What is interval estimator?

Interval estimator uses sample data to calculate the interval of the possible values of an
unknown parameter of a population. It gives the range of values for the parameter.
Interval estimates are intervals within which the parameter is expected to fall, with a
certain degree of confidence. The interval of the parameter is selected in a way that it
falls within a 95 % or higher probability, also known as the confidence interval.

The level of confidence indicates the percent of time that a series of confidence intervals
includes the unknown population characteristic, such as the population mean.

Part B & C

1. Discuss on population and samples with suitable examples.

2. Discuss the different types of random sampling techniques.

3. Elaborate on the different types of non-probability based sampling techniques.

4. Illustrate the hypothesis testing with an example.

5. Explain the procedure of z-test with an example.

6. A teacher claims that the mean score of students in the class is greater than 80 with a
standard deviation of 20. If a sample of 75 students was selected with a mean score of
90 then check if there is enough evidence to support this claim at a 0.05 significance
level.

7. An online food delivery company claims that the mean delivery time is less than 30
minutes with a standard deviation of 10 minutes. Is there enough evidence to support
this claim at a 0.05 significance level if 49 orders were examined with a mean of 20
minutes?

8. A company wants to improve the quality of products by reducing defects and


monitoring the efficiency of assembly lines. In assembly line A, there were 9 defects
reported out of 100 samples and in line B, 25 defects out of 600 samples were
identified. Check if there is a difference in the procedures at a 0.05 alpha level?

9. Explain in detail about Estimation and the significance of point estimates.

10. Elaborate on Confidence interval and level of confidence.

You might also like