01.ad3491 Fdsa QB
01.ad3491 Fdsa QB
QUESTION BANK
UNIT I
INTRODUCTION TO DATA SCIENCE
Part A
1. Define Data Science.
Data science is an interdisciplinary field that uses scientific methods, processes,
algorithms and systems to extract knowledge from noisy, structured or unstructured
data and apply knowledge from data across a wide range of application domains.
Data science is related to Data mining, Machine learning and Big data.
2. What is big data?
Big data is a comprehensive term for any collection of large or complex data sets
which is difficult to process using traditional data management techniques such as, for
example, the RDBMS (relational database management systems).
3. What is machine learning?
Machine learning is a branch of artificial intelligence (AI) and computer science
which focuses on the use of data and algorithms to imitate the way that humans learn,
gradually improving its accuracy.
4. What is data mining?
Data mining is the process of reviewing large data sets to identify patterns and
relationships that can solve business problems through data analysis. Data mining
techniques and tools enable enterprises to predict future trends and make
moreinformed business decisions.
5. What are the characteristics of big data?
The characteristics of big data are:
- Volume - Quantity of data
- Variety - Different types of data
- Velocity - Speed at which new data generated
- These characteristics are complemented with a fourth V, Veracity - Accuracy of
data
6. List the categories of data.
The main categories of data are:
a) Structured Data
b) Unstructured Data
Introduction to Data Science
c) Natural language Data
d) Machine-generated Data
e) Graph-based Data
f) Audio, video and images Data
g) Streaming Data
7. List some of the application domains of data science?
The application domains of data science are:
- Health care applications
- Transportation
- Education
- Government Organization
- Commercial applications
8. What is structured data? Give examples.
Structured data is data that depends on a data model and resides in a fixed field
within a record.
Examples: Tabies within databases or Excel files.
9. What is unstructured data? Give an example.
Unstructured data is data that is not easy to fit into a data model because the
content is context-specific or varying. One example of unstructured data is the regular
email.
10. What is machine generated data?
Machine-generated data is information that is automatically created by a
computer, process, application, or other machine without human intervention.
Machinegenerated data is becoming a major data resource. The analysis of machine data
relies on highly scalable tools, due to its high volume and speed.
Examples: Web server logs, call detail records, network event logs, and telemetry.
11. State the importance of setting the research goal.
The goal of a data science project is to fulfill a precise and measurable objective
that is clearly connected to the purpose, workflows, and decision-making processes of
the business. This step defines the scope of the project.
12. List the phases involved in the data science process.
- Setting the research goal
- Retrieving data
- Data Preparation
- Data Exploration
- Data Modeling
- Presentation and automation
13. What is meant by data cleaning?
Data cleaning is the process of fixing or removing incorrect, corrupted,
incorrectly formatted, duplicate, or incomplete data within a dataset. When combining
multiple data sources. there are many opportunities for data to be duplicated or
mislabeled.
14. What is project charter?
Project charter is a document that lays out the project vision, scope, objectives,
project team, and their responsibilities, key stakeholders, and how it will be carried out
or the implementation plan.
15. Identify the important contents of a project charter.
Project charter must include the following:
- Clear research goal
- Project mission and context
- Performing the analysis
- Resources to be used
- Proof that it is an achievable project, or proof of concepts
- Deliverables and a measure of success
- Timeline
16. List some of the visualization techniques.
- Line graphs
- Histograms
- Bar graphs
- Boxplots
- Sankey and network graphs.
17. List the problems associated with real world data.
The problems with the real world data are:
- Incomplete data: Missing attribute values.
- Noisy: Contains errors.
- Inconsistent: Contain discrepancies in codes and names.
18. Define data warchouse, data mart and data lake,
- Data warehouse is constructed by integrating data from multiple
heterogeneous sources that support analytical reporting. structured and/or ad
hoc queries, and decision making.
- Data mart supplies subject-oriented data necessary to support a specific
business unit.
- A data lake stores an organization's raw and processed (unstructured and
structured) data at both large and small scales.
19. List some of the factors involved in selecting the modeling technique.
- Model performance
- Moving the model to a production environment for easy implementation
- Maintenance of the model
20. What is dummy variable?
Dummy variables can only take two values: true (1) or false (0). They are used to
indicate the absence of a categorical effect that may explain the observation.
21. What do you mean by Exploratory Data Analysis?
Exploratory Data Analysis is the critical process of performing initial
investigations on data so as to discover patterns, to spot anomalies, to test hypothesis
and to check assumptions with the help of summary statistics and graphical
representations. It is used to discover trends and patterns from the dataset
22. List out the methods for combining data from different table.
- Joining tables
- Appending tables
- Using views to simulate data joins and appends
- Enriching aggregated measures
Part B & C
1. Discuss the applications of data science and big data with suitable examples.
2. Illustrate the overview of the data science process.
3. Elaborate any 5 application domains of data science.
4. Describe the categories of data for data mining.
5. Discuss the significance of setting the research goal for the data science project.
6. Discuss the strategies involved in retrieving relevant data from different sources of
data.
7. Explain the different stages of data preparation phase.
8. Elucidate the techniques involved in data cleansing.
9. Illustrate the steps involved in combining data from different data sources.
10. Explain impact of variable reduction on data science project highlighting its pros
and cons.
11. Elaborate on the steps involved in model building with suitable diagrams.
UNIT – II
DESCRIPTIVE ANALYTICS
PART - A
1. What is meant by frequency distribution?
A frequency distribution is a collection of observations produced by sorting
observations into classes and showing their frequency ( $f$ ) of occurrence in each
class.
2. What is Qualitative data? Give example.
Qualitative or Categorical data is a set of observations where any single
observation is a word, letter, or numerical code that represents a class or a category.
Example: Words - Yes or No, Letters - Y or N, Numerical code - 0 or 1
3. What is quantitative data? Give an example
Quantitative Data is a set of observations where any single observation is a
number that represents an amount or a count. It can be expressed in numerical values,
which make it countable and includes statistical data analysis. It is also known as
numerical data.
Example: Weights: 35,56 ….., 70 kgs
4. Compare Discrete and Continuous Variables.
Discrete Variable is a variable that consists of isolated numbers separated by
gaps.
- Example: Number of students in a class
Continuous Variable is a variable that consists of numbers whose values, at least
in theory, have no restrictions.
- Example: Height of students in a class
5. State the differences between Nominal and Ordinal data.
Nominal Data Ordinal Data
Cannot be quantified and there is no There is a sequential order by their position
intrinsic ordering on the scale
It is "in-between" qualitative data and
Qualitative data or categorical data
quantitative data
They do not provide any quantitative They provide sequence and can assign
value and cannot perform any numbers to ordinal data but cannot perform
arithmetical operation arithmetical operation
Cannot be used to compare with one Can compare one item with another by
another ranking or ordering
Examples: Economic status, customer
Examples: Gender, Nationality
satisfaction
6. What are the types of Frequency Distribution?
- Grouped frequency distribution
- Ungrouped frequency distribution
- Cumulative frequency distribution
- Relative frequency distribution
- Relative cumulative frequency distribution
7. Define an outlier.
An outlier is a data point that differs significantly from other observations. An
outlier can occur due to variability in the measurement or it may indicate an
experimental error.
8. What is percentile rank?
Percentile Rank (PR) of an observation is the percentage of scores in the entire
distribution with equal or smaller values than that score. Its mathematical formula is
given as:
Where, X is the original score and i and σ are the mean and the standard deviation,
respectively, for the normal distribution of the original scores.
19. How will you convert a z score to original score?
Part B & C
1. Explain the different types of frequency distribution with suitable examples and
diagrams.
2. Elaborate the different ways to describe or represent data using tables with suitable
examples.
3. Explain the various ways by which data can be represented or described using graphs
with suitable examples and diagrams.
4. Explain the different measures of central tendency and describe the suitable
measures for the different types of data distribution.
5. Construct the frequency table and draw bar graph and stem and leaf displays for the
following data: 139,145,150,145,136,150,152,144,138,138
6. Construct the histogram and convert it to a frequency polygon for the following data:
138,139,139,145,145,150,145,136,150,152,144,138,138,150,149.133,134.152$.
155,151 .
7. Compute the mean, median and mode for the foliowing data sets:
- 45,55,60,60,63,63,63,63,65,65,70
- 26.9,26.3,28.7,27.4,26.6,27.4,26.9,26.9
8. Explain the various measures of variability with suitable examples.
9. Using the computation formula for the sum of squares, calculate the population
standard deviation and the sample standard deviation for the scores:
- 1,3,7,2,0,4,3,7
- 10,8,5,0,1,7,9,2,1
10. Consider the test scores approximating a normal curve with a mean of 500 and a
standard deviation of 100 . Sketch a normal curve and shade in the target area
described by the following:
- more than 550
- less than 525
- between 520 and 540
Plan solutions for the target areas. Convert to z scores and find proportions that
correspond to the target areas.
11. Elaborate in detail the significance of correlation and the various types of
correlation.
12. What are scatterplots? Illustrate on the various types with suitable examples.
13. Elaborate on the correlation coefficient r. Compare the various correlation
coefficients,
14. Calculate and analyze the correlation coefficient between the number of study hours
and the number of sleeping hours of different students.
Number of Study Hours 2 4 6 8 10
Number of Sleeping Hours 10 9 8 7 6
18. Explain the significance of regression line and Least squares regression equation.
19. Find the standard error of the estimate of the mean weight of high school football
player using the data given of weights of the players.
Player Number Weight in Pounds
1 150
2 203
3 176
4 190
5 168
6 193
7 189
8 178
9 197
10 172
INFERENTIAL STATISTICS
Part A
Example: All students in a college, all people living in India indicate the population of
India.
A real population is one in which all potential observations are accessible at the time of
sampling.
Examples of real populations: The ages of all visitors to a Park on a given day, the
ethnicity of all employees in an organization.
Finite Population
Infinite Population
Existent Population
Hypothetical Population
The population in which whose unit is not available in solid form is known as the
hypothetical population. A population consists of sets of observations, objects etc. that
are all something in common. In some situations, the populations are only hypothetical.
5. Define Sample.
Probability sampling
Non-probability sampling
Population Sample
Measurable quantity is called parameter Measurable quantity is called statistics
It is a complete set This is a subset of population
It is the true representation of opinion It has margin error and confidence interval
Data collection is by complete Data collection is by means of Sampling or
enumeration or census sample survey
Convenience sampling
Quota sampling
Purposive sampling
Snowball sampling
Snowball sampling is also known as referral sampling. This technique helps researchers
find a sample when they are difficult to locate. Researchers use this technique when the
sample size is small and not easily available.
12. What is the difference between non-probability sampling and probability
sampling?
The sample sizes are in hundreds or thousands for surveys, but less than 100 for most
experiments. Optimal sample size depends on the estimated variability among
observations and the acceptable amount of error in the conclusion.
Groups rather than individual units of the target population are selected at random.
Cluster sampling is similar to stratified sampling, besides the population is divided into
a large number of subgroups (for example, hundreds of thousands of strata or
subgroups). After that, some of these subgroups are chosen at random and simple
random samples are then gathered within these subgroups. These subgroups are
known as clusters. It is basically utilized to lessen the cost of data compilation.
It helps to reduce the bias involved in the sample compared to other methods of
sampling and it is considered as a fair method of sampling.
This method does not require any technical knowledge, as it is a fundamental
method of collecting the data.
The data collected through this method is well informed.
As the population size is large in the simple random sampling method,
researchers can create the sample size that they want.
It is easy to pick the smaller sample size from the existing larger population.
The standard error of the mean equals the standard deviation of the population divided
by the square root of the sample size. It is a rough measure of the average amount by
which sample means deviate from the mean of the sampling distribution or from the
population mean.
A one-tailed test is a statistical test in which the critical area of a distribution is one-
sided so that it is either greater than or less than a certain value, but not both. If the
sample being tested falls into the one-sided critical area, the alternative hypothesis will
be accepted instead of the null hypothesis.
The Central Limit Theorem states that the distribution of a sample variable
approximates a normal distribution (i.e., a bell curve) as the sample size becomes larger,
assuming that all samples are identical in size, and regardless of the population's actual
distribution shape.
23. What is confidence interval?
24. What is the formula for the confidence interval for µ (based on z )?
Where
Method of Moments
Maximum Likelihood
Bayes Estimators
Best Unbiased Estimators
Bias
Consistency
Most efficient or unbiased
Point estimates convey no information about the degree of inaccuracy due to sampling
variability. Statisticians supplement point estimates with another, more realistic type of
estimate, known as interval estimates or confidence intervals.
Interval estimator uses sample data to calculate the interval of the possible values of an
unknown parameter of a population. It gives the range of values for the parameter.
Interval estimates are intervals within which the parameter is expected to fall, with a
certain degree of confidence. The interval of the parameter is selected in a way that it
falls within a 95 % or higher probability, also known as the confidence interval.
The level of confidence indicates the percent of time that a series of confidence intervals
includes the unknown population characteristic, such as the population mean.
Part B & C
6. A teacher claims that the mean score of students in the class is greater than 80 with a
standard deviation of 20. If a sample of 75 students was selected with a mean score of
90 then check if there is enough evidence to support this claim at a 0.05 significance
level.
7. An online food delivery company claims that the mean delivery time is less than 30
minutes with a standard deviation of 10 minutes. Is there enough evidence to support
this claim at a 0.05 significance level if 49 orders were examined with a mean of 20
minutes?