0% found this document useful (0 votes)
7 views15 pages

Fdsa 12 - 2M

Data science

Uploaded by

wipet19377
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views15 pages

Fdsa 12 - 2M

Data science

Uploaded by

wipet19377
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

UNIT – 1

1. What is Data Science?


Data science is the area of study which involves methods to analyze massive amounts of data and extracting knowledge from all the
data that are gathered.
2. Explain the benefits of using statistics in data science.
Statistics help Data scientist to get a better idea of customer’s expectation. Using the statistics method Data Scientists
can get knowledge regarding consumer interest, behavior, engagement, retention, etc. It also helps you to build powerful data models to
validate
certain inferences and predictions.
3. What are the needs of data science?
The basic needs of data science are:
● Better Decision Making
● Predictive Analysis
● Pattern Discovery
4. List out the various field, where data science are used?
Data Science are used almost everywhere in both commercial and non-commercial settings. Some of the fields where data science used
are
⮚ Healthcare industry

⮚ Retailers

⮚ Financial sectors

⮚ Transportation

⮚ Government sectors

⮚ Universities
5.What are the three sub-phases of data preparations?
The three sub-Phase of data preparations:
● Data cleaning
● Data Integration.
● Data Transformation.
6. What is data cleaning?
Removing missing values, false and inconsistencies across data source.
7. Define Streaming data.
Data that is generated continuously by thousands of data sources, which typically send in the data records simultaneously and in
small sizes. Examples are the “What’s trending” on Twitter, live sporting or music events, and the stock market.
8. What is Pareto charts?
• It is a graph that indicate the frequency of defects, as well as cumulative impact.
• Pareto charts are useful to find the defects to prioritize in order to observe the greatest overall improvement.
• It is a combination of a bar graph and line graph.
9. What are recommender system?
Recommender systems are a subclass of information filtering systems, used to predict how ser would rate or score particular objects
(movies, music, merchandise, etc.). Recommender systems filter large volume of information based on the data provided by a user
and other factors.
Recommender systems utilizes algorithms that optimize the analysis of the data to build the recommendations.
10. What are various forms of data used in data science?
• The main categories of data are:
⮚ Structured data

⮚ Unstructured data

⮚ Natural language

⮚ Machine-generated

⮚ Graph-based

⮚ Audio, video and images

⮚ Streaming
11. Explain how a recommender system works.
A recommender system is a system that many consumer-facing, content-driven, online platforms employs to generate
recommendations for users from a library of available content.
These system generates recommendations based on what they know about the user’s tastes from their activities on the platforms.
12. List out the steps involved in data science process.
1. Setting the research goal
2. Retrieving data
3. Data Preparation
4. Data Exploration
5. Data Mining
6. Presentation and automation
13. Mention the input that covers inside the project charter.
● A clear research goal
● The project mission and context
● How you’re going to perform your analysis
● What resources you expect to use
● Proof that it’s an achievable project, or proof of concepts
● Deliverables and a measure of success
● A timeline
14. List out the various open data sites providers.
Open data site Description
Data.gov The home of the US Government’s open data
https://open-data.europa.eu/ The home of the European Commission’s open data
Freebase.org An open database that retrieves its information from
sites like
Wikipedia, MusicBrains, and the SEC archive
Data.worldbank.org Open data initiative from the World Bank
Aiddata.org Open data for international development
Open.fda.gov Open data from the US Food and Drug
Administration

15. Define the techniques to handle missing data.


● Omit the values
● Set value to null
● Impute a static value such as 0 or the mean
● Impute a value from an estimated or theoretical distribution.
● Modeling the value (nondependent)

16. What is outliers?


• An outlier is an observation that seems to be distant from other observations or, more specifically, one observation that follows a
• different
• logic or generative process than the other observations.
• The easiest way to find outliers is to use a plot or a table with the minimum and maximum values.

17.What are the different types of Recommender systems?


The two types of recommender systems are
Collaborative filtering – Collaborative filtering is a method of making automatic predictions by using the recommendations of other people.
Content-Based filtering – It is based on the description of an item and a user’s choice. As he name suggests, it uses content (keywords)
to describe the items, and the ser profile is built to state the type of item this user likes.
18. What are the steps involved in model building?
The model building consists of the following steps such as
a. Selection of a modeling technique and variables to enter in the model
b. Execution of the model
c. Diagnosis and model comparison.
19. What are the various operation involved in combining data.
Two operations to combine information from different data sets.
• The first operation is joining: enriching an observation from one table with information from another table.
• The second operation is appending or stacking: adding the observation of one table to those of another table.
20. What is the difference between a bar graph and a histogram.
Bar chart and Histograms can be used to compare the sizes of the different groups. A bar chart is made up of bars plotted on a
chart. A histogram is a graph that represents a frequency distribution, the height of the bars represent observed frequencies.
21. How does data cleaning plays a vital role in the analysis?
● Data cleaning can help in analysis because:
● Cleaning data from multiple sources helps to transforms it into a format that data analyst or data scientists can work with.
● Data cleaning helps to increase the accuracy of the model in machine learning.
● It is a cumbersome process because as the number of data sources increases, the time taken to clean the data increases exponentially
● due to the number of sources and the volume of data generated by these sources.
22. What is Presentation and automation?
● Presenting your results to the stakeholders and industrializing your analysis process for repetitive reuse and integration with other
tools.

23. Difference between Data Science and Machine Learning.


Data Science Machine Learning
Data Science is an interdisciplinary field that uses scientific Machine Learning is the Scientific study of algorithm and
methods, algorithm and system to extract knowledge from statistical methods.
structural and unstructured data.
It helps you to create insight from data dealing with real-world It helps you to predict and the outcomes for new database from
complexities. historical data with the help of mathematical models.
It is a Complete process It is a single steps in the entire data science process.
It is not a subset of Artificial Intelligence (AI). ML technology is a subset of AI.
In Data science, high RAM and SSD used, which helps to In ML, GPUs are used for intensive vector operations.
overcome I/O bottleneck problems.
24. Define Data Preparation.
• Data preparation is the process of cleaning and transforming raw data prior to processing and analysis.
• It is an important step prior to processing and often involves reformatting data, making corrections to data and the combining of
• data sets to enrich data.
25. Define Box plots.
• Box plots are a standardized way of displaying the distribution of data based on a five number summary (“minimum”,
• first quartile(Q1), median, third quartile (Q3) and “maximum”).
• Median : Middle value of a data set
• First quartile: the middle number between the smallest number and the median.
• Third quartile: the middle number between the median and highest value of the dataset.
26. What is Brushing and linking.
• With brushing and linking you combine and link different graphs and tables (or views) so changes in one graph are automatically
• transferred to the other graphs.

UNIT -2
1.Discuss the differences between the frequency table and the frequency distribution table?
The frequency table is said to be a tabular method where each part of the data is assigned to its corresponding frequency. Whereas,
a frequency distribution is generally the graphical representation of the frequency table.
2.What are the numerous types of frequency distributions?
Different types of frequency distributions are as follows:
1. Grouped frequency distribution.
2. Ungrouped frequency distribution.
3. Cumulative frequency distribution.
4. Relative frequency distribution.
5. Relative cumulative frequency distribution, etc.
3. What are some characteristics of the frequency distribution?
Some major characteristics of the frequency distribution are given as follows:
1. Measures of central tendency and location i.e. mean, median, and mode.
2. Measures of dispersion i.e. range, variance, and the standard deviation.
3. The extent of the symmetry or asymmetry i.e. skewness.
4. The flatness or the peakedness i.e. kurtosis.
4.What is the importance of frequency distribution?
The value of the frequency distributions in statistics is excessive. A well-formed frequency distribution creates the possibility of a
detailed
analysis of the structure of the population. So, the groups where the population breaks down are determinable.
5.What is frequency distribution?
A frequency distribution is a collection of observations produced by sorting observations
into classes and showing their frequency (f ) of occurrence in each class.
6.Essential guidelines for frequency distribution.
● Each observation should be included in one, and only one, class.
● List all classes, even those with zero frequencies.
● All classes should have equals intervals.
7.What is Real Limits.
The real limits are located at the midpoint of the gap between adjacent tabled boundaries; that is, one-half of one unit of
measurement below the lower tabled boundary and one-half of one unit of measurement above the upper tabled boundary.
8.Define Relative frequency distributions.
Relative frequency distributions show the frequency of each class as a part or fraction of the total frequency for the entire
distribution.
9. How to convert frequency distribution into relative frequency distribution.
To convert a frequency distribution into a relative frequency distribution, divide the frequency for each class by the total frequency
for the entire distribution.
10. Define Cumulative frequency distribution.
Cumulative frequency distributions show the total number of observations in each class and in all lower-ranked classes.
11.What is Percentile Ranks.
The percentile rank of a score indicates the percentage of scores in the entire distribution with similar or smaller values than that
score.
Thus a weight has a percentile rank of 80 if equal or lighter weights constitute 80 percent of the entire distribution.
12.List some of the features of histogram.
● Equal units along the horizontal axis (the X axis, or abscissa) reflect the various class intervals of the frequency distribution.
● Equal units along the vertical axis (the Y axis, or ordinate) reflect increases in frequency. (The units along the vertical axis do not
● have to be the same width as those along the horizontal axis.)
● The intersection of the two axes defines the origin at which both numerical scales equal 0.
● Numerical scales always increase from left to right along the horizontal axis and from bottom to top along the vertical axis.
● The body of the histogram consists of a series of bars whose heights reflect the frequencies for the various classes.
13. Define Stem and leaf display.
A device for sorting quantitative data on the basis of leading and trailing digits.

14. Difference between positively skewed and negatively skewed distribution.


Positively Skewed Distribution
A distribution that includes a few extreme observations in the positive direction (to the right of the majority of observations).
Negatively Skewed Distribution
A distribution that includes a few extreme observations in the negative direction (to the left of the majority of observations).
15. Define mode.
The mode reflects the value of the most frequently occurring score.
16. Define multi mode.
Distributions can have more than one mode (or no mode at all). Distributions with two obvious peaks, even though they are not
exactly the same height, are referred to as bimodal. Distributions with more than two peaks are referred to as multimodal.
17. Determine the mode for the following retirement ages: 60, 63, 45, 63, 65, 70, 55, 63, 60, 65, 63.
Answer: 63
18. Define median.
The median reflects the middle value when observations are ordered from least to most.
19. List out the steps to find the median values.
1.Order scores from least to most.
2. Find the middle position by adding one to the total number of scores and dividing by 2.
3.If the middle position is a whole number, as in the left-hand panel below, use this number to count into the set of ordered scores.
4. The value of the median equals the value of the score located at the middle position.
5. If the middle position is not a whole number, as in the right-hand panel below, use the two nearest whole numbers to count into
the set of ordered scores.
5. The value of the median equals the value midway between those of the two middlemost scores; to find the midway value, add the
6. two given values and divide by 2.
20. Find the median for the following retirement ages: 60, 63, 45, 63, 65, 70, 55, 63, 60, 65, 63.
median = 63
21. When do we use the median value.
The median can be used whenever it is possible to order qualitative data from least to most because the level of measurement is
ordinal.
22. What is Range.
The range is the difference between the largest and smallest scores.
23.What is Degrees of freedom (df).
Degrees of freedom (df) refers to the number of values that are free to vary, given one or more mathematical restrictions, in a s
ample being used to estimate a population characteristic.
24.How do you calculate IQR.
● Order scores from least to most.
● To determine how far to penetrate the set of ordered scores, begin at either end,
● then add 1 to the total number of scores and divide by 4. If necessary, round the
● result to the nearest whole number.
● Beginning with the largest score, count the requisite number of steps (calculated
● in step 2) into the ordered scores to find the location of the third quartile.
● The third quartile equals the value of the score at this location.
● Beginning with the smallest score, again count the requisite number of steps into
● the ordered scores to find the location of the first quartile.
● The first quartile equals the value of the score at this location.
● The IQR equals the third quartile minus the first quartile.

25. Difference between qualitative and quantitative data?


Quantitative data Qualitative data
Definition Quantitative data are the result of Qualitative data are he reslts of
counting or measuring attributes of a categorizing or describing attributes of a
population. population.
Data that you will see Always numbers Generally described b words or letters.
Examples Amount of money you have, height, Hair color, Blood type, ethnic group.
weight, number of people living in your
town.

26. Why n-1 is used in calculating SD for sample?


● Using n − 1 in the denominator of SD and variance sample calculation, that solves a problem in inferential statistics
● associated with generalizations from samples to populations.
● The adequacy of these generalizations usually depends on accurately estimating unknown variability in the population with
● known variability in the sample. But if we were to use n rather than n − 1 in the denominator of our estimates, they would
● tend to underestimate variability in the population because n is too large.
● This tendency would compromise any subsequent generalizations, such as whether observed mean differences are real or
● merely transitory.
● On the other hand, when the denominator is made smaller by using n − 1, variability in the population is estimated more
● accurately, and subsequent generalizations are more likely to be valid.

27. Employees of Corporation A earn annual salaries described by a mean of $90,000 and a standard deviation of $10,000.
(a) The majority of all salaries fall between what two values?
(b) A small minority of all salaries are less than what value?
(c) A small minority of all salaries are more than what value?
(d) Answer parts (a), (b), and (c) for Corporation B’s employees, who earn annual salaries described by a mean of $90,000 and a standard
deviation of $2,000.
Answer:
(a) $80,000 to $100,000
(b) $70,000
(c) $110,000
(d) $88,000 to $92,000; $86,000; $94,000
28.Define Normal curve.
A theoretical curve noted for its symmetrical bell-shaped form.
29. List some of the properties of the normal curve.
● Obtained from a mathematical equation, the normal curve is a theoretical curve defined for a continuous variable
● Because the normal curve is symmetrical, its lower half is the mirror image of its upper half.
● Being bell shaped, the normal curve peaks above a point midway along the horizontal spread and then tapers off gradually in either
● direction from the peak (without actually touching the horizontal axis, since, in theory, the tails of a normal curve extend infinitely
far).
● The values of the mean, median (or 50th percentile), and mode, located at a point midway along the horizontal spread, are the same
● for the normal curve.

30. Define z scores


A z score is a unit-free, standardized score that, regardless of the original units of measurement, indicates how many standard
deviations a score is above or below the mean of its distribution.
31.How to find the proportion for one score.
● Sketch a normal curve and shade in the target area.
● Plan your solution according to the normal table.
● Convert X to z.
● Find the target area.

32.What is Standard score.


Any unit-free scores expressed relative to a known mean and a known standard deviation they are referred to as standard scores.
Although z scores qualify as standard scores because they are unit-free and expressed relative to a known mean of 0 and a known
standard deviation of 1, other scores also qualify as standard scores.

33.What is Transformed Standard Score.


Particularly when reporting test results to a wide audience, z scores can be changed to transformed standard scores, other types of
unit-free standard scores that lack negative signs and decimal points. These transformations change neither the shape of the
original distribution nor the relative standing of any test score within the distribution.

34.Define Scatterplots.
A scatterplot is a graph containing a cluster of dots that represents all pairs of scores. With a little training, you can use any dot
cluster as a preview of a fully measured relationship.
35.Define correlation coefficient.
A correlation coefficient is a number between –1 and 1 that describes the relationship between pairs of variables.
36. Specify the properties of correlation coefficient.
Two properties are:
1. The sign of r indicates the type of linear relationship, whether positive or negative.
2. The numerical value of r, without regard to sign, indicates the strength of the linear relationship.
37. Define least square regression equation.
The equation that minimizes the total of all squared prediction errors for known Y scores in the original correlation analysis.
11. Assume that an r of .30 describes the relationship between educational level (highest grade completed) and estimated number of hours spent
reading each week. More specifically:
educational level (x) weekly reading time (y)
X = 13 Y=8
SSx = 25 Ssy = 50
r = .30

(a)Determine the least squares equation for predicting weekly reading time from educational
level.
Answer:
b = √(50/25)(.30) = .42; a = 8 – (.42)(13) = 2.54

38. What is Standard Error of Estimate.


A rough measure of the average amount of predictive error.

39.What is Squared Correlation Coefficient (r 2).


The square of the correlation coefficient, r2, always indicates the proportion of total variability in one variable that is
predictable from its relationship with the other variable.
40. What is Multiple Regression Equation
A least squares equation that contains more than one predictor or X variable.

UNIT -3

1. 1. Where you have used Hypothesis Testing in your machine learning solution?
Hypothesis testing is one of the statistical analysis where we test the assumption made for any particular situation.
While testing some assumption which was claimed to be true, and performed the hypothesis testing where the null hypothesis
was that whatever claimed results to be true and the alternate hypothesis was that whatever claim was made were false.

2. Which type of error is sever error, Type1 or Type2? And why with example.
It depends on the problem statement we are looking into.
For example:
The confusion matrix with regards to disease verses treatment is fatal in case of true positive (when patient have the disease and the
model predicted that patient don’t have the disease) then in that case patient won’t get the treatment and might loose his/her life.
Similarly in criminal is guilty or innocent case false positive is is much more worse (when the person is inncent and the model
predicts
person is guilty) as we will end up punishing an innocent.
3. What is the most significance benefits of Hypothesis Testing?
The most significant benefit of hypothesis testing is it allows you to evaluate the strength of your claim or assumption before
implementing it in your data set. Also, hypothesis testing is the only valid method to prove that something "is or is not".

4. List the benefits of hypothesis testing.


● Hypothesis testing provides a reliable framework for making any data decisions for your population of interest.
● It helps the researcher to successfully extrapolate data from the sample to the larger population.
● Hypothesis testing allows the researcher to determine whether the data from the sample is statistically significant.
● Hypothesis testing is one of the most important processes for measuring the validity and reliability of outcomes in any
● systematic investigation.
● It helps to provide links to the underlying theory and specific research questions.

5. Mention some of the Limitations of Hypothesis Testing.


● The interpretation of a p-value for observation depends on the stopping rule and definition of multiple comparisons. This makes it
● difficult to calculate since the stopping rule is subject to numerous interpretations, plus "multiple comparisons" are unavoidably
ambiguous.
● Conceptual issues often arise in hypothesis testing, especially if the researcher merges Fisher and Neyman-Pearson's methods
● which are conceptually distinct.
● In an attempt to focus on the statistical significance of the data, the researcher might ignore the estimation and confirmation by
● repeated experiments.
● Hypothesis testing can trigger publication bias, especially when it requires statistical significance as a criterion for publication.
● When used to detect whether a difference exists between groups, hypothesis testing can trigger absurd assumptions that affect the
● reliability of your observation.

6. How Hypothesis Testing Works?


The basis of hypothesis testing is to examine and analyze the null hypothesis and alternative hypothesis to know which one is
the most plausible assumption. Since both assumptions are mutually exclusive, only one can be true. In other words, the occurrence
of a null hypothesis destroys the chances of the alternative coming to life, and vice-versa.

7.Define One-Tailed Test.


A one-tailed test is a statistical hypothesis test in which the critical area of a distribution is one-sided so that it is either greater than or
less than a certain value, but not both. If the sample being tested falls into the one-sided critical area, the alternative hypothesis will
be
accepted instead of the null hypothesis. A one-tailed test is also known as a directional hypothesis or directional test.

8. Define Two-Tailed Test.


A two-tailed test is a method in which the critical area of a distribution is two-sided and tests whether a sample is greater than or
less than a certain range of values. If the sample being tested falls into either of the critical areas, the alternative hypothesis is
accepted
instead of the null hypothesis.

9.What is test statistic ?


The test statistic measures how close the sample has come to the null hypothesis. Its observed value changes randomly from one
random
sample to a different sample. A test statistic contains information about the data that is relevant for deciding whether to reject the null
hypothesis or not.

10.What is meant by level of significance?


The significance level is the probability of rejecting the null hypothesis when it is true. For example, a significance level of 0.05
indicates a 5% risk of concluding that a difference exists when there is no actual difference.

11. What is sample size?


Sample size refers to the number of participants or observations included in a study. This number is usually represented by n.

The size of a sample influences two statistical properties: 1) the precision of our estimates and 2) the power of the study to draw
conclusions.

12. When do you reject the null hypothesis?


We reject the null hypothesis, including all its assumptions, when it is inconsistent with the observed data. For example this
inconsistency may be determined through statistical analysis and modeling. Typically, if a statistical analysis produces a p-value
that is below the significance level (α) cut-off value we have set (e.g., generally either 0.05 or 0.01), we reject the null hypothesis.
If the p-value is above the statistical level cut-off value, we fail to reject the null hypothesis.
Note that if the result is not statistically significant this does not prove that the null hypothesis is true. Data can suggest that

the null hypothesis is false but just may not be strong enough to make a sufficiently convincing case that the null hypothesis is
false.

Also note that rejecting the null hypothesis is not the same as showing real-world significance.

13. What is power?


Power is defined as the probability of rejecting the null hypothesis given that it is false. Power requires the specification of an exact
alternative hypothesis value.
The power of a hypothesis test equals the probability (1 − s) of detecting a particular effect when the null hypothesis (H0) is false.
Power is simply the complement (1 − s) of the probability (s) of failing to detect the effect, that is, the complement of the probability
of a type II error.

14. Explain Power with an example.


In clinical trials, it is the probability that the trial will be able to detect a true effect of the treatment of a specified size or larger.
Practically, it is the likelihood that a test will detect the specified effect or greater when that effect exists to be detected i.e. achieve a
significant
p-value in the presence of a real effect.

15. Why does power increase with sample size?


As the sample size increases, so does the power of the significance test. This is because a larger sample size constricts the
distribution of the test statistic. This means that the standard error of the distribution is reduced and the acceptance region is reduced
which in turn increases the level of power. Acceptance region here refers to the range of values in which the test statistic may fall
where one fails to reject the Null Hypothesis. As this region becomes smaller, the probability that a false Null hypothesis will be
rejected increases.
Sample size also strongly influences the P-value of a test. An effect that fails to be significant at a specified level of significance in

a small sample can be significant in a larger sample.

16. What is the effect size?


The effect size is the difference in the primary outcome value that the clinical trial or study is designed to detect. In general, the
greater
the true effect of the treatment the easier it is to detect this difference using the sample selected for your trial. As a result, a larger
effect size will also increase your power i.e. you will have more power to detect a larger effect size and have smaller power to detect
a
smaller effect size.
There two types of effect size: standardized effect sizes and unstandardized effect sizes.

17. What is the relationship between effect size and power?


To determine the required sample size to achieve the desired study power, or to determine the expected power obtainable with a
proposed sample size, one must specify the difference that is to be detected. Statistical power is affected significantly by the size of
the
effect as well as the size of the sample used to detect it. In general, bigger effects sizes are easier to detect than smaller effect sizes,
while large samples offer greater test sensitivity than small samples.

18. What is a standardized effect size?


A standardized effect size measures the magnitude of the treatment effect on a unit free scale. This scale will usually be in
magnitudes
related to the variance. This allows a more direct and comparable measure of the expected degree of effect across different studies.
There are many standardized effect sizes. Some of the more common examples are Cohen’s d, partial Eta-squared, Glass’ delta and
Hedges’ g.

19. What is a confidence interval and confidence level?


A confidence interval is an interval which contains a population parameter (e.g. mean) with a given level of probability.
This level of probability is often denoted as the confidence level.
A confidence level refers to the percentage of all possible samples that are expected to include the true population parameter.

20. What is the confidence interval width?


The confidence interval width is the distance from the lower to the upper limit of the confidence interval. The smaller the confidence
interval the more confidence we will tend to have in our point estimate and more precise the estimate will be considered to
be. The usefulness of the confidence interval depends on its width/precision. The width depends on the chosen confidence level and
on the standard deviation of the quantity being estimated as well as on the sample size.
For a two-sided interval the width of a confidence interval is defined as the distance between the two interval limits.
However, in the one-sided cases the width of the confidence interval is often defined as the distance from the parameter estimate to the
limit of the interval despite technically the interval will go from the lowest possible lower and upper limit to the upper and lower limit
of an upper or lower one-sided interval respectively.

21. What effect does increasing sample size have on the confidence interval?
A larger sample will tend to produce a better estimate of the population parameter, when all other factors are equal.
Increasing the sample size decreases the width of confidence intervals, because it decreases the standard error. This can also be
phrased as increasing the sample size will increase the precision of the confidence interval.

22. What are the benefits of using an interval or precision based approach to sample size determination?
In a study in which the researcher is more interested in the precision of the estimate rather than the testing a specific
hypothesis about the estimate, the confidence interval approach is more informative about the observed results than the significance
testing approach. Sample size which targets the precision of the estimate uses the confidence interval as a method to define the
specific precision of interest to the researcher. Common cases where this may be true include survey design and early-stage research.

23. Specify the decision rule for each of the following situations.
(a) a two-tailed test with α = .05
(b) a one-tailed test, upper tail critical, with α = .01
Answer:
(a) Reject H0 at the .05 level of significance if z equals or is more positive than 1.96 of if z equals or is more negative than –1.96.
(b) Reject H0 at the .01 level of significance if z equals or is more positive than 2.33.

24. Define Alpha (𝜶).


The probability of a type I error, that is, the probability of rejecting a true null hypothesis.

25. Define Beta ( β).


The probability of a type II error, that is, the probability of retaining a false null hypothesis.

26. Reading achievement scores are obtained for a group of fourth graders. A score of 4.0 indicates a level of achievement
appropriate for fourth grade, a score below 4.0 indicates underachievement, and a score above 4.0 indicates overachievement.
Assume that the population standard deviation equals 0.4. A random sample of 64 fourth graders reveals a mean achievement
score of 3.82.
(a) Construct a 95 percent confidence interval for the unknown population mean. (Remember to convert the standard
deviation to a standard error.)
(b) Interpret this confidence interval; that is, do you find any consistent evidence either of overachievement or of
underachievement?

ANSWER:
27. What is Point Estimate?
A single value that represents some unknown population characteristic, such as the population mean.

28. What is meant by Margin of Error?


Margin of errors, is the degree of error in results received from random sampling surveys. A higher margin of error in
statistics indicates less likelihood of relying on the results of a survey or poll, i.e. the confidence on the results will be lower to
represent a population.

29. What is a Margin of Error in confidence interval?


A margin of error tells you how many percentage points your results will differ from the real population value. For example,
a 95% confidence interval with a 4 percent margin of error means that your statistic will be within 4 percentage points of the real
population value 95% of the time.
30. What is population?
In statistics, population is the entire set of items from which you draw data for a statistical study. It can be a group of individuals, a set
of items, etc. It makes up the data pool for a study.
31. What is a sample? A sample represents the group of interest from the population, which you will use to represent the data. The
sample is an unbiased subset of the population that best represents the whole data
32. When are samples used?
• The population is too large to collect data. • The data collected is not reliable. • The population is hypothetical and is unlimited in
size. Take the example of a study that documents the results of a new medical procedure. It is unknown how the procedure will affect
people across the globe, so a test group is used to find out how people react to it.
33.Difference Between Population and Sample.

34. Define Hypothetical Population


A population containing a finite number of individuals, members or units is a class. ... All the 400 students of 10th class of particular
school is an example of existent type of population and the population of heads and tails obtained by tossing a coin on infinite number
of times is an example of hypothetical population.
35. WHAT IS RANDOM SAMPLINGS?
Random sampling occurs if, at each stage of sampling, the selection process guarantees that all potential observations in the
population have an equal chance of being included in the sample
37. What Is sampling Distrubtion ?
The sampling distribution of the mean refers to the probability distribution of means for all possible random samples of a given size
from some population.
38.What Are The Types Of Sampling Distribution?
• Sampling distribution of mean • Sampling distribution of propotion • T-distribution
39.Define Sampling distribution of mean
The most common type of sampling distribution is of the mean. It focuses on calculating the mean of every sample group chosen from
the population and plotting the data points. The graph shows a normal distribution where the center is the mean of the sampling
distribution, which represents the mean of the entire population.
UNIT -4
1. Draw Stream Processing – Architecture

2 . Write the Issues & Constraints of stream Processing

• Data stream processing differs from other Big Data Processing because this is mostly real time, not batch
processing.Data need to be processed on its flight. That is, store & process is not possible. If the data is not processed in the stream
then it is lost for good.
• Speed of data stream could be “very high”; in the sense not enough processing capability to process each and our
element in the streamVolume of traffic could be “very high”; in the sense not enough storage to store and process.Every other
issue in this area can be traced to Speed and Volume of data.
• There should be provision to handle both ad-hoc & pre-defined queries.
• Reporting need not be real time.
3. Give the Examples of Stream Processing.
Sensor based data collection, Internet traffic targeting a server, Routed packets in a back-bone router
4. What are the Problems in Filtering Streams?
Filtering requires: Matching some key and data values in the streaming data with stored keys. This requires some table
lookup – consequently this makes it difficult to scale filtering

5. What does Bloom filter consist of?

Bloom filter consists of: A bit-array of n bits (n buckets), initially all the bits set to 0’s. A collection of hash functions
h1, h2, . . . , hk. Each hash function maps a “key” value to n buckets, corresponding to the n bits of the bit-array. A set
S of m key values.The purpose of the Bloom filter is to allow through all stream elements whose keys are in S, while
rejecting most of the stream elements whose keys are not in S.
5.Write an Application of Bloom Filtering
Spam filtering in email

6. State Flajolet-Martin Algorithm


Flajolet-Martin algorithm approximates the number of unique objects in a stream or a database in one pass. If the
stream contains n elements with m of them unique, this algorithm runs in O (n) time and needs O(log(m)) memory. So
the real innovation here is the memory usage, in that an exact, brute-force algorithm would need O (m) memory This is
an approximate algorithm. It gives an approximation for the number of unique objects, along with a standard deviation,
which can then be used to determine bounds on the approximation with a desired maximum error, if needed.
7. How can the Accuracy of counting be improved?

• Averaging: Use multiple hash functions and use the average R instead.
• Bucketing: Averages are susceptible to large fluctuations. So use multiple buckets of hash functions from the
step and use the median of the average R. This gives good accuracy.
8. List some types of Simple Moments.

0th Moment simply calculates the number of distinct elements in a stream

1st Moment simply calculates the frequencies of distinct elements in a stream

2nd Moment calculates the sum of the squares of the frequencies of distinct elements in a stream. The second moment
is sometimes called the surprise number, since it measures how uneven the distribution of elements in the stream is.

9. List the steps to be followed when a new element arrives at the stream window for a decaying window.
• Multiply the current sum by 1-c

• Add a(t+1)
10. Write the rules to be followed when representing a stream by buckets

• The right end of a bucket is always a position with a 1

• No position is in more than one bucket

• There are one or two buckets of any given size up to some maximum size

• All sizes must be a power of 2

• Buckets cannot decrease as we move to the left


11. What is Stream Processing?

A stream is a sequence of data elements made available over time , it can be thought of as a conveyor belt that allows
items to be processed one at a time rather than in large batches. Streams are processed differently from batch data –
normal functions cannot operate on streams as a whole, as they have potentially unlimited data, and formally, streams
are co-data, not data.

12. What is random walk hypothesis?


When applied to a particular financial instrument, the random walk hypothesis states that the price of this instrument is
governed by a random walk and hence is unpredictable. If the random walk hypothesis is false then there will exist
some correlation

between the instrument price and some other indicators such as trading volume or the previous day’s instrument closing price. If
the correlation can be determined then a potential profit can be made.
13. List the types of stock market prediction methods
• Fundamental analysis

• Technical methods

• Internet based data sources


14. What is Stock Market prediction?
It is an act of trying to determine the future value of a company stock or other financial instrument traded on a financial exchange.
The successful prediction of a stock’s future price could yield significant profit.
15. What do you mean by Fundamental analysis?
Fundamental analysts are concerned with the company that underlies the stock itself. They evaluate company’s past performance
as well as the credibility of its accounts. Many performance ratios are created that aid the fundamental analyst with assessing the
validity of a stock, such as the P/E ratio.
16. What is Technical analysis?
Technical analysts or chartists are not concerned with any of the company’s fundamentals. They seek to determine the future price
of a stock based solely on the trends of the past price.
17. What do you mean by Internet-based data sources in stock market prediction?
Tobias Preis et al. introduced a method to identify online precursors for stock market moves, using trading strategies based on
search volume data provided by Google Trends. Their analysis of Google search volume for 98 terms of varying financial
relevance, published in Scientific Reports, suggests that increases in search volume for financially relevant search terms tend to
precede large losses in financial markets.
18. How to estimate the number of 1’s in a window?
We can estimate the number of 1’s in a window of 0’s and 1’s by grouping the 1’s into buckets. Each bucket has a number of 1’s
that is a power of 2; there are one or two buckets of each size, and sizes never decrease as we go back in time. If we record only
the position and size of the buckets, we can represent the contents of a window of size N with O (log2N) space.
19. What is Exponentially Decaying Window?
Rather than fixing a window size, we can imagine that the window consists of all the elements that ever arrived in the stream, but
with the element that arrived t time units ago weighted by e-ct for some time-constant c. Doing so allows us to maintain certain
summaries of an exponentially.
20. Define F-Test?
An F-test is any statistical test in which the test statistic has an F-distribution under the null hypothesis. It is most often used when
comparing statistical models that have been fitted to a data set, in order to identify the model that best fits the population from
which the data were sampled
21. what is analysis of variance?Analysis of variance is a collection of statistical models and their associated estimation procedures
used to analyze the differences among means. ANOVA was developed by the statistician Ronald Fisher
22. Define effect size estimation.
Effect size estimates provide important information about the impact of a treatment on the outcome of interest or on the
association between variables. • Effect size estimates provide a common metric to compare the direction and strength of the
relationship between variables across studies.

23. what is mean by multiple comparisons, multiplicity or multiple testing.


the multiple comparisons, multiplicity or multiple testing problem occurs when one considers a set of statistical inferences
simultaneously or infers a subset of parameters selected based on the observed values. The more inferences are made, the more
likely erroneous inferences become
24. Define ANOVA.
The repeated measures analysis of variance (ANOVA) is an omnibus test that is an extension of the dependent samples t test. The
test is used to determine whether there are any significant differences between the means of three or more variables.

25. Mention a two-factor factorial design

A two-factor factorial design is an experimental design in which data is collected for all possible combinations of the levels of the
two factors of interest. If equal sample sizes are taken for each of the possible factor combinations then the design is a balanced
two-factor factorial design.

26. Define statistical test in F-test.


An F-test is any statistical test in which the test statistic has an F-distribution under the null hypothesis. It is most often used when
comparing statistical models that have been fitted to a data set, in order to identify the model that best fits the population from
which the data were sampled.

27. What are the two- way analysis of variance?


the two-way analysis of variance is an extension of the one-way ANOVA that examines the influence of two different categorical
independent variables on one continuous dependent variable.

28. What are the types of ANOVA?


There are two main types of ANOVA: one-way (or unidirectional) and two-way. There also variations of ANOVA. For example,
MANOVA (multivariate ANOVA) differs from ANOVA as the former tests for multiple dependent variables simultaneously
while the latter assesses only one dependent variable at a time.

29. Define chi-square test.


The Chi-Square test is a statistical procedure used by researchers to examine the differences between categorical variables in the
same population. For example, imagine that a research group is interested in whether or not education level and marital status
are related for all people in the U.S

30. What Does the Analysis of Variance Reveal?


● The ANOVA test is the initial step in analyzing factors that affect a given data set. Once the test is finished, an analyst
performs additional testing on the methodical factors that measurably contribute to the data set's inconsistency. The
analyst utilizes the ANOVA test results in an f- test to generate additional data that aligns with the proposed
regression models.
● The ANOVA test allows a comparison of more than two groups at the same time to determine whether a relationship
exists between them. The result of the ANOVA formula, the F statistic (also called the F-ratio), allows for the analysis
of multiple groups of data to determine the variability between samples and within samples.
● If no real difference exists between the tested groups, which is called the null hypothesis, the result of the ANOVA's
F-ratio statistic will be close to 1. The distribution of all possible values of the F statistic is the F-distribution. This is
actually a group of distribution functions, with two characteristic numbers, called the numerator degrees of freedom
and the denominator degrees of freedom.
31. How to Use ANOVA?
● A researcher might, for example, test students from multiple colleges to see if students from one of the colleges
consistently outperform students from the other colleges. In a business application, an R&D researcher might test two
different processes of creating a product to see if one process is better than the other in terms of cost efficiency.

● The type of ANOVA test used depends on a number of factors. It is applied when data needs to be experimental. Analysis
of variance is employed if there is no access to statistical software resulting in computing ANOVA by hand. It is simple to
use and best suited for small samples. With many experimental designs, the sample sizes have to be the same for the
various factor level combinations.

● ANOVA is helpful for testing three or more variables. It is similar to multiple two-sample t- tests. However, it results in
fewer type I errors and is appropriate for a range of issues. ANOVA groups differences by comparing the means of each
group and includes spreading out the variance into diverse sources. It is employed with subjects, test groups, between
groups and within groups

32. What is the Analysis of Variance in Other Applications


In addition to its applications in the finance industry, ANOVA is also used in a wide variety of contexts and applications to test
hypotheses in reviewing clinical trial data.For example, to compare the effects of different treatment protocols on patient
outcomes; in social science research (for instance to assess the effects of gender and class on specified variables), in software
engineering (for instance to evaluate database management systems), in manufacturing (to assess product and process quality
metrics), and industrial design among other fields.

33. What Is a Test?

● In technical analysis and trading, a test is when a stock’s price approaches an established support or resistance level set by the
market. If the stock stays within the support and resistance levels, the test passes. However, if the stock price reaches new lows
and/or new highs, the test fails. In other words, for technical analysis, price levels are tested to see if patterns or signals are
accurate.

● A test may also refer to one or more statistical techniques used to evaluate differences or similarities between estimated
values from models or variables found in data. Examples include the t-test and z-test

34. Define Range-Bound Market Test.


● When a stock is range-bound, price frequently tests the trading range’s upper and lower boundaries. If traders are using a
strategy that buys support and sells resistance, they should wait for several tests of these boundaries to confirm price
respects them before entering a trade.

● Once in a position, traders should place a stop-loss order in case the next test of support or resistance fails.
35. What is the Trending Market Test.
In an up-trending market, previous resistance becomes support, while in a down-trending market, past support becomes
resistance. Once price breaks out to a new high or low, it often retraces to test these levels before resuming in the direction of
the trend. Momentum traders can use the test of a previous swing high or swing low to enter a position at a more favorable price
than if they would have chased the initial breakout.
A stop-loss order should be placed directly below the test area to close the trade if the trend unexpectedly reverses.

36. Define Statistical Tests

Inferential statistics uses the properties of data to test hypotheses and draw
conclusions. Hypothesis testing allows one to test an idea using a data sample with regard to a population parameter. The methodology
employed by the analyst depends on the nature of the data used and the reason for the analysis. In particular, one seeks to reject the null
hypothesis, or the notion that one or more random variables have no effect on another. If this can be rejected, the variables are likely to be
associated with one another

37. What Is Alpha Risk?

● Alpha risk is the risk that in a statistical test a null hypothesis will be rejected when it is actually true. This is also known as a type
I error, or a false positive. The term "risk" refers to the chance or likelihood of making an incorrect decision. The primary
determinant of the amount of alpha risk is the sample size used for the test. Specifically, the larger the sample tested, the lower the
alpha risk becomes.
● Alpha risk can be contrasted with beta risk, or the risk of committing a type II error (i.e., a false negative).
● Alpha risk, in this context, is unrelated to the investment risk associated with an actively managed portfolio that seeks alpha, or
excess returns above the market

38. What Is Range-Bound Trading?


Range-bound trading is a trading strategy that seeks to identify and capitalize on securities, like stocks, trading in price channels.

After finding major support and resistance levels and connecting them with horizontal trendlines, a trader can buy a security at the lower
trendline support (bottom of the channel) and sell it at the upper trendline resistance (top of the channel)

UNIT V
1. What Is Predictive Analytics?

The term predictive analytics refers to the use of statistics and modeling techniques to make predictions about future outcomes and
performance. Predictive analytics looks at current and historical data patterns to determine if those patterns are likely to emerge again. This
allows businesses and investors to adjust where they use their resources to take advantage of possible future events. Predictive analysis can
also be used to improve operational efficiencies and reduce risk.
2. Understanding Predictive Analytics
Predictive analytics is a form of technology that makes predictions about certain unknowns in the future. It draws on a series of techniques
to make these determinations, including artificial intelligence (AI), data mining, machine learning, modeling, and statistics.3 For instance,
data mining involves the analysis of large sets of data to detect patterns from it. Text analysis does the same, except for large blocks of text.
3. Predictive models are used for all kinds of applications, including:
● Weather forecasts
● Creating video games
● Translating voice to text for mobile phone messaging

● Customer service
● Investment portfolio development
4. What is mean by Forecasting
Forecasting is essential in manufacturing because it ensures the optimal utilization of resources in a supply chain. Critical spokes of the
supply chain wheel, whether it is inventory management or the shop floor, require accurate forecasts for functioning.Predictive modelling is
often used to clean and optimize the quality of data used for such forecasts. Modelling ensures that more data can be ingested by the system,
including from customer-facing operations, to ensure a more accurate forecast.
5. Define Credit
Credit scoring makes extensive use of predictive analytics. When a consumer or business applies for credit, data on the applicant's credit
history and the credit record of borrowers with similar characteristics are used to predict the risk that the applicant might fail to perform on
any credit extended.
6. Define Underwriting
Data and predictive analytics play an important role in underwriting. Insurance companies examine policy applicants to determine the
likelihood of having to pay out for a future claim based on the current risk pool of similar policyholders, as well as past events that
have resulted in pay-outs. Predictive models that consider characteristics in comparison to data about past policyholders and claims are
routinely used by actuaries
7. What is mean by Marketing
Individuals who work in this field look at how consumers have reacted to the overall economy when planning on a new campaign. They can
use these shifts in demographics to determine if the current mix of products will entice consumers to make a purchase.
Active traders, meanwhile, look at a variety of metrics based on past events when deciding whether to buy or sell a security. Moving
averages, bands, and breakpoints are based on historical data and are used to forecast future price movements
8. Predictive Analytics vs. Machine Learning
A common misconception is that predictive analytics and machine learning are the same things. Predictive analytics help us understand
possible future occurrences by analyzing the past. At its core, predictive analytics includes a series of statistical techniques (including
machine learning, predictive modelling, and data mining) and uses statistics (both historical and current) to estimate, or predict, future
outcomes
9. What is the Decision Trees
● If you want to understand what leads to someone's decisions, then you may find decision trees useful. This type of model places
data into different sections based on certain variables, such as price or market capitalization. Just as the name implies, it looks like
a tree with individual branches and leaves. Branches indicate the choices available while individual leaves represent a particular
decision.
● Decision trees are the simplest models because they're easy to understand and dissect. They're also very useful when you need to
make a decision in a short period of time.
10. Define Regression
This is the model that is used the most in statistical analysis. Use it when you want to determine patterns in large sets of data and when
there's a linear relationship between the inputs. This method works by figuring out a formula, which represents the relationship between
all the inputs found in the dataset. For example, you can use regression to figure out how price and other key factors can shape the
performance of a security
11. Define Neural Networks
Neural networks were developed as a form of predictive analytics by imitating the way the human brain works. This model can deal
with complex data relationships using artificial intelligence and pattern recognition. Use it if you have several hurdles that you need to
overcome like when you have too much data on hand, when you don't have the formula you need to help you find a relationship
between the inputs and outputs in your dataset, or when you need to make predictions rather than come up with explanations.
12. What are the Benefits of Predictive Analytics
● There are numerous benefits to using predictive analysis. As mentioned above, using this type of analysis can help entities
when you need to make predictions about outcomes when there are no other (and obvious) answers available.
● Investors, financial professionals, and business leaders are able to use models to help reduce risk. For instance, an investor
and their advisor can use certain models to help craft an investment portfolio with minimal risk to the investor by taking
certain factors into consideration, such as age, capital, and goals.
● There is a significant impact to cost reduction when models are used. Businesses can determine the likelihood of success or
failure of a product before it launches. Or they can set aside capital for production improvements by using predictive
techniques before the manufacturing process begins
13. Criticism of Predictive Analytics
● The use of predictive analytics has been criticized and, in some cases, legally restricted due to perceived inequities in its
outcomes. Most commonly, this involves predictive models that result in statistical discrimination against racial or ethnic
groups in areas such as credit scoring, home lending, employment, or risk of criminal behaviour.
● A famous example of this is the (now illegal) practice of redlining in home lending by banks. Regardless of whether the
predictions drawn from the use of such analytics are accurate, their use is generally frowned upon, and data that explicitly
include information such as a person's race are now often excluded from predictive analytics.
14. How Does Netflix Use Predictive Analytics?
Data collection is very important to a company like Netflix. It collects data from its customers based on their behaviour and past
viewing patterns. It uses information and makes predictions.
based to make recommendations based on their preferences. This is the basis behind the "Because you watched..." lists you'll find on
your subscription.

15. What Is Data Analytics?


Data analytics is the science of analysing raw data to make conclusions about that information. Many of the techniques and processes
of data analytics have been automated into mechanical processes and algorithms that work over raw data for human consumption

16. What are the various steps of Data Analysis.


The process involved in data analysis involves several different steps:

1. The first step is to determine the data requirements or how the data is grouped. Data may be separated by age, demographic,
income, or gender. Data values may be numerical or be divided by category.
2. The second step in data analytics is the process of collecting it. This can be done through a variety of sources such as
computers, online sources, cameras, environmental sources, or through personnel.
3. Once the data is collected, it must be organized so it can be analyzed. This may take place on a spreadsheet or other form of
software that can take statistical data.
4. The data is then cleaned up before analysis. This means it is scrubbed and checked to ensure there is no duplication or error, and that it
is not incomplete. This step helps correct any errors before it goes on to a data analyst to be analyzed decaying window easily. For
instance, the weighted sum of elements can be recomputed, when a new element arrives, by multiplying the old sum by 1-c and then
adding the new element.

You might also like