Fdsa 12 - 2M
Fdsa 12 - 2M
⮚ Retailers
⮚ Financial sectors
⮚ Transportation
⮚ Government sectors
⮚ Universities
5.What are the three sub-phases of data preparations?
The three sub-Phase of data preparations:
● Data cleaning
● Data Integration.
● Data Transformation.
6. What is data cleaning?
Removing missing values, false and inconsistencies across data source.
7. Define Streaming data.
Data that is generated continuously by thousands of data sources, which typically send in the data records simultaneously and in
small sizes. Examples are the “What’s trending” on Twitter, live sporting or music events, and the stock market.
8. What is Pareto charts?
• It is a graph that indicate the frequency of defects, as well as cumulative impact.
• Pareto charts are useful to find the defects to prioritize in order to observe the greatest overall improvement.
• It is a combination of a bar graph and line graph.
9. What are recommender system?
Recommender systems are a subclass of information filtering systems, used to predict how ser would rate or score particular objects
(movies, music, merchandise, etc.). Recommender systems filter large volume of information based on the data provided by a user
and other factors.
Recommender systems utilizes algorithms that optimize the analysis of the data to build the recommendations.
10. What are various forms of data used in data science?
• The main categories of data are:
⮚ Structured data
⮚ Unstructured data
⮚ Natural language
⮚ Machine-generated
⮚ Graph-based
⮚ Streaming
11. Explain how a recommender system works.
A recommender system is a system that many consumer-facing, content-driven, online platforms employs to generate
recommendations for users from a library of available content.
These system generates recommendations based on what they know about the user’s tastes from their activities on the platforms.
12. List out the steps involved in data science process.
1. Setting the research goal
2. Retrieving data
3. Data Preparation
4. Data Exploration
5. Data Mining
6. Presentation and automation
13. Mention the input that covers inside the project charter.
● A clear research goal
● The project mission and context
● How you’re going to perform your analysis
● What resources you expect to use
● Proof that it’s an achievable project, or proof of concepts
● Deliverables and a measure of success
● A timeline
14. List out the various open data sites providers.
Open data site Description
Data.gov The home of the US Government’s open data
https://open-data.europa.eu/ The home of the European Commission’s open data
Freebase.org An open database that retrieves its information from
sites like
Wikipedia, MusicBrains, and the SEC archive
Data.worldbank.org Open data initiative from the World Bank
Aiddata.org Open data for international development
Open.fda.gov Open data from the US Food and Drug
Administration
UNIT -2
1.Discuss the differences between the frequency table and the frequency distribution table?
The frequency table is said to be a tabular method where each part of the data is assigned to its corresponding frequency. Whereas,
a frequency distribution is generally the graphical representation of the frequency table.
2.What are the numerous types of frequency distributions?
Different types of frequency distributions are as follows:
1. Grouped frequency distribution.
2. Ungrouped frequency distribution.
3. Cumulative frequency distribution.
4. Relative frequency distribution.
5. Relative cumulative frequency distribution, etc.
3. What are some characteristics of the frequency distribution?
Some major characteristics of the frequency distribution are given as follows:
1. Measures of central tendency and location i.e. mean, median, and mode.
2. Measures of dispersion i.e. range, variance, and the standard deviation.
3. The extent of the symmetry or asymmetry i.e. skewness.
4. The flatness or the peakedness i.e. kurtosis.
4.What is the importance of frequency distribution?
The value of the frequency distributions in statistics is excessive. A well-formed frequency distribution creates the possibility of a
detailed
analysis of the structure of the population. So, the groups where the population breaks down are determinable.
5.What is frequency distribution?
A frequency distribution is a collection of observations produced by sorting observations
into classes and showing their frequency (f ) of occurrence in each class.
6.Essential guidelines for frequency distribution.
● Each observation should be included in one, and only one, class.
● List all classes, even those with zero frequencies.
● All classes should have equals intervals.
7.What is Real Limits.
The real limits are located at the midpoint of the gap between adjacent tabled boundaries; that is, one-half of one unit of
measurement below the lower tabled boundary and one-half of one unit of measurement above the upper tabled boundary.
8.Define Relative frequency distributions.
Relative frequency distributions show the frequency of each class as a part or fraction of the total frequency for the entire
distribution.
9. How to convert frequency distribution into relative frequency distribution.
To convert a frequency distribution into a relative frequency distribution, divide the frequency for each class by the total frequency
for the entire distribution.
10. Define Cumulative frequency distribution.
Cumulative frequency distributions show the total number of observations in each class and in all lower-ranked classes.
11.What is Percentile Ranks.
The percentile rank of a score indicates the percentage of scores in the entire distribution with similar or smaller values than that
score.
Thus a weight has a percentile rank of 80 if equal or lighter weights constitute 80 percent of the entire distribution.
12.List some of the features of histogram.
● Equal units along the horizontal axis (the X axis, or abscissa) reflect the various class intervals of the frequency distribution.
● Equal units along the vertical axis (the Y axis, or ordinate) reflect increases in frequency. (The units along the vertical axis do not
● have to be the same width as those along the horizontal axis.)
● The intersection of the two axes defines the origin at which both numerical scales equal 0.
● Numerical scales always increase from left to right along the horizontal axis and from bottom to top along the vertical axis.
● The body of the histogram consists of a series of bars whose heights reflect the frequencies for the various classes.
13. Define Stem and leaf display.
A device for sorting quantitative data on the basis of leading and trailing digits.
27. Employees of Corporation A earn annual salaries described by a mean of $90,000 and a standard deviation of $10,000.
(a) The majority of all salaries fall between what two values?
(b) A small minority of all salaries are less than what value?
(c) A small minority of all salaries are more than what value?
(d) Answer parts (a), (b), and (c) for Corporation B’s employees, who earn annual salaries described by a mean of $90,000 and a standard
deviation of $2,000.
Answer:
(a) $80,000 to $100,000
(b) $70,000
(c) $110,000
(d) $88,000 to $92,000; $86,000; $94,000
28.Define Normal curve.
A theoretical curve noted for its symmetrical bell-shaped form.
29. List some of the properties of the normal curve.
● Obtained from a mathematical equation, the normal curve is a theoretical curve defined for a continuous variable
● Because the normal curve is symmetrical, its lower half is the mirror image of its upper half.
● Being bell shaped, the normal curve peaks above a point midway along the horizontal spread and then tapers off gradually in either
● direction from the peak (without actually touching the horizontal axis, since, in theory, the tails of a normal curve extend infinitely
far).
● The values of the mean, median (or 50th percentile), and mode, located at a point midway along the horizontal spread, are the same
● for the normal curve.
34.Define Scatterplots.
A scatterplot is a graph containing a cluster of dots that represents all pairs of scores. With a little training, you can use any dot
cluster as a preview of a fully measured relationship.
35.Define correlation coefficient.
A correlation coefficient is a number between –1 and 1 that describes the relationship between pairs of variables.
36. Specify the properties of correlation coefficient.
Two properties are:
1. The sign of r indicates the type of linear relationship, whether positive or negative.
2. The numerical value of r, without regard to sign, indicates the strength of the linear relationship.
37. Define least square regression equation.
The equation that minimizes the total of all squared prediction errors for known Y scores in the original correlation analysis.
11. Assume that an r of .30 describes the relationship between educational level (highest grade completed) and estimated number of hours spent
reading each week. More specifically:
educational level (x) weekly reading time (y)
X = 13 Y=8
SSx = 25 Ssy = 50
r = .30
(a)Determine the least squares equation for predicting weekly reading time from educational
level.
Answer:
b = √(50/25)(.30) = .42; a = 8 – (.42)(13) = 2.54
UNIT -3
1. 1. Where you have used Hypothesis Testing in your machine learning solution?
Hypothesis testing is one of the statistical analysis where we test the assumption made for any particular situation.
While testing some assumption which was claimed to be true, and performed the hypothesis testing where the null hypothesis
was that whatever claimed results to be true and the alternate hypothesis was that whatever claim was made were false.
2. Which type of error is sever error, Type1 or Type2? And why with example.
It depends on the problem statement we are looking into.
For example:
The confusion matrix with regards to disease verses treatment is fatal in case of true positive (when patient have the disease and the
model predicted that patient don’t have the disease) then in that case patient won’t get the treatment and might loose his/her life.
Similarly in criminal is guilty or innocent case false positive is is much more worse (when the person is inncent and the model
predicts
person is guilty) as we will end up punishing an innocent.
3. What is the most significance benefits of Hypothesis Testing?
The most significant benefit of hypothesis testing is it allows you to evaluate the strength of your claim or assumption before
implementing it in your data set. Also, hypothesis testing is the only valid method to prove that something "is or is not".
The size of a sample influences two statistical properties: 1) the precision of our estimates and 2) the power of the study to draw
conclusions.
the null hypothesis is false but just may not be strong enough to make a sufficiently convincing case that the null hypothesis is
false.
Also note that rejecting the null hypothesis is not the same as showing real-world significance.
21. What effect does increasing sample size have on the confidence interval?
A larger sample will tend to produce a better estimate of the population parameter, when all other factors are equal.
Increasing the sample size decreases the width of confidence intervals, because it decreases the standard error. This can also be
phrased as increasing the sample size will increase the precision of the confidence interval.
22. What are the benefits of using an interval or precision based approach to sample size determination?
In a study in which the researcher is more interested in the precision of the estimate rather than the testing a specific
hypothesis about the estimate, the confidence interval approach is more informative about the observed results than the significance
testing approach. Sample size which targets the precision of the estimate uses the confidence interval as a method to define the
specific precision of interest to the researcher. Common cases where this may be true include survey design and early-stage research.
23. Specify the decision rule for each of the following situations.
(a) a two-tailed test with α = .05
(b) a one-tailed test, upper tail critical, with α = .01
Answer:
(a) Reject H0 at the .05 level of significance if z equals or is more positive than 1.96 of if z equals or is more negative than –1.96.
(b) Reject H0 at the .01 level of significance if z equals or is more positive than 2.33.
26. Reading achievement scores are obtained for a group of fourth graders. A score of 4.0 indicates a level of achievement
appropriate for fourth grade, a score below 4.0 indicates underachievement, and a score above 4.0 indicates overachievement.
Assume that the population standard deviation equals 0.4. A random sample of 64 fourth graders reveals a mean achievement
score of 3.82.
(a) Construct a 95 percent confidence interval for the unknown population mean. (Remember to convert the standard
deviation to a standard error.)
(b) Interpret this confidence interval; that is, do you find any consistent evidence either of overachievement or of
underachievement?
ANSWER:
27. What is Point Estimate?
A single value that represents some unknown population characteristic, such as the population mean.
• Data stream processing differs from other Big Data Processing because this is mostly real time, not batch
processing.Data need to be processed on its flight. That is, store & process is not possible. If the data is not processed in the stream
then it is lost for good.
• Speed of data stream could be “very high”; in the sense not enough processing capability to process each and our
element in the streamVolume of traffic could be “very high”; in the sense not enough storage to store and process.Every other
issue in this area can be traced to Speed and Volume of data.
• There should be provision to handle both ad-hoc & pre-defined queries.
• Reporting need not be real time.
3. Give the Examples of Stream Processing.
Sensor based data collection, Internet traffic targeting a server, Routed packets in a back-bone router
4. What are the Problems in Filtering Streams?
Filtering requires: Matching some key and data values in the streaming data with stored keys. This requires some table
lookup – consequently this makes it difficult to scale filtering
Bloom filter consists of: A bit-array of n bits (n buckets), initially all the bits set to 0’s. A collection of hash functions
h1, h2, . . . , hk. Each hash function maps a “key” value to n buckets, corresponding to the n bits of the bit-array. A set
S of m key values.The purpose of the Bloom filter is to allow through all stream elements whose keys are in S, while
rejecting most of the stream elements whose keys are not in S.
5.Write an Application of Bloom Filtering
Spam filtering in email
• Averaging: Use multiple hash functions and use the average R instead.
• Bucketing: Averages are susceptible to large fluctuations. So use multiple buckets of hash functions from the
step and use the median of the average R. This gives good accuracy.
8. List some types of Simple Moments.
2nd Moment calculates the sum of the squares of the frequencies of distinct elements in a stream. The second moment
is sometimes called the surprise number, since it measures how uneven the distribution of elements in the stream is.
9. List the steps to be followed when a new element arrives at the stream window for a decaying window.
• Multiply the current sum by 1-c
• Add a(t+1)
10. Write the rules to be followed when representing a stream by buckets
• There are one or two buckets of any given size up to some maximum size
A stream is a sequence of data elements made available over time , it can be thought of as a conveyor belt that allows
items to be processed one at a time rather than in large batches. Streams are processed differently from batch data –
normal functions cannot operate on streams as a whole, as they have potentially unlimited data, and formally, streams
are co-data, not data.
between the instrument price and some other indicators such as trading volume or the previous day’s instrument closing price. If
the correlation can be determined then a potential profit can be made.
13. List the types of stock market prediction methods
• Fundamental analysis
• Technical methods
A two-factor factorial design is an experimental design in which data is collected for all possible combinations of the levels of the
two factors of interest. If equal sample sizes are taken for each of the possible factor combinations then the design is a balanced
two-factor factorial design.
● The type of ANOVA test used depends on a number of factors. It is applied when data needs to be experimental. Analysis
of variance is employed if there is no access to statistical software resulting in computing ANOVA by hand. It is simple to
use and best suited for small samples. With many experimental designs, the sample sizes have to be the same for the
various factor level combinations.
● ANOVA is helpful for testing three or more variables. It is similar to multiple two-sample t- tests. However, it results in
fewer type I errors and is appropriate for a range of issues. ANOVA groups differences by comparing the means of each
group and includes spreading out the variance into diverse sources. It is employed with subjects, test groups, between
groups and within groups
● In technical analysis and trading, a test is when a stock’s price approaches an established support or resistance level set by the
market. If the stock stays within the support and resistance levels, the test passes. However, if the stock price reaches new lows
and/or new highs, the test fails. In other words, for technical analysis, price levels are tested to see if patterns or signals are
accurate.
● A test may also refer to one or more statistical techniques used to evaluate differences or similarities between estimated
values from models or variables found in data. Examples include the t-test and z-test
● Once in a position, traders should place a stop-loss order in case the next test of support or resistance fails.
35. What is the Trending Market Test.
In an up-trending market, previous resistance becomes support, while in a down-trending market, past support becomes
resistance. Once price breaks out to a new high or low, it often retraces to test these levels before resuming in the direction of
the trend. Momentum traders can use the test of a previous swing high or swing low to enter a position at a more favorable price
than if they would have chased the initial breakout.
A stop-loss order should be placed directly below the test area to close the trade if the trend unexpectedly reverses.
Inferential statistics uses the properties of data to test hypotheses and draw
conclusions. Hypothesis testing allows one to test an idea using a data sample with regard to a population parameter. The methodology
employed by the analyst depends on the nature of the data used and the reason for the analysis. In particular, one seeks to reject the null
hypothesis, or the notion that one or more random variables have no effect on another. If this can be rejected, the variables are likely to be
associated with one another
● Alpha risk is the risk that in a statistical test a null hypothesis will be rejected when it is actually true. This is also known as a type
I error, or a false positive. The term "risk" refers to the chance or likelihood of making an incorrect decision. The primary
determinant of the amount of alpha risk is the sample size used for the test. Specifically, the larger the sample tested, the lower the
alpha risk becomes.
● Alpha risk can be contrasted with beta risk, or the risk of committing a type II error (i.e., a false negative).
● Alpha risk, in this context, is unrelated to the investment risk associated with an actively managed portfolio that seeks alpha, or
excess returns above the market
After finding major support and resistance levels and connecting them with horizontal trendlines, a trader can buy a security at the lower
trendline support (bottom of the channel) and sell it at the upper trendline resistance (top of the channel)
UNIT V
1. What Is Predictive Analytics?
The term predictive analytics refers to the use of statistics and modeling techniques to make predictions about future outcomes and
performance. Predictive analytics looks at current and historical data patterns to determine if those patterns are likely to emerge again. This
allows businesses and investors to adjust where they use their resources to take advantage of possible future events. Predictive analysis can
also be used to improve operational efficiencies and reduce risk.
2. Understanding Predictive Analytics
Predictive analytics is a form of technology that makes predictions about certain unknowns in the future. It draws on a series of techniques
to make these determinations, including artificial intelligence (AI), data mining, machine learning, modeling, and statistics.3 For instance,
data mining involves the analysis of large sets of data to detect patterns from it. Text analysis does the same, except for large blocks of text.
3. Predictive models are used for all kinds of applications, including:
● Weather forecasts
● Creating video games
● Translating voice to text for mobile phone messaging
● Customer service
● Investment portfolio development
4. What is mean by Forecasting
Forecasting is essential in manufacturing because it ensures the optimal utilization of resources in a supply chain. Critical spokes of the
supply chain wheel, whether it is inventory management or the shop floor, require accurate forecasts for functioning.Predictive modelling is
often used to clean and optimize the quality of data used for such forecasts. Modelling ensures that more data can be ingested by the system,
including from customer-facing operations, to ensure a more accurate forecast.
5. Define Credit
Credit scoring makes extensive use of predictive analytics. When a consumer or business applies for credit, data on the applicant's credit
history and the credit record of borrowers with similar characteristics are used to predict the risk that the applicant might fail to perform on
any credit extended.
6. Define Underwriting
Data and predictive analytics play an important role in underwriting. Insurance companies examine policy applicants to determine the
likelihood of having to pay out for a future claim based on the current risk pool of similar policyholders, as well as past events that
have resulted in pay-outs. Predictive models that consider characteristics in comparison to data about past policyholders and claims are
routinely used by actuaries
7. What is mean by Marketing
Individuals who work in this field look at how consumers have reacted to the overall economy when planning on a new campaign. They can
use these shifts in demographics to determine if the current mix of products will entice consumers to make a purchase.
Active traders, meanwhile, look at a variety of metrics based on past events when deciding whether to buy or sell a security. Moving
averages, bands, and breakpoints are based on historical data and are used to forecast future price movements
8. Predictive Analytics vs. Machine Learning
A common misconception is that predictive analytics and machine learning are the same things. Predictive analytics help us understand
possible future occurrences by analyzing the past. At its core, predictive analytics includes a series of statistical techniques (including
machine learning, predictive modelling, and data mining) and uses statistics (both historical and current) to estimate, or predict, future
outcomes
9. What is the Decision Trees
● If you want to understand what leads to someone's decisions, then you may find decision trees useful. This type of model places
data into different sections based on certain variables, such as price or market capitalization. Just as the name implies, it looks like
a tree with individual branches and leaves. Branches indicate the choices available while individual leaves represent a particular
decision.
● Decision trees are the simplest models because they're easy to understand and dissect. They're also very useful when you need to
make a decision in a short period of time.
10. Define Regression
This is the model that is used the most in statistical analysis. Use it when you want to determine patterns in large sets of data and when
there's a linear relationship between the inputs. This method works by figuring out a formula, which represents the relationship between
all the inputs found in the dataset. For example, you can use regression to figure out how price and other key factors can shape the
performance of a security
11. Define Neural Networks
Neural networks were developed as a form of predictive analytics by imitating the way the human brain works. This model can deal
with complex data relationships using artificial intelligence and pattern recognition. Use it if you have several hurdles that you need to
overcome like when you have too much data on hand, when you don't have the formula you need to help you find a relationship
between the inputs and outputs in your dataset, or when you need to make predictions rather than come up with explanations.
12. What are the Benefits of Predictive Analytics
● There are numerous benefits to using predictive analysis. As mentioned above, using this type of analysis can help entities
when you need to make predictions about outcomes when there are no other (and obvious) answers available.
● Investors, financial professionals, and business leaders are able to use models to help reduce risk. For instance, an investor
and their advisor can use certain models to help craft an investment portfolio with minimal risk to the investor by taking
certain factors into consideration, such as age, capital, and goals.
● There is a significant impact to cost reduction when models are used. Businesses can determine the likelihood of success or
failure of a product before it launches. Or they can set aside capital for production improvements by using predictive
techniques before the manufacturing process begins
13. Criticism of Predictive Analytics
● The use of predictive analytics has been criticized and, in some cases, legally restricted due to perceived inequities in its
outcomes. Most commonly, this involves predictive models that result in statistical discrimination against racial or ethnic
groups in areas such as credit scoring, home lending, employment, or risk of criminal behaviour.
● A famous example of this is the (now illegal) practice of redlining in home lending by banks. Regardless of whether the
predictions drawn from the use of such analytics are accurate, their use is generally frowned upon, and data that explicitly
include information such as a person's race are now often excluded from predictive analytics.
14. How Does Netflix Use Predictive Analytics?
Data collection is very important to a company like Netflix. It collects data from its customers based on their behaviour and past
viewing patterns. It uses information and makes predictions.
based to make recommendations based on their preferences. This is the basis behind the "Because you watched..." lists you'll find on
your subscription.
1. The first step is to determine the data requirements or how the data is grouped. Data may be separated by age, demographic,
income, or gender. Data values may be numerical or be divided by category.
2. The second step in data analytics is the process of collecting it. This can be done through a variety of sources such as
computers, online sources, cameras, environmental sources, or through personnel.
3. Once the data is collected, it must be organized so it can be analyzed. This may take place on a spreadsheet or other form of
software that can take statistical data.
4. The data is then cleaned up before analysis. This means it is scrubbed and checked to ensure there is no duplication or error, and that it
is not incomplete. This step helps correct any errors before it goes on to a data analyst to be analyzed decaying window easily. For
instance, the weighted sum of elements can be recomputed, when a new element arrives, by multiplying the old sum by 1-c and then
adding the new element.