0% found this document useful (0 votes)
20 views8 pages

P&s Theory

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views8 pages

P&s Theory

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

(U1) Applications of (U1)

data science dependent, independent


Data science has numerous applications across
various domains, including:
,Categorical and
1. **Business and Finance:**
- Predictive analytics for stock market trends. Continuous variables
- Fraud detection in financial transactions. 1. **Dependent Variable:**
- Customer segmentation for targeted marketing.
2. **Healthcare:**
- Definition: The variable in a study that
- Disease prediction and early diagnosis. you are trying to predict or explain. It is
- Drug discovery and development. dependent on other variables.
- Patient outcome analysis. - Example: In a study examining the
3. **E-commerce:** impact of study time on exam scores, the
- Recommender systems for personalized product exam score is the dependent variable. It
suggestions.
- Demand forecasting for inventory management.
depends on the amount of time spent
- Price optimization strategies. studying.
4. **Marketing and Advertising:** 2. **Independent Variable:**
- Customer sentiment analysis. - Definition: The variable that is
- Ad targeting and optimization. manipulated or changed in an experiment
- Marketing campaign effectiveness assessment.
5. **Manufacturing and Operations:**
to observe its effect on the dependent
- Predictive maintenance for machinery. variable.
- Quality control and defect detection. - Example: In the same study on exam
- Supply chain optimization. scores, the study time is the independent
6. **Social Media and Entertainment:** variable. Researchers manipulate the study
- Sentiment analysis for user feedback. time to see how it influences exam
- Content recommendation algorithms.
- User behavior prediction.
scores.
7. **Energy and Environment:** 3. **Categorical Variable:**
- Predictive maintenance for equipment in energy - Definition: A variable that can take on
production. categories or labels and represents
- Environmental impact assessment. qualitative data.
- Energy consumption optimization.
8. **Government and Public Policy:**
- Example: Gender is a categorical
- Crime prediction and analysis for law variable. It can be categorized as "Male" or
enforcement. "Female," and it represents a
- Public health monitoring and response. characteristic rather than a numerical
- Traffic flow optimization. value.
9. **Education:**
Personalized learning recommendations.
4. **Continuous Variable:**
- Student performance prediction. - Definition: A variable that can take on
- Educational program optimization. any value within a given range and
10. **Sports Analytics:** represents quantitative data.
- Player performance analysis. - Example: Age is a continuous variable.
- Injury prediction and prevention. It can take on any value within a range (e.
- Game strategy optimization.
These applications demonstrate the versatility of
g., 25.5 years), and it represents a
data science in extracting meaningful insights, measurable quantity on a numerical scale.
making informed decisions, and solving complex These definitions and examples help
problems across diverse industries. distinguish between types of variables
commonly encountered in statistical
analysis and research.
B ig da ta
(U1)
Descriptive Statistics (U1)vi su ali zation
categories
**Descriptive Statistics:**
Big data visualization encompasses various
Descriptive statistics involve the analysis and
types of visual representations to make
summary of data to provide a clear and concise complex datasets more understandable and
overview of its main characteristics. These actionable.
statistics help in organizing, simplifying, and 1. **Charts and Graphs:**
presenting data in a meaningful way. - Line charts, bar graphs, scatter plots,
**Applications in Computer and pie charts are widely used for
visualizing patterns, trends, and
Science and Engineering:** relationships in data.
1. **Performance Analysis:** 2. **Maps and Geospatial Visualization:**
- Descriptive statistics are used to analyze and - Mapping data onto geographical
summarize the performance metrics of representations helps analyze spatial
computer systems, such as response time, patterns and trends. It includes choropleth
throughput, and resource utilization. This is maps, heat maps, and interactive maps.
crucial for optimizing system efficiency. 3. **Dashboards:**
2. **Network Traffic Monitoring:** - Interactive dashboards provide a
- Descriptive statistics help in understanding comprehensive view of key performance
and visualizing patterns in network traffic, indicators and metrics. They often
including data transfer rates, packet loss, and incorporate multiple visualizations for
network latency. This information aids in holistic data analysis.
network optimization and troubleshooting. 4. **Infographics:**
3. **Software Debugging:** - Infographics combine visual elements,
- Descriptive statistics can be employed to text, and graphics to convey complex
analyze and summarize debugging data, such information in a visually appealing and easily
as the frequency of bugs, their types, and the digestible format.
time taken to fix them. This helps in identifying 5. **Network Diagrams:**
and addressing software issues efficiently. - Visualizing relationships and connections
4. **Code Complexity Analysis:** in complex networks, such as social
- Metrics like cyclomatic complexity and code networks or data flows, is done through
churn are analyzed using descriptive statistics network diagrams.
6. **Word Clouds:**
to understand the complexity and
- Word clouds represent the frequency of
maintainability of software code. This
words in a dataset, with more frequently
information guides software development
occurring words displayed in larger fonts.
practices. 7. **Tree Maps:**
5. **User Interaction Analytics:** - Tree maps visualize hierarchical data
- Descriptive statistics play a role in analyzing structures, displaying nested rectangles to
user interactions with software applications, represent categories and subcategories.
websites, or mobile apps. This includes user 8. **Parallel Coordinates:**
engagement metrics, click-through rates, and - Suitable for multidimensional datasets,
feature usage patterns, helping in user parallel coordinates visualize relationships
experience optimization. between multiple variables using parallel
axes.
(U2)properties of (U2)
properties of Rank
Correlation coefficients. Correlation coefficients.
1. **Range:** Correlation Rank correlation coefficient measures the
coefficients always fall within the strength and direction of the monotonic
relationship between two variables. Unlike
range of -1 to 1. the Pearson correlation coefficient, rank
2. **Directionality:** The sign of the correlation focuses on the order of values
correlation indicates the direction rather than their specific values. One
of the relationship: common rank correlation coefficient is the
- Positive correlation (r > 0): As Spearman rank correlation coefficient.
one variable increases, the other **Spearman Rank Correlation Coefficient:**
tends to increase. 1. **Range:** The Spearman rank correlation
coefficient (\( \rho \)) lies between -1 and
- Negative correlation (r < 0): As 1, inclusive.
one variable increases, the other 2. **Directionality:** The sign of \( \rho \)
tends to decrease. indicates the direction of the monotonic
3. **Strength:** The absolute value relationship:
of the correlation coefficient - Positive \( \rho \) implies a positive
indicates the strength of the monotonic relationship.
relationship. A value closer to 1 or - - Negative \( \rho \) implies a negative
monotonic relationship.
1 implies a stronger correlation. 3. **Interpretation:** The absolute value of
4. **Independence of Scale:** \( \rho \) represents the strength of the
Correlation is not affected by monotonic relationship. A value closer to 1
changes in the scale of or -1 indicates a stronger relationship.
measurement of the variables. 4. **Sensitivity to Ties:** Spearman
5. **No Causation:** Correlation correlation handles tied ranks, giving it an
does not imply causation. Even if advantage when dealing with datasets
where some values are repeated.
two variables are correlated, it does 5. **Invariant to Monotonic
not mean that one causes the Transformations:** Spearman correlation is
other. unaffected by monotonic transformations
6. **Sensitive to Outliers:** of the variables.
Correlation coefficients can be 6. **Applicability:** Suitable for both
influenced by outliers in the data. continuous and discrete ordinal data.
7. **Symmetry:** The correlation 7. **Non-Parametric:** Rank correlation
does not assume normality in the data and
between X and Y is the same as the is less sensitive to outliers.
correlation between Y and X. 8. **No Assumption of Linearity:** Unlike
8. **Non-Linear Relationships:** Pearson correlation, Spearman does not
Correlation measures linear assume a linear relationship between
relationships; it may not accurately variables.
represent non-linear associations. Remember that rank correlation assesses
monotonic relationships but may not
capture non-monotonic associations or
provide information about the specific form
of the relationship.
Sample space , properties of discrete and
Mutually exclusive (U3)continuous random
(U3) events ,Compound
variables
event ,Dependent and **Discrete Random Variables:**
1. **Countable Outcomes:** Discrete
independent events. random variables have a countable number
of possible outcomes, often associated with
counting.
1. **Sample Space:** 2. **Probability Mass Function (PMF):**
- *Definition:* The sample space, often Described by a probability mass function,
denoted by \(S\), is the set of all possible which gives the probability of each possible
outcomes of a random experiment or outcome.
process. It represents the total set of 3. **Probability Distribution:** The
outcomes that could occur. probabilities sum up to 1, representing the
2. **Mutually Exclusive Events:** entire sample space.
- *Definition:* Two events are mutually 4. **Expected Value (Mean):** The mean of
exclusive (or disjoint) if they cannot occur a discrete random variable is calculated as
at the same time. In other words, the the sum of each outcome multiplied by its
occurrence of one event means that the probability.
other event cannot happen simultaneously. 5. **Variance and Standard Deviation:**
3. **Compound Event:** Measures of variability for discrete random
- *Definition:* A compound event is an variables are calculated based on the
event that consists of two or more simple probabilities associated with each outcome.
events. Simple events are individual **Continuous Random Variables:**
outcomes in the sample space, and a 1. **Uncountable Outcomes:** Continuous
compound event involves combinations of random variables have uncountably infinite
these outcomes. possible outcomes, typically associated with
measurements.
4. **Dependent and Independent
2. **Probability Density Function (PDF):**
Events:** Described by a probability density function,
- *Dependent Events:* Two events are which gives the probability density at each
dependent if the occurrence or non- point in the range. The probability for a
occurrence of one event affects the specific outcome is obtained by integrating
probability of the other event. The over a range.
outcome of the first event influences the 3. **Probability Distribution:** The area
probability of the second event. under the probability density function over
- *Independent Events:* Two events are the entire range is equal to 1.
independent if the occurrence or non- 4. **Expected Value (Mean):** The mean of
occurrence of one event does not affect a continuous random variable is calculated
the probability of the other event. The as the integral of the product of each
outcomes of the events are not influenced outcome and its probability density function
by each other. 5. **Variance and Standard Deviation:**
Understanding these concepts is Measures of variability for continuous
fundamental in probability theory and helps random variables are calculated using
in analyzing and calculating probabilities integrals and are associated with the
associated with various events in a given squared differences between each outcome
scenario. and the mean.
(U3)Bayes theorem one tailed and
Bayes' Theorem relates the conditional and
(U5) two tailed tests
marginal probabilities of random events. It is
expressed as follows: **One-Tailed Test:**
In a one-tailed test, the critical region for
P(B∣A)⋅P(A)
_____________
P(A∣B) = P(B)
evaluating the statistical significance is
located entirely in one tail of the
Where probability distribution. This type of test is
used when the hypothesis specifies a
•P(A∣B) is the probability of event A direction of the effect.
occurring given that event B has *Example:*
occurred. Suppose you are testing a new drug's
•P(B∣A) is the probability of event B effectiveness and your null hypothesis (\(H
occurring given that event A has _0\)) is that the drug has no effect. Your
occurred. alternative hypothesis (\(H_1\)) is that the
•P(A) and P(B) are the probabilities of drug has a positive effect. In this case,
events A and B, respectively. you are only interested in whether the
Proof: drug improves the condition, so the critical
The proof of Bayes' Theorem can be region is in the right tail of the
derived from the definition of conditional distribution. If the test statistic falls in
probability. Starting with the definition: this region, you would reject the null
hypothesis in favor of the alternative,
P(A∩B)
_____________ concluding that the drug has a positive
P(A∣B)= effect.
P(B)
By the multiplication rule: **Two-Tailed Test:**
In a two-tailed test, the critical region is
P(A∩B)=P(B∣A)⋅P(A) split between both tails of the probability
distribution. This is used when the
Substituting this back into the hypothesis does not specify the direction
conditional probability definition: of the effect, and you want to determine if
there is a significant difference in either
direction.
P(B∣A)⋅P(A)
_____________
*Example:*
P(A∣B)= Consider a scenario where you are
P(B) investigating whether a coin is fair. The
null hypothesis (\(H_0\)) is that the coin is
Thus, Bayes' Theorem is proven. fair (equal chance of heads or tails), and
the alternative hypothesis (\(H_1\)) is that
This theorem is widely used in the coin is not fair. In a two-tailed test,
statistics and probability theory, you're interested in deviations from
particularly in Bayesian statistics, fairness in either direction. If the test
to update probabilities based on statistic falls into the critical regions in
new evidence or information. either tail, you would reject the null
hypothesis, concluding that the coin is not
fair..
t-Test,F-test, Critical region
(U4) (U5)critical value
chi-square test Level of significance
1. **t-Test:** **Critical Region:**
- *Definition:* A t-test is a statistical - *Definition:* The critical region (or
test used to compare the means of two rejection region) in hypothesis testing is
groups and determine if there is a the range of values for a test statistic
significant difference between them. It is that leads to the rejection of the null
commonly employed when working with hypothesis. This region is determined
small sample sizes and assumes that the based on the chosen significance level and
data follows a normal distribution. the distribution of the test statistic. If
2. **F-Test:** the calculated test statistic falls within
- *Definition:* The F-test is a statistical the critical region, it provides evidence to
test that is often used to compare reject the null hypothesis in favor of the
variances or test the equality of means alternative hypothesis.
across multiple groups. There are **Level of Significance:**
different variations of the F-test, such as - *Definition:* The level of significance,
the one-way analysis of variance (ANOVA denoted by \( \alpha \), is the probability
) F-test, which assesses if there are any of rejecting the null hypothesis when it is
statistically significant differences actually true. In hypothesis testing, it
between the means of three or more represents the threshold for determining
independent groups. whether the obtained results are
3. **Chi-Square Test:** statistically significant. Commonly used
- *Definition:* The chi-square test is a significance levels include 0.05, 0.01, and
statistical test used to determine if there 0.10. A lower significance level
is a significant association between two corresponds to a more stringent criterion
categorical variables. It is often applied to for rejecting the null hypothesis.
contingency tables to assess whether **Critical Value:**
the observed distribution of frequencies - *Definition:* The critical value is the
differs from the expected distribution. specific value (or values) of a test
There are different types of chi-square statistic that separates the critical region
tests, including the chi-square test for from the non-critical region. These values
independence and the chi-square are determined based on the chosen level
goodness-of-fit test. of significance and the distribution of the
These tests are fundamental in statistics test statistic (e.g., t-distribution, F-
and are used for various purposes, such distribution). If the calculated test
as hypothesis testing, comparing groups, statistic is greater than or less than
and examining relationships between these critical values, it leads to the
variables. The specific test chosen rejection of the null hypothesis.
depends on the nature of the data and In summary, the critical region is where
the research question being addressed. the decision to reject the null hypothesis
is made, the level of significance sets the
threshold for making this decision, and
critical values are the boundary values
that define the critical region.
(U5) point and
types of errors (U4)interval
estimations.
in sampling
In the context of sampling, there are two **Point Estimation:**
main types of errors: sampling errors and non - *Definition:* Point estimation involves
-sampling errors. using a single value, typically the sample
1. **Sampling Errors:** statistic, to estimate a population
- *Definition:* Sampling errors arise due to parameter. The point estimate serves as
the variability that occurs by chance when a the best guess or the most likely value
sample, rather than an entire population, is for the parameter based on the
surveyed. These errors can be minimized by information available from the sample.
using proper sampling techniques. - *Example:* If you measure the average
- *Examples:* height of a sample of individuals and use
- **Random Sampling Error:** Results that average as an estimate for the
from the natural variability inherent in any average height of the entire population,
sampling process. It can be reduced by the sample mean serves as a point
increasing the sample size. estimate for the population mean.
- **Systematic Sampling Error:** Results **Interval Estimation:**
from a systematic flaw in the sampling - *Definition:* Interval estimation
process. For example, if a researcher provides a range of values, known as a
consistently chooses samples from a specific confidence interval, within which the
subgroup, it introduces bias. true population parameter is likely to fall
2. **Non-Sampling Errors:** . Unlike point estimation, it
- *Definition:* Non-sampling errors are acknowledges the uncertainty
errors that can occur at any stage of the associated with estimating a parameter
research process, not just during the from a sample.
sampling phase. They are not related to - *Example:* Suppose you calculate a 95
chance and are often the result of human % confidence interval for the average
error, data collection problems, or faulty income of a population. This interval
measurement instruments. might be, for example, $40,000 to $50,
- *Examples:* 000. This means that you are 95%
- **Selection Bias:** Occurs when the confident that the true average income
sample chosen is not representative of the of the population falls within this range.
entire population. This can happen if certain In summary, point estimation gives a
groups are excluded from the sample. specific value as the estimate, while
- **Measurement Error:** Arises when interval estimation provides a range of
there are inaccuracies in the measurement values to express the uncertainty
instruments or data collection methods. associated with the estimate. Interval
- **Non-Response Bias:** Results from estimation is often preferred in
the failure to obtain responses from some inferential statistics because it provides
sampled units, leading to an incomplete or a measure of the precision and reliability
biased sample. of the estimate.
(U5)
(U4) procedure followed in
Population,samples,
testing of hypotheses
Sampling distribution, The procedure for testing hypotheses involves

F distribution
several key steps. Here is a general outline of the
process:
1. **Formulate Hypotheses:**
- Define the null hypothesis ((H0)) and the
- **Population:** alternative hypothesis (H1) or (Ha). The null
- *Definition:* A population is the entire hypothesis typically represents a statement of no
group of individuals, items, or units about effect or no difference, while the alternative
which the researcher wants to draw hypothesis represents the opposite.
conclusions. It includes all possible 2. **Set Significance Level ((alpha)):**
- Choose a significance level ((alpha)), which is
elements that share a common
the probability of rejecting the null hypothesis
characteristic. when it is actually true. Common choices include
- **Sample:** 0.05, 0.01, and 0.10.
- *Definition:* A sample is a subset of 3. **Collect Data:**
the population selected for study. The - Collect relevant data through experiments,
surveys, or observations.
goal of sampling is to gather information
4. **Choose a Statistical Test:**
about the population by examining a - Select an appropriate statistical test based on
representative portion of it, as studying the nature of the data and the hypotheses being
the entire population might be impractical tested. Common tests include t-tests, chi-square
or impossible. tests, ANOVA, regression analysis, etc.
**(ii) Sampling Distribution:** 5. **Determine the Critical Region:**
- Define the critical region(s) in the distribution
- *Definition:* A sampling distribution is of the test statistic. The critical region is where
the probability distribution of a statistic you would reject the null hypothesis.
(such as the mean or standard deviation) 6. **Calculate the Test Statistic:**
obtained from multiple samples of a - Use the collected data to calculate the test
population. It describes the variability of statistic based on the chosen statistical test.
7. **Make a Decision:**
the statistic across different samples and
- Compare the calculated test statistic to the
provides information about how much critical value(s) or use p-values to determine
the sample statistic is likely to differ whether to reject the null hypothesis. If the test
from the population parameter. statistic falls in the critical region or the p-value
**(iii) F Distribution:** is less than (alpha), reject (H0); otherwise, fail to
reject (H_0).
*Definition:* The F distribution is a
8. **Draw Conclusions:**
probability distribution that arises in the - State the conclusion in the context of the
context of analysis of variance (ANOVA) problem, considering the evidence from the
and regression analysis. It is a continuous statistical test. If (H0) is rejected, interpret the
probability distribution that is positively results in terms of the alternative hypothesis.
skewed and has two parameters, degrees 9. **Check Assumptions and Limitations:**
- Verify that the assumptions of the chosen
of freedom in the numerator and degrees
statistical test are met. Consider any limitations
of freedom in the denominator. The F in the study and discuss the implications of the
distribution is commonly used to test findings.
hypotheses about variances or ratios of 10. **Report Results:**
variances. - Present the results, including the test
statistic, p-value, and any relevant confidence
intervals.

You might also like