Data Science Notes
Data Science Notes
1. Data Cleaning: EDA involves examining the information for errors, lacking
values, and inconsistencies. It includes techniques including records
imputation, managing missing statistics, and figuring out and getting rid of
outliers.
conceptual data model The conceptual data model is a view of the data that
is required to help business processes. It also keeps track of business events
and keeps related performance measures. The conceptual model defines what
the system contains.Conceptual Model focuses on finding the data used in a
business rather than processing flow.
2. Logical Model: In the logical data model, the map of rules and data
structures includes the data required, such as tables, columns, etc. Data
architects and Business Analysts create the Logical Model. We can use the
logical model to transform it into a database. Logical Model is always present
in the root package object
β = (X^{T}X + λ*I)^{-1}X^{T}y
N^{-1}Σ^{N}_{i=1}f(x_{i}, y_{I}, α, β)
Clustering is the task of dividing the unlabeled data or data points into
different clusters such that similar data points fall in the same cluster than
those which differ from the others
How it works
Mean: The mean is the sum of all the values in a dataset divided by the
number of values.
Median: The median is the middle value in a dataset when the values
are arranged in order from smallest to largest.
Mode: The mode is the value that occurs most frequently in a dataset.
population is the entire set of items from which you draw data for a
statistical study. It can be a group of individuals, a set of items, etc. It makes
up the data pool for a study.
For the above situation, it is easy to collect data. The population is small and
willing to provide data and can be contacted. The data collected will be
complete and reliable.
Parameters can be categorized into two main types: model parameters and
hyper parameters.
Model parameters are the internal variables of a model that are directly
estimated from the training data. They are the "knobs" that the model can
adjust to fit the data and make accurate predictions. For instance, in linear
regression, the slope and intercept coefficients are model parameters that
determine the relationship between the input variable and the output
variable.
Hyper parameters, on the other hand, are not directly estimated from the
training data. Instead, they are set before the model is trained and control the
learning process itself. Hyper parameters influence how the model learns
from the data and can significantly impact its performance. Examples of hyper
parameters include the learning rate in gradient descent optimization, the
number of hidden layers in a neural network, and the regularization
parameter to prevent over fitting.
Types of Estimation
Applications of Estimation
The sampling distribution depends on multiple factors – the statistic, sample size, sampling
process, and the overall population. It is used to help calculate statistics such as means,
ranges, variances, and standard deviations for the given sample.
SE = S/√(n) where,
n is Number of Observations
S is Standard Deviation
n is Number of Observations
For example, the expected value for Male Republicans is: 240*200/440 = 109
Similarly, you can calculate the expected value for each of the cells.
The world is constantly curious about the Chi-Square test's application in machine
learning and how it makes a difference. Feature selection is a critical topic
in machine learning, as you will have multiple features in line and must choose
the best ones to build the model. By examining the relationship between the
elements, the chi-square test aids in the solution of feature selection problems. In
this tutorial, you will learn about the chi-square test and its application.
What Is a Chi-Square Test?
The Chi-Square test is a statistical procedure for determining the difference
between observed and expected data. This test can also be used to determine
whether it correlates to the categorical variables in our data. It helps to find out
whether a difference between two categorical variables is due to chance or a
relationship between them.
Chi-Square Test Definition
A chi-square test is a statistical test that is used to compare observed and
expected results. The goal of this test is to identify whether a disparity between
actual and predicted data is due to chance or to a link between the variables
under consideration. As a result, the chi-square test is an ideal choice for aiding in
our understanding and interpretation of the connection between our two
categorical variables.
A chi-square test or comparable nonparametric test is required to test a
hypothesis regarding the distribution of a categorical variable. Categorical
variables, which indicate categories such as animals or countries, can be nominal
or ordinal. They cannot have a normal distribution since they can only have a few
particular values.
For example, a meal delivery firm in India wants to investigate the link between
gender, geography, and people's food preferences.
It is used to calculate the difference between two categorical variables, which are:
As a result of chance or
Because of the relationship
Formula For Chi-Square Test
Where
c = Degrees of freedom
O = Observed Value
E = Expected Value
The degrees of freedom in a statistical calculation represent the number of
variables that can vary in a calculation. The degrees of freedom can be calculated
to ensure that chi-square tests are statistically valid. These tests are frequently
used to compare observed data with data that would be expected to be obtained
if a particular hypothesis were true.
The Observed values are those you gather yourselves.
The expected values are the frequencies expected, based on the null hypothesis.
Fundamentals of Hypothesis Testing
Hypothesis testing is a technique for interpreting and drawing inferences about a
population based on sample data. It aids in determining which sample data best
support mutually exclusive population claims.
Null Hypothesis (H0) - The Null Hypothesis is the assumption that the event will
not occur. A null hypothesis has no bearing on the study's outcome unless it is
rejected.
H0 is the symbol for it, and it is pronounced H-naught.
Alternate Hypothesis(H1 or Ha) - The Alternate Hypothesis is the logical opposite
of the null hypothesis. The acceptance of the alternative hypothesis follows the
rejection of the null hypothesis. H1 is the symbol for it.
What Are Categorical Variables?
Categorical variables belong to a subset of variables that can be divided into
discrete categories. Names or labels are the most common categories. These
variables are also known as qualitative variables because they depict the
variable's quality or characteristics.
Categorical variables can be divided into two categories:
1. Nominal Variable: A nominal variable's categories have no natural ordering.
Example: Gender, Blood groups
2. Ordinal Variable: A variable that allows the categories to be sorted is
ordinal variables. Customer satisfaction (Excellent, Very Good, Good,
Average, Bad, and so on) is an example.
Why Do You Use the Chi-Square Test?
Chi-square is a statistical test that examines the differences between categorical
variables from a random sample in order to determine whether the expected and
observed results are well-fitting.
Here are some of the uses of the Chi-Squared test:
The Chi-squared test can be used to see if your data follows a well-known
theoretical probability distribution like the Normal or Poisson distribution.
The Chi-squared test allows you to assess your trained regression model's
goodness of fit on the training, validation, and test data sets.
What Does A Chi-Square Statistic Test Tell You?
A Chi-Square test ( symbolically represented as 2 ) is fundamentally a data
analysis based on the observations of a random set of variables. It computes how
a model equates to actual observed data. A Chi-Square statistic test is calculated
based on the data, which must be raw, random, drawn from independent
variables, drawn from a wide-ranging sample and mutually exclusive. In simple
terms, two sets of statistical data are compared -for instance, the results of
tossing a fair coin. Karl Pearson introduced this test in 1900 for categorical data
analysis and distribution. This test is also known as ‘Pearson’s Chi-Squared Test’.
Chi-Squared Tests are most commonly used in hypothesis testing. A hypothesis is
an assumption that any given condition might be true, which can be tested
afterwards. The Chi-Square test estimates the size of inconsistency between the
expected results and the actual results when the size of the sample and the
number of variables in the relationship is mentioned.
These tests use degrees of freedom to determine if a particular null hypothesis
can be rejected based on the total number of observations made in the
experiments. Larger the sample size, more reliable is the result.
There are two main types of Chi-Square tests namely -
1. Independence
2. Goodness-of-Fit
Independence
The Chi-Square Test of Independence is a derivable ( also known as inferential )
statistical test which examines whether the two sets of variables are likely to be
related with each other or not. This test is used when we have counts of values
for two nominal or categorical variables and is considered as non-parametric test.
A relatively large sample size and independence of obseravations are the required
criteria for conducting this test.
For Example-
In a movie theatre, suppose we made a list of movie genres. Let us consider this
as the first variable. The second variable is whether or not the people who came
to watch those genres of movies have bought snacks at the theatre. Here the null
hypothesis is that th genre of the film and whether people bought snacks or not
are unrelatable. If this is true, the movie genres don’t impact snack sales.
Goodness-Of-Fit
In statistical hypothesis testing, the Chi-Square Goodness-of-Fit test determines
whether a variable is likely to come from a given distribution or not. We must
have a set of data values and the idea of the distribution of this data. We can use
this test when we have value counts for categorical variables. This test
demonstrates a way of deciding if the data values have a “ good enough” fit for
our idea or if it is a representative sample data of the entire population.
For Example-
Suppose we have bags of balls with five different colours in each bag. The given
condition is that the bag should contain an equal number of balls of each colour.
The idea we would like to test here is that the proportions of the five colours of
balls in each bag must be exact.
Who Uses Chi-Square Analysis?
Chi-square is most commonly used by researchers who are studying survey
response data because it applies to categorical variables. Demography, consumer
and marketing research, political science, and economics are all examples of this
type of research.
Example
Let's say you want to know if gender has anything to do with political party
preference. You poll 440 voters in a simple random sample to find out which
political party they prefer. The results of the survey are shown in the table below:
Open File – enables the user to select the file from the local machine
Open URL – enables the user to select the data file from different
locations
pen Database – enables users to retrieve a data file from a database
source
Supplied training set: Evaluation is based on how well it can predict the class
of a set of instances loaded from a file.
Cross-validation: Evaluation is based on cross-validation by using the
number of folds entered in the ‘Folds’ text field.
To classify the data set based on the characteristics of attributes, Weka uses
classifiers.
Clustering: The cluster tab enables the user to identify similarities or groups
of occurrences within the data set. Clustering can provide data for the user to
analyse. The training set, percentage split, supplied test set and classes are
used for clustering, for which the user can ignore some attributes from the
data set, based on the requirements. Available clustering schemes in Weka are
k-Means, EM, Cobweb, X-means and FarthestFirst.
Association: The only available scheme for association in Weka is the Apriori
algorithm. It identifies statistical dependencies between clusters of attributes,
and only works with discrete data. The Apriori algorithm computes all the
rules having minimum support and exceeding a given confidence level.
Visualisation: The user can see the final piece of the puzzle, derived
throughout the process. It allows users to visualise a 2D representation of
data, and is used to determine the difficulty of the learning problem. We can
visualise single attributes (1D) and pairs of attributes (2D), and rotate 3D
visualisations in Weka. It has the Jitter option to deal with nominal attributes
and to detect ‘hidden’ data points.
Choosing the Right K Value: Choosing the right k value is crucial to getting
good results with K-NN algorithms. If k is too small, the algorithm will be
sensitive to noise and outliers. If k is too large, the algorithm will oversimplify
the problem and may miss important details. The value of k depends on the
size of the dataset and the complexity of the problem. A common approach is
to try different k values and choose the one that gives the best results. In some
cases, cross-validation can be used to estimate the optimal k value.
Distance Metrics in K-NN Algorithms: The distance metric used to evaluate
the similarity between data points is an essential component of K-NN
algorithms. Euclidean distance is a common choice, but other metrics such as
Manhattan distance, Minkowski distance, and cosine similarity can be used
depending on the type of data and the problem being addressed. Some metrics
are sensitive to the scale and units of the data, while others are not. Choosing
an appropriate distance metric is critical for obtaining accurate results with K-
NN algorithms
Handling Categorical Features in K-NN Algorithms: K-NN algorithms can
handle both continuous and categorical features, but categorical features
require special handling. One-hot encoding is a common approach, where
each category is converted into a binary variable. This technique allows the
distance metric to evaluate the similarity between the categories. Another
approach is to use a distance metric specifically designed for categorical data,
such as Gower distance and categorical cosine similarity.
Advantages Disadvantages
Simple and intuitive Requires a large amount of
Non-parametric - no memory to store the training
assumptions about the data
distribution of the data Computationally expensive for
Can handle both categorical and large datasets
continuous data Sensitive to the choice of k
Can be used for both value, distance metric, and data
classification and regression preprocessing
tasks Not suitable for high
dimensional data
Applications of K-NN Algorithms: K-NN algorithms have a wide range of
applications in various fields, including healthcare, finance, engineering, and
social sciences. Some common applications include:
Advantages of K-means:
Simple and easy to understand and implement.
Fast and efficient for large datasets.
Works well with numerical data.
Disadvantages of K-means:
Sensitive to the initial choice of centroids.
Does not work well with non-spherical clusters.
Requires the number of clusters (k) to be predefined.
Limitations of the k-means algorithm:
Assumes Spherical Clusters: K-means assumes that the clusters have a
spherical shape with equal variance.
Sensitive to Initial Placement: The algorithm's performance can be
influenced by the initial placement of cluster centers.
Requires Predefined Number of Clusters: The number of clusters (k) needs
to be specified in advance
Applications of the k-means algorithm
Customer Segmentation Image Compression Anomaly Detection
Identify customer groups Reduce the size of an Flag unusual patterns or
with similar behavior and image by grouping outliers in data by
preferences for targeted similar pixels together, identifying clusters with
marketing campaigns. preserving visual quality. few data points.