0% found this document useful (0 votes)
7 views

Data Science Notes

Data science is a multidisciplinary field focused on extracting meaningful insights from large datasets through techniques from mathematics, statistics, and computer engineering. Its importance lies in enabling informed decision-making, predictive capabilities, and fostering innovation across various industries. The data science process involves stages such as problem definition, data collection, cleaning, analysis, modeling, and deployment, with applications spanning healthcare, finance, e-commerce, and more.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Data Science Notes

Data science is a multidisciplinary field focused on extracting meaningful insights from large datasets through techniques from mathematics, statistics, and computer engineering. Its importance lies in enabling informed decision-making, predictive capabilities, and fostering innovation across various industries. The data science process involves stages such as problem definition, data collection, cleaning, analysis, modeling, and deployment, with applications spanning healthcare, finance, e-commerce, and more.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

What is data science: data science is the study of data to extract meaningful

insights for business. it is a multidisciplinary approach that combines


principles and practices from the fields of mathematics, statistics, artificial
intelligence, and computer engineering to analyze large amounts of data.
Way is data science important ? : Data science is important because it
combines tools, methods, and technology to generate meaning from data.
Modern organizations are inundated with data; there is a proliferation of
devices that can automatically collect and store information. Online systems
and payment portals capture more data in the fields of ecommerce, medicine,
finance, and every other aspect of human life. We have text, audio, video, and
image data available in vast quantities.
History of data science The term “Data Science” was created in the early
1960s to describe a new profession that would support the understanding and
interpretation of the large amounts of data which was being amassed at the
time. (At the time, there was no way of predicting the truly massive amounts
of data over the next fifty years.)
Need for data science? Data science is crucial because of the exponential
growth in data generation across various sectors. Here are some key reasons
for its significance:
Insights and Decision-making: Data science helps extract valuable insights
from large and complex datasets, enabling informed and strategic decision-
making in businesses and organizations.
Predictive Capabilities: Through machine learning and statistical models,
data science can predict trends, behaviors, and outcomes, aiding in forecasting
and planning.
Efficiency and Innovation: It fosters innovation by optimizing processes,
improving efficiency, and identifying opportunities for growth and
development.
Competitive Advantage: Companies that effectively utilize data science gain
a competitive edge by understanding customer needs better, improving
products/services, and optimizing operations.
Problem Solving: Data science methods help in addressing challenges,
detecting anomalies, and finding solutions across various domains like
healthcare, finance, marketing, and more.
Components of data science---Data science consists of many algorithms,
theories, components etc. Before detail study of data science, we need to
understand them. basic components of data science are discussed here.
Data science consists of several key components:
1. Data Collection: Gathering data from various sources, including databases,
sensors, social media, and more.
2. Data Cleaning and Preprocessing: Refining and preparing data by
handling missing values, removing outliers, and structuring it for analysis.
3. Exploratory Data Analysis (EDA): Understanding the data through
visualization and summary statistics to identify patterns, trends, and
relationships.
4. Statistical Analysis: Applying statistical methods to derive insights, make
predictions, and validate hypotheses.
5. Machine Learning: Building and training models that can learn from data
to make predictions, classifications, or optimizations.
6. Data Visualization: Presenting data in visual formats like charts, graphs,
and dashboards to communicate findings effectively.
7. Domain Expertise: Understanding the specific industry or field of
application to interpret results in context and derive meaningful conclusions.
8. Communication and Interpretation: Articulating findings and insights to
stakeholders, making data-driven recommendations and decisions.
Big data: Big data Big data refers to extremely large and diverse collections of
structured, unstructured, and semi-structured data that continues to grow
exponentially over time. These datasets are so huge and complex in volume,
velocity, and variety, that traditional data management systems cannot store,
process, and analyze them.
Characteristics of big data big data: characteristics are often referred to as
the “3 Vs of big data” and were first defined by Gartner in 2001. The
characteristics of big data, often known as the 5 Vs, encapsulate its defining
attributes:
1. Volume: Refers to the vast amount of data generated continuously from
various sources like social media, sensors, transactions, etc. It's typically
in eta bytes or exa bytes
2. Velocity: Denotes the speed at which data is generated and processed. It
emphasizes the real-time or near-real-time nature of data streams and
the need to handle them swiftly.
3. Variety: Indicates the diverse types of data—structured, semi-
structured, and unstructured. These can include text, images, videos,
logs, and more, making it challenging to manage and analyze.
4. Veracity: Represents the reliability and accuracy of the data. Given the
diversity and sources, big data often has issues with inconsistency,
incompleteness, and uncertainty.
5. Value: Although not always included in the original Vs, "value" is
increasingly considered important. It highlights the need to derive
insights, patterns, and meaningful information from big data to make
informed decisions and gain value.
Facets of Data:
Structured data:
1. It concerns all data which can be stored in database SQL in table with
rows and columns.
2. They have relational key and can be easily mapped into pre-designed
fields.
3. Today, those data are the most processed in development and the
simplest way to manage information.
4. But structured data represent only 5 to 10% of all informatics data.
Unstructured Data:
1. Unstructured data represent around 80% of data
2. It often include text and multimedia content
3. Examples include e-mail messages, word processing documents, videos,
photos, audio files, presentations, web pages and many other kinds of
business documents
4. Unstructured data is everywhere.
5. In fact, most individuals and organizations conduct their lives around
unstructured data.
6. Just as with structured data, unstructured data is either machine
generated or human generated.
Example of machine-generated unstructured data: Satellite images,
Photographs and video etc.
Example of human-generated unstructured data: Social media data, mobile
data etc.
machine-generated data:
1. Satellite images: This includes weather data or the data that the
government captures in its satellite surveillance imagery. Just think
about Google Earth, and you get the picture.
2. Photographs and video: This include security, surveillance, and traffic
video.
3. Radar or sonar data: This includes vehicular, meteorological, and
Seismic oceanography.
Graph based or Network Data: In graph theory, a graph is a mathematical
structure to model pair-wise relationships between objects.
Graph or network data is, in short, data that focuses on the relationship or
adjacency of objects.
The graph structures use nodes, edges, and properties to represent and store
graphical data. Graph-based data is a natural way to represent social
networks.
Audio, Image & Video: Audio, image, and video are data types that pose
specific challenges to a data scientist.
MLBAM (Major League Baseball Advanced Media) announced in 2014 that
they'll increase video capture to approximately 7 TB per game for the purpose
of live, in-game analytics. High- speed cameras at stadiums will capture ball
and athlete movements to calculate in real time, for example, the path taken
by a defender relative to two baselines.
Streaming Data: Streaming data is data that is generated continuously by
thousands of data sources, which typically send in the data records
simultaneously, and in small sizes (order of Kilobytes).
The need of business analytics; Business analytics plays a crucial role in
modern organizations for several reasons:
1. Informed Decision Making: It helps businesses make data-driven
decisions by analyzing historical and current data to predict future
trends, understand customer behavior, optimize processes, and identify
opportunities.
2. Performance Improvement: Analyzing various business metrics aids
in identifying inefficiencies, improving processes, and maximizing
performance across departments, leading to better productivity and
profitability.
3. Understanding Customers: Through analytics, businesses gain
insights into customer preferences, buying patterns, and behavior,
allowing for targeted marketing strategies, personalized offerings, and
improved customer experiences.
4. Risk Management: Analytics helps in identifying potential risks and
opportunities early on, enabling proactive strategies to mitigate risks
and leverage opportunities effectively.
5. Operational Efficiency: By analyzing operational data, businesses can
streamline processes, reduce costs, optimize resource allocation, and
enhance overall efficiency.
6. Competitive Advantage: Leveraging data and analytics can provide a
competitive edge by enabling companies to innovate, adapt quickly to
market changes, and stay ahead of competitors.
1. Business Understanding: This initial stage involves comprehending
the business problem, goals, and objectives. Data scientists need a clear
understanding of what the organization wants to achieve and how data
can support those goals
2. Data Understanding: Once the business problem is defined, the next
step involves understanding the available data. This includes data
sourcing, exploring data sources, assessing data quality, and
understanding the characteristics and limitations of the data.
3. Data Preparation: This phase involves cleaning and preprocessing the
data to make it suitable for analysis. Tasks include handling missing
values, dealing with outliers, normalizing or scaling features, encoding
categorical variables, and splitting the data into training and testing sets
4. Exploratory Data Analysis (EDA): Exploring and understanding the
dataset through statistical and visual methods to identify patterns,
trends, and relationships within the data.
5. Data Modeling: Once the data is prepared, the modeling phase begins.
This step involves selecting appropriate algorithms or models based on
the nature of the problem and the characteristics of the data
6. Model Evaluation: Assessing model performance using various
metrics to ensure its accuracy, reliability, and generalIzability on new
data.
7. Model Deployment: Integrating the chosen model into the business
process or application to make predictions or recommendations based
on new data.
Application of data science: Data science finds application across various
industries and domains:
1. Healthcare: Predictive analytics for disease diagnosis, patient outcome
prediction, drug discovery, personalized medicine, and optimizing
hospital operations.
2. Finance: Risk assessment, fraud detection, algorithmic trading,
customer segmentation, credit scoring, and portfolio management.
3. E-commerce and Retail: Recommendation systems, demand
forecasting, pricing optimization, customer behavior analysis, and
inventory management.
4. Marketing and Advertising: Customer segmentation, targeted
advertising, sentiment analysis, campaign optimization, and social
media analytics.
5. Telecommunications: Network optimization, customer churn
prediction, resource allocation, and improving service quality.
6. Manufacturing: Predictive maintenance, supply chain optimization,
quality control, and process optimization.
7. Energy and Utilities: Predictive maintenance of equipment, energy
consumption optimization, grid management, and renewable energy
forecasting.
8. Transportation and Logistics: Route optimization, fleet management,
demand forecasting, and supply chain optimization.
9. Government and Public Sector: Crime pattern analysis, urban
planning, healthcare management, and policy decision-making.
The data science process is a systematic approach to extracting insights and
knowledge from data. While it can be flexible and iterative, it generally
involves the following key steps:
1. Problem Definition: Clearly define the problem you want to solve or the
question you want to answer. This step involves understanding business
objectives, identifying the scope of the project, and setting measurable goals.
2. Data Collection: Gather relevant data from various sources. This could
include databases, APIs, files, or even manual data entry. Ensuring data quality
is crucial at this stage.
3. Data Cleaning and Preprocessing: Raw data often contains errors,
missing values, outliers, or inconsistencies. Cleaning involves handling these
issues by imputing missing values, removing outliers, standardizing formats,
and transforming data for analysis.
4. Exploratory Data Analysis (EDA): Explore the data to understand its
characteristics, identify patterns, correlations, and gain insights. Visualization
techniques and statistical methods are used here to discover trends and
relationships within the data.
5. Model Selection and Training: Choose appropriate machine learning or
statistical models based on the problem at hand. Train these models on the
prepared data using techniques like regression, classification, clustering, etc.
6. Model Evaluation: Assess the performance of the trained models using
evaluation metrics. This step involves validating the model against test data to
ensure its predictive power and generalizability.
7. Model Tuning: Fine-tune the model parameters or hyper parameters to
improve performance. Techniques like cross-validation or grid search are
often used for this purpose.
Here's a breakdown of setting the research goal, including defining the
research goal and creating a project charter:
Define Research Goal: The first step in setting a research goal is to define
the specific problem or question that the researcher wants to address. This
involves stating the research goal in terms that are clear, specific, and
achievable.
Create a Project Charter: A project charter is a written document that
outlines the objectives, scope, timeline, and resources needed for the research
project. The project charter serves as a roadmap for the researcher, outlining
the steps that need to be taken to achieve the research goal.
Retrieving data Data retrieval can refer to both internal and external data.
Internal data refers to data that is generated and stored within a company or
organization. Examples of internal data include customer transaction data,
employee performance data, and financial data.
External data, on the other hand, refers to data that is sourced from outside
the company or organization. Examples of external data include publically
available data streams, social media data, and government data.
Both internal and external data are important for the data science process,
and a successful data science project often depends on a combination of both
types of data. However, the process for retrieving internal and external data is
often different.

Data cleansing, also referred to as data cleaning or data scrubbing, is the


process of fixing incorrect, incomplete, duplicate or otherwise erroneous data
in a data set. It involves identifying data errors and then changing, updating or
removing data to correct them.
preparation that involves several key steps:
1. Handling Missing Values: Identify missing data and decide how to handle
it—either by removing rows/columns or imputing values using statistical
methods.
2. Dealing with Duplicates: Identify and remove duplicate entries in the
dataset to maintain data integrity.
3. Fixing Structural Errors: Correct inconsistencies in data formats, typos, or
erroneous entries that don't adhere to the expected structure.
4. Normalization and Standardization: Scale numerical features to a
common scale and standardize them to have a mean of 0 and a standard
deviation of 1, if required.
5. Handling Outliers: Identify and address outliers that can skew analysis or
machine learning models.
6. Dealing with Categorical Variables: Encode categorical variables into
numerical form suitable for analysis or modeling (one-hot encoding, label
encoding).
7. Feature Engineering: Create new features or transform existing ones to
improve model performance.

Data integration is the process of combining data from different sources to


help data managers and executives analyze it and make smarter business
decisions. This process involves a person or system locating,
retrieving, cleaning, and presenting the data.
This can be achieved through a variety of data integration techniques,
including:
Extract, Transform and Load: copies of datasets from disparate sources are
gathered together, harmonized, and loaded into a data warehouse or database
Extract, Load and Transform: data is loaded as is into a big data system and
transformed at a later time for particular analytics uses
Change Data Capture: identifies data changes in databases in real-time and
applies them to a data warehouse or other repositories
Data Replication: data in one database is replicated to other databases to
keep the information the information synchronized to operational uses and
for backup
Data Virtualization: data from different systems are virtually combined to
create a unified view rather than loading data into a new repository
Streaming Data Integration: a real time data integration method in which
different streams of data are continuously integrated and fed into analytics
systems and data stores

Data transformation is the process of converting, cleansing, and structuring


data into a usable format that can be analyzed to support decision making
processes, and to propel the growth of an organization.
The data transformation process is carried out in five stages.
1. Discovery: The first step is to identify and understand data in its original
source format with the help of data profiling tools. Finding all the sources and
data types that need to be transformed. This step helps in understanding how
the data needs to be transformed to fit into the desired format.
2. Mapping: The transformation is planned during the data mapping phase.
This includes determining the current structure, and the consequent
transformation that is required, then mapping the data to understand at a
basic level, the way individual fields would be modified, joined or aggregated.
3. Code Generation: The code, which is required to run the transformation
process, is created in this step using a data transformation platform or tool.
4. Execution: The data is finally converted into the selected format with the
help of the code. The data is extracted from the source(s), which can vary from
structured to streaming, telemetry to log files. Next, transformations are
carried out on data, such as aggregation, format conversion or merging, as
planned in the mapping stage.
Review: The transformed data is evaluated to ensure the conversion has had
the desired results in terms of the format of the data.
It must also be noted that not all data will need transformation, at times it can
be used as is.

Exploratory data analysis (EDA) is used by data scientists to analyze and


investigate data sets and summarize their main characteristics, often
employing data visualization methods.
1. Univariate Analysis: Analyzing individual variables to understand
their distributions through histograms, kernel density plots, box plots,
2. Bivariate Analysis: Exploring relationships between two variables
using scatter plots, joint plots, heatmaps, or correlation matrices to
identify correlations or patterns.
3. Multivariate Analysis: Extending analysis to multiple variables
simultaneously, often using techniques like dimensionality reduction
(PCA, t-SNE) or parallel coordinates plots.
4. Visualization Techniques: Employing various visualization methods
such as bar charts, pie charts, line graphs, and geospatial plots to
present data patterns or trends.
5. Missing Values and Outlier Analysis: Identifying missing values and
outliers through visualizations or statistical methods to decide on
appropriate handling strategies.
6. Time Series Analysis: Examining time-dependent data for trends,
seasonality, or cyclic patterns using methods like time series plots,
autocorrelation plots, etc.

The Foremost Goals of EDA

1. Data Cleaning: EDA involves examining the information for errors, lacking
values, and inconsistencies. It includes techniques including records
imputation, managing missing statistics, and figuring out and getting rid of
outliers.

2. Descriptive Statistics: EDA utilizes precise records to recognize the


important tendency, variability, and distribution of variables. Measures like
suggest, median, mode, preferred deviation, range, and percentiles are usually
used.

3. Data Visualization: EDA employs visual techniques to represent the


statistics graphically. Visualizations consisting of histograms, box plots,
scatter plots, line plots, heatmaps, and bar charts assist in identifying styles,
trends, and relationships within the facts.

Data modeling is the process of creating a visual representation of either a


whole information system or parts of it to communicate connections between
data points and structures

There are three types of Data Models:

conceptual data model The conceptual data model is a view of the data that
is required to help business processes. It also keeps track of business events
and keeps related performance measures. The conceptual model defines what
the system contains.Conceptual Model focuses on finding the data used in a
business rather than processing flow.

2. Logical Model: In the logical data model, the map of rules and data
structures includes the data required, such as tables, columns, etc. Data
architects and Business Analysts create the Logical Model. We can use the
logical model to transform it into a database. Logical Model is always present
in the root package object

3. Physical Data Model: In a physical data model, the implementation is


described using a specific database system. It defines all the components and
services that are required to build a database. It is created by using the
database language and queries. The physical data model represents each
table, column, constraints like primary key, foreign key, NOT NULL, etc. The
main work of the physical data model is to create a database. This model is
created by the Database Administrator (DBA) and developers

1. Hierarchical Model:The hierarchical model is a tree-like structure. There


is one root node, or we can say one parent node and the other child nodes are
sorted in a particular order. But, the hierarchical model is very rarely used
now. This model can be used for real-world model relationships.
2. Object-oriented Model: The object-oriented approach is the creation of
objects that contains stored values. The object-oriented model communicates
while supporting data abstraction, inheritance, and encapsulation.
Network Model:The network model provides us with a flexible way of
representing objects and relationships between these entities. It has a feature
known as a schema representing the data in the form of a graph. An object is
represented inside a node and the relation between them as an edge, enabling
them to maintain multiple parent and child records in a generalized manner.
4. Entity-relationship Model: ER model (Entity-relationship model) is a
high-level relational model which is used to define data elements and
relationship for the entities in a system. This conceptual design provides a
better view of the data that helps us easy to understand. In this model, the
entire database is represented in a diagram called an entity-relationship
diagram, consisting of Entities, Attributes, and Relationships.
5. Relational Model: Relational is used to describe the different relationships
between the entities. And there are different sets of relations between the
entities such as one to one, one to many.
Presentation and automation play significant roles in data science, aiding in
both the communication of insights and the efficiency of processes. Here's a
breakdown of their importance in the field:
Presentation in Data Science:
1.Data Visualization: Creating visual representations of data helps in
conveying complex information more intuitively. Tools like Matplotlib,
Seaborn, Tableau, and Power BI are used to generate graphs, charts, and
dashboards.
2. Storytelling: Presenting data effectively involves crafting a narrative
around the insights discovered. This helps in making the findings more
relatable and understandable to stakeholders.
3. Communication Skills: Data scientists need to effectively communicate
their findings to non-technical stakeholders. This involves translating
technical jargon into layman's terms without losing the essence of the
analysis.
4. Reports and Presentations: Generating comprehensive reports and
presentations that encapsulate the analysis, methodology, and
recommendations is crucial in conveying the value of the data-driven insights.

Automation in Data Science:

1. Data Collection and Cleaning: Automation tools help in streamlining the


process of gathering and preprocessing data, saving time and reducing errors.
Libraries like Pandas in Python or dplyr in R are used extensively for this
purpose.

2. Model Building and Evaluation: Automating the creation and evaluation


of models using techniques like AutoML (Automated Machine Learning) can
speed up the process of finding the best model for a given problem.

3. Deployment of Models: Automation assists in deploying machine learning


models into production environments, enabling real-time predictions and
decision-making.

4. Task Scheduling: Automating repetitive tasks like model training, data


updates, or report generation through schedulers or scripts helps in
maintaining a smooth workflow.

5. Optimization and Hyper parameter Tuning: Tools and techniques


automate the process of optimizing models by fine-tuning hyper parameters,
which leads to improved model performance.
Both presentation and automation are essential components in the lifecycle of
a data science project. The ability to convey insights effectively and efficiently
automate processes contributes significantly to the success of data-driven
initiatives.
Data analytics (DA) is the process of examining data sets to find trends and
draw conclusions about the information they contain. Increasingly, data
analytics is done with the aid of specialized systems and software.

4 KEY TYPES OF DATA ANALYTICS


1. Descriptive Analytics: Descriptive analytics is the simplest type of
analytics and the foundation the other types are built on. It allows you to pull
trends from raw data and succinctly describe what happened or is currently
happening.
Descriptive analytics answers the question, “What happened?”
Data visualization is a natural fit for communicating descriptive analysis
because charts, graphs, and maps can show trends in data—as well as dips and
spikes—in a clear, easily understandable way.
2. Diagnostic Analytics: Diagnostic analytics addresses the next logical
question, “Why did this happen?”
Taking the analysis a step further, this type includes comparing coexisting
trends or movement, uncovering correlations between variables, and
determining causal relationships where possible.
Diagnostic analytics is useful for getting at the root of an organizational issue.
3. Predictive Analytics: Predictive analytics is used to make predictions
about future trends or events and answers the question, “What might happen
in the future?”
By analyzing historical data in tandem with industry trends, you can make
informed predictions about what the future could hold for your company.
Making predictions for the future can help your organization formulate
strategies based on likely scenarios.
4. Prescriptive Analytics: Finally, prescriptive analytics answers the
question, “What should we do next?”
Prescriptive analytics takes into account all possible factors in a scenario and
suggests actionable takeaways. This type of analytics can be especially useful
when making data-driven decisions.

Data analytics lifecycle:


Here is different phases of the life cycle of data analytics, in which we will go
over different life cycle phases and then go over them in detail.
Phase 1: Discovery –
1. The data science team is trained and researches the issue.
2. Create context and gain understanding.
3. Learn about the data sources that are needed and accessible to the
project.
4. The team comes up with an initial hypothesis, which can be later
confirmed with evidence.
Phase 2: Data Preparation -
1. Methods to investigate the possibilities of pre-processing, analysing, and
preparing data before analysis and modelling.
2. It is required to have an analytic sandbox. The team performs, loads,
and transforms to bring information to the data sandbox.
3. Data preparation tasks can be repeated and not in a predetermined
sequence.
4. Some of the tools used commonly for this process include - Hadoop,
Alpine Miner, Open Refine, etc.
Phase 3: Model Planning -
1. The team studies data to discover the connections between variables.
Later, it selects the most significant variables as well as the most
effective models.
2. In this phase, the data science teams create data sets that can be used
for training for testing, production, and training goals.
3. The team builds and implements models based on the work completed
in the modelling planning phase.
4. Some of the tools used commonly for this stage are MATLAB and
STASTICA.
Phase 4: Model Building -
1. The team creates datasets for training, testing as well as production use.
2. The team is also evaluating whether its current tools are sufficient to run the models
or if they require an even more robust environment to run models.
3. Tools that are free or open-source or free tools Rand PL/R, Octave,
WEKA.
4. Commercial tools - MATLAB, STASTICA.
Phase 5: Communication Results -
1. Following the execution of the model, team members will need to
evaluate the outcomes of the model to establish criteria for the success
or failure of the model.
2. The team is considering how best to present findings and outcomes to
the various members of the team and other stakeholders while taking
into consideration cautionary tales and assumptions.
Phase 6: Operationalize -
1. This technique allows the team to gain insight into the performance and
constraints related to the model within a production setting at a small
scale and then make necessary adjustments before full deployment.
2. The team produces the last reports, presentations, and codes.
3. Open source or free tools such as WEKA, SQL, MADlib, and Octave.

Regression analysis is a statistical technique of measuring the relationship


between variables. It provides the values of the dependent variable from the
value of an independent variable. The main use of regression analysis is to
determine the strength of predictors, forecast an effect, a trend, etc.
In order to understand regression analysis fully, it’s essential to comprehend the
following terms:
 Dependent Variable: This is the main factor that you’re trying to
understand or predict.
 Independent Variables: These are the factors that you hypothesize
have an impact on your dependent variable

Type of Regression analysis
1. Linear Regression
Linear regression is one of the most basic types of regression in machine
learning. The linear regression model consists of a predictor variable and a
dependent variable related linearly to each other. In case the data involves
more than one independent variable, then linear regression is called multiple
linear regression models.
The below-given equation is used to denote the linear regression model:
y=mx+c+e
Logistic regression is one of the types of regression analysis technique,
which gets used when the dependent variable is discrete. Example: 0 or 1, true
or false, etc. This means the target variable can have only two values, and a
sigmoid curve denotes the relation between the target variable and the
independent variable. There are three types of logistic regression:
o Binary(0/1, pass/fail)
o Multi(cats, dogs, lions)
o Ordinal(low, medium, high)
Below is the equation that denotes the logistic regression.
logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3….+bkXk

Ridge Regression: This is another one of the types of regression in machine


learning which is usually used when there is a high correlation between the
independent variables. This is because, in the case of multi collinear data, the
least square estimates give unbiased values. But, in case the collinearity is
very high, there can be some bias value. Therefore, a bias matrix is introduced
in the equation of Ridge Regression. This is a powerful regression method
where the model is less susceptible to overfitting.

Below is the equation used to denote the Ridge Regression

β = (X^{T}X + λ*I)^{-1}X^{T}y

Lasso Regression: Lasso Regression is one of the types of regression in


machine learning that performs regularization along with feature selection.
It prohibits the absolute size of the regression coefficient. As a result, the
coefficient value gets nearer to zero, which does not happen in the case of
Ridge Regression .Below is the equation that represents the Lasso Regression
method:

N^{-1}Σ^{N}_{i=1}f(x_{i}, y_{I}, α, β)

Polynomial Regression: Polynomial Regression is another one of the types


of regression analysis techniques in machine learning, which is the same as
Multiple Linear Regression with a little modification. In Polynomial
Regression, the relationship between independent and dependent variables,
that is X and Y, is denoted by the n-th degree. Below equation represents the
Polynomial Regression: Y= b0+b1x+ b2x2+ b3x3+.....+ bnxn.

The Classification algorithm is a Supervised Learning technique that is used


to identify the category of new observations on the basis of training data. In
Classification, a program learns from the given dataset or observations and
then classifies new observation into a number of classes or groups. Such
as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc.

K-Nearest Neighbors: KNN (K-Nearest Neighbors) becomes one of many


algorithms used in data mining and machine learning, KNN is a classifier
algorithm in which the learning is based on the similarity of data (a vector)
from others. It also could be used to store all available cases and classifies new
cases based on a similarity measure (e.g., distance functions).

Decision Tree: Decision tree algorithm is included in supervised learning


algorithms. This algorithm could be used to solve regression and other
classification problems. Decision tree builds classification or regression
models in the form of a tree structure. It breaks down a dataset into smaller
and smaller subsets while at the same time an associated decision tree is
incrementally developed. The purpose of using decision tree algorithm is to
predict class or value of target variable by learning simple decision rules
concluded from prior data.

Random Forest: Random forests are an ensemble learning method for


classification, regression and other tasks that operates by constructing
multiple decision trees at training time. For classification task, the output
from the random forest is the class selected by most trees. For the regression
task, the mean or mean prediction of each tree is returned. Random forests
generally outperform decision trees but have lower accuracy than gradient
boosted trees. However, the characteristics of the data can affect its
performance.

Naïve Bayes: Naive Bayes is a classification technique based on Bayes'


theorem with the assumption of independence between predictors. In simple
terms, the Naive Bayes classifier assumes that the presence of certain features
in a class is not related to the presence of other features. Classifier assumes
that the presence of a particular feature in a class is unrelated to the presence
of any other feature. It's updating knowledge step by step with new info.
Conclusion: Classification algorithms in machine learning use input training
data to predict the likelihood that subsequent data will fall into one of the
predetermined categories. There are five classification algorithms that have
been widely used in data science that are broken down into Neural Network,
K-Nearest Neighbors, Decision Tree, Random Forest, and Naïve Bayes.

Neural Network: First, there is neural network. It is a set of algorithms that


attempt to identify the underlying relationships in a data set through a
process that mimics how human brain operates. In data science, neural
networks help to cluster and classify complex relationship. Neural networks
could be used to group unlabelled data according to similarities among the
example inputs and classify data when they have a labelled dataset to train on.

Clustering is the task of dividing the unlabeled data or data points into
different clusters such that similar data points fall in the same cluster than
those which differ from the others

Clustering algorithms work by iteratively assigning objects to clusters. The


assignment of objects to clusters is based on a measure of similarity between
objects. The most common measures of similarity are Euclidean distance and
Manhattan distance.

clustering can be divided into two subgroups:

 Hierarchical clustering: Hierarchical clustering algorithms create a


hierarchy of clusters, where each cluster is a subset of its parent cluster.
 Partitional clustering: Partitional clustering algorithms partition the
data into a fixed number of clusters.
 Example: A marketing company uses clustering to identify groups of
customers with similar purchasing habits.
 A bank uses clustering to identify patterns in transaction data that are
indicative of fraud.

Clustering can provide a number of benefits, including:

 Improved decision-making: Clustering can help you make better


decisions by providing insights into the data.
 Reduced costs: Clustering can help you reduce costs by identifying
patterns in the data that can be used to improve efficiency.
 Increased revenue: Clustering can help you increase revenue by
identifying new opportunities and targeting your marketing efforts
more effectively.

Association rule analysis (ARA) is a data mining technique used to discover


interesting associations between items in a large set of data. It is a popular
technique for market basket analysis, which is used to identify products that
are frequently purchased together.

How it works

ARA works by identifying itemsets that occur frequently together in a dataset.


An itemset is a set of one or more items. The frequency of an itemset is the
number of times it appears in the dataset.

To identify interesting associations, ARA uses two measures: support and


confidence.

 Support is the proportion of transactions that contain the itemset. For


example, if the itemset {milk, bread} has a support of 0.2, this means
that 20% of transactions contain both milk and bread.
 Confidence is the proportion of transactions that contain the
consequent itemset given that they contain the antecedent itemset. For
example, if the rule {milk} → {bread} has a confidence of 0.6, this means
that 60% of transactions that contain milk also contain bread.

ARA can be used to identify a wide variety of interesting associations,


including:

 Market basket relationships: For example, you might discover that


diapers are frequently purchased together with baby wipes.
 Cross-selling opportunities: For example, you might discover that
customers who buy a particular product are also likely to buy another
product.
 Customer segmentation: For example, you might discover that there are
two distinct groups of customers, one that is likely to buy diapers and
baby wipes and the other that is likely to buy electronics and home
appliances.

Benefits of association rule analysis


ARA can provide a number of benefits, including:

 Improved marketing: ARA can be used to identify effective marketing


campaigns and target specific customer segments.
 Increased sales: ARA can be used to identify cross-selling opportunities
and recommend products to customers.
 Reduced costs: ARA can be used to identify unnecessary product
inventory and reduce costs.

Applications of association rule analysis

ARA is used in a wide variety of industries, including:

 Retail: ARA is used to identify market basket relationships and cross-


selling opportunities.
 Finance: ARA is used to identify fraud patterns and risk factors.
 Healthcare: ARA is used to identify patterns in patient data and improve
diagnoses.
 Manufacturing: ARA is used to identify quality control issues and
improve production efficiency.

Examples of association rule analysis

Here are some examples of association rule analysis:

 A grocery store discovers that diapers are frequently purchased


together with baby wipes.
 An online retailer discovers that customers who buy a laptop are also
likely to buy a mouse and a keyboard.
 A bank discovers that customers who make large deposits are also likely
to make large withdrawals.
 A hospital discovers that patients who have pneumonia are also likely to
have a fever.
 A manufacturing plant discovers that machines that are frequently used
together are also likely to break down together.

Statistics is the science of collecting, analyzing, interpreting, presenting, and


organizing numerical data. It plays a vital role in various fields such as
business, research, and decision-making. Understanding statistics enables us
to make informed decisions and draw meaningful conclusions from data.
here are some of the most common basic terminologies in statistics:

Population: A population is the entire group of individuals or objects that a


researcher is interested in studying. For example, the population of all
adults in the india would be all adults who are citizens or residents of the
india.

Sample: A sample is a subset of the population that is selected to represent


the entire population. For example, a sample of 1,000 adults in the india
would be a small subset of the entire population of adults in the india.

Parameter: A parameter is a numerical characteristic of a population. For


example, the mean height of all adults in the india is a parameter.

Statistic: A statistic is a numerical characteristic of a sample. For example,


the mean height of a sample of 1,000 adults in the india is a statistic.

Variable: A variable is a characteristic of an individual or object that can


take on different values. For example, height, weight, and age are all
variables.

Frequency Distribution: A frequency distribution shows how often each


value of a variable occurs in a dataset. For example, a frequency distribution
of the heights of a sample of 1,000 adults in the United States would show
how many adults are each height.

Measures of Central Tendency: Measures of central tendency are used to


describe the "middle" of a dataset. The three most common measures of
central tendency are the mean, median, and mode.

 Mean: The mean is the sum of all the values in a dataset divided by the
number of values.
 Median: The median is the middle value in a dataset when the values
are arranged in order from smallest to largest.
 Mode: The mode is the value that occurs most frequently in a dataset.

Measures of Dispersion: Measures of dispersion are used to describe the


spread of a dataset. The two most common measures of dispersion are the
variance and standard deviation.
 Variance: The variance is the average squared deviation from the
mean of a dataset.
 Standard Deviation: The standard deviation is the square root of the
variance.

Normal Distribution: A symmetrical bell-shaped curve that represents the


distribution of many types of data, where the mean, median, and mode are
all equal.

population is the entire set of items from which you draw data for a
statistical study. It can be a group of individuals, a set of items, etc. It makes
up the data pool for a study.

Generally, population refers to the people who live in a particular area at a


specific time. But in statistics, population refers to data on your study of
interest. It can be a group of individuals, objects, events, organizations, etc.
You use populations to draw conclusions.

An example of a population would be the entire student body at a school. It


would contain all the students who study in that school at the time of data
collection. Depending on the problem statement, data from each of these
students is collected. An example is the students who speak Hindi among the
students of a school.

For the above situation, it is easy to collect data. The population is small and
willing to provide data and can be contacted. The data collected will be
complete and reliable.

Population Parameter: Mean: μ = (ΣX) / N, where ΣX is the sum of all


values in the population and N is the size of the population

Standard Deviation: σ = √[(Σ(X-μ)²) / N], where X is a value in the


population, μ is the population mean, and N is the size of the population

A sample is a subset of a population. It is used to represent the entire


population and is selected in a way that is meant to be unbiased. This means
that every member of the population has an equal chance of being selected
for the sample.

Common sampling techniques used in data science include:


1. Simple random sampling (SRS): In SRS, every data point in the
population has the same probability of being selected. This method is
easy to implement and provides a representative sample of the
population. However, it can be expensive when the population is large.
2. Systematic sampling: In systematic sampling, the first data point is
selected at random and then every nth data point is selected
afterwards, where n is a predetermined number. This method is
efficient and cost-effective, but it does not provide a perfectly random
sample.
3. Stratified sampling: In stratified sampling, the population is divided
into subgroups (strata) based on one or more characteristic(s).
Samples are then selected from each stratum in proportion to its size.
This method can be used to ensure that different subgroups of the
population are equally represented in the sample.
4. Cluster sampling: In cluster sampling, multiple sampling frames are
created based on a predetermined criterion, such as geographic
location or demographic characteristics. Samples are then selected at
random from each sample frame. This method can be used to ensure
that the sample is representative of the different groups within the
population.

parameters play a crucial role in statistical models and machine learning


algorithms. They are numerical values that define the specific characteristics
of a model and influence its ability to make predictions or inferences from
data.

Parameters can be categorized into two main types: model parameters and
hyper parameters.

Model parameters are the internal variables of a model that are directly
estimated from the training data. They are the "knobs" that the model can
adjust to fit the data and make accurate predictions. For instance, in linear
regression, the slope and intercept coefficients are model parameters that
determine the relationship between the input variable and the output
variable.
Hyper parameters, on the other hand, are not directly estimated from the
training data. Instead, they are set before the model is trained and control the
learning process itself. Hyper parameters influence how the model learns
from the data and can significantly impact its performance. Examples of hyper
parameters include the learning rate in gradient descent optimization, the
number of hidden layers in a neural network, and the regularization
parameter to prevent over fitting.

Parameter importance measures the relative contribution of each


parameter to the model's predictions. Understanding parameter importance
can provide insights into the underlying relationships within the data and
help identify the most influential factors for the target outcome.

Estimation in data science refers to the process of obtaining approximate


values for unknown parameters of a population using a sample of data. This is
in contrast to descriptive statistics, which summarize the data directly
without making inferences about the population.

Types of Estimation

There are two main types of estimation in data science:

1. Point estimation: This involves using a single value to estimate an


unknown population parameter. For example, the sample mean is a
point estimator for the population mean.

2. Interval estimation: This involves providing a range of values within


which the true population parameter is likely to lie. For example, a
confidence interval is an interval estimator for the population mean.

Estimation Methods: There are many different estimation methods in data


science, depending on the type of parameter being estimated and the
characteristics of the sample data. Some common methods include:

1. Method of moments: This involves finding the parameter value that


minimizes the sum of squared deviations between sample moments and
theoretical moments.

2. Maximum likelihood estimation (MLE): This involves finding the


parameter value that maximizes the likelihood function, which
represents the probability of observing the sample data given the
parameter value.

3. Bayesian estimation: This involves using Bayesian inference to


calculate the posterior distribution of the parameter, given the sample
data and prior information about the parameter.

Applications of Estimation

1. Predictive modeling: Estimating unknown parameters of a prediction


model, such as the coefficients in a linear regression model.

2. Hypothesis testing: Estimating the standard deviation of a sample to


determine the power of a hypothesis test.

3. Sampling: Estimating the size of a population based on a sample size.

4. Risk assessment: Estimating the probability of a particular event


occurring, such as credit card fraud or loan defaults.

An Estimator is a rule or a formula applied to sample data to estimate an


unknown quantity or parameter related to a population. Estimators are used
to make predictions, infer characteristics, or estimate values of parameters
based on the information available from a sample.

There are two main types of estimators:

1. Point Estimators: These provide a single value as an estimate for the


unknown population parameter. For example, the sample mean, sample
variance, or sample proportion are all point estimators. If you want to
estimate the average height of a population, you might use the sample
mean height as an estimator.

2. Interval Estimators: These provide a range of values within which the


true parameter is likely to lie, along with a level of confidence.
Confidence intervals are a common example of interval estimators. They
indicate a range of values that likely contains the true population
parameter, based on the sample data and a specified level of confidence.

The quality of an estimator is measured by its properties, such as bias,


efficiency, and consistency:
 Bias: An estimator is unbiased if, on average, it provides estimates that
are equal to the true population parameter. An estimator with no
systematic tendency to overestimate or underestimate is desirable.

 Efficiency: An efficient estimator has a smaller variance, meaning it


produces estimates that are more tightly clustered around the true
parameter value.

 Consistency: A consistent estimator tends to converge to the true


population parameter as the sample size increases

A sampling distribution refers to the distribution of a statistic calculated from


multiple samples of the same size taken from the same population. It helps in
understanding the behavior of statistics derived from various samples and
provides insights into the variability of these statistics.

The sampling distribution depends on multiple factors – the statistic, sample size, sampling
process, and the overall population. It is used to help calculate statistics such as means,
ranges, variances, and standard deviations for the given sample.

Types of Sampling Distribution

1. Sampling distribution of mean: As shown from the example above, you


can calculate the mean of every sample group chosen from the population and
plot out all the data points. The graph will show a normal distribution, and the
center will be the mean of the sampling distribution, which is the mean of the
entire population.

2. Sampling distribution of proportion: It gives you information about


proportions in a population. You would select samples from the population
and get the sample proportion. The mean of all the sample proportions that
you calculate from each sample group would become the proportion of the
entire population.

3. T-distribution: T-distribution is used when the sample size is very small or


not much is known about the population. It is used to estimate the mean of the
population, confidence intervals, statistical differences, and linear regression.

Standard error is a mathematical tool used in statistics to measure variability.


or precision of a sample statistic (such as the mean or proportion) used to
estimate a population parameter. It quantifies how much the sample statistic
tends to vary from the true population parameter.
Standard Error(SE) Formula: The SE formula is used to determine the
reliability of a sampling that represents a population. The sample mean that
differs from the provided population and is expressed as:

SE = S/√(n) where,

 S is Standard Deviation of Data

 n is Number of Observations

Standard Error of Mean (SEM): Standard Error of Mean(SEM) is also known


by the name Standard Deviation of Mean, is the standard deviation of the
measure of sample mean of the population.

SEM = S/√(n) where,

 S is Standard Deviation

 n is Number of Observations

Standard Error Estimate is use to find the accuracy of prediction of any


event. Its abbreviation is SEE. Standard Error Estimate (SEE) is also called the
Sum of Squares Error. SEE is the square root of average squared deviation.
SEE = √[Σ(xi – μ)/(n – 2)] where,
 xi is values of Data
 μ is Mean Value of Data
 n is Sample Size
Example: Find the standard error for the sample data: 5, 8, 10, 12.
Solution: Mean of Given Data(μ) = (5+8+10+12)/4
μ = 8.75
Standard Deviation = √((5 – 8.75)2 + (8 – 8.75)2 + (10 – 8.75)2 + (12 –
8.75)2)/(4 – 1)
σ = √(26.75/3)
σ = 2.98
SE = 2.98/√8.75
SE = 1.0074 Answer
A good estimator is one that provides accurate and reliable estimates of the true
values of the population parameters. Here are some key properties of good
estimator:
1. Unbiasedness: An estimator is unbiased if its expected value equals the true
population parameter. In other words, on average, the estimator should
estimate the true value accurately without any systematic bias. Bias can arise
from various factors, such as sampling errors, model misspecification, or
measurement errors.
2. Consistency: An estimator is consistent if it converges to the true population
parameter as the sample size increases. This means that as we collect more data,
our estimates should become increasingly accurate and reliable. Consistency is a
crucial property for long-term analysis and inferences about the population.
3. Efficiency: An estimator is efficient if it has the smallest possible variance
among all unbiased estimators of the same parameter. In other words, an
efficient estimator provides the most precise estimates with the least amount of
uncertainty. Efficiency is important because it allows us to make more accurate
inferences with smaller sample sizes.
4. Sufficiency: An estimator is sufficient if it captures all the information in
the sample that is relevant to the parameter being estimated. This means that
no other statistic based on the sample can provide additional information
about the parameter. Sufficiency ensures that we extract the maximum
amount of information from the data for accurate estimation.
measures of central tendency, often referred to as measures of center, are
statistical metrics used to describe the central or average value of a dataset.
These measures provide a way to summarize and understand where the data
tends to cluster or center around. The main measures of center include:
Mean in general terms is used for the arithmetic mean of the data, but other
than the arithmetic mean there are geometric mean and harmonic mean as well
that are calculated using different formulas. Here in this article, we will discuss
the arithmetic mean
Mean for Ungrouped Data:
x̄ = Σ xi /n or Mean = Sum of all Observations ÷ Total number of Observations
x̄ = (27 + 11 + 17 + 19 + 21) ÷ 5
⇒ = 95 ÷ 5
⇒ = 19
Mean for grouped Data: Mean = ∑xi fi / ∑fi
Arithmetic Mean: The formula for Arithmetic Mean is given by
x̄ = Σ xi /N Where,
 x1, x2, x3, . . ., xn are the observations, and
 N is the number of observations.
Geometric Mean: The formula for Geometric Mean is given by
G.M =n√x1.x2.x3........xn Where,
 x1, x2, x3, . . ., xn are the observations, and
 n is the number of observations.
Harmonic Mean: The formula for Harmonic Mean is given by
H.M. = n/(1/x1+1/x2+….+1/xn) or H.M. = n/ Σ(1/xi) Where,

 x1, x2, . . ., xn are the observations, and


 n is the number of observations.
The Median of any distribution is that value that divides the distribution into
two equal parts such that the number of observations above it is equal to the
number of observations below it. Thus, the median is called the central value of
any given data either grouped or ungrouped.
Median of Ungrouped Data: To calculate the Median, the observations must be
arranged in ascending or descending order. If the total number of observations
is N then there are two cases
N is Odd : Median = Value of observation at [(n + 1) ÷ 2]th Position
N is Even: Median = Arithmetic mean of Values of observations at (n ÷ 2)th and
[(n ÷ 2) + 1]th Position
Median of Grouped Data l + [(n/2−c)/f] × h Where,

 l is the lower limit of median class,


 n is the total number of observations,
 cf is the cumulative frequency of the preceding class,
 f is the frequency of each class, and
 h is the class size.
measures of spread, also known as measures of dispersion, are crucial for
understanding how "spread out" or "variable" your data is. They quantify how
data points deviate from the central tendency (e.g., mean or median). Knowing
the spread helps you:
1. Range:
 Formula: Range = Maximum Value - Minimum Value
 Example: Consider a dataset of exam scores: {60, 70, 75, 80, 90}.
The maximum value is 90, and the minimum value is 60. Therefore,
the range = 90 - 60 = 30.
2. Variance:
 Formula: Variance (σ²) = Σ [(x - μ)²] / N Where:
 x = Each individual data point
 μ = Mean of the dataset
 N = Total number of data points
 Example: Using the same exam scores dataset, first, find the mean:
(60 + 70 + 75 + 80 + 90) / 5 = 75. The deviations from the mean are:
{-15, -5, 0, 5, 15}. Squaring and averaging these deviations: [(225 +
25 + 0 + 25 + 225) / 5] = 100. Therefore, the variance = 100.
3. Standard Deviation:
 Formula: Standard Deviation (σ) = √Variance
 Example: Using the variance calculated above (Variance = 100),
the standard deviation = √100 = 10.
4. Inter quartile Range (IQR):
 Formula: IQR = Q3 - Q1 Where:
 Q1 = First quartile (25th percentile)
 Q3 = Third quartile (75th percentile)
 Example: For the dataset {10, 20, 30, 40, 50, 60, 70, 80, 90}, Q1 =
25th percentile = 30 and Q3 = 75th percentile = 70. Therefore, IQR =
70 - 30 = 40.
5. Mean Absolute Deviation (MAD):
 Formula: MAD = Σ |x - μ| / N
 Example: Revisiting the exam scores, the mean (μ) was calculated
as 75. The absolute deviations from the mean are: {15, 5, 0, 5, 15}.
The average of these absolute deviations = (15 + 5 + 0 + 5 + 15) /
5 = 8.
Probability is the measure of the likelihood of an event/something happening.
It is an important element in predictive analysis allowing you to explore the
computational math behind your outcome.
Types of Probability:
 Theoretical Probability: this focuses on how likely an event is to occur and
is based on the foundation of reasoning. Using theory, the outcome is the
expected value. Using the head and tails example, the theoretical probability
of landing on heads is 0.5 or 50%.
 Experimental Probability: this focuses on how frequently an event occurs
during an experiment duration. Using the head and tails example - if we
were to toss a coin 10 times and it landed on heads 6 times, the
experimental probability of the coin landing on heads would be 6/10 or
60%
Importance of probability: It is very useful for data scientists to know and
understand the chances of an event occurring and can be very effective in the
decision-making process.
You will be constantly working with data and you need to learn more about it
before performing any form of analysis.

What Is a Normal Distribution? : Normal distribution, also known as the


Gaussian distribution, is a probability distribution that is symmetric about the
mean, showing that data near the mean are more frequent in occurrence than
data far from the mean.
In graphical form, the normal distribution appears as a "bell curve".
KEY TAKEAWAYS
The normal distribution is the proper term for a probability bell curve. In a
normal distribution the mean is zero and the standard deviation is 1. It has zero
skew and a kurtosis of 3. Normal distributions are symmetrical, but not all
symmetrical distributions are normal. Many naturally-occurring phenomena
tend to approximate the normal distribution. In finance, most pricing
distributions are not, however, perfectly normal.
Understanding Normal Distribution: The normal distribution is the most
common type of distribution assumed in technical stock market analysis and in
other types of statistical analyses. The standard normal distribution has two
parameters: the mean and the standard deviation.
The normal distribution model is important in statistics and is key to the Central
Limit Theorem (CLT). This theory states that averages calculated from
independent, identically distributed random variables have approximately
normal distributions, regardless of the type of distribution from which the
variables are sampled (provided it has finite variance).
Properties of the Normal Distribution
The normal distribution has several key features and properties that define it.
First, its mean (average), median (midpoint), and mode (most frequent
observation) are all equal to one another. Moreover, these values all represent
the peak, or highest point, of the distribution. The distribution then falls
symmetrically around the mean, the width of which is defined by the standard
deviation.
The Empirical Rule
For all normal distributions, 68.2% of the observations will appear within plus
or minus one standard deviation of the mean; 95.4% of the observations will fall
within +/- two standard deviations; and 99.7% within +/- three standard
deviations. This fact is sometimes referred to as the "empirical rule," a heuristic
that describes where most of the data in a normal distribution will appear.
This means that data falling outside of three standard deviations ("3-sigma")
would signify rare occurrences.

The Formula for the Normal


Distribution
The normal distribution follows
the following formula. Note that
only the values of the mean (μ )
and standard deviation (σ) are
necessary
where: x = value of the variable or data being examined and f(x) the
probability function
μ = the mean
σ = the standard deviation
The term "binary distribution" typically refers to the Bernoulli distribution
or the binomial distribution, which are closely related and commonly used to
model discrete random variables that have two possible outcomes—often
referred to as "success" and "failure" or coded as 0 and 1.
Bernoulli Distribution: The Bernoulli distribution is the simplest form and
represents a single trial with two possible outcomes, usually denoted as 1
(success) and 0 (failure). It's characterized by a single parameter, p, which
represents the probability of success. The probability mass function of the
Bernoulli distribution is:

 P(X = x)p(X=x) is the probability


of the random variable X taking
on the value x (0 or 1),
 p is the probability of success
Where x can take values 0 or (the event X = 1),
1.  1 - p is the probability of failure
(the event X = 0).
Example: Suppose you are flipping a fair coin, and you are interested in
modeling the probability of getting heads (success) in a single flip.
Here, the random variable X represents the outcome of the coin flip:
 X =1X=1 if it lands heads (success).
 X =0X=0 if it lands tails (failure).
In this case, since it's a fair coin, the probability of getting heads (p) is 0.5.
Using the Bernoulli distribution formula, the probability of getting heads
(X=1) in a single coin flip can be calculated as:
This means that according to the Bernoulli distribution, the probability of
getting heads in a single flip of a fair coin is 0.5.
Binomial Distribution: The binomial distribution extends the Bernoulli
distribution to multiple independent trials, each with the same probability of
success p. It represents the number of successes in a fixed number of trials n.
The probability mass function of the binomial distribution is:

 X is the random variable representing the


number of successes.
 k is the number of successes (it can range from 0 to n).

 denotes the binomial coefficient, which calculates the number of


ways to choose k successes out of n trials.
 p is the probability of success in a single trial.
 1−p is the probability of failure in a single trial.
Example: Suppose you are conducting an experiment where you toss a fair coin
10 times, and you want to model the probability of getting exactly 4 heads
In this scenario:
 n=10 (number of trials, i.e., coin tosses)
 p=0.5 (probability of success, i.e., getting heads on a fair coin)
Using the binomial distribution formula, the probability of getting exactly 4
heads in 10 coin tosses can be calculated as:

Calculating the binomial coefficient (which represents the number of


ways to choose 4 successes out of 10 trials):

Substituting the values into the formula:

Therefore, according to the binomial distribution, the probability of getting


exactly 4 heads in 10 coin tosses with a fair coin is approximately 0.205 or
20.5%.
Hypothesis testing is a statistical method used to determine if there is
enough evidence in a sample data to draw conclusions about a population. It
involves formulating two competing hypotheses, the null hypothesis (H0) and
the alternative hypothesis (Ha), and then collecting data to assess the
evidence.
Formulating Hypotheses:
 Null Hypothesis (H0): Represents the default assumption, often stating
no effect or no difference.
 Alternative Hypothesis (Ha): Contradicts the null hypothesis,
suggesting an effect or difference.
Types of Hypothesis Testing

Z Test: To determine whether a discovery or relationship is statistically


significant, hypothesis testing uses a z-test. It usually checks to see if two
means are the same (the null hypothesis). Only when the population standard
deviation is known and the sample size is 30 data points or more, can a z-test
be applied.

T Test: A statistical test called a t-test is employed to compare the means of


two groups. To determine whether two groups differ or if a procedure or
treatment affects the population of interest, it is frequently used in hypothesis
testing.

Chi-Square : You utilize a Chi-square test for hypothesis testing concerning


whether your data is as predicted. To determine if the expected and observed
results are well-fitted, the Chi-square test analyzes the differences between
categorical variables from a random sample. The test's fundamental premise
is that the observed values in your data should be compared to the predicted
values that would be present if the null hypothesis were true.
x¯ is the sample mean, μ is the population mean, σ is the
population standard deviation and n is the size of the sample.

s is the sample standard deviation.


Oi is the observed value and Ei is the expected value.

Example 2: The average score on a test is 80 with a standard deviation of 10.


With a new teaching curriculum introduced it is believed that this score will
change. On random testing, the score of 38 students, the mean was found to be
88. With a 0.05 significance level, is there any evidence to support this claim?
Solution: This is an example of two-tail hypothesis testing. The z test will be
used.

The chi-square test is a statistical method used to determine if there's a


significant association between categorical variables. It's particularly useful
when you have data that fits into categories and you want to assess if there's a
relationship between them.
There are different types of chi-square tests, but two of the most common
ones are:
1. Chi-Square Test for Independence (or Contingency Table Test):
This test is used to determine whether there is a significant association
between two categorical variables. It works with a contingency table
(also known as a cross-tabulation table) that displays the frequency
counts of the variables.
2. Chi-Square Goodness of Fit Test: This test compares the observed
categorical data to the expected data to determine if they differ
significantly. It's often used to assess how well the observed frequencies
fit an expected distribution.
Example: Let's say you want to know if gender has anything to do with
political party preference. You poll 440 voters in a simple random sample to
find out which political party they prefer. The results of the survey are shown
in the table below:

Republican Democrat Independent Total


Male 100 70 30 200
Female 140 60 20 220
Total 240 130 50 440

To see if gender is linked to political party preference, perform a Chi-Square test


of independence using the steps below.
Step 1: Define the Hypothesis H0: There is no link between gender and political
party preference.
H1: There is a link between gender and political party preference.
Calculate the Expected Values
Now you will calculate the expected frequency.

For example, the expected value for Male Republicans is: 240*200/440 = 109
Similarly, you can calculate the expected value for each of the cells.

The world is constantly curious about the Chi-Square test's application in machine
learning and how it makes a difference. Feature selection is a critical topic
in machine learning, as you will have multiple features in line and must choose
the best ones to build the model. By examining the relationship between the
elements, the chi-square test aids in the solution of feature selection problems. In
this tutorial, you will learn about the chi-square test and its application.
What Is a Chi-Square Test?
The Chi-Square test is a statistical procedure for determining the difference
between observed and expected data. This test can also be used to determine
whether it correlates to the categorical variables in our data. It helps to find out
whether a difference between two categorical variables is due to chance or a
relationship between them.
Chi-Square Test Definition
A chi-square test is a statistical test that is used to compare observed and
expected results. The goal of this test is to identify whether a disparity between
actual and predicted data is due to chance or to a link between the variables
under consideration. As a result, the chi-square test is an ideal choice for aiding in
our understanding and interpretation of the connection between our two
categorical variables.
A chi-square test or comparable nonparametric test is required to test a
hypothesis regarding the distribution of a categorical variable. Categorical
variables, which indicate categories such as animals or countries, can be nominal
or ordinal. They cannot have a normal distribution since they can only have a few
particular values.
For example, a meal delivery firm in India wants to investigate the link between
gender, geography, and people's food preferences.
It is used to calculate the difference between two categorical variables, which are:
 As a result of chance or
 Because of the relationship
Formula For Chi-Square Test
Where
c = Degrees of freedom
O = Observed Value
E = Expected Value
The degrees of freedom in a statistical calculation represent the number of
variables that can vary in a calculation. The degrees of freedom can be calculated
to ensure that chi-square tests are statistically valid. These tests are frequently
used to compare observed data with data that would be expected to be obtained
if a particular hypothesis were true.
The Observed values are those you gather yourselves.
The expected values are the frequencies expected, based on the null hypothesis.
Fundamentals of Hypothesis Testing
Hypothesis testing is a technique for interpreting and drawing inferences about a
population based on sample data. It aids in determining which sample data best
support mutually exclusive population claims.
Null Hypothesis (H0) - The Null Hypothesis is the assumption that the event will
not occur. A null hypothesis has no bearing on the study's outcome unless it is
rejected.
H0 is the symbol for it, and it is pronounced H-naught.
Alternate Hypothesis(H1 or Ha) - The Alternate Hypothesis is the logical opposite
of the null hypothesis. The acceptance of the alternative hypothesis follows the
rejection of the null hypothesis. H1 is the symbol for it.
What Are Categorical Variables?
Categorical variables belong to a subset of variables that can be divided into
discrete categories. Names or labels are the most common categories. These
variables are also known as qualitative variables because they depict the
variable's quality or characteristics.
Categorical variables can be divided into two categories:
1. Nominal Variable: A nominal variable's categories have no natural ordering.
Example: Gender, Blood groups
2. Ordinal Variable: A variable that allows the categories to be sorted is
ordinal variables. Customer satisfaction (Excellent, Very Good, Good,
Average, Bad, and so on) is an example.
Why Do You Use the Chi-Square Test?
Chi-square is a statistical test that examines the differences between categorical
variables from a random sample in order to determine whether the expected and
observed results are well-fitting.
Here are some of the uses of the Chi-Squared test:
 The Chi-squared test can be used to see if your data follows a well-known
theoretical probability distribution like the Normal or Poisson distribution.
 The Chi-squared test allows you to assess your trained regression model's
goodness of fit on the training, validation, and test data sets.
What Does A Chi-Square Statistic Test Tell You?
A Chi-Square test ( symbolically represented as 2 ) is fundamentally a data
analysis based on the observations of a random set of variables. It computes how
a model equates to actual observed data. A Chi-Square statistic test is calculated
based on the data, which must be raw, random, drawn from independent
variables, drawn from a wide-ranging sample and mutually exclusive. In simple
terms, two sets of statistical data are compared -for instance, the results of
tossing a fair coin. Karl Pearson introduced this test in 1900 for categorical data
analysis and distribution. This test is also known as ‘Pearson’s Chi-Squared Test’.
Chi-Squared Tests are most commonly used in hypothesis testing. A hypothesis is
an assumption that any given condition might be true, which can be tested
afterwards. The Chi-Square test estimates the size of inconsistency between the
expected results and the actual results when the size of the sample and the
number of variables in the relationship is mentioned.
These tests use degrees of freedom to determine if a particular null hypothesis
can be rejected based on the total number of observations made in the
experiments. Larger the sample size, more reliable is the result.
There are two main types of Chi-Square tests namely -
1. Independence
2. Goodness-of-Fit
Independence
The Chi-Square Test of Independence is a derivable ( also known as inferential )
statistical test which examines whether the two sets of variables are likely to be
related with each other or not. This test is used when we have counts of values
for two nominal or categorical variables and is considered as non-parametric test.
A relatively large sample size and independence of obseravations are the required
criteria for conducting this test.
For Example-
In a movie theatre, suppose we made a list of movie genres. Let us consider this
as the first variable. The second variable is whether or not the people who came
to watch those genres of movies have bought snacks at the theatre. Here the null
hypothesis is that th genre of the film and whether people bought snacks or not
are unrelatable. If this is true, the movie genres don’t impact snack sales.
Goodness-Of-Fit
In statistical hypothesis testing, the Chi-Square Goodness-of-Fit test determines
whether a variable is likely to come from a given distribution or not. We must
have a set of data values and the idea of the distribution of this data. We can use
this test when we have value counts for categorical variables. This test
demonstrates a way of deciding if the data values have a “ good enough” fit for
our idea or if it is a representative sample data of the entire population.
For Example-
Suppose we have bags of balls with five different colours in each bag. The given
condition is that the bag should contain an equal number of balls of each colour.
The idea we would like to test here is that the proportions of the five colours of
balls in each bag must be exact.
Who Uses Chi-Square Analysis?
Chi-square is most commonly used by researchers who are studying survey
response data because it applies to categorical variables. Demography, consumer
and marketing research, political science, and economics are all examples of this
type of research.
Example
Let's say you want to know if gender has anything to do with political party
preference. You poll 440 voters in a simple random sample to find out which
political party they prefer. The results of the survey are shown in the table below:

To see if gender is linked to political party preference, perform a Chi-Square test


of independence using the steps below.
Step 1: Define the Hypothesis
H0: There is no link between gender and political party preference.
H1: There is a link between gender and political party preference.
Step 2: Calculate the Expected Values
Now you will calculate the expected frequency.

For example, the expected value for Male Republicans is:


Similarly, you can calculate the expected value for each of the cells.

Step 3: Calculate (O-E)2 / E for Each Cell in the Table


Now you will calculate the (O - E)2 / E for each cell in the table.
Where, O = Observed Value E = Expected Value

Step 4: Calculate the Test Statistic X2


X2 is the sum of all the values in the last table
= 0.743 + 2.05 + 2.33 + 3.33 + 0.384 + 1 = 9.837
The degrees of freedom = (r-1) (c-1). We have (3-1)(2-1) = 2.Finally, you compare
our obtained statistic to the critical statistic found in the chi-square table. As you
can see, for an alpha level of 0.05 and two degrees of freedom, the critical statistic
is 5.991, which is less than our obtained statistic of 9.83. You can reject our null
hypothesis because the critical statistic is higher than your obtained statistic.
Weka stands for Waikato Environment for Knowledge Analysis. It is a
collection of open-source machine learning tools used in data mining. It is
designed to assist in the application of machine learning to real-world
datasets and allows you to apply various algorithms, models and
classification techniques to your data. Weka is a collection of tools for:
 Regression
 Clustering
 Association
 Data pre-processing
 Classification
 Visualisation

Weka Explorer: 1. Preprocessing: Data preprocessing is a must. There are

three ways to inject the data for preprocessing:

 Open File – enables the user to select the file from the local machine
 Open URL – enables the user to select the data file from different
locations
 pen Database – enables users to retrieve a data file from a database
source

2. Classification: To predict nominal or numeric quantities, we have


classifiers in Weka. Available learning schemes are decision-trees and lists,
support vector machines, instance-based classifiers, logistic regression and
Bayes’ nets. Once the data has been loaded, all the tabs are enabled. Based on
the requirements and by trial and error, we can find out the most suitable
algorithm to produce an easily understandable representation of data.

Before running any classification algorithm, we need to set test options.


Available test options are listed below.
Use training set: Evaluation is based on how well it can predict the class of
the instances it was trained on.

Supplied training set: Evaluation is based on how well it can predict the class
of a set of instances loaded from a file.
Cross-validation: Evaluation is based on cross-validation by using the
number of folds entered in the ‘Folds’ text field.

Split percentage: Evaluation is based on how well it can predict a certain


percentage of the data, held out for testing by using the values entered in the
‘%’ field.

To classify the data set based on the characteristics of attributes, Weka uses
classifiers.

Clustering: The cluster tab enables the user to identify similarities or groups
of occurrences within the data set. Clustering can provide data for the user to
analyse. The training set, percentage split, supplied test set and classes are
used for clustering, for which the user can ignore some attributes from the
data set, based on the requirements. Available clustering schemes in Weka are
k-Means, EM, Cobweb, X-means and FarthestFirst.

Association: The only available scheme for association in Weka is the Apriori
algorithm. It identifies statistical dependencies between clusters of attributes,
and only works with discrete data. The Apriori algorithm computes all the
rules having minimum support and exceeding a given confidence level.

Attribute selection: Attribute selection crawls through all possible


combinations of attributes in the data to decide which of these will best fit the
desired calculation—which subset of attributes works best for prediction.
The attribute selection method contains two parts.

 Search method: Best-first, forward selection, random, exhaustive,


genetic algorithm, ranking algorithm

 Evaluation method: Correlation-based, wrapper, information gain, chi-


squared
All the available attributes are used in the evaluation of the data set by
default. But it enables users to exclude some of them if they want to.

Visualisation: The user can see the final piece of the puzzle, derived
throughout the process. It allows users to visualise a 2D representation of
data, and is used to determine the difficulty of the learning problem. We can
visualise single attributes (1D) and pairs of attributes (2D), and rotate 3D
visualisations in Weka. It has the Jitter option to deal with nominal attributes
and to detect ‘hidden’ data points.

Regression algorithms are a type of machine learning algorithm used to


predict numerical values based on input data. Regression algorithms attempt
to find a relationship between the input variables and the output variable by
fitting a mathematical model to the data. The goal of regression is to find a
mathematical relationship between the input features and the target variable
that can be used to make accurate predictions on new, unseen data.

1. Linear regression: Linear regression is a simple and widely used


algorithm. It assumes a linear relationship between the independent
variables and the target variable. The algorithm estimates the
coefficients of the linear equation that best fits the data. The equation
can be of the form: y= m x +c, where y is the target variable, x is the input
feature, m is the slope, and c is the intercept.
Example: applications include predicting housing prices based on
features like square footage and number of bedrooms or estimating
sales based on advertising expenditure.

2. Logistic regression: Logistic regression is a popular algorithm used for


binary classification problems where the target variable has two
possible outcomes (e.g., yes/no, true/false, 0/1). Despite its name,
logistic regression is a classification algorithm, not a regression
algorithm. It models the relationship between the independent variables
(input features) and the binary target variable using the logistic function,
also known as the sigmoid function.
Example: predicting whether a customer will churn (i.e., stop doing
business with a company) based on their demographic information and
purchase history.

K-Nearest Neighbors (K-NN) is a popular machine learning algorithm used


in data science for classification and regression analysis. It works by finding
the k nearest data points to a given observation and assigning the observation
to the most common class or the average value of its neighbors. This
document will provide an in-depth look at K-NN algorithms, including its
working, selection of k value, distance metrics, handling categorical features,
advantages, and disadvantages, and applications.

Choosing the Right K Value: Choosing the right k value is crucial to getting
good results with K-NN algorithms. If k is too small, the algorithm will be
sensitive to noise and outliers. If k is too large, the algorithm will oversimplify
the problem and may miss important details. The value of k depends on the
size of the dataset and the complexity of the problem. A common approach is
to try different k values and choose the one that gives the best results. In some
cases, cross-validation can be used to estimate the optimal k value.
Distance Metrics in K-NN Algorithms: The distance metric used to evaluate
the similarity between data points is an essential component of K-NN
algorithms. Euclidean distance is a common choice, but other metrics such as
Manhattan distance, Minkowski distance, and cosine similarity can be used
depending on the type of data and the problem being addressed. Some metrics
are sensitive to the scale and units of the data, while others are not. Choosing
an appropriate distance metric is critical for obtaining accurate results with K-
NN algorithms
Handling Categorical Features in K-NN Algorithms: K-NN algorithms can
handle both continuous and categorical features, but categorical features
require special handling. One-hot encoding is a common approach, where
each category is converted into a binary variable. This technique allows the
distance metric to evaluate the similarity between the categories. Another
approach is to use a distance metric specifically designed for categorical data,
such as Gower distance and categorical cosine similarity.
Advantages Disadvantages
 Simple and intuitive  Requires a large amount of
 Non-parametric - no memory to store the training
assumptions about the data
distribution of the data  Computationally expensive for
 Can handle both categorical and large datasets
continuous data  Sensitive to the choice of k
 Can be used for both value, distance metric, and data
classification and regression preprocessing
tasks  Not suitable for high
dimensional data
Applications of K-NN Algorithms: K-NN algorithms have a wide range of
applications in various fields, including healthcare, finance, engineering, and
social sciences. Some common applications include:

Medical Diagnosis Credit Scoring


K-NN algorithms can be used to K-NN algorithms can be used to
classify patients based on their classify loan applicants based on their
symptoms and medical history. This financial history and credit rating.
can help doctors to diagnose diseases This can help banks and other
and recommend appropriate financial institutions to make loan
treatments. decisions.

Image Recognition Sentiment Analysis


K-NN algorithms can be used to K-NN algorithms can be used to
classify images based on their classify text data based on their
features, such as pixel intensity and sentiment, such as positive, negative,
texture. This can help in tasks such as or neutral. This can help in tasks such
face recognition and object detection. as customer feedback analysis and
social media monitoring.
The k-means algorithm is a popular unsupervised clustering algorithm used
in data science to group unlabeled data points into a specific number of
clusters (k). It is an iterative algorithm that aims to minimize the within-
cluster variance, meaning it tries to find clusters where data points within a
cluster are similar to each other and dissimilar to data points in other clusters.
The working of the K-Means algorithm is explained in the below steps:
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input
dataset).
Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each data point to the
new closest centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready

Advantages of K-means:
 Simple and easy to understand and implement.
 Fast and efficient for large datasets.
 Works well with numerical data.
Disadvantages of K-means:
 Sensitive to the initial choice of centroids.
 Does not work well with non-spherical clusters.
 Requires the number of clusters (k) to be predefined.
Limitations of the k-means algorithm:
Assumes Spherical Clusters: K-means assumes that the clusters have a
spherical shape with equal variance.
Sensitive to Initial Placement: The algorithm's performance can be
influenced by the initial placement of cluster centers.
Requires Predefined Number of Clusters: The number of clusters (k) needs
to be specified in advance
Applications of the k-means algorithm
Customer Segmentation Image Compression Anomaly Detection
Identify customer groups Reduce the size of an Flag unusual patterns or
with similar behavior and image by grouping outliers in data by
preferences for targeted similar pixels together, identifying clusters with
marketing campaigns. preserving visual quality. few data points.

You might also like