ADBMS
ADBMS
Q] Explain DBMS
Before DBMSs, data was organized in simple file formats (like text files
or spreadsheets). But these old methods had many problems. A DBMS
was created to overcome these problems, and here’s why it’s important
to learn about DBMS:
1. Real-world Entities:
2. Relation-based Tables:
The database is separate from the applications that use the data.
The database is a passive entity that stores data, and the application is
an active entity that interacts with the database.
This separation makes it easier to manage and update data without
affecting the applications using it.
4. Less Redundancy:
5. Consistency:
Consistency means ensuring that the data is always accurate and follows
the rules.
DBMSs help maintain data integrity by ensuring that the data stays
consistent even when changes are made, so there are no contradictions
or errors in the data.
6. Query Language:
A DBMS uses a special query language (like SQL) to ask for data and
manipulate it.
For example, with a simple query, you can ask the database to show all
the students who have a grade above 80.
This is much more powerful and efficient than the old methods, where
you had to search through files manually.
Q] Explain Different Types of Databases
Types of Databases :
Databases come in different types, depending on how they store and
manage data. Here are some common types:
1) Centralized Database :
What it is: All the data is stored in one central system, and users can
access it from different locations using applications.
Advantages:
Disadvantages:
Slower response times: As the database grows, fetching data can take
longer.
Hard to update: Making changes can be more complex.
Risk of data loss: If the central server fails, all data could be lost.
2) Distributed Database :
What it is: The data is spread across multiple systems (computers), but
users can still access it easily through connections between those
systems.
Types:
Homogeneous DDB: All systems use the same operating system and
software.
Heterogeneous DDB: Systems use different operating systems or
software.
Advantages:
3) Relational Database :
What it is: This is the most common type, where data is stored in tables
(rows and columns). It uses SQL (Structured Query Language) to manage
the data.
4) NoSQL Database :
What it is: A newer type of database designed to store large amounts of
unstructured or semi-structured data. It doesn’t use tables like relational
databases.
Example: MongoDB.
Types:
5) Cloud Database :
What it is: A database that runs on a cloud platform (remote servers)
instead of a local server.
Advantages:
6) Object-oriented Database :
What it is: Uses an object-oriented approach (like programming) to store
data as objects (which are instances of classes).
Example: Realm, ObjectBox.
7) Hierarchical Database
What it is: Organizes data in a tree-like structure, where each record has
a single parent and can have many children.
Advantages:
8) Network Database :
What it is: Data is stored in a network structure where records (nodes)
are connected by links.
Advantages:
More flexible than hierarchical databases because each record can have
multiple parents.
9) Personal Database :
What it is: A database designed for a single user, usually for personal use,
like tracking personal data.
Advantages:
Simple to use.
Small and doesn’t require much storage.
Advantages:
Advantages:
Types:
What it is: All the sites in a homogeneous distributed database use the
same DBMS (database management system) and operating system. It
means all the systems involved are similar and work together smoothly.
Properties:
Same software: All sites use the same database software.
Identical DBMS: Every site uses the same version of the DBMS from
the same vendor (e.g., all sites using Oracle).
Cooperation: Each site knows about the other sites, and they all
work together to process user requests.
Single interface: If it's a single database, it can be accessed through a
single user interface.
Properties:
Different software: Each site might use different database software
(e.g., one site might use MySQL, another might use Oracle, etc.).
Different schemas: The way data is organized can vary between sites,
which makes it harder to query and process.
Complex query and transaction processing: Since the sites use
different systems and data models, it’s more difficult to run queries
or process transactions across them.
Limited cooperation: Each site may not know what the other sites
are doing, so the sites don’t work as closely together as in
homogeneous systems.
While they have many advantages, distributed databases also come with
some challenges:
1. Complexity: Setting up and managing a distributed system is more
complex than managing a single database.
2. Cost: More resources and technology are needed to manage multiple
sites.
3. Security: Managing security across multiple sites is harder, as you
have to ensure each site is protected.
4. Integrity Control: Ensuring data remains consistent and correct across
all sites can be difficult.
5. Lack of Standards: Different systems might use different standards for
data storage and communication, making integration harder.
6. Lack of Experience: Distributed databases are complex, so they may
require specialized knowledge to manage.
7. Database Design Complexity: Designing a database that works
efficiently across multiple sites is more complicated than designing a
single-site database.
Client: The client interacts with the user. It provides the interface that
users use to access the database, such as forms, reports, or dashboards.
Clients are responsible for:
A single server serves many clients. The server handles requests from
multiple clients at the same time.
Example: A university database where many students (clients) send
requests to a single server that manages their records.
There are multiple servers (more than one) and multiple clients. Each
client can connect to any of the servers, and the servers work together
to handle requests from clients.
Example: A large e-commerce site where users (clients) can connect to
different servers that manage inventory, orders, and payments.
Key Characteristics:
3) Multi-DBMS Architecture :
Key Features:
Autonomous Systems: Each database system operates independently
but is linked together to form a larger system.
Data Integration: Even though the systems are autonomous, they work
together as one unified system for the users.
2. Fully Replicated :
In this approach, every site has a complete copy of the database.
This allows faster access to data because any site can provide all the
data needed.
However, updating data becomes costly and complex because any
change must be made on every copy.
Example: In a company, every office location stores a full copy of the
employee database, so they can access it quickly. But if someone
changes a phone number, it needs to be updated everywhere.
3. Partially Replicated :
Here, only some tables or parts of the database are copied to different
sites.
This replication is based on how often certain data is accessed. The more
often data is used, the more copies are made.
This helps reduce unnecessary data replication and saves storage.
Example: A retail company may replicate product inventory data in
branches where the most popular products are sold more frequently,
but not replicate every product.
4. Fragmented :
In this case, a table is divided into smaller pieces, called fragments, and
each fragment is stored at different sites.
This helps in parallel processing and improves disaster recovery.
There are three types of fragmentation:
Vertical Fragmentation (dividing columns)
Horizontal Fragmentation (dividing rows)
Hybrid Fragmentation (a mix of both)
Example: A hospital database may split patient records (rows) into
separate fragments based on department (e.g., cardiology, neurology)
and store them at the relevant departments.
5. Mixed Distribution :
This is a combination of fragmented and partially replicated data.
Some parts of the data are fragmented, and then the fragments are
replicated at sites based on how often they are accessed.
This approach balances the need for distribution and replication.
Example: A global company may fragment sales data by region, and then
replicate frequently accessed sales data across multiple regional offices.
Data Replication :
Advantages of Replication:
Reliability: If one site fails, others still have copies of the data.
Reduced Network Load: With copies of the data at different locations,
requests don’t need to go over the network to distant servers.
Faster Response: Users can access data from nearby sites, reducing wait
times.
Simpler Transactions: Transactions (like data updates) can be simpler
when data is available at multiple places.
Disadvantages of Replication:
Increased Storage: Storing multiple copies requires more storage space.
Cost and Complexity: Keeping copies synchronized and managing
updates can be complex and expensive.
Tight Coupling: Changes in the database structure may cause issues
across applications using different copies.
Fragmentation :
Fragmentation is the process of dividing a table into smaller, more
manageable pieces called fragments. Fragmentation is done to improve
performance, security, and reliability. There are three types of
fragmentation:
Types of Fragmentation:
Vertical Fragmentation:
1. Vertical Fragmentation :
In vertical fragmentation, columns of a table are divided into different
fragments.
Each fragment contains a set of columns from the original table.
It is useful when you want to protect sensitive data or when different
users need access to different columns of data.
Example: A university table with student information (name, address,
grades, etc.) can be split, with the grades column stored separately for
privacy.
2. Horizontal Fragmentation :
3. Hybrid Fragmentation:
Once we’ve created these ADTs, we can use them in our tables.
For example:
sql code :
sql code :
This query fetches the details of all the students belongs to the
city Vengurla.
Database object refers to any item or entity that is created and used to
store, manage, or refer to data.
Tables
Views
Sequences
Indexes
Synonyms
1. Table :
A table is the most basic and important database object. It is where the
actual data is stored in the database. A table is made up of rows (also
called records) and columns (also called fields or attributes). Each
column in a table has a specific data type (e.g., numbers, text, dates).
Ex.
CREATE TABLE dept
(
Deptno NUMBER(2),
Dname VARCHAR2(20),
Location VARCHAR2(20)
);
2. View :
A view is a virtual table. It doesn't actually store data on its own but
rather shows data from one or more tables based on a query. You can
think of a view as a window through which you can see and sometimes
change data.
Ex.
CREATE VIEW view AS
SELECT Student_id , last_name, salary
FROM Student
WHERE department_id = 111;
This creates a view called view that shows the student ID, last name, and
annual salary for students in department 111. You can use the view just
like a table in queries.
3. Sequence :
A sequence is used to automatically generate unique numbers, often for
use as primary keys (unique identifiers) for records in a table. This helps
ensure that each record in a table has a unique identifier.
Ex.
CREATE SEQUENCE dept_deptid_seq
INCREMENT BY 10
START WITH 120
MAXVALUE 9999;
4. Index :
An index is like a shortcut that helps speed up data retrieval. It improves
the performance of queries, especially when you are searching for rows
based on specific columns. Indexes are created on one or more columns
of a table.
Ex.
CREATE INDEX emp_last_name_idx
ON employees(last_name);
Ex.
CREATE SYNONYM d_sum FOR dept_sum_vu;
This creates a synonym d_sum for the view dept_sum_vu. Now, instead
of referring to dept_sum_vu, you can simply use d_sum in your queries.
Dimensional Modelling is a way to organize data so that it’s easy and fast
to look up information in a data warehouse (a big storage system for
business data). It is useful when businesses need to analyze things like
sales, profits, or customer data.
Advantages :
Simple Design:
The structure of the database is easy to understand, so business users
don’t need special training.
Disadvantages :
Hard to Combine Data:
Adding data from different sources (like sales software and inventory
software) is difficult.
Not Flexible:
If the way the business works changes, updating the data warehouse can
be hard.
Facts:
Facts are the numbers or measurements you care about.
Example: Total sales, number of products sold, or revenue.
Dimensions:
Dimensions are details that describe facts, like who, what, or where.
Example:
Who: Customer name
What: Product name
Where: Store location
Attributes:
Attributes are extra details about dimensions.
Example:
For a location dimension, the attributes might be:
State
Country
Zipcode
Fact Table:
This table stores the numbers (facts) and links to dimensions.
Example: A sales fact table might include:
Total sales amount
A link to the customer, product, and location.
Dimension Table:
This table stores details (attributes) about dimensions.
Example: A product dimension table might include:
Product name
Brand
Category
2. Two-Tier Architecture :
What it is: The process is split into different stages to make
things more organized and efficient.
Steps:
Source Layer: Data comes from many sources (like databases or
files).
Data Staging Layer: The data is cleaned and transformed so it
can be used.
Data Warehouse Layer: The cleaned data is stored in a central
place.
Analysis Layer: This is where data is analyzed, reports are
created, and business decisions are made.
3. Three-Tier Architecture :
What it is: A more complex setup with an extra step that helps
standardize and clean the data even more before storing it in
the data warehouse.
Steps:
Source Layer: Collects data from multiple sources.
Reconciled Layer: Cleans and integrates the data to make sure
everything is consistent.
Data Warehouse Layer: Stores the cleaned data, and smaller
databases (called data marts) may also be created for specific
departments.
Advantages:
Great for large companies with lots of data from many sources.
Helps the business see all their data in one place.
Disadvantages:
It uses more storage space because of the extra layer.
ETL Process :
1. Extraction :
What it is: In this first step, data is pulled out from different
sources (like databases, files, applications) and moved to an
area called the staging area.
Why it's important: The data may come in different formats, so
it can't be directly loaded into the data warehouse. The staging
area acts like a waiting room where data is temporarily stored
and cleaned before it’s transformed.
Challenges: Some data might be unstructured (like emails or
web pages), making it harder to extract. The right tools are
needed to handle this.
2. Transformation :
What it is: In this step, the extracted data is cleaned, changed,
and converted into a consistent format that fits the needs of
the data warehouse.
How it's done:
Filtering: Removing unnecessary data or only selecting specific
parts of the data.
Cleaning: Fixing issues like missing values (e.g., replacing blanks
with default values), or standardizing terms (e.g., changing
"United States" to "USA").
Joining: Combining data from different sources into one piece
of data.
Splitting: Breaking one piece of data into smaller pieces if
needed.
Sorting: Organizing data in a particular order, often by key
attributes (e.g., date or location).
This step makes sure that the data is high-quality and
standardized, which helps with consistent analysis.
3. Loading :
What it is: In this final step, the transformed data is moved into
the data warehouse.
How it’s done: Data can be loaded into the warehouse all at
once or in small batches, depending on what the business
needs.
How often: Data can be loaded regularly or less often, based on
the requirements of the organization.
Pipelining: Sometimes, the process doesn’t wait for one step to
finish before starting another. For example, as soon as some
data is extracted, it can be transformed while new data is
extracted, and while transformed data is being loaded into the
warehouse, more can be processed.
How it works:
How it works:
What it is: A dependent data mart is created from a data warehouse. It’s
like a "subsection" of the warehouse.
How it works: First, a data warehouse is built, and then data marts are
created by extracting data from it. The data mart relies on the data
warehouse for its data.
Method: This is called a top-down approach, meaning the data
warehouse is built first.
2. Independent Data Marts:
What it is: In this case, data marts are built independently, and later,
they are integrated to form a data warehouse.
How it works: Multiple data marts are created first, each serving a
specific purpose. Afterward, the data from all these marts is combined
to create a data warehouse.
Method: This is called a bottom-up approach, as data marts are created
first, and they come together to form a warehouse.
What it is: A hybrid data mart combines data from different sources, not
just from a data warehouse.
Why it's used: It’s useful when businesses need to integrate data from
various sources, especially when new groups or products are added to
the organization.
Q] What is OLAP ?
1. Roll-up:
This is like grouping data together. For example, if you have sales for
individual cities, you can "roll them up" to see the total sales for the
whole region.
It’s like going from more details (cities) to less details (regions).
2. Drill-down:
This is the opposite of roll-up. It means breaking down the data into
smaller details. For example, if you see total sales for a quarter, you
can "drill down" to see sales for each month.
It’s like zooming in to see more details.
3. Slice:
This means picking one part of the data to focus on. For example,
you could choose just data from Q1 (the first quarter of the year)
and create a smaller dataset just for that time period.
4. Dice:
This is like slice, but instead of just one part, you pick two or more
parts of the data. For example, you could focus on both Q1 and Q2
sales in a specific region.
5. Pivot:
Characteristics of OLAP :
FASMI Characteristics:
Fast :
OLAP systems give quick results. Simple queries should take only
seconds to process.
Analysis :
OLAP can handle complex business analysis, like forecasting or
comparing different sets of data.
Share :
It allows multiple people to use it at the same time, and it makes sure
the data is safe.
Multidimensional :
OLAP shows data in multiple ways, like by time, region, or product,
making it easier to analyze.
Information :
OLAP is good at handling large amounts of data, and it deals with
missing or incomplete data correctly.
Retailers use data mining to understand what products are popular, how
much people are willing to pay, and what kind of promotions will attract
customers.
It helps companies design better sales campaigns, like discounts or
loyalty bonuses, and measure how well these campaigns work to
increase profits.
2. Finance:
Banks and financial institutions use data mining to predict stock prices,
assess risks for loans, and find new customers.
Credit card companies use data mining to track spending habits and
detect fraudulent purchases by spotting unusual patterns in customer
transactions.
4. Social Media:
Data Selection:
What’s this step about? You choose the data that’s important for
solving the problem you’re working on.
Example: If you’re looking at sales in a store, you would only pick
data related to customer purchases, not stuff like employee records
unless you need them.
Data Preprocessing:
What’s this step about? Data isn’t always perfect. Sometimes it has
errors, missing information, or other issues. So, you clean and
organize the data so it’s ready for analysis.
Example: If some customer records are missing their ages, you might
fill in the missing values or remove those records.
Data Transformation:
What’s this step about? You change the data to make it easier to
work with.
Example: You might group ages into categories like "young,"
"middle-aged," and "old" to make the data easier to analyze.
Data Mining:
What’s this step about? This is the core of the KDD process. You
apply algorithms (computer programs) to look for patterns or
relationships in the data. It's like digging for treasure using special
tools.
Example: You might use a machine learning program to predict
whether a customer will buy something based on their past
purchases. Or you might find that people who buy milk also tend to
buy bread.
Evaluation/Interpretation:
What’s this step about? After finding patterns, you need to check if
they make sense and are actually useful. You also need to
understand what these patterns mean in real life.
Example: If you find that younger customers tend to buy more
gadgets, you check if this pattern holds true in different regions, or if
it’s just a coincidence.
Knowledge Presentation:
What’s this step about? Once you’ve found useful patterns, you
share the results in an easy-to-understand way, usually with charts
or reports.
Example: You might show the marketing team a graph that shows
which age groups are more likely to buy a product, helping them
decide who to target for their next campaign.
In the real world, data comes from many different sources, like sensors,
websites, databases, and more. But this data often has problems. It can
be incomplete, noisy (contains errors), or inconsistent. If we don’t fix
these problems before analyzing the data, it could lead to wrong
conclusions, biased results, or bad decisions. This is where data
preprocessing comes in.
When we collect data, it often has the following issues that need to be
fixed:
1. Incomplete Data:
2. Noisy Data:
Noisy data means the data has errors, or contains extreme values
that are very different from the rest of the data, known as outliers.
Outliers are data points that don't fit the general pattern and might
be due to errors in measurement, broken sensors, or software bugs.
If these errors aren’t fixed, they can mess up analysis and predictions.
Example: If most people in a dataset are between 20 and 50 years
old, but one data point shows someone is 200 years old, that's an
outlier and should be fixed.
3. Inconsistent Data:
1. Binning :
Binning is a way of grouping similar data values into bins or buckets and
smoothing the data within those bins.
For example, you have a list of numbers: [4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34].
You divide these numbers into groups (bins):
Bin 1: [4, 8, 9, 15]
Bin 2: [21, 21, 24, 25]
Bin 3: [26, 28, 29, 34]
Then, you "smooth" the values in each bin, replacing all numbers in the
bin with a single value like:
Mean (average) of the numbers,
Median (middle value) of the numbers,
Closest boundary (the lowest or highest value in the bin).
2. Regression :
Regression is a method used to find patterns or relationships in data. It's
like drawing a line or curve through data points to see how they are
connected.
3. Clustering :
Clustering is a technique where we group similar data points together
into clusters (like forming teams based on similar interests).
Outliers: If there are data points that don’t belong to any group and
are far away from the others, these are called outliers (similar to odd
ones out in a group).
How Clustering Works: Clustering looks for groups of data points that
are similar to each other. It then puts those similar points in the
same group (or cluster). Data points that are very different from the
rest get placed in a separate group, and we might treat them as
noise or errors.
Data reduction is about making a large amount of data smaller and more
manageable, while keeping the important information.
1. Data Compression :
2. Dimensionality Reduction :
3. Data Sampling :
5. Discretization :
7. Data Filtering :
What it is: You randomly pick a certain number of data points (or tuples)
from the dataset, but once a data point is selected, it cannot be chosen
again.
How it works: Imagine you have 100 data points, and you need to select
10 of them. Every data point has an equal chance (1 in 100) of being
selected, but once a data point is chosen, it’s out of the pool for future
picks.
Example: If you're selecting 5 students from a class of 30, each student
has an equal chance of being selected, but once a student is chosen,
they can't be selected again.
What it is: Like the previous method, but this time, once a data point is
selected, it is put back into the dataset. This means it can be selected
again.
How it works: If you have 100 data points and you need to select 10,
after each pick, you put the data point back, so it could be picked again.
Example: If you randomly select a student from the class, they are put
back in the pool, meaning the same student could be chosen again.
3. Cluster Sampling:
What it is: Imagine you have a big dataset, and you want to make it
smaller and easier to handle. Instead of picking data randomly from the
entire dataset, you divide it into smaller groups, called clusters. Then,
you randomly choose some of these clusters and use all the data from
the selected clusters.
How it works:
First, you divide your data into groups (clusters).
Then, you randomly pick a few of these clusters.
Finally, you use all the data from the chosen clusters.
4. Stratified Sampling:
What it is: In stratified sampling, you divide your data into smaller,
distinct groups called strata. These groups are based on something
important, like age, gender, or department. Then, you randomly pick
data points from each group to make sure all groups are fairly
represented.
How it works:
First, you divide your dataset into different groups (called strata). Each
group shares a common characteristic.
Then, you randomly select data from each group to make sure every
group is included in the final sample.
In some retail stores, data showed that when men bought diapers, they
were also likely to buy beer.
This is an unexpected and interesting relationship! It’s not something
you would have guessed just by looking at what people usually buy.
This kind of insight helps stores design better product placements and
promotions. If the store knows that diapers and beer are often bought
together, they might place the two items near each other in the store to
encourage more sales.
1. Data Collection:
Whenever a customer buys items, the store records what was purchased.
This could be from a transactional database, which stores all customer
purchases.
Nowadays, most products come with a barcode, which makes it easier
for stores to track exactly what customers buy.
2. Finding Patterns:
Once we know which items are bought together often, we can use that
information to:
Promote combos (like coffee and sugar together).
Change the store layout to place related items near each other, making
it easier for customers to buy them together.
Create special sales or discounts for combinations of products.
First, the algorithm looks at individual items (like bread, milk, butter) and
sees how often they appear together in the data.
Then, it combines those items into pairs (like bread and butter), then
triples (bread, butter, milk), and so on.
It keeps only the itemsets (like pairs or triples) that appear often enough
(this is called minimum support).
Step 1: Look at single items and count how often they show up in the
data.
Step 2: Find pairs of items (like bread and butter) that appear together
often enough.
Step 3: Then, find triples (bread, butter, milk) and keep repeating until
no more frequent sets are found.
2. Create Rules:
After finding frequent items or itemsets, the algorithm makes rules. For
example, if people buy bread, they are likely to also buy butter. This is
called an association rule.
Advantages:
Easy to understand and use.
Helps businesses make better decisions on product placement and
promotions.
Disadvantages:
It can be slow if you have a lot of data because it checks many
combinations.
It needs to look through the data multiple times.
Example:
Imagine a store selling food items like bread, butter, milk, and coffee.
The Apriori algorithm might find that:
1. Training (Learning):
First, we give the model some training data (examples with known labels)
to learn from. This step is like teaching the model what the categories
are, so it understands how to classify new data.
Continuous: Data that can have any value in a range (like decimal
numbers).
Example: Height (50.5, 60.2, 70.3).
2. Weather Forecasting:
The weather changes over time based on factors like temperature,
humidity, and wind. By analyzing past weather data, we can predict
future weather conditions.
Advantages of Classification :
Cost-Effective: It can help businesses make decisions without
needing to spend a lot of resources.
Helps in Decision-Making: For example, banks can use classification
to predict if someone is a high risk for loan approval.
Crime Prediction: It can help identify potential criminal suspects by
analyzing patterns.
Medical Risk Prediction: It helps in identifying patients who are at
risk of certain diseases based on past data.
Disadvantages of Classification
Privacy Issues: When using personal data, there’s always a risk that
companies might misuse the data.
Accuracy Problems: The model might not always give perfect results,
especially if the data is not well-selected or the model is not the best
choice for the problem.
Q] Explain Regression .
3. Outliers: Outliers are extreme values (either very high or very low)
compared to the rest of the data. These can distort regression results
and should be handled carefully.
1. Linear Regression:
Simple Linear Regression: Involves one independent variable and
predicts a continuous dependent variable using a straight line.
Multiple Linear Regression: Uses more than one independent variable to
predict the dependent variable.
Example: A company may use linear regression to understand how
different marketing campaigns affect sales. If a company runs ads on TV,
radio, and online, linear regression can help understand the combined
impact of these ads.
2. Logistic Regression:
Used when the dependent variable is binary (e.g., yes/no, true/false,
0/1).
Example: Logistic regression is used to predict whether an email is spam
or not spam, based on factors like the presence of certain keywords.
3. Polynomial Regression:
Used when the relationship between the independent and dependent
variables is curved rather than linear.
Example: Polynomial regression can help in predicting complex data
trends, like the relationship between age and income, where the curve
isn’t straight.
7. Ridge Regression:
A regularized version of linear regression that helps to handle
multicollinearity (when independent variables are highly correlated).
Adds a penalty term to the model to prevent it from becoming too
complex.
Example: Ridge regression can be used when predicting house prices
where many features like square footage, number of bedrooms, and
neighborhood are highly correlated.
8. Lasso Regression:
Another regularized version of linear regression that uses a penalty term
to shrink the coefficients of less important features to zero. This helps in
feature selection.
Example: Lasso regression can help predict sales while eliminating less
relevant marketing activities from the model.
For example, Naive Bayes can help us predict whether or not a player
will play based on certain weather conditions (like Sunny, Rainy, or
Overcast).
Understanding Bayes' Theorem :
Before jumping into Naive Bayes, you need to understand Bayes'
Theorem, which helps us calculate probabilities based on prior
knowledge. Here’s the formula for Bayes' Theorem:
Dataset Example:
We have this dataset of weather conditions and whether or not a player
plays:
-----------------------
Outlook Play|
Rainy Yes |
Sunny Yes |
Overcast Yes |
Overcast Yes |
Sunny No |
Rainy Yes |
Sunny Yes |
Overcast Yes |
Rainy No |
Sunny No |
Sunny Yes |
Rainy No |
Overcast Yes |
Overcast Yes |
-----------------------
We need to predict if the player will play on a Sunny day.
Now, calculate the probabilities for each condition, i.e., the likelihood of
playing or not playing based on the weather.
Disadvantages:
Independence assumption: It assumes that all features are independent,
which isn't always true in real-world situations.
The Problem:
We have a dataset of fruits, and each fruit has two features: weight and
color (we'll use numbers for color). We want to classify a new fruit based
on its weight and color.
Here’s our dataset:
Now, let's say we have a new fruit with the following features:
Weight = 165g
Color = 1 (Red)
We want to classify this new fruit as either Apple or Orange using the
KNN algorithm.
We’ll use K = 3. This means we will look at the 3 nearest neighbors (the 3
closest fruits) and decide what the new fruit is based on them.
Root Node: The starting point of the tree where all data is
considered.
Leaf Node: The end point of the tree, which gives the final decision
or outcome.
Splitting: The process of dividing data at each node based on certain
rules.
Branch/Sub-tree: The tree structure formed from splitting.
Pruning: Cutting off unnecessary branches to make the tree simpler
and more accurate.
Parent/Child Nodes: The parent node is the starting point (root), and
the child nodes are the branches that split off from it.
How Does the Decision Tree Work?
A Decision Tree learns from training data to make predictions. Here's
how it works:
Data Used: Logs from web servers (recording user visits and
interactions) or browser logs are analyzed to understand how users
behave on the web.