0% found this document useful (0 votes)
22 views17 pages

ML 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views17 pages

ML 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

UNIT-2

#Graphs, maps, Map Searching


Graph theory is a mathematical field used to model relationships between objects. It plays a
central role in solving problems like map searching, route optimization, and network
analysis. Here's a detailed breakdown:

Graphs: Basics
A graph G=(V,E)G = (V, E)G=(V,E) consists of:
 Vertices (Nodes): Points in the graph (VVV).
 Edges: Connections between vertices (EEE).
o Directed: Edges have a direction (e.g., A → B).
o Undirected: Edges are bidirectional (e.g., A—B).
o Weighted: Edges have weights or costs (e.g., distance, time).
Types of Graphs
1. Undirected Graph: EEE contains pairs of vertices without direction.
2. Directed Graph (Digraph): EEE contains ordered pairs of vertices.
3. Weighted Graph: Each edge has an associated weight.
4. Tree: A connected graph with no cycles.
5. Cyclic/Acyclic Graph: Graphs with or without cycles.

Graph Representation
1. Adjacency Matrix:
o n×nn \times nn×n matrix where M[i][j]=1M[i][j] = 1M[i][j]=1 if there's an edge
between iii and jjj.
o Efficient for dense graphs.
2. Adjacency List:
o List of neighbors for each vertex.
o Space-efficient for sparse graphs.
3. Edge List:
o A list of all edges and their weights.
Graph Algorithms
Graphs are used to solve map searching and navigation problems. Below are key algorithms:

1. Breadth-First Search (BFS)


 Use: Finds the shortest path in unweighted graphs.
 Approach:
o Starts at a source vertex and explores neighbors level by level.
o Uses a queue.
 Time Complexity: O(V+E)O(V + E)O(V+E).

2. Depth-First Search (DFS)


 Use: Explores all paths from a source, used in cycle detection or topological sorting.
 Approach:
o Uses a stack or recursion to explore as deeply as possible before backtracking.
 Time Complexity: O(V+E)O(V + E)O(V+E).

3. Dijkstra’s Algorithm
 Use: Finds the shortest path in weighted graphs (non-negative weights).
 Approach:
o Maintains a priority queue to iteratively find the shortest distance.

 Time Complexity: O((V+E)log⁡V)O((V + E) \log V)O((V+E)logV) with a priority queue.

4. Bellman-Ford Algorithm
 Use: Finds the shortest path in graphs with negative weights.
 Approach:
o Iteratively relaxes all edges V−1V-1V−1 times.
 Time Complexity: O(VE)O(VE)O(VE).

5. A Search*
 Use: Optimized pathfinding with a heuristic (e.g., maps).
 Approach:
o Combines Dijkstra’s algorithm with a heuristic function to guide the search.
 Key Formula: f(n)=g(n)+h(n)f(n) = g(n) + h(n)f(n)=g(n)+h(n)
o g(n)g(n)g(n): Cost from the start to the current node.
o h(n)h(n)h(n): Heuristic estimate of the cost to the goal.

 Time Complexity: Depends on the heuristic; typically O((V+E)log⁡V)O((V + E) \log


V)O((V+E)logV).

6. Floyd-Warshall Algorithm
 Use: Finds shortest paths between all pairs of vertices.
 Approach:
o Uses dynamic programming to iteratively improve path estimates.
 Time Complexity: O(V3)O(V^3)O(V3).

7. Minimum Spanning Tree (MST)


 Finds a tree connecting all vertices with the minimum total edge weight.
 Algorithms:

1. Prim’s Algorithm: Greedy algorithm using a priority queue (O((V+E)log⁡V)O((V


+ E) \log V)O((V+E)logV)).

2. Kruskal’s Algorithm: Sorts edges and uses union-find (O(Elog⁡E)O(E \log


E)O(ElogE)).

Graph Use in Map Searching


Graph-based algorithms are extensively used in map-related tasks such as route
optimization and navigation. Below are key applications:

1. Road Networks
 Modeled as weighted graphs where:
o Nodes represent intersections or locations.
o Edges represent roads, weighted by distance, time, or cost.
2. Shortest Path Problems
 Example: Google Maps uses Dijkstra’s or A* to find the fastest route.
 Challenges: Incorporating real-time traffic data.

3. Navigation Systems
 Graph Representation:
o Use adjacency lists for sparse road networks.
o Heuristic algorithms (e.g., A*) prioritize user-defined metrics like distance or
traffic.
 Tools:
o OpenStreetMap: Free geographic data.
o GraphHopper: Open-source routing engine.

4. Location-Based Services
 Geospatial Graphs:
o Nodes represent geolocations (latitude, longitude).
o Edges are weighted by metrics like travel time.

5. Delivery Optimization
 Traveling Salesman Problem (TSP):
o Find the shortest route visiting all locations exactly once.
o Solved using:
 Approximation algorithms (e.g., nearest neighbor).
 Genetic algorithms.

Visualization Tools
1. Gephi: For interactive graph visualization.
2. NetworkX (Python): Easy graph modeling and visualization.
3. Google Maps API: For integrating graph-based map searching.
4. Neo4j: A graph database for large-scale geographic queries.
What is Map Data Structure?
Map data structure (also known as a dictionary , associative array , or hash map ) is defined
as a data structure that stores a collection of key-value pairs, where each key is associated
with a single value.
Maps provide an efficient way to store and retrieve data based on a unique identifier (the
key).
Need for Map Data Structure
Map data structures are important because they allow for efficient storage and retrieval of
key-value pairs. Maps provide the following benefits:
 Fast Lookup: Unordered maps allow for constant-time (O(1)) average-case lookup of
elements based on their unique keys.
 Efficient Insertion and Deletion: Maps support fast insertion and deletion of key-
value pairs, typically with logarithmic (O(log n)) or constant-time (O(1)) average-case
complexity.
 Unique Keys: Maps ensure that each key is unique, allowing for efficient association
of data with specific identifiers.
 Flexible Data Storage: Maps can store a wide variety of data types as both keys and
values, providing a flexible and versatile data storage solution.
 Intuitive Representation: The key-value pair structure of maps offers an intuitive way
to model and represent real-world data relationships.
Properties of Map Data Structure:
A map data structure possesses several key properties that make it a valuable tool for
various applications:
 Key-Value Association: Maps allow you to associate arbitrary values with unique
keys. This enables efficient data retrieval and manipulation based on keys.
 Unordered (except for specific implementations): In most map implementations,
elements are not stored in any specific order. This means that iteration over a map
will yield elements in an arbitrary order. However, some map implementations, such
as TreeMap in Java, maintain order based on keys.
 Dynamic Size: Maps can grow and shrink dynamically as you add or remove
elements. This flexibility allows them to adapt to changing data requirements
without the need for manual resizing.
 Efficient Lookup: Maps provide efficient lookup operations based on keys. You can
quickly find the value associated with a specific key using methods
like get() or [] with an average time complexity of O(1) for hash-based
implementations and O(log n) for tree-based implementations.
 Duplicate Key Handling: Most map implementations do not allow duplicate keys.
Attempting to insert a key that already exists will typically overwrite the existing
value associated with that key. However, some map implementations,
like multimap in C++, allow storing multiple values for the same key.
 Space Complexity: The space complexity of a map depends on its implementation.
Hash-based maps typically have a space complexity of O(n) , where n is the number
of elements, while tree-based maps have a space complexity of O(n log n) .
 Time Complexity: The time complexity of operations like insertion, deletion, and
lookup varies depending on the implementation. Hash-based maps typically have an
average time complexity of O (1) for these operations, while tree-based maps have
an average time complexity of O (log n) . However, the worst-case time complexity
for tree-based maps is still O (log n) , making them more predictable and reliable for
performance-critical applications.
Ordered vs. Unordered Map Data Structures
Both ordered and unordered maps are associative containers that store key-value pairs.
However, they differ in how they store and access these pairs, leading to different
performance characteristics and use cases.
Ordered Map:
An ordered map maintains the order in which key-value pairs are inserted. This means that
iterating over the map will return the pairs in the order they were added.
 Implementation: Typically implemented using a self-balancing binary search
tree (e.g., red-black tree ) or a skip list.
 Access: Accessing elements by key is efficient (typically O(log n) time complexity),
similar to an unordered map.
 Iteration: Iterating over the map is efficient (typically O(n) time complexity) and
preserves the insertion order.
 Use Cases: When the order of elements is important, such as:
o Maintaining a chronological log of events.
o Representing a sequence of operations.
o Implementing a cache with a least-recently-used (LRU) eviction policy.
Unordered Map:
An unordered map does not maintain the order of key-value pairs. The order in which
elements are returned during iteration is not guaranteed and may vary across different
implementations or executions.
 Implementation: Typically implemented using a hash table.
 Access: Accessing elements by key is very efficient (typically O(1) average time
complexity), making it faster than an ordered map in most cases.
 Iteration: Iterating over the map is less efficient than an ordered map (typically O(n)
time complexity) and does not preserve the insertion order.
 Use Cases: When the order of elements is not important and fast access by key is
crucial, such as:
o Implementing a dictionary or symbol table.
o Storing configuration settings.
o Caching frequently accessed data.
Summary Table:

Feature Ordered Map Unordered Map

Order Maintains insertion order No order

Implementation Self-balancing tree, skip list Hash table

Access by key O(log n) O(1) average

Iteration O(n) O(n)

Use cases Order matters, LRU cache Fast access, dictionaries

#Application of algorithms: stable marriages example, Dictionaries and hashing, search


trees, Dynamic programming
Stable Marriages
The Stable Marriage Problem states that given N men and N women, where each person has
ranked all members of the opposite sex in order of preference, marry the men and women
together such that there are no two people of opposite sex who would both rather have
each other than their current partners. If there are no such people, all the marriages are
“stable”. The idea is to iterate through all free men while there is any free man available.
Every free man goes to all women in his preference list according to the order. For every
woman he goes to, he checks if the woman is free, if yes, they both become engaged. If the
woman is not free, then the woman chooses either says no to him or dumps her current
engagement according to her preference list. So an engagement done once can be broken if
a woman gets better option.
An important and large-scale application of stable marriage is in assigning users to servers in
a large distributed Internet service.[3] Billions of users access web pages, videos, and other
services on the Internet, requiring each user to be matched to one of (potentially) hundreds
of thousands of servers around the world that offer that service. A user prefers servers that
are proximal enough to provide a faster response time for the requested service, resulting in
a (partial) preferential ordering of the servers for each user. Each server prefers to serve
users that it can with a lower cost, resulting in a (partial) preferential ordering of users for
each server. Content delivery networks that distribute much of the world's content and
services solve this large and complex stable marriage problem between users and servers
every tens of seconds to enable billions of users to be matched up with their respective
servers that can provide the requested web pages, videos, or other services.[3]
Dictionaries and Hashing
Dictionaries and hashing are foundational tools in Machine Learning (ML) for efficient data
management, feature representation, and optimization. They enable rapid data access,
transformation, and mapping, especially in high-dimensional or large-scale problems.

Applications of Dictionaries and Hashing in ML


1. Feature Hashing
 What is Feature Hashing?
A technique used to transform high-dimensional feature spaces into a lower-
dimensional representation using hash functions. Also known as the "hashing trick."
 How It Works:
o Maps input features to a fixed-size vector using a hash function.
o Collisions (two features hashing to the same index) are tolerated but
minimized.
 Advantages:
o Memory efficiency: No need to explicitly store feature names.
o Scalability: Handles high-dimensional data (e.g., text data in NLP).
 Use Cases:
o Text classification and sentiment analysis (e.g., hashing words or n-grams).
o Large-scale sparse data (e.g., clickstream data for recommender systems).
2. Word Embeddings
 Role of Dictionaries:
o Mapping between words (or tokens) and their embeddings (vectors).
o Example: { "cat": [0.1, 0.5, ...], "dog": [0.2, 0.6, ...] }.
 Hashing:
o Efficiently look up or store embeddings for millions of words.
o Used in libraries like TensorFlow or PyTorch for embedding layers.

3. K-Nearest Neighbors (KNN) Optimization


 Problem:
o Storing distances for large datasets is memory-intensive.
 Solution:
o Use dictionaries to store distances for quick retrieval.
o Hash points or features for efficient indexing in high-dimensional spaces.

4. Sparse Representations
 Sparse Matrices in ML:
o Many datasets are sparse (e.g., one-hot encoding, text vectors).
o Dictionaries efficiently store non-zero entries.
 Example:
# Sparse representation of a one-hot encoded vector
sparse_vector = {2: 1, 5: 1} # Non-zero entries at indices 2 and 5

5. Data Deduplication
 Challenge:
o Identify and remove duplicate samples in large datasets.
 Solution:
o Use hashing to create a unique hash for each data point.
o Store hashes in a dictionary for quick lookup.

Hashing in ML Frameworks
Hash Table Implementation
 Python Example:
data = {"feature1": 0.5, "feature2": 0.8}
print(data["feature1"]) # Access time: O(1)
Libraries Utilizing Hashing
 Scikit-learn:
o Implements feature hashing via sklearn.feature_extraction.FeatureHasher.
o Example:
python
Copy code
from sklearn.feature_extraction import FeatureHasher

hasher = FeatureHasher(n_features=10, input_type='string')


features = [{'feature1': 1, 'feature2': 1}, {'feature1': 2}]
hashed_features = hasher.transform(features)
print(hashed_features)
 TensorFlow and PyTorch:
o Use hashing internally in embedding layers for word and feature lookups.

Benefits of Hashing in ML
1. Speed: Quick lookups and insertions for high-volume data.
2. Scalability: Efficient handling of sparse and large datasets.
3. Memory Efficiency: Reduces memory overhead by avoiding explicit storage of
feature names.

Challenges
1. Collisions: Hash functions may produce the same value for different inputs, leading
to information loss.
o Mitigation: Use higher-dimensional hash spaces or advanced hash functions.
2. Interpretability: Collisions make interpreting features harder.
Search Trees
In machine learning, a "search tree," most commonly refers to a decision tree, which is a
type of tree-based algorithm used for classification and regression tasks, where data is
organized hierarchically with nodes representing decisions based on features, allowing for
efficient prediction by traversing the tree structure to reach a final outcome.
Key points about search trees in machine learning:
 Structure:
A decision tree consists of a root node, internal nodes (decision points), and leaf nodes (final
predictions), with each branch representing a possible decision based on a feature value.
 Decision making:
At each internal node, an attribute is selected, and a comparison is made to split the data
into subsets based on the attribute value, guiding the search towards the most likely class.
 Benefits:
 Interpretability: Decision trees are considered easy to understand as the
decision-making process can be visualized through the tree structure.
 Handling mixed data types: They can handle both numerical and categorical
features.
 Fast prediction: Once trained, predictions are made quickly by traversing the
tree.
 Drawbacks:
 Overfitting potential: Decision trees can easily overfit to training data,
especially if they are too deep.
 Sensitivity to feature scaling: Feature scaling may be important for optimal
performance.
Different types of search trees:
 Binary search tree (BST): A basic search tree where each node has at most two
children, with the left child having a smaller value than the parent and the right child
having a larger value.
 k-ary tree: A search tree where each node can have up to k children.
Dynamic Programming
This is a complete tutorial on Dynamic Programming (DP) from Theory to Problems ordered
Difficulty Wise.
 DP is an algorithmic technique used in computer science and mathematics to solve
complex problems by breaking them down into smaller overlapping subproblems.
 The core idea behind DP is to store solutions to subproblems so that each is solved
only once.
 To solve DP problems, we first write a recursive solution in a way that there are
overlapping subproblems in the recursion tree (the recursive function is called with
same paramaters multiple times)
 To make sure that a recursive value is computed only once (to improve time taken by
algorithm), we store results of the recursive calls.
 There are two ways to store the results, one is top down (or memoization) and other
is bottom up (or tabulation).
 Some popular problems solved using DP are Fibonacci Numbers, Diff Utility (Longest
Common Subsequence), Bellman–Ford Shortest Path, Floyd Warshall, Edit
Distance and Matrix Chain Multiplication.
Dynamic Programming (DP)
Dynamic programming is a method for solving problems by breaking them into smaller
overlapping subproblems and solving each subproblem only once. It is particularly effective
for problems with optimal substructure and overlapping subproblems.

Key Concepts
1. Optimal Substructure:
o A problem exhibits optimal substructure if its solution can be constructed
from the solutions of its subproblems.
o Example: Shortest path in a graph can be solved by combining shortest paths
to intermediate vertices.
2. Overlapping Subproblems:
o A problem has overlapping subproblems if the same subproblem is solved
multiple times.
o Example: Fibonacci sequence calculation.
3. Memoization vs Tabulation:
o Memoization: Top-down approach where solutions to subproblems are
stored in a cache to avoid redundant computations.
o Tabulation: Bottom-up approach where subproblems are solved iteratively
and stored in a table.

Steps to Solve a DP Problem


1. Define the State:
o Decide the parameters to represent subproblems.
o Example: For the Fibonacci sequence, the state can be F(n)F(n)F(n),
representing the nth Fibonacci number.
2. Formulate the Recurrence Relation:
o Identify how the solution to a problem depends on its subproblems.
o Example: F(n)=F(n−1)+F(n−2)F(n) = F(n-1) + F(n-2)F(n)=F(n−1)+F(n−2).
3. Base Cases:
o Define the trivial cases where no further subproblems are needed.
o Example: F(0)=0F(0) = 0F(0)=0, F(1)=1F(1) = 1F(1)=1.
4. Compute the Solution:
o Use either memoization or tabulation to compute the result efficiently.
Approaches of dynamic programming
There are two approaches to dynamic programming:
o Top-down approach
o Bottom-up approach
Top-down approach
The top-down approach follows the memorization technique, while bottom-up approach
follows the tabulation method. Here memorization is equal to the sum of recursion and
caching. Recursion means calling the function itself, while caching means storing the
intermediate results.
Advantages
o It is very easy to understand and implement.
o It solves the subproblems only when it is required.
o It is easy to debug.
Disadvantages
It uses the recursion technique that occupies more memory in the call stack. Sometimes
when the recursion is too deep, the stack overflow condition will occur.
It occupies more memory that degrades the overall performance.

Decision Boundaries

Definition
A decision boundary is the surface that separates different classes in the feature space. For
a linear classifier, the decision boundary is a straight line (in 2D), a plane (in 3D), or a
hyperplane (in higher dimensions).
Mathematical Form
The decision boundary is defined by the equation:
w1x1+w2x2+…+wnxn+b=0w_1x_1 + w_2x_2 + \ldots + w_nx_n + b = 0w1x1+w2x2+…+wnxn
+b=0
Role in Classification
1. Separation of Classes:
o Points on one side of the boundary belong to one class, and points on the
other side belong to another.
2. Interpretability:
o The weights (www) determine the orientation of the boundary.
o The bias (bbb) shifts the boundary.
3. Relationship with Data:
o A perfectly separable dataset allows the boundary to completely divide the
classes.
o For overlapping datasets, the boundary represents the best attempt at
separation based on the classifier's objective.

Linear Classifiers and Decision Boundaries


Examples of Linear Classifiers
1. Logistic Regression:
o Uses the logistic function to model probabilities.
o Decision boundary: Line where the predicted probability is 0.5.
2. Support Vector Machines (SVMs):
o Maximizes the margin (distance) between the boundary and the nearest data
points of each class.
o Decision boundary: Line that lies equidistant from support vectors.
3. Perceptron:
o A simple algorithm for binary classification.
o Decision boundary: Defined by the learned weights and bias.

Advantages of Linear Classifiers


1. Simplicity:
o Easy to interpret and implement.
2. Efficiency:
o Computationally less expensive, especially for high-dimensional data.
3. Robustness:
o Works well for linearly separable data.

Limitations
1. Non-Linearity:
o Linear classifiers cannot handle datasets where classes are not linearly
separable.
o Example: XOR problem.
2. Feature Engineering:
o Often requires transforming features (e.g., adding polynomial terms or using
kernels) to make data linearly separable.

Illustration of Relationship Between Decision Boundaries and Linear Classifiers


 The decision boundary is the geometric representation of the classifier’s learned
decision rule.
 A change in weights or bias will alter the decision boundary, affecting how the data is
classified.
 For linearly separable datasets, a well-trained linear classifier will place the boundary
to perfectly divide the classes.

Cross validation is a technique used in machine learning to evaluate the performance of a


model on unseen data. It involves dividing the available data into multiple folds or subsets,
using one of these folds as a validation set, and training the model on the remaining folds.
This process is repeated multiple times, each time using a different fold as the validation set.
Finally, the results from each validation step are averaged to produce a more robust
estimate of the model’s performance. Cross validation is an important step in the machine
learning process and helps to ensure that the model selected for deployment is robust and
generalizes well to new data.
What is cross-validation used for?
The main purpose of cross validation is to prevent overfitting, which occurs when a model is
trained too well on the training data and performs poorly on new, unseen data. By
evaluating the model on multiple validation sets, cross validation provides a more realistic
estimate of the model’s generalization performance, i.e., its ability to perform well on new,
unseen data.
Types of Cross-Validation
There are several types of cross validation techniques, including k-fold cross validation,
leave-one-out cross validation, and Holdout validation, Stratified Cross-Validation. The
choice of technique depends on the size and nature of the data, as well as the specific
requirements of the modeling problem.
1. Holdout Validation
In Holdout Validation, we perform training on the 50% of the given dataset and rest 50% is
used for the testing purpose. It’s a simple and quick way to evaluate a model. The major
drawback of this method is that we perform training on the 50% of the dataset, it may
possible that the remaining 50% of the data contains some important information which we
are leaving while training our model i.e. higher bias.
2. LOOCV (Leave One Out Cross Validation)
In this method, we perform training on the whole dataset but leaves only one data-point of
the available dataset and then iterates for each data-point. In LOOCV, the model is trained
on n−1n−1 samples and tested on the one omitted sample, repeating this process for each
data point in the dataset. It has some advantages as well as disadvantages also.
An advantage of using this method is that we make use of all data points and hence it is low
bias.
The major drawback of this method is that it leads to higher variation in the testing model
as we are testing against one data point. If the data point is an outlier it can lead to higher
variation. Another drawback is it takes a lot of execution time as it iterates over ‘the
number of data points’ times.
3. Stratified Cross-Validation
It is a technique used in machine learning to ensure that each fold of the cross-validation
process maintains the same class distribution as the entire dataset. This is particularly
important when dealing with imbalanced datasets, where certain classes may be
underrepresented. In this method,
1. The dataset is divided into k folds while maintaining the proportion of classes in each
fold.
2. During each iteration, one-fold is used for testing, and the remaining folds are used
for training.
3. The process is repeated k times, with each fold serving as the test set exactly once.
Stratified Cross-Validation is essential when dealing with classification problems where
maintaining the balance of class distribution is crucial for the model to generalize well to
unseen data.
4. K-Fold Cross Validation
In K-Fold Cross Validation, we split the dataset into k number of subsets (known as folds)
then we perform training on the all the subsets but leave one(k-1) subset for the evaluation
of the trained model. In this method, we iterate k times with a different subset reserved for
testing purpose each time.
Note: It is always suggested that the value of k should be 10 as the lower value of k takes
towards validation and higher value of k leads to LOOCV method.

You might also like