ML 2
ML 2
Graphs: Basics
A graph G=(V,E)G = (V, E)G=(V,E) consists of:
Vertices (Nodes): Points in the graph (VVV).
Edges: Connections between vertices (EEE).
o Directed: Edges have a direction (e.g., A → B).
o Undirected: Edges are bidirectional (e.g., A—B).
o Weighted: Edges have weights or costs (e.g., distance, time).
Types of Graphs
1. Undirected Graph: EEE contains pairs of vertices without direction.
2. Directed Graph (Digraph): EEE contains ordered pairs of vertices.
3. Weighted Graph: Each edge has an associated weight.
4. Tree: A connected graph with no cycles.
5. Cyclic/Acyclic Graph: Graphs with or without cycles.
Graph Representation
1. Adjacency Matrix:
o n×nn \times nn×n matrix where M[i][j]=1M[i][j] = 1M[i][j]=1 if there's an edge
between iii and jjj.
o Efficient for dense graphs.
2. Adjacency List:
o List of neighbors for each vertex.
o Space-efficient for sparse graphs.
3. Edge List:
o A list of all edges and their weights.
Graph Algorithms
Graphs are used to solve map searching and navigation problems. Below are key algorithms:
3. Dijkstra’s Algorithm
Use: Finds the shortest path in weighted graphs (non-negative weights).
Approach:
o Maintains a priority queue to iteratively find the shortest distance.
4. Bellman-Ford Algorithm
Use: Finds the shortest path in graphs with negative weights.
Approach:
o Iteratively relaxes all edges V−1V-1V−1 times.
Time Complexity: O(VE)O(VE)O(VE).
5. A Search*
Use: Optimized pathfinding with a heuristic (e.g., maps).
Approach:
o Combines Dijkstra’s algorithm with a heuristic function to guide the search.
Key Formula: f(n)=g(n)+h(n)f(n) = g(n) + h(n)f(n)=g(n)+h(n)
o g(n)g(n)g(n): Cost from the start to the current node.
o h(n)h(n)h(n): Heuristic estimate of the cost to the goal.
6. Floyd-Warshall Algorithm
Use: Finds shortest paths between all pairs of vertices.
Approach:
o Uses dynamic programming to iteratively improve path estimates.
Time Complexity: O(V3)O(V^3)O(V3).
1. Road Networks
Modeled as weighted graphs where:
o Nodes represent intersections or locations.
o Edges represent roads, weighted by distance, time, or cost.
2. Shortest Path Problems
Example: Google Maps uses Dijkstra’s or A* to find the fastest route.
Challenges: Incorporating real-time traffic data.
3. Navigation Systems
Graph Representation:
o Use adjacency lists for sparse road networks.
o Heuristic algorithms (e.g., A*) prioritize user-defined metrics like distance or
traffic.
Tools:
o OpenStreetMap: Free geographic data.
o GraphHopper: Open-source routing engine.
4. Location-Based Services
Geospatial Graphs:
o Nodes represent geolocations (latitude, longitude).
o Edges are weighted by metrics like travel time.
5. Delivery Optimization
Traveling Salesman Problem (TSP):
o Find the shortest route visiting all locations exactly once.
o Solved using:
Approximation algorithms (e.g., nearest neighbor).
Genetic algorithms.
Visualization Tools
1. Gephi: For interactive graph visualization.
2. NetworkX (Python): Easy graph modeling and visualization.
3. Google Maps API: For integrating graph-based map searching.
4. Neo4j: A graph database for large-scale geographic queries.
What is Map Data Structure?
Map data structure (also known as a dictionary , associative array , or hash map ) is defined
as a data structure that stores a collection of key-value pairs, where each key is associated
with a single value.
Maps provide an efficient way to store and retrieve data based on a unique identifier (the
key).
Need for Map Data Structure
Map data structures are important because they allow for efficient storage and retrieval of
key-value pairs. Maps provide the following benefits:
Fast Lookup: Unordered maps allow for constant-time (O(1)) average-case lookup of
elements based on their unique keys.
Efficient Insertion and Deletion: Maps support fast insertion and deletion of key-
value pairs, typically with logarithmic (O(log n)) or constant-time (O(1)) average-case
complexity.
Unique Keys: Maps ensure that each key is unique, allowing for efficient association
of data with specific identifiers.
Flexible Data Storage: Maps can store a wide variety of data types as both keys and
values, providing a flexible and versatile data storage solution.
Intuitive Representation: The key-value pair structure of maps offers an intuitive way
to model and represent real-world data relationships.
Properties of Map Data Structure:
A map data structure possesses several key properties that make it a valuable tool for
various applications:
Key-Value Association: Maps allow you to associate arbitrary values with unique
keys. This enables efficient data retrieval and manipulation based on keys.
Unordered (except for specific implementations): In most map implementations,
elements are not stored in any specific order. This means that iteration over a map
will yield elements in an arbitrary order. However, some map implementations, such
as TreeMap in Java, maintain order based on keys.
Dynamic Size: Maps can grow and shrink dynamically as you add or remove
elements. This flexibility allows them to adapt to changing data requirements
without the need for manual resizing.
Efficient Lookup: Maps provide efficient lookup operations based on keys. You can
quickly find the value associated with a specific key using methods
like get() or [] with an average time complexity of O(1) for hash-based
implementations and O(log n) for tree-based implementations.
Duplicate Key Handling: Most map implementations do not allow duplicate keys.
Attempting to insert a key that already exists will typically overwrite the existing
value associated with that key. However, some map implementations,
like multimap in C++, allow storing multiple values for the same key.
Space Complexity: The space complexity of a map depends on its implementation.
Hash-based maps typically have a space complexity of O(n) , where n is the number
of elements, while tree-based maps have a space complexity of O(n log n) .
Time Complexity: The time complexity of operations like insertion, deletion, and
lookup varies depending on the implementation. Hash-based maps typically have an
average time complexity of O (1) for these operations, while tree-based maps have
an average time complexity of O (log n) . However, the worst-case time complexity
for tree-based maps is still O (log n) , making them more predictable and reliable for
performance-critical applications.
Ordered vs. Unordered Map Data Structures
Both ordered and unordered maps are associative containers that store key-value pairs.
However, they differ in how they store and access these pairs, leading to different
performance characteristics and use cases.
Ordered Map:
An ordered map maintains the order in which key-value pairs are inserted. This means that
iterating over the map will return the pairs in the order they were added.
Implementation: Typically implemented using a self-balancing binary search
tree (e.g., red-black tree ) or a skip list.
Access: Accessing elements by key is efficient (typically O(log n) time complexity),
similar to an unordered map.
Iteration: Iterating over the map is efficient (typically O(n) time complexity) and
preserves the insertion order.
Use Cases: When the order of elements is important, such as:
o Maintaining a chronological log of events.
o Representing a sequence of operations.
o Implementing a cache with a least-recently-used (LRU) eviction policy.
Unordered Map:
An unordered map does not maintain the order of key-value pairs. The order in which
elements are returned during iteration is not guaranteed and may vary across different
implementations or executions.
Implementation: Typically implemented using a hash table.
Access: Accessing elements by key is very efficient (typically O(1) average time
complexity), making it faster than an ordered map in most cases.
Iteration: Iterating over the map is less efficient than an ordered map (typically O(n)
time complexity) and does not preserve the insertion order.
Use Cases: When the order of elements is not important and fast access by key is
crucial, such as:
o Implementing a dictionary or symbol table.
o Storing configuration settings.
o Caching frequently accessed data.
Summary Table:
4. Sparse Representations
Sparse Matrices in ML:
o Many datasets are sparse (e.g., one-hot encoding, text vectors).
o Dictionaries efficiently store non-zero entries.
Example:
# Sparse representation of a one-hot encoded vector
sparse_vector = {2: 1, 5: 1} # Non-zero entries at indices 2 and 5
5. Data Deduplication
Challenge:
o Identify and remove duplicate samples in large datasets.
Solution:
o Use hashing to create a unique hash for each data point.
o Store hashes in a dictionary for quick lookup.
Hashing in ML Frameworks
Hash Table Implementation
Python Example:
data = {"feature1": 0.5, "feature2": 0.8}
print(data["feature1"]) # Access time: O(1)
Libraries Utilizing Hashing
Scikit-learn:
o Implements feature hashing via sklearn.feature_extraction.FeatureHasher.
o Example:
python
Copy code
from sklearn.feature_extraction import FeatureHasher
Benefits of Hashing in ML
1. Speed: Quick lookups and insertions for high-volume data.
2. Scalability: Efficient handling of sparse and large datasets.
3. Memory Efficiency: Reduces memory overhead by avoiding explicit storage of
feature names.
Challenges
1. Collisions: Hash functions may produce the same value for different inputs, leading
to information loss.
o Mitigation: Use higher-dimensional hash spaces or advanced hash functions.
2. Interpretability: Collisions make interpreting features harder.
Search Trees
In machine learning, a "search tree," most commonly refers to a decision tree, which is a
type of tree-based algorithm used for classification and regression tasks, where data is
organized hierarchically with nodes representing decisions based on features, allowing for
efficient prediction by traversing the tree structure to reach a final outcome.
Key points about search trees in machine learning:
Structure:
A decision tree consists of a root node, internal nodes (decision points), and leaf nodes (final
predictions), with each branch representing a possible decision based on a feature value.
Decision making:
At each internal node, an attribute is selected, and a comparison is made to split the data
into subsets based on the attribute value, guiding the search towards the most likely class.
Benefits:
Interpretability: Decision trees are considered easy to understand as the
decision-making process can be visualized through the tree structure.
Handling mixed data types: They can handle both numerical and categorical
features.
Fast prediction: Once trained, predictions are made quickly by traversing the
tree.
Drawbacks:
Overfitting potential: Decision trees can easily overfit to training data,
especially if they are too deep.
Sensitivity to feature scaling: Feature scaling may be important for optimal
performance.
Different types of search trees:
Binary search tree (BST): A basic search tree where each node has at most two
children, with the left child having a smaller value than the parent and the right child
having a larger value.
k-ary tree: A search tree where each node can have up to k children.
Dynamic Programming
This is a complete tutorial on Dynamic Programming (DP) from Theory to Problems ordered
Difficulty Wise.
DP is an algorithmic technique used in computer science and mathematics to solve
complex problems by breaking them down into smaller overlapping subproblems.
The core idea behind DP is to store solutions to subproblems so that each is solved
only once.
To solve DP problems, we first write a recursive solution in a way that there are
overlapping subproblems in the recursion tree (the recursive function is called with
same paramaters multiple times)
To make sure that a recursive value is computed only once (to improve time taken by
algorithm), we store results of the recursive calls.
There are two ways to store the results, one is top down (or memoization) and other
is bottom up (or tabulation).
Some popular problems solved using DP are Fibonacci Numbers, Diff Utility (Longest
Common Subsequence), Bellman–Ford Shortest Path, Floyd Warshall, Edit
Distance and Matrix Chain Multiplication.
Dynamic Programming (DP)
Dynamic programming is a method for solving problems by breaking them into smaller
overlapping subproblems and solving each subproblem only once. It is particularly effective
for problems with optimal substructure and overlapping subproblems.
Key Concepts
1. Optimal Substructure:
o A problem exhibits optimal substructure if its solution can be constructed
from the solutions of its subproblems.
o Example: Shortest path in a graph can be solved by combining shortest paths
to intermediate vertices.
2. Overlapping Subproblems:
o A problem has overlapping subproblems if the same subproblem is solved
multiple times.
o Example: Fibonacci sequence calculation.
3. Memoization vs Tabulation:
o Memoization: Top-down approach where solutions to subproblems are
stored in a cache to avoid redundant computations.
o Tabulation: Bottom-up approach where subproblems are solved iteratively
and stored in a table.
Decision Boundaries
Definition
A decision boundary is the surface that separates different classes in the feature space. For
a linear classifier, the decision boundary is a straight line (in 2D), a plane (in 3D), or a
hyperplane (in higher dimensions).
Mathematical Form
The decision boundary is defined by the equation:
w1x1+w2x2+…+wnxn+b=0w_1x_1 + w_2x_2 + \ldots + w_nx_n + b = 0w1x1+w2x2+…+wnxn
+b=0
Role in Classification
1. Separation of Classes:
o Points on one side of the boundary belong to one class, and points on the
other side belong to another.
2. Interpretability:
o The weights (www) determine the orientation of the boundary.
o The bias (bbb) shifts the boundary.
3. Relationship with Data:
o A perfectly separable dataset allows the boundary to completely divide the
classes.
o For overlapping datasets, the boundary represents the best attempt at
separation based on the classifier's objective.
Limitations
1. Non-Linearity:
o Linear classifiers cannot handle datasets where classes are not linearly
separable.
o Example: XOR problem.
2. Feature Engineering:
o Often requires transforming features (e.g., adding polynomial terms or using
kernels) to make data linearly separable.