RTIT Notes
RTIT Notes
The field of Information Technology (IT) is in constant flux, with new technologies and
approaches emerging at a rapid pace. For BCA students, understanding these trends is crucial
for building a successful career. This section provides an overview of some of the most
impactful recent trends, which will be explored in greater detail in subsequent sections. We will
focus on Artificial Intelligence, Data Warehousing, Data Mining, and Spark. These areas are
transforming industries and creating new opportunities for skilled IT professionals.
● What it is: AI refers to the simulation of human intelligence in machines that are
programmed to think, learn, and solve problems. This involves developing algorithms and
systems that can perform tasks that typically require human intelligence, such as visual
perception, speech recognition, decision-making, and language translation.
● Why it's important: AI is rapidly changing the way we live and work. It's being used in
everything from self-driving cars to medical diagnosis to customer service chatbots.
Understanding AI concepts and techniques is essential for anyone pursuing a career in
IT.
● Key areas: Machine Learning (ML), Deep Learning (DL), Natural Language Processing
(NLP), Computer Vision, Robotics.
● Examples:
○ Machine Learning: Algorithms that allow computers to learn from data without
being explicitly programmed. Used in recommendation systems (Netflix,
Amazon), fraud detection, and predictive analytics.
○ Deep Learning: A subfield of ML that uses artificial neural networks with multiple
layers to analyze data with complex structures and patterns. Used in image
recognition, speech recognition, and natural language processing.
○ Natural Language Processing: Enables computers to understand, interpret,
and generate human language. Used in chatbots, language translation, and
sentiment analysis.
● What it is: A data warehouse is a central repository of integrated data from multiple
sources. It stores current and historical data in one single place that are used for creating
analytical reports for workers throughout the enterprise. The data is cleaned,
transformed, and cataloged for analysis and reporting.
● Why it's important: Data warehouses enable businesses to gain valuable insights from
their data, leading to better decision-making. They support Business Intelligence (BI) and
analytics by providing a consolidated view of data from across the organization.
● Key characteristics: Subject-oriented, integrated, time-variant, and non-volatile.
● Use cases:
○ Business Intelligence: Providing a foundation for reporting, dashboards, and
data visualization.
○ Decision Support: Enabling data-driven decision-making at all levels of the
organization.
○ Customer Relationship Management (CRM): Analyzing customer data to
improve customer service and personalize marketing efforts.
○ Supply Chain Management: Optimizing supply chain operations by analyzing
data on inventory, logistics, and demand.
● What it is: Data mining is the process of discovering patterns, trends, and insights from
large datasets. It involves using various techniques, such as statistical analysis, machine
learning, and database technology, to extract valuable information from raw data.
● Why it's important: Data mining helps organizations uncover hidden patterns and
relationships in their data, which can be used to improve business performance, identify
new opportunities, and mitigate risks.
● Key techniques:
○ Classification: Categorizing data into predefined classes.
○ Regression: Predicting a continuous value based on input variables.
○ Clustering: Grouping similar data points together.
○ Association Rule Mining: Discovering relationships between items in a dataset.
● Applications:
○ Market Basket Analysis: Identifying products that are frequently purchased
together.
○ Fraud Detection: Detecting fraudulent transactions by identifying unusual
patterns.
○ Customer Segmentation: Grouping customers based on their characteristics
and behaviors.
○ Risk Management: Assessing and mitigating risks by analyzing historical data.
1.4 Spark
● What it is: Apache Spark is a fast and general-purpose distributed computing system. It
provides high-level APIs in Java, Scala, Python and R, and an optimized engine that
supports general execution graphs. It also supports a rich set of higher-level tools
including Spark SQL for SQL and structured data processing, MLlib for machine learning,
GraphX for graph processing, and Spark Streaming.
● Why it's important: Spark is designed for speed, ease of use, and sophisticated
analytics. It excels at processing large datasets in parallel, making it ideal for big data
applications.
● Key features:
○ In-memory processing: Spark can process data in memory, which significantly
improves performance compared to disk-based processing systems.
○ Real-time data processing: Spark Streaming enables real-time analysis of data
streams.
○ Fault tolerance: Spark provides fault tolerance through its Resilient Distributed
Datasets (RDDs).
○ Ease of use: Spark's high-level APIs make it easy to develop and deploy big data
applications.
● Use cases:
○ Big data analytics: Processing and analyzing large datasets from various
sources.
○ Real-time data streaming: Analyzing real-time data streams from sensors,
social media, and other sources.
○ Machine learning: Building and deploying machine learning models at scale.
○ Data integration: Integrating data from different sources into a unified view.
2. Artificial Intelligence
2.2 Applications of AI
AI has found applications in nearly every business sector and is becoming increasingly common
in everyday life. Some key applications include:
● Alan Turing: British logician and computer pioneer. In 1935, he introduced the concept of
the "Universal Turing Machine". In 1950, he published "Computing Machinery and
Intelligence," proposing the Turing Test.
● Early AI Programs:
○ Christopher Strachey (1951): Created one of the earliest successful AI programs.
○ Arthur Samuel (1952): Developed a checkers program that learned from
experience.
● Key Concepts & Developments:
○ Machine Learning: Arthur Samuel coined the term in 1959.
○ Expert Systems: The first "expert system" was created in 1965 by Edward
Feigenbaum and Joshua Lederberg.
○ Chatterbots: Joseph Weizenbaum created ELIZA, the first chatterbot, in 1966.
○ Deep Learning: Soviet mathematician Alexey Ivakhnenko proposed a new
approach to AI that would later become "Deep Learning" in 1968.
● State Space: The set of all possible states or configurations that a problem can assume.
● State: A specific configuration of the problem.
● Search Space: The set of all paths or operations that can be used to transition between
states within the problem space.
● Initial State: The starting point of the search.
● Goal State: The desired end configuration.
● Transition: An action that changes one state to another.
● State Space Search: A process used in AI to explore potential configurations or states
of an instance until a goal state with the desired property is found.
● Components of State Space Representation:
○ States: Different arrangements of the issue.
○ Initial State: The initial setting.
○ Goal State(s): The ideal configuration(s).
○ Actions: The processes via which a system changes states.
○ Transition Model: Explains what happens when states are subjected to actions.
○ Path Cost: The expense of moving from an initial state to a certain state.
● Search Strategy: A technique that tells us which rule has to be applied next while
searching for the solution of a problem within the problem space.
● Control Strategy: Control strategies are adopted for applying the rules and searching
the problem solution in search space.
● Key Requirements of a Good Control Strategy:
○ It should cause motion: Each rule or strategy applied should cause the motion
because if there will be no motion than such control strategy will never lead to a
solution.
○ It should be systematic: Taking care of only the first strategy we may go through
particular useless sequences of operators several times.
● Types of Search Strategies:
○ Breadth-First Search: Searches along the breadth and follows first-in-first-out
queue data structure approach.
○ Depth-First Search: Searches along the depth and follows the stack approach.
● Problem characteristics define the fundamental aspects that influence how AI processes
and solves problems.
● Core Characteristics of AI Problems:
○ Complexity
○ Uncertainty
○ Ambiguity
○ Lack of clear problem definition
○ Non-linearity
○ Dynamism
○ Subjectivity
○ Interactivity
○ Context sensitivity
○ Ethical considerations
● Key Aspects to Consider in Tackling AI Challenges:
○ Complexity and Uncertainty
○ Multi-disciplinary Approach
○ Goal-oriented Design
2.8 AI Problems: Water Jug Problem, Tower of Hanoi, Missionaries & Cannibal Problem
These are classic AI problems used to illustrate search algorithms and problem-solving
techniques:
3. AI Search Techniques
Search algorithms are fundamental to AI, enabling systems to navigate through problem spaces
to find solutions. These algorithms can be classified into uninformed (blind) and informed
(heuristic) searches.
Uninformed search algorithms, also known as blind search algorithms, explore the search space
without any prior knowledge about the goal or the cost of reaching the goal. These algorithms
rely solely on the information provided in the problem definition, such as the initial state, actions
available in each state, and the goal state.
● Breadth-First Search (BFS): Explores all the neighbor nodes at the present depth prior
to moving on to the nodes at the next depth level. BFS implemented using FIFO queue
data structure.
○ Advantages: BFS will provide a solution if any solution exists, and BFS will
provide the minimal solution which requires the least number of steps.
○ Disadvantages: It requires lots of memory since each level of the tree must be
saved into memory to expand the next level, and BFS needs lots of time if the
solution is far away from the root node.
● Depth-First Search (DFS): Explores as far as possible along each branch before
backtracking. It uses a stack data structure to keep track of the nodes to be explored.
● Depth-Limited Search (DLS): A variant of DFS where the depth of the search is limited
to a certain level.
● Iterative Deepening Search (IDS): A general strategy, often used in combination with
DFS, that finds the best depth limit. It combines the benefits of BFS (guaranteed shortest
path) and DFS (less memory consumption) by gradually increasing the depth limit.
○ Advantages: Combines the benefits of BFS and DFS search algorithm in terms
of fast search and memory efficiency.
○ Disadvantages: The main drawback of IDDFS is that it repeats all the work of
the previous phase.
● Bidirectional Search: Runs two simultaneous searches, one forward from the initial
state and the other backward from the goal, stopping when the two searches meet in the
middle
● Uniform Cost Search (UCS): Expands nodes according to their path costs form the
root node. It can be used to solve any graph/tree where the optimal cost is in demand.
○ Advantages: Uniform cost search is optimal because at every state the path with
the least cost is chosen.
○ Uniform cost search is equivalent to BFS algorithm if the path cost of all edges is
the same.
Informed search algorithms use heuristic functions that are specific to the problem, apply them
to guide the search through the search space to try to reduce the amount of time spent in
searching.
● Generate and Test: Generate possible solutions and test them until a solution is found.
● Hill Climbing: A heuristic search used for mathematical optimization problems. It tries to
find a sufficiently good solution to the problem. This solution may not be the global
optimal maximum.
○ Steepest-Ascent Hill climbing: It first examines all the neighboring nodes and then
selects the node closest to the solution state as next node.
○ Stochastic hill climbing: It does not examine all the neighboring nodes before
deciding which node to select.
● Best-First Search: A search algorithm which explores a graph by expanding the most
promising node chosen according to a specified rule.
● A:* A best-first search algorithm that uses a heuristic function to estimate the cost of
reaching the goal.
● AO:* A search algorithm used for solving problems that can be broken down into
subproblems.
● Constraint Satisfaction: A search technique where solutions are found that satisfy
certain constraints.
● Mean-End Analysis: Involves reducing the difference between the current state and the
goal state.
4. Data Warehousing
● Definition: A data warehouse (DW) is a system that aggregates data from multiple
sources into a single, central, and consistent data store. It's a subject-oriented,
integrated, time-variant, and non-volatile collection of data in support of management's
decision-making process.
● Purpose: To feed business intelligence (BI), reporting, and analytics, and support
regulatory requirements – so companies can turn their data into insight and make smart,
data-driven decisions.
● Key Characteristics (according to Bill Inmon):
○ Subject-oriented: Data is organized around subjects or topics (e.g., customers,
products) rather than applications.
○ Integrated: Data from different sources is brought together and made consistent.
○ Time-variant: Data is maintained over time, allowing for trend analysis.
○ Non-volatile: Data is not altered or removed once it is placed into the data
warehouse.
● Source Layer: The logical layer of all systems of record, operational databases (CRM,
ERP, etc).
● Staging Layer: Where data is extracted, transformed, and loaded (ETL).
● Warehouse Layer: Where all of the data is stored. The warehouse data is
subject-oriented, integrated, time-variant, and non-volatile.
● Consumption Layer: Used for reporting, analysis, AI/ML, and distribution.
● Single-Tier Architecture: Minimizes data storage by deduplicating data. Best suited for
smaller organizations.
● Two-Tier Architecture: Data is extracted, transformed, and loaded into a centralized
data warehouse. Includes data marts for specific business user applications.
● Three-Tier Architecture: The most common approach, consisting of the source layer,
staging area layer, and analytics layer.
4.5 Multidimensional Data Model
● Definition: A data storage schema that has more than two dimensions, containing rows
and columns, repeated and extended with another category or multiple categories.
● Purpose: To solve complex queries in real-time.
● Key Components:
○ Measures: Numerical data that can be analyzed and compared (e.g., sales,
revenue).
○ Dimensions: Attributes that describe the measures (e.g., time, location, product).
○ Cubes: Structures that represent the multidimensional relationships between
measures and dimensions.
● Common Schemas:
○ Star Schema: A fact table joined to dimension tables. The simplest and most
common type of schema.
○ Snowflake Schema: The fact table is connected to several normalized
dimension tables containing descriptive data. More complex.
○ Fact Constellation Schema: Multiple fact tables.
Users Frontline workers (e.g., store clerks, Data scientists, analysts, business
online shoppers). users.
Emphasis Fast response times for transactions. Query performance and flexibility for
analysis.
OLAP tools enable users to analyze multidimensional data interactively from multiple
perspectives. Basic analytical operations include:
● Roll-up (Consolidation): Aggregates data by climbing up a concept hierarchy (e.g.,
from city to country).
● Drill-down: Navigates through the details, from less detailed data to highly detailed data
(e.g., from region's sales to sales by individual products).
● Slice: Selects a single dimension from the OLAP cube, creating a sub-cube.
● Dice: Selects a sub-cube from the OLAP cube by selecting two or more dimensions.
● Pivot (Rotation): Rotates the current view to get a new view of the representation.
5. Data Mining
● Definition: Data mining is the process of discovering patterns, trends, and useful
information from large datasets.
● Alternative Names: Knowledge discovery, knowledge extraction, data/pattern analysis,
information harvesting, business intelligence, etc.
● Goal: Transforming raw data into understandable structures for later use in machine
learning or analytical activities.
● Key Steps: Data cleaning, data transformation, pattern discovery, and knowledge
representation.
Data mining tasks are generally divided into two categories: descriptive and predictive.
● Data Quality: Incomplete, noisy, and inconsistent data can affect the accuracy of data
mining results.
● Scalability: Data mining algorithms need to be scalable to handle large datasets.
● Complexity: Data mining techniques can be complex and require specialized
knowledge.
● Privacy: Data mining can raise privacy concerns, especially when dealing with sensitive
personal data.
● Interpretability: The patterns discovered by data mining algorithms should be
understandable and actionable.
● KDD: The overall process of turning raw data into useful knowledge. Includes data
cleaning, data integration, data selection, data transformation, data mining, pattern
evaluation, and knowledge representation.
● Data Mining: A specific step within the KDD process focused on extracting patterns
from data.
● Relationship: Data mining is an essential part of the KDD process.
5.12 Introduction to Text Mining, Web Mining, Spatial Mining, Temporal Mining
6. Spark
● Driver Program: The main program that launches the Spark application and manages
the execution of tasks.
● Cluster Manager: Allocates resources (e.g., memory, CPU) to the Spark application.
● Worker Nodes: Execute the tasks assigned by the driver program.
● Executor: A process running on each worker node that executes the tasks.
● Definition: Resilient Distributed Datasets (RDDs) are the fundamental data abstraction
in Spark.
● Key Features:
○ Immutable
○ Distributed
○ Fault-tolerant
○ Support parallel processing
● Transformation: Creates a new RDD from an existing RDD (e.g., map, filter,
reduceByKey).
● Action: Performs a computation on an RDD and returns a value (e.g., count, collect,
saveAsTextFile).
● Spark SQL: A component for working with structured data using SQL.
● Data Frames: A distributed collection of data organized into named columns.
● Benefits:
○ Easy to use
○ Optimized for performance
○ Support for various data sources
● Kafka: A distributed streaming platform for building real-time data pipelines and
streaming applications.
● Integration with Spark Streaming: Spark Streaming can consume data from Kafka
topics in real-time.
● Use Cases:
○ Real-time analytics
○ Fraud detection
○ Personalization
Exam Paper
● Extract: Retrieving data from various source systems (databases, files, APIs).
● Transform: Cleaning, validating, standardizing, and applying business rules to the
extracted data.
● Load: Writing the transformed data into a target system, typically a data warehouse or
data mart.
Q2) Attempt any FOUR of the following (Out of FIVE) [4x4=16]
Performance Generally slower for complex queries Typically faster for slicing, dicing, and
as calculations are often done aggregation due to pre-calculated
on-the-fly using SQL. summaries in the cube.
Scalability More scalable in terms of data volume, Scalability can be limited by the cube
leveraging the scalability of the size ("cube explosion"). Larger cubes
underlying RDBMS. require more memory/disk.
Flexibility More flexible; can handle detailed Less flexible; analysis is limited to the
transactional data easily. Doesn't dimensions and aggregations defined
require pre-computation for all in the cube.
dimensions.
Disk Space Can be more efficient if data is sparse. Can require significant disk space for
Stores detailed data. storing pre-aggregated data, especially
for dense cubes.
Working Principle:
1. First Pass - Frequency Count: Scan the transaction database once to determine the
support count for each individual item. Discard items that do not meet the minimum
support threshold (min_sup). Sort the frequent items in descending order of their support
count.
2. Second Pass - FP-Tree Construction: Scan the database again. For each transaction,
select only the frequent items (identified in the first pass) and sort them according to the
descending frequency order. Insert these sorted frequent items into the FP-Tree
structure.
○ FP-Tree Structure: The FP-Tree is a compact, prefix-tree-like structure. Each
node represents an item, stores its count, and has links to its children nodes.
Transactions sharing common prefixes share the same path in the tree. A header
table is maintained, listing each frequent item and pointing to its first occurrence
in the tree (nodes for the same item are linked using node-links).
3. Mining Frequent Itemsets: Mine the FP-Tree recursively to find frequent itemsets. This
is done by starting from the least frequent items in the header table and generating their
"conditional pattern bases" (sub-databases consisting of prefixes of paths ending in that
item) and recursively building and mining "conditional FP-Trees" for these bases.
Advantages:
● Efficiency: Usually much faster than Apriori, especially for dense datasets or low support
thresholds.
● No Candidate Generation: Avoids the computationally expensive step of generating
and testing candidate itemsets.
● Compact Structure: The FP-Tree often compresses the database information
effectively.
(c) Explain the working of Spark with the help of its Architecture?
Ans: Apache Spark processes large datasets in a distributed manner using a master-slave
architecture.
Core Components:
1. Driver Program: The process running the main() function of the application and creating
the SparkContext. It coordinates the execution of the job.
2. SparkContext: The main entry point for Spark functionality. It connects to the Cluster
Manager and coordinates the execution of tasks on the cluster.
3. Cluster Manager: An external service responsible for acquiring resources (CPU,
memory) on the cluster for Spark applications. Examples include Spark Standalone,
Apache YARN, Apache Mesos, or Kubernetes.
4. Worker Nodes: Nodes in the cluster that host Executors.
5. Executor: A process launched on a worker node that runs tasks and keeps data in
memory or disk storage. Each application has its own executors. Executors
communicate directly with the Driver Program.
6. Task: A unit of work sent by the Driver Program to be executed on an Executor.
7. RDDs/DataFrames/Datasets: Spark's core data abstractions representing distributed
collections of data that can be processed in parallel. They are immutable and resilient
(can be recomputed if lost).
Working Flow:
1. Application Submission: The user submits a Spark application (code) to the Driver
Program.
2. SparkContext Initialization: The Driver Program creates a SparkContext (or
SparkSession).
3. Resource Acquisition: The SparkContext connects to the Cluster Manager, requesting
resources (Executors) on Worker Nodes.
4. Executor Launch: The Cluster Manager allocates resources and launches Executors
on the Worker Nodes.
5. Task Scheduling: The Driver Program analyzes the application code, breaking it down
into stages and tasks based on transformations and actions on RDDs/DataFrames. It
sends these tasks to the Executors.
6. Task Execution: Executors run the assigned tasks on their portion of the data. They can
cache data in memory for faster access and report results or status back to the Driver
Program.
7. Result Collection: Actions trigger computation. Once all tasks are completed, the
results are either returned to the Driver Program (e.g., collect()) or written to an external
storage system (e.g., saveAsTextFile()).
8. Termination: Once the application completes, the SparkContext is stopped, and the
Cluster Manager releases the resources used by the Executors.
(A simple diagram showing Driver -> Cluster Manager -> Worker Nodes (with Executors) would
enhance this explanation visually.)
1. Local Maxima/Minima: The algorithm can get stuck on a peak (local maximum) that is
not the overall best solution (global maximum). Since it only looks at immediate
neighboring states and accepts only improvements, it has no way to backtrack or explore
other parts of the search space once it reaches a local optimum where all neighbors are
worse or equal.
2. Plateaus: The search can encounter a flat region where several neighboring states have
the same objective function value. The algorithm might wander aimlessly on the plateau
or terminate prematurely if it cannot find a state with a better value.
3. Ridges: Ridges are areas in the search space where the optimal path is very narrow. Hill
climbing might oscillate back and forth along the sides of the ridge, making slow progress
or getting stuck because the operators available might not allow movement directly along
the top of the ridge.
4. Incompleteness: It does not guarantee finding the global optimum solution. It only finds a
local optimum relative to its starting point.
5. Starting Point Dependency: The solution found heavily depends on the initial starting
state. Different starting points can lead to different local optima.
Key Concepts:
1. Data Cube: The central metaphor for the model. It's a logical structure representing data
across multiple dimensions. While visualized as a 3D cube, it can have many more
dimensions (hypercube).
2. Dimensions: These represent the perspectives or categories along which data is
analyzed. Examples include Time, Product, Location, Customer. Dimensions often have
hierarchies (e.g., Location: City -> State -> Country; Time: Day -> Month -> Quarter ->
Year).
3. Measures: These are the quantitative values or metrics being analyzed. They are
typically numeric and additive (though semi-additive and non-additive measures exist).
Examples include Sales Amount, Profit, Quantity Sold, Customer Count.
4. Facts: These represent the business events or transactions being measured. A fact
typically contains the measures and foreign keys linking to the dimension tables.
Common Schemas:
● Star Schema: The simplest structure. It consists of a central fact table containing
measures and keys, surrounded by dimension tables (one for each dimension),
resembling a star. Dimension tables are usually denormalized.
● Snowflake Schema: An extension of the star schema where dimension tables are
normalized into multiple related tables. This reduces redundancy but can increase query
complexity.
This model facilitates OLAP operations like slicing (selecting a subset based on one dimension
value), dicing (selecting a subcube based on multiple dimension values), drill-down (moving
down a hierarchy), roll-up (moving up a hierarchy), and pivoting (rotating the cube axes).
Effective preprocessing significantly improves the quality, accuracy, and efficiency of subsequent
data mining tasks.
(c) Explain the various search and control strategies in artificial intelligence.
Ans: Search strategies are fundamental to problem-solving in AI. They define systematic ways to
explore a state space (the set of all possible states reachable from an initial state) to find a goal
state. Control strategies determine the order in which nodes (states) in the search space are
expanded.
1. Uninformed Search (Blind Search): These strategies do not use any domain-specific
knowledge about the problem beyond the problem definition itself (states, operators, goal
test). They explore the search space systematically.
○ Breadth-First Search (BFS): Explores the search tree level by level. It expands
all nodes at depth 'd' before moving to depth 'd+1'. It is complete and optimal
(finds the shallowest goal) if edge costs are uniform. Uses a FIFO queue.
○ Depth-First Search (DFS): Explores the deepest branch first. It expands nodes
along one path until a leaf or goal is reached, then backtracks. It is not guaranteed
to be complete or optimal. Uses a LIFO stack. More memory efficient than BFS
for deep trees.
○ Uniform Cost Search (UCS): Expands the node with the lowest path cost (g(n))
from the start node. It is complete and optimal if edge costs are non-negative.
Uses a priority queue. Similar to Dijkstra's algorithm.
2. Informed Search (Heuristic Search): These strategies use domain-specific knowledge
in the form of a heuristic function h(n) which estimates the cost from the current node n
to the nearest goal state. This guides the search towards more promising states.
○ Greedy Best-First Search: Expands the node that appears closest to the goal
according to the heuristic function h(n) alone. It is often fast but is not complete or
optimal.
○ A Search:* Expands the node with the lowest evaluation function value f(n) = g(n)
+ h(n), where g(n) is the actual cost from the start to node n, and h(n) is the
estimated cost from n to the goal. A* is complete and optimal if the heuristic h(n)
is admissible (never overestimates the true cost) and, for graph search,
consistent. Uses a priority queue.
Control Strategy: The control strategy essentially implements the chosen search algorithm. It
manages the frontier (the set of nodes waiting to be expanded) and decides which node to
expand next based on the specific search algorithm's criteria (e.g., FIFO for BFS, LIFO for DFS,
priority queue based on cost/heuristic for UCS, Greedy, A*).
Workload Many concurrent users, high volume Fewer users, lower volume of
of simple transactions. complex, long-running queries.
Data Updates Frequent, real-time updates. Periodic batch updates (e.g., nightly
ETL). Data is relatively static.
1. Transformations:
○ Definition: Transformations create a new RDD from an existing one. They define
how to compute a new dataset based on the source dataset.
○ Laziness: Transformations are lazy, meaning Spark does not execute them
immediately. Instead, it builds up a lineage graph (a DAG - Directed Acyclic
Graph) of transformations. The actual computation happens only when an Action
is called.
○ Immutability: RDDs are immutable; transformations always produce a new RDD
without modifying the original one.
○ Examples:
■ map(func): Returns a new RDD by applying a function func to each
element of the source RDD.
■ filter(func): Returns a new RDD containing only the elements that satisfy
the function func.
■ flatMap(func): Similar to map, but each input item can be mapped to 0 or
more output items (the function should return a sequence).
■ union(otherRDD): Returns a new RDD containing all elements from the
source RDD and the argument RDD.
■ groupByKey(): Groups values for each key in an RDD of key-value pairs
into a single sequence.
■ reduceByKey(func): Aggregates values for each key using a specified
associative and commutative reduce function.
■ join(otherRDD): Performs an inner join between two RDDs based on their
keys.
2. Actions:
○ Definition: Actions trigger the execution of the transformations defined in the
DAG and return a result to the driver program or write data to an external storage
system.
○ Execution Trigger: Actions are the operations that cause Spark to perform the
computations planned by the transformations.
○ Examples:
■ collect(): Returns all elements of the RDD as an array to the driver
program. (Use with caution on large RDDs).
■ count(): Returns the number of elements in the RDD.
■ take(n): Returns the first n elements of the RDD as an array.
■ first(): Returns the first element of the RDD (equivalent to take(1)).
■ reduce(func): Aggregates the elements of the RDD using a specified
associative and commutative function and returns the final result to the
driver.
■ foreach(func): Executes a function func on each element of the RDD
(often used for side effects like writing to external systems).
■ saveAsTextFile(path): Writes the elements of the RDD as text files to a
specified directory.
Understanding the difference between lazy transformations and eager actions is crucial for
writing efficient Spark applications.
1. Purpose: It's used by the Executor JVM for various purposes, including:
○ Task Execution: Memory needed to run the actual task code and hold data being
processed by tasks.
○ Data Storage: Storing partitions of RDDs, DataFrames, or Datasets that are
cached or persisted in memory (Storage Memory).
○ Shuffle Operations: Buffering data during shuffle operations (when data needs
to be redistributed across executors). (Shuffle Memory).
2. Configuration: The amount is configured via the spark.executor.memory setting when
submitting a Spark application.
3. Unified Memory Management (Spark 1.6+): Modern Spark versions use a unified
memory management system. A large portion of the executor heap space is managed
jointly for both execution and storage. Spark can dynamically borrow memory between
storage and execution regions based on demand, making memory usage more flexible
and robust.
4. Impact on Performance: Sufficient executor memory is crucial for performance. Too
little memory can lead to excessive garbage collection, spilling data to disk frequently
(which slows down processing significantly), or even OutOfMemoryErrors. Caching data
in memory relies heavily on having adequate executor memory.
5. Overhead: An additional amount of memory (spark.executor.memoryOverhead or
spark.executor.memoryOverheadFactor) is usually allocated off-heap for JVM overheads,
string interning, and other native overheads.
Properly configuring executor memory is vital for optimizing Spark job performance and stability.
(c) What are the two advantages of 'Depth First Search (DFS)?
Ans: Depth First Search (DFS) is an uninformed search algorithm that explores as far as
possible along each branch before backtracking. Its main advantages compared to algorithms
like Breadth-First Search (BFS) are:
1. Memory Efficiency: DFS requires significantly less memory than BFS, especially for
search trees with a large branching factor (b) and depth (d). DFS only needs to store the
current path being explored from the root to the current node, plus the unexplored sibling
nodes at each level along that path. In the worst case, its space complexity is O(b*d),
representing the stack depth. In contrast, BFS needs to store all nodes at the current
depth level, which can grow exponentially (O(b^d)), potentially leading to memory
exhaustion for large search spaces.
2. Potential for Quick Solution Finding (in some cases): If the goal state happens to lie
deep within the search tree along one of the initial paths explored by DFS, the algorithm
might find a solution much faster than BFS. BFS explores level by level and would only
find a deep solution after exploring all shallower nodes. However, it's important to note
that DFS does not guarantee finding the optimal (e.g., shortest) solution first, and it can
get stuck exploring very deep or infinite paths if not implemented carefully (e.g., with
depth limits or visited checks).
1. Machine Learning (ML): This is arguably the most prominent AI technique today. ML
algorithms enable systems to learn patterns and make predictions or decisions from data
without being explicitly programmed for the task.
○ Types: Includes Supervised Learning (learning from labeled data, e.g.,
classification, regression), Unsupervised Learning (finding patterns in unlabeled
data, e.g., clustering, dimensionality reduction), and Reinforcement Learning
(learning through trial and error by receiving rewards or penalties).
○ Applications: Recommendation systems, image recognition, spam filtering,
medical diagnosis, financial forecasting.
2. Natural Language Processing (NLP): NLP focuses on enabling computers to
understand, interpret, generate, and interact with human language (text and speech) in a
meaningful way.
○ Tasks: Includes machine translation, sentiment analysis, text summarization,
question answering, chatbot development, speech recognition, and text
generation.
○ Techniques: Combines computational linguistics with statistical models,
machine learning (especially deep learning models like Transformers).
○ Applications: Virtual assistants (Siri, Alexa), automated customer service,
language translation services (Google Translate), social media monitoring.
3. Search Algorithms and Problem Solving: This is a classical AI technique focused on
finding solutions to problems by systematically exploring a space of possible states.
○ Scope: Covers finding paths (e.g., route planning), solving puzzles (e.g., Rubik's
cube, Sudoku), game playing (e.g., chess, Go), and constraint satisfaction
problems.
○ Strategies: Includes Uninformed Search (BFS, DFS) and Informed Search (A*,
Greedy Best-First) using heuristics to guide the exploration efficiently.
○ Applications: Robotics (path planning), logistics optimization, game AI,
automated theorem proving.
(Other important techniques could include Computer Vision, Expert Systems, Planning, etc.)
(e) What are the major steps involved in the ETL process?
Ans: ETL (Extract, Transform, Load) is a core process used to collect data from various
sources, clean and modify it, and store it in a target database, typically a data warehouse, for
analysis and reporting.
1. Extract:
○ Goal: Retrieve data from one or more source systems.
○ Sources: Can include relational databases (SQL Server, Oracle), NoSQL
databases, flat files (CSV, XML, JSON), APIs, web services, legacy systems,
spreadsheets, etc.
○ Activities: Connecting to sources, querying or reading data, potentially
performing initial validation (e.g., checking data types, record counts). Data can
be extracted entirely (full extraction) or incrementally (only changes since the last
extraction). The extracted data is often moved to a staging area.
2. Transform:
○ Goal: Apply rules and functions to the extracted data to convert it into the desired
format and structure for the target system and analysis. This is often the most
complex step.
○ Activities:
■ Cleaning: Correcting typos, handling missing values, standardizing
formats (e.g., dates, addresses).
■ Filtering: Selecting only certain rows or columns.
■ Enrichment: Combining data from multiple sources, deriving new
attributes (e.g., calculating age from birthdate).
■ Aggregation: Summarizing data (e.g., calculating total sales per region).
■ Splitting/Merging: Dividing columns or combining multiple columns.
■ Joining: Linking data from different sources based on common keys.
■ Validation: Applying business rules to ensure data quality and integrity.
■ Format Conversion: Changing data types or encoding.
3. Load:
○ Goal: Write the transformed data into the target system.
○ Target: Usually a data warehouse, data mart, or operational data store.
○ Activities: Inserting the processed data into the target tables.
○ Methods:
■ Full Load: Wiping existing data in the target table and loading all the
transformed data (used for initial loads or small tables).
■ Incremental Load (Delta Load): Loading only the new or modified records
since the last load, often based on timestamps or change flags. This is
more efficient for large datasets. Load processes often involve managing
indexes, constraints, and logging for auditing and recovery.
Q5) Write a short note on any TWO of the following (Out of THREE) [2x3=6]
(b) 'Water Jug Problem' in artificial intelligence with the help of diagrams and propose a
solution to the problem.
Ans: The Water Jug Problem is a classic AI puzzle used to illustrate state-space search. A
typical version is: "You have two unmarked jugs, one holds 5 gallons (J5) and the other holds 3
gallons (J3). You have an unlimited supply of water. How can you measure out exactly 4
gallons?"
Problem Formalization:
● States: Represented by (x, y), where x is the water in J5 (0≤x≤5) and y is the water in J3
(0≤y≤3). The initial state is (0, 0).
● Goal State: Any state where x=4, i.e., (4, y).
● Operators (Actions):
1. Fill J5 completely: (x, y) -> (5, y) if x<5
2. Fill J3 completely: (x, y) -> (x, 3) if y<3
3. Empty J5: (x, y) -> (0, y) if x>0
4. Empty J3: (x, y) -> (x, 0) if y>0
5. Pour J5 into J3 until J3 is full: (x, y) -> (x - (3-y), 3) if x+y≥3, x>0
6. Pour J3 into J5 until J5 is full: (x, y) -> (5, y - (5-x)) if x+y≥5, y>0
7. Pour all from J5 into J3: (x, y) -> (0, x+y) if x+y≤3, x>0
8. Pour all from J3 into J5: (x, y) -> (x+y, 0) if x+y≤5, y>0
Solution Path (one possibility using diagrams as state representations):
A search algorithm like BFS can find the shortest sequence. One solution is:
1. (0, 0) - Start
2. (5, 0) - Fill J5 (Operator 1)
3. (2, 3) - Pour J5 into J3 until J3 is full (Operator 5)
4. (2, 0) - Empty J3 (Operator 4)
5. (0, 2) - Pour all from J5 into J3 (Operator 7)
6. (5, 2) - Fill J5 (Operator 1)
7. (4, 3) - Pour J5 into J3 until J3 is full (Operator 5) -> Goal Reached! (4 gallons in J5)
This sequence shows one way to achieve the goal state by applying the defined operators.
Data warehouses use multidimensional data models (like star or snowflake schemas) and are
queried using OLAP tools. They provide a "single source of truth" for analytical purposes across
an enterprise.