0% found this document useful (0 votes)

6 views14 pages

Assignment Distributed Data Mining

Uploaded by

wubshet adane

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views14 pages

Assignment Distributed Data Mining

Uploaded by

wubshet adane

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Debremarkos university

Institute of Technology
Department of Software engineering
Fundamentals of business intelegence and big data analysis group
Assignment

Name Id signature
Fundamentals of Big Data Analytics and Business
Intelligence group assignment
Table of Contents
1. Define distributed data mining, why distributed data mining, why parallel data mining and
benefits of parallel and distributed data mining? .......................................................................... 2
Distributed Data Mining (DDM): ................................................................................................. 2
Why Distributed Data Mining?.................................................................................................... 2
Why Parallel Data Mining? ......................................................................................................... 2
Benefits of Parallel and Distributed Data Mining ....................................................................... 3
2. Discuss on popular distributed platforms (Hadoop and Spark )? ............................................ 3
Hadoop ........................................................................................................................................ 4
Key Components of Hadoop: ....................................................................................................... 4
Strengths of Hadoop: .................................................................................................................. 4
Limitations of Hadoop: ................................................................................................................ 4
Apache Spark ............................................................................................................................... 5
Key Features of Apache Spark: .................................................................................................... 5
Strengths of Spark: ...................................................................................................................... 5
Limitations of Spark:.................................................................................................................... 5
Conclusion ................................................................................................................................ 6
3. Why distributed computing for data analytics? ...................................................................... 7
Key Reasons for Distributed Computing in Data Analytics: ........................................................ 7
Benefits of Distributed Computing in Data Analytics: ................................................................. 8
Use Cases in Data Analytics: ....................................................................................................... 9
Conclusion: ............................................................................................................................. 9
4. Discuss on Hadoop file system (HDFS)? ................................................................................. 10
Key Features of HDFS ................................................................................................................ 10
Key Components of HDFS .......................................................................................................... 11
HDFS Operations ....................................................................................................................... 12
Advantages of HDFS .................................................................................................................. 13
Limitations of HDFS ................................................................................................................... 13
Conclusion ............................................................................................................................ 13
Fundamentals of Big Data Analytics and Business
Intelligence group assignment

1. Define distributed data mining,

why distributed data mining, why
parallel data mining and benefits of
parallel and distributed data mining?
Distributed Data Mining (DDM):

Distributed Data Mining is the process of extracting useful patterns and knowledge from
large datasets spread across multiple locations or systems. It deals with the challenges of
mining data that is partitioned across different sites or stored in a distributed environment,
like a network of computers or databases.

Why Distributed Data Mining?

 Scalability: Large datasets may not be manageable on a single machine due to resource
constraints like memory and processing power. Distributing the workload across multiple
machines helps in scaling up.
 Data Localization: In many cases, data is generated or stored across geographically
distant locations. Instead of moving the data to a central location for mining, it's often
more efficient to mine data locally and share the results.
 Efficiency: In distributed systems, processing the data where it resides (local
computation) can reduce the time and cost associated with transferring large datasets.
 Heterogeneous Data: Organizations often have data in different formats or structures
spread across different systems. Distributed data mining enables these diverse datasets to
be mined collectively.

Why Parallel Data Mining?

Parallel Data Mining involves performing data mining tasks simultaneously across multiple
processors or machines to improve efficiency and speed. This approach becomes crucial when
dealing with large datasets that need significant computational resources.
Fundamentals of Big Data Analytics and Business
Intelligence group assignment
 Speed and Performance: By parallelizing the tasks, mining can be done much faster
since multiple processors work on the problem simultaneously.
 Handling Large Datasets: When datasets are too large for a single machine, parallel
processing distributes the workload, making it manageable.
 Enhanced Computational Power: Using multiple processors or systems allows for
complex models or algorithms that would be too slow or impossible to run on a single
machine.

Benefits of Parallel and Distributed Data Mining

1. Improved Speed and Efficiency: Both approaches reduce the time required to analyze
large datasets by distributing tasks across multiple machines or processors.
2. Resource Utilization: These methods allow for better use of computational resources,
balancing the load across different systems, and ensuring that no single system is
overwhelmed.
3. Scalability: As datasets grow, parallel and distributed mining systems can scale up to
handle increased volumes without a significant drop in performance.
4. Cost-Effective: Instead of investing in a single powerful machine, distributed systems
allow organizations to use multiple less-expensive machines.
5. Fault Tolerance: In a distributed environment, if one machine or node fails, the others
can continue processing, making the system more resilient to failures.
6. Data Privacy and Security: In scenarios where data is sensitive and cannot be moved,
distributed data mining enables local computation, minimizing the need to share sensitive
data between systems.

2. Discuss on popular distributed

platforms (Hadoop and Spark )?

Popular Distributed Platforms: Hadoop and Spark

Both Hadoop and Apache Spark are widely used distributed platforms designed to handle large-
scale data processing across distributed environments. They are highly valued for their
scalability, fault tolerance, and ability to manage massive datasets, but they differ in terms of
architecture, processing models, and performance.
Fundamentals of Big Data Analytics and Business
Intelligence group assignment

Hadoop
Overview:
Hadoop is an open-source framework developed by the Apache Software Foundation for
distributed storage and processing of large datasets. It is designed to handle huge volumes of data
by distributing the data and computational workload across clusters of commodity hardware.

Key Components of Hadoop:

 Hadoop Distributed File System (HDFS): A distributed file system that stores large
files across multiple machines. It breaks data into blocks and distributes them across the
nodes in a cluster, ensuring fault tolerance by replicating blocks.
 MapReduce: The core data processing engine in Hadoop, where tasks are divided into
two phases:
o Map: Breaks down the data into key-value pairs and processes them in parallel.
o Reduce: Aggregates the output of the Map function to generate the final result.
 YARN (Yet Another Resource Negotiator): Manages and schedules resources across
the Hadoop cluster. It helps in task coordination and job execution across nodes.

Strengths of Hadoop:
 Scalability: Can scale horizontally by adding more nodes to the cluster to handle larger
datasets.
 Fault Tolerance: Automatically replicates data blocks across multiple nodes. If one node
fails, the system retrieves the data from another node.
 Batch Processing: Best suited for long-running, large-scale batch processing jobs, where
tasks can be divided into chunks and processed independently.
 Cost-Effectiveness: Runs on commodity hardware, reducing costs compared to
traditional data storage solutions.

Limitations of Hadoop:
 Slow Processing: Since Hadoop uses the disk-based MapReduce framework, it is
relatively slow, especially for iterative tasks or real-time processing.
 Complexity: Hadoop requires complex setup and configuration, and writing efficient
MapReduce jobs can be challenging for developers.
Fundamentals of Big Data Analytics and Business
Intelligence group assignment
Apache Spark
Overview:
Apache Spark is an open-source, lightning-fast unified analytics engine for large-scale data
processing. It is also developed by the Apache Software Foundation and provides distributed
data processing like Hadoop, but with a focus on in-memory computation for faster processing.

Key Features of Apache Spark:

 In-Memory Processing: Unlike Hadoop, Spark processes data in-memory (RAM)
instead of writing intermediate results to disk. This makes it much faster, especially for
iterative algorithms (e.g., machine learning).
 RDD (Resilient Distributed Dataset): The core data structure in Spark that allows fault-
tolerant, parallel processing of distributed data. It keeps track of how to rebuild data in
case of failure, ensuring resilience.
 Supports Multiple APIs: Spark has APIs for Java, Scala, Python, and R, making it
accessible to a broader range of developers.
 Supports Multiple Workloads:

 Batch Processing: Like Hadoop, Spark can handle batch processing jobs efficiently.
 Real-Time Processing: Spark supports real-time data processing through Spark
Streaming.
 Machine Learning: Spark integrates with MLlib for large-scale machine learning
algorithms.
 Graph Processing: Spark also has GraphX for graph processing and analysis.

Strengths of Spark:
 Speed: Spark can be up to 100 times faster than Hadoop for some tasks due to its in-
memory processing model.
 Flexibility: Spark supports both batch and real-time processing, making it suitable for a
wide variety of use cases, from ETL (Extract, Transform, Load) to real-time analytics.
 Ease of Use: Spark provides higher-level APIs and abstractions, making it easier for
developers to write efficient data processing pipelines compared to Hadoop's
MapReduce.
 Advanced Analytics: Spark includes built-in libraries for machine learning (MLlib),
graph processing (GraphX), and SQL queries (Spark SQL), making it a more complete
ecosystem for data analytics.

Limitations of Spark:
 Memory Requirements: Spark’s in-memory processing can require significant amounts
of memory, which can be expensive.
Fundamentals of Big Data Analytics and Business
Intelligence group assignment
 Complex Setup for Real-Time Processing: While Spark supports real-time processing,
setting up a fault-tolerant streaming architecture can be complex.

Comparison: Hadoop vs Spark

Feature Hadoop Apache Spark

Processing Model Batch Processing Batch and Real-Time Processing
Speed Slower (Disk-based MapReduce) Faster (In-memory processing)
Ease of Use Complex MapReduce Higher-level APIs (easier for developers)
programming
Data Storage HDFS Can use HDFS, but also works with other data
sources (e.g., S3)
Fault Tolerance Data replication in HDFS RDD lineage-based fault tolerance
Workloads Mainly batch processing Batch, real-time, machine learning, graph
Supported processing
Memory Usage Low memory requirements (disk- Higher memory usage due to in-memory
based) computation
Cost Efficiency Cost-effective for very large Can be more costly due to higher memory
datasets on disk demands

Conclusion

 Hadoop: Ideal for long-running, large-scale batch processing jobs where speed is not
critical. It is reliable, fault-tolerant, and can handle extremely large datasets at a lower
cost.
 Spark: Best suited for situations where speed is important, such as iterative algorithms,
real-time analytics, and machine learning. It’s more flexible than Hadoop but may require
more memory resources.

Both platforms can coexist in a big data ecosystem. In fact, many organizations use Hadoop for
data storage (HDFS) and Spark for faster data processing, leveraging the strengths of both
platforms.
Fundamentals of Big Data Analytics and Business
Intelligence group assignment
3. Why distributed computing for data
analytics?
Distributed computing has become essential for modern data analytics due to the exponential
growth of data (Big Data) and the increasing complexity of analytical models. In a distributed
computing environment, tasks are broken down into smaller sub-tasks and processed
concurrently across multiple machines or nodes. This paradigm offers significant advantages for
data analytics, especially when dealing with large-scale datasets and resource-intensive
computations.

Key Reasons for Distributed Computing in Data

Analytics:
 Handling Large Datasets (Big Data):
The sheer volume of data generated today exceeds the capacity of single machines to
store and process efficiently. Distributed computing allows datasets to be partitioned and
stored across multiple machines or nodes, enabling large-scale data analytics that can
handle terabytes or petabytes of data.
 Scalability:
Distributed computing allows organizations to scale their data processing capabilities
horizontally by adding more machines to a cluster. As data grows, the system can scale
without significant performance degradation, making it easier to handle increasing
workloads without requiring expensive, high-end hardware.
 Improved Performance and Efficiency:
By distributing tasks across multiple nodes, distributed computing accelerates data
processing. Each node can work on a subset of the data, performing parallel computations
that dramatically reduce processing time compared to single-machine architectures. This
is especially useful for time-sensitive analytics such as real-time data processing and
interactive querying.
 Resource Optimization:
Distributed computing leverages the collective computing power of many machines to
perform complex calculations, optimizing the use of available resources. This enables
efficient data processing and analytics while reducing costs associated with overloading a
single machine.
 Fault Tolerance:
Distributed systems are designed to handle failures gracefully. If one machine in the
cluster fails, the system can continue operating by redistributing the workload to other
machines. This fault tolerance ensures reliability and resilience, which is crucial for
processing critical data and maintaining analytics workflows.
Fundamentals of Big Data Analytics and Business
Intelligence group assignment
 Distributed Data Sources:
Often, data is stored across multiple locations or data centers due to geographic,
organizational, or security requirements. Distributed computing enables local data
processing without requiring all data to be transferred to a central location, reducing data
transfer times and improving efficiency.
 Real-Time and Streaming Data Processing:
In scenarios where real-time data processing is necessary (e.g., financial transactions,
IoT, social media analytics), distributed computing allows for streaming analytics by
distributing the data and processing it in real time across multiple nodes. This supports
applications where low-latency decision-making is crucial.
 Complex Analytical Models:
Modern data analytics involves complex models such as machine learning, deep learning,
and graph processing, which require significant computational power. Distributed
computing enables these models to run efficiently by distributing computations across
nodes, allowing for quicker model training and testing, especially when dealing with
large feature sets or datasets.
 Cost Efficiency:
Distributed computing uses clusters of commodity hardware, which is much more cost-
effective than purchasing high-end, monolithic machines. Cloud platforms offer scalable
and flexible distributed computing environments, where organizations can pay for what
they use and scale their resources on demand, reducing infrastructure costs.
 Collaboration Across Teams and Locations:
Distributed systems can enable collaboration across different teams and locations by
allowing data to be processed in different geographic locations while maintaining
consistency and efficiency. This supports global analytics operations and can improve
productivity for teams working in distributed environments.

Benefits of Distributed Computing in Data Analytics:

 Faster Data Processing:

Parallel processing across multiple nodes reduces the time needed to perform analytics on
large datasets.
 Scalable Infrastructure:
Distributed computing infrastructures can grow to meet increasing demands, making
them adaptable to future data and computational needs.
 Cost Savings:
Organizations can use cheaper hardware and cloud services, reducing the need for large
capital investments in high-end hardware.
 High Availability and Reliability:
The redundancy built into distributed systems ensures that analytics operations can
continue even in the case of node or system failures.
 Real-Time Analytics:
Distributed computing allows for real-time data processing, enabling businesses to
Fundamentals of Big Data Analytics and Business
Intelligence group assignment
respond quickly to changing conditions or events, such as customer behaviors or market
trends.
 Diverse Data Processing:
Distributed computing supports the analysis of structured, semi-structured, and
unstructured data (text, images, videos, etc.) across multiple sources, helping
organizations gain deeper insights from a wide variety of data types.

Use Cases in Data Analytics:

 Web Analytics: Distributed computing processes massive amounts of user data
generated by websites and apps to provide insights into user behavior, improving the
effectiveness of digital marketing strategies.
 Fraud Detection: Real-time distributed processing helps detect fraudulent activities in
large-scale financial transactions by analyzing patterns across distributed data sources.
 Recommendation Systems: Distributed data analytics is used in e-commerce and
streaming platforms to process user interactions and provide real-time recommendations
based on preferences and behavior.
 Scientific Research: Large-scale data simulations, such as climate modeling and
genomics research, leverage distributed computing to process massive datasets and run
complex models.
 Machine Learning and AI: Distributed computing is essential for training machine
learning models on big datasets, enabling faster experimentation and deployment of AI
systems.

Conclusion:

Distributed computing is essential for data analytics in today’s world due to the explosion of
data, the need for real-time insights, and the complexity of modern analytical models. It offers
scalability, faster processing, and the ability to handle large, distributed datasets efficiently,
making it a crucial component for organizations that aim to derive actionable insights from vast
and varied data sources.
Fundamentals of Big Data Analytics and Business
Intelligence group assignment
4. Discuss on Hadoop file system
(HDFS)?
Hadoop Distributed File System (HDFS)

The Hadoop Distributed File System (HDFS) is a core component of the Hadoop ecosystem,
specifically designed to handle large-scale data storage in a distributed environment. It is a
scalable, fault-tolerant, and distributed file system that enables the storage of vast amounts of
data across multiple machines while ensuring reliability and accessibility. HDFS is based on
Google’s Google File System (GFS) and plays a crucial role in the Hadoop framework by
facilitating high-throughput access to large datasets.

Key Features of HDFS

 Distributed Architecture:
HDFS stores data across a cluster of machines, known as nodes, by breaking down files
into smaller blocks (chunks) and distributing these blocks across multiple nodes. This
allows HDFS to store datasets that are far too large to fit on a single machine.
 Fault Tolerance:
HDFS automatically replicates data blocks across multiple nodes in the cluster. This
replication ensures data availability and fault tolerance in case of hardware failures. If
one node becomes unavailable, the system can retrieve the replicated data from another
node, ensuring that data loss is prevented.
 High Throughput:
HDFS is optimized for delivering high-throughput data access, making it ideal for
applications that process large datasets with a focus on reading and analyzing large files
rather than performing low-latency operations. It is designed for batch processing rather
than real-time access.
 Large Block Size:
By default, HDFS uses a large block size (typically 128 MB or 256 MB), which is much
larger than block sizes in traditional file systems. This reduces the overhead of managing
and processing files, improving overall efficiency when dealing with large datasets.
 Write Once, Read Many:
HDFS follows a write-once, read-many access pattern. Once data is written to HDFS, it
cannot be modified (except through appends). This simplifies data management, making
it well-suited for workloads that require high read throughput.
 Master-Slave Architecture:
HDFS follows a master-slave architecture consisting of a NameNode (master) and
multiple DataNodes (slaves).
Fundamentals of Big Data Analytics and Business
Intelligence group assignment
o NameNode: Manages the file system metadata and keeps track of where each
block of a file is stored. It handles operations like file creation, deletion, and
replication.
o DataNodes: Store the actual data blocks and are responsible for reading and
writing operations at the request of the NameNode.
 Scalability:
HDFS is designed to scale horizontally. As data volumes grow, new nodes can be added
to the cluster without major changes to the system, allowing it to handle increasing
amounts of data.
 Data Locality:
HDFS minimizes data movement by processing data locally where it resides (on the same
DataNode), improving performance and reducing network overhead. Hadoop’s
MapReduce framework exploits data locality by sending computation to the nodes where
data is stored, thus reducing data transfer times.
 Replication Factor:
Each data block in HDFS is replicated across multiple DataNodes based on a
configurable replication factor (default is 3). This ensures data reliability and availability,
even in the event of node failures.

Key Components of HDFS

 NameNode:
o The NameNode is the master node responsible for managing the file system
metadata, which includes information about which DataNodes hold the data
blocks of a file.
o It maintains the namespace of the file system and the mapping of file blocks to
DataNodes.
o The NameNode does not store the actual data but the information about where the
data is stored across the DataNodes.
o Since the NameNode is critical to the system, it can become a single point of
failure (SPOF), although Hadoop mitigates this risk using a secondary
NameNode for backup.
 DataNodes:
o DataNodes are the worker nodes in the HDFS architecture. They store and
manage the actual data blocks.
o DataNodes periodically send heartbeat messages to the NameNode to signal their
status. If the NameNode does not receive a heartbeat from a DataNode, it assumes
the DataNode has failed and initiates the replication of the data to other available
nodes.
o Each DataNode can store multiple blocks, and these blocks are distributed across
different DataNodes.
 Secondary NameNode:
Fundamentals of Big Data Analytics and Business
Intelligence group assignment
o The secondary NameNode is not a backup NameNode but assists in
checkpointing. It periodically merges the namespace changes from the NameNode
to create snapshots.
o In the event of a NameNode failure, the secondary NameNode helps recover the
metadata and bring the system back online.
 Blocks:
o A file stored in HDFS is split into blocks, and each block is stored across
DataNodes. The default block size is large (e.g., 128 MB), enabling efficient
storage of large files.
o Blocks are replicated based on the replication factor, which helps ensure data
redundancy and fault tolerance.
 HDFS Client:
o The HDFS client is responsible for interacting with the NameNode and
DataNodes to read or write data. It requests file locations from the NameNode and
communicates directly with DataNodes for data retrieval or storage.

HDFS Operations
 File Write:
o When a file is written to HDFS, it is broken down into blocks.
o The NameNode assigns a block ID and provides the client with a list of
DataNodes where the block will be stored.
o The client sends the block to the first DataNode, which then replicates it to the
second and third DataNodes as per the replication factor.
o Once all blocks are written, the file is considered successfully stored.
 File Read:
o To read a file, the HDFS client contacts the NameNode to get the block locations
(the list of DataNodes where the blocks are stored).
o The client retrieves the data directly from the closest DataNode for optimal
performance, prioritizing data locality.
 Replication:
o HDFS maintains a configurable replication factor for each block to ensure fault
tolerance.
o If a DataNode goes down, the NameNode automatically initiates block replication
to other DataNodes to maintain the desired replication factor.
 Heartbeat:
o DataNodes regularly send heartbeats to the NameNode to report their status. If a
DataNode fails to send a heartbeat within a certain period, the NameNode marks
it as unavailable and replicates the blocks stored on that DataNode to other nodes.
Fundamentals of Big Data Analytics and Business
Intelligence group assignment
Advantages of HDFS
 Fault Tolerance:
HDFS provides high fault tolerance through block replication. If one node fails, another
node can provide access to the replicated data.
 Scalability:
HDFS can easily scale by adding more nodes to the cluster. It can store and process
enormous datasets that cannot fit on a single machine.
 Cost-Effectiveness:
HDFS is designed to run on commodity hardware, allowing organizations to use low-cost
machines in their clusters rather than investing in expensive high-performance servers.
 High Throughput:
HDFS is optimized for high-throughput access to large datasets, making it ideal for
applications that require analyzing massive amounts of data (e.g., batch processing and
data-intensive applications).

Limitations of HDFS
 Latency:
HDFS is not optimized for low-latency access to small files or real-time applications. It
excels in batch processing but is not suitable for use cases where quick response times are
critical.
 Small Files:
HDFS is inefficient when dealing with many small files because each file, regardless of
its size, requires metadata storage in the NameNode. This can overwhelm the
NameNode’s memory.
 Single Point of Failure:
The NameNode can be a single point of failure (SPOF) in HDFS, though newer versions
of Hadoop mitigate this risk through high-availability features like multiple NameNodes.
 Write Once, Read Many:
HDFS follows a write-once, read-many pattern, which means files cannot be modified
after being written, limiting flexibility for certain use cases.

Conclusion

HDFS is a key enabler of Big Data storage and processing in distributed environments. Its ability
to store and manage massive datasets, along with its fault tolerance, scalability, and high
throughput, makes it a fundamental component of Hadoop and other Big Data technologies.
However, its limitations regarding latency and small file handling mean it is best suited for large-
scale batch processing tasks rather than real-time or transactional applications.

Hadoop PPT
No ratings yet
Hadoop PPT
25 pages
Master Spark Concepts
No ratings yet
Master Spark Concepts
112 pages
BDA Notes Unit-2
No ratings yet
BDA Notes Unit-2
27 pages
Big Data Hadoop Complete Final Spaced
No ratings yet
Big Data Hadoop Complete Final Spaced
15 pages
Jifs223295 2
No ratings yet
Jifs223295 2
25 pages
Unit 4 - Data Science - WWW - Rgpvnotes.in
No ratings yet
Unit 4 - Data Science - WWW - Rgpvnotes.in
18 pages
Unit 2 - Intro To Hadoop
No ratings yet
Unit 2 - Intro To Hadoop
51 pages
Unit-3 CC
No ratings yet
Unit-3 CC
10 pages
Distributed DBMS
No ratings yet
Distributed DBMS
62 pages
Big Data Distributed Platforms
No ratings yet
Big Data Distributed Platforms
18 pages
Big Data Analytics Presentation
No ratings yet
Big Data Analytics Presentation
30 pages
Bda U2
No ratings yet
Bda U2
68 pages
Big Data Analysis: Shubham Gupta B.Tech Computers Batch: E3 Roll No.:E059
No ratings yet
Big Data Analysis: Shubham Gupta B.Tech Computers Batch: E3 Roll No.:E059
34 pages
MCAD2232 (PRESS) BIG DATA and Its Applications
No ratings yet
MCAD2232 (PRESS) BIG DATA and Its Applications
140 pages
Unit Ii Hadoop With HDFS
No ratings yet
Unit Ii Hadoop With HDFS
22 pages
BDA Class3
No ratings yet
BDA Class3
15 pages
Big Data Complete Notes
No ratings yet
Big Data Complete Notes
33 pages
UNIT-3 - Technologies For Handling Big Data
No ratings yet
UNIT-3 - Technologies For Handling Big Data
21 pages
Chapter Three Data Science
No ratings yet
Chapter Three Data Science
23 pages
BD by Maaz
No ratings yet
BD by Maaz
19 pages
MA - VaishuAchini - VIT - 24 - ICT703 - A3
No ratings yet
MA - VaishuAchini - VIT - 24 - ICT703 - A3
21 pages
Database Reporting Tools To Query and Manage Data in Relational Database Management Systems Use (SQL)
No ratings yet
Database Reporting Tools To Query and Manage Data in Relational Database Management Systems Use (SQL)
2 pages
Cloud Computing Unit 3
No ratings yet
Cloud Computing Unit 3
10 pages
Internal 1
No ratings yet
Internal 1
19 pages
Subject: Dds (512) Distributed Data Processing
No ratings yet
Subject: Dds (512) Distributed Data Processing
12 pages
BDA 2 Marks
No ratings yet
BDA 2 Marks
13 pages
Parcial Cono 1 21
No ratings yet
Parcial Cono 1 21
21 pages
Big Data 2 - Part
No ratings yet
Big Data 2 - Part
40 pages
CC Unit 2
No ratings yet
CC Unit 2
29 pages
Big Data Unit-1
No ratings yet
Big Data Unit-1
9 pages
Storage and Processing
No ratings yet
Storage and Processing
8 pages
Data Science and Big Data UNIT 3
No ratings yet
Data Science and Big Data UNIT 3
11 pages
Parcial Cono 1 14
No ratings yet
Parcial Cono 1 14
14 pages
Big Data and Hadoop
No ratings yet
Big Data and Hadoop
8 pages
Big Data
No ratings yet
Big Data
8 pages
BDA Unit2 Notes
No ratings yet
BDA Unit2 Notes
23 pages
Last Min Preparation - Big Data
No ratings yet
Last Min Preparation - Big Data
5 pages
Data Analytics Mid Sem Notes
No ratings yet
Data Analytics Mid Sem Notes
9 pages
Experiment No - 1 Bda
No ratings yet
Experiment No - 1 Bda
10 pages
BDA Unit 2
No ratings yet
BDA Unit 2
8 pages
TIE - 21CS71 SIMP With Key Answers
No ratings yet
TIE - 21CS71 SIMP With Key Answers
19 pages
I Am Preparing For A Big Data Analytics University...
No ratings yet
I Am Preparing For A Big Data Analytics University...
15 pages
MCA - BigData Notes
No ratings yet
MCA - BigData Notes
136 pages
UNIT-1 BigData
No ratings yet
UNIT-1 BigData
10 pages
Advanced Database Concepts
No ratings yet
Advanced Database Concepts
7 pages
Introduction To Big Dat1
No ratings yet
Introduction To Big Dat1
6 pages
BDA Unit 3
No ratings yet
BDA Unit 3
6 pages
Sakhr - Chaib - Paper On Data Mining
No ratings yet
Sakhr - Chaib - Paper On Data Mining
3 pages
CC Unit - 5
No ratings yet
CC Unit - 5
27 pages
Big Data Analytics Overview
No ratings yet
Big Data Analytics Overview
17 pages
Hadoop Big Data Unit 2
No ratings yet
Hadoop Big Data Unit 2
23 pages
Ashish Presentation Stage1 Modify LR
No ratings yet
Ashish Presentation Stage1 Modify LR
24 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
Data Domain Fundamentals Student Guide
100% (1)
Data Domain Fundamentals Student Guide
70 pages
NCP-MCI-6.10 (90 Questions)
No ratings yet
NCP-MCI-6.10 (90 Questions)
10 pages
NCSE Core v2
No ratings yet
NCSE Core v2
19 pages
DVPmysqlucFederation at Flickr: Doing Billions of Queries Per Day
95% (22)
DVPmysqlucFederation at Flickr: Doing Billions of Queries Per Day
26 pages
Chapter - 2 Hadoop
No ratings yet
Chapter - 2 Hadoop
32 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Sec Attacks in CC
No ratings yet
Sec Attacks in CC
575 pages
Big Data Analytics Using Apache Hadoop
No ratings yet
Big Data Analytics Using Apache Hadoop
33 pages
High Availability Implementation of FileNet P8
No ratings yet
High Availability Implementation of FileNet P8
488 pages
Windev 24 HFSQL PDF
No ratings yet
Windev 24 HFSQL PDF
24 pages
Veeam Backup 11.0 Plug-Ins User Guide
No ratings yet
Veeam Backup 11.0 Plug-Ins User Guide
235 pages
BIG DATA & Hadoop Interview Questions With Answers
No ratings yet
BIG DATA & Hadoop Interview Questions With Answers
9 pages
Remote Administrator 5: Installation Manual and User Guide
No ratings yet
Remote Administrator 5: Installation Manual and User Guide
122 pages
Veeam Backup Enterprise Manager
No ratings yet
Veeam Backup Enterprise Manager
309 pages
Fault Tolerance in Wireless Sensor Networks
No ratings yet
Fault Tolerance in Wireless Sensor Networks
8 pages
Unit 3 Notes UDS23201J Query Processing
No ratings yet
Unit 3 Notes UDS23201J Query Processing
38 pages
Chapter 4
No ratings yet
Chapter 4
21 pages
Systems Power Hardware Cbu Ps Cbu 2
No ratings yet
Systems Power Hardware Cbu Ps Cbu 2
35 pages
UNIT V Iot r20
No ratings yet
UNIT V Iot r20
23 pages
White Paper SAP HANA Safeguarding Business Continuity PDF
No ratings yet
White Paper SAP HANA Safeguarding Business Continuity PDF
9 pages
Golden Gate Oracle Replication
No ratings yet
Golden Gate Oracle Replication
3 pages
Streaming Replication in Practice
No ratings yet
Streaming Replication in Practice
70 pages
Lecture 5
No ratings yet
Lecture 5
31 pages
SQL Server Interview Questions and Answers - (31-40)
No ratings yet
SQL Server Interview Questions and Answers - (31-40)
7 pages
DS Guess Paper 2024-25
No ratings yet
DS Guess Paper 2024-25
5 pages
Apex Institute of Technology: Big Data Security
No ratings yet
Apex Institute of Technology: Big Data Security
30 pages
Internship Placement (4th CSE, 3rd CSE and 3rd SE)
No ratings yet
Internship Placement (4th CSE, 3rd CSE and 3rd SE)
30 pages
Dell SRDF Connect
No ratings yet
Dell SRDF Connect
27 pages
Dell Equallogic Ps Series Firmware Version 6.0.6 Fix List: Issues Corrected in Version 6.0.6
No ratings yet
Dell Equallogic Ps Series Firmware Version 6.0.6 Fix List: Issues Corrected in Version 6.0.6
9 pages
Chapter 4
No ratings yet
Chapter 4
41 pages
Building A Scalable Architecture
No ratings yet
Building A Scalable Architecture
46 pages
Pix Asa Failover
No ratings yet
Pix Asa Failover
30 pages
MINTs Organizational Strcuture-DRAFT04.png
No ratings yet
MINTs Organizational Strcuture-DRAFT04.png
1 page
2017 - Corbellini Et Al. - Persisting Big-Data, The NoSQL Landscape
No ratings yet
2017 - Corbellini Et Al. - Persisting Big-Data, The NoSQL Landscape
23 pages
Disaster Recovery Planning For Mysql Mariadb
No ratings yet
Disaster Recovery Planning For Mysql Mariadb
25 pages
Unisphere Overview
No ratings yet
Unisphere Overview
33 pages
Disaster Recovery As A Service - DRAAS
No ratings yet
Disaster Recovery As A Service - DRAAS
8 pages
Recommendation Letter Mint
No ratings yet
Recommendation Letter Mint
1 page
Distributed File System
No ratings yet
Distributed File System
20 pages
Modin for Scalable Data Science: The Complete Guide for Developers and Engineers
From Everand
Modin for Scalable Data Science: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Hard Circle Drives (HDDs): Uncovering the Center of Information Stockpiling
From Everand
Hard Circle Drives (HDDs): Uncovering the Center of Information Stockpiling
Friend Good
No ratings yet
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet

Assignment Distributed Data Mining

Uploaded by

Assignment Distributed Data Mining

Uploaded by

Debremarkos university

1. Define distributed data mining,

Why Distributed Data Mining?

Why Parallel Data Mining?

Benefits of Parallel and Distributed Data Mining

2. Discuss on popular distributed

Popular Distributed Platforms: Hadoop and Spark

Key Components of Hadoop:

Key Features of Apache Spark:

Comparison: Hadoop vs Spark

Feature Hadoop Apache Spark

Key Reasons for Distributed Computing in Data

Benefits of Distributed Computing in Data Analytics:

 Faster Data Processing:

Use Cases in Data Analytics:

Key Features of HDFS

Key Components of HDFS

You might also like