0% found this document useful (0 votes)

32 views23 pages

Unit 4 BDTT

Oozie is an open-source workflow scheduling system for managing Hadoop jobs such as Pig, Hive, Sqoop and MapReduce jobs. It allows users to manage dependencies and run workflows for complex jobs. Oozie workflows are defined using XML files and can be scheduled to run periodically using coordinators. The main components of Oozie are the workflow manager and coordinators.

Uploaded by

Ashwath

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views23 pages

Unit 4 BDTT

Uploaded by

Ashwath

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

INTRODUCTION TO OOZIE

Definition
An open-source workflow scheduling tool, Apache Oozie helps handle and
organize data processing tasks across Hadoop-based infrastructure.
Users can create, plan, and control workflows that contain a coordinated series
of Hadoop jobs, Pig scripts, Hive searches, and other operations.
Oozie can handle task dependencies, manage retry mechanisms, and support a
variety of workflow types, including simple and sophisticated processes.

Evolution of Oozie

Yahoo initially created Apache Oozie in 2008 as a tool for privately managing
Hadoop operations. Later, in 2011, it was made available as an open-source
undertaking run by the Apache Software Foundation.
For managing and scheduling massive data processing processes, Oozie is a
critical Hadoop ecosystem component frequently used in production settings. Its
community has expanded, with developers contributing to its continual
development and advancements.

Main Components of Apache Oozie

The Oozie Workflow Manager and Oozie Coordinators are the two main
workflow management components of Apache Oozie.

 The Oozie Workflow Manager manages and executes workflows and sequences

of actions that must be conducted in a specific order. The Workflow Definition

Language (WDL), an Extensible Markup Language (XML)-based language,

defines workflows. The WDL outlines the order in which activities must be

carried out, the input and output data required by each action, and their

interdependencies. In addition to managing dependencies between actions and

handling errors, the Workflow Manager parses the WDL and carries out the steps

in the predetermined order.

 Oozie Coordinators are responsible for organizing and overseeing repeating

workflows. The Coordinator Application Language (CAL), an XML-based

language, defines coordinators. Coordinators describe a schedule for running

workflows, the data input for each instance of the workflow, and dependencies

between the cases of the process. The Coordinator operates periodically and

generates workflow instances by the plan and supplied data.

Key Features of Oozie

 Oozie allows users to create, organize, and carry out workflow collections of

tasks or actions.

 Oozie supports the scheduling of repeating processes using coordinators, which

lets users provide a schedule for when workflows will execute.

 Management of dependencies between tasks and workflows is supported by

Oozie, ensuring that activities are executed in the proper order and that

workflows are correctly completed.

 Oozie is built on a modular, extensible architecture that enables users to

customize and extend its features.

 Oozie is highly scalable and designed for large-scale data processing tasks in

distributed computing environments.

 Oozie offers a web-based graphical user interface and RESTful API for

controlling and monitoring workflows and coordinators.

APACHE SPARK

Spark was introduced by Apache Software Foundation for speeding up the Hadoop
computational computing software process.

Spark uses Hadoop in two ways – one is storage and second is processing. Since
Spark has its own cluster management computation, it uses Hadoop for storage
purpose only.

Apache Spark is a lightning-fast cluster computing technology, designed for fast

computation.

It is based on Hadoop MapReduce and it extends the MapReduce model to

efficiently use it for more types of computations, which includes interactive queries
and stream processing.

The main feature of Spark is its in-memory cluster computing that increases the
processing speed of an application.

Spark is designed to cover a wide range of workloads such as batch applications,

iterative algorithms, interactive queries and streaming.

Evolution of Apache Spark

Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s AMPLab

by Matei Zaharia.

It was Open Sourced in 2010 under a BSD license.

It was donated to Apache software foundation in 2013, and now Apache Spark has
become a top level Apache project from Feb-2014.

Features of Apache Spark

 Speed − Spark helps to run an application in Hadoop cluster, up to 100 times

faster in memory, and 10 times faster when running on disk.

 Supports multiple languages − Spark provides built-in APIs in Java, Scala,

or Python. Spark comes up with 80 high-level operators for interactive
querying.
 Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also
supports SQL queries, Streaming data, Machine learning (ML), and Graph
algorithms.

Spark Built on Hadoop

The following diagram shows three ways of how Spark can be built with Hadoop
components.

Components of Spark
Apache Spark Core

Spark Core is the underlying general execution engine for spark platform that all
other functionality is built upon.

It provides In-Memory computing and referencing datasets in external storage

systems.

Spark SQL

Spark SQL is a component on top of Spark Core that introduces a new data
abstraction called SchemaRDD, which provides support for structured and semi-
structured data.

Spark Streaming

Spark Streaming leverages Spark Core's fast scheduling capability to perform

streaming analytics.

It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets)

transformations on those mini-batches of data.

MLlib (Machine Learning Library)

MLlib is a distributed machine learning framework above Spark because of the

distributed memory-based Spark architecture.

It is, according to benchmarks, done by the MLlib developers against the Alternating
Least Squares (ALS) implementations.

GraphX

GraphX is a distributed graph-processing framework on top of Spark.

It provides an API for expressing graph computation that can model the user-defined
graphs by using Pregel abstraction API.
LIMITATIONS OF HADOOP
S.No ISSSUE SOLUTION
1. Issue with small files  Just merge the small files to
create bigger files and then
copy bigger files to HDFS.
 Sequence files work very
well in practice to overcome
the ‘small file problem’, in
which we use the filename as
the key and the file contents as
the value.
 Storing files in HBase is a
very common design
pattern to overcome small
file problem with HDFS.
2. Slow processing speed  In-memory processing is
faster as no time is spent in
moving the data/processes in
and out of the disk.
 Spark is 100 times faster than
MapReduce as it processes
everything in memory.
 Flink, as it processes faster
than spark because of its
streaming architecture and
Flink gets instructions to
process only the parts of the
data that have actually
changed
3. Latency  Apache Spark is yet another
batch system but it is
relatively faster since it caches
much of the input data on
memory by RDD(Resilient
Distributed Dataset) and
keeps intermediate data in
memory itself.
4. Security  Spark provides a security
bonus to overcome these
limitations of Hadoop.
 If we run the spark in HDFS,
it can use HDFS ACLs and
file-level permissions.
 Additionally, Spark can run
on YARN giving it the
capability of using Kerberos
authentication.
5. No real time data processing  Apache Spark supports
stream processing. Stream
processing involves
continuous input and output of
data.
 Apache Flink provides
single run-time for the
streaming as well as batch
processing, so one common
run-time is utilized for data
streaming applications and
batch processing applications.
6. Support for batch processing only  Flink improves the overall
performance as it provides
single run-time for the
streaming as well as batch
processing.
 Flink uses native closed loop
iteration operators which
make machine learning and
graph processing faster.
7. Uncertainty  Hadoop only ensures that the
data job is complete, but it’s
unable to guarantee when the
job will be complete.
8. Lengthy line of code  Spark and Flink are written in
scala and java but the
implementation is in Scala, so
the number of lines of code is
lesser than Hadoop.
 So it will also take less time to
execute the program and solve
the lengthy line of code
limitations of Hadoop.
9. No Caching  Spark and Flink can overcome
this limitation of Hadoop, as
Spark and Flink cache data in
memory for further iterations
which enhance the overall
performance.
10. Not ease of use  Spark has interactive mode so
that developers and users alike
can have intermediate
feedback for queries and other
activities.
 Spark is easy to program as it
has tons of high-level
operators.
11. No delta iteration  Spark iterates its data in
batches.
 For iterative processing in
Spark, we schedule and
execute each iteration
separately.
12. Vulnerable by nature  Hadoop is entirely written
in Java, a language most
widely used, hence java been
most heavily exploited by
cyber criminals and as a
result, implicated in numerous
security breaches.
13. No abstraction  Spark is used in which we
have RDD abstraction for
the batch.
 Flink has Dataset abstraction.
APACHE FLINK
Apache Flink is a real-time processing framework which can process streaming
data.

It is an open source stream processing framework for high-performance, scalable,

and accurate real-time applications.

It has true streaming model and does not take input data as batch or micro-batches.

Apache Flink was founded by Data Artisans company and is now developed under
Apache License by Apache Flink Community.

Ecosystem on Apache Flink

Storage

Apache Flink has multiple options from where it can Read/Write data. Below is a
basic storage list −

 HDFS (Hadoop Distributed File System)

 Local File System
 S3
 RDBMS (MySQL, Oracle, MS SQL etc.)
 MongoDB
 HBase
 Apache Kafka
 Apache Flume

Deploy

You can deploy Apache Fink in local mode, cluster mode or on cloud. Cluster
mode can be standalone, YARN, MESOS.

On cloud, Flink can be deployed on AWS or GCP.

Kernel

This is the runtime layer, which provides distributed processing, fault tolerance,
reliability, native iterative processing capability and more.

APIs & Libraries

This is the top layer and most important layer of Apache Flink.

It has Dataset API, which takes care of batch processing, and Datastream API,
which takes care of stream processing.

There are other libraries like Flink ML (for machine learning), Gelly (for graph
processing ), Tables for SQL.

This layer provides diverse capabilities to Apache Flink.

INSTALLING FLINK
check whether we have Java 8 installed in our system

now proceed by downloading Apache Flink.

wget http://mirrors.estointernet.in/apache/flink/flink-1.7.1/flink-1.7.1-bin-scala_2.11.tgz

Now, uncompress the tar file.

tar -xzf flink-1.7.1-bin-scala_2.11.tgz

Go to Flink's home directory.

cd flink-1.7.1/

Start the Flink Cluster.

./bin/start-cluster.sh
Open the Mozilla browser and go to the below URL, it will open the Flink Web
Dashboard.

http://localhost:8081

This is how the User Interface of Apache Flink Dashboard looks like.
BATCH ANALYTICS USING FLINK
Apache Flink’s unified approach to stream and batch processing means that a DataStream
application executed over bounded input will produce the same final results regardless of the
configured execution mode.
It is important to note what final means here: a job executing in STREAMING mode might
produce incremental updates (think upserts in a database) while a BATCH job would only
produce one final result at the end. The final result will be the same if interpreted correctly but
the way to get there can be different.

As a rule of thumb, you should be using BATCH execution mode when your program is
bounded because this will be more efficient.

You have to use STREAMING execution mode when your program is unbounded because
only this mode is general enough to be able to deal with continuous data streams.

The execution mode can be configured via the execution.runtime-mode setting.

There are three possible values:

 STREAMING: The classic DataStream execution mode (default)

 BATCH: Batch-style execution on the DataStream API
 AUTOMATIC: Let the system decide based on the boundedness of the
sources

In BATCH execution mode, the tasks of a job can be separated into stages that
can be executed one after another.

We can do this because the input is bounded and Flink can therefore fully process
one stage of the pipeline before moving on to the next.

In the above example the job would have three stages that correspond to the three
tasks that are separated by the shuffle barriers.

Instead of sending records immediately to downstream tasks, as explained above

for STREAMING mode, processing in stages requires Flink to materialize
intermediate results of tasks to some non-ephemeral storage which allows
downstream tasks to read them after upstream tasks have already gone off line.
This will increase the latency of processing but comes with other interesting
properties.

For one, this allows Flink to backtrack to the latest available results when a failure
happens instead of restarting the whole job.

Another side effect is that BATCH jobs can execute on fewer resources (in terms
of available slots at TaskManagers) because the system can execute tasks
sequentially one after the other.

Task Managers will keep intermediate results at least as long as downstream tasks
have not consumed them.

After that, they will be kept for as long as space allows in order to allow the
aforementioned backtracking to earlier results in case of a failure.
BIG DATA MINING WITH NO SQL
Datasets that are difficult to store and analyze by any software database tool are
referred to as big data. Due to the growth of data, an issue arises that based on
recent fads in the IT region, how the data will be effectively processed. A
requirement for ideas, techniques, tools, and technologies is been set for handling
and transforming a lot of data into business value and knowledge. The major
features of NoSQL solutions are stated below that help us to handle a large amount
of data.

NoSQL databases that are best for big data are:

 MongoDB
 Cassandra
 CouchDB
 Neo4j

Different ways to handle Big Data problems:

1. The queries should be moved to the data rather than moving data to queries:
At the point, when an overall query is needed to be sent by a customer to all
hubs/nodes holding information, the more proficient way is to send a query to every
hub than moving a huge set of data to a central processor. The stated statement is
a basic rule that assists to see how NoSQL data sets have sensational execution
benefits on frameworks that were not developed for queries distribution to hubs.
The entire data is kept inside hub/node in document form which means just the
query and result are needed to move over the network, thus keeping big data’s
queries quick.
2. Hash rings should be used for even distribution of data:
To figure out a reliable approach to allocating a report to a processing hub/node is
perhaps the most difficult issue with databases that are distributed. With a help of
an arbitrarily produced 40-character key, the hash rings method helps in even
distribution of a large amount of data on numerous servers and this is a decent
approach to uniform distribution of network load.
3. For scaling read requests, replication should be used:
In real-time, replication is used by databases for making data’s backup copies.
Read requests can be scaled horizontally with the help of replication. The strategy
of replication functions admirably much of the time.
4. Distribution of queries to nodes should be done by the database:
Separation of concerns of evaluation of query from the execution of the query is
important for getting more increased performance from queries traversing
numerous hubs/nodes. The query is moved to the data by the NoSQL database
instead of data moving to the query.
NOSQL DATABASES
A database is a collection of structured data or information which is stored in a
computer system and can be accessed easily. A database is usually managed by a
Database Management System (DBMS).
NoSQL is a non-relational database that is used to store the data in the nontabular
form. NoSQL stands for Not only SQL. The main types are documents, key-value,
wide-column, and graphs.

Types of NoSQL Database:

 Document-based databases
 Key-value stores
 Column-oriented databases
 Graph-based databases

Document-Based Database:

The document-based database is a nonrelational database. Instead of storing the data

in rows and columns (tables), it uses the documents to store the data in the database.
A document database stores data in JSON, BSON, or XML documents.
Documents can be stored and retrieved in a form that is much closer to the data
objects used in applications which means less translation is required to use these data
in the applications. In the Document database, the particular elements can be
accessed by using the index value that is assigned for faster querying.
Key features of documents database:
 Flexible schema: Documents in the database has a flexible schema. It
means the documents in the database need not be the same schema.
 Faster creation and maintenance: the creation of documents is easy and
minimal maintenance is required once we create the document.
 No foreign keys: There is no dynamic relationship between two
documents so documents can be independent of one another. So, there is
no requirement for a foreign key in a document database.
 Open formats: To build a document we use XML, JSON, and others.

Key-Value Stores:

A key-value store is a nonrelational database. The simplest form of a NoSQL

database is a key-value store. Every data element in the database is stored in key-
value pairs. The data can be retrieved by using a unique key allotted to each element
in the database. The values can be simple data types like strings and numbers or
complex objects.
A key-value store is like a relational database with only two columns which is the
key and the value.

Column Oriented Databases:

A column-oriented database is a non-relational database that stores the data in

columns instead of rows. That means when we want to run analytics on a small
number of columns, you can read those columns directly without consuming
memory with the unwanted data.
Columnar databases are designed to read data more efficiently and retrieve the data
with greater speed. A columnar database is used to store a large amount of data.

Graph-Based databases:

Graph-based databases focus on the relationship between the elements. It stores the
data in the form of nodes in the database. The connections between the nodes are
called links or relationships.
Key features of graph database:
 In a graph-based database, it is easy to identify the relationship between
the data by using the links.
 The Query’s output is real-time results.
 The speed depends upon the number of relationships among the database
elements.
 Updating data is also easy, as adding a new node or edge to a graph
database is a straightforward task that does not require significant schema
changes.
MONGODB
MongoDB, the most popular NoSQL database, is an open-source document-
oriented database.

The term ‘NoSQL’ means ‘non-relational’.

It means that MongoDB isn’t based on the table-like relational database structure
but provides an altogether different mechanism for storage and retrieval of data.

This format of storage is called BSON ( similar to JSON format).

A simple MongoDB document Structure:

title: 'Geeksforgeeks',

by: 'Harshit Gupta',

url: 'https://www.geeksforgeeks.org',

type: 'NoSQL'

SQL databases store data in tabular format.

This data is stored in a predefined data model which is not very much flexible for
today’s real-world highly growing applications.

Modern applications are more networked, social and interactive than ever.

Applications are storing more and more data and are accessing it at higher rates.

Relational Database Management System(RDBMS) is not the correct choice

when it comes to handling big data by the virtue of their design since they are
not horizontally scalable.

If the database runs on a single server, then it will reach a scaling limit. NoSQL
databases are more scalable and provide superior performance.

MongoDB is such a NoSQL database that scales by adding more and more servers
and increases productivity with its flexible document model.
Where do we use MongoDB?

MongoDB is preferred over RDBMS in the following scenarios:

 Big Data: If you have huge amount of data to be stored in tables, think
of MongoDB before RDBMS databases. MongoDB has built-in
solution for partitioning and sharding your database.
 Unstable Schema: Adding a new column in RDBMS is hard whereas
MongoDB is schema-less. Adding a new field does not effect old
documents and will be very easy.
 Distributed data Since multiple copies of data are stored across
different servers, recovery of data is instant and safe even if there is a
hardware failure.

Language Support by MongoDB:

 MongoDB currently provides official driver support for all popular

programming languages like C, C++, Rust, C#, Java, Node.js, Perl, PHP,
Python, Ruby, Scala, Go, and Erlang.

Installing MongoDB:

 Just go to http://www.mongodb.org/downloads and select your operating

system out of Windows, Linux, Mac OS X and Solaris. A detailed
explanation about the installation of MongoDB is given on their site.

 For Windows, a few options for the 64-bit operating systems drops down.
When you’re running on Windows 7, 8 or newer versions, select Windows
64-bit 2008 R2+. When you’re using Windows XP or Vista then
select Windows 64-bit 2008 R2+ legacy.
BASIC QUERIES IN MONGODB

The use Command

MongoDB use DATABASE_NAME is used to create database. The command

will create a new database if it doesn't exist, otherwise it will return the existing
database.

Syntax

Basic syntax of use DATABASE statement is as follows −

use DATABASE_NAME

The dropDatabase() Method

MongoDB db.dropDatabase() command is used to drop a existing database.

Syntax

Basic syntax of dropDatabase() command is as follows −

db.dropDatabase()

This will delete the selected database. If you have not selected any database, then it
will delete default 'test' database.

The createCollection() Method

MongoDB db.createCollection(name, options) is used to create collection.

Syntax

Basic syntax of createCollection() command is as follows −

db.createCollection(name, options)

In the command, name is name of collection to be created. Options is a document

and is used to specify configuration of collection.

The drop() Method

MongoDB's db.collection.drop() is used to drop a collection from the database.

Syntax

Basic syntax of drop() command is as follows −

db.COLLECTION_NAME.drop()

The insert() Method

To insert data into MongoDB collection, you need to use

MongoDB's insert() or save() method.

Syntax

The basic syntax of insert() command is as follows −

>db.COLLECTION_NAME.insert(document)

The find() Method

To query data from MongoDB collection, you need to use

MongoDB's find() method.

Syntax

The basic syntax of find() method is as follows −

>db.COLLECTION_NAME.find()

find() method will display all the documents in a non-structured way.

INTRODUCTION TO CASSANDRA
Apache Cassandra is an open-source no SQL database that is used for handling big
data.

Apache Cassandra has the capability to handle structure, semi-structured, and

unstructured data.

Apache Cassandra was originally developed at Facebook after that it was open-
sourced in 2008 and after that, it become one of the top-level Apache projects in
2010.

Features of Cassandra:
1. It is scalable.
2. It is flexible (can accept structured , semi-structured and unstructured data).
3. It has transaction support as it follows ACID properties.
4. It is highly available and fault tolerant.
5. It is open source.

Apache Cassandra is a highly scalable, distributed database that strictly follows the
principle of the CAP (Consistency Availability and Partition tolerance) theorem.
In Apache Cassandra, there is no master-client architecture. It has a peer-to-peer
architecture.

In Apache Cassandra, we can create multiple copies of data at the time of keyspace
creation.

We can simply define replication strategy and RF (Replication Factor) to create

multiple copies of data.

Example:
CREATE KEYSPACE Example
WITH replication = {'class': 'NetworkTopologyStrategy',
'replication_factor': '3'};
In this example, we define RF (Replication Factor) as 3 which simply means that
we are creating here 3 copies of data across multiple nodes in a clockwise direction.

Unit V
No ratings yet
Unit V
35 pages
SPARK
No ratings yet
SPARK
66 pages
Apache Spark
No ratings yet
Apache Spark
40 pages
Spark Notes
No ratings yet
Spark Notes
37 pages
Apache Spark
No ratings yet
Apache Spark
25 pages
Chap5 BigDataComputingAndProcessing
No ratings yet
Chap5 BigDataComputingAndProcessing
72 pages
226 Unit-7
No ratings yet
226 Unit-7
26 pages
Hadoopvsspark 180108070838
No ratings yet
Hadoopvsspark 180108070838
17 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Hortonworks Data Platform (HDP)
100% (1)
Hortonworks Data Platform (HDP)
56 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
Poetic Seminar
No ratings yet
Poetic Seminar
17 pages
L03-Spark Framework
No ratings yet
L03-Spark Framework
58 pages
Shark
No ratings yet
Shark
24 pages
Big Data Anlytics Unit 3 R22 It
No ratings yet
Big Data Anlytics Unit 3 R22 It
57 pages
S - Hadoop Ecosystem
No ratings yet
S - Hadoop Ecosystem
14 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
43 pages
Bda Unit 6
No ratings yet
Bda Unit 6
14 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
Bda U4
No ratings yet
Bda U4
49 pages
Big Data Handling Techniques
No ratings yet
Big Data Handling Techniques
21 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
APACHE SPARK and Scala
No ratings yet
APACHE SPARK and Scala
49 pages
Spark Introduction
No ratings yet
Spark Introduction
12 pages
Pyspark Notes New
No ratings yet
Pyspark Notes New
18 pages
Basics of Big Data
No ratings yet
Basics of Big Data
7 pages
Spark SQL
100% (1)
Spark SQL
25 pages
07 - Apache Spark - An Introduction
No ratings yet
07 - Apache Spark - An Introduction
36 pages
Module 2
No ratings yet
Module 2
20 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Apache Spark Features
No ratings yet
Apache Spark Features
2 pages
Introduction To Spark 1
No ratings yet
Introduction To Spark 1
21 pages
Sspark
No ratings yet
Sspark
7 pages
06 Big Data
No ratings yet
06 Big Data
52 pages
Big Data Analytics Presentation
No ratings yet
Big Data Analytics Presentation
30 pages
Group 3&4 Assignment Sample Solution
No ratings yet
Group 3&4 Assignment Sample Solution
5 pages
Lec No 10
No ratings yet
Lec No 10
17 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
DEV3600SlideGuide PDF
No ratings yet
DEV3600SlideGuide PDF
555 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Big Data Assignment Notes
No ratings yet
Big Data Assignment Notes
13 pages
Key Features: General-Purpose Fast Cluster Computing Platform
No ratings yet
Key Features: General-Purpose Fast Cluster Computing Platform
16 pages
Unit 5
No ratings yet
Unit 5
14 pages
SPARK
No ratings yet
SPARK
47 pages
A Brief Introduction To Apache Spark
No ratings yet
A Brief Introduction To Apache Spark
10 pages
Big Data Processing With Apache Spark - Infoqdotcom
No ratings yet
Big Data Processing With Apache Spark - Infoqdotcom
16 pages
Unit V Big Data
No ratings yet
Unit V Big Data
18 pages
Unit 4 (Big Data Analytics)
No ratings yet
Unit 4 (Big Data Analytics)
28 pages
Apache Flink
No ratings yet
Apache Flink
116 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
IoT Module 5
No ratings yet
IoT Module 5
9 pages
Tech Seminar Report
No ratings yet
Tech Seminar Report
5 pages
Azure Azure SQL Azuresql VM
No ratings yet
Azure Azure SQL Azuresql VM
2,069 pages
Should Not Use: You Might Have To Use The Commit Work in Some Cases As of Ur Need
No ratings yet
Should Not Use: You Might Have To Use The Commit Work in Some Cases As of Ur Need
3 pages
Rapport PFE Balghouthi Hazemespdsi20201678154298523
No ratings yet
Rapport PFE Balghouthi Hazemespdsi20201678154298523
66 pages
Unit 3 MCQ
No ratings yet
Unit 3 MCQ
20 pages
Active Directory Architecture
100% (1)
Active Directory Architecture
43 pages
Diagnostics Apps Check 220410
No ratings yet
Diagnostics Apps Check 220410
262 pages
KMBNIT01 2023-24 Question Paper AKTU MBA
No ratings yet
KMBNIT01 2023-24 Question Paper AKTU MBA
1 page
4-1 Database Functional Specifications Document
No ratings yet
4-1 Database Functional Specifications Document
32 pages
Bigdata MCQ QA Part2
No ratings yet
Bigdata MCQ QA Part2
9 pages
Admin Universe
No ratings yet
Admin Universe
588 pages
Practical File-Xii-2024-25
No ratings yet
Practical File-Xii-2024-25
38 pages
Unit 4: Defining The CDS Data Model Projection: Week 2: Developing A Read-Only List Report App
No ratings yet
Unit 4: Defining The CDS Data Model Projection: Week 2: Developing A Read-Only List Report App
9 pages
Bteq Sample Script
100% (1)
Bteq Sample Script
4 pages
Acronis #CyberFit Cloud Tech Associate Advanced Backup 2023 Handout
No ratings yet
Acronis #CyberFit Cloud Tech Associate Advanced Backup 2023 Handout
225 pages
DB2 App Dev Exam - 703 Details
100% (1)
DB2 App Dev Exam - 703 Details
2 pages
ETL Standards and Guidelines
No ratings yet
ETL Standards and Guidelines
48 pages
Apply
No ratings yet
Apply
3 pages
Esxcfg Mpath Paths.20193
No ratings yet
Esxcfg Mpath Paths.20193
9 pages
Lab Evaluation Form
No ratings yet
Lab Evaluation Form
2 pages
Coursera P95F85AY3Z7A
No ratings yet
Coursera P95F85AY3Z7A
1 page
Devansh Pahuja 401803008 COE16
No ratings yet
Devansh Pahuja 401803008 COE16
17 pages
Document 1702400682058
No ratings yet
Document 1702400682058
1 page
How To Create A Fully Functional E-Commerce Website With Django - by Andika Pratama - Analytics Vidhya - Medium
No ratings yet
How To Create A Fully Functional E-Commerce Website With Django - by Andika Pratama - Analytics Vidhya - Medium
1 page
2060 VB
No ratings yet
2060 VB
24 pages
Start Routine To Populate All Columns at Once
No ratings yet
Start Routine To Populate All Columns at Once
2 pages
Database Assignment 2
No ratings yet
Database Assignment 2
5 pages
Serviio Online Backup - 2014 12 19 10 27 49
No ratings yet
Serviio Online Backup - 2014 12 19 10 27 49
6 pages
Focars' 85: Overview of Database Management Concepts
No ratings yet
Focars' 85: Overview of Database Management Concepts
33 pages
Sram Memory Cell
No ratings yet
Sram Memory Cell
20 pages
Au Speakingunix14 PDF
No ratings yet
Au Speakingunix14 PDF
9 pages
Hadoop实际解决方案手册: Chinese Edition
From Everand
Hadoop实际解决方案手册: Chinese Edition
Posts & Telecom Press
No ratings yet
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
From Everand
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
Adam Jones
No ratings yet