0% found this document useful (0 votes)
165 views44 pages

MBA-DATA ANALYTICS - Data Science and Business Analysis - Unit 5

Uploaded by

Aamir Reza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
165 views44 pages

MBA-DATA ANALYTICS - Data Science and Business Analysis - Unit 5

Uploaded by

Aamir Reza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

MBA – DATA ANALYTICS

SEMESTER-I

DATA SCIENCE AND BUSINESS


ANALYTICS
MB-DA-104
All rights reserved. No Part of this book may be reproduced or transmitted, in any form or by
any means, without permission in writing from Mizoram University. Any person who does
any unauthorized act in relation to this book may be liable to criminal prosecution and civil
claims for damages. This book is meant for educational and learning purposes. The authors
of the book has/have taken all reasonable care to ensure that the contents of the book do not
violate any existing copyright or other intellectual property rights of any person in any
manner whatsoever. In the event the Authors has/ have been unable to track any source and if
any copyright has been inadvertently infringed, please notify the publisher in writing for
corrective action.

© Team Lease Edtech Pvt. Ltd.

All rights reserved. No Part of this book may be reproduced in any form without permission
in writing from Team Lease Edtech Pvt. Ltd.
CONTENT

UNIT - 5: Relational Databases………………………………………………………4


UNIT - 5: RELATIONAL DATABASES

STRUCTURE

5.1 Learning objective

5.2 Introduction

5.3 Relational Database Management System

5.3.1 Different types of RDBMS

5.4 Structured Query Language (SQL)

5.4.1 SQL statements

5.4.2 Important SQL commands and their syntax

5.5 Big data storage and retrieval

5.5.1 Dealing with the volume and velocity challenge

5.6 NoSQL database

5.6.1 Difference between RDBMS and NoSQL

5.6.2 Types of NoSQL systems

5.6.3 NoSQL databases queries

5.6.4 A common design choice in NoSQL: Shared-nothing architecture:

5.6.5 Benefits of NoSQL

5.6.6 Drawbacks of NoSQL

5.6.7 When should we opt for NoSQL?

5.7 Big data distributed computing

5.7.1 Need for distributed computing for big data:

5.7.2 The issue with latency for Big data:

5.8 MapReduce

5.8.1 MapReduce in Hadoop framework

5.8.2 Terms related to MapReduce model

5.8.3 Working of MapReduce model


5.8.4 MapReduce algorithm

5.8.5 Advantages of MapReduce

5.8.6 Disadvantages of MapReduce

5.9 Spark RDD

5.9.1 Reasons for using RDD

5.9.2 When should RDD be used?

5.9.3 Features of Spark RDD

5.9.4 Demerits of Spark RDD

5.10 Artificial neural network

5.10.1 Working of Artificial neural networks

5.10.2 An artificial neural network's process of learning

5.10.3 Supervised and unsupervised learning in artificial neural networks

5.10.4 Types of neural networks

5.10.5 Comparing neural networks with classical AI

5.10.6 Drawbacks of Artificial neural networks

5.11 Deep learning

5.11.1 Practical examples of data learning:

5.12 Summary

5.13 Self-Assessment questions

5.14 Suggested Readings

5.1 LEARNING OBJECTIVES

After studying this unit, you will be able to:

● To understand the basics of RDBMS.

● To learn the various commands in SQL.

● To know the difference between RDBMS and NoSQL.

● To understand the uses, benefits and drawbacks of NoSQL.


● To know about MapReduce, its advantages and limitations

● To learn about Spark RDD and its usage.

● To understand various aspects of artificial neural networks and deep learning in


relation to artificial intelligence.

5.2 INTRODUCTION

A set of data stored in a computer is termed a database. They are generally maintained in a
structured way to ensure ease of accessibility. Relational database is a type of database that
uses a structure for letting the users identify and access data related to another piece of data
in the database. It is in the form of tables.

A Table is made up of several rows and columns. The rows are referred to as records, and
the columns have a descriptive name along with a certain data type. For example, if a column
has the age of people, then it will have an integer data type, name and country will have a
string data type.

Name Age Country

Natalia 32 Russia

John 34 USA

Rustom 37 India

The table given above has the 3 columns for the name of the people, their country, and age.
Here, name and country name are of string data type; age is integer datatype. Each of the 4
rows is for 4 candidates.

5.3. RELATIONAL DATABASE MANAGEMENT SYSTEM

RDBMS is a program meant for creating, updating, and administering a relational database.
Generally, SQL language is used by the relational database management systems for
accessing the database. Let us learn more about SQL in the upcoming section.

5.3.1. Different types of RDBMS

The SQL syntax is not the same for all types of RDBMS. Let us see some of the popular ones
in the following section:
i.MySQL:

It is an open-source SQL database used for web application development. PHP is used for
accessing it. There are some merits of using MySQL, namely, reliability, ease of use,
inexpensive, and it has a huge developers' community who can readily answer the queries.

It has some demerits too. This includes a lack of advanced features that developers may need
and poor performance while scaling. Open-source development has slowed down since
Oracle took over MySQL.

ii.PostgreSQL:

It is an open-source SQL database. No corporation controls it and is often used for


developing web applications. Its advantages are similar to MySQL but come with some
additional features such as supporting foreign keys without the need for complex
configuration.

PostgreSQL's main drawback is that it performs slower compared to other databases and also
has less popularity.

iii. Oracle DB:

This database is owned by the Oracle Corporation, and it is not open-sourced. Larger
applications use it, mainly by the top banks in the banking industry. This is because it uses
powerful technology that is comprehensive and pre-integrated with business applications and
functionalities built mainly for the banks.

Since it is not open source, it is not available for free and can even be very expensive. This is
only drawback.

iii. SQL Server:

It is owned by Microsoft and is close sourced. Generally, the larger enterprise applications
use it. Express is its entry-level version offered by Microsoft, but when an application is
scaled, that becomes quite expensive.

iv. SQLite:

This is an open-source SQL database that can store a whole database in one file. Since all the
data can be locally stored, you don't have to connect your database with a server. It is
generally used in a database in MP3 players, phones, PDAs, and other similar electronic
types of equipment.
5.4. STRUCTURED QUERY LANGUAGE (SQL)

It is a programming language that helps us to communicate with the data stored in the
RDBMS. In 1986, it became a standard of the American National Standards Institute (ANSI).
SQL has a syntax that bears a resemblance to the English language, making it easy for
writing, reading, and interpreting it.

Other variants of SQL are also used by many RDBMSs for accessing data in the tables.
SQLite is one of them, having a minimal number of SQL commands.

Syntax of SQL:

You can find one or more tables in a database, each of which has a name for identification
purposes. All tables contain records, which are the rows of data.

5.4.1.SQL Statements

The actions performed on the database are done using SQL statements.

Let us see an example of the Customer table:

Customer ID Customer name Contact name City

1 Tanya Pushpa Kolkata

2 Meena Hema Delhi

3 Sangita Malti Mumbai

4 Rita Angela Chennai

Table 5.3. Customer table

For collecting all the records in the "Customers" table, we have to write:

SELECT * FROM Customers;

Note:

● SQL keywords are not case sensitive, but we will be mentioning all keywords in the
upper case for better identification.

● In the case of certain databases, a semicolon might be needed at the end of a SQL
statement for separating multiple statements that are written to be executed in the
single call to the server.
5.4.2. Important SQL commands and their syntax

i. SELECT:

It is used to get data from a database that is stored in a result table named "result-set".

Syntax:

SELECT col1, col2, ..., col n

FROM table_name;

Here, col1, col2, ... are the field names (column names) in the table from which data you
want to select data.

For selecting all the fields, follow the syntax mentioned below:

SELECT * FROM table_name;

Example:

From Table 5.3. Customer table,

SELECT Rita, Chennai FROM Customers;

This will give the details of Rita from the Customers table.

ii. UPDATE:

It is for modifying the current records in a database.

Syntax:

UPDATE table_name

SET col1 = val1, col2 = val2, ..., col n= val n

WHERE condition;

Here, "WHERE" indicates the records which need to be updated, and it is an optional part of
this syntax. If this part is ignored in the syntax then the UPDATE command will update all
records in the specified table.

iii. DELETE:

It is for deleting the records in a database.

Syntax:

DELETE FROM Table_name;


This will delete all rows in the specified table. Here you can specify the 'WHERE' keyword
also to delete specific records from the table.

iv. INSERT INTO:

It is for inserting new data into a database.

Syntax:

INSERT INTO table_name (col1, col2, col3, ..., col n)

VALUES (val1, val2, val3, ..., val n);

The above syntax mentions the names of the columns along with the values which need to be
inserted.

In case you are adding values for all the columns, it is not required to specify the column
name in the syntax. But the order of the values should be as per the order of columns in the
table as mentioned below:

INSERT INTO table_name

VALUES (val1, val2, val3, ...);

If you need to insert data in specific columns only, then specify the target columns along
with the corresponding values.

v. CREATE DATABASE:

It is for creating a new database.

Syntax:

CREATE DATABASE database_name;

Note:

It is important to have admin privileges prior to the creation of the database. After it is
created, type:

SHOW DATABASES;

This will give you a list of all existing databases.

vi. ALTER DATABASE:

It is used for modifying a database. The actions performed by this keyword include addition
and deletion besides modification.
The syntax for adding a column:

ALTER TABLE table_name

ADD column_name datatype;

Syntax for removing a column:

ALTER TABLE table_name

DROP COLUMN column_name;

For changing a column's data type, follow the syntax given below:

ALTER TABLE table_name

ALTER COLUMN column_name datatype;

vii. CREATE TABLE:

It is for creating a new table in a database.

Syntax:

CREATE TABLE table_name ( col1 datatype, col2 datatype, col3 datatype... ,col n data
type);

Here, col1, col2,...col n indicate the names of the columns in the table, and the data type
indicates what type of data the column will hold, such as varchar, integer, and others.

For creating the copy of an existing table, use the following syntax:

CREATE TABLE new_table_name AS

SELECT col1, col2,... FROM existing_table_name

WHERE ....;

In this case, the new table will have the same column definitions. You can select some or all
columns. The creation of a new table from the existing table will fill up the new table with
values from the existing table.

vii. ALTER TABLE:

It is for modifying a table.

Syntax:

ALTER TABLE table_name

ADD column_name datatype;


The syntax for changing a column's data type is:

ALTER TABLE table_name

ALTER COLUMN column_name datatype;

viii. DROP TABLE:

It is for deleting a table in which all the information stored in it will be lost.

Syntax:

DROP TABLE table_name;

But if you just want to remove some data in the table instead of the whole table, follow this
syntax:

TRUNCATE TABLE table_name;

ix. CREATE INDEX:

It is for index creation, i.e., search key. Data is retrieved quickly using an index. The index
won't be visible to the users, but it will simply speed up their searches.

You should note that if a table has indexes, then updating it takes longer. So it is
recommended to create indexes on columns that you will frequently be searching.

Syntax:

CREATE INDEX index_name

ON table_name (column1, column2, ...);

If you want to avoid duplicate values, then follow the given Syntax:

CREATE UNIQUE INDEX index_name

ON table_name (column1, column2, ...);

x. DROP INDEX:

It is for deleting an index.

In SQL Server, it is written as:

CREATE UNIQUE INDEX index_name

ON table_name (column1, column2, ...);

In MySQL, it is written as:

ALTER TABLE table_name


DROP INDEX index_name;

5.5. BIG DATA STORAGE AND RETRIEVAL

Big data storage deals with storing and management of data in a scalable manner by ensuring
that the requirements of applications that need access to the data are effectively met. An
unlimited amount of data storage is allowed by an ideal big data storage, by efficiently
meeting with the high rates of random read and write access, efficiently dealing with various
data models, and supporting structure as well as unstructured data. They work only on
encrypted data for privacy reasons. Though it is difficult to meet all these parameters in
practical scenarios, the newly developed storage systems at least partially address most of the
challenges of volume, velocity, or variety. They are not categorized as relational database
management systems, but it doesn't imply that RDBMSs don't address the challenges. In fact,
alternative storage technologies such as the use of the Hadoop Distributed File System
(HDFS) are an efficient and less expensive option for this.

5.5.1. Dealing with the volume and velocity challenge

Volume challenge:

Big data storage systems use distributed and shared-nothing architecture for addressing
higher storage requirements. They scale-out to new notes to provide computational power
and storage. It is possible to seamlessly add new machines to a storage cluster, after which
the storage system transparently distributes the data between individual nodes.

Velocity challenge:

Velocity implies the time needed for getting a response to a query, which is mainly important
when there are a lot of incoming data. Similarly, a variety of data means effort needed for
integrating and working with data that originate from numerous sources. Graph databases
address these challenges.

5.6. NOSQL DATABASE

There are constraints on data type and consistency in the case of SQL databases, but in the
case of NoSQL, these constraints have been removed for speed, scaling, and flexibility.
When an application is developed, it is quite essential to decide whether SQL databases
should be used for NoSQL databases for data storage.
There are different trade-offs offered by SQL and NoSQL databases, making each suitable
for different use cases. This will be clarified in the following points:

5.6.1. Difference between RDBMS and NoSQL

i. Flexibility and operational efficiency:

SQL or Relational Database ensures reliable transactions and responds to ad-hoc queries, but
they have certain restrictions, such as rigid schema that makes them unfit for some apps.

But in the case of the NoSQL database, data is stored and managed in a way that ensures
high operational speed and better flexibility for the developers. Horizontal scaling across
100s or 1000s of servers is possible in the case of NoSQL databases, unlike the SQL
databases.

ii. Data consistency:

The data consistency provided by NoSQL is not like that of SQL databases. This is because
performance and scalability have been sacrificed in the case of the SQL databases to abide by
the ACID properties for reliable transactions. But the NoSQL databases have prioritized
speed and scalability by ditching the ACID guarantees.

iii. Data structure:

All data has an inherent structure in SQL databases. For example, a column may just have
integers only, resulting in a high degree of normalization. Hence aggregations like JOIN can
be easily performed on the data in a SQL database.

But in the case of NoSQL, data is stored in free form, i.e., you can store any data in any
record. This results in 4 types of database, namely Key-value Pair Based, Column-oriented
Graph, Graphs based, and Document-oriented, which we will be discussing in the upcoming
section.

Scenarios when schema-less data is useful:

Schema-less data storage is useful in the following scenarios:

a. Quickly accessing the data, i.e., when speed and simplicity of access is the primary
concern rather than consistency or reliable transactions.

b. While storing data of large volume, you want to avoid getting locked into a schema as
changing it later would be difficult.
c. You want to retain the originality of the unstructured data obtained from multiple sources
for better flexibility.

d. When you need a hierarchical form of data to be defined by the data itself instead of an
external schema. With NoSQL, you can be assured that the data will be self-referential,
unlike the case of SQL databases, which are difficult to emulate

5.6.2. Types of NoSQL systems

NoSQL systems are classified into 4 types, each having its data model. The types are as
explained below:

i. Document databases:

CouchDB and MongoDB are examples of Document databases in which the data is stored in
free form, just like that in JSON (JavaScript Object Notation). They can be integers, strings,
Boolean, arrays, objects, or free form text. Their structure is in alignment with the objects
that developers are working on.

You can use document databases as general-purpose databases. It is possible to scale them
out horizontally to accommodate large volumes of data.

ii. Key-value stores:

In this type of database, every item has keys and values. It contains free form values from
simple integers to strings or complex JSON documents, all of which are accessed by using
Keys. Hence it is easy to learn how to query a certain key-value pair. This type is useful
when large amounts of data need to be stored, but no complex queries are needed for
retrieving them. Storing the preferences of users or caching are its common uses. Redis,
DynamoDB, and Riak are examples of this type.

iii. Wide column stores:

Here, data is stored in dynamic columns and rows. But unlike conventional SQL databases,
they provide better flexibility as each row doesn't need to have the same columns. Hence this
type is also referred to as 2-dimensional key-value databases. It is mainly useful when there
is a large amount of data to be stored, and you can predict the query patterns. While using
user profile data and data related to the Internet of Things, this type is very useful.HBase and
Cassandra are examples of this type.

iv. Graph databases:


In this case, the entities and their relationships are represented in the form of a network or
graph. Every node of the graph is a free form data chunk. In other words, this type is
represented by nodes and edges. When you need to traverse relationships for finding patterns
in social networks, detecting fraud, and recommendation engines, this type is very useful.
Neo4j and JanusGraph are a common example of GraphDB.

5.6.3. NoSQL databases queries

SQL has a standardized query structure, which ensures that the basics remain the same while
handling certain operations differently. But in the case of the NoSQL database, it has its own
syntax when you want to manage data or make a query.

For example, if you are using CouchDB, it will use the requests in the form of JSON sent
through HTTP for creating or retrieving documents from its database, and MongoDB uses a
command like an interface or language library for sending JSON objects over a binary
protocol. Even if you can use SQL like syntax for working with data in certain cases, it will
be very limited. For example, in Cassandra, you can use the SELECT or INSERT keywords
just like SQL, but there is no way of using the JOIN keyword as the keyword doesn't exist
there.

5.6.4. A common design choice in NoSQL: Shared-nothing architecture

In this type of design, every server node in the cluster operates independently, i.e., it doesn't
depend on any other node. For example, for returning a piece of data to the client, it doesn't
need the piece of data to the client; consensus is not required from every single node. This
Shared-nothing design is used by various conventional SQL systems also, but the consistency
is sacrificed across the cluster to ensure better performance.

The closest node responds to the queries, which makes the process very fast. Resiliency and
scaling out are the other advantages of shared-nothing architecture. Scaling out implies
spinning new nodes in the cluster and waiting for them to synchronize with others. In case a
node in the NoSQL database goes down, other servers in the cluster will chug along. Even
when fewer nodes are available to cater to the requests, all the data will still be available.

5.6.5. Benefits of NoSQL

The advantages of NoSQL databases are as follows:

i. Scalability is high:

Scalability lets NoSQL handle a large amount of data efficiently. Sharding means the data is
partitioned and placed on multiple machines in a way to ensure the preservation of data
order. Vertical scaling implies the addition of more resources to the existing machine, and
horizontal scaling implies the addition of more machines for handling data. Implementation
of horizontal scaling is easy with respect to vertical scaling. Casandra and MongoDB are
horizontal scaling database examples.

ii. Availability is high:

NoSQL has an auto-replication feature, which makes it highly available. This is because,
whenever there is a failure, data replicates to the last consistent state.

5.6.6. Drawbacks of NoSQL

NoSQL has the following demerits:

i. No wider focus:

This is because NoSQL is designed for storage and has less functionality. In this field of
transaction management, relational databases are a better choice.

ii. Open source product:

NoSQL doesn't have a reliable standard, which implies that it is very likely that two database
systems can be unequal.

iii. Data management is challenging:

Data management is not an easy task, even though the big data tools are meant for this
purpose. In NoSQL, data management is complex with respect to the relational database as
installing it and managing data on a daily basis is quite hectic.

iv. Lack of GUI:

This makes accessing the database somewhat difficult.

v. Lack of approach for data backup in a consistent manner:

In databases like MongoDB, this problem exists, which is a huge drawback for NoSQL.

vi. Document size increases:

This is true in database systems like MongoDB and CouchDB, where data is stored in JSON
format. This implies documents are very large that need higher speed and high network
bandwidth. Also, the descriptive ley names are problematic as it increases the size of the
document.
5.6.7. When should we opt for NoSQL?

We should use NoSQL in the following cases:

i. When the amount of data that needs to be stored and retrieved is large.

ii. The relationship between the stored data is unimportant.

iii. Data changes with time and is unstructured.

iv. At the database level, there is no need to support Constraints and Joins.

v. Data growth is continuous and needs regular scaling for efficiently handling them.

5.7. BIG DATA DISTRIBUTED COMPUTING

Distributed Computing:

This deals with the study of distributed systems in the field of computer science. There are
many nodes communicating through the network in a distributed system. A shared goal is
accomplished through the interaction of computers in each node.
Fig 5.6
Source: https://en.wikipedia.org/wiki/Distributed_computing

When a firm is considering a big data project, it is essential to understand the basics of
distributed computing first. Since the computing resources can be distributed in many ways,
there is not a single distributed computing model. For example, a set of programs can be
distributed on the same physical server, and for communicating and passing information,
messaging services can be used. Also, you can have multiple systems or servers, each having
its own memory and working in a coordinating manner for resolving an issue.
5.7.1. Need for distributed computing for big data

All issues do not need distributed computing. When there is not a huge time constraint, you
can process a complex situation remotely through some specialized service. But when
companies need to analyse complex data, it generally moves the data to an external entity
where there are a lot of spare resources for processing these data. Since it was not
economically feasible to purchase enough computing resources for handling the emerging
requirements, companies had to wait to get the intended results. In some cases, companies
used to capture only certain sections of data instead of capturing everything due to the cost
factor. Even though all data were needed, analysts had to adjust with snapshots with the hope
to capture the right data at the right moment.

Gradually, a breakthrough in the hardware and software sectors brought a revolution in the
data management industry. With the increase in innovation and demand, power increased,
and the hardware cost decreased. Besides this, new software was developed for automating
processes like load balancing and optimization across large node clusters. This software
could also understand the performance level needed for certain workloads. The nodes were
treated as a single pool for computing storage and networking the assets. This enabled the use
of virtualization technology for the movement of processes to another node with no
interruption even when a node failed.

Computing and big data's changing economics:

There has been a decrease in the cost of resources for computing and storage. The economics
of computing has because of virtualization as commodity servers could be clustered, and
blades could be networked in a rack. Innovation in software solutions coincided with this
change, resulting in a significant improvement in these systems' manageability.

A significant reduction in latency occurred because of the capability to leverage distributed


computing and parallel processing methods. In certain specific cases like High-Frequency
Trading (HFT), physically locating servers in a location is necessary for achieving low
latency.

5.7.2. The issue with latency for Big data

Managing large quantities of data come with a perennial problem, which is the effect of
latency. The delay in a system because of the delays in task execution is termed latency,
which is a problem in every aspect of computing.
Suppliers, customers, and partners can experience a notable difference in latency because of
distributed computing and parallel processing. Since speed, volume and variety are the big
data requirements; various big data applications depend on low latency. When high
performance is needed in a high latency environment, it may not be possible to construct a
big data application. Besides this, latency can also impact the data in near real-time. A high
level of latency while dealing with real-time data is equivalent to the difference between
success and failure.

5.8. MAPREDUCE

Today, data is collected about people, processes, and organizations by the algorithms and
applications 24/7. This results in huge data volumes, which has a major challenge of how to
process them quickly in an efficient manner without the loss of meaningful insights. This is
the point where the MapReduce programming model becomes useful. Google used it initially
for analysis of its search engine results. Its potential to split and process terabytes of data in a
parallel manner for providing quick results made it very popular.

5.8.1. MapReduce in Hadoop framework

In the Hadoop framework, MapReduce is a programming model that is used for accessing
big data stored in the HDFS (Hadoop File System). It is vital for the Hadoop framework's
functioning. The petabytes of data are split into small chunks, which are processed in parallel
on the Hadoop commodity servers. The data from multiple servers are then aggregated at the
end for returning a consolidated output to the application.

Example:

At a time, 5 TB of data can be processed by a Hadoop cluster having 20,000 commodity


servers and 256 MB block data in each. Thus processing time is reduced with respect to
sequential processing of a large data set.

With MapReduce, there is the execution of the logic on the server where data already resides
instead of sending it to the location where the application or logic resides. This makes the
processing faster. The input and output are stored in the form of files.

Initially, MapReduce was just a way of retrieving the data stored in HDFS, but today there
are query-based systems for data retrieval from HDFS through the use of SQL-like
statements.

However, these usually run along with jobs that are written using the MapReduce model.
That's because MapReduce has unique advantages.
5.8.2. Terms related to MapReduce model

Map:

It is a user-defined function that generates zero or more key-value pairs by taking a series of
key-value pairs and processing each of them.

Intermediate Keys:

These are the pairs of key-values that are generated by the mapper.

Input Phase:

It is a phase of having a Record Reader for translation of each record in the input file. The
parsed data is sent to the mapper in the form of key-value pairs

Output Phase:

This phase has an output formatter for translating the Reducer function's final key-value
pairs. These are written onto a file using a record writer.

Combiner:

It is a type of local Reducer which is responsible for similar grouping kinds of data from the
map phase into identifiable sets. Intermediate keys from mapper from its input. Then user-
defined code is applied for aggregating the values in one mapper's small scope. This doesn't
form a part of the primary MapReduce algorithm and is optional.

Reducer:

The grouped key-value paired data is taken as input by the Reducer, and a Reducer function
is run on each of them. There are various ways of aggregating, filtering, and combining data
in this case, and this needs a wide range of processing. Upon the completion of execution,
zero or more key values are given to the final step.

Shuffle and Sort:

This is the step where the Reducer task starts. The grouped key-value pairs are downloaded
onto the local machine where the Reducer is running. A larger data set is formed by sorting
the individual key-value pairs. In order to iterate their values easily in the Reducer task, the
equivalent keys are grouped by the data list.

5.8.3. Working of MapReduce model

Map and Reduce are the two vital tasks of the MapReduce algorithm.
Map task: It takes a data set for converting it into another data set where tuples(key-value
pairs) are formed by breaking down individual elements.

Reduce task: It is performed after the map job, and the output from the Map task is treated
as an input in the Reduce task. It combines the data tuples to form a smaller set.

Fig. 5.8.3
Source: https://www.tutorialspoint.com/map_reduce/images/phases.jpg

5.8.4. MapReduce algorithm

Input to the Mapper class is tokenized, mapped, and sorted. Its output is used as input for the
Reducer class, which then searches the matching pairs and reduces them.

Figure 5.8.4
Source: https://www.tutorialspoint.com/map_reduce/images/mapper_reducer_class.jpg

The mathematical algorithms used for dividing a task into small parts and assigning them to
multiple systems are as follows:

a. Sorting

This is a basic algorithm that processes and analyses the data. The output key-value pairs are
sorted automatically from the mapper by their keys. Implementation of the sorting methods is
done in the mapper class itself.

In the Shuffle and Sort phase, once the values in the mapper class are tokenized, the
matching valued keys are collected by the Context class as a collection. The RawComparator
class is used for collecting and sorting similar key-value pairs. Hadoop automatically sorts
the set of intermediate key-value pairs for a given Reducer for forming key values before
being presented to the Reducer.

b. Searching:

It is helpful in the combiner and Reducer phase.

c. Indexing

This is used for pointing to a particular data and its address. Batch indexing is performed on
input files for a certain mapper. We call the indexing technique an inverted index in
MapReduce.

d.TF-IDF (Term-Frequency − Inverse Document Frequency)

It is a web analysis algorithm for processing text. Frequency implies the number of times of
appearance of a term in a document.

Term frequency is calculated by dividing the number of times a word appears in a document
by the total number of words in it.

Inverse Document Frequency (IDF) is calculated by dividing the number of documents in a


text database by the number of documents where a specific term appears.

Example:

Let us consider an example when Twitter receives 500 million tweets in a day, which implies
3000 tweets are received every second. With the help of the MapReduce algorithm, the
following actions are taken:

a. Tokenize:
The tweets are tokenized into maps of tokens and written as key-value pairs.

b. Filter:

Unwanted words from the maps of tokens are filtered and written as key-value pairs.

c. Count:

A token counter is generated for each word.

d. Aggregate Counters:

Similar counter values are aggregated to form small manageable units.

5.8.5. Advantages of MapReduce

Some benefits of MapReduce are as follows:

i. Scalable:

Hadoop is a scalable platform that stores and distributes large sets of data across various
servers. Inexpensive servers are used here, which can work in parallel. For enhancing the
system's processing power, more servers can be added. Unlike this, RDBMSs can't be scaled
to process large sets of data.

ii. Flexible:

Structured and unstructured data can be processed with the MapReduce programming model
to generate business value out of them. Various languages are supported by Hadoop for data
processing. It also has various applications like recommendation system marketing analysis,
data warehousing, and fraud detection.

iii. Secured:

When an outsider gets access to an organization's data, he can manipulate them to harm the
business operation. This risk is mitigated by the MapReduce programming model as it works
with HDFS and HBase, which allows the approved users only to operate on the data stored in
the system.

iv. Cost-efficient:

Since the system is highly scalable, it is a cost-efficient option, as seen from the perspective
of current day requirements. Businesses do not need to be downsized in this case as in the
traditional RDBMSs.

v. Faster data processing:


Hadoop's key feature is HDFS, which is a mapping system used for locating data in a cluster.
MapReduce is a tool that processes large volumes of structured or semi-structured data and is
located in the same server, ensuring faster processing speeds.

vi. Uses simple programming language:

Simple Java programming forms the basis of MapReduce, which enables programmers to
create programs that can handle many tasks easily and in an efficient manner. People can
learn it easier to design data processing models to meet their business requirements.

vii. Parallel processing:

MapReduce ensures parallel processing by dividing a task into independent tasks. This
makes the process easier, and also less time is needed for running the program.

viii. Fault tolerance:

Data is processed in the MapReduce programming model by sending them to individual


nodes and then forwarding the same to other nodes in the network. This ensures that, in case
of failure of a node, the data can still be accessed from other nodes. This makes Hadoop
fault-tolerant. Also, in case of a fault, it is recognized quickly, and a quick fix is applied for
an automatic recovery solution.

5.8.6. Disadvantages of MapReduce

i. Rigid:

MapReduce has a rigid framework. In its flow of execution, there can be 1 or more mappers
and 0 or more reducers. You can do a job using MapReduce only when it can be executed in
this framework.

ii. Too much manual coding:

This is needed for the common operations like join, aggregate, sorting, distinct, filter, and
others.

iii. Hidden semantics:

Inside the Map and Reduce functions, the semantics have been hidden, which makes
maintenance, extension, and optimization quite difficult.
5.9. SPARK RDD

RDD or Resilient Distributed Dataset is a distributed collection of data elements, which are
partitioned across nodes in the cluster. It is Apache Spark's fundamental data structure. All
the datasets in Spark RDDs are logically partitioned across various servers for computing
them on different nodes of the cluster. RDDs can be created in Spark in 3 ways:

a. Data in stable storage

b. Other RDDs

c. Parallelizing of the current collection in the driver program.

You can cache Spark RDD and partition it manually. When RDD is used multiple times,
caching is useful, but for correctly balancing the partitions, manual partitioning is better.
When you do smaller partitions, RDD can be distributed more equally among several
executors. Thus the work becomes easy with fewer partitions.

When programmers want to indicate which RDDs have to be reused in future operations,
they can call a persistent method. Persistent RDDs are kept by Spark in memory by default.
But in case of insufficient RAM, they will be spilled to the disk. Other persistent strategies
that can be used by users include storing only the RDD on disk or replicating it through flags
to persist.

Hadoop has been taken over by Spark owing to the benefits provided by it, such as quicker
execution in interactive processing algorithms.

5.9.1. Reasons for using RDD

When data needs to be processed over many jobs in computations like Page rank algorithm,
Logistic Regression, K-means clustering, reusing or sharing the data among multiple jobs is
quite common. It might be needed to do many ad hoc queries over a shared set of data. In the
current distributed computing systems like MapReduce, there is an underlying problem of
storing data in some intermediate stable distributed store like Amazon S3 or HDFS. This
results in slower job computations as many IO operations, replication, and serializations are
involved in the procedure.

5.9.2. When should RDDs be used?

Here are the instances when you should use RDDs:

i. When you need low-level transformations and actions on your dataset.


ii. You have unstructured data like media streams or text streams.

iii. For manipulating data with functional programming constructs rather than expressions
that are domain-specific.

iv. When you are ready to let go of some optimization and performance benefits for
structured and semi-structured data that comes with DataFrames and Datasets

v. Imposing a schema is not something you are concerned about.

5.9.3. Features of Spark RDD

i. In-memory computation:

This implies that Spark RDDs store intermediate results in the RAM (distributed memory)
rather than the stable disk. storage

ii. Results are not computed right away:

Apache spark just remembers the transformations applied to some base data set instead of
computing the results instantly. Only when action needs a result for the driver program, the
transformations will be computed by Spark

iii. Tolerance to a fault:

In case of a failure, Spark RDDs can automatically track data lineage information for
rebuilding the lost data.

iv. Immutability:

Sharing data across processes is safe. The data can be created or retrieved anytime, because
of which caching, sharing, and replication becomes simpler. This ensures consistency in
computations.

v. Partitioning:

Every partition is a logical division of data and is mutable. The creation of a partition is
possible by transforming existing partitions.

vii. Persistence:

Users have the choice to state the RDDs that they want to reuse and also specify a storage
strategy for the same.

viii. Coarse-grained Operations:

This is applicable for all database elements through maps, filter, or group by operation.
ix. Defining placement preference of computation partition:

RDDs are capable of doing this, which is termed location-stickiness. Information regarding
the RDD's location is referred to as placement preference. In order to enhance the
computation speed, the DAGScheduler places the partition close to the data

5.9.4. Demerits of Spark RDD

Here are some of the drawbacks of Spark RDD:

i. Lack of inbuilt optimization engine:

Developers have to optimize each RDD on the basis of its attributes while working with
structured data because, in this scenario, RDDs can’t avail the benefits of the advanced
optimizers in Spark, such as catalyst optimizer and Tungsten execution engine

ii. Users have to specify the schema:

RDD's can't infer the schema of ingested data because of which users have to specify it.

iii. Limited performance:

RDDs are in memory JVM objects, so when the data grows, there are overheads like Garbage
collection and Java Serialization, which limits the performance.

5.10 ARTIFICIAL NEURAL NETWORK

This is a model for processing information and has been inspired by the biological nervous
system, like the brain, which has similar functions. These networks are used for various
tasks, one of which is classification. For example, you can get images of various types of
birds and train a neural network with these pictures such that it can identify when a new
image is presented to it, to show the percentage of resemblance as well as identify which bird
it is.

In the same way, artificial neural networks have applications in character recognition, self-
driving cars, compression of images, predicting the stock market movements, and many
more.

5.10.1 Working of artificial neural networks

Various layers of mathematical processing are used for sensing the input information in the
Artificial neural network. The neurons or units can range from dozens to millions and are
arranged in the form of a series of layers. The data moves from the input unit to the hidden
units, which then transports it into something that can be used by the output unit.

Most of the artificial neural networks are completely connected from one layer to another,
and weights are assigned to the connections. The influence of one unit on another is more
when the number is higher. With the movement of data through each unit, the network learns
more about it.

5.10.2 An artificial neural network's process of learning

Artificial neural networks are able to learn quickly, which makes them so powerful. While
training the model, information patterns are fed from the data set into the network through
the input neurons. This triggers the hidden neurons, which then arrive at the output neurons.
It forms the feedforward network.

Every neuron receives inputs from the neurons present on its left. When they travel along,
weights of the corresponding connections are multiplied with these inputs. This forms the
simplest neural network in which each neuron adds up the input received by it. Once the sum
reaches a threshold value, the neuron"fires" and triggers the ones to which it is connected to
its right.

In order to learn, the network has to learn the wrong and right done by it, which the feedback
process, termed as 'backprop'. Our brain also learns in the same manner. With time, it helps
the network to learn by reducing the gap between the actual output and the intended output
until both of them match.

5.10.3 Supervised and unsupervised learning in artificial neural networks

In the supervised type of learning, the data is labelled in the dataset. There are preset training
examples in the training data. This involves a pair having an input object (vector)and the
desired output value (supervisory signal).

In the case of unsupervised learning, machine learning algorithms are used for drawing
inferences from datasets that have unlabelled inputs. Cluster analysis is the most common
unsupervised learning method used in exploratory data analysis for finding the hidden
patterns or grouping in data.

5.10.4 Types of neural networks

i. Perceptron:
This is a single layer neural model that includes the input and the output layer. It doesn't have
any hidden layers. The input is taken, and the weighted input is found out for each node.
Then an activation function is used for classification.

Fig 5.10.4.1
Source: https://d1m75rqqgidzqn.cloudfront.net/wp-data/2020/05/26143602/Blog-images_21_5_2020-01-630x420.jpg

ii. Feedforward neural network:

Here, the nodes never form a cycle. The perceptrons are arranged in the form of layers. Input
is taken by the input layer, and output is generated by the output later. There is no link
between the hidden layers with the outer world. Every perceptron in this model is linked with
each node in the next layer, because of which all the nodes are fully connected. Also, you
can't find any visible or invisible association between nodes present in the same layer. It
doesn't have any back loops. For reducing chances of error while making a prediction, the
backpropagation algorithm is used for updating the weight values.

Some practical implementations include data compression, reorganization of patterns, speech


recognition, and others.
Fig:5.10.4.2.
Source: https://d1m75rqqgidzqn.cloudfront.net/2019/11/feed-foward-nn-infograph1-300x226.jpg

iii. Radial Basis Functions Neural Network:

This type is used in function approximation problems and has a faster learning rate compared
to other neural networks. A Radial Basis Function is used as an activation function. We can
get 0 or 1 as output from a logistic function. When we have continuous values, we can't use
this type of neural network, which is its main demerit.

Some applications of this type include function approximation, classification system control,
and others.

Fig:5.10.4.3.
Source: https://d1m75rqqgidzqn.cloudfront.net/wp-data/2020/05/26145133/Blog-images_21_5_2020-02-630x420.jpg

iv. Kohonen Self-organizing Neural Network:

This makes use of an unsupervised learning algorithm and is also referred to as self-
organizing maps, which is beneficial when our data is scattered in various dimensions. It is a
dimensionality reduction method that is used for the visualization of high dimensional data.
Here, competitive learning is used instead of error correction learning.

There are two types of topologies in this, namely rectangular topology and hexagonal grid
topology.
Some practical applications include management of coastal waters and assessing, predicting
water quality.

Fig 5.10.4.4.
Source: https://analyticsindiamag.com/wp-content/uploads/2018/01/SOM.png

v. Recurrent Neural Network

It is the Feedforward network's variation in which all the neurons in the hidden layers receive
input with a certain delay in time. When previous information is needed in the current
iterations, this comes into use. Historical information is considered in this type of model, and
its size doesn't increase with the increase in the input size. The slow speed of computation is
one of its drawbacks. Besides this, it doesn't take into account any future input for the current
state and can't remember information for a long time.

Some practical applications include rhythm learning, speech synthesis, robot control, and
others.
Fig. 5.10.4.5.
Source: https://d1m75rqqgidzqn.cloudfront.net/wp-data/2020/05/26145050/Blog-images_21_5_2020-03-630x420.jpg

vi. Convolution Neural Network

They are primarily used for classifying images, clustering them, and recognizing objects.

Some practical applications of this type include Video analysis, NLP, drug discovery, and
others.

Fig.5.10.4.6.
Source: https://d1m75rqqgidzqn.cloudfront.net/wp-data/2019/11/07200605/convolutional-nn.jpg

5.10.5 Comparing neural networks with classical AI

AI programs are based on classical software principles. The programs have a logical
sequence; they do operations on the data stored in the memory locations and store the results
in a different memory location. They are deterministic and follow rules that are clearly
defined.
But in the case of neural networks, the operations are not sequential or deterministic. It just
has the underlying hardware but no central processor for controlling the logic, as in the case
of the classical AI. The logic is rather dispersed across a large number of small artificial
neurons. They perform mathematical operations on the input received.

5.10.6 Drawbacks of Artificial neural networks

Though the artificial neural network has been designed to mimic the biological neural
network system, both are quite different. In no way do they have the intelligence that of a
human brain. Let us see some of its demerits:

a. Lot of input is needed:

Human brains can work efficiently with fewer examples, but in the case of artificial neural
networks, thousands or millions of examples are needed to attain a standard level of
accuracy.

b. They can't generalize:

Neural networks will perform accurately only when it has been trained for the task. Apart
from that, we can't expect even good performance for anything quite similar to it. For
example, if an ANN has been trained on pictures of cats, it can never identify dogs because
you need to train it with thousands of dog images for it to be able to do so.

c. They are opaque:

Neuron weights and activations are used by neurons for expressing the neuron behaviours. So
it is difficult to understand any logic behind their decisions. This is the reason they are
referred to as black boxes.

5.11 DEEP LEARNING

Artificial intelligence is a field that involves an effort to make machines perform actions like
those of humans. It involves machine learning in which machines are trained to learn by
experience and acquire skills so that they can act without any intervention from humans.
Deep learning forms a subset of the artificial neural networks where the algorithms have been
inspired by the human brain to learn from a large amount of available data input. Just like us,
in a deep learning algorithm, a task is repeatedly performed, and every time it is tweaked for
enhancing the outcome.
We solve a problem by thinking about it to figure out a solution. In the same way, deep
learning can enable machines to resolve issues by the use of 2.6 quintillion bytes of data
generated every day. Since a lot of data is required by the machines to learn, deep learning
has increased with the rise in data creation today. Besides this, stronger computing power is
one of the major reasons for this. With deep learning, machines can solve complex problems
while using a diverse, interconnected, and unstructured data set.

5.11.1 Practical examples of data learning

i. Translation:

Deep learning algorithms are used for automatically translating languages, which is a huge
advantage for businesses, governments, and travellers.

Example:

input: The cat likes to eat pizza.

It’s spanish translation will be:

el gato le gusta comer pizza.

ii. Virtual assistants:

Virtual assistants of the online service providers help us understand speech and language
used by humans during the interaction, with the help of deep learning.

Example: Alexa, Siri, Cortana are some such virtual assistants.

iii. Autonomous vehicles:

This includes driverless delivery cars, drones, and other similar vehicles for which a deep
learning algorithm acts as a vision, guiding them on how to move on roads safely by
following the signs on the streets along with the traffic rules.

Examples:

Some self-driving cars are Tesla Model S., Cadillac CT6.

iv. Facial recognition:

With deep learning, the facial recognition feature is used for security purposes besides
tagging people on social media posts. But comes with a demerit, i.e., it can't recognize people
when their hairstyle changes, they have shaved or grown beards, or the same image clicked
on low light areas and similar instances.
Example:

Facial recognition is used for identification of people on social media such as Facebook,
aiding in forensic investigations, unlocking phones, preventing retail crimes and helping the
blind to understand social situations in a better way.

v. Chatbots:

Many companies use chatbots to provide services to people. These help in responding to
people in an intelligent way that customers find useful. In this case, deep learning occurs
through a large amount of auditory and text data.

vi. Colorizing an image:

Colouring a black and white image was a cumbersome job earlier, but with the help of a deep
learning algorithm, the objects can be recreated with the right colour, that too in an accurate
way, which looks impressive.

vii. Pharmaceuticals:

Deep learning has a major role to play in the medical field in diagnosing diseases and
creation of various types of medicines.

Example:

Supervised learning is useful in personalised treatment and behavioural modification.


Machine learning can be used for screening the drug compounds for prediction of their
success rate on the basis of biological parameters. Smart health records can also be
maintained with it.

viii. Personalization in shopping:

It may be entertainment or shopping; customers always love a personalized experience. Deep


learning helps such platforms come up with personalized suggestions so that you get to see
relevant ads while shopping or searching for a movie or TV show.

5.12 SUMMARY

● An integrated collection of the related files. A DBMS provides a convenient environment


for information storage and retrieval with the help of a query language. It prevents any
unauthorized access to the data, thus ensuring security.

● A relational database model is based on E.F Codd's relational model


● RDBMS forms the basis for SQL as well as other modern database systems such as
MySQL, Microsoft Access, IBM DB2, and others.

● SQL is a computer language that can be used for storing, manipulating, and retrieving
data from a relational database.

● The collection of large data sets that can't be processed using traditional computing
techniques is known as Big Data. These data can be structured, unstructured, or semi-
structured.

● MapReduce is a programming pattern in the Hadoop framework. It is the core component


that is very important for the functioning of this framework and is used for accessing big
data in the HDFS.

● RDD is Spark's fundamental data-structure that is an immutable distributed collection of


objects. The datasets are divided into logical partitions.

● The artificial neural network has been built in an effort to create functionality similar to
the biological neural system. It forms the basis for Artificial intelligence and helps in
solving problems that can't be easily resolved by humans.

● Deep learning is a function of Artificial intelligence that tries to mimic the functioning of
a human brain for data processing and creating patterns that can be used to make
decisions.

5.13. SELF-ASSESSMENT QUESTIONS

A. Descriptive Type Questions

1. What is the MapReduce algorithm? State its merits and demerits?

2. What are the different types of NoSQL? When should we use NoSQL?

3.What is the need for distributed computing for big data?

4. How is RDBMS different from NoSQL?

5. Write short notes on single layer Feed-Forward network

B. Multiple Choice Questions

1. What does "Velocity" in Big Data mean?

a) Speed of input data generation


b) Speed of individual machine processors

c) Speed of ONLY storing data

d) Speed of storing and processing data

2. Sliding window operations typically fall in the category of__________________.

a) OLTP Transactions

b) Big Data Batch Processing

c) Big Data Real-Time Processing

d) Small Batch Processing

3. What is HBase used as?

a) Tool for Random and Fast Read/Write operations in Hadoop

b) Faster Read-only query engine in Hadoop

c) MapReduce alternative in Hadoop

d) Fast MapReduce layer in Hadoop

4. Which of the following are the core components of Hadoop?

a) HDFS

b) Map Reduce

c) HBase

d) Both (a) and (b)

5. What is the default HDFS block size?

a) 32 MB

b) 64 KB

c) 128 KB

d) 64 MB

6. What is the default HDFS replication factor?

a) 4

b) 1

c) 3
d) 2

7. The mechanism used to create replicas in HDFS is____________.

a. Gossip Protocol

b. Replicate Protocol

c. HDFS Protocol

d. Store and Forward Protocol

8. Where is the HDFS replication factor controlled?

a. mapred-site.xml

b. yarn-site.xml

c. core-site.xml

d. hdfs-site.xml

9. Which of the following is the correct sequence of MapReduce flow?

a. Map > Reduce > Combine

b. Combine > Reduce > Map

c. Map > Combine > Reduce

d. Reduce > Combine > Map

10. Which of the following types of joins can be performed in the Reduce side join
operation?

a. Equi Join

b. Left Outer Join

c. Right, Outer Join

d. Full Outer Join

e. All of the above

11. Fuzzy logic is a form of

a) Two-valued logic

b) Crisp set logic

c) Many-valued logic
d) Binary set logic

12. Traditional set theory is also known as Crisp Set theory.

a) True

b) False

(Explanation: Traditional set theory set membership is fixed or exact whether the member is
in the set or not. There are only two crisp values, true or false. In the case of fuzzy logic,
there are many values. With weight say x the member is in the set)

13. The values of the set membership is represented by

a) Discrete Set

b) Degree of truth

c) Probabilities

d) Both b & c

14. Where does the Bayes rule can be used?

a) Solving queries

b) Increasing complexity

c) Decreasing complexity

d) Answering a probabilistic query

15. Which condition is used to influence a variable directly by all the others?

a) Partially connected

b) Fully connected

c) Local connected

d) None of the mentioned

16. A perceptron is:

a) a single layer feed-forward neural network with pre-processing

b) an auto-associative neural network

c) a double layer auto-associative neural network

d) a neural network that contains feedback


17. What are the advantages of neural networks over conventional computers?

(i) They have the ability to learn by example

(ii) They are more fault-tolerant

(iii)They are more suited for real-time operation due to their high 'computational' rates

a) (i) and (ii) are true

b) (i) and (iii) are true

c) Only (i)

d) All of the mentioned

18. Which is true for neural networks?

a) It has a set of nodes and connections

b) Each node computes its weighted input

c) Node could be in an excited state or non-excited state

d) All of the mentioned

19. What is backpropagation?

a) It is another name given to the curvy function in the perceptron

b) It is the transmission of error back through the network to adjust the inputs

c) It is the transmission of error back through the network to allow weights to be adjusted so
that the network can learn.

d) None of the mentioned

Answers:

1.d, 2.c, 3.a, 4.d, 5.d, 6.c, 7.c, 8.d, 9.c, 10.e, 11.c, 12.a, 13.b, 14.c, 15.b, 16. a, 17.d, 18.d, 19.c

5.15 SUGGESTED READINGS

Textbooks References

1. Data Science, Classification, and Related Methods. Studies in Classification, Data


Analysis, and Knowledge Organization. Springer Japan.

2. Tony Hey; Stewart Tansley; Kristin Michele Tolle. The Fourth Paradigm: Data-intensive
Scientific Discovery. Microsoft Research.
3. Bell, G.; Hey, T.; Szalay, A. "COMPUTER SCIENCE: Beyond the Data Deluge."
Science.
Reference Books

1. Simon S Haykin and Simon Haykin, (1998), Neural Networks: A Comprehensive


Foundation, Pearson Education.

2. S. Rajasekaran and G.A. Vijayalakshmi Pai, (2010), Neural Networks: Fuzzy Logic, and
Genetic Algorithms.

3. D.K. Pratihar, (2008), Soft Computing

Websites

● Marr, B. (n.d.). Deep learning. Forbes.


https://www.forbes.com/sites/bernardmarr/2018/10/01/what-is-deep-learning-ai-a-simple-
guide-with-8-practical-examples/?sh=4e5854958d4b

● Yegulalp, S. (2017, December 7). NoSQL. InfoWorld.


https://www.infoworld.com/article/3240644/what-is-nosql-databases-for-a-cloud-scale-
future.html

● Spark RDD. (n.d.). Spark RDD. https://data-flair.training/blogs/spark-rdd-tutorial/

You might also like