MBA-DATA ANALYTICS - Data Science and Business Analysis - Unit 5
MBA-DATA ANALYTICS - Data Science and Business Analysis - Unit 5
SEMESTER-I
All rights reserved. No Part of this book may be reproduced in any form without permission
in writing from Team Lease Edtech Pvt. Ltd.
CONTENT
STRUCTURE
5.2 Introduction
5.8 MapReduce
5.12 Summary
5.2 INTRODUCTION
A set of data stored in a computer is termed a database. They are generally maintained in a
structured way to ensure ease of accessibility. Relational database is a type of database that
uses a structure for letting the users identify and access data related to another piece of data
in the database. It is in the form of tables.
A Table is made up of several rows and columns. The rows are referred to as records, and
the columns have a descriptive name along with a certain data type. For example, if a column
has the age of people, then it will have an integer data type, name and country will have a
string data type.
Natalia 32 Russia
John 34 USA
Rustom 37 India
The table given above has the 3 columns for the name of the people, their country, and age.
Here, name and country name are of string data type; age is integer datatype. Each of the 4
rows is for 4 candidates.
RDBMS is a program meant for creating, updating, and administering a relational database.
Generally, SQL language is used by the relational database management systems for
accessing the database. Let us learn more about SQL in the upcoming section.
The SQL syntax is not the same for all types of RDBMS. Let us see some of the popular ones
in the following section:
i.MySQL:
It is an open-source SQL database used for web application development. PHP is used for
accessing it. There are some merits of using MySQL, namely, reliability, ease of use,
inexpensive, and it has a huge developers' community who can readily answer the queries.
It has some demerits too. This includes a lack of advanced features that developers may need
and poor performance while scaling. Open-source development has slowed down since
Oracle took over MySQL.
ii.PostgreSQL:
PostgreSQL's main drawback is that it performs slower compared to other databases and also
has less popularity.
This database is owned by the Oracle Corporation, and it is not open-sourced. Larger
applications use it, mainly by the top banks in the banking industry. This is because it uses
powerful technology that is comprehensive and pre-integrated with business applications and
functionalities built mainly for the banks.
Since it is not open source, it is not available for free and can even be very expensive. This is
only drawback.
It is owned by Microsoft and is close sourced. Generally, the larger enterprise applications
use it. Express is its entry-level version offered by Microsoft, but when an application is
scaled, that becomes quite expensive.
iv. SQLite:
This is an open-source SQL database that can store a whole database in one file. Since all the
data can be locally stored, you don't have to connect your database with a server. It is
generally used in a database in MP3 players, phones, PDAs, and other similar electronic
types of equipment.
5.4. STRUCTURED QUERY LANGUAGE (SQL)
It is a programming language that helps us to communicate with the data stored in the
RDBMS. In 1986, it became a standard of the American National Standards Institute (ANSI).
SQL has a syntax that bears a resemblance to the English language, making it easy for
writing, reading, and interpreting it.
Other variants of SQL are also used by many RDBMSs for accessing data in the tables.
SQLite is one of them, having a minimal number of SQL commands.
Syntax of SQL:
You can find one or more tables in a database, each of which has a name for identification
purposes. All tables contain records, which are the rows of data.
5.4.1.SQL Statements
The actions performed on the database are done using SQL statements.
For collecting all the records in the "Customers" table, we have to write:
Note:
● SQL keywords are not case sensitive, but we will be mentioning all keywords in the
upper case for better identification.
● In the case of certain databases, a semicolon might be needed at the end of a SQL
statement for separating multiple statements that are written to be executed in the
single call to the server.
5.4.2. Important SQL commands and their syntax
i. SELECT:
It is used to get data from a database that is stored in a result table named "result-set".
Syntax:
FROM table_name;
Here, col1, col2, ... are the field names (column names) in the table from which data you
want to select data.
For selecting all the fields, follow the syntax mentioned below:
Example:
This will give the details of Rita from the Customers table.
ii. UPDATE:
Syntax:
UPDATE table_name
WHERE condition;
Here, "WHERE" indicates the records which need to be updated, and it is an optional part of
this syntax. If this part is ignored in the syntax then the UPDATE command will update all
records in the specified table.
iii. DELETE:
Syntax:
Syntax:
The above syntax mentions the names of the columns along with the values which need to be
inserted.
In case you are adding values for all the columns, it is not required to specify the column
name in the syntax. But the order of the values should be as per the order of columns in the
table as mentioned below:
If you need to insert data in specific columns only, then specify the target columns along
with the corresponding values.
v. CREATE DATABASE:
Syntax:
Note:
It is important to have admin privileges prior to the creation of the database. After it is
created, type:
SHOW DATABASES;
It is used for modifying a database. The actions performed by this keyword include addition
and deletion besides modification.
The syntax for adding a column:
For changing a column's data type, follow the syntax given below:
Syntax:
CREATE TABLE table_name ( col1 datatype, col2 datatype, col3 datatype... ,col n data
type);
Here, col1, col2,...col n indicate the names of the columns in the table, and the data type
indicates what type of data the column will hold, such as varchar, integer, and others.
For creating the copy of an existing table, use the following syntax:
WHERE ....;
In this case, the new table will have the same column definitions. You can select some or all
columns. The creation of a new table from the existing table will fill up the new table with
values from the existing table.
Syntax:
It is for deleting a table in which all the information stored in it will be lost.
Syntax:
But if you just want to remove some data in the table instead of the whole table, follow this
syntax:
It is for index creation, i.e., search key. Data is retrieved quickly using an index. The index
won't be visible to the users, but it will simply speed up their searches.
You should note that if a table has indexes, then updating it takes longer. So it is
recommended to create indexes on columns that you will frequently be searching.
Syntax:
If you want to avoid duplicate values, then follow the given Syntax:
x. DROP INDEX:
Big data storage deals with storing and management of data in a scalable manner by ensuring
that the requirements of applications that need access to the data are effectively met. An
unlimited amount of data storage is allowed by an ideal big data storage, by efficiently
meeting with the high rates of random read and write access, efficiently dealing with various
data models, and supporting structure as well as unstructured data. They work only on
encrypted data for privacy reasons. Though it is difficult to meet all these parameters in
practical scenarios, the newly developed storage systems at least partially address most of the
challenges of volume, velocity, or variety. They are not categorized as relational database
management systems, but it doesn't imply that RDBMSs don't address the challenges. In fact,
alternative storage technologies such as the use of the Hadoop Distributed File System
(HDFS) are an efficient and less expensive option for this.
Volume challenge:
Big data storage systems use distributed and shared-nothing architecture for addressing
higher storage requirements. They scale-out to new notes to provide computational power
and storage. It is possible to seamlessly add new machines to a storage cluster, after which
the storage system transparently distributes the data between individual nodes.
Velocity challenge:
Velocity implies the time needed for getting a response to a query, which is mainly important
when there are a lot of incoming data. Similarly, a variety of data means effort needed for
integrating and working with data that originate from numerous sources. Graph databases
address these challenges.
There are constraints on data type and consistency in the case of SQL databases, but in the
case of NoSQL, these constraints have been removed for speed, scaling, and flexibility.
When an application is developed, it is quite essential to decide whether SQL databases
should be used for NoSQL databases for data storage.
There are different trade-offs offered by SQL and NoSQL databases, making each suitable
for different use cases. This will be clarified in the following points:
SQL or Relational Database ensures reliable transactions and responds to ad-hoc queries, but
they have certain restrictions, such as rigid schema that makes them unfit for some apps.
But in the case of the NoSQL database, data is stored and managed in a way that ensures
high operational speed and better flexibility for the developers. Horizontal scaling across
100s or 1000s of servers is possible in the case of NoSQL databases, unlike the SQL
databases.
The data consistency provided by NoSQL is not like that of SQL databases. This is because
performance and scalability have been sacrificed in the case of the SQL databases to abide by
the ACID properties for reliable transactions. But the NoSQL databases have prioritized
speed and scalability by ditching the ACID guarantees.
All data has an inherent structure in SQL databases. For example, a column may just have
integers only, resulting in a high degree of normalization. Hence aggregations like JOIN can
be easily performed on the data in a SQL database.
But in the case of NoSQL, data is stored in free form, i.e., you can store any data in any
record. This results in 4 types of database, namely Key-value Pair Based, Column-oriented
Graph, Graphs based, and Document-oriented, which we will be discussing in the upcoming
section.
a. Quickly accessing the data, i.e., when speed and simplicity of access is the primary
concern rather than consistency or reliable transactions.
b. While storing data of large volume, you want to avoid getting locked into a schema as
changing it later would be difficult.
c. You want to retain the originality of the unstructured data obtained from multiple sources
for better flexibility.
d. When you need a hierarchical form of data to be defined by the data itself instead of an
external schema. With NoSQL, you can be assured that the data will be self-referential,
unlike the case of SQL databases, which are difficult to emulate
NoSQL systems are classified into 4 types, each having its data model. The types are as
explained below:
i. Document databases:
CouchDB and MongoDB are examples of Document databases in which the data is stored in
free form, just like that in JSON (JavaScript Object Notation). They can be integers, strings,
Boolean, arrays, objects, or free form text. Their structure is in alignment with the objects
that developers are working on.
You can use document databases as general-purpose databases. It is possible to scale them
out horizontally to accommodate large volumes of data.
In this type of database, every item has keys and values. It contains free form values from
simple integers to strings or complex JSON documents, all of which are accessed by using
Keys. Hence it is easy to learn how to query a certain key-value pair. This type is useful
when large amounts of data need to be stored, but no complex queries are needed for
retrieving them. Storing the preferences of users or caching are its common uses. Redis,
DynamoDB, and Riak are examples of this type.
Here, data is stored in dynamic columns and rows. But unlike conventional SQL databases,
they provide better flexibility as each row doesn't need to have the same columns. Hence this
type is also referred to as 2-dimensional key-value databases. It is mainly useful when there
is a large amount of data to be stored, and you can predict the query patterns. While using
user profile data and data related to the Internet of Things, this type is very useful.HBase and
Cassandra are examples of this type.
SQL has a standardized query structure, which ensures that the basics remain the same while
handling certain operations differently. But in the case of the NoSQL database, it has its own
syntax when you want to manage data or make a query.
For example, if you are using CouchDB, it will use the requests in the form of JSON sent
through HTTP for creating or retrieving documents from its database, and MongoDB uses a
command like an interface or language library for sending JSON objects over a binary
protocol. Even if you can use SQL like syntax for working with data in certain cases, it will
be very limited. For example, in Cassandra, you can use the SELECT or INSERT keywords
just like SQL, but there is no way of using the JOIN keyword as the keyword doesn't exist
there.
In this type of design, every server node in the cluster operates independently, i.e., it doesn't
depend on any other node. For example, for returning a piece of data to the client, it doesn't
need the piece of data to the client; consensus is not required from every single node. This
Shared-nothing design is used by various conventional SQL systems also, but the consistency
is sacrificed across the cluster to ensure better performance.
The closest node responds to the queries, which makes the process very fast. Resiliency and
scaling out are the other advantages of shared-nothing architecture. Scaling out implies
spinning new nodes in the cluster and waiting for them to synchronize with others. In case a
node in the NoSQL database goes down, other servers in the cluster will chug along. Even
when fewer nodes are available to cater to the requests, all the data will still be available.
i. Scalability is high:
Scalability lets NoSQL handle a large amount of data efficiently. Sharding means the data is
partitioned and placed on multiple machines in a way to ensure the preservation of data
order. Vertical scaling implies the addition of more resources to the existing machine, and
horizontal scaling implies the addition of more machines for handling data. Implementation
of horizontal scaling is easy with respect to vertical scaling. Casandra and MongoDB are
horizontal scaling database examples.
NoSQL has an auto-replication feature, which makes it highly available. This is because,
whenever there is a failure, data replicates to the last consistent state.
i. No wider focus:
This is because NoSQL is designed for storage and has less functionality. In this field of
transaction management, relational databases are a better choice.
NoSQL doesn't have a reliable standard, which implies that it is very likely that two database
systems can be unequal.
Data management is not an easy task, even though the big data tools are meant for this
purpose. In NoSQL, data management is complex with respect to the relational database as
installing it and managing data on a daily basis is quite hectic.
In databases like MongoDB, this problem exists, which is a huge drawback for NoSQL.
This is true in database systems like MongoDB and CouchDB, where data is stored in JSON
format. This implies documents are very large that need higher speed and high network
bandwidth. Also, the descriptive ley names are problematic as it increases the size of the
document.
5.6.7. When should we opt for NoSQL?
i. When the amount of data that needs to be stored and retrieved is large.
iv. At the database level, there is no need to support Constraints and Joins.
v. Data growth is continuous and needs regular scaling for efficiently handling them.
Distributed Computing:
This deals with the study of distributed systems in the field of computer science. There are
many nodes communicating through the network in a distributed system. A shared goal is
accomplished through the interaction of computers in each node.
Fig 5.6
Source: https://en.wikipedia.org/wiki/Distributed_computing
When a firm is considering a big data project, it is essential to understand the basics of
distributed computing first. Since the computing resources can be distributed in many ways,
there is not a single distributed computing model. For example, a set of programs can be
distributed on the same physical server, and for communicating and passing information,
messaging services can be used. Also, you can have multiple systems or servers, each having
its own memory and working in a coordinating manner for resolving an issue.
5.7.1. Need for distributed computing for big data
All issues do not need distributed computing. When there is not a huge time constraint, you
can process a complex situation remotely through some specialized service. But when
companies need to analyse complex data, it generally moves the data to an external entity
where there are a lot of spare resources for processing these data. Since it was not
economically feasible to purchase enough computing resources for handling the emerging
requirements, companies had to wait to get the intended results. In some cases, companies
used to capture only certain sections of data instead of capturing everything due to the cost
factor. Even though all data were needed, analysts had to adjust with snapshots with the hope
to capture the right data at the right moment.
Gradually, a breakthrough in the hardware and software sectors brought a revolution in the
data management industry. With the increase in innovation and demand, power increased,
and the hardware cost decreased. Besides this, new software was developed for automating
processes like load balancing and optimization across large node clusters. This software
could also understand the performance level needed for certain workloads. The nodes were
treated as a single pool for computing storage and networking the assets. This enabled the use
of virtualization technology for the movement of processes to another node with no
interruption even when a node failed.
There has been a decrease in the cost of resources for computing and storage. The economics
of computing has because of virtualization as commodity servers could be clustered, and
blades could be networked in a rack. Innovation in software solutions coincided with this
change, resulting in a significant improvement in these systems' manageability.
Managing large quantities of data come with a perennial problem, which is the effect of
latency. The delay in a system because of the delays in task execution is termed latency,
which is a problem in every aspect of computing.
Suppliers, customers, and partners can experience a notable difference in latency because of
distributed computing and parallel processing. Since speed, volume and variety are the big
data requirements; various big data applications depend on low latency. When high
performance is needed in a high latency environment, it may not be possible to construct a
big data application. Besides this, latency can also impact the data in near real-time. A high
level of latency while dealing with real-time data is equivalent to the difference between
success and failure.
5.8. MAPREDUCE
Today, data is collected about people, processes, and organizations by the algorithms and
applications 24/7. This results in huge data volumes, which has a major challenge of how to
process them quickly in an efficient manner without the loss of meaningful insights. This is
the point where the MapReduce programming model becomes useful. Google used it initially
for analysis of its search engine results. Its potential to split and process terabytes of data in a
parallel manner for providing quick results made it very popular.
In the Hadoop framework, MapReduce is a programming model that is used for accessing
big data stored in the HDFS (Hadoop File System). It is vital for the Hadoop framework's
functioning. The petabytes of data are split into small chunks, which are processed in parallel
on the Hadoop commodity servers. The data from multiple servers are then aggregated at the
end for returning a consolidated output to the application.
Example:
With MapReduce, there is the execution of the logic on the server where data already resides
instead of sending it to the location where the application or logic resides. This makes the
processing faster. The input and output are stored in the form of files.
Initially, MapReduce was just a way of retrieving the data stored in HDFS, but today there
are query-based systems for data retrieval from HDFS through the use of SQL-like
statements.
However, these usually run along with jobs that are written using the MapReduce model.
That's because MapReduce has unique advantages.
5.8.2. Terms related to MapReduce model
Map:
It is a user-defined function that generates zero or more key-value pairs by taking a series of
key-value pairs and processing each of them.
Intermediate Keys:
These are the pairs of key-values that are generated by the mapper.
Input Phase:
It is a phase of having a Record Reader for translation of each record in the input file. The
parsed data is sent to the mapper in the form of key-value pairs
Output Phase:
This phase has an output formatter for translating the Reducer function's final key-value
pairs. These are written onto a file using a record writer.
Combiner:
It is a type of local Reducer which is responsible for similar grouping kinds of data from the
map phase into identifiable sets. Intermediate keys from mapper from its input. Then user-
defined code is applied for aggregating the values in one mapper's small scope. This doesn't
form a part of the primary MapReduce algorithm and is optional.
Reducer:
The grouped key-value paired data is taken as input by the Reducer, and a Reducer function
is run on each of them. There are various ways of aggregating, filtering, and combining data
in this case, and this needs a wide range of processing. Upon the completion of execution,
zero or more key values are given to the final step.
This is the step where the Reducer task starts. The grouped key-value pairs are downloaded
onto the local machine where the Reducer is running. A larger data set is formed by sorting
the individual key-value pairs. In order to iterate their values easily in the Reducer task, the
equivalent keys are grouped by the data list.
Map and Reduce are the two vital tasks of the MapReduce algorithm.
Map task: It takes a data set for converting it into another data set where tuples(key-value
pairs) are formed by breaking down individual elements.
Reduce task: It is performed after the map job, and the output from the Map task is treated
as an input in the Reduce task. It combines the data tuples to form a smaller set.
Fig. 5.8.3
Source: https://www.tutorialspoint.com/map_reduce/images/phases.jpg
Input to the Mapper class is tokenized, mapped, and sorted. Its output is used as input for the
Reducer class, which then searches the matching pairs and reduces them.
Figure 5.8.4
Source: https://www.tutorialspoint.com/map_reduce/images/mapper_reducer_class.jpg
The mathematical algorithms used for dividing a task into small parts and assigning them to
multiple systems are as follows:
a. Sorting
This is a basic algorithm that processes and analyses the data. The output key-value pairs are
sorted automatically from the mapper by their keys. Implementation of the sorting methods is
done in the mapper class itself.
In the Shuffle and Sort phase, once the values in the mapper class are tokenized, the
matching valued keys are collected by the Context class as a collection. The RawComparator
class is used for collecting and sorting similar key-value pairs. Hadoop automatically sorts
the set of intermediate key-value pairs for a given Reducer for forming key values before
being presented to the Reducer.
b. Searching:
c. Indexing
This is used for pointing to a particular data and its address. Batch indexing is performed on
input files for a certain mapper. We call the indexing technique an inverted index in
MapReduce.
It is a web analysis algorithm for processing text. Frequency implies the number of times of
appearance of a term in a document.
Term frequency is calculated by dividing the number of times a word appears in a document
by the total number of words in it.
Example:
Let us consider an example when Twitter receives 500 million tweets in a day, which implies
3000 tweets are received every second. With the help of the MapReduce algorithm, the
following actions are taken:
a. Tokenize:
The tweets are tokenized into maps of tokens and written as key-value pairs.
b. Filter:
Unwanted words from the maps of tokens are filtered and written as key-value pairs.
c. Count:
d. Aggregate Counters:
i. Scalable:
Hadoop is a scalable platform that stores and distributes large sets of data across various
servers. Inexpensive servers are used here, which can work in parallel. For enhancing the
system's processing power, more servers can be added. Unlike this, RDBMSs can't be scaled
to process large sets of data.
ii. Flexible:
Structured and unstructured data can be processed with the MapReduce programming model
to generate business value out of them. Various languages are supported by Hadoop for data
processing. It also has various applications like recommendation system marketing analysis,
data warehousing, and fraud detection.
iii. Secured:
When an outsider gets access to an organization's data, he can manipulate them to harm the
business operation. This risk is mitigated by the MapReduce programming model as it works
with HDFS and HBase, which allows the approved users only to operate on the data stored in
the system.
iv. Cost-efficient:
Since the system is highly scalable, it is a cost-efficient option, as seen from the perspective
of current day requirements. Businesses do not need to be downsized in this case as in the
traditional RDBMSs.
Simple Java programming forms the basis of MapReduce, which enables programmers to
create programs that can handle many tasks easily and in an efficient manner. People can
learn it easier to design data processing models to meet their business requirements.
MapReduce ensures parallel processing by dividing a task into independent tasks. This
makes the process easier, and also less time is needed for running the program.
i. Rigid:
MapReduce has a rigid framework. In its flow of execution, there can be 1 or more mappers
and 0 or more reducers. You can do a job using MapReduce only when it can be executed in
this framework.
This is needed for the common operations like join, aggregate, sorting, distinct, filter, and
others.
Inside the Map and Reduce functions, the semantics have been hidden, which makes
maintenance, extension, and optimization quite difficult.
5.9. SPARK RDD
RDD or Resilient Distributed Dataset is a distributed collection of data elements, which are
partitioned across nodes in the cluster. It is Apache Spark's fundamental data structure. All
the datasets in Spark RDDs are logically partitioned across various servers for computing
them on different nodes of the cluster. RDDs can be created in Spark in 3 ways:
b. Other RDDs
You can cache Spark RDD and partition it manually. When RDD is used multiple times,
caching is useful, but for correctly balancing the partitions, manual partitioning is better.
When you do smaller partitions, RDD can be distributed more equally among several
executors. Thus the work becomes easy with fewer partitions.
When programmers want to indicate which RDDs have to be reused in future operations,
they can call a persistent method. Persistent RDDs are kept by Spark in memory by default.
But in case of insufficient RAM, they will be spilled to the disk. Other persistent strategies
that can be used by users include storing only the RDD on disk or replicating it through flags
to persist.
Hadoop has been taken over by Spark owing to the benefits provided by it, such as quicker
execution in interactive processing algorithms.
When data needs to be processed over many jobs in computations like Page rank algorithm,
Logistic Regression, K-means clustering, reusing or sharing the data among multiple jobs is
quite common. It might be needed to do many ad hoc queries over a shared set of data. In the
current distributed computing systems like MapReduce, there is an underlying problem of
storing data in some intermediate stable distributed store like Amazon S3 or HDFS. This
results in slower job computations as many IO operations, replication, and serializations are
involved in the procedure.
iii. For manipulating data with functional programming constructs rather than expressions
that are domain-specific.
iv. When you are ready to let go of some optimization and performance benefits for
structured and semi-structured data that comes with DataFrames and Datasets
i. In-memory computation:
This implies that Spark RDDs store intermediate results in the RAM (distributed memory)
rather than the stable disk. storage
Apache spark just remembers the transformations applied to some base data set instead of
computing the results instantly. Only when action needs a result for the driver program, the
transformations will be computed by Spark
In case of a failure, Spark RDDs can automatically track data lineage information for
rebuilding the lost data.
iv. Immutability:
Sharing data across processes is safe. The data can be created or retrieved anytime, because
of which caching, sharing, and replication becomes simpler. This ensures consistency in
computations.
v. Partitioning:
Every partition is a logical division of data and is mutable. The creation of a partition is
possible by transforming existing partitions.
vii. Persistence:
Users have the choice to state the RDDs that they want to reuse and also specify a storage
strategy for the same.
This is applicable for all database elements through maps, filter, or group by operation.
ix. Defining placement preference of computation partition:
RDDs are capable of doing this, which is termed location-stickiness. Information regarding
the RDD's location is referred to as placement preference. In order to enhance the
computation speed, the DAGScheduler places the partition close to the data
Developers have to optimize each RDD on the basis of its attributes while working with
structured data because, in this scenario, RDDs can’t avail the benefits of the advanced
optimizers in Spark, such as catalyst optimizer and Tungsten execution engine
RDD's can't infer the schema of ingested data because of which users have to specify it.
RDDs are in memory JVM objects, so when the data grows, there are overheads like Garbage
collection and Java Serialization, which limits the performance.
This is a model for processing information and has been inspired by the biological nervous
system, like the brain, which has similar functions. These networks are used for various
tasks, one of which is classification. For example, you can get images of various types of
birds and train a neural network with these pictures such that it can identify when a new
image is presented to it, to show the percentage of resemblance as well as identify which bird
it is.
In the same way, artificial neural networks have applications in character recognition, self-
driving cars, compression of images, predicting the stock market movements, and many
more.
Various layers of mathematical processing are used for sensing the input information in the
Artificial neural network. The neurons or units can range from dozens to millions and are
arranged in the form of a series of layers. The data moves from the input unit to the hidden
units, which then transports it into something that can be used by the output unit.
Most of the artificial neural networks are completely connected from one layer to another,
and weights are assigned to the connections. The influence of one unit on another is more
when the number is higher. With the movement of data through each unit, the network learns
more about it.
Artificial neural networks are able to learn quickly, which makes them so powerful. While
training the model, information patterns are fed from the data set into the network through
the input neurons. This triggers the hidden neurons, which then arrive at the output neurons.
It forms the feedforward network.
Every neuron receives inputs from the neurons present on its left. When they travel along,
weights of the corresponding connections are multiplied with these inputs. This forms the
simplest neural network in which each neuron adds up the input received by it. Once the sum
reaches a threshold value, the neuron"fires" and triggers the ones to which it is connected to
its right.
In order to learn, the network has to learn the wrong and right done by it, which the feedback
process, termed as 'backprop'. Our brain also learns in the same manner. With time, it helps
the network to learn by reducing the gap between the actual output and the intended output
until both of them match.
In the supervised type of learning, the data is labelled in the dataset. There are preset training
examples in the training data. This involves a pair having an input object (vector)and the
desired output value (supervisory signal).
In the case of unsupervised learning, machine learning algorithms are used for drawing
inferences from datasets that have unlabelled inputs. Cluster analysis is the most common
unsupervised learning method used in exploratory data analysis for finding the hidden
patterns or grouping in data.
i. Perceptron:
This is a single layer neural model that includes the input and the output layer. It doesn't have
any hidden layers. The input is taken, and the weighted input is found out for each node.
Then an activation function is used for classification.
Fig 5.10.4.1
Source: https://d1m75rqqgidzqn.cloudfront.net/wp-data/2020/05/26143602/Blog-images_21_5_2020-01-630x420.jpg
Here, the nodes never form a cycle. The perceptrons are arranged in the form of layers. Input
is taken by the input layer, and output is generated by the output later. There is no link
between the hidden layers with the outer world. Every perceptron in this model is linked with
each node in the next layer, because of which all the nodes are fully connected. Also, you
can't find any visible or invisible association between nodes present in the same layer. It
doesn't have any back loops. For reducing chances of error while making a prediction, the
backpropagation algorithm is used for updating the weight values.
This type is used in function approximation problems and has a faster learning rate compared
to other neural networks. A Radial Basis Function is used as an activation function. We can
get 0 or 1 as output from a logistic function. When we have continuous values, we can't use
this type of neural network, which is its main demerit.
Some applications of this type include function approximation, classification system control,
and others.
Fig:5.10.4.3.
Source: https://d1m75rqqgidzqn.cloudfront.net/wp-data/2020/05/26145133/Blog-images_21_5_2020-02-630x420.jpg
This makes use of an unsupervised learning algorithm and is also referred to as self-
organizing maps, which is beneficial when our data is scattered in various dimensions. It is a
dimensionality reduction method that is used for the visualization of high dimensional data.
Here, competitive learning is used instead of error correction learning.
There are two types of topologies in this, namely rectangular topology and hexagonal grid
topology.
Some practical applications include management of coastal waters and assessing, predicting
water quality.
Fig 5.10.4.4.
Source: https://analyticsindiamag.com/wp-content/uploads/2018/01/SOM.png
It is the Feedforward network's variation in which all the neurons in the hidden layers receive
input with a certain delay in time. When previous information is needed in the current
iterations, this comes into use. Historical information is considered in this type of model, and
its size doesn't increase with the increase in the input size. The slow speed of computation is
one of its drawbacks. Besides this, it doesn't take into account any future input for the current
state and can't remember information for a long time.
Some practical applications include rhythm learning, speech synthesis, robot control, and
others.
Fig. 5.10.4.5.
Source: https://d1m75rqqgidzqn.cloudfront.net/wp-data/2020/05/26145050/Blog-images_21_5_2020-03-630x420.jpg
They are primarily used for classifying images, clustering them, and recognizing objects.
Some practical applications of this type include Video analysis, NLP, drug discovery, and
others.
Fig.5.10.4.6.
Source: https://d1m75rqqgidzqn.cloudfront.net/wp-data/2019/11/07200605/convolutional-nn.jpg
AI programs are based on classical software principles. The programs have a logical
sequence; they do operations on the data stored in the memory locations and store the results
in a different memory location. They are deterministic and follow rules that are clearly
defined.
But in the case of neural networks, the operations are not sequential or deterministic. It just
has the underlying hardware but no central processor for controlling the logic, as in the case
of the classical AI. The logic is rather dispersed across a large number of small artificial
neurons. They perform mathematical operations on the input received.
Though the artificial neural network has been designed to mimic the biological neural
network system, both are quite different. In no way do they have the intelligence that of a
human brain. Let us see some of its demerits:
Human brains can work efficiently with fewer examples, but in the case of artificial neural
networks, thousands or millions of examples are needed to attain a standard level of
accuracy.
Neural networks will perform accurately only when it has been trained for the task. Apart
from that, we can't expect even good performance for anything quite similar to it. For
example, if an ANN has been trained on pictures of cats, it can never identify dogs because
you need to train it with thousands of dog images for it to be able to do so.
Neuron weights and activations are used by neurons for expressing the neuron behaviours. So
it is difficult to understand any logic behind their decisions. This is the reason they are
referred to as black boxes.
Artificial intelligence is a field that involves an effort to make machines perform actions like
those of humans. It involves machine learning in which machines are trained to learn by
experience and acquire skills so that they can act without any intervention from humans.
Deep learning forms a subset of the artificial neural networks where the algorithms have been
inspired by the human brain to learn from a large amount of available data input. Just like us,
in a deep learning algorithm, a task is repeatedly performed, and every time it is tweaked for
enhancing the outcome.
We solve a problem by thinking about it to figure out a solution. In the same way, deep
learning can enable machines to resolve issues by the use of 2.6 quintillion bytes of data
generated every day. Since a lot of data is required by the machines to learn, deep learning
has increased with the rise in data creation today. Besides this, stronger computing power is
one of the major reasons for this. With deep learning, machines can solve complex problems
while using a diverse, interconnected, and unstructured data set.
i. Translation:
Deep learning algorithms are used for automatically translating languages, which is a huge
advantage for businesses, governments, and travellers.
Example:
Virtual assistants of the online service providers help us understand speech and language
used by humans during the interaction, with the help of deep learning.
This includes driverless delivery cars, drones, and other similar vehicles for which a deep
learning algorithm acts as a vision, guiding them on how to move on roads safely by
following the signs on the streets along with the traffic rules.
Examples:
With deep learning, the facial recognition feature is used for security purposes besides
tagging people on social media posts. But comes with a demerit, i.e., it can't recognize people
when their hairstyle changes, they have shaved or grown beards, or the same image clicked
on low light areas and similar instances.
Example:
Facial recognition is used for identification of people on social media such as Facebook,
aiding in forensic investigations, unlocking phones, preventing retail crimes and helping the
blind to understand social situations in a better way.
v. Chatbots:
Many companies use chatbots to provide services to people. These help in responding to
people in an intelligent way that customers find useful. In this case, deep learning occurs
through a large amount of auditory and text data.
Colouring a black and white image was a cumbersome job earlier, but with the help of a deep
learning algorithm, the objects can be recreated with the right colour, that too in an accurate
way, which looks impressive.
vii. Pharmaceuticals:
Deep learning has a major role to play in the medical field in diagnosing diseases and
creation of various types of medicines.
Example:
5.12 SUMMARY
● SQL is a computer language that can be used for storing, manipulating, and retrieving
data from a relational database.
● The collection of large data sets that can't be processed using traditional computing
techniques is known as Big Data. These data can be structured, unstructured, or semi-
structured.
● The artificial neural network has been built in an effort to create functionality similar to
the biological neural system. It forms the basis for Artificial intelligence and helps in
solving problems that can't be easily resolved by humans.
● Deep learning is a function of Artificial intelligence that tries to mimic the functioning of
a human brain for data processing and creating patterns that can be used to make
decisions.
2. What are the different types of NoSQL? When should we use NoSQL?
a) OLTP Transactions
a) HDFS
b) Map Reduce
c) HBase
a) 32 MB
b) 64 KB
c) 128 KB
d) 64 MB
a) 4
b) 1
c) 3
d) 2
a. Gossip Protocol
b. Replicate Protocol
c. HDFS Protocol
a. mapred-site.xml
b. yarn-site.xml
c. core-site.xml
d. hdfs-site.xml
10. Which of the following types of joins can be performed in the Reduce side join
operation?
a. Equi Join
a) Two-valued logic
c) Many-valued logic
d) Binary set logic
a) True
b) False
(Explanation: Traditional set theory set membership is fixed or exact whether the member is
in the set or not. There are only two crisp values, true or false. In the case of fuzzy logic,
there are many values. With weight say x the member is in the set)
a) Discrete Set
b) Degree of truth
c) Probabilities
d) Both b & c
a) Solving queries
b) Increasing complexity
c) Decreasing complexity
15. Which condition is used to influence a variable directly by all the others?
a) Partially connected
b) Fully connected
c) Local connected
(iii)They are more suited for real-time operation due to their high 'computational' rates
c) Only (i)
b) It is the transmission of error back through the network to adjust the inputs
c) It is the transmission of error back through the network to allow weights to be adjusted so
that the network can learn.
Answers:
1.d, 2.c, 3.a, 4.d, 5.d, 6.c, 7.c, 8.d, 9.c, 10.e, 11.c, 12.a, 13.b, 14.c, 15.b, 16. a, 17.d, 18.d, 19.c
Textbooks References
2. Tony Hey; Stewart Tansley; Kristin Michele Tolle. The Fourth Paradigm: Data-intensive
Scientific Discovery. Microsoft Research.
3. Bell, G.; Hey, T.; Szalay, A. "COMPUTER SCIENCE: Beyond the Data Deluge."
Science.
Reference Books
2. S. Rajasekaran and G.A. Vijayalakshmi Pai, (2010), Neural Networks: Fuzzy Logic, and
Genetic Algorithms.
Websites