0% found this document useful (0 votes)

13 views120 pages

Big Data

all units of big dataa

Uploaded by

037 E Dattabhavani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views120 pages

Big Data

all units of big dataa

Uploaded by

037 E Dattabhavani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 120

Origins-Introducing Mapreduce Framework for Big data

Why Map reduce ?

• The Big data world has immense information to be processed and requires data to be distributed among
a number of systems or nodes of a cluster so that the data can be handled eﬃciently.
• To handle computations on data stored in connected systems rather than single source, a new approach

UNIT II
of Programming was to be explored.
• The advent of Local Area Networks and other Networking technologies are able to provide the solutions
of combining computing and storing capacities of systems on the network.

Contd. MapReduce Framework: About it , its features

• Some years ago, Google developed and started using a programming model that and the way it works!
they called MapReduce. About MapReduce:
• It was a new style of data processing designed to manage big data using • MapReduce is a software framework and programming model used for processing huge amounts of
Distributed and Parallel Computing on a cluster. data.

• This model was inspired by the combination of map and reduce operations • MapReduce based program work in two phases, namely, Map and Reduce.

commonly used in existing Programming languages. • Map tasks deal with splitting and mapping of data while Reduce tasks shuﬄe and reduce the data.

• The MapReduce model had a huge impact on Google’s ability to handle huge • Hadoop is capable of running MapReduce programs written in various languages: Java, Ruby, Python, and
amounts of data in a reasonable time. C++.

• MapReduce was the pioneer attempt for processing big data and other future • The input to each phase is key-value pairs.

technologies like Hadoop also had software utilities that still use the Mapreduce • In addition, every programmer needs to specify two functions: map function and reduce function.
model.
Understanding MapReduce in Hadoop
What is
● MapReduce is a Hadoop framework used for writing applications that can process MapReduce
vast amounts of data on large clusters.
● This application allows data to be stored in a distributed form. It simplifies
enormous volumes of data and large scale computing. ● MapReduce is a programming framework that allows us to perform distributed and parallel
● There are two primary tasks in MapReduce: map and reduce. processing on large data sets in a distributed environment.
● We perform the former task before the latter. In the map job, we split the input ● MapReduce consists of two distinct tasks – Map and Reduce.
● As the name MapReduce suggests, the reducer phase takes place after the mapper phase has
dataset into chunks. been completed.
● Map task processes these chunks in parallel. ● So, the ﬁrst is the map job, where a block of data is read and processed to produce key-value
● The map we use outputs as inputs for the reduce tasks. Reducers process the pairs as intermediate outputs.
● The output of a Mapper or map job (key-value pairs) is input to the Reducer.
intermediate data from the maps into smaller tuples, that reduces the tasks,
● The reducer receives the key-value pair from multiple map jobs.
leading to the final output of the framework. ● Then, the reducer aggregates those intermediate data tuples (intermediate key-value pair) into a
smaller set of tuples or key-value pairs which is the ﬁnal output.

What is this style of Mapreduce Programming Mapreduce style processing:

• Example : Excel (normal style of processing)
Contd. Important functions in Mapreduce program
1. Map and reduce functions
• The MapReduce algorithm contains two important tasks, namely Map and Reduce.
• The map task is done by means of Mapper Class
• The reduce task is done by means of Reducer Class.
• Mapper class takes the input, tokenizes it (i.e converts it to key – value pairs ) , maps and
sorts it. The output of Mapper class is used as input by Reducer class, which in turn searches
matching pairs and reduces them.
2. Combine Function / Merging / Shuffle & Sort Illustration of steps of Map Reduce algorithm using Key
function value pairs:
• It is the second step in MapReduce Algorithm. Shuffle Function is also known as “Combine The output of this Map Function is a set of key and value pairs as <Key, Value>
Function”. as shown in the below diagram.
• It performs the following two sub-steps:
1. Merging
2. Sorting
• It takes a list of outputs coming from “Map Function” and perform these two sub-steps on each
and every key-value pair.
• Merging step combines all key-value pairs which have same keys (that is grouping key-value
pairs by comparing “Key”). This step returns <Key, List<Value>>.
• Sorting step takes input from Merging step and sort all key-value pairs by using Keys. This step
also returns <Key, List<Value>> output but with sorted key-value pairs.
• Finally, Shuffle Function returns a list of <Key, List<Value>> sorted pairs to next step.

Combine/ Shuﬄe function Reduce function

• It is the final step in MapReduce Algorithm. It performs only one step : Reduce step.
• It takes list of <Key, List<Value>> sorted pairs from Shuffle Function and perform
reduce operation as shown below.
Let us Mapreduce :
• MapReduce Example – Word Count
• In this assignment , revise and practice how MapReduce Algorithm solves
WordCount Problem theoretically.
• Problem Statement:
Count the number of occurrences of each word available in a DataSet.
• Input DataSet
Please find our example Input paragraph. Just for simplicity, we are going to
use simple small Dataset.
• However, Real-time applications use very huge amount of Data.

Features of MapReduce: Contd.

1.Scheduling: MapReduce involves two operations : map and reduce.
• These are executed on large data that are divided into smaller subsets and • In case Nodes are fewer than tasks , then tasks are executed on a
stored separately in different computing resources. priority basis.
• The operation of breaking tasks in subtasks and running these subtasks • The mapping operation requires task prioritization based on the
independently in parallel is called Mapping which is performed ahead of reduce number of nodes in the cluster.
operation. • The reduction operation cannot be performed until the entire mapping
During a MapReduce job, operation is completed.
Hadoop sends the Map
• The reduction operation then merges independent results on the basis
and Reduce tasks to the
of priority.
appropriate servers in
the cluster. • Hence MapReduce programming model requires scheduling of tasks.
2. Synchronization 3. Data Locality ( Co-location of code & data):
• Execution of several concurrent processes requires synchronization. • The effectiveness of a data processing mechanism depends largely on
the location of the code and the data required for the code to execute.
• The MapReduce program execution framework should be aware of all the map
• The best result is obtained when both the code and data reside on the
and reduce jobs that are to take place in the program.
same machine.
• It should track all the tasks along with timings and ensure to start the reduction
• This means colocation of code and data produces the most effective
process only after all mapping is completed.
processing outcome.
• This is called Data locality.

4.Handling of errors/faults:
• MapReduce engines usually provide a high level of fault tolerance and robustness in handling errors.
5. Scale out Architecture:
• The reason for providing robustness to these engines is their high tendency to make errors or faults. • Mapreduce engines are built in such a way that they can
• There are high chances of failure in clustered nodes on which different parts of program are running. accommodate more machines as and when required.
• Therefore the engine must have the capability of recognizing the fault and rectify it. • This possibility of introducing more computing resources to the
• Moreover, the engine design involves the ability to find out the tasks that are incomplete and eventually assign architecture makes Mapreduce programming model more suitable
then to different nodes.
higher computational demands of Big data.
HADOOP VS MAPREDUCE

HADOOP VS MAPREDUCE Concept :

Meaning :

Language : Framework :
How it works : Stages of Mapreduce Working of Mapreduce:
• The data goes through the following phases of MapReduce algorithm:
• Input Splits:
• Applications to handle data are designed by software professionals on the basis of
An input to a MapReduce program is divided into fixed-size pieces called input splits. algorithms, which are stepwise processes to solve a problem / achieve a goal.
Input split is a chunk of the input that is consumed by a single map.
• The Mapreduce model also works on an algorithm to execute the above stages
• Mapping
This is the very first phase in the execution of mapreduce program. In this phase data in each split is passed to a
• This algorithm can be depicted as follows:
mapping function to produce suitable key – value pairs. In our word count example, the job of mapping phase is to
count a number of occurrences of each word from input splits and prepare a list in the form of <word, frequency> i.e
1. Take a large dataset or set of records
key is word and frequency is its value. 2. Perform iteration over the data
• Shuffling
3. Extract some interesting patterns to prepare an output list by using map function.
This phase consumes the output of Mapping phase. Its task is to consolidate the relevant records from Mapping phase
output. In our example, the same words are clubbed together along with their respective frequency. 4. Arrange/Sort output list properly to enable optimization for further processing.
• Reducing 5. Compute set of results by using the reduce function.
In this phase, output values from the Shuffling phase are aggregated. This phase combines values from Shuffling
phase and returns a single output value. In short, this phase summarizes the complete dataset. 6. Provide the final output.
In our example, this phase aggregates the values from Shuffling phase i.e., calculates total occurrences of each word. The working of the MapReduce Approach is shown below :

Working of Mapreduce Approach

Description of the model in ﬁgure: Revise
• The above framework is a combination of a master and three slaves ,the master
monitors the entire job assigned to the MapReduce algorithm and is given the name of 1. How MapReduce Organizes Work?
JobTracker . It is also called Master node.
• Hadoop divides the job into tasks.
• Slaves on the other hand are responsible for keeping track of individual tasks and are
called TaskTrackers.
2. Name the two types of tasks in MapReduce approach :
• First the given job is divided into a number of tasks by the master i.e. the JobTracker
distributes these tasks into slave nodes. • Map tasks
• It is responsibility of Jobtracker to further keep an eye on the processing activities and the
• Reduce tasks
re-execution of failed tasks. Slaves coordinate with the master by executing the tasks they
are given by the master. 3. The complete execution process (execution of Map and Reduce tasks,
both) is controlled by two types of entities called a
• The job tracker receives jobs from client applications to process large information.
• Jobtracker: Acts like a master (responsible for complete execution of
• These jobs assigned in the form of individual tasks (after a job is divided into smaller parts) submitted job)
to various TaskTrackers.
• Multiple Task Trackers: Acts like slaves, each of them performing the job
• The data after being processed by Task trackers is transmitted to the reduce function so
that the ﬁnal integrated output which is an aggregate of the data processed by map
function can be provided.

Techniques to optimize MapReduce Jobs:

• An analysis of MapReduce program execution shows that it involves a series of steps in which Contd.
each step has its own set of resource requirements.
• The performance of Mapreduce jobs and their reliability , can be optimized
• In addition , you must avoid any bottlenecks of resources in order to draw the maximum by using some techniques.
benefits from MapReduce resources.
• These resources if utilized to the fullest can help you to reduce the response time of • The techniques are organized in the following categories:
Mapreduce jobs to a minimum level. 1. Hardware / Network Topology
• Encountering a deadlock for even a single resource , during the execution of the program
slows down the execution process.
2. Synchronization
3. File System
1. Hardware and Network topology: Contd.
• Map reduce makes it possible for hardware to run the Mapreduce tasks on • The performance offered by hardware systems that are located in the same rack where data is
inexpensive clusters of commodity computers. stored will be higher than that of systems located in a different rack than the one containing
• These computers can be connected through standard networks. the data.
• The reason for the low performance in above case is the requirement to move the data
• The performance and fault tolerance required for Big Data operations are also /application code .
influenced by the physical locations of servers.
• You can minimize latency in mapreduce processing by keeping the hardware elements close
• Usually , the data center arranges the hardware in racks. to each other.

2.Synchronization 3. File system:

• Map reduce operations are best suited to distributed file systems.
• The completion of map processing enables the reduce function to combine • Distributed file systems are different from local file systems in a manner that the local file
the various outputs for providing the final results. systems have less capability of storing and arranging the data.
• All the meta data and access rights apart from mapping , block and file locations are stored in
• However the performance will degrade if the results of mapping are contained the master node. On the other hand , the data on which the application code will run is kept on
within the same nodes where data processing began. the slave nodes.
• In order to improve the performance , we should copy the results from • The master node receives all the requests which are forwarded to appropriate slaves for
performing the required actions.
mapping nodes to the reducing nodes which will start their processing
tasks immediately.
Contd. Contd.
3. Map reduce hands the workload by keeping large jobs in small data batches.
• You need to keep the following considerations while designing a file for
Hence Mapreduce needs a network bandwidth that remains available for a
Mapreduce program:
long time instead of having quick execution times of mappers and reducers.
1. Master node handles various operations that may lead to a risk of
4.The increasing number of security layers hampers the performance of
overworking. If it fails , then you will be not able to access the entire file
distributed file systems.It is always advisable to allow only authorized users to
system and code unless it becomes active again.
access the data center environment and protect the distributed file system.
In order to optimise the file system , you can develop a standby master
node.
2. In Big data environment files with less than 100 MB are not preferred, so
you need to avoid them.The best results are obtained when the distributed
file systems are loaded with small number of large sized files.

Role of Hbase in Big data processing

Introducing : Databases :
What is HBase
• A database management system (DBMS) is a software solution that helps • Hbase is an open source , non-relational , distributed , column oriented
users view, query, and manage databases. database developed as a part of Apache Software Foundation Hadoop
project.
• Both RDBMS and HBase, both are database management systems. RDBMS
uses tables to represent data and their relationships. HBase is a • It is beneﬁcial when large amounts of data is required to be stored ,
column-oriented DBMS and it works on top of Hadoop Distributed File System updated and processed at a fast speed.
(HDFS). • Because of vast size of Big data , its storage and processing are challenging
• RDBMS : A relational database is a type of database that stores and provides tasks.
access to data points that are related to one another. In a relational database,
each row in the table is a record with a unique ID called the key. The columns of
• Just as Mapreduce enhances big data processing , Hbase takes care of its
the table hold attributes of the data : thus it is row oriented. storage and access requirements.
• Hbase however is column-oriented.
Read@Home: Differentiate between Relational
Role of Hbase in Mapreduce database and HBase
• Hbase helps programmers to store large quantities of data in such a way
that it can be accessed easily and quickly as and when required.
• It stores data in compressed format and thus occupies less space in
memory.
• Relational databases are row – oriented , meaning the data in each row of
the table is saved together. In Hbase, it follows columnar way of saving data.
• In case you have large volume and variety of data , you can use a columnar
database.
• Hbase is suitable in cases where data changes gradually and rapidly.
Examples of data that use Hbase are demographic data , IP addresses,
geolocation lookup tables, and product dimensions.

Thank you
Case Examples of MapReduce
Mapreduce is used to process various types of data obtained from various sectors.Some of the ﬁelds beneﬁtted by the use of Mapreduce
are:
1.Web Page visits: Suppose a researcher wants to know the number of times the website of a particular newspaper was accessed.The
map task would be to read the logs of the web page requests and make a complete list.
The map output may look similar to the following:

<emailURL,1>

UNIT II
<newspaperURL,1>
<socialmediaURL,1>
<sportsnewsURL,1>
<newspaperURL,1>

Part b <emailURL,1>
<newspaperURL,1>

The reduce function would ﬁnd the results for the newspaper URL and add them. The output of the preceding code is :

<newspaperURL,3>

Contd. Contd.
2. Word frequency: A researcher wishes to read articles about flood but he 3. Word count: Suppose the researchers wishes to determine the number of times celebrities talk about the
present bestseller book.
does not want those articles in which flood is discussed as a minor The data to be analysed comprises written blogs , posts and tweets of the celebrities.
topic.Therefore he decided that an article basically dealing with earthquakes The map function will make the list of all words.This list will be in the form of following Key value pairs
(where key is word and value is 1 for every appearance of the word)
and floods should have the word ‘ Tectonic plate’ in it more than 10 times. The output of map function :
<global warming ,1>
● The map function will count the number of times the specified word
<food,1>
occurred in each document and provide the result as <global warming , 1>
<bestseller,1>
<document,frequency>
<afghanisthan,1>
● The reduce function will then count and select only the results that have <bestseller,1>
The preceding output will be converted in the following form by reduce function:
frequency of more than 10 words.
<global warming ,2>
<food,1>
<Bestseller,2>
<Afghanisthan,1>
Parts of Big Data Architecture/Big data Stack :
Human Body ● As it deals with huge values of variety data, Big data analysis requires the use of
best technologies at every stage , be it collecting data , cleaning it, sorting and
organizing it, integrating or analysing it.
● Thus, technologies associated with Big data analysis are a bit complex in nature
and so to understand them , we create model template / architecture commonly
known as Big data Architecture before designing the systems.
● The configuration of this model varies depending on the specific needs of the
organization.
● However the basic layers and components more or less remains the same.
● The model should give a complete view of all the required elements.
● Although initially creating a model or even viewing it may seem time-consuming
, but it can save a significant amount of time ,effort and rework during
subsequent stages of implementation.

Principles of Big Data Implementation: Big data Architecture/Big Data Stack

● Performance: High end infrastructure should be made to deliver high performance with low
latency .Performance is measured end to end on the basis of a single transaction /query. The ● Big data analysis also needs the creation of a model / architecture , commonly known as the Big Data
total time taken by a data packet to travel from one node to other is described as Latency. Architecture or Big Data Stack.

● Availability: The infrastructure setup must be available at all times to ensure nearly 100% ● While creating this model , we must take into consideration all the hardware , infrastructure
uptime guarantee / service. software,operational software,management software,Application Programming interface (APIs) and
software development tools.
It is obvious that businesses cannot wait in case of a service interruption / failure ; therefore an
● In short , we can say that the architecture of the Big data environment must fulfill all principles of Big
alternative to the main system must also be maintained. data implementation described above and able to perform the following functions:
● Scalability:The Big data systems must be scalable enough to accommodate varying storage ✔ Capture data from different sources.
and computing requirements. ✔ Cleaning and integrating data of different types of formats.
● Flexibility : Flexible infrastructure facilitate adding more resources to the setup and promote ✔ Sorting and organizing data
failure recovery.It should be noted that flexible infrastructure is also costly ; however costs can
✔ Analysing data
be controlled with the use of cloud services , where you need to pay for what you actually use.
✔ Identifying relationships / patterns in data
● Cost : You must select the infrastructure that you can afford.This includes all the hardware,
storage and networking requirements. ✔ Deriving conclusions
Big data Architecture/Big Data Stack Layers of the Big Data Handling Technologies
Architecture:
● Above figure shows a sample illustration of Big Data Architecture,comprising the following layers and components:
1. Data Sources layer

2. Ingestion layer

3. Storage layer

4. Physical infrastructure layer

5. Platform management layer

6. Security layer

7. Monitoring layer

8. Analytics layer

9. Visualisation later

10. Big Data Application

1.Data Sources layer: Example : Take Telecom industry and identify the
● Organisations generate huge amounts of data on a daily basis.
● The basic function of the Data Sources layer is to absorb and integrate the data
sources of data
coming from various sources , at varying velocity and in diﬀerent formats.
● Before this data is considered for Big Data Stack , we have to diﬀerentiate
between the noise and relevant information.

In communication systems, noise is an error or undesired random disturbance of a

useful information signal. The noise is a summation of unwanted or disturbing energy
from natural and sometimes man-made sources.
Ingestion layer: Stages in Ingestion layer:
● The task of validating , sorting ● Identification: At this stage , data is categorized into various known
and cleaning data is done by data formats or unstructured data is assigned with default formats.
Ingestion layer. The removal of ● Filtration : Relevant information or data is filtered.
noise from the data also takes
place in ingestion layer. ● Validation: Filtered data is analysed.
● In other words , It validates , ● Noise reduction : Data is cleaned by removing noise and minimizing
related disturbances.
cleanses ,transforms,reduces and
integrates the unstructured data ● Transformation: Data is split / combined on the basis of its type and
into the Big data stack for further content.
processing. ● Compression: Size of the data is reduced without affecting the
content.
● In Ingestion layer, the data
passes through the following ● Integration: Now the refined dataset is integrated with the Hadoop
stages: Storage layer which consists of Hadoop Distributed File System (HDFS)
& NOSQL databases.

Storage layer Diagram: Different NOSQL databases for different

● HADOOP is an open source framework used to store large volumes of data in a
distributed manner across multiple machines.
business applications
● The Hadoop storage layer supports fault tolerance and parallelisation which
enable high speed distributed processing algorithms to execute on a large scale.
● HDFS is the file system that is used to store huge volume of data across a large
number of commodity machines in a cluster.
● Files stored in HDFS are operated upon by many complex programs.
● Follows Write once Read many model. (Write once read many (WORM)
describes a data storage device in which information, once written, cannot be
modified.Why ? HDFS is designed not to store the data but how fast data can be
retrieved while analyzing.)
Example Consider Big data handling in Hospital:
HDFS can be implemented in an organization at comparatively less costs and can
easily handle the continuous streaming of data.
Supports NOSQL databases like HBASE , MongoDB ,Infinite Graph etc.
Physical Infrastructure layer: Contd.
● This layer takes care of hardware and network requirements. ● In Big data environment, networks that are capable of accommodating the
● It can provide a virtualized cloud environment or a distributed grid of anticipated volume and velocity of the inbound and outbound data in case of
commodity servers over a fast gigabit network. heavy network traffic are called Physical Redundant Networks.
● This layer is based a Distributed computing model,which allows the physical ● Similar to redundant networks , hardware resources for storage and servers
storage of data in many different locations by linking them through networks must also have sufficient speed and capacity to handle all expected types of Big
and the distributed file system. data.
● If slow servers are connected to high speed networks , the slow performance of
servers will be of little use and can become a bottleneck anytime.

Platform Management layer: Security layer:

● The role of this layer is to provide tools and query languages for accessing ● The Security layer handles the basic security principles that Big Data architecture should
NOSQL databases. follow.

● This layer uses the HDFS that lies on top of HADOOP Infrastructure layer. ● Big data projects are full of security issues because of using the distributed architecture , a
simple programming model and the open framework of services.
● Therefore , the following security checks must be considered while designing a Big Data
Stack:
1. It must authenticate nodes by use of protocols.

2. It must enable ﬁle-layer encryption.

3. It must subscribe a key management service for trusted keys and certiﬁcates.

4. It must maintain logs of communication that occurs between nodes and trace any
anomalies across layers.
5. It must ensure safe communication between nodes by using Secure Sockets Layer (SSL).
Monitoring layer Analytics Engine:
● This layer consists of a number of monitoring systems.
● The role of an analytics engine is to analyse huge amounts of
● These systems remain automatically aware of all the unstructured data.This type of analysis is related to text analytics
configurations and functions of different operating systems and and statistical analytics.
hardware. ● Some examples of different types of unstructured data that are
● They provide machine/node communication through high level available as large datasets include the following:
protocol like XML –Extension Markup Language. ❖ Documents containing textual patterns.
● Some examples for monitoring Big data stacks are Ganglia and ❖ Text and symbols generated by customers or users using social
Nagios. media forums ,such as Yammer, Twitter and Facebook.
❖ Machine generated data , such as Radio frequency Identification
(RFID) feeds and weather data.

Different Statistical /Numerical methods for Visualisation Layer:

analyzing Big data ● This layer handles the task of interpreting and visualizing Big Data.
● Visualisation of data is done by data analyst and scientists to have a look at the
different aspects of the data in various visual modes.
● It can be described as viewing a piece of information from different
perspectives , interpreting it in different manners, trying to fit it in different
types of situations and deriving different types of conclusions from it.
● The Visualisation layer works on aggregated data stored in traditional
Operational Data stores , data warehouse and data marts. These data stores
get the data from Data sources.
● Some examples of visualization tools are Tableau ,R, MapR, Revolution R ,
Clickview , Spotfire.
Big data applications
● Different types of tools and applications are used to implement Big
Data Stack Architecture.
● The applications can be horizontal / vertical.
● Horizontal apps are used to address the problems that are common
across industries where as vertical applications are used to solve an
industry specific problem.
Virtualisation and Big data
● Big data virtualization is a process that focuses on creating virtual structures/machines
for big data handling systems.
● It is the process of abstracting different data sources involved in handling Big data so
that a single data access layer which delivers integrated information as data services to
users and applications in real-time or near real-time.
● A virtual machine is basically a software representation of a physical machine that can

Virtualisation & Big Data execute or perform the same functions as physical machine.

Hypervisor/Virtual Machine Manager

It is a program that allows multiple
operating systems to share a single
hardware.
It controls the host processor and
resources.It allocates what the guest
Operating systems need.
Contd.
●Virtual machines are provided from Virtualisation tools/ software
packages like Actifro Sky , Denodo Platform , IBM Cloud Pak
,Informatika Power centre etc. are tools to make software of virtual
machines.

Virtualisation Environment

● Rather than assigning a dedicated set of physical resources to each set of

Why virtualization is needed for Big tasks, a pooled set of virtual resources can be quickly allocated as needed
across all workloads.
data
● Reliance on the pool of virtual resources allows companies to improve
● Virtualization is ideal for big data because in Big data analysis , the data is
latency.
having high volumes , high variety and high velocity of arrival.
● We need to separate resources and services from the underlying physical
delivery environment, enabling you to create many virtual systems within a Basic Features of Virtualisation
single physical system. ● Partitioning : Multiple applications and operating systems are supported by a
● One of the primary reasons that companies have implemented virtualization single physical system by partitioning(separating) the available resources.
is to improve the performance and efficiency of processing of a diverse mix
of workloads.
● Isolation : Each virtual machine runs in an isolated manner from its host physical
system and other virtual machines. Benefits
The benefit of this isolation is that if any one virtual instance crashes , the
other virtual machines and host systems are not affected. ●Virtualisation is implemented to increase the performance and
efficiency of processing a variety of workloads.
● Encapsulation : Each virtual machine encapsulates its state as a file system. Like a
simple file on a computer system , a virtual machine can also be moved or copied .It ●Using Virtual resources provides the following benefits:
works like an independent guest software configuration.
● Interposition: Generally , in a virtual machine , all the new guest actions are
❑Enhance service delivery speed by decreasing latency
performed through the monitor .A monitor can inspect , modify or deny operations ❑Enable better utilization of resources and services
such as compression, encryption, profiling, and translation.
❑Provide a foundation for implementing cloud computing
❑Improve productivity , implement scalability and save costs

❑Provide a level of automation and standardization for optimizing the

computing environment.

Types/Approaches of Virtualisation:
● In the Big data environment , you can virtualize almost every ● Servers are the lifeblood of any network.
element such as server, storage, applications,data,networks,
processors,etc. ● They provide the shared resources that the network users need, such as

What is a server? e-mail, Web services, databases, file storage, etc.

● A server is a machine or computer program that provides data or

functionality for other machines or programs. We call the other
Server Virtualisation:
devices or programs ‘clients.’ ● In case of server virtualization , a single physical server is partitioned into multiple virtual
servers.
● Most commonly, the term refers to a computer that provides data to ● Each virtual server has its own hardware and related resources , such as Random Access
other computers. Memory (RAM),CPU,Hard drive and network controller.

● The process of creating virtual machines involves installing a

lightweight software (i.e. a computer program that is designed to have
a small memory footprint (RAM usage component) called a
Application Virtualisation
hypervisor onto a physical server. ●Application virtualization means encapsulating applications in a way
that they would not be dependent on the underlying physical
● The hypervisor's job is to allow multiple sharing of related resources such as CPU time, memory,
computer system.
storage and network bandwidth on the physical server -- available to one or more virtual
machines ●It improves the manageability and portability of applications.
● In Big data analysis , server virtualization can ensure the scalability of the platform as per the ●It can be used along with server virtualization.
volume of the data.
●Application virtualization ensures that Big data applications can access
● Server virtualization also provides foundation of using cloud services as data sources. resources on the basis of their relative priority with each other.
●Big data applications have significant IT resources requirements and ● A virtual network is a network where all devices, servers, virtual machines, and data
application virtualization can help them in accessing resources at
low costs.

What is virtual network?

centers that are connected are done so through software and wireless technology. This ●While implementing network virtualization , you do not need to rely on
allows the reach of the network to be expanded as far as it needs to for peak efficiency.
the physical network for managing traffic between connections.
● A local area network, or LAN, is a kind of wired network that can usually only reach within
the domain of a single building.
●You can create as many virtual networks as you need from a single
● A wide area network, or WAN, is another kind of wired network, but the computers and
devices connected to the network can stretch over a half-mile in some cases. physical implementation.
● Conversely, a virtual network doesn’t follow the conventional rules of networking
because it isn’t wired at all and instead specialized internet technology is used to access.
●In the Big data environment , network virtualization helps in defining
different networks with different sets of performance and capacities to
Network virtualisation manage the large distributed data required for Big data analysis.

●Network virtualization means using virtual networking as a pool of

connection resources .
●Processor and memory virtualization , thus can increase the speed of
Processor and Memory processing and get your analysis results sooner.

Virtualisation Contd.
●Processor virtualization optimizes the power of the processor and maximizes
its performance.
●Memory virtualization separates memory from the servers.
●Big data analysis needs systems to have high processing power(CPU) and
memory (RAM) for performing complex computations.
●These computations can take a lot of time in case CPU and memory
resources are not sufficient.

Data and Storage Virtualization

● The benefits of data virtualization for companies include quickly combining
different sources of data, improving productivity, accelerating time value,
eliminating latency, maintaining data warehouse, and reducing the need for
multiple copies of data as well as less hardware.

● Data virtualization provides an abstract service that delivers data

continuously in a consistent form without knowledge of the underlying
physical database.

● It is used to create a platform that can provide dynamic linked data services.
● On the other hand , storage virtualisation combines physical storage
resources so that they can be shared in a more effective way.

Storing data in Databases and Data

Warehouses

RDBMS with an example:

● Relational database systems use a model that organizes data into tables of rows (also ● The tables can be related based on the common Customer ID field. You can, therefore,
called records or tuples) and columns (also called attributes or fields). query the table to produce valuable reports, such as a consolidated customer statement.

● Generally, columns represent categories of data, while rows represent individual

instances. Illustration of the above
● For example, imagine your company maintains a customer table that contains company
data about each customer account and one or more transaction tables that contain data
describing individual transactions.
example:
● The columns (or fields) for the customer table might be Customer ID, Company Name,
Company Address, etc.;

● The columns for a transaction table might be Transaction Date, Customer ID, Transaction
Amount, Payment Method, etc.
Contd.
Contd.

●These tables can be linked or related using keys. Each row in a table is ● RDBMS consists of several tables and relationships between those tables
identified using a unique key, called a primary key. help in classifying the information contained in them.
●This primary key can be added to another table, becoming a foreign key. ● Each table in RDBMS has a pre set schema.
●The primary/foreign key relationship forms the basis of the way relational ● These schemas are linked using the values in specific columns of each
databases work.
table.(primary key /foreign key).
●Returning to our example, if we have a table representing product orders,
one of the columns might contain customer information. ● The data to be stored / transacted in a RDBMS need to adhere to ACID
standards:
●Here, we can import a primary key that links to a row with the
● ACID is a concept that refers to the four properties of a transaction in a
information for a specific customer.
database system, which are: Atomicity, Consistency, Isolation and
ACID : Durability.
● These properties ensure the accuracy and integrity of the data in the Consistency: Ensures that data abides by the schema (table) standards,
database, ensuring that the data does not become corrupt as a result of such as correct data type entry , constraints and keys.
some failure, guaranteeing the validity of the data even when errors or Isolation: Refers to the encapsulation of information , i.e. makes only
failures occur. necessary information visible.
Atomicity: Ensures full completion of a database operation. Durability: Ensures that transactions stay valid even after a power
failure or errors.
A transaction must be an atomic unit of work, which means that all
the modified data are performed or none of them will be. The
transaction should be completely executed or fails completely, if one RDBMS and Big data
part of the transaction fails, all the transaction will fail. This provides
reliability because if there is a failure in the middle of a transaction, ●Like other databases , the main purpose for RDBMS is to provide a solution
none of the changes in that transaction will be committed. for storing and retrieving information in a more convenient and efficient
manner.

●The most common way of fetching data from these tables is by using ● One of the biggest difficulties
Structural Query Language(SQL). with RDBMS is that it is not yet
near the demand levels of Big
●As you know data is stored in tables of the form of rows and columns ; The data. The volume of data
size of the file increases as new data / records are added resulting in handling today is rising at a faster
increase in size of the database. rate.
●Big data solutions are designed for storing and managing enormous amounts ● For example: Facebook stores 1.5
of data using a simple file structure , format and highly distributed storage petabytes of photos. Google
processes 20PB each day .Every
mechanism.
minute , over 168 million emails
are sent and received , 11 million
Contd. searches in Google .
● Big data primarily comprises
semi-structured data , such as
social media sentiment analysis ,text mining data etc. while RDBMSs are more suitable In this structured data is mostly processed.
for structured data such as weblog , financial data etc. In this both structured and unstructured data is
processed.
Differences between RDBMS and It is less scalable than Hadoop. It is highly scalable.

Big Data systems The data schema of RDBMS is static type. The data schema of Hadoop is dynamic type.
RDBMS Big data Hadoop

Cost is applicable for licensed software. Free of cost, as it is an open source software.
Traditional row-column based databases, An open-source software used for storing data
basically used for data storage, manipulation and and running applications or processes
retrieval. concurrently.
RDBMS and big data link

●Big data solutions provide a way to avoid storage limitations and reduce the cost of
processing and storage for immense data volumes. Conclusion:
●Nowadays systems based on RDBMS , also able to store huge amounts of data with ● In the data – tsunami kind of environment , where data inflow is beyond usual
advanced technology and developed software and hardware. Example: Analytics conventions and rationales, Big data systems act as a dam to contain the water(here
Platform System(APS) from Microsoft. data) and then utilizes RDBMS cleverly to make channels in order to distribute data
specifically to hydroelectric stations ,irrigation canals,other places where water is most
●In fact Relational database systems and Big data batch processing solutions are seen required.
as complementary mechanisms rather than competitive mechanisms.
● Thus Big data systems happens to be non-relational when it comes to storing and
●Batch processing solutions of Big data are very unlikely ever to replace RDBMS. handling incoming data , and then it abides by conventional RDBMS mechanisms to
●In most cases , they balance and enhance capabilities for managing data and
generating Business intelligence cases.
●Results / output of Big data systems can still be stored in RDBMS as shown in the
next diagram.
disseminate the results to meaningful formats. ●It states that any distributed data store can only provide two of the
following three guarantees:
❑Consistency : Same data is visible by all the nodes.
❑Availability : Every request is answered whether it succeeds or fails.
❑Partition tolerance – Despite network failures ,the system continues to
operate.
●The CAP Theorem is useful in decision making in the case of design of
database servers/ systems.
CAP THEOREM : How to understand it?
●CAP Theorem is also called Brewer’s Theorem.

● In the theorem, partition tolerance is a must. The assumption is that the system ● Consistency in CAP is different than that of ACID. Consistency in CAP means having the
operates on a distributed data store so the system, by nature, operates with network most up-to-date information.
partitions.
● Network failures will happen, so to offer any kind of reliable service, partition tolerance
is necessary—the P of CAP.
Technical background of a query
● The moment in question is the user query. We assume that a user makes a query to a database, and the networked database is to return a value.
● That leaves a decision between the other two, C and A.
● When a network failure happens, one can choose to guarantee consistency or
availability :
Alice from London &
❖ High consistency comes at the cost of lower availability. ALICE Ramesh from Hyderabad

❖ High availability comes at the cost of lower consistency.

SEARCHING FOR
ROOM
OF SAME
HOTEL SAME DATE

RAMESH Applications of CAP Theorem :

● Whichever value is returned from the database depends on our choice to provide consistency or availability. Here’s how this choice could play out:
● On a query, we can respond to the user with the current value on the server, offering a highly available service. design of peer-to-peer
● If we do this, there is no guarantee that the value is the most recent value submitted to the database. ●Peer-to-peer (P2P) computing or networking is a distributed application
architecture that partitions tasks or workloads between peers. Peers are
● It is possible a recent write could be stuck in transit somewhere. equally privileged, equipotent participants in the application. They are said
● If we want to guarantee high consistency, then we have to wait for the new write or return an error to the query. to form a peer-to-peer network of nodes.

● Thus, we sacrifice availability to ensure the data returned by the query is consistent.
●Peers make a portion of their resources, such as processing power, disk
storage or network bandwidth, directly available to other network
participants, without the need for central coordination by servers or stable

hosts. Peers are both suppliers and consumers of resources, in contrast to 2) Data recovery or backup is very difficult. Each computer
the traditional client–server model in which the consumption and supply should have its own backup system
of resources is divided.

Disadvantages of Peer to peer:-

Non-relational Databases
●The database that does not use the table/key model of RDBMS is a non-
1) In this network, the whole system is decentralised thus it is relational database.
difficult to administer. That is one person cannot determine
●Such kind of databases have effective data operation techniques and processes
the whole accessibility setting of whole network. that are custom designed to provide solutions to Big data problems.
●NoSQL (not only SQL) is one such example of a popular emerging non-relational
database.
●Most non-relational databases are associated with websites such as google, ● Scalability: It refers to capability to write data across multiple data clusters
Amazon , Yahoo!,and Facebook. simultaneously irrespective of physical hardware or infrastructure limitations.
● Seamlessness: Another important aspect that ensures the resiliency of non-relational
●These website introduce new applications almost every day with millions of databases , is their capability to expand /contract to accommodate varying degrees of
users. increasing or decreasing data flows without affecting the end user experience.
●So they require non-relational databases to handle unexpected traffic spikes ● Data and Query model: Instead of the traditional row/column , key-value structure , non-
since RDBMS cannot withstand fluctuations. relational databases use special framework to store data.
● Persistence design : Persistence is an important element in non-relational databases
Important characteristics Non- ensuring faster throughput of huge amounts of data by making use of dynamic memory
rather than conventional reading and writing from disks.
● Eventual consistency: While RDBMS uses ACID( Atomicity,Consistency,Isolation,
relational Databases : Durability) for ensuring data consistency, Non relational databases use BASE ( Basically
available Soft state and Eventual Consistency) to ensure that inconsistencies are resolved
when data is midway between the nodes in a distributed system.

● A lot of corporations still use relational databases for some data but increasing
Polyglot persistence: persistence requirements of dynamic applications are growing from
predominantly relational to a mixture of data sources.
● Polyglot applications are the ones that make use of several core database
technologies.
● Such databases are often used to solve complex problem by breaking it into
Integrating Big data in Traditional
fragments and applying different database models.
● Then the results of different sets are aggregated into a data storage and analysis
Data warehouses
solution.It means picking up the right Non-relational DB for the right application.
● For example, Disney in addition to RDBMS also uses Cassandra and Mongo DB
.NETFLIX uses Cassandra ,Hbase and SimpleDB.
Summarise : Data Warehouse Big data Handling Technology / Solution :
●Big Data Technology is a medium to store and operate huge amounts of heterogeneous
Group of methods and software data, holding data in low-cost storage devices.
• Incorporated or used in Big organisations.Provides Dashboard based interface
●It is designed for keeping data in a raw or unstructured format while processing is in
progress.
Data collection from functional systems that are heterogeneous
• Data sources and types being different ●It is preferred because there is a lot of data that has to be manually and relationally
handled.
Synchronized into a centralized database
●If this data is potentially used , it can provide much valuable information leading to superior
decision making.
Analytical visualization can be done

Single point of reference

Mm Summarise : Big Data Handling Thus …

●Organisations require a data warehouse in order to make rational decisions .
Technology ●In order to have good knowledge of what is actually going on in your company, you need
Medium to store and operate huge amounts of data your data to be reliable , credible and available to everyone.
• Incorporated or used to store Big data and process them
●Big data technology is just a medium to store and operate huge amounts of data whereas a
Best used when data is heterogeneous Data warehouse is a way of organizing data.
• Data sources and types being different

Keeps data in unstructured format while processing goes on. Illustration through case:
●Consider the case of ABC company.
Increases performance because of optimized storage
●It has to analyse the data of 100000 employees across the world.
Also enhances Analytical abilities
●Assessing the performance manually of each employee is a huge task for the administrative
department before rewarding bonuses and increasing salaries based on his awards list /
contribution to company.
●The company sets up a data warehouse in which information related to each employee
is stored and provides useful reports and results.

Employee Data warehouse

Options with an Organisation:

● Can an organization :

Have a Big data solution and no Data warehouse or vice versa ? YES ●Thus it is a misunderstood conviction that once a Big data solution is
Have both? YES implemented, existing relational data warehousing becomes redundant and not
Thus there is hardly any correlation between a Big data technology and Data warehouse.
required anymore.

● Organisations that use Data warehousing technology will continue to do so and those that use both are
future proof from any further technological advancements .

● Big data systems are normally used to understand strategic issues, for example inventory maintenance
or target based individual performance reports.

● Data warehousing is used for reports and visualizations for management purposes at pan – company
level.

● Data warehousing is a proven concept and thus will continue to provide crucial database support to
many enterprises.

Note :
●Data Availability is a well-known challenge for any system related to transforming
Integrating Big data in Traditional Data and processing data for use by end-users and Big data is no different.
●HADOOP is beneficial in mitigating this risk and make data available for analysis
warehouses immediately upon acquisition.
●Organisations are beginning to realise that they have an inevitable business requirement ●The challenge here however , is to sort and load data that is unstructured and in
of combining traditional Data warehouses (based on structured formats) to less varied formats.
structured Big data systems.
●Also context – sensitive data involving several different domains may require another
●The main challenges confronting the physical architecture of the integration between level of availability check.
the two include data availability , loading storage , performance , data volume ,
scalability and varying query demands against the data and operational costs for
maintaining the environment. 2.Pattern study
●To cope up with the above issues that might hamper the overall implementation and ●Pattern study is nothing but the centralization and localization of data according
integration process , following are the issues and challenges associated. to the demands.

1.Data Availability:

●For example: In amazon , results are combined based on end user location(i.e. ●Especially in case of big documents , images or videos .
destination pin code) , so as to return only meaningful contextual knowledge
●Sqoop , Flume etc. come handy in this scenario.
than to impart the entire data to the user.
●Trending topics in news channels / epapers is also an example of pattern
study(keywords or popularity of links as per the hits they receive , etc. are
4.Data volumes and Exploration
conjoined to know the pattern. ●Data exploration and mining is an activity associated with Big data systems and it
yields large datasets as processing output.
3.Data Incorporation and ●These datasets are required to be preserved in the system by occasional
optimization of intermediary datasets. Negligence in this aspect can be reason for
Integration: potential performance drain over a period of time.
●Traffic spikes and Volatile surge in data volumes can easily dislocate the
●The data incorporation process for Big data systems becomes a bit complex when functional systems of the firm.All over the Data cycle (Acquisition 🡪
file formats are heterogeneous. transformation -🡪 Processing 🡪 Results) , we need to take care of this.
●Continuous data processing on a platform can create a conflict for resources over a
given period of time, often leading to deadlocks.
●Distributed storage is a new storage technology to compete against above.
5.Compliance and Localised legal ●Exchange of data and Persistence of data across different storage layers need to
be take care of while handling Big data projects.
Requirements.
●Various compliance standards such as Safe Harbor ,PCI Regulations etc. can have Changing Deployment models in Big
some impact on data security and storage.
●For example transactional data need to be stored online as per Courts of law. data era
●Thus to meet such requirements , Big data infrastructure can be used. ● Data management deployment models are shifting altogether to different levels ever since the inception of Big
data systems with Data warehouse.
●Large volumes of data must be carefully handled to ensure that all standards
● Following are the necessities to be taken care of while handling Big data systems with Data warehouse:
relevant to the data are compiled and security measures carried out.
1. Scalability and speed:The platform should support parallel processing , optimized storage , dynamic query

6.Storage performance: 2.
optimization.
Agility and Elasticity: Agile means that the platform should be flexible and respond rapidly in case of
●Processors , memories / core disks etc. are the traditional methods of storage and changing trends.Elasticity means that the platform models can be expanded and decreased as per the
they have proven to be beneficial and successful in working of organisations. demands of the user.

3. Affordability and Manageability: One must solve issues such as flexible pricing ,licensed software ,
customization and cloud based techniques for managing and controlling data.
4. Appliance Model / Commodity Hardware: Create clusters.

Thank You...
Conceptualizing Data Analysis as a Process
● The “Problem” with Data Analysis
● Data Analysis as a Linear Process
● Data Analysis as a Cycle

UNIT II

The “Problem” with Data Analysis Data Analysis as a process

Problem : What does ‘data analysis’ mean? Does it refer to one method or many? A
collection of different procedures? Is it a process? If so, what does that mean?
● From this perspective, we present a data analysis process that includes the
● More important, can employees of a company – without a background in math or
following key components:
statistics – learn to identify and use data analysis in their work?
● The answer to the last question is Yes! – assuming a minimum investment of time, • Purpose
effort, and practice. • Questions
Solution : Data analysis should be made systematically following a set of specific • Data Collection
procedures and methods.
• Data Analysis Procedures and Methods
● However, before programs can effectively use these procedures and methods, we
believe it is important to see data analysis as part of a process. • Interpretation/Identification of Findings
● By this, we mean that data analysis involves goals, relationships; decision making; and • Writing, Reporting, and Dissemination; and
ideas, in addition to working with the actual data itself.
• Evaluation
● Simply put, data analysis includes ways of working with information (data) to support
the work, goals and plans of your program or agency.
Data Analysis as a Linear Process
Linear process:
● A strictly linear approach to data analysis is to work through the
components in order, from beginning to end.
● A possible advantage of this approach is that it is structured and
organized, as the steps of the process are arranged in a fixed order.
● In addition, this linear conceptualization of the process may make it
easier to learn.
● A possible disadvantage is that the step-by-step nature of the decision
making may obscure or limit the power of the analyses – in other
words, the structured nature of the process limits its effectiveness.

Data analysis process as a Cycle: Cyclical process:

● A cyclical approach to data analysis provides much more flexibility to the nature
of the decision making and also includes more and different kinds of decisions to
be made.
● In this approach, different components of the process can be worked on at
different times and in different sequences – as long as everything comes
“together” at the end.
● A possible advantage of this approach is that program staff are not “bound” to
work on each step in order.
● The potential exists for program staff to “learn by doing” and to make
improvements to the process before it is completed.
Process Component #2. Question(s):
Process Component #1. Purpose(s):
● What Do We Want To Know?
● What Do We Do? & Why?
● Before effective data collection or analytical procedures can proceed, one or more specific
● An effective data analysis process is based upon the nature and mission of questions should be formulated.
the organization as well as upon the skills of the team that is charged with ● These questions serve as the basis for an organized approach to making decisions: first,
the task of collecting and using data for program purposes. about what data to collect; and second, about which types of analysis to use with the data.

● Above all, an eﬀective data analysis process is functional – i.e., it is useful and ● Some questions are close-ended and therefore relatively straightforward, e.g.,
adds value to organizational services and individual practices. “Did our program meet the 10% mandate for serving children with disabilities last year”?

● Therefore, a preliminary step in the data analysis process is to select and ● Other questions are highly open-ended, such as:
train a team to carry out the process. “How could we do a better job of parent involvement?”

● More specifically, these standards are the basis for the first step in the data In the first case, there are only two possible answers to the question: “Yes” or “No.” In the
second case, a possible answer to the question could include many relevant pieces of
analysis process – forming one or more specific questions to be examined. information.

Process Component #3. Data Collection:

CONT... ● What Information Can Help Us Answer Our Question(s)?
● Data collection is a process in and of itself, in addition to being a part of the larger whole. Data come
in many diﬀerent types and can be collected from a variety of sources, including:

● Diﬀerent types of questions require diﬀerent types of data – which • Observations

makes a diﬀerence in collecting data. • Questionnaires

• Interviews
● In any case, the selection of one or more speciﬁc questions allows the
• Documents
process of data collection and analysis to proceed.
• Tests
● Based on the nature and scope of the questions (i.e., what is included • Others
in the question) programs can then create a plan to manage and The value of carefully selecting the questions to be examined is therefore of major importance: the
organize the next step in the process – data collection. way that the question is worded is the foundation for an eﬀective data collection plan.

● Finally, by formulating specific questions at the beginning of the We urge programs to develop a specific planning process for data collection (no matter how brief) in
order to avoid the common pitfalls of the collection process, which include having:
process, programs are also in a position to develop skills in evaluating
• Too little data to answer the question;
their data analysis process in the future.
• More data than is necessary to answer the question; and/or
• Data that is not relevant to answering the question.
Process Component #4. Data Analysis:
CONT... ● What Are Our Results?
● In order to successfully manage the data collection process, programs need a plan ● Once data have been collected, the next step is to look at and to identify what
that addresses the following: is going on – in other words, to analyze the data. Here, we refer to “data
✔ What types of data are most appropriate to answer the questions? analysis” in a more narrow sense: as a set of procedures or methods that can
✔ How much data are necessary? be applied to data that has been collected in order to obtain one or more sets
✔ Who will do the collection? of results.
✔ When and Where will the data be collected? ● Because there are different types of data, the analysis of data can proceed on
✔ How will the data be compiled and later stored? different levels.
● By creating a data collection plan, programs can proceed to the next step of the
overall process.
● The wording of the questions, in combination with the actual data collected,
have an influence on which procedure(s) can be used – and to what effects.
● In addition, once a particular round of data analysis is completed, a program can
then step back and reflect upon the contents of the data collection plan and identify
“lessons learned” to inform the next round.

Process Component #5. Interpretation:

● What Do The Results Mean?
● Once a set of results has been obtained from the data, we can then turn to the interpretation of the results.
CONT...
● In some cases, the results of the data analysis speak for themselves. ● On a final note, it is important to state that two observers may
For example, if a program’s teaching staff all have bachelor’s degrees, the program can report that 100% of legitimately make different interpretations of the same set of data and
their teachers are credentialed. In this case, the results and the interpretation of the data are (almost) identical.
its results. While there is no easy answer to this issue, the best
● However, there are many other cases in which the results of the data analysis and the interpretation of those
results are not identical.
approach seems to be to anticipate that disagreements can and do
For example, if a program reports that 30% of its teaching staff has an AA degree, the interpretation of this
occur in the data analysis process.
result is not so clearcut.
● As programs develop their skills in data analysis, they are encouraged
● In this case, interpretation of the data involves two parts: 1) presenting the result(s) of the analysis; and 2)
providing additional information that will allow others to understand the meaning of the results. to create a process that can accomplish dual goals:
● In other words, we are placing the results in a context of relevant information. ● 1) to obtain a variety of perspectives on how to interpret a given set of
● Obviously, interpretation involves both decision making and the use of good judgments! We use the term results; and
results to refer to any information obtained from using analysis procedures. We use the term findings to refer
to results which will be agreed upon by the data analysis team as best representing their work.
● 2) to develop procedures or methods to resolve disputes or
● In other words, the team may generate a large number of results, but a smaller number of findings will be
written up, reported, and disseminated. disagreements over interpretation.
Process Component #6. Writing, Reporting & Process Component #7. Evaluation:
Dissemination: ● What Did We Learn About Our Data Analysis Process?
● The final step of the data analysis process is evaluation. Here, we do not refer to conducting a program evaluation,
but rather, an evaluation of the preceding steps of the data analysis process. Here, program staff can review and
● What Do We Have To Say? How Do We Tell The Story of Our Data? reflect upon:

● Once data have been analyzed and an interpretation has been developed, programs face the next ● Purpose: was the data analysis process consistent with federal standards and other, relevant regulations?
tasks of deciding how to write, report, and/or disseminate the findings. ● Questions: were the questions worded in a way that was consistent with federal standards, other regulations, and
organizational purposes? Were the questions effective in guiding the collection and analysis of data?
● First, good writing is structured to provide information in a logical sequence. In turn, good writers are
strategic – they use a variety of strategies to structure their writing. ● Data Collection: How well did the data collection plan work? Was there enough time allotted to obtain the
necessary information? Were data sources used that were not
● One strategy is to have the purpose for the written work to be clearly and explicitly laid out. This helps
● effective? Do additional data sources exist that were not utilized? Did the team collect too little data or too much?
to frame the presentation and development of the structure of the writing. Second, good writing
takes its audience into account. ● Data Analysis Procedures or Methods: Which procedures or methods were chosen? Did these conform to the
purposes and questions? Were there additional procedures or methods that could be used in the future?
● Therefore, good writers often specify who their audience is in order to shape their writing.
● Interpretation/Identification of Findings: How well did the interpretation process work? What information was
● A final thought is to look upon the writing/reporting tasks as opportunities to tell the story of the data used to provide a context for the interpretation of the results? Was additional relevant information not utilized for
interpretation? Did team members disagree over the interpretation of the data or was there consensus?
you have collected, analyzed, and interpreted.
● Writing, Reporting, and Dissemination. How well did the writing tell the story of the data? Did the intended
audience find the presentation of information effective?

Thank You

Placement Information System
55% (20)
Placement Information System
50 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Lecture 10 MapReduce Hadoop
No ratings yet
Lecture 10 MapReduce Hadoop
37 pages
BDP 2024 09
No ratings yet
BDP 2024 09
24 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
43 pages
Chapter 4
No ratings yet
Chapter 4
53 pages
MapReduce BigData 09
No ratings yet
MapReduce BigData 09
9 pages
Module 3 (Part-1) - Big Data
No ratings yet
Module 3 (Part-1) - Big Data
46 pages
ECS765P - W2 - The MapReduce Programming Model
No ratings yet
ECS765P - W2 - The MapReduce Programming Model
53 pages
3.Map-Reduce Framework - 1
No ratings yet
3.Map-Reduce Framework - 1
47 pages
Map Reduce
No ratings yet
Map Reduce
42 pages
Unit-2 MapReduce2024
No ratings yet
Unit-2 MapReduce2024
41 pages
Bda Unit 2
No ratings yet
Bda Unit 2
48 pages
Map Reduce
No ratings yet
Map Reduce
8 pages
Map Reduce
No ratings yet
Map Reduce
35 pages
Map Reduce Workflow Colloquim
No ratings yet
Map Reduce Workflow Colloquim
30 pages
Mapreduce Model Principles
No ratings yet
Mapreduce Model Principles
65 pages
777 1651400043 BD Module 4
No ratings yet
777 1651400043 BD Module 4
21 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Unit4 Fos
No ratings yet
Unit4 Fos
7 pages
Data Science Presentation
No ratings yet
Data Science Presentation
20 pages
Vehicle Parking Management System
100% (3)
Vehicle Parking Management System
28 pages
Map Reduce Tutorial-1
No ratings yet
Map Reduce Tutorial-1
7 pages
Big Data Notes
No ratings yet
Big Data Notes
13 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
(BIG DATA) (MapReduce - Quick Guide, Tutorialspoint - Com)
No ratings yet
(BIG DATA) (MapReduce - Quick Guide, Tutorialspoint - Com)
36 pages
HDFS Unit 4
No ratings yet
HDFS Unit 4
12 pages
Big Data Management Continued
No ratings yet
Big Data Management Continued
48 pages
Big Data Unit - 3
No ratings yet
Big Data Unit - 3
7 pages
Unit 3
No ratings yet
Unit 3
22 pages
Map Reduce
No ratings yet
Map Reduce
7 pages
Unit-2 Map Reduce Notes
No ratings yet
Unit-2 Map Reduce Notes
28 pages
Unit 5 Lecture 5
No ratings yet
Unit 5 Lecture 5
21 pages
Chapter 4 - Understanding Map Reduce Fundamentals
No ratings yet
Chapter 4 - Understanding Map Reduce Fundamentals
45 pages
Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts
No ratings yet
Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts
26 pages
3.1.how Map Reduce Works & 3.2 Anatomy
No ratings yet
3.1.how Map Reduce Works & 3.2 Anatomy
11 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Bda Unit 3
No ratings yet
Bda Unit 3
20 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Chapter4 - MapReduce
No ratings yet
Chapter4 - MapReduce
29 pages
Hadoop - MapReduce
No ratings yet
Hadoop - MapReduce
5 pages
132 P16cse5a-P16ite3a 2020052706582977
No ratings yet
132 P16cse5a-P16ite3a 2020052706582977
15 pages
Unit-2 (MapReduce-I)
No ratings yet
Unit-2 (MapReduce-I)
28 pages
Map Reduce
No ratings yet
Map Reduce
18 pages
Cloud Computing Prof
No ratings yet
Cloud Computing Prof
11 pages
Unit 2
No ratings yet
Unit 2
12 pages
Map Reduce 2
No ratings yet
Map Reduce 2
14 pages
Distributed and Cloud Computing
No ratings yet
Distributed and Cloud Computing
58 pages
Big Data Analytics UNIT 3 Notets
No ratings yet
Big Data Analytics UNIT 3 Notets
12 pages
Bda Unit 3
No ratings yet
Bda Unit 3
29 pages
3 Fuel Consumption Example - MR
No ratings yet
3 Fuel Consumption Example - MR
7 pages
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
No ratings yet
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
15 pages
Unit 4 1
No ratings yet
Unit 4 1
12 pages
Analysis of Mapreduce Algorithms: Harini Padmanaban
No ratings yet
Analysis of Mapreduce Algorithms: Harini Padmanaban
6 pages
Data Science
No ratings yet
Data Science
7 pages
Big Data Lecture # 07
No ratings yet
Big Data Lecture # 07
21 pages
Map Reduce Report
No ratings yet
Map Reduce Report
16 pages
Notes Bug Data and of Apache
No ratings yet
Notes Bug Data and of Apache
4 pages
Term Paper Java
No ratings yet
Term Paper Java
14 pages
Cim Short Question and Answer
100% (2)
Cim Short Question and Answer
9 pages
CBS-154 Proficy HMI SCADA iFIX Fundamentals PDF
No ratings yet
CBS-154 Proficy HMI SCADA iFIX Fundamentals PDF
2 pages
Data Base Management System: Prof. Partha Pratim Das
No ratings yet
Data Base Management System: Prof. Partha Pratim Das
933 pages
Python Postgresql Tutorial
No ratings yet
Python Postgresql Tutorial
47 pages
Object Relational Database
No ratings yet
Object Relational Database
3 pages
CBSE Project Railway Ticket Reservation
No ratings yet
CBSE Project Railway Ticket Reservation
13 pages
Section 1 Quiz Database Design Oracle
No ratings yet
Section 1 Quiz Database Design Oracle
25 pages
How To Configure RMAN To Work With NETbackup
No ratings yet
How To Configure RMAN To Work With NETbackup
7 pages
QC Interview Questions - Tutorialspoint
No ratings yet
QC Interview Questions - Tutorialspoint
5 pages
10774AD ENU LabManual
No ratings yet
10774AD ENU LabManual
351 pages
Grade 6 Worksheet Managing Data
No ratings yet
Grade 6 Worksheet Managing Data
4 pages
Course Outline SQL & PL-SQL
No ratings yet
Course Outline SQL & PL-SQL
2 pages
Final Report TK
No ratings yet
Final Report TK
42 pages
London Jets
No ratings yet
London Jets
19 pages
Indraprastha New Arts Commerce and Science College Wardha Department of Commerce and Management Project Guidelines (2021-22)
No ratings yet
Indraprastha New Arts Commerce and Science College Wardha Department of Commerce and Management Project Guidelines (2021-22)
5 pages
SQL Data Type
No ratings yet
SQL Data Type
8 pages
Free Switch Cookbook - Assignments
No ratings yet
Free Switch Cookbook - Assignments
2 pages
Data Warehousing and OLAP Technology
No ratings yet
Data Warehousing and OLAP Technology
51 pages
Mba Iv It
No ratings yet
Mba Iv It
6 pages
Image Retrieval Using Color and Shape
No ratings yet
Image Retrieval Using Color and Shape
24 pages
For Book Thesis
No ratings yet
For Book Thesis
161 pages
Database Principles: Fundamentals of Design, Implementation, and Management Tenth Edition-Chapter-5
No ratings yet
Database Principles: Fundamentals of Design, Implementation, and Management Tenth Edition-Chapter-5
55 pages
Vtrisdoc
No ratings yet
Vtrisdoc
145 pages
Case Study 6
No ratings yet
Case Study 6
6 pages
Report of Mini Project
No ratings yet
Report of Mini Project
6 pages
Dbms Unit IV
No ratings yet
Dbms Unit IV
10 pages
340 DVT AIIMSJammu 5 July 23 PDF
No ratings yet
340 DVT AIIMSJammu 5 July 23 PDF
8 pages
PolyView Ceragon
No ratings yet
PolyView Ceragon
14 pages
Flutter Full-Stack
From Everand
Flutter Full-Stack
HAROLD WHITES
No ratings yet

Big Data

Uploaded by

Big Data

Uploaded by

Origins-Introducing Mapreduce Framework for Big data

Why Map reduce ?

Contd. MapReduce Framework: About it , its features

What is this style of Mapreduce Programming Mapreduce style processing:

Combine/ Shuﬄe function Reduce function

Features of MapReduce: Contd.

HADOOP VS MAPREDUCE Concept :

Working of Mapreduce Approach

Techniques to optimize MapReduce Jobs:

2.Synchronization 3. File system:

Role of Hbase in Big data processing

Principles of Big Data Implementation: Big data Architecture/Big Data Stack

4. Physical infrastructure layer

5. Platform management layer

10. Big Data Application

In communication systems, noise is an error or undesired random disturbance of a

Storage layer Diagram: Different NOSQL databases for different

Platform Management layer: Security layer:

2. It must enable ﬁle-layer encryption.

Different Statistical /Numerical methods for Visualisation Layer:

Hypervisor/Virtual Machine Manager

● Rather than assigning a dedicated set of physical resources to each set of

❑Provide a level of automation and standardization for optimizing the

What is a server? e-mail, Web services, databases, file storage, etc.

● A server is a machine or computer program that provides data or

● The process of creating virtual machines involves installing a

What is virtual network?

●Network virtualization means using virtual networking as a pool of

Data and Storage Virtualization

● Data virtualization provides an abstract service that delivers data

Storing data in Databases and Data

RDBMS with an example:

● Generally, columns represent categories of data, while rows represent individual

❖ High availability comes at the cost of lower consistency.

RAMESH Applications of CAP Theorem :

Disadvantages of Peer to peer:-

Single point of reference

Mm Summarise : Big Data Handling Thus …

Employee Data warehouse

Options with an Organisation:

The “Problem” with Data Analysis Data Analysis as a process

Data analysis process as a Cycle: Cyclical process:

Process Component #3. Data Collection:

● Diﬀerent types of questions require diﬀerent types of data – which • Observations

makes a diﬀerence in collecting data. • Questionnaires

Process Component #5. Interpretation:

You might also like