0% found this document useful (0 votes)
6 views9 pages

Big Data Technology

The document discusses Big Data technologies, focusing on NoSQL databases and Hadoop. NoSQL is a non-relational database used for big data applications, while Hadoop is an open-source platform for storing and processing diverse data types, featuring HDFS and MapReduce for scalability and fault tolerance. It also highlights the shift from traditional data processing methods to more flexible and cost-effective open-source solutions and cloud-based models.

Uploaded by

pratima depa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views9 pages

Big Data Technology

The document discusses Big Data technologies, focusing on NoSQL databases and Hadoop. NoSQL is a non-relational database used for big data applications, while Hadoop is an open-source platform for storing and processing diverse data types, featuring HDFS and MapReduce for scalability and fault tolerance. It also highlights the shift from traditional data processing methods to more flexible and cost-effective open-source solutions and cloud-based models.

Uploaded by

pratima depa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Big Data Technology

(NoSQL and Hadoop)

NoSQL(NOT ONLY SQL)

It is a light weight ,open source, non relational database that did not expose the
standard SQL interface.

NoSQL databases are widely used in bigdata and other real time web applications.

Features of NoSQL:

1. NoSQL databases are non-relational

2. Distributed

3. No support for ACID properties

4. No fixed table schema

Types of NoSQL Databases

1. Key-value

2. Document

3. Column

4. Graph
Hadoop is an open-source platform for storage and processing of diverse data types that
enables data-driven enterprises to rapidly derive the complete value from all their
data.

History of Hadoop

The original creators of Hadoop are Doug Cutting (used to be at Yahoo!


now at Cloudera) and Mike Cafarella (now teaching at the University of
Michigan in Ann Arbor). Doug and Mike were building a project called
“Nutch” with the goal of creating a large Web index. They saw the
MapReduce and GFS papers from Google, which were obviously super
relevant to the problem Nutch was trying to solve. They integrated the
concepts from MapReduce and GFS into Nutch; then later these two
components were pulled out to form the genesis of the Hadoop project.

The name “Hadoop” itself comes from Doug’s son yellow plush elephant toy that he
has.

The scalability and elasticity of free, open-source Hadoop running on
standard hardware allow organizations to hold onto more data than ever
before.

Hadoop handles a variety of workloads, including search, log process- ing,
recommendation systems, data warehousing, and video/image analysis

Apache Hadoop is an open-source project Hadoop is able to store any
kind of data in its native format and to perform a wide variety of analyses and
transformations on that data. Hadoop stores terabytes, and even petabytes, of
data inexpensively. It is robust and reliable and handles hardware and system
failures auto- matically, without losing data or interrupting data analyses.

Hadoop runs on clusters of commodity servers and each of those servers
has local CPUs and disk storage that can be leveraged by the system.
The two critical components of Hadoop are:
1. The Hadoop Distributed File System (HDFS). HDFS is the storage
system for a Hadoop cluster. When data lands in the cluster, HDFS breaks
it into pieces and distributes those pieces among the different servers
participating in the cluster. Each server stores just a small fragment of the
complete data set, and each piece of data is replicated on more than one
server.

2.MapReduce. Because Hadoop stores the entire dataset in small pieces across a
collection of servers, analytical jobs can be distributed, in parallel, to each of the
servers storing part of the data. Each server evaluates the question against its
local fragment simultaneously and reports its results back for collation into a
comprehensive answer. MapReduce is the agent that distributes the work and
collects the results.


Both HDFS and MapReduce are designed to continue to work in
the face of system failures.

Because of the way that HDFS and MapReduce work, Hadoop
provides scalable, reliable, and fault-tolerant services for data storage and
analysis at very low cost.

Compute Cluster

DFS Block 1 DFS Block 1

Data Map
datadata data data data
datadata data data data DFS Block 1
datadata data data data
datadata data data data
datadata data data data
datadata data data data DFS Block 2

datadata data data data Reduce


datadata data data data DFS Bloc Ma p
datadata data data data
datadata data data data
datadata data data data
datadata data data data DFS Block 2

Map

DFS Block 3
DFS Block 3

Old vs. New Approaches



The old way is a data and analytics technology stack with different layers
“cross-communicating data” and working on “scale-up” expensive
hardware.

The new way is a data and analytics platform that does all the data
processing and analytics in one “layer,” with- out moving data back and
forth.

Summary

1. The technology stack has changed. New proprietary technologies and


open-source inventions enable different approaches that make it easier and
more affordable to store, manage, and analyze data.
2. Hardware and storage is affordable and continuing to get cheaper to enable
massive parallel processing.
3. The variety of data is on the rise and the ability to handle unstructured data
is on the rise.

Data Discovery: Work the Way People’s Minds Work


Tableau Software and QlikTech International.(Qlikview)

Open-Source Technology for Big Data Analytics

 Open-source software is computer software that is available in


source code form under an open-source license that permits users to
study, change, and improve and at times also to distribute the
software.

 Although the source code is released, there are still governing bodies
and agreements in place. The most prominent and popular example is
the GNU General Public License (GPL), which “allows free
distribution under the condition that further developments and
applications are put under the same license.”This ensures that the
products keep improving over time for the greater population of users.

 Some other open-source projects are managed and supported by


commercial companies, such as Cloudera, that provide extra
capabilities, training, and professional services that support open-
source projects such as Hadoop.

 You can make it into what you want and what you need. If you
come up with an idea, you can put it to work immediately. That’s the
advantage of the open- source stack—flexibility, extensibility, and
lower cost.”
 “One of the great benefits of open source lies in the flexibility of the
adoption model: you download and deploy it when you need it”.

 Pace of software development has accelerated dramatically because of


open-source software.

 The old model was top-down, slow, inflexible and expensive.


The new software development model is bottom-up, fast,
flexible, and considerably less costly.

 A traditional proprietary stack is defined and controlled by a


single vendor, or by a small group of vendors. It reflects the old
command-and-control mentality of the traditional corporate
world and the old economic order.

 An open-source stack is defined by its community of users and


contributors. No one “controls” an open-source stack, and no one
can predict exactly how it will evolve. The open-source stack
reflects the new realities of the networked global economy, which is
increasingly dependent on big data.

The Cloud and Big Data

With a cloud model, you pay on a subscription basis with no upfront capital
expense. You don’t incur the typical 30 percent maintenance fees—and all the
updates on the platform are automatically available.

The ability to build massively scalable platforms—platforms where you


have the option to keep adding new products and services for zero additional
cost—is giving rise to business models that weren’t possible before.

You might also like