Big Data Technology
Big Data Technology
It is a light weight ,open source, non relational database that did not expose the
standard SQL interface.
NoSQL databases are widely used in bigdata and other real time web applications.
Features of NoSQL:
2. Distributed
1. Key-value
2. Document
3. Column
4. Graph
Hadoop is an open-source platform for storage and processing of diverse data types that
enables data-driven enterprises to rapidly derive the complete value from all their
data.
History of Hadoop
The name “Hadoop” itself comes from Doug’s son yellow plush elephant toy that he
has.
■
The scalability and elasticity of free, open-source Hadoop running on
standard hardware allow organizations to hold onto more data than ever
before.
■
Hadoop handles a variety of workloads, including search, log process- ing,
recommendation systems, data warehousing, and video/image analysis
■
Apache Hadoop is an open-source project Hadoop is able to store any
kind of data in its native format and to perform a wide variety of analyses and
transformations on that data. Hadoop stores terabytes, and even petabytes, of
data inexpensively. It is robust and reliable and handles hardware and system
failures auto- matically, without losing data or interrupting data analyses.
■
Hadoop runs on clusters of commodity servers and each of those servers
has local CPUs and disk storage that can be leveraged by the system.
The two critical components of Hadoop are:
1. The Hadoop Distributed File System (HDFS). HDFS is the storage
system for a Hadoop cluster. When data lands in the cluster, HDFS breaks
it into pieces and distributes those pieces among the different servers
participating in the cluster. Each server stores just a small fragment of the
complete data set, and each piece of data is replicated on more than one
server.
2.MapReduce. Because Hadoop stores the entire dataset in small pieces across a
collection of servers, analytical jobs can be distributed, in parallel, to each of the
servers storing part of the data. Each server evaluates the question against its
local fragment simultaneously and reports its results back for collation into a
comprehensive answer. MapReduce is the agent that distributes the work and
collects the results.
■
Both HDFS and MapReduce are designed to continue to work in
the face of system failures.
■
Because of the way that HDFS and MapReduce work, Hadoop
provides scalable, reliable, and fault-tolerant services for data storage and
analysis at very low cost.
Compute Cluster
Data Map
datadata data data data
datadata data data data DFS Block 1
datadata data data data
datadata data data data
datadata data data data
datadata data data data DFS Block 2
Map
DFS Block 3
DFS Block 3
Summary
Although the source code is released, there are still governing bodies
and agreements in place. The most prominent and popular example is
the GNU General Public License (GPL), which “allows free
distribution under the condition that further developments and
applications are put under the same license.”This ensures that the
products keep improving over time for the greater population of users.
You can make it into what you want and what you need. If you
come up with an idea, you can put it to work immediately. That’s the
advantage of the open- source stack—flexibility, extensibility, and
lower cost.”
“One of the great benefits of open source lies in the flexibility of the
adoption model: you download and deploy it when you need it”.
With a cloud model, you pay on a subscription basis with no upfront capital
expense. You don’t incur the typical 30 percent maintenance fees—and all the
updates on the platform are automatically available.