Big Data Analytics
Big Data Analytics
DEVELOPMENT
PROGRAM
ANALYTICS
“Big Data are high volume, high velocity, or high-variety information assets that
require new forms of processing to enable enhanced decision making, insight
discovery, and process optimization.”
Big data analytics is the often complex process of examining large and
varied data sets, or big data, to uncover information -- such as hidden patterns,
unknown correlations, market trends and customer preferences that can help
organizations make informed business decisions
Huge Volume of Data which cannot be processed
with traditional approach with a given time frame.
Big Data Issues
Structured
Semi structured
Unstructured
Speed at which data is generated
Speed at which data is moving around and analyzed
Analyze data while it is being generated without even putting it into
databases
Getting value out of Big Data!!!
Data in doubt
Variability
Variability refers to how spread out a group of data
is. The common measures of variability are the
range, variance, and standard deviation.
SMART METER
Top Big Data Tools
The Apache Hadoop software library is a big data framework. It allows
distributed processing of large data sets across clusters of computers. It is
designed to scale up from single servers to thousands of machines.
Features:
•Authentication improvements when using HTTP proxy server
•Specification for Hadoop Compatible Filesystem effort
•Support for POSIX-style filesystem extended attributes
•It offers robust ecosystem that is well suited to meet the analytical needs
of developer
•It brings Flexibility In Data Processing
•It allows for faster data Processing
HPCC is a big data tool developed by LexisNexis Risk Solution.
It delivers on a single platform, a single architecture and a single programming
language for data processing.
Features:
•Highly efficient accomplish big data tasks with far less code.
•It can be used both for complex data processing on a Thor cluster
Storm is a free and open source big data computation system. It offers
distributed real-time, fault-tolerant processing system. With real-time
computation capabilities.
•It benchmarked as processing one million 100 byte messages per second per
node
•It uses parallel calculations that run across a cluster of machines
Qubole Data is Autonomous Big data management platform. It is self-
managed, self-optimizing tool which allows the data team to focus on
business outcomes.
AMAZON
Features:
•Single Platform for every use case
•Open-source Engines, optimized for the Cloud
•Comprehensive Security, Governance, and Compliance
The Apache Cassandra database is widely used today to provide an
effective management of large amounts of data.
NETFLIX, WALMART LABS, CERN
Features:
•Support for replicating across multiple data centers by providing lower
latency for users
•Data is automatically replicated to multiple nodes for fault-tolerance
Statwing is an easy-to-use statistical tool. It was built by and
for big data analysts. Its modern interface chooses statistical
tests automatically.
Features:
•Explore any data in seconds
•Statwing helps to clean data, explore relationships, and create
charts in minutes
CouchDB stores data in JSON documents that can be accessed web or query
using JavaScript. It offers distributed scaling with fault-tolerant storage. It
allows accessing data by defining the Couch Replication Protocol.
ADP, REFINITIV , APPLE
Features:
•CouchDB is a single-node database that works like any other database
•It allows running a single logical database server on any number of servers
Pentaho provides big data tools to extract, prepare and blend data. It offers
visualizations and analytics that change the way to run any business. This
Big data tool allows turning big data into big insights.
RELIANCE JIO, JPMORGAN CHASE
Features:
•Data access and integration for effective data visualization
•It empowers users to architect big data at the source and stream them for
accurate analytics
Apache Flink is an open-source stream processing Big data tool. It is distributed,
high-performing, always-available, and accurate data streaming applications.
XING, SALECYCLE
Features:
•Provides results that are accurate, even for out-of-order or late-arriving data
•This big data tool supports stream processing and windowing with event time
semantics
•It supports flexible windowing based on time, count, or sessions to data-driven
windows
Cloudera is the fastest, easiest and highly secure modern big data platform. It allows
anyone to get any data across any environment within single, scalable platform.
GLOBAL SYSTEM , EXCEL GLOBAL SYSTEM
Features:
•High-performance analytics
•It offers provision for multi-cloud
•Deploy and manage Cloudera Enterprise across AWS, Microsoft Azure and Google
Cloud Platform
Open Refine is a powerful big data tool. It helps to work with messy data, cleaning it
and transforming it from one format into another. It also allows extending it with
web services and external data.
FASTLIX , APICRAFTER
Features:
•Open Refine tool help you explore large data sets with ease
•It can be used to link and extend your dataset with various web services
•Import data in various formats
RapidMiner is an open source big data tool. It is used for data prep,
machine learning, and model deployment. It offers a suite of products to
build new data mining processes and setup predictive analysis.
Features:
Apache SAMOA:
SLAVE NODES
MASTER NODES
Basic Hadoop Architecture
At its core, Hadoop has two major layers namely:
(a)Processing/Computation layer (Map Reduce)
(b)Storage layer (Hadoop Distributed File System).
How Does Hadoop Work?
Data is initially divided into directories and files. Files
are divided into uniform sized blocks
These files are then distributed across various cluster
nodes for further processing.
HDFS, being on top of the local file system,
supervises the processing.
Blocks are replicated for handling hardware failure.
Checking that the code was executed successfully.
Performing the sort that takes place between the map
and reduce stages.
Sending the sorted data to a certain computer.
Writing the debugging logs for each job.
Advantages of Hadoop
Hadoop framework allows the user to quickly write and test
distributed systems. It is efficient, and it automatic
distributes the data and work across the machines and in
turn, utilizes the underlying parallelism of the CPU cores.
Hadoop does not rely on hardware to provide fault-tolerance
and high availability (FTHA), rather Hadoop library itself has been
designed to detect and handle failures at the application layer.
Servers can be added or removed from the cluster
dynamically and Hadoop continues to operate without
interruption.
Another big advantage of Hadoop is that apart from being open
source, it is compatible on all the platforms since it is Java based.
HADOOP ─ ENVIRONMENT SETUP
Prerequisites
VIRTUAL BOX: it is used for installing the operating system on it.
OPERATING SYSTEM: You can install Hadoop on Linux based
operating systems, Ubuntu and CentOS are very commonly used
JAVA: You need to install the Java 8 package on your system.
HADOOP: You require Hadoop 2.7.3 package.
Install Hadoop
Step 1: Download the Java 8 Package. Save this file in your home
directory.
Step 2: Extract the Java Tar File.
Command: tar -xvf jdk-8u101-linux-i586.tar.gz
Step 3: Download the Hadoop 2.7.3 Package.
Command: wget
https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.t
ar.gz
Step 4: Extract the Hadoop tar File.
Command: tar -xvf hadoop-2.7.3.tar.gz
HADOOP ─ ENVIRONMENT SETUP
Step 5: Add the Hadoop and Java paths in the bash file (.bashrc).
Open. bashrc file. Now, add Hadoop and Java Path as shown below.
Command: vi .bashrc
Then, save the bash file and close it.
For applying all these changes to the current Terminal,
execute the source command.
Command: source .bashrc
To make sure that Java and Hadoop have been properly
installed on your system and can be accessed through the
Terminal, execute the java -version and hadoop version commands.
Command: java -version
Step 6: Edit the Hadoop Configuration files.
Command: cd hadoop-2.7.3/etc/hadoop/
Command: ls
All the Hadoop configuration files are located in hadoop-
3.0.3/etc/hadoop
HADOOP ─ ENVIRONMENT SETUP
Step 7: Open core-site.xml and edit the property mentioned below inside
configuration tag:
core-site.xml informs Hadoop daemon where NameNode runs in the
cluster. It contains configuration settings of Hadoop core such as I/O
settings that are common to HDFS & MapReduce.
Command: vi core-site.xml