100% found this document useful (3 votes)
997 views79 pages

Big Data Analytics

The document provides an overview of big data analytics and related topics. It discusses the evolution of technology, defines what big data is and common issues. It also covers big data analytics vs data science analytics, the six V's of big data, sources of big data, predictive analysis processing, popular big data tools in 2020, Hadoop introduction and architecture, and big data job opportunities.

Uploaded by

sania2011
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
100% found this document useful (3 votes)
997 views79 pages

Big Data Analytics

The document provides an overview of big data analytics and related topics. It discusses the evolution of technology, defines what big data is and common issues. It also covers big data analytics vs data science analytics, the six V's of big data, sources of big data, predictive analysis processing, popular big data tools in 2020, Hadoop introduction and architecture, and big data job opportunities.

Uploaded by

sania2011
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 79

FACULTY

DEVELOPMENT
PROGRAM

ANALYTICS

JAI PRATAP DIXIT


HEAD-INFORMATION TECHNOLOGY
Ambalika Institute of Management and Technology Lucknow
• Evolution of Technology
• What is Big data , Issues , Issues with existing
System, Challenges
• Use Case of Big Data
• Benefits of Big Data
• Big Data Analytics Vs Data Science Analytics
• 6V’s of Big Data
• The Sources of Big Data
• Big Data Predictive Analysis Processing
• Big Data Tools in 2020
• Best Big Data Analytics Tools in 2020
• BIG Data Hadoop Introduction, Architecture
• How Does Hadoop Work? and Its Advantages
• Hadoop Environmental Setup
• Hadoop HDFS, MAPREDUCE, YARN Operations
• Big Data JOB Opportunities
Units of data
• The bit
• Nibble
• The Byte
• Kilobyte (1024 Bytes)
• Megabyte (1024 Kilobytes) Then there is the hypothetical
• Gigabyte (1,024 Megabytes) "Googolbyte" which would be a
number of bytes equal to a 10
• Terabyte (1,024 Gigabytes)** followed by 100 zeroes.
• Petabyte  (1,024 Terabytes)
• Exabyte (1,024 Petabytes)
• Zettabyte (1,024 Exabytes)
• Yottabyte (1,204 Zettabytes)
What is Big Data
( Large amount of Data)

“Big Data are high volume, high velocity, or high-variety information assets that
require new forms of processing to enable enhanced decision making, insight
discovery, and process optimization.”
Big data analytics is the often complex process of examining large and
varied data sets, or big data, to uncover information -- such as hidden patterns,
unknown correlations, market trends and customer preferences that can help
organizations make informed business decisions
Huge Volume of Data which cannot be processed
with traditional approach with a given time frame.
Big Data Issues

Technology moves too fast


Inaccurate data
Lack of skilled workers
Complexity of Storage Size
Backup And Recovery
Challenge of Search &Optimizing Time
Variations of The Tools & Technologies
Dynamism of Change & Scalability
Resistance in Building The Man Power
Speed of Research in Applying
Variety of Data Formats That Arise
Composition of Demanded Multiple Skill sets
Etc….
Issues with Existing System
Capturing data
Storage
Searching
Big Data Sharing
Challenges Transfer
Analysis
Use Case of Big Data
 Recommendation Engines
 Analyzing Call Details Records
 Fraud Detection
 Market Basket Analysis
 Sentimental Analysis
Big Data Analytics Vs Data Science Analytics

…………………Dealing with unstructured and structured that


comprises everything that related to data cleansing, preparation,
and analysis.

Deals with the combination of statistics, mathematics,


programming, problem-solving, capturing data in ingenious
ways, the ability to look at things differently, and the activity of
cleansing, preparing, and aligning the data.
6V’s of Big Data
Quantity of data
Data sets too large to store and analyze using traditional databases
Different types of data that we can use

Structured
Semi structured
Unstructured
Speed at which data is generated
Speed at which data is moving around and analyzed
Analyze data while it is being generated without even putting it into
databases
Getting value out of Big Data!!!
Data in doubt
Variability
Variability refers to how spread out a group of data
is. The common measures of variability are the
range, variance, and standard deviation.

Data sets with similar values are said to have


little variability while data sets that have values
that are spread out have high variability
The Sources of Big Data
 Black Box Data 
 Social Media Data
 Stock Exchange Data
 Power Grid Data
 Transport Data
 Search Engine Data
 Banking Data
 Website Data
 Retails Data
 Etc….
APPLICATION OF BIG DATA
 Banking and Securities
 Communications, Media and Entertainment
 Healthcare Providers
 Education
 Manufacturing and Natural Resources
 Government
 Insurance
 Retail and Wholesale trade
 Transportation
 Energy and Utilities
Companies using data and analytics to gain a
competitive edge
Big data is big business
Amazon American Express
Netflix Next Big Sound
Starbucks IBM
Enterprise SAP
Microsoft Amazon Web Services
Google TIBCO
BigPanda Splice Machine
Etc…
Big Data Analysis

SMART METER
Top Big Data Tools
The Apache Hadoop software library is a big data framework. It allows
distributed processing of large data sets across clusters of computers. It is
designed to scale up from single servers to thousands of machines.
Features:
•Authentication improvements when using HTTP proxy server
•Specification for Hadoop Compatible Filesystem effort
•Support for POSIX-style filesystem extended attributes
•It offers robust ecosystem that is well suited to meet the analytical needs
of developer
•It brings Flexibility In Data Processing
•It allows for faster data Processing
HPCC is a big data tool developed by LexisNexis Risk Solution.
It delivers on a single platform, a single architecture and a single programming
language for data processing.

3 LQ Labs, Theart Metrix, Elsevier

Features:

•Highly efficient accomplish big data tasks with far less code.

•Offers high redundancy and availability

•It can be used both for complex data processing on a Thor cluster
Storm is a free and open source big data computation system. It offers
distributed real-time, fault-tolerant processing system. With real-time
computation capabilities.

Full Contact Inc, Lookout


Features:

•It benchmarked as processing one million 100 byte messages per second per
node
•It uses parallel calculations that run across a cluster of machines
Qubole Data is Autonomous Big data management platform. It is self-
managed, self-optimizing tool which allows the data team to focus on
business outcomes.
AMAZON
Features:
•Single Platform for every use case
•Open-source Engines, optimized for the Cloud
•Comprehensive Security, Governance, and Compliance
The Apache Cassandra database is widely used today to provide an
effective management of large amounts of data.
NETFLIX, WALMART LABS, CERN
Features:
•Support for replicating across multiple data centers by providing lower
latency for users
•Data is automatically replicated to multiple nodes for fault-tolerance
Statwing is an easy-to-use statistical tool. It was built by and
for big data analysts. Its modern interface chooses statistical
tests automatically.
Features:
•Explore any data in seconds
•Statwing helps to clean data, explore relationships, and create
charts in minutes
CouchDB stores data in JSON documents that can be accessed web or query
using JavaScript. It offers distributed scaling with fault-tolerant storage. It
allows accessing data by defining the Couch Replication Protocol.
ADP, REFINITIV , APPLE
Features:
•CouchDB is a single-node database that works like any other database
•It allows running a single logical database server on any number of servers
Pentaho provides big data tools to extract, prepare and blend data. It offers
visualizations and analytics that change the way to run any business. This
Big data tool allows turning big data into big insights.
RELIANCE JIO, JPMORGAN CHASE
Features:
•Data access and integration for effective data visualization
•It empowers users to architect big data at the source and stream them for
accurate analytics
Apache Flink is an open-source stream processing Big data tool. It is distributed,
high-performing, always-available, and accurate data streaming applications.
XING, SALECYCLE
Features:
•Provides results that are accurate, even for out-of-order or late-arriving data
•This big data tool supports stream processing and windowing with event time
semantics
•It supports flexible windowing based on time, count, or sessions to data-driven
windows
Cloudera is the fastest, easiest and highly secure modern big data platform. It allows
anyone to get any data across any environment within single, scalable platform.
GLOBAL SYSTEM , EXCEL GLOBAL SYSTEM

Features:
•High-performance analytics
•It offers provision for multi-cloud
•Deploy and manage Cloudera Enterprise across AWS, Microsoft Azure and Google
Cloud Platform
Open Refine is a powerful big data tool. It helps to work with messy data, cleaning it
and transforming it from one format into another. It also allows extending it with
web services and external data.
FASTLIX , APICRAFTER
Features:
•Open Refine tool help you explore large data sets with ease
•It can be used to link and extend your dataset with various web services
•Import data in various formats
RapidMiner is an open source big data tool. It is used for data prep,
machine learning, and model deployment. It offers a suite of products to
build new data mining processes and setup predictive analysis.

Features:

•Allow multiple data management methods


•GUI or batch processing
•Integrates with in-house databases
DataCleaner is a data quality analysis application and a solution platform. It
has strong data profiling engine. It is extensible and thereby adds data
cleansing, transformations, matching, and merging.
Feature:
•Interactive and explorative data profiling
•Fuzzy duplicate record detection
•Data transformation and standardization
Hive is an open source-software big data too. It allows programmers
analyze large data sets on Hadoop. It helps with querying and managing
large datasets real fast.
MICROSOFT , APPLE
Features:
•It Supports SQL like query language for interaction and Data modeling
•It compiles language with two main tasks map, and reducer
Kaggle is the world's largest big data community. It helps
organizations and researchers to post their data & statistics. It
is the best place to analyze data seamlessly.
Features:
•The best place to discover and seamlessly analyze open data
SAP
SAP's main Big Data tool is its HANA in-memory relational
database that works with Hadoop. HANA is a traditional row-
and-column database, but it can perform advanced analytics,
like predictive analytics, spatial data processing, text analytics,
text search, streaming analytics, and graph data processing and
has ETL (Extract, Transform, and Load) capabilities. SAP also
offers data warehousing to manage all of your data from a
single platform, cloud services, as well as data management
tools for governance, orchestration, cleansing, and storage.
Best Big Data Analytics Tools in 2020
Xplenty is a cloud-based ETL solution providing simple visualized data pipelines for
automated data flows across a wide range of sources and destinations. Xplenty's
powerful on-platform transformation tools allow you to clean, normalize, and
transform data while also adhering to compliance best practices.
Features:
•Powerful, code-free, on-platform data transformation offering
•Rest API connector - pull in data from any source that has a Rest API
Azure HDInsight is a Spark and Hadoop service in the cloud. It provides big
data cloud offerings in two categories, Standard and Premium. It provides
an enterprise-scale cluster for the organization to run their big data
workloads.
Features:
•Reliable analytics with an industry-leading SLA
•It offers enterprise-grade security and monitoring
•Protect data assets and extend on-premises security and governance
controls to the cloud
Skytree is a big data analytics tool that empowers data scientists to build
more accurate models faster. It offers accurate predictive machine learning
models that are easy to use.
Features:
•Highly Scalable Algorithms
•Artificial Intelligence for Data Scientists
•It allows data scientists to visualize and understand the logic behind ML
decisions
Talend is a big data tool that simplifies and automates big data integration.
Its graphical wizard generates native code. It also allows big data
integration, master data management and checks data quality.
Features:
•Accelerate time to value for big data projects
•Simplify ETL & ELT for big data
•Talend Big Data Platform simplifies using MapReduce and Spark by
generating native code
Splice Machine is a big data analytic tool. Their architecture is portable
across public clouds such as AWS, Azure, and Google.
Features:
•It can dynamically scale from a few to thousands of nodes to enable
applications at every scale
•The Splice Machine optimizer automatically evaluates every query to the
distributed H-Base regions
Apache Spark is a powerful open source big data analytics tool.
It offers over 80 high-level operators that make it easy to build
parallel apps. It is used at a wide range of organizations to
process large datasets.
Features:
•It helps to run an application in Hadoop cluster, up to 100
times faster in memory, and ten times faster on disk
•It offers lighting Fast Processing
Plotly is an analytics tool that lets users create charts and dashboards to share
online.
Features:
•Easily turn any data into eye-catching and informative graphics
•It provides audited industries with fine-grained information on data provenance

Apache SAMOA:

Apache SAMOA is a big data analytics tool. It enables development of new ML


algorithms. It provides a collection of distributed algorithms for common data
mining and machine learning tasks.
Lumify is a big data fusion, analysis, and visualization platform. It helps
users to discover connections and explore relationships in their data via a
suite of analytic options.
Features:
•It provides both 2D and 3D graph visualizations with a variety of automatic
layouts
•It provides a variety of options for analyzing the links between entities on
the graph
Elastic search is a JSON-based Big data search and analytics engine. It is a
distributed, RESTful search and analytics engine for solving numbers of use
cases. It offers horizontal scalability, maximum reliability, and easy
management.
Features:
•It allows combine many types of searches such as structured, unstructured,
geo, metric, etc
•Intuitive APIs for monitoring and management give complete visibility and
control
R is a language for statistical computing and graphics. It also
used for big data analysis. It provides a wide variety of
statistical tests.
Features:
•Effective data handling and storage facility,
•It provides a suite of operators for calculations on arrays, in
particular, matrices,
•It provides coherent, integrated collection of big data tools for
data analysis
Kafka
Kafka is often used in real-time streaming data architectures
to provide real-time analytics. Since Kafka is a fast, scalable,
durable, and fault-tolerant publish-subscribe messaging
system, Kafka is used in use cases where JMS, Rabbit MQ,
and AMQP may not even be considered due to volume and
responsiveness.
Kafka has higher throughput, reliability, and replication
characteristics, which makes it applicable for things like
tracking service calls (tracks every call) or tracking IoT sensor
data where a traditional MOM might not be considered.
What is Hadoop?
Using the solution provided by Google, Doug Cutting and his
team developed an Open Source frame work Project called
HADOOP.
4 September, 2007: release 0.14.1
31 May 2018: Release 3.0.3
Hadoop runs applications using the MapReduce algorithm,
where the data is processed in parallel with others. In short,
Hadoop is used to develop applications that could perform
complete statistical analysis on huge amounts of data.
Conti..
 Hadoop Distributed File System (HDFS)
 MapReduce
An Open Source framework that allows distributed processing of large data-sets across the
cluster of commodity hardware
An open source framework that allows distributed processing of large data-sets across
the Cluster of commodity hardware
• Every day we generate 2.3 trillion GBs of data
• Hadoop handles huge volumes of data efficiently
• Hadoop uses the power of distributed computing
• HDFS & Yarn are two main components of Hadoop
• It is highly fault tolerant, reliable & available
Hadoop Characteristics
Hadoop Components
Hadoop Nodes
Nodes

SLAVE NODES
MASTER NODES
Basic Hadoop Architecture
At its core, Hadoop has two major layers namely:
(a)Processing/Computation layer (Map Reduce)
(b)Storage layer (Hadoop Distributed File System).
How Does Hadoop Work?
 Data is initially divided into directories and files. Files
are divided into uniform sized blocks
 These files are then distributed across various cluster
nodes for further processing.
 HDFS, being on top of the local file system,
supervises the processing.
 Blocks are replicated for handling hardware failure.
 Checking that the code was executed successfully.
 Performing the sort that takes place between the map
and reduce stages.
Sending the sorted data to a certain computer.
 Writing the debugging logs for each job.
Advantages of Hadoop
 Hadoop framework allows the user to quickly write and test
distributed systems. It is efficient, and it automatic
distributes the data and work across the machines and in
turn, utilizes the underlying parallelism of the CPU cores.
 Hadoop does not rely on hardware to provide fault-tolerance
and high availability (FTHA), rather Hadoop library itself has been
designed to detect and handle failures at the application layer.
 Servers can be added or removed from the cluster
dynamically and Hadoop continues to operate without
interruption.
 Another big advantage of Hadoop is that apart from being open
source, it is compatible on all the platforms since it is Java based.
HADOOP ─ ENVIRONMENT SETUP
Prerequisites
VIRTUAL BOX: it is used for installing the operating system on it.
OPERATING SYSTEM: You can install Hadoop on Linux based
operating systems, Ubuntu and CentOS are very commonly used
JAVA: You need to install the Java 8 package on your system.
HADOOP: You require Hadoop 2.7.3 package.

Install Hadoop
Step 1: Download the Java 8 Package. Save this file in your home
directory.
Step 2: Extract the Java Tar File.
Command: tar -xvf jdk-8u101-linux-i586.tar.gz
Step 3: Download the Hadoop 2.7.3 Package.
Command: wget
https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.t
ar.gz
Step 4: Extract the Hadoop tar File.
Command: tar -xvf hadoop-2.7.3.tar.gz
HADOOP ─ ENVIRONMENT SETUP
Step 5: Add the Hadoop and Java paths in the bash file (.bashrc).
Open. bashrc file. Now, add Hadoop and Java Path as shown below.
Command:  vi .bashrc
Then, save the bash file and close it.
For applying all these changes to the current Terminal,
execute the source command.
Command: source .bashrc
To make sure that Java and Hadoop have been properly
installed on your system and can be accessed through the
Terminal, execute the java -version and hadoop version commands.
Command: java -version
Step 6: Edit the Hadoop Configuration files.
Command: cd hadoop-2.7.3/etc/hadoop/
Command: ls
All the Hadoop configuration files are located in hadoop-
3.0.3/etc/hadoop
HADOOP ─ ENVIRONMENT SETUP
Step 7: Open core-site.xml and edit the property mentioned below inside
configuration tag:
core-site.xml informs Hadoop daemon where NameNode runs in the
cluster. It contains configuration settings of Hadoop core such as I/O
settings that are common to HDFS & MapReduce.
Command: vi core-site.xml

Step 8: Edit hdfs-site.xml and edit the property mentioned below inside


configuration tag:
hdfs-site.xml contains configuration settings of HDFS daemons (i.e.
NameNode, DataNode, Secondary NameNode). It also includes the
replication factor and block size of HDFS.
Command: vi hdfs-site.xml
HADOOP ─ ENVIRONMENT SETUP
Step 9: Edit the mapred-site.xml file and edit the property mentioned
below inside configuration tag:
mapred-site.xml contains configuration settings of MapReduce
application like number of JVM that can run in parallel, the size of the
mapper and the reducer process,  CPU cores available for a process,
etc.
In some cases, mapred-site.xml file is not available. So, we have to
create the mapred-site.xml file using mapred-site.xml template.
Command: cp mapred-site.xml.template mapred-site.xml
Command: vi mapred-site.xml.
Step 10: Edit yarn-site.xml and edit the property mentioned below inside
configuration tag:
yarn-site.xml contains configuration settings of ResourceManager and
NodeManager like application memory management size, the operation
needed on program & algorithm, etc.
Command: vi yarn-site.xml
Job Opportunities
•Big Data Manager
•Big Data Analyst
•Big Data Developer
•Big Data Consultants
•Big Data Architect
•Big Data Programmer
……..

You might also like