Big Data Analytics Notes
Big Data Analytics Notes
UNIT ‐ I:
Introduction to big data: Data, Characteristics of data and Types of digital data:
Unstructured, Semi- structured and Structured - Sources of data. Big Data Evolution
-Definition of big data-Characteristics and Need of big data-Challenges of big data.
Big data analytics, Overview of business intelligence.
For example, data might include individual prices, weights, addresses, ages, names,
temperatures, dates, or distances.
1. Accuracy
Data should be sufficiently accurate for the intended use and should be captured only
once, although it may have multiple uses. Data should be captured at the point of
activity.
2. Validity
Data should be recorded and used in compliance with relevant requirements, including
the correct application of any rules or definitions. This will ensure consistency between
periods and with similar organizations, measuring what is intended to be measured.
3. Reliability
Data should reflect stable and consistent data collection processes across collection
points and over time. Progress toward performance targets should reflect real changes
rather than variations in data collection approaches or methods. Source data is clearly
identified and readily available from manual, automated, or other systems and records.
4. Timeliness
Data should be captured as quickly as possible after the event or activity and must be
available for the intended use within a reasonable time period. Data must be available
quickly and frequently enough to support information needs and to influence service
or management decisions.
5. Relevance
Data captured should be relevant to the purposes for which it is to be used. This will
require a periodic review of requirements to reflect changing needs.
6. Completeness
Data requirements should be clearly specified based on the information needs of the
organization and data collection processes matched to these requirements.
Structured Data:
Structured data refers to any data that resides in a fixed field within a record or
file.
Having a particular Data Model.
Meaningful data.
Data arranged in arow and column.
Structured data has the advantage of being easily entered, stored, queried and
analysed.
E.g.: Relational Data Base, Spread sheets.
Structured data is often managed using Structured Query Language (SQL)
Unstructured Data:
Unstructured data can not readily classify and fit into a neat box
Also called unclassified data.
Which does not confirm to any data model.
Business rules are not applied.
Indexing is not required.
New York Stock Exchange : The New York Stock Exchange is an example of Big
Data that generates about one terabyte of new trade data per day.
Social Media: The statistic shows that 500+terabytes of new data get ingested into
the databases of social media site Facebook, every day. This data is mainly
generated in terms of photo and video uploads, message exchanges, putting
comments etc.
Jet engine :A single Jet engine can generate 10+terabytes of data in 30 minutes of
flight time. With many thousand flights per day, generation of data reaches up to
many Petabytes.
Volume:
The name Big Data itself is related to an enormous size. Big Data is a vast ‘volume’ of
data generated from many sources daily, such as business processes, machines,
social media platforms, networks, human interactions, and many more.
Variety:
Big Data can be structured, unstructured, and semi-structured that are being
collected from different sources. Data will only be collected
from databases and sheets in the past, but these days the data will comes in array
forms, that are PDFs, Emails, audios, SM posts, photos, videos, etc.
Veracity
Veracity means how much the data is reliable. It has many ways to filter or translate
the data. Veracity is the process of being able to handle and manage data efficiently.
Big Data is also essential in business development.
Value
Value is an essential characteristic of big data. It is not the data that we process or
store. It is valuable and reliable data that we store, process, and also analyze.
Velocity
Velocity plays an important role compared to others. Velocity creates the speed by
which the data is created in real-time. It contains the linking of incoming data sets
speeds, rate of change, and activity bursts. The primary aspect of Big Data is to
provide demanding data rapidly.
Big data velocity deals with the speed at the data flows from sources like application
logs, business processes, networks, and social media sites, sensors, mobile
devices, etc.
Companies are using Big Data to know what their customers want, who are their best
customers, why people choose different products. The more a company knows about
its customers, the more competitive it becomes.
We can use it with Machine Learning for creating market strategies based on
predictions about customers. Leveraging big data makes companies customer-centric.
Companies can use Historical and real-time data to assess evolving consumers’
preferences. This consequently enables businesses to improve and update their
marketing strategies which make companies more responsive to customer needs.
Big Data importance doesn’t revolve around the amount of data a company has. Its
importance lies in the fact that how the company utilizes the gathered data.
Every company uses its collected data in its own way. More effectively the company
uses its data, more rapidly it grows.
The companies in the present market need to collect it and analyze it because:
1. Cost Savings
Big Data tools like Apache Hadoop, Spark, etc. bring cost-saving benefits to
businesses when they have to store large amounts of data. These tools help
organizations in identifying more effective ways of doing business.
2. Time-Saving
Real-time in-memory analytics helps companies to collect data from various sources.
Tools like Hadoop help them to analyze data immediately thus helping in making quick
decisions based on the learnings.
If we don’t know what our customers want then it will degrade companies’ success. It
will result in the loss of clientele which creates an adverse effect on business growth.
Big data analytics helps businesses to identify customer related trends and patterns.
Customer behavior analysis leads to a profitable business.
To deal with this challenge, businesses use data integration software, ETL
software, and business intelligence software to map disparate data sources
into a common structure and combine them so they can generate accurate
reports.
also validating data sources against what you expect them to be and cleaning up
corrupted and incomplete data sets. Data quality software can also be used
specifically for the task of validating and cleaning your data before it is processed.
Thinking about hiring a data analytics company to help your business implement
a big data strategy? Browse our list of top data analytics companies, and learn
more about their services in our hiring guide.
When your business begins a data project, start with goals in mind and strategies
for how you will use the data you have available to reach those goals. The team
involved in implementing a solution needs to plan the type of data they need and
the schemas they will use before they start building the system so the project
doesn't go in the wrong direction. They also need to create policies for purging
old data from the system once it is no longer useful.
There are a few ways to solve this problem. One is to hire a big data
specialist and have that specialist manage and train your data team until they
are up to speed. The specialist can either be hired on as a full -time employee or
as a consultant who trains your team and moves on, depending on your budget.
Another option, if you have time to prepare ahead, is to offer training to your
current team members so they will have the skills once your big data project is in
motion.
8. Organizational resistance
Another way people can be a challenge to a data project is when they resist
change. The bigger an organization is, the more resistant it is to change. Leaders
may not see the value in big data, analytics, or machine learning. Or they may
simply not want to spend the time and money on a new project.
This can be a hard challenge to tackle, but it can be done. You can start with a
smaller project and a small team and let the results of that project prove the value
of big data to other leaders and gradually become a data-driven business. Another
option is placing big data experts in leadership roles so they can guide your
business towards transformation.
BI supports fact-based decision making using historical data rather than assumptions
and gut feeling.
BI tools perform data analysis and create reports, summaries, dashboards, maps,
graphs, and charts to provide users with detailed intelligence about the nature of the
business.
Why is BI important?
Step 2) The data is cleaned and transformed into the data warehouse. The table can
be linked, and data cubes are formed.
Step 3) Using BI system the user can ask quires, request ad-hoc reports or conduct
any other analysis.
1. Boost productivity
With a BI program, It is possible for businesses to create reports with a single click
thus saves lots of time and resources. It also allows employees to be more productive
on their tasks.
2. To improve visibility
BI also helps to improve the visibility of these processes and make it possible to
identify any areas which need attention.
3. Fix Accountability
BI system assigns accountability in the organization as there must be someone who
should own accountability and ownership for the organization’s performance against
its set goals.
BI System Disadvantages
1. Cost:
Business intelligence can prove costly for small as well as for medium-sized
enterprises. The use of such type of system may be expensive for routine business
transactions.
2. Complexity:
Another drawback of BI is its complexity in implementation of datawarehouse. It can
be so complex that it can make business techniques rigid to deal with.
3. Limited use
Like all improved technologies, BI was first established keeping in consideration the
buying competence of rich firms. Therefore, BI system is yet not affordable for many
small and medium size companies.
UNIT ‐ II:
Big data technologies and Databases: Hadoop – Requirement of Hadoop
Framework - Design principle of Hadoop –Comparison with other system SQL and
RDBMS- Hadoop Components – Architecture -Hadoop 1 vs Hadoop 2.
There are three core components of Hadoop as mentioned earlier. They are HDFS,
MapReduce, and YARN. These together form the Hadoop framework architecture.
Features:
The storage is distributed to handle a large data pool
Distribution increases data security
It is fault-tolerant, other blocks can pick up the failure of one block
2. MapReduce:
The MapReduce framework is the processing unit. All data is distributed and
processed parallelly. There is a MasterNode that distributes data amongst
SlaveNodes. The SlaveNodes do the processing and send it back to the MasterNode.
Features:
Consists of two phases, Map Phase and Reduce Phase.
Processes big data faster with multiples nodes working under one CPU
Features:
It is a filing system that acts as an Operating System for the data stored on
HDFS
It helps to schedule the tasks to avoid overloading any system
A Hadoop cluster consists of a single master and multiple slave nodes. The master
node includes Job Tracker, Task Tracker, NameNode, and DataNode whereas the
slave node includes DataNode and TaskTracker.
NameNode
o It is a single master server exist in the HDFS cluster.
o As it is a single node, it may become the reason of single point failure.
o It manages the file system namespace by executing an operation like the
opening, renaming and closing the files.
o It simplifies the architecture of the system.
DataNode
o The HDFS cluster contains multiple DataNodes.
o Each DataNode contains multiple data blocks.
o These data blocks are used to store data.
o It is the responsibility of DataNode to read and write requests from the file
system's clients.
o It performs block creation, deletion, and replication upon instruction from the
NameNode.
Job Tracker
o The role of Job Tracker is to accept the MapReduce jobs from client and
process the data by using NameNode.
o In response, NameNode provides metadata to Job Tracker.
Task Tracker
o It works as a slave node for Job Tracker.
o It receives task and code from Job Tracker and applies that code on the file.
This process can also be called as a Mapper.
MapReduce Layer
The MapReduce comes into existence when the client application submits the
MapReduce job to Job Tracker. In response, the Job Tracker sends the request to the
appropriate Task Trackers. Sometimes, the TaskTracker fails or time out. In such a
case, that part of the job is rescheduled.
Hadoop 1 vs Hadoop 2
Hadoop 1 Hadoop 2
HDFS HDFS
2. Daemons:
Hadoop 1 Hadoop 2
Namenode Namenode
Datanode Datanode
3. Working:
In Hadoop 1, there is HDFS which is used for storage and top of it, Map Reduce
which works as Resource Management as well as Data Processing. Due to this
workload on Map Reduce, it will affect the performance.
In Hadoop 2, there is again HDFS which is again used for storage and on the
top of HDFS, there is YARN which works as Resource Management. It basically
allocates the resources and keeps all the things going on.
4. Limitations:
Hadoop 2 is also a Master-Slave architecture. But this consists of multiple masters (i.e
active namenodes and standby namenodes) and multiple slaves. If here master node
got crashed then standby master node will take over it. You can make multiple
5. Ecosystem
Oozie is basically Work Flow Scheduler. It decides the particular time of jobs to
execute according to their dependency.
Pig, Hive and Mahout are data processing tools that are working on the top of
Hadoop.
Sqoop is used to import and export structured data. You can directly import and
export the data into HDFS using SQL database.
Flume is used to import and export the unstructured data and streaming data.
UNIT ‐ III:
MapReduce and YARN framework: Introduction to MapReduce , Processing data
with Hadoop using MapReduce, Introduction to YARN, Architecture, Managing
Resources and Applications with Hadoop YARN.
Big data technologies and Databases: NoSQL: Introduction to NoSQL - Features
and Types- Advantages & Disadvantages -Application of NoSQL.
-
3.1 Introduction to MapReduce in Hadoop
The input to each phase is key-value pairs. In addition, every programmer needs to
specify two functions: map function and reduce function.
Let us understand more about MapReduce and its components. MapReduce majorly
has the following three Classes. They are,
Mapper Class
The first stage in Data Processing using MapReduce is the Mapper Class. Here,
RecordReader processes each Input record and generates the respective key-value
pair. Hadoop’s Mapper store saves this intermediate data into the local disk.
Input Split
RecordReader
It interacts with the Input split and converts the obtained data in the form of Key-
Value Pairs.
Reducer Class
The Intermediate output generated from the mapper is fed to the reducer which
processes it and generates the final output which is then saved in the HDFS.
Driver Class
Yet Another Resource Manager takes programming to the next level beyond Java ,
and makes it interactive to let another application Hbase, Spark etc. to work on
it.Different Yarn applications can co-exist on the same cluster so MapReduce, Hbase,
Spark all can run at the same time bringing great benefits for manageability and cluster
utilization.
Components Of YARN
Jobtracker & Tasktrackerwere were used in previous version of Hadoop, which were
responsible for handling resources and checking progress management. However,
Hadoop 2.0 has Resource manager and NodeManager to overcome the shortfall of
Jobtracker & Tasktracker.
Scheduler
Application manager
a) Scheduler
The scheduler is responsible for allocating the resources to the running application.
The scheduler is pure scheduler it means that it performs no monitoring no tracking
for the application and even doesn’t guarantees about restarting failed tasks either
due to application failure or hardware failures.
b) Application Manager
It manages running Application Masters in the cluster, i.e., it is responsible for starting
application masters and for monitoring and restarting them on different nodes in case
of failures.
One application master runs per application. It negotiates resources from the resource
manager and works with the node manager. It Manages the application life cycle.
The AM acquires containers from the RM’s Scheduler before contacting the
corresponding NMs to start the application’s individual tasks.
3.5 NoSQL