Data Analytics CSE704 Module-1
Data Analytics CSE704 Module-1
Module-1
Introduction to Big Data
Data Analytics CSE704
• Course objectives:
Syllabus
– To know the fundamental concepts of big data and analytics.
– To explore tools and practices for working with big data
– To learn about stream computing.
– To know about the research that requires the integration of large
amounts of data
• Module I: Introduction to Big Data: (8 Hours)
– Evolution of Big data – Best Practices for Big data Analytics –
Big data characteristics – Validating – The Promotion of the
Value of Big Data – Big Data Use Cases- Characteristics of Big
Data Applications –Perception and Quantification of Value -
Understanding Big Data Storage – A General Overview of High-
Performance Architecture – HDFS – MapReduce and YARN –
Map Reduce Programming Model.
Amity School of Engineering & Technology
Course Outcomes:
• Upon completion of the course, the
students will be able to:
– Work with big data tools and its analysis
techniques
– Analyze data by utilizing clustering and
classification algorithms
– Learn and apply different mining algorithms
and recommendation systems for large
volumes of data
– Perform analytics on data streams
– Learn NoSQL databases and management.
Amity School of Engineering & Technology
• Text Book:
– Anand Rajaraman and Jeffrey David Ullman, “Mining of Massive
Datasets”, Cambridge University Press, 2012.
– David Loshin, “Big Data Analytics: From Strategic Planning to
Enterprise Integration with Tools, Techniques, NoSQL, and Graph”,
Morgan Kaufmann/El sevier Publishers, 2013.
• References:
– EMC Education Services, “Data Science and Big Data Analytics:
Discovering, Analyzing, Visualizing and Presenting Data”, Wiley
publishers, 2015.
– Bart Baesens, “Analytics in a Big Data World: The Essential Guide to
Data Science and its Applications”, Wiley Publishers, 2015.
– Dietmar Jannach and Markus Zanker, “Recommender Systems: An
Introduction”, Cambridge University Press, 2010.
– Kim H. Pries and Robert Dunnigan, “Big Data Analytics: A Practical
Guide for Managers ” CRC Press, 2015.
– Jimmy Lin and Chris Dyer, “Data-Intensive Text Processing with
MapReduce”, Synthesis Lectures on Human Language Technologies,
Vol. 3, No. 1, Pages 1-177, Morgan Claypool publishers, 2010.
Amity School of Engineering & Technology
Big data
• Big Data refers to massive amounts of
data produced by different sources like
social media platforms, web logs, sensors,
IoT devices, and many more. It can be
either structured (like tables in DBMS),
semi-structured (like XML files), or
unstructured (like audios, videos, images).
• Traditional database management
systems are not able to handle this vast
amount of data. Big Data helps companies
to generate valuable insights.
Amity School of Engineering & Technology
Definitions
• According to Gartner, “Big Data is a High Volume, High
Velocity and High Variety Information Assets that
demand Cost Effective and innovative forms of
Information Processing that enable enhanced insight,
Decision Making and Process Automation”.
• According to Ernst and Young, “Big Data refers to the
Dynamic, Large and Desperate Volumes of Data being
created by People, Tools and Machines. It requires New,
Innovative and Scalable Technology to Collect, Host and
Analytically Process the vast amount of Data gathered in
order to derive Real Time Business Insights that relate to
Customers, Risk, Profit, Performance, Productivity
Management and Share Holder Value”.
Amity School of Engineering & Technology
Amity School of Engineering & Technology
• Data Streaming:
Data Streaming technology has emerged
as a solution to process large volumes of
data in real time.
• Edge Computing:
Edge Computing is a kind of distributed
computing paradigm that allows data
processing to be done at the edge or the
corner of the network, closer to the source
of the data.
Amity School of Engineering & Technology
Short Story
• 1940s to 1989 – Data Warehousing and
Personal Desktop Computers
• 1989 to 1999 – Emergence of the World
Wide Web
• 2000s to 2010s – Controlling Data
Volume, Social Media and Cloud
Computing
• 2010s to now – Optimization Techniques,
Mobile Devices and IoT
Amity School of Engineering & Technology
Volume
• Big Data is a vast 'volumes' of data generated
from many sources daily, such as business
processes, machines, social media
platforms, networks, human interactions, and
many more.
• Facebook can generate approximately
a billion messages, 4.5 billion times that the
"Like" button is recorded, and more than 350
million new posts are uploaded each day. Big
data technologies can handle large amounts of
data.
Amity School of Engineering & Technology
Veracity
• Veracity means how much the data is
reliable.
• It has many ways to filter or translate the
data.
• Veracity is the process of being able to
handle and manage data efficiently.
Amity School of Engineering & Technology
Variety
• Big Data can be structured,
unstructured, and semi-structured that
are being collected from different sources.
Data will only be collected
from databases and sheets in the past,
But these days the data will comes in
array forms, that are PDFs, Emails,
audios, SM posts, photos, videos, etc.
Amity School of Engineering & Technology
Value
– It is an essential characteristic of big data. It is
not the data that we process or store. It
is valuable and reliable data that we store,
process, and also analyze.
Amity School of Engineering & Technology
Velocity
– It creates the speed by which the data is
created in real-time. It contains the linking of
incoming data sets speeds, rate of change,
and activity bursts. The primary aspect of
Big Data is to provide demanding data rapidly.
Amity School of Engineering & Technology
• 1) Creating transparency
– Big Data is analyzed across different boundaries and
can identify a variety of inefficiencies. In
manufacturing organizations, for example, Big Data
can help identify improvement opportunities across
R&D, engineering and production departments in
order to bring new products faster to market.
• 2) Data driven discovery
– Big Data can provide tremendous new insights that
might have not been identified previously by finding
patterns or trends in data sets. In the insurance
industry for example, Big Data can help to determine
profitable products and provide improved ways to
calculate insurance premiums.
Amity School of Engineering & Technology
High-Performance Computing
• HPC environments are designed for high-speed floating
point processing and much of the calculations are done
in memory, which renders the highest computational
performance possible.
• Cray Computers and IBM Blue Gene are examples of
HPC environments.
• HPC environments are predominantly used by research
organizations and by business units that demand very
high scalability and computational performance, where
the value being created is so huge and strategic that
cost is not the most important consideration.
• While HPC environments have been around for quite
some time, they are used for specialty applications and
primarily provide a programming environment for custom
application development.
Amity School of Engineering & Technology
Hadoop
• Hadoop is an open-source software framework for
storing data and running applications on clusters of
commodity hardware. It provides massive storage for
any kind of data, enormous processing power and the
ability to handle virtually limitless concurrent tasks or
jobs.
Why is Hadoop important?
• Ability to store and process huge amounts of any
kind of data, quickly. With data volumes and varieties
constantly increasing, especially from social media and
the Internet of Things (IoT), that's a key consideration.
• Computing power. Hadoop's distributed computing
model processes big data fast. The more computing
nodes you use, the more processing power you have.
Amity School of Engineering & Technology
Hadoop Architecture
Amity School of Engineering & Technology
Hadoop Components
• The Hadoop Architecture Mainly consists
of 4 components.
• MapReduce
• HDFS(Hadoop Distributed File System)
• YARN(Yet Another Resource Negotiator)
• Common Utilities or Hadoop Common
Amity School of Engineering & Technology
Data Replication
• HDFS is designed to reliably store very large
files across machines in a large cluster.
• It stores each file as a sequence of blocks. The
blocks of a file are replicated for fault tolerance.
• The block size and replication factor are
configurable per file.
• All blocks in a file except the last block are the
same size, while users can start a new block
without filling out the last block to the configured
block size after the support for variable length
block was added to append and hsync.
Amity School of Engineering & Technology
MapReduce
• MapReduce nothing but just like an Algorithm or
a data structure that is based on the YARN
framework.
• The major feature of MapReduce is to perform
the distributed processing in parallel in a
Hadoop cluster which Makes Hadoop working
so fast.
• MapReduce has mainly 2 tasks which are
divided phase-wise:
• In first phase, Map is utilized and in next
phase Reduce is utilized.
Amity School of Engineering & Technology
Amity School of Engineering & Technology
Map Task
• RecordReader The purpose of recordreader is to break the records. It
is responsible for providing key-value pairs in a Map() function. The
key is actually is its locational information and value is the data
associated with it.
• Map: A map is nothing but a user-defined function whose work is to
process the Tuples obtained from record reader. The Map() function
either does not generate any key-value pair or generate multiple pairs
of these tuples.
• Combiner: Combiner is used for grouping the data in the Map
workflow. It is similar to a Local reducer. The intermediate key-value
that are generated in the Map is combined with the help of this
combiner. Using a combiner is not necessary as it is optional.
• Partitionar: Partitionar is responsible for fetching key-value pairs
generated in the Mapper Phases. The partitioner generates the
shards corresponding to each reducer. Hashcode of each key is also
fetched by this partition. Then partitioner performs it’s (Hashcode)
modulus with the number of reducers (key.hashcode()%(number of
reducers)).
Amity School of Engineering & Technology
Reduce Task
• Shuffle and Sort: The Task of Reducer starts with this step,
the process in which the Mapper generates the intermediate
key-value and transfers them to the Reducer task is known
as Shuffling. Using the Shuffling process the system can sort
the data using its key value. Once some of the Mapping tasks
are done Shuffling begins that is why it is a faster process and
does not wait for the completion of the task performed by
Mapper.
• Reduce: The main function or task of the Reduce is to gather
the Tuple generated from Map and then perform some sorting
and aggregation sort of process on those key-value
depending on its key element.
• OutputFormat: Once all the operations are performed, the
key-value pairs are written into the file with the help of record
writer, each record in a new line, and the key and value in a
space-separated manner.
Amity School of Engineering & Technology
Amity School of Engineering & Technology
Features of YARN
• Multi-Tenancy (Multi user with single interface)
• Scalability
• Cluster-Utilization
• Compatibility
Amity School of Engineering & Technology
Phases of MapReduce
Amity School of Engineering & Technology
Code
Amity School of Engineering & Technology
Further Study
• EMC Education Services, “Data Science and Big Data
Analytics: Discovering, Analyzing, Visualizing and Presenting
Data”, Wiley publishers, 2015.
• Bart Baesens, “Analytics in a Big Data World: The Essential
Guide to Data Science and its Applications”, Wiley Publishers,
2015.
• https://www.projectpro.io/article/5-big-data-use-cases-how-
companies-use-big-data/155
• https://www.geeksforgeeks.org/mapreduce-architecture/
• https://www.techtarget.com/searchbusinessanalytics/definition
/big-data-analytics