Unit 1
Unit 1
Unit 1
College of
Engineering
UNIT 1
Introduction to Big Data Analytics
• The rise in technology has led to the production and storage of voluminous amounts of data.
• Earlier megabytes were used but nowadays petabytes are used for processing, analysis, discovering new
facts and generating new knowledge.
• Conventional systems for storage, processing and analysis pose challenges in large growth in volume
of data, variety of data, various forms and formats, increasing complexity, faster generation of data and
need of quickly processing, analyzing and usage.
Note:
As size and complexity
increase, the proportion of
unstructured data types
also increase.
Department of Computer Science Engineering(Data Science) 2
B.M.S. College of
Engineering
Need of Big Data
• An example of a traditional tool for structured data storage and querying is RDBMS.
• Big Data requires new tools for processing and analysis of a large volume of data. For example, unstructured,
NoSQL (not only SQL) data or Hadoop compatible system data.
3. Multi-Structured Data
• Multi-structured data refers to data consisting of multiple formats of data, viz. structured, semi-structured and/or
unstructured data.
• For example, streaming data on customer interactions, data of multiple sensors, data at web or enterprise
server or the data- warehouse data in multiple formats.
• Big Data architecture is the logical and/ or physical layout/structure of how Big Data will be stored,
accessed and managed within a Big Data or IT environment.
• Architecture logically defines how Big Data solution will work, the core components (hardware,
database, software, storage) used, flow of information, security and more.
• Characteristics of Big Data make designing Big Data architecture a complex process.
• The requirements for offering competing products at lower costs in the market make the designing task
more challenging for a Big Data architect.
• Five vertically aligned textboxes on the left of Figure 1.2 show the layers. Horizontal textboxes show the
functions in each layer.
Figure 1.2 Design of logical layers in a data processing architecture, and functions in the layers
LAYER 1
• Considers amount of data needed at ingestion layer(L2) and either push from L1 or pull by L2 as per the
mechanisms for the usage
• Source data-types: Database, files, internal or external files
• Source formats – Semi-structured, structured or unstructured
LAYER 2
• Consider ingestion and ETL processes in either real time, which means store and use the data as
generated or in batches.
LAYER 3
• Data storage type, format, compression, incoming data frequency, querying patterns
• Data storage using HDFS or No SQL data stores-Hbase, Cassandra, MongoDB
LAYER 4
• Data processing software such as MapReduce, Hive, Spark
• Processing in scheduled batches or real time or hybrid
LAYER 5
• Data integration layer
• Data usage for reports, visualization, knowledge discovery.
• Export of datasets to cloud, web, etc
• Data managing means enabling, controlling, protecting, delivering and enhancing the value of data and
information asset.
• Data management functions include:
1. Data assets creation, maintenance and protection
2. Data governance, which includes establishing the processes for ensuring the availability, usability,
integrity, security and high-quality of data. The processes enable trustworthy data availability for analytics,
followed by the decision making at the enterprise.
3. Data architecture creation, modelling and analysis
4. Database maintenance, administration and management system. For example, RDBMS (relational
database management system), NoSQL
5. Managing data security, data access control, deletion, privacy and security
6. Managing the data quality
• A stack consists of a set of software components and data store units. Applications, machine-
learning algorithms, analytics and visualization tools use Big Data Stack (BDS) at a cloud service, such
as Amazon EC2, Azure or private cloud. The stack uses cluster of high performance machines.
• Berkeley Data Analytics Stack (BDAS) consists of data processing, data management and resource
management layers. Following list these:
1. Applications, Data processing software component provides in-memory processing which processes
the data efficiently across the frameworks.
2. Data processing combines batch, streaming and interactive computations.
3. Resource management software component provides for sharing the infrastructure across various
frameworks.
Figure 1.10 shows a four layers architecture for Big Data Stack that consists of Hadoop, MapReduce,
Spark core and SparkSQL,Streaming, R, Graphx, MLib, Mahout, Arrow and Kafka.
• Big data drives digital transformation by enabling prediction trends in datasets that go far beyond the
capabilities of legacy analytic tools in terms of volume, velocity, variety and variability.
• Organizations require a process management framework to reap the benefits of big data analytics by
ensuring that different functional groups and roles within an organization interplay with each other with
the appropriate processes, purposes and outcomes.
• A new international standard, ISO/IEC 24668: Process Management Framework for Big Data
Analytics, provides practical guidance, based on best practices, on managing and overseeing big data
analytics.
• It describes processes for the acquisition, description, storage and processing of data, irrespective of the
industry or sector in which the organization operates.
• ISO/IEC 24668 takes the various process categories into account, along with their interconnectivities.
These process categories include organization stakeholder processes, competency development
processes, data management processes, analytics development processes and technology integration
processes. 27
Department of Computer Science Engineering(Data Science)
B.M.S. College of
Engineering Big Data Analytics standards
• This framework can be used not only for managing processes but also for enabling risk determination
and process improvements. It will help organizations to develop competitive advantages, as well as to
improve sales and customer experiences.
Following are the five application areas in order of the popularity of Big Data use cases:
• CVA using the inputs of evaluated purchase patterns, preferences, quality, price and post sales
servicing requirements
• Operational analytics for optimizing company operations
• Detection of frauds and compliances
• New products and innovations in service
• Enterprise data warehouse optimization.
2. Big Data and Healthcare
3. Big Data in Medicine
4. Big Data in Advertising
5. Big Data in Sports
6. For real time inventory management
7. In Finance
8. In Education 30
Department of Computer Science Engineering(Data Science)