Unit 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

B.M.S.

College of
Engineering

UNIT 1
Introduction to Big Data Analytics

Prof. Lavanya Naik


1
B.M.S. College of
Engineering
Need of Big Data

• The rise in technology has led to the production and storage of voluminous amounts of data.
• Earlier megabytes were used but nowadays petabytes are used for processing, analysis, discovering new
facts and generating new knowledge.
• Conventional systems for storage, processing and analysis pose challenges in large growth in volume
of data, variety of data, various forms and formats, increasing complexity, faster generation of data and
need of quickly processing, analyzing and usage.

Note:
As size and complexity
increase, the proportion of
unstructured data types
also increase.
Department of Computer Science Engineering(Data Science) 2
B.M.S. College of
Engineering
Need of Big Data

• An example of a traditional tool for structured data storage and querying is RDBMS.
• Big Data requires new tools for processing and analysis of a large volume of data. For example, unstructured,
NoSQL (not only SQL) data or Hadoop compatible system data.

Department of Computer Science Engineering(Data Science) 3


B.M.S. College of
Engineering Selected key terms and their meanings
• Application means application software or a collection of software components. For example, software for
acquiring, storing, visualizing and analyzing data. An application performs a group of coordinated activities,
functions and tasks.
• Application Programming Interface (API) refers to a software component which enables a user to access an
application, service or software that runs on a local or remote computing platform.
• Data Model refers to a map or schema, which represents the inherent properties of the data.
• Data Repository refers to a collection of data.
• Data Store refers to a data repository of a set of objects.
• Distributed Data Store refers to a data store distributed over multiple nodes. Apache Cassandra is one
example of a distributed data store.

Department of Computer Science Engineering(Data Science) 4


B.M.S. College of
Engineering Selected key terms and their meanings
• Database (DB) refers to a grouping of tables for the collection of data.
• Table refers to a presentation which consists of row fields and column fields.
• Flat File means a file in which data cannot be picked from in between and must be readfrom the beginning to be
interpreted.
• Flat File Database refers to a database in which each record is in a separate row unrelated to each other.
• Name-Value Pair refers to constructs used in which a field consists of name and the corresponding value after
that. For example, a name value pair is date, ""Oct. 20, 2018"", chocolates_sold, 178;
• Hash Key-Value Pair refers to the construct in which a hash function computes a key for indexing and search,
and distributing the entries (key/value pairs) across an array of slots (also called buckets).
• Stream Analytics refers to a method of computing continuously, i.e. even while events take place data flows
through the system.

Department of Computer Science Engineering(Data Science) 5


B.M.S. College of
Engineering Selected key terms and their meanings
• Database Maintenance (DBM) refers to a set of tasks which improves a database. DBM uses functions for
improving performance (such as by query planning and optimization), freeing-up storage space,
updating internal statistics, checking data errors and hardware faults.
• Database Administration (DBA) refers to the function of managing and maintaining Database
Management System (DBMS) software regularly. such as installation, configuration, database design,
implementation upgrading, evaluation of database features.
• Data Warehouse refers to sharable data, data stores and databases in an enterprise. It consists of integrated,
subject oriented (such as finance, human resources and business) and non-volatile data stores, which update
regularly.
• Data Mart is a subset of data warehouse.

Department of Computer Science Engineering(Data Science) 6


B.M.S. College of
Engineering Big Data
• Data
Information, usually in the form of facts or statistics that one can analyze or use for further calculations
• Web Data
Web data is the data present on web servers (or enterprise servers) in the form of text, images, videos, audios and
multimedia files for web users.
Some examples of web data are Wikipedia, GoogleMaps, YouTube.

Department of Computer Science Engineering(Data Science) 7


B.M.S. College of
Engineering Classification of Data
1. Structured data
• conform and associate with data schemas and data models.
• Structured data are found in tables (rows and columns).
• Structured data enables the following:
i. data insert, delete, update and append
ii. Indexing to enable faster data retrieval
iii. Scalability which enables increasing or decreasing capacities and data processing operations such as, storing,
processing and analytics
iv. Transactions processing which follows ACID rules (Atomicity, Consistency, Isolation and Durability)
v. encryption and decryption for data security.

Department of Computer Science Engineering(Data Science) 8


B.M.S. College of
Engineering Classification of Data
2. Semi-Structured Data
• Examples of semi-structured data are XML and JSON documents.
• Semi-structured data contain tags or other markers, which separate elements and enforce hierarchies of
records and fields within the data.
• Semi-structured form of data does not conform and associate with formal data model structures.

3. Multi-Structured Data
• Multi-structured data refers to data consisting of multiple formats of data, viz. structured, semi-structured and/or
unstructured data.
• For example, streaming data on customer interactions, data of multiple sensors, data at web or enterprise
server or the data- warehouse data in multiple formats.

Department of Computer Science Engineering(Data Science) 9


B.M.S. College of
Engineering Classification of Data
4. Unstructured Data
• Unstructured data does not possess data features such as a table or a database.
• Unstructured data are found in file types such as .TXT, .CSV. Data may be as key-value pairs, such as hash
key-value pairs.
• Mobile data: Text messages, chat messages, tweets, blogs and comments
• Website content data: YouTube videos, browsing data, e-payments

Department of Computer Science Engineering(Data Science) 10


B.M.S. College of
Engineering Big Data Definition
Big Data is high-volume, high -velocity and or high-variety information asset that requires new forms of
processing for enhanced decision making, insight discovery and process optimization.
Big Data Characteristics
Characteristics of Big Data, called 3Vs (and 4Vs also used) are:
1. Volume The phrase 'Big Data' contains the term big, which is related to size of the data
and hence the characteristic.
2. Velocity The term velocity refers to the speed of generation of data. Velocity is a
measure of how fast the data generates and processes.
3. Variety - Big Data comprises of a variety of data. Data is generated from multiple sources
in a system. This introduces variety in data and therefore introduces 'complexity'. The variety is due to the
availability of a large number of heterogeneous platforms in the industry.
4. Veracity - It refers to inconsistencies and uncertainty in data, that is data which is available can sometimes get
messy and quality and accuracy are difficult to control. Example: Data in bulk could create confusion whereas less
amount of data could convey half or Incomplete Information. 11
Department of Computer Science Engineering(Data Science)
B.M.S. College of
Engineering Phases in Analytics

Department of Computer Science Engineering(Data Science) 12


B.M.S. College of
Engineering Phases in Analytics
1. Business Problem Definition - In this stage, the problem is identified, and assumptions are made that how much
potential gain a company will make after carrying out the analysis.
2. Data Definition - Find the appropriate datasets to work with. Depending on the business case and the scope of
analysis of the project being addressed, the sources of datasets can be either external or internal to the
company.
3. Data Acquisition and filtration - Once the source of data is identified, now it is time to gather the data from such
sources.
4. Data Extraction - there might be a possibility that some of the entries of the data might be incompatible, to rectify
this issue, a separate phase is created, known as the data extraction phase. In this phase, the data, which don’t
match with the underlying scope of the analysis, are extracted and transformed in such a form.
5. Data Munging - There might be a possibility, that the data might have constraints, that are unsuitable, which can
lead to false results. Hence there is a need to clean and validate the data.

Department of Computer Science Engineering(Data Science) 13


B.M.S. College of
Engineering Phases in Analytics
6. Data Aggregation & Representation - The data is cleansed and validated, against certain rules set by the
enterprise. But the data might be spread across multiple datasets, and it is not advisable to work with multiple
datasets. Hence, the datasets are joined together.
7. Exploratory Data Analysis - Depending on the nature of the big data problem, analysis is carried out.
8. Data Visualization - Now we have the answer to some questions, using the information from the data in the
datasets. But these answers are still in a form that can’t be presented to business users. A sort of representation is
required to obtains value or some conclusion from the analysis. Hence, various tools are used to visualize the data
in graphic form, which can easily be interpreted by business users.
9. Utilization of analysis results - time for the business users to make decisions to utilize the results. The results can
be used for optimization, to refine the business process.

Department of Computer Science Engineering(Data Science) 14


B.M.S. College of
Engineering Phases in Analytics
• The analytics process typically goes through several key phases, each building upon the previous one to derive
insights and drive decision-making.
• These phases help transform raw data into actionable information. Below are the main phases in analytics:
1. Descriptive Analytics:
Purpose: To describe or summarize historical data and identify patterns or trends.
Key Activities: Data aggregation and data mining.
Reporting and visualization of past events (e.g., dashboards, charts).
Example: A retail company examining last quarter’s sales figures to understand overall performance.
Outcome: Insights about what has happened.
2. Diagnostic Analytics:
Purpose: To determine the reasons behind past events and patterns.
Key Activities: Drill-down and data discovery techniques.
Statistical analysis to find correlations, anomalies, and outliers.

Department of Computer Science Engineering(Data Science) 15


B.M.S. College of
Engineering Phases in Analytics
Example: Identifying why a particular product experienced a sales drop by analyzing customer segments and
competitor activities.
Outcome: Understanding why something happened.
3. Predictive Analytics:
Purpose: To predict future outcomes based on historical data and patterns.
Key Activities:Data modeling and machine learning.
Forecasting future trends using statistical models, regression analysis, and algorithms.
Example: Forecasting future customer demand for a product using past sales data and external factors like market
trends.
Outcome: Insights into what might happen.

Department of Computer Science Engineering(Data Science) 16


B.M.S. College of
Engineering Phases in Analytics
4. Prescriptive Analytics:
Purpose: To suggest actions or decisions that should be taken to achieve a desired outcome.
Key Activities: Optimization algorithms, decision trees, and simulations.
Generating recommendations based on predictive models.
Example: Recommending optimal pricing strategies based on market demand predictions.
Outcome: Guidance on what actions to take to achieve specific objectives.
5. Cognitive Analytics :
Purpose: To mimic human-like reasoning and decision-making by learning from data.
Key Activities: Artificial Intelligence (AI) techniques, such as natural language processing (NLP) and machine
learning.
Understanding unstructured data (text, voice, images).
Example: Chatbots understanding and responding to customer inquiries, or AI systems making medical diagnoses
based on patient data.
Outcome: A system that learns and provides deeper insights, often for more complex, unstructured problems.
Department of Computer Science Engineering(Data Science) 17
B.M.S. College of
Engineering Designing Data Architecture

• Big Data architecture is the logical and/ or physical layout/structure of how Big Data will be stored,
accessed and managed within a Big Data or IT environment.
• Architecture logically defines how Big Data solution will work, the core components (hardware,
database, software, storage) used, flow of information, security and more.
• Characteristics of Big Data make designing Big Data architecture a complex process.
• The requirements for offering competing products at lower costs in the market make the designing task
more challenging for a Big Data architect.
• Five vertically aligned textboxes on the left of Figure 1.2 show the layers. Horizontal textboxes show the
functions in each layer.

Department of Computer Science Engineering(Data Science) 18


B.M.S. College of
Engineering Designing Data Architecture

Figure 1.2 Design of logical layers in a data processing architecture, and functions in the layers

Department of Computer Science Engineering(Data Science) 19


B.M.S. College of
Engineering Designing Data Architecture

LAYER 1
• Considers amount of data needed at ingestion layer(L2) and either push from L1 or pull by L2 as per the
mechanisms for the usage
• Source data-types: Database, files, internal or external files
• Source formats – Semi-structured, structured or unstructured
LAYER 2
• Consider ingestion and ETL processes in either real time, which means store and use the data as
generated or in batches.
LAYER 3
• Data storage type, format, compression, incoming data frequency, querying patterns
• Data storage using HDFS or No SQL data stores-Hbase, Cassandra, MongoDB

Department of Computer Science Engineering(Data Science) 20


B.M.S. College of
Engineering Designing Data Architecture

LAYER 4
• Data processing software such as MapReduce, Hive, Spark
• Processing in scheduled batches or real time or hybrid
LAYER 5
• Data integration layer
• Data usage for reports, visualization, knowledge discovery.
• Export of datasets to cloud, web, etc

Department of Computer Science Engineering(Data Science) 21


B.M.S. College of
Engineering Managing Data for Analysis

• Data managing means enabling, controlling, protecting, delivering and enhancing the value of data and
information asset.
• Data management functions include:
1. Data assets creation, maintenance and protection
2. Data governance, which includes establishing the processes for ensuring the availability, usability,
integrity, security and high-quality of data. The processes enable trustworthy data availability for analytics,
followed by the decision making at the enterprise.
3. Data architecture creation, modelling and analysis
4. Database maintenance, administration and management system. For example, RDBMS (relational
database management system), NoSQL
5. Managing data security, data access control, deletion, privacy and security
6. Managing the data quality

Department of Computer Science Engineering(Data Science) 22


B.M.S. College of
Engineering Managing Data for Analysis
7. Data collection using the ETL process
8. Managing documents, records and contents
9. Creation of reference and master data, and data control and supervision
10. Data and application integration
11. Integrated data management, enterprise-ready data creation, fast access and analysis,
automation and simplification of operations on the data
12. Data warehouse management
13. Maintenance of business intelligence
14. Data mining and analytics algorithms.

Department of Computer Science Engineering(Data Science) 23


B.M.S. College of
Engineering Big Data Stack

• A stack consists of a set of software components and data store units. Applications, machine-
learning algorithms, analytics and visualization tools use Big Data Stack (BDS) at a cloud service, such
as Amazon EC2, Azure or private cloud. The stack uses cluster of high performance machines.

Berkeley Data Analytics Stack (BDAS)


• The importance of Big Data lies in the fact that what one does with it rather than how big or large it is
• Identify whether the gathered data is able to help in obtaining the following findings:
1) cost reduction,
2) time reduction
3) new product planning and development
4) smart decision making using predictive analytics
5) knowledge discovery.

Department of Computer Science Engineering(Data Science) 24


B.M.S. College of
Engineering Big Data Stack

• Berkeley Data Analytics Stack (BDAS) consists of data processing, data management and resource
management layers. Following list these:
1. Applications, Data processing software component provides in-memory processing which processes
the data efficiently across the frameworks.
2. Data processing combines batch, streaming and interactive computations.
3. Resource management software component provides for sharing the infrastructure across various
frameworks.
Figure 1.10 shows a four layers architecture for Big Data Stack that consists of Hadoop, MapReduce,
Spark core and SparkSQL,Streaming, R, Graphx, MLib, Mahout, Arrow and Kafka.

Department of Computer Science Engineering(Data Science) 25


B.M.S. College of
Engineering Big Data Stack

Department of Computer Science Engineering(Data Science) 26


B.M.S. College of
Engineering Big Data Analytics standards

• Big data drives digital transformation by enabling prediction trends in datasets that go far beyond the
capabilities of legacy analytic tools in terms of volume, velocity, variety and variability.
• Organizations require a process management framework to reap the benefits of big data analytics by
ensuring that different functional groups and roles within an organization interplay with each other with
the appropriate processes, purposes and outcomes.
• A new international standard, ISO/IEC 24668: Process Management Framework for Big Data
Analytics, provides practical guidance, based on best practices, on managing and overseeing big data
analytics.
• It describes processes for the acquisition, description, storage and processing of data, irrespective of the
industry or sector in which the organization operates.
• ISO/IEC 24668 takes the various process categories into account, along with their interconnectivities.
These process categories include organization stakeholder processes, competency development
processes, data management processes, analytics development processes and technology integration
processes. 27
Department of Computer Science Engineering(Data Science)
B.M.S. College of
Engineering Big Data Analytics standards

• This framework can be used not only for managing processes but also for enabling risk determination
and process improvements. It will help organizations to develop competitive advantages, as well as to
improve sales and customer experiences.

Department of Computer Science Engineering(Data Science) 28


B.M.S. College of
Engineering Case study on Business Analytics

1. Big Data in Marketing and Sales


• Data are important for most aspect of marketing, sales and advertising.
• Customer Value (CV) depends on three factors - quality, service and price.
• Big data analytics deploy large volume of data to identify and derive intelligence using predictive
models about the individuals.
• The facts enable marketing companies to decide what products to sell.
• Customer (desired) value means what a customer desires from a product.
• Customer (perceived) value means what the customer believes to have received from a product after
purchase of the product.
• Customer value analytics (CVA) means analyzing what a customer really needs.
• CVA makes it possible for leading marketers, such as Amazon to deliver the consistent customer
experiences.

Department of Computer Science Engineering(Data Science) 29


B.M.S. College of
Engineering Case study on Business Analytics

Following are the five application areas in order of the popularity of Big Data use cases:
• CVA using the inputs of evaluated purchase patterns, preferences, quality, price and post sales
servicing requirements
• Operational analytics for optimizing company operations
• Detection of frauds and compliances
• New products and innovations in service
• Enterprise data warehouse optimization.
2. Big Data and Healthcare
3. Big Data in Medicine
4. Big Data in Advertising
5. Big Data in Sports
6. For real time inventory management
7. In Finance
8. In Education 30
Department of Computer Science Engineering(Data Science)

You might also like