Unit - 1 (Big Data)
Unit - 1 (Big Data)
1. Structured Data:
Structured data is created using a fixed schema and is maintained in tabular format.
The elements in structured data are addressable for effective analysis. It contains
all the data which can be stored in the SQL database in a tabular format. Today,
most of the data is developed and processed in the simplest way to manage
information.
Examples –
Relational data, Geo-location, credit card numbers, addresses, etc.
Consider an example for Relational Data like you have to maintain a record of
students for a university like the name of the student, ID of a student, address, and
Email of the student. To store the record of students used the following relational
schema and table for the same.
2. Unstructured Data :
It is defined as the data in which is not follow a pre-defined standard or
you can say that any does not follow any organized format. This kind of
3. Semi-Structured Data :
Semi-structured data is information that does not reside in a relational
database but that have some organizational properties that make it easier to
analyze. With some process, you can store them in a relational database but
is very hard for some kind of semi-structured data, but semi-structured exist
to ease space.
Example –XML data.
Big data solutions typically involve one or more of the following types of workload:
Data sources: All big data solutions start with one or more data sources. Examples
include:
o Application data stores, such as relational databases.
o Static files produced by applications, such as web server log files.
o Real-time data sources, such as IoT devices.
Data storage: Data for batch processing operations is typically stored in a
distributed file store that can hold high volumes of large files in various formats.
This kind of store is often called a data lake. Options for implementing this storage
include Azure Data Lake Store or blob containers in Azure Storage.
Batch processing: Because the data sets are so large, often a big data solution
must process data files using long-running batch jobs to filter, aggregate, and
otherwise prepare the data for analysis. Usually these jobs involve reading source
files, processing them, and writing the output to new files. Options include running
U-SQL jobs in Azure Data Lake Analytics, using Hive, Pig, or custom Map/Reduce
jobs in an HDInsight Hadoop cluster, or using Java, Scala, or Python programs in
an HDInsight Spark cluster.
Real-time message ingestion: If the solution includes real-time sources, the
architecture must include a way to capture and store real-time messages for
stream processing. This might be a simple data store, where incoming messages
are dropped into a folder for processing. However, many solutions need a message
ingestion store to act as a buffer for messages, and to support scale-out
processing, reliable delivery, and other message queuing semantics. Options
include Azure Event Hubs, Azure IoT Hubs, and Kafka.
Stream processing: After capturing real-time messages, the solution must process
them by filtering, aggregating, and otherwise preparing the data for analysis. The
processed stream data is then written to an output sink. Azure Stream Analytics
provides a managed stream processing service based on perpetually running SQL
queries that operate on unbounded streams. You can also use open source Apache
streaming technologies like Spark Streaming in an HDInsight cluster.
Analytical data store: Many big data solutions prepare data for analysis and then
serve the processed data in a structured format that can be queried using
analytical tools. The analytical data store used to serve these queries can be a
Kimball-style relational data warehouse, as seen in most traditional business
intelligence (BI) solutions. Alternatively, the data could be presented through a
low-latency NoSQL technology such as HBase, or an interactive Hive database that
provides a metadata abstraction over data files in the distributed data store. Azure
Synapse Analytics provides a managed service for large-scale, cloud-based data
(3)Big data Auditing, and Protection: With Big Data and analytics, there is
a possibility of a more efficient and effective identification of financial
reporting, detection of fraud, and examination of operational business risks
Big data privacy involves properly managing big data to minimize risk and
protect sensitive data. Because big data comprises large and complex data
sets, many traditional privacy processes cannot handle the scale and velocity
required
Classification of analytics
1) Descriptive analytics
Descriptive analytics is a statistical method that is used to search and
summarize historical data in order to identify patterns or meaning.
2) Predictive analytics
Predictive Analytics is a statistical method that utilizes algorithms and
machine learning to identify trends in data and predict future
behaviors.
3) Prescriptive analytics
Prescriptive analytics is a statistical method used to generate
recommendations and make decisions based on the computational
findings of algorithmic models.