0% found this document useful (0 votes)
16 views

BIG Data_Unit_1

The document outlines the curriculum for a Big Data course at the United Institute of Management, covering topics such as the introduction to Big Data, Hadoop, Map-Reduce, NoSQL databases, and analytics tools. It details the architecture, components, and technologies involved in Big Data, including data ingestion, storage, processing, and analysis. Additionally, it discusses the historical context of Big Data innovation and the drivers for its adoption across various industries.

Uploaded by

rajsneha25597
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

BIG Data_Unit_1

The document outlines the curriculum for a Big Data course at the United Institute of Management, covering topics such as the introduction to Big Data, Hadoop, Map-Reduce, NoSQL databases, and analytics tools. It details the architecture, components, and technologies involved in Big Data, including data ingestion, storage, processing, and analysis. Additionally, it discusses the historical context of Big Data innovation and the drivers for its adoption across various industries.

Uploaded by

rajsneha25597
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

United Institute of Management

Department of Computer Application


Subject Name: Big Data Subject Code: KCA022(Elective-2)
Introduction to Big Data: Types of digital data, history of Big Data innovation,
introduction to Big Data platform, drivers for Big Data, Big Data architecture and
characteristics, 5 Vs of Big Data, Big Data technology components, Big Data
I importance and applications, Big Data features – security, compliance, auditing and 08
protection, Big Data privacy and ethics, Big Data Analytics, Challenges of
conventional systems, intelligent data analysis, nature of data, analytic processes and
tools, analysis vs reporting, modern data analytic tools.
Hadoop: History of Hadoop, Apache Hadoop, the Hadoop Distributed File System,
components of Hadoop, data format, analyzing data with Hadoop, scaling out, Hadoop
streaming, Hadoop pipes, Hadoop Echo System.
Map-Reduce: Map-Reduce framework and basics, how Map Reduce works,
II 08
developing a Map Reduce application, unit tests with MR unit, test data and local tests,
anatomy of a Map Reduce job run, failures, job scheduling, shuffle and sort, task
execution, Map Reduce types, input formats, output formats, Map Reduce features,
Real-world Map Reduce
HDFS (Hadoop Distributed File System): Design of HDFS, HDFS concepts,
benefits and challenges, file sizes, block sizes and block abstraction in HDFS, data
replication, how does HDFS store, read, and write files, Java interfaces to HDFS,
command line interface, Hadoop file system interfaces, data flow, data ingest with
III Flume and Scoop, Hadoop archives, Hadoop I/O: Compression, serialization, Avro and 08
file-based data structures. Hadoop Environment: Setting up a Hadoop cluster, cluster
specification, cluster setup and installation, Hadoop configuration, security in Hadoop,
administering Hadoop, HDFS monitoring & maintenance, Hadoop benchmarks,
Hadoop in the cloud
Hadoop Eco System and YARN: Hadoop ecosystem components, schedulers, fair
and capacity, Hadoop 2.0 New Features – Name Node high availability, HDFS
federation, MRv2, YARN, Running MRv1 in YARN.
NoSQL Databases: Introduction to NoSQL MongoDB: Introduction, data types,
creating, updating and deleing documents, querying, introduction to indexing, capped
IV collections 08

Spark: Installing spark, spark applications, jobs, stages and tasks, Resilient Distributed
Databases, anatomy of a Spark job run, Spark on YARN
SCALA: Introduction, classes and objects, basic types and operators, built-in control
structures, functions and closures, inheritance.
Hadoop Eco System Frameworks: Applications on Big Data using Pig, Hive and
HBase
Pig : Introduction to PIG, Execution Modes of Pig, Comparison of Pig with Databases,
Grunt, Pig Latin, User Defined Functions, Data Processing operators, Hive - Apache
Hive architecture and installation, Hive shell, Hive services, Hive
V 08
metastore, comparison with traditional databases, HiveQL, tables, querying data and
user defined functions, sorting and aggregating, Map Reduce scripts, joins &
subqueries.
HBase – Hbase concepts, clients, example, Hbase vs RDBMS, advanced usage,
schema design, advance indexing, Zookeeper – how it helps in monitoring a cluster,
how to build applications with Zookeeper. IBM Big Data strategy, introduction to
Infosphere, BigInsights and Big Sheets, introduction to Big SQL.

INDEX
United Institute of Management ........................................................................................ 1
Types of digital data ........................................................................................................... 3
History of Big Data innovation .......................................................................................... 4
Big Data architecture and characteristics .......................................................................... 7
Big Data Tools and Techniques .......................................................................................... 8
Big Data features – security, compliance, auditing and protection.................................... 14
Big Data privacy and ethics ............................................................................................. 16
Big Data Analytics ........................................................................................................... 17
Key components of analytics include ................................................................................ 17
Big Data Analytics Process .............................................................................................. 18
Challenges of conventional systems ................................................................................. 18
ANALYSIS vs REPORTING ............................................................................................ 19
MODERN DATA ANALYTIC TOOLS ............................................................................. 20
Key Points: ...................................................................................................................... 23

By: Shivam Bhardwaj UIM(011) Email:[email protected] 2


UNIT 1

Introduction to Big Data: Types of digital data, history of Big Data innovation, introduction
to Big Data platform, drivers for Big Data, Big Data architecture and characteristics, 5 Vs of
Big Data, Big Data technology components, Big Data importance and applications, Big Data
features – security, compliance, auditing and protection, Big Data privacy and ethics, Big
Data Analytics, Challenges of conventional systems, intelligent data analysis, nature of data,
analytic processes and tools, analysis vs reporting, modern data analytic tools.

Types of digital data


digital data can be categorized into several types based on various characteristics and sources.
These types of digital data are often classified into three main categories: structured data,
semi-structured data, and unstructured data. Here's a breakdown of these types:

Structured Data:

Structured data is highly organized and follows a specific format or schema.


It is typically stored in relational databases or spreadsheets.
Examples include:
Numerical data: Numbers, dates, currency values.
Text data: Names, addresses, product IDs.
Categorical data: Labels, codes, categories.
Structured data is easy to analyze and query, making it a fundamental component of many
traditional data analytics processes.

Semi-Structured Data:
Semi-structured data is less organized than structured data but still has some level of
structure.
It may not conform to a rigid schema but often includes tags, labels, or hierarchies that
provide some meaning.
Examples include:
JSON (JavaScript Object Notation) data.
XML (eXtensible Markup Language) data.
NoSQL databases like MongoDB, which store data in semi-structured formats.
Semi-structured data is commonly used in web applications and data interchange formats.

Unstructured Data:
Unstructured data lacks a predefined structure and does not fit neatly into tables or rows.
It includes a wide range of content types, making it the most challenging type of data to work
with.
Examples include:
Text documents: Email messages, social media posts, articles.
Multimedia content: Images, videos, audio recordings.
Sensor data: Data from IoT (Internet of Things) devices.
Weblogs and clickstream data.

By: Shivam Bhardwaj UIM(011) Email:[email protected] 3


Analyzing unstructured data often requires advanced techniques such as natural language
processing (NLP), image recognition, and machine learning.

These types of digital data play a crucial role in big data analytics and inform the choice of
tools, techniques, and technologies used for data storage, processing, and analysis in the
world of big data.

History of Big Data innovation


Early Beginnings:

The concept of Big Data can be traced back to the 1960s and 1970s when the world was
generating data at an increasing rate, primarily in scientific research and government projects.
Relational Databases:

In the 1970s, Edgar F. Codd introduced the concept of relational databases, which allowed for
structured data storage and querying. While not initially designed for Big Data, relational
databases played a crucial role in data management.
Data Warehousing:

In the 1980s, data warehousing emerged as a method to consolidate and manage large
volumes of data for analytical purposes. This laid the foundation for the analysis of Big Data.
Internet Era:

The rise of the internet in the late 20th century led to an explosion of data generation. Search
engines like Google and e-commerce platforms like Amazon were early pioneers in dealing
with large datasets.
Hadoop and MapReduce:

In the mid-2000s, Hadoop and the MapReduce programming model were developed by
Yahoo and popularized by Apache. This open-source framework allowed distributed
processing of large datasets across clusters of computers.

Introduction to Big Data platform


Big data platforms are tools and technologies designed to address these challenges. They
provide the infrastructure and tools for collecting, storing, processing, and analyzing big data.
Here are some key components of big data platforms:

Data Ingestion: This phase involves collecting data from various sources, including
databases, social media, IoT devices, and more. Technologies like Apache Kafka and Flume
are used for data ingestion.

Data Storage: Big data platforms use distributed storage systems like Hadoop Distributed
File System (HDFS) or cloud-based storage solutions such as Amazon S3 or Azure Blob
Storage.

By: Shivam Bhardwaj UIM(011) Email:[email protected] 4


Data Processing: For data processing, platforms like Apache Hadoop and Apache Spark are
popular choices. They allow for distributed processing of large datasets.

Data Analysis: Machine learning and data analytics tools are integrated into big data
platforms. Libraries such as TensorFlow, scikit-learn, and Spark MLlib are used for analysis.

Data Visualization: Data visualization tools like Tableau, Power BI, and open-source options
like D3.js help present insights in a meaningful way.

Common Big Data Technologies


Hadoop: A distributed storage and processing framework for big data.

Apache Spark: A fast, in-memory data processing engine for large-scale data processing.

NoSQL Databases: Such as MongoDB, Cassandra, and HBase, which are designed for
unstructured and semi-structured data.

Machine Learning Libraries: Including TensorFlow, scikit-learn, and PyTorch for building
predictive models.

Cloud-Based Solutions: Like AWS, Azure, and Google Cloud, which offer managed big data
services.

Drivers for Big Data


The adoption of big data is driven by various factors that have transformed the way
organizations collect, process, and leverage data. These drivers have played a significant role
in the growing importance of big data analytics across various industries. Here are the key
drivers for big data adoption:

Data Explosion: The digital age has seen an exponential increase in data generation. The
proliferation of smartphones, , and online transactions has led to an enormous volume of data
that organizations can harness IoT devices, social media for insights.

Competitive Advantage: Companies recognize that by effectively utilizing big data, they can
gain a competitive edge. Analyzing data allows for better decision-making, improved
products and services, and enhanced customer experiences.

Cost Reduction: Big data technologies enable organizations to store and process data more
efficiently, often at a lower cost compared to traditional methods. This cost-effectiveness is a
strong incentive for adoption.

Personalization: Customers expect highly personalized experiences. Big data analytics


empowers organizations to understand customer preferences and deliver tailored content,
products, and services.

Operational Efficiency: Big data can optimize operational processes, supply chain
management, and resource allocation. It helps organizations make data-driven decisions to
enhance efficiency and reduce waste.
By: Shivam Bhardwaj UIM(011) Email:[email protected] 5
Risk Management: Big data analytics can identify and mitigate risks in real time. This is
crucial in industries such as finance and insurance to prevent fraud and minimize losses.

Innovation: Big data is a catalyst for innovation. By analyzing data, organizations can
discover new insights, create innovative products, and adapt to changing market conditions.

Customer Insights: Understanding customer behavior, sentiment, and feedback is critical for
businesses. Big data provides a means to gain deep insights into customer preferences and
needs.

Regulatory Compliance: Many industries face stringent data regulations. Big data platforms
help organizations manage and report data in compliance with these regulations, reducing
legal risks.

Healthcare Advancements: In the healthcare sector, big data is driving research and
enabling personalized medicine. It assists in diagnosing diseases, predicting outbreaks, and
improving patient care.

Real-Time Decision-Making: Big data technologies can process data in real time, allowing
organizations to make immediate decisions and respond swiftly to changing circumstances.

Supply Chain Optimization: Big data analytics can optimize the supply chain by predicting
demand, reducing lead times, and ensuring efficient inventory management.

Smart Cities: The concept of smart cities relies on big data to enhance urban planning, traffic
management, energy consumption, and public services.

Scientific Research: Big data is instrumental in scientific research, enabling data-driven


discoveries and breakthroughs in fields like genomics, climate science, and particle physics.

Social Media and Sentiment Analysis: Organizations can gauge public sentiment and
opinions through social media data, which is valuable for marketing, reputation management,
and product development.

Predictive Maintenance: Big data analytics helps in predicting when machinery and
equipment will require maintenance, reducing downtime and operational disruptions.

Environmental Monitoring: Big data is used in environmental sciences to monitor climate


change, track wildlife, and manage natural resources sustainably.

In conclusion, the drivers for big data adoption are multifaceted, ranging from the need to
manage data growth to gaining a competitive advantage and improving operational
efficiency. As the volume and complexity of data continue to grow, big data technologies and
analytics will remain central to the success of organizations in various industries.

By: Shivam Bhardwaj UIM(011) Email:[email protected] 6


Big Data architecture and characteristics
Big Data architecture is the structured framework that outlines how organizations collect,
store, process, and analyze vast and complex data sets known as "Big Data." This
architectural framework is designed to address the specific challenges posed by the sheer
volume, velocity, variety, and veracity of data in the Big Data domain. Below is an overview
of the key components and characteristics of Big Data architecture.

Big data architecture is a comprehensive solution to deal with an enormous amount of data. It
details the blueprint for providing solutions and infrastructure for dealing with big data based
on a company’s demands. It clearly defines the components, layers, and methods of
communication. The reference point is the ingestion, processing, storing, managing,
accessing, and analysing of the data.

• Data Sources: All of the sources that feed into the data extraction pipeline are subject
to this definition, so this is where the starting point for the big data pipeline is located.
Data sources, open and third-party, play a significant role in architecture. Relational
databases, data warehouses, cloud-based data warehouses, SaaS applications, real-
time data from company servers and sensors such as IoT devices, third-party data
providers, and also static files such as Windows logs, comprise several data sources.
Both batch processing and real-time processing are possible. The data managed can
be both batch processing and real-time processing.

• Data Storage: There is data stored in file stores that are distributed in nature and that
can hold a variety of format-based big files. It is also possible to store large numbers
of different format-based big files in the data lake. This consists of the data that is
managed for batch built operations and is saved in the file stores. We provide HDFS,
Microsoft Azure, AWS, and GCP storage, among other blob containers.

• Batch Processing: Each chunk of data is split into different categories using long-
running jobs, which filter and aggregate and also prepare data for analysis. These jobs
typically require sources, process them, and deliver the processed files to new files.
Multiple approaches to batch processing are employed, including Hive jobs, U-SQL
jobs, Sqoop or Pig and custom map reducer jobs written in any one of the Java or
Scala or other languages such as Python.
By: Shivam Bhardwaj UIM(011) Email:[email protected] 7
• Real Time-Based Message Ingestion: A real-time streaming system that caters to the
data being generated in a sequential and uniform fashion is a batch processing system.
When compared to batch processing, this includes all real-time streaming systems that
cater to the data being generated at the time it is received. This data mart or store,
which receives all incoming messages and discards them into a folder for data
processing, is usually the only one that needs to be contacted. Message-based
ingestion stores such as Apache Kafka, Apache Flume, Event hubs from Azure, and
others, on the other hand, must be used if message-based processing is required. The
delivery process, along with other message queuing semantics, is generally more
reliable.

• Stream Processing: Real-time message ingest and stream processing are different.
The latter uses the ingested data as a publish-subscribe tool, whereas the former takes
into account all of the ingested data in the first place and then utilises it as a publish-
subscribe tool. Stream processing, on the other hand, handles all of that streaming
data in the form of windows or streams and writes it to the sink. This includes Apache
Spark, Flink, Storm, etc.

• Analytics-Based Datastore: In order to analyze and process already processed data,


analytical tools use the data store that is based on HBase or any other NoSQL data
warehouse technology. The data can be presented with the help of a hive database,
which can provide metadata abstraction, or interactive use of a hive database, which
can provide metadata abstraction in the data store. NoSQL databases like HBase or
Spark SQL are also available.

• Reporting and Analysis: The generated insights, on the other hand, must be
processed and that is effectively accomplished by the reporting and analysis tools that
utilize embedded technology and a solution to produce useful graphs, analysis, and
insights that are beneficial to the businesses. For example, Cognos, Hyperion, and
others.

• Orchestration: Data-based solutions that utilise big data are data-related tasks that
are repetitive in nature, and which are also contained in workflow chains that can
transform the source data and also move data across sources as well as sinks and loads
in stores. Sqoop, oozie, data factory, and others are just a few examples.

Big Data Tools and Techniques

A big data tool can be classified into the four buckets listed below based on its practicability.

1. Massively Parallel Processing (MPP)


2. No-SQL Databases
3. Distributed Storage and Processing Tools
4. Cloud Computing Tools

By: Shivam Bhardwaj UIM(011) Email:[email protected] 8


Massively Parallel Processing (MPP)

A loosely coupled or shared nothing storage system is a massively parallel processing


construct with the goal of dividing up a large number of computing machines into discrete
pieces and proceeding in parallel. An MPP system is also referred to as a loosely coupled or
shared nothing system. Processing is accomplished by breaking a large number of computer
processors into separate bits and proceeding in parallel.

Each processor works on separate tasks, has a different operating system, and does not share
memory. It is also possible for up to 200 or more processors to work on applications
connected to this high-speed network. In each case, the processor handles a different set of
instructions and has a different operating system, which is not shared. MPP may also send
messages between processes via a messaging system that allows it to send commands to the
processors.

MPP-based databases are IBM Netezza, Oracle Exadata, Teradata, SAP HANA, EMC
Greenplum.

No-SQL Databases

Structures are employed to help associate data with a particular domain. Data cannot be
stored in a structured database unless it is first converted to one. SQL (or NoSQL) is a non-
structured language used to encapsulate unstructured data and create structures for
heterogeneous data in the same domain. NoSQL databases offer a vast array of configuration
scalability, as well as versatility, and scalability in handling large quantities of data. There is
also distributed data storage, making data available locally or remotely.

NoSQL databases include the following categories:

1. Key-value Pair Based


2. Graphs based
3. Column-oriented Graph
4. Document-oriented

By: Shivam Bhardwaj UIM(011) Email:[email protected] 9


Key-value model: Dictionaries, collections, and associative arrays can often use hash tables
to store data, but this database stores information in a unique key-value pair. The key is
required to access data and the value is used to record information. It helps store data without
a schema. The key is unique and used to retrieve and update data, while the value is a string,
char, JSON, or BLOB. Redis, Dynamo, and Riak are the key-value store databases.

Graph-based model: Graph databases store both entities and relationships between them,
and they are multi-relational. Nodes and links and entities are stored as elements on the
graph, and relationships between these elements are represented by edges (or nodes). Graph
databases are employed for mapping, transportation, social networks, and spatial data
applications. They may also be used to discover patterns in semi-structured and unstructured
data. The Neo4J, FlockDB, and OrientDB graph databases are available.

Column-based NoSQL database: Columnar databases work on columns. Compared to


relational databases, they have a set of columns rather than tables. Each column is significant,
and it is viewed independently. The values in the database are stored in a contiguous manner
and may not have values. Because columns are easy to assess, columnar databases are
efficient at performing summarisation jobs such as SUM, COUNT, AVG, MIN, and MAX.

Column families are also known as Wide columns or Columnar columns or Column stores.
These are used for data structures, business intelligence, CRM, and catalogues of library
cards.

The columnar databases Cassandra, HBase, and Hypertable use NoSQL databases that use
columnar storage.

By: Shivam Bhardwaj UIM(011) Email:[email protected] 10


Document-Oriented NoSQL database: The document-oriented database stores documents
in order to make them essentially document-oriented rather than data-oriented. JSON or
XML are the formats used for data, and key-value pairs and the format of JSON or XML are
used for data. E-commerce applications, blogging platforms, real-time analytics, Content
Management systems (CMS), are among the applications that benefit from these databases.

MongoDB, CouchDB, Amazon SimpleDB, Riak, Lotus Notes. NoSQL document databases
are MongoDB, CouchDB, Amazon SimpleDB, Riak, and Lotus Notes.

By: Shivam Bhardwaj UIM(011) Email:[email protected] 11


The 5 Vs of Big Data
The 5 Vs of Big Data are a set of characteristics that describe the key challenges and attributes
of large and complex datasets. These characteristics are crucial for understanding the nature of
Big Data and for designing appropriate data processing and analysis strategies. Here are the 5
Vs of Big Data:

Volume:

Definition: Volume refers to the vast amount of data generated and collected. Big Data involves
extremely large datasets, often measured in terabytes, petabytes, or even exabytes.
Significance: Managing and processing such immense volumes of data requires specialized
infrastructure and technologies designed for storage, retrieval, and analysis at scale.
Velocity:

Definition: Velocity represents the speed at which data is generated and how quickly it must be
processed and analyzed. Data can be generated in real-time or near-real-time.
Significance: Real-time processing capabilities are essential for handling data streams that are
generated rapidly, such as social media updates, sensor data, financial transactions, and more.
Variety:

Definition: Variety encompasses the diverse types of data that make up Big Data, including
structured, semi-structured, and unstructured data. This data can include text, images, videos,
sensor readings, and more.
Significance: Big Data systems must be able to handle and analyze different data formats and
structures. This requires flexibility in storage and processing to accommodate a wide range of
data sources.
Veracity:

Definition: Veracity refers to the trustworthiness and quality of data. Big Data can be noisy,
with errors, duplications, inconsistencies, and inaccuracies.
Significance: To derive meaningful insights, it's essential to address data quality issues through
data cleansing, validation, and integration processes. Trustworthy data leads to more reliable
analysis.
By: Shivam Bhardwaj UIM(011) Email:[email protected] 12
Value:

Definition: Value signifies the goal of Big Data, which is to extract valuable insights, patterns,
and knowledge from the data.
Significance: The ultimate purpose of handling Big Data is to make data-driven decisions, gain
competitive advantages, improve processes, and create value for organizations. Extracting
actionable insights from large datasets is at the heart of Big Data analytics.
Understanding the 5 Vs of Big Data is essential for organizations to develop effective strategies
for data collection, storage, processing, and analysis. By addressing these characteristics,
businesses can harness the power of Big Data to make informed decisions, discover.

1. Ingestion :
The ingestion layer is the very first step of pulling in raw data.
It comes from internal sources, relational databases, non-relational databases, social media,
emails, phone calls etc.
There are two kinds of ingestions :
Batch, in which large groups of data are gathered and delivered together.
Streaming, which is a continuous flow of data. This is necessary for real-time data analytics.
2. Storage :

By: Shivam Bhardwaj UIM(011) Email:[email protected] 13


Storage is where the converted data is stored in a data lake or warehouse and eventually
processed.
The data lake/warehouse is the most essential component of a big data ecosystem.
It needs to contain only thorough, relevant data to make insights as valuable as possible.
It must be efficient with as little redundancy as possible to allow for quicker processing.
3. Analysis :
In the analysis layer, data gets passed through several tools, shaping it into actionable
insights.
There are four types of analytics on big data :

• Diagnostic: Explains why a problem is happening.


• Descriptive: Describes the current state of a business through historical data.
• Predictive: Projects future results based on historical data.
• Prescriptive: Takes predictive analytics a step further by projecting best future
efforts.
4. Consumption :
The final big data component is presenting the information in a format digestible to the end-
user.
This can be in the forms of tables, advanced visualizations and even single numbers if
requested.
The most important thing in this layer is making sure the intent and meaning of the output is
understandable.

Big Data features – security, compliance, auditing and protection


Security:

Definition: Security in Big Data involves safeguarding data against unauthorized access,
breaches, and cyber threats. It encompasses data encryption, authentication, authorization,
and network security.
Importance: Protecting sensitive information is paramount, as data breaches can have severe
consequences. Security measures, including encryption of data at rest and in transit, secure
access controls, and intrusion detection systems, are vital to ensure data confidentiality and
integrity.
Compliance:

Definition: Compliance refers to adhering to legal and regulatory requirements that pertain to
data handling, privacy, and security. Depending on the industry, this might involve
regulations like GDPR, HIPAA, or industry-specific standards.

By: Shivam Bhardwaj UIM(011) Email:[email protected] 14


Importance: Non-compliance can result in legal actions and fines. Big Data solutions must be
designed to meet these compliance standards. Data governance, access controls, and audit
trails are often implemented to maintain compliance.
Auditing:

Definition: Auditing involves tracking and monitoring all activities related to data, including
access, modification, and usage. Audit logs capture details such as who accessed data, what
actions were taken, and when.
Importance: Auditing provides transparency and accountability, aiding in compliance efforts
and identifying any suspicious or unauthorized activities. It is an essential part of ensuring
data security and governance in Big Data systems.
Data Protection:

Definition: Data protection focuses on safeguarding data from loss, corruption, and damage.
This includes backup and disaster recovery strategies.
Importance: Big Data systems often deal with irreplaceable data. Data protection measures
ensure data availability in the face of hardware failures, disasters, or other unforeseen events.
Key Measures for These Features in Big Data:

Access Control: Implement role-based access control (RBAC) to restrict data access to
authorized users. This ensures that only those with proper permissions can interact with
sensitive data.

Encryption: Use encryption techniques for data at rest and data in transit to protect data from
unauthorized access during storage and transmission.

Authentication and Authorization: Enforce strong authentication mechanisms to verify the


identity of users and systems. Authorization rules define what actions users can perform on
data.

Data Masking: For non-production environments, sensitive data can be replaced with
fictional data or obscured to prevent exposure of actual sensitive information.

Data Governance: Establish data governance policies and practices to ensure data quality,
compliance, and ethical data use.

Audit Trails: Implement detailed audit logs to track who accessed the data, what they did, and
when they did it. Regularly review audit logs for anomalies.

Data Classification: Categorize data according to its sensitivity and define appropriate
security and protection measures based on the classification.

Incident Response: Develop an incident response plan to address security breaches or data
loss effectively. This includes strategies for data recovery and system restoration.

By: Shivam Bhardwaj UIM(011) Email:[email protected] 15


Big Data privacy and ethics

Big Data technologies and practices have raised important concerns related to privacy and
ethics. The massive volume of data, advanced analytics, and data-driven decision-making
have the potential to impact individuals, society, and organizations in significant ways.
Therefore, addressing privacy and ethical considerations is crucial. Here are key aspects of
Big Data privacy and ethics:

1. Data Privacy:

Definition: Data privacy involves the protection of individuals' personal and sensitive
information, ensuring that it is collected, stored, processed, and shared in a way that respects
their rights and choices.
Importance: With Big Data, there's an increased risk of privacy violations as vast amounts of
personal data are collected from various sources. Ensuring data privacy is essential to
maintain trust with customers and users.
2. Informed Consent:

Definition: Informed consent requires individuals to be aware of what data is being collected,
how it will be used, and to have the choice to opt in or out of data collection.
Importance: Informed consent is crucial in data collection. It empowers individuals to make
choices about their data and promotes transparency.
3. Data Anonymization:

Definition: Data anonymization involves removing or disguising personally identifiable


information (PII) from datasets to protect individuals' identities.
Importance: Anonymization allows for data analysis while reducing the risk of privacy
breaches. However, it's becoming more challenging as de-anonymization techniques advance.
4. Fairness and Bias:

Definition: Fairness refers to ensuring that Big Data analysis and algorithms are free from
bias and do not discriminate against individuals or groups based on race, gender, age, or other
characteristics.
Importance: Unintentional biases can lead to unfair treatment. Ethical considerations include
identifying and mitigating biases in data and algorithms.

5. Data Ownership and Control:

Definition: Data ownership concerns the question of who owns the data. Individuals should
have control over their own data, including the right to access, correct, or delete it.
Importance: Clarifying data ownership and control is essential to uphold individuals' rights
and privacy.
6. Transparency:

Definition: Transparency involves making data collection and processing practices clear to
individuals. Organizations should be open about their data practices.
By: Shivam Bhardwaj UIM(011) Email:[email protected] 16
Importance: Transparency builds trust, allows individuals to make informed decisions about
their data, and enhances accountability.
7. Data Security:

Definition: Data security measures protect data from unauthorized access, breaches, and
cyber threats.
Importance: Protecting data against security threats is fundamental to safeguarding privacy
and maintaining data integrity.
8. Ethical Use of Data:

Definition: Ethical use of data involves using data in ways that align with moral and social
values, respecting human rights and avoiding harm.
Importance: Adhering to ethical principles in data analysis and decision-making is essential to
avoid potential harm or misuse of data.
9. Social Implications:

Definition: Big Data can have profound societal impacts, such as influencing public opinion,
political decisions, and employment practices.
Importance: Understanding the social implications of Big Data is critical to address the
ethical and societal consequences.
10. Compliance with Regulations:

Definition: Compliance involves following legal and regulatory frameworks, such as GDPR
(General Data Protection Regulation) or HIPAA (Health Insurance Portability and
Accountability Act).
Importance: Organizations must comply with data protection regulations to avoid legal
consequences and protect individual privacy.

Big Data Analytics


Analytics is the systematic process of examining, interpreting, and transforming data into
meaningful insights and knowledge. It involves the use of various techniques, tools, and
methodologies to analyze data, discover patterns, trends, and correlations, and ultimately make
informed decisions or recommendations based on the findings.

Key components of analytics include


Data Collection: The first step in analytics is gathering relevant data from various sources,
which can include structured data from databases, unstructured data from text or images, and
real-time data from sensors or other devices.

Data Processing: Once collected, data often requires preprocessing, which involves cleaning,
transforming, and organizing it to ensure accuracy and consistency. This step is crucial for
accurate analysis.

Data Analysis: Analytics involves the application of statistical, mathematical, or


computational techniques to the prepared data. This analysis can take various forms, including
descriptive, diagnostic, predictive, and prescriptive analytics.
By: Shivam Bhardwaj UIM(011) Email:[email protected] 17
Descriptive Analytics: Describes past data to understand what has happened. It includes
summary statistics, data visualization, and reporting.
Diagnostic Analytics: Aims to discover why a particular event occurred by exploring the causes
and factors contributing to the observed outcomes.
Predictive Analytics: Utilizes historical data and statistical models to forecast future events or
trends.
Prescriptive Analytics: Suggests specific actions or decisions based on predictive analytics to
optimize outcomes.
Data Visualization: Presenting the results of data analysis through charts, graphs, dashboards,
and other visual aids to make complex information more understandable and actionable.

Decision-Making: The ultimate goal of analytics is to support informed decision-making.


Insights derived from data analysis guide individuals, organizations, or systems in making
choices or taking actions that are expected to lead to more favorable outcomes.

Continuous Improvement: Analytics is an iterative process. As more data is collected and


analyzed, organizations can refine their strategies and approaches to improve their operations
and decision-making.

Big Data Analytics Process


Data Collection: Gather data from various sources, including internal databases, external
sources, and sensors.
Data Preprocessing: Clean, transform, and structure data for analysis, addressing missing
values and outliers.
Data Storage: Store data efficiently, often using distributed storage systems.
Data Analysis: Apply analytical techniques, including statistical analysis and machine
learning, to extract insights.
Data Visualization: Present results through graphs, charts, and dashboards for better
understanding.
Decision Making: Use insights to make informed business decisions.

Challenges of conventional systems

• Lack of Scalability • Inefficient Processes


• Limited Flexibility • Poor User Experience
• Obsolete Technology • Limited Analytical Capabilities
• High Maintenance Costs • Integration Challenges
• Security Vulnerabilities

By: Shivam Bhardwaj UIM(011) Email:[email protected] 18


ANALYSIS vs REPORTING
Purpose:

Analysis: The primary purpose of analysis is to examine data in-depth, identify patterns,
trends, anomalies, and relationships within the data. It aims to extract insights and answers to
specific questions, often leading to a deeper understanding of the underlying factors.

Reporting: Reporting is focused on summarizing and presenting data in a structured format.


Its main purpose is to convey information clearly and concisely, typically to inform
stakeholders about the current state of affairs or to provide historical data.

Nature:

Analysis: Analysis involves exploration, data mining, and hypothesis testing. It requires
critical thinking, data modeling, and the application of statistical or analytical techniques to
uncover meaningful insights.

Reporting: Reporting is descriptive and static in nature. It involves creating standardized


reports or dashboards that present data in a pre-defined format for easy consumption.

Level of Detail:

Analysis: Analysis goes into granular details, often looking at individual data points or
specific subsets of data. It aims to answer specific questions and requires a more thorough
examination of the data.

Reporting: Reporting provides a high-level overview of data, usually presenting aggregated


or summarized information. It doesn't delve deeply into the details but offers a broad picture
of the data.

Interactivity:

Analysis: Analysis is interactive and iterative. Analysts explore data, adjust models or
hypotheses, and perform multiple iterations to gain insights. It often involves ad-hoc queries
and exploratory data analysis.

Reporting: Reporting is typically static, offering a fixed view of data. While some reports and
dashboards may allow limited interaction (e.g., filtering or drilling down), it's not as flexible
as analysis.

Timing:
Analysis: Analysis can be an ongoing process and is often performed as needed or in
response to specific questions or problems. It doesn't have a fixed schedule.

Reporting: Reporting is usually periodic and scheduled. Reports are generated at regular
intervals (daily, weekly, monthly) and provide a snapshot of data at that moment.

Audience:

Analysis: The audience for analysis is often data analysts, scientists, or subject-matter experts
who need to deep dive into data to make informed decisions or recommendations.

Reporting: Reporting caters to a broader audience, including executives, managers, and other
stakeholders who require concise, high-level information for decision-making.

MODERN DATA ANALYTIC TOOLS


Apache Hadoop: An open-source framework for distributed storage and processing of big
data. Hadoop includes the Hadoop Distributed File System (HDFS) for data storage and
MapReduce for data processing. It's widely used for batch processing and large-scale data
analysis.

Apache Spark: Another open-source, distributed computing framework, Spark is known for
its speed and versatility. It supports real-time and batch data processing, machine learning,
and graph processing, making it a versatile choice for various analytics tasks.

Python and R: Both Python and R are popular programming languages for data analysis and
machine learning. They offer extensive libraries and frameworks, such as Pandas, NumPy,
SciPy, Scikit-Learn (Python), and Tidyverse (R), that provide powerful data manipulation
and statistical analysis capabilities.

Tableau: A data visualization tool that allows users to create interactive and shareable
dashboards. Tableau is known for its user-friendly interface and the ability to connect to a
wide range of data sources.

By: Shivam Bhardwaj UIM(011) Email:[email protected] 20


Power BI: A business analytics service by Microsoft that provides interactive visualizations
and business intelligence capabilities. Power BI supports integration with various data
sources, including cloud services and on-premises databases.

QlikView and Qlik Sense: Qlik's products offer associative data modeling and in-memory
analytics. They allow users to explore data, create dynamic visualizations, and share insights
with others.

Google Data Studio: A free data visualization and reporting tool from Google. It allows
users to create custom reports and dashboards, connect to various data sources, and share
them with others.

SAS (Statistical Analysis System): SAS provides a comprehensive suite of analytics


solutions for advanced analytics, data management, and business intelligence. It is commonly
used in industries such as healthcare and finance.

KNIME: An open-source data analytics, reporting, and integration platform. KNIME allows
users to build data pipelines and automate data processing and analysis workflows.

Databricks: A unified analytics platform that combines the power of Apache Spark with
collaborative data science tools. It's designed for big data analytics, machine learning, and
artificial intelligence.

Alteryx: A data analytics platform that offers data blending, data preparation, and advanced
analytics capabilities. It's known for its user-friendly workflow-based interface.

Sisense: A business intelligence platform that enables data integration, analysis, and
visualization. It's designed to handle complex data and provide insights to non-technical
users.

Looker: A data exploration and business intelligence platform that allows organizations to
create and share reports and dashboards.

By: Shivam Bhardwaj UIM(011) Email:[email protected] 21


Redash: An open-source tool for querying, visualizing, and collaborating on data. It supports
various data sources and enables users to create custom dashboards and visualizations.

Big Data features – security, compliance, auditing and protection, Big Data privacy and
ethics

Security:

Volume and Velocity: The massive volume and high velocity of Big Data make it challenging
to secure. Security measures must keep pace with the continuous influx of data.

Data Breach Risks: Big Data often contains sensitive and personally identifiable information.
The risk of data breaches and unauthorized access is a significant concern.

Authentication and Authorization: Implementing robust authentication and authorization


mechanisms is crucial to ensure that only authorized users have access to specific data.

Encryption: Data encryption is essential for protecting data both in transit and at rest. It
ensures that even if data is intercepted, it remains unreadable without the appropriate
decryption keys.

Compliance, Auditing, and Protection:

Regulatory Compliance: Big Data often contains data subject to various regulations, such as
GDPR, HIPAA, and industry-specific compliance requirements. Organizations must ensure
compliance with these regulations.
Data Auditing: Auditing data access and usage is vital for compliance and security.
Organizations need to maintain comprehensive audit logs and monitor data access and
modifications.
Data Governance: Establishing robust data governance practices helps organizations manage
data quality, privacy, and compliance effectively. This includes data lineage, metadata
management, and policy enforcement.
Protection Against Insider Threats: Insider threats pose significant risks. Organizations must
implement strategies to detect and mitigate threats from within their own workforce.

Big Data Privacy and Ethics:

Data Anonymization: Anonymizing data is a common privacy practice. It involves removing


or encrypting personally identifiable information to protect individual privacy while retaining
data utility.
Ethical Data Use: Big Data analytics raise ethical concerns regarding how data is collected,
stored, and used. Organizations should establish ethical guidelines to ensure responsible data
usage.
Bias and Fairness: Analyzing Big Data may reveal biased results or reinforce existing biases.
It's crucial to address these issues and strive for fairness and equity in data-driven decision-
making.
Informed Consent: When collecting and using data, organizations should obtain informed
consent from individuals, making them aware of how their data will be used.
Transparency: Being transparent about data practices and sharing information about data
collection, processing, and usage builds trust with data subjects and stakeholders.

By: Shivam Bhardwaj UIM(011) Email:[email protected] 22


Key Points:
Types of Digital Data:

Digital data encompasses a wide variety of types, including text, images, audio, video,
structured databases, sensor data, and more.
Each type of data requires different processing and storage methods.
History of Big Data Innovation:

Big Data innovation has its roots in the early days of computing but gained prominence in the
21st century with the explosion of digital data.
Technologies like Hadoop, NoSQL databases, and advanced analytics tools have driven this
innovation.
Introduction to Big Data Platform:

A Big Data platform is an integrated technology framework for collecting, storing, processing,
and analyzing large and complex datasets.
It enables organizations to derive insights and value from data.
Drivers for Big Data:

The key drivers for Big Data adoption include the need for real-time decision-making, data-
driven insights, and competitive advantages.
The exponential growth of data and the Internet of Things (IoT) are also significant drivers.
Big Data Architecture and Characteristics:

Big Data architecture typically includes data sources, data storage, data processing, and data
analytics layers.
Key characteristics are scalability, fault tolerance, and parallel processing.
5 Vs of Big Data:

The 5 Vs represent Volume (data size), Velocity (data speed), Variety (data types), Veracity
(data quality), and Value (data usefulness).
These dimensions define the challenges and opportunities in Big Data.
Big Data Technology Components:

Components may include storage (Hadoop HDFS, NoSQL databases), processing (Map-
Reduce, Spark), and analytics tools (Tableau, R, Python).
Data integration, ETL (Extract, Transform, Load), and visualization tools are also part of the
ecosystem.
Big Data Importance and Applications:

Big Data is crucial for enhancing decision-making, improving customer experiences,


optimizing operations, and driving innovation.
Applications range from healthcare and finance to marketing, logistics, and scientific research.
Big Data Features - Security, Compliance, Auditing, and Protection:

Security measures protect data from unauthorized access and breaches.


Compliance ensures data adheres to legal and regulatory requirements.
Auditing tracks data access and usage, and protection safeguards against data loss.
Big Data Privacy and Ethics:

By: Shivam Bhardwaj UIM(011) Email:[email protected] 23


Privacy concerns focus on protecting individuals' data and ensuring responsible data handling.
Ethical considerations relate to the ethical use of data and the consequences of data analysis on
privacy and society.
Big Data Analytics:

Big Data analytics involves processing and analyzing large and complex datasets to extract
valuable insights.
Techniques include descriptive, diagnostic, predictive, and prescriptive analytics.
Challenges of Conventional Systems:

Traditional systems struggle to handle Big Data's volume, velocity, and variety.
They may lack scalability, real-time processing, and advanced analytics capabilities.
Intelligent Data Analysis:

Intelligent data analysis uses AI and machine learning to derive deeper insights from data.
It includes techniques like clustering, regression, and neural networks.
Nature of Data:

Data can be structured (organized) or unstructured (not organized), and semi-structured (partly
organized).
Understanding data's nature is crucial for its effective processing.
Analytic Processes and Tools:

Analytic processes involve data collection, data preparation, data analysis, and interpretation.
Tools encompass software and algorithms used in data analytics.
Analysis vs. Reporting:

Reporting presents data in a static format.


Analysis involves exploring data, identifying trends, and making predictions or decisions based
on data insights.
Modern Data Analytic Tools:

Modern tools include open-source platforms like Apache Spark, data warehouses like
Snowflake, and cloud-based analytics services.
They offer advanced analytics and visualization capabilities.

By: Shivam Bhardwaj UIM(011) Email:[email protected] 24

You might also like