Module 2
Module 2
Overview of Data
Repositories
A data repository is a general term used to refer to data that has
been collected, organized, and isolated so that it can be used for
business operations or mined for reporting and data analysis.
1
Databases.
Let’s begin with databases.
A database is a collection of data, or information, designed for the
input, storage, search and retrieval, and modification of data.
2
Even though a database and DBMS mean different things the terms
are often used interchangeably.
There are different types of databases.
Several factors influence the choice of database, such as the
Data type and structure,
Querying mechanisms,
Latency requirements,
Transaction speeds, and
Intended use of the data.
3
It’s important to mention two main types of databases here—
1. Relational databases 2. Non-relational databases.
Relational databases
Relational databases, also referred to as RDBMSes, build on the
organizational principles of flat files,
with data organized into a tabular format with rows and columns
following a
well-defined structure and schema.
However, unlike flat files, RDBMSes are optimized for data
operations and querying involving many tables and much larger
data volumes.
4
non-relational databases made it possible to store data in a
schema-less or free-form fashion.
NoSQL is widely used for processing big data.
Data Warehouse.
A data warehouse works as a central repository that merges
information coming from disparate sources and consolidates it
through the extract, transform, and load process, also known as the
ETL process, into one comprehensive database for analytics and
business intelligence.
Summary
Overall, data repositories help to isolate data and make reporting
and analytics more efficient and credible while also serving as a data
archive.
6
A relational database is a collection of data organized into a table
structure, where the tables can be linked, or related, based on data
common to each. Tables are made of rows and columns, where rows
are the “records”, and the columns the “attributes”.
Let’s take the example of a customer table that maintains data about
each customer in a company. The columns, or attributes, in the
customer table are the Company ID, Company Name, Company
Address, and Company Primary Phone; and Each row is a customer
record.
7
This capability of relating tables based on common data enables you
to retrieve an entirely new table from data in one or more tables
with a single query.
8
But this is where the similarity ends.
Relational databases, by design, are ideal for the optimized
storage, retrieval, and processing of data for large volumes of
data, unlike spreadsheets that have a limited number of rows and
columns.
Each table in a relational database has a unique set of rows and
columns and
relationships can be defined between tables, which minimizes
data redundancy.
Moreover, you can restrict database fields to specific data types
and values,
which minimizes irregularities and leads to greater consistency
and data integrity.
Relational databases use SQL for querying data, which gives you
the advantage of processing millions of records and retrieving
large amounts of data in a matter of seconds.
Moreover, the security architecture of relational databases
provides controlled access to data and also ensures that the
standards and policies for governing data can be enforced.
9
Relational databases range from small desktop systems to massive
cloud-based systems. They can be either:
open-source and internally supported,
open-source with commercial support, or
commercial closed-source systems.
IBM DB2, Microsoft SQL Server, MySQL, Oracle Database, and
PostgreSQL are some of the popular relational databases.
Cloud-based relational databases, also referred to as Database-as-a-
Service, are gaining wide use as they have access to the limitless
compute and storage capabilities offered by the cloud.
Some of the popular cloud relational databases include Amazon
Relational Database Service (RDS), Google Cloud SQL, IBM DB2 on
Cloud, Oracle Cloud, and SQL Azure.
RDBMS is a mature and well-documented technology, making it easy
to learn and find qualified talent.
10
One of the most significant advantages of the relational database
approach is
its ability to create meaningful information by joining tables.
Some of its other advantages include:
Flexibility: Using SQL, you can add new columns, add new
tables, rename relations, and make other changes while the
database is running and queries are happening.
Reduced redundancy: Relational databases minimize data
redundancy. For example, the information of a customer
appears in a single entry in the customer table, and the
transaction table pertaining to the customer stores a link to the
customer table.
Ease of backup and disaster recovery: Relational databases
offer easy export and import options, making backup and
restore easy. Exports can happen while the database is running,
making restore on failure easy.
Cloud-based relational databases do continuous mirroring, which
means the loss of data on restore can be measured in seconds or
less.
11
ACID-compliance: ACID stands for Atomicity, Consistency,
Isolation, and Durability. And ACID compliance implies that
the data in the database remains accurate and
consistent despite failures, and database transactions are
processed reliably.
13
NoSQL
NoSQL, which stands for “not only SQL,” or sometimes “non-SQL” is a
non-relational database design that provides flexible schemas for the
storage and retrieval of data. NoSQL databases have existed for
many years but have only recently become more popular in the era
of cloud, big data, and high-volume web and mobile
applications. They are chosen today for their attributes around scale,
performance, and ease of use.
It's important to emphasize that the "No" in "NoSQL" is an
abbreviation for "not only" and not the actual word "No."
What is a NoSQL database?
NoSQL (not only SQL) or Non-SQL is a non-relational database
design that provides flexible schemas for the storage and
retrieval of data
Gained greater popularity due to the emergence of cloud
computing, big data, and high-volume web and mobile
applications
Chosen for their attributes around scale, performance, and
ease of use
14
NoSQL databases are built for specific data models and have flexible
schemas that allow programmers to create and manage modern
applications. They do not use a traditional row/column/table
database design with fixed schemas, and typically not use the
structured query language (or SQL) to query data, although some
may support SQL or SQL-like interfaces.
NoSQL (not only SQL) or Non-SQL is a non-relational database
design that provides flexible schemas for the storage and retrieval
of data
·Built for specific data models
Has flexible schemas that allow programmers to create and
manage modern applications
Do not use a traditional row/column/table database design with
fixed schemas
Do not, typically, use the structured query language (or SQL) to
query data
Based on the model being used for storing data, there are four
common types of NoSQL databases.
1. Key-value store,
2. Document-based,
15
3. Column-based, and
4. Graph-based.
Key-value store.
Data in a key-value database is stored as a collection of key-value
pairs.
The key represents an attribute of the data and is a unique
identifier.
Both keys and values can be anything from simple integers or
strings to complex JSON documents.
Key-value stores are great for storing user session data and user
preferences, making real-time recommendations and targeted
advertising, and in-memory data caching.
However, not a great fit if you want to:
if you want to be able to query the data on specific data value,
need relationships between data values,
need to have multiple unique keys, a key-value store may not
be the best fit.
Key- Value Store tools.
Redis, Memcached, and DynamoDB are some well-known examples
in this category.
16
Document-based:
Document databases store each record and its associated data
within a single document.
17
They enable flexible indexing, powerful ad hoc queries, and
analytics over collections of documents.
Document databases are preferable for eCommerce platforms,
medical records storage, CRM platforms, and analytics platforms.
Not a great fit if you want to:
However, if you’re looking to run complex search queries
and multi-operation transactions,
a document-based database may not be the best option for you.
MongoDB, Document DB, CouchDB, and Cloud ant are some of the
popular document-based databases.
18
Column-based:
Column-based models store data in cells grouped as columns of
data instead of rows.
A logical grouping of columns, that is, columns that are usually
accessed together, is called a column family. For example, a
customer’s name and profile information will most likely be
accessed together but not their purchase history. So, customer
name and profile information data can be grouped into a column
family.
Since column databases store all cells corresponding to a column
as a continuous disk entry, accessing and searching the data
becomes very fast.
Column databases can be great for systems that require heavy
write requests, storing time-series data, weather data, and IoT
data.
Not a great fit if you want to:
But if you need to use complex queries
change your querying patterns frequently, this may not be the
best option for you.
The most popular column databases are Cassandra and HBase.
19
Graph-based:
Graph-based databases use a graphical model to represent and
store data.
They are particularly useful for visualizing, analyzing, and finding
connections between different pieces of data.
20
The circles are nodes, and they contain the data. The arrows
represent relationships. Graph databases are an excellent choice for
working with connected data, which is data that contains lots of
interconnected relationships.
Graph databases are great for social networks, real-time product
recommendations, network diagrams, fraud detection, and access
management.
Not a great fit if you want to:
But if you want to process high volumes of transactions, it may
not be the best choice for you, because graph databases are not
optimized for large-volume analytics queries.
Neo4J and Cosmos DB are some of the more popular graph
databases.
21
Graph databases are great for fit
22
Advantage of NoSQL
NoSQL was created in response to the limitations of traditional
relational database technology.
The primary advantage of NoSQL is its ability to handle large
volumes of structured, semi-structured, and unstructured data.
Some of its other advantages include:
The ability to run as distributed systems scaled across multiple
data centres, which enables them to take advantage of cloud
computing infrastructure;
An efficient and cost-effective scale-out architecture that provides
additional capacity and performance with the addition of new
nodes;
and Simpler design, better control over availability, and improved
scalability that enables you to be more agile, more flexible, and to
23
iterate more quickly.
To summarize the key differences between relational and non-
relational databases:
24
Data Warehouses.
A data warehouse works like a multi-purpose storage for different
use cases. By the time the data comes into the warehouse, it has
already been modelled and structured for a specific purpose,
meaning it is analysis ready. As an organization, you would opt for a
data warehouse when you have massive amounts of data from your
operational systems that needs to be readily available for reporting
and analysis
Data warehouses serve as the single source of truth—storing current
and historical data that has been cleansed, conformed, and
categorized.
A data warehouse is a multi-purpose enabler of operational and
performance analytics.
A data
warehouse is a
multi- purpose
enable of
A data warehouse is a multi- operation and
Data Marts. performance
analytics.
A data
mart is a
sub-
section of
the data
25
warehouse, built specifically for a particular business function,
purpose, or community of users. The idea is to provide stakeholders
data that is most relevant to them, when they need it. For example,
the sales or finance teams accessing data for their quarterly
reporting and projections.
Since a data mart offers analytical capabilities for a restricted area
of the data warehouse,
it offers isolated security and isolated performance.
The most important role of a data mart is business-specific
reporting and analytics.
Data Lakes
A Data Lake is a storage repository that can store large amounts of
structured, semi-structured, and unstructured data in their native
format, classified and tagged with metadata. So, while a data
warehouse stores data processed for a specific need,
26
Data Lakes
A data lake is a pool of raw data where each data element is given
a unique identifier and is tagged with metatags for further use.
You would opt for a data lake if you generate, or have access to,
large volumes of data on an ongoing basis, but don’t want to be
restricted to specific or pre-defined use cases. Unlike data
warehouses,
A data lake would retain all source data, without any exclusions.
And the data could include all types of data sources and types.
Data lakes are sometimes also used as a staging area of a data
warehouse.
The most important role of a data lake is in predictive and
advanced analytics.
27
The most important role of a data lakes is in
predictive and advanced analytics.
28
Tools for batch processing include Stitch and Blondo.
30
increasingly being used for real-time streaming event data as well.
Data Pipeline
It’s common to see the terms ETL and data pipelines used
interchangeably. And although both move data from source to
destination,
data pipeline is a broader term that
31
Foundations of Big Data
Big Data
In this digital world, everyone leaves a trace. From our travel habits
to our workouts and entertainment, the increasing number of
internet connected devices that we interact with on a daily basis
record vast amounts of data about us there's even a name for it Big
Data.
32
Ernst and Young offers the following definition:
“Big data refers to the dynamic, large, and disparate volumes of data
being created by people, tools, and machines. It requires new,
innovative and scalable technology to collect, host, and analytically
process the vast amount of data gathered in order to drive real-time
business insights that relate to consumers, risk, profit, performance,
productivity management, and enhanced shareholder value.
Ernst and Young
There is no one definition of big data but there are certain elements
that are common across the different definitions, such as
velocity, volume, variety, veracity, and value. These are the V's of big
data
Velocity
Velocity is the speed at which data accumulates. Data is being
generated extremely fast in a process that never stops. Near or real-
time streaming, local, and cloud-based technologies can process
information very quickly.
Volume
33
Volume is the scale of the data or the increase in the amount of data
stored. Drivers of volume are the increase in data sources, higher
resolution sensors, and scalable infrastructure.
Variety
Variety is the diversity of the data. Structured data fits neatly into
rows and columns in relational databases, while unstructured data is
not organized in a predefined way like tweets, blog posts, pictures,
numbers, and video. Variety also reflects that data comes from
different sources; machines, people, and processes, both internal
and external to organizations. Drivers are mobile technologies
social media, wearable technologies, geo technologies video, and
many, many more. Veracity is the quality and origin of data and its
conformity to facts and accuracy. Attributes include consistency,
completeness, integrity, and ambiguity. Drivers include cost and the
need for traceability. With the large amount of data available, the
debate rages on about the accuracy of data in the digital age. Is the
information real or is it false?
Value
Value is our ability and need to turn data into value. Value isn't just
profit. It may have medical or social benefits, as well as customer,
employee or personal satisfaction. The main reason that people
invest time to understand big data is to derive value from it.
34
Let's look at some examples of the V's in action.
Velocity.
Velocity. Every 60 seconds, hours of footage are uploaded to
YouTube, which is generating data. Think about how quickly data
accumulates over hours, days, and years.
Volume.
Volume. The world population is approximately 7 billion people and
the vast majority are now using digital devices. Mobile phones,
desktop and laptop computers, wearable devices, and so on. These
devices all generate, capture, and store data approximately 2.5
quintillion bytes every day. That's the equivalent of 10 million blu-ray
DVDs.
Variety.
Variety. Let's think about the different types of data. Text, pictures,
film, sound, health data from wearable devices, and many different
types of data from devices connected to the internet of things.
Veracity
35
Veracity. Eighty percent of data is considered to be unstructured and
we must devise ways to produce reliable and accurate insights. The
data must be categorized, analyzed, and visualized.
Data Scientists
Data scientists, today, derive insights from big data and cope with
the challenges that these massive data sets present. The scale of the
data being collected means that it's not feasible to use conventional
data analysis tools, however, alternative tools that
leverage distributed computing power can overcome this
problem. Tools such as Apache Spark, Hadoop, and its
ecosystem provides ways to extract, load, analyze, and process the
data across distributed compute resources, providing new
insights and knowledge.
36
This gives organizations more ways to connect with their customers
and enrich the services they offer. So next time you strap on your
smartwatch, unlock your smartphone, or track your workout,
remember your data is starting a journey that might take it all the
way around the world, through big data analysis and back to you.
37
In this video, we are going to talk about three open-source
technologies and the role they play in big data analytics
1. Apache 2. Apache Hive,
Hadoop, 3. Apache Spark.
Apache Hadoop
Hadoop is a collection of tools that provides distributed storage and
processing of big data.
Apache Hive,
Hive is a data warehouse for data query and analysis built on top of
Hadoop.
Apache Spark.
Spark is a distributed data analytics framework designed to perform
complex data analytics in real-time.
Hadoop
Hadoop, a java-based open-source framework, allows distributed
storage and processing of large datasets across clusters of
computers. In Hadoop distributed system, a node is a single
computer, and a collection of nodes forms a cluster. Hadoop can
scale up from a single node to any number of nodes, each offering
local storage and computation. Hadoop provides a reliable, scalable,
and cost-effective solution for storing data with no format
requirements.
38
Ben
efits include:
Using Hadoop, you can: Incorporate emerging data formats, such
as streaming audio, video, social media sentiment, and
clickstream data, along with structured, semi-structured, and
unstructured data not traditionally used in a data warehouse.
Provide real-time, self-service access for all stakeholders.
Optimize and streamline costs in your enterprise data warehouse
by consolidating data across the organization and moving “cold”
data, that is, data that is not in frequent use, to a Hadoop-based
system.
39
HDFS provides scalable and reliable big data storage by
partitioning files over multiple nodes.
It splits large files across multiple computers, allowing parallel
access to them. Computations can, therefore, run in parallel on
each node where data is stored.
It also replicates file blocks on different nodes to prevent data
loss, making it fault-tolerant.
40
need the blocks from every server in the cluster.
HDFS also replicates these smaller pieces onto two additional servers
by default, ensuring availability when a server fails, In addition to
higher availability, this offers multiple benefits. It allows the Hadoop
cluster to break up work into smaller chunks and run those jobs
on all servers in the cluster for better scalability. Finally, you gain the
benefit of data locality, which is the process of moving the
computation closer to the node on which the data resides. This is
critical when working with large data sets because it minimizes
network congestion and increases throughput.
Some of the other benefits that come from using HDFS include:
Fast recovery from hardware failures, because HDFS is built to
detect faults and automatically recover.
Access to streaming data, because HDFS supports high data
throughput rates.
Accommodation of large data sets, because HDFS can scale to
hundreds of nodes, or computers, in a single cluster.
Portability, because HDFS is portable across multiple hardware
platforms and compatible with a variety of underlying operating
systems.
41
Hive
Hive is an open-source data warehouse software for reading, writing,
and managing large data set files that are stored directly in either
HDFS or other data storage systems such as Apache HBase.
Hadoop is intended for long sequential scans and, because Hive is
based on Hadoop, queries have very high latency—which means Hive
is less appropriate for applications that need very fast response
times.
42
Hive is better suited for data warehousing tasks such as ETL,
reporting, and data analysis and includes tools that enable easy
access to data via SQL.
Apache Spark
This brings us to Spark, a general-purpose data processing engine
designed to extract and process large volumes of data for a wide
range of applications,
including
Interactive Analytics,
Streams Processing,
Machine Learning,
Data Integration, and
ETL.
Key attributes:
It takes advantage of in-memory processing to significantly
increase the speed of computations and spilling to disk only when
memory is constrained.
Spark has interfaces for major programming languages, including
Java, Scala, Python, R, and SQL.
It can run using its standalone clustering technology as well as on
top of other infrastructures such as Hadoop. And
it can access data in a large variety of data sources, including HDFS
and Hive, making it highly versatile.
43
The ability to process streaming data fast and perform complex
analytics in real-time is the key use case for Apache Spark.
44
business function or use case.
Data Lakes, that serve as storage repositories for large
amounts of structured, semi-structured, and unstructured data
in their native format.
Big Data Stores, that provide distributed computational and
storage infrastructure to store, scale, and process very large
data sets.
ETL, or Extract Transform and Load, Process is an automated process
that converts raw data into analysis-ready data by:
Extracting data from source locations.
Transforming raw data by cleaning, enriching, standardizing,
and validating it.
Loading the processed data into a destination system or data
repository.
Data Pipeline, sometimes used interchangeably with
ETL, encompasses the entire journey of moving data from the source
to a destination data lake or application, using the ETL process.
Big Data refers to the vast amounts of data that is being produced
each moment of every day, by people, tools, and machines. The
sheer velocity, volume, and variety of data challenge the tools and
systems used for conventional data. These challenges led to the
emergence of processing tools and platforms designed specifically
for Big Data, such as Apache Hadoop, Apache Hive, and Apache
Spark.
Practice Quiz
Question 1 :- Structured Query Language, or SQL, is the standard
querying language for what type of data repository?
Answer:- RDBMS
45
SQL is the standard querying language for RDBMSs.
Question 2 :-In use cases for RDBMS, what is one of the reasons that
relational databases are so well suited for OLTP applications?
Answer:- Support the ability to insert, update, or delete small
amount of data.
This is one of the abilities of RDBMSs that make them very well suited
for OLTP applications.
Question 3:- Which NoSQL database type stores each record and its
associated data within a single document and also works well
with Analytics platforms?
Answer:- Document Base
Document-based NoSQL databases store each record and its
associated data within a single document and work well with
Analytics platforms.
Question 4:-What type of data repository is used to isolate a subset
of data for a particular business function, purpose, or community of
users?
Answer:- Data Mart
A data mart is a sub-section of the data warehouse used to isolate a
subset of data for a particular business function, purpose, or
community of users.
Question 5:-What does the attribute “Velocity” imply in the context
of Big Data?
Answer:- The speed at which data accumulates.
46
Velocity, in the context of Big Data, is the speed at which data
accumulates.
Question 6:- Which of the Big Data processing tools provides
distributed storage and processing of Big Data?
Answer:-Hadoop
Hadoop, a java-based open-source framework, allows distributed
storage and processing of large datasets across clusters of
computers.
Graded Quiz
Question 1:- Data Marts and Data Warehouses have typically been
relational, but the emergence of what technology has helped to let
these be used for non-relational data?
Answer:- No SQL
The emergence of NoSQL technology has made it possible for data
marts and data warehouses to be used for both relational and non-
relational data.
47
Answer:- Graph base.
Graph-based NoSQL databases use a graphical model to represent
and store data and are used for visualizing, analyzing, and finding
connections between different pieces of data.
Question 4 :-Which of the data repositories serves as a pool of raw
data and stores large amounts of structured, semi-structured, and
unstructured data in their native formats?
Answer:- Data Lakes.
A Data Lake can store large amounts of structured, semi-structured,
and unstructured data in their native format, classified and tagged
with metadata.
Question 5:- What does the attribute “Veracity” imply in the context
of Big Data?
Answer:- Accuracy and conformity of Data to facts.
Veracity, in the context of Big Data, refers to the accuracy and
conformity of data to facts.
Question 6:- Apache Spark is a general-purpose data
processing engine designed to extract and process Big Data for a
wide range of applications. What is one of its key use cases?
Answer:- Perform Complex analytics in real-time
Spark is a general-purpose data processing engine used for
performing complex data analytics in real-time.
48