0% found this document useful (0 votes)
7 views

Module 2

Uploaded by

Md Ashraf
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Module 2

Uploaded by

Md Ashraf
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 48

Module 2

Overview of Data
Repositories
A data repository is a general term used to refer to data that has
been collected, organized, and isolated so that it can be used for
business operations or mined for reporting and data analysis.

For use in business Mined for reporting


Operations and data analysis

It can be a small or large database infrastructure with one or more


databases that collect, manage, and store data sets.
In this video, we will provide an overview of the different types of
repositories your data might reside in,
1. Such as databases,
2. Data warehouses, and
3. Big data stores,

1
Databases.
Let’s begin with databases.
A database is a collection of data, or information, designed for the
input, storage, search and retrieval, and modification of data.

And a Database Management System, or DBMS, is a set of programs


that creates and maintains the database. It allows you to store,
modify, and extract information from the database using a
function called querying.
For example, if you want to find customers who have been inactive
for six months or more, using the query function, the database
management system will retrieve data of all customers from the
database that have been inactive for six months and more.

2
Even though a database and DBMS mean different things the terms
are often used interchangeably.
There are different types of databases.
Several factors influence the choice of database, such as the
 Data type and structure,
 Querying mechanisms,
 Latency requirements,
 Transaction speeds, and
 Intended use of the data.

3
It’s important to mention two main types of databases here—
1. Relational databases 2. Non-relational databases.
Relational databases
 Relational databases, also referred to as RDBMSes, build on the
organizational principles of flat files,
 with data organized into a tabular format with rows and columns
following a
 well-defined structure and schema.
 However, unlike flat files, RDBMSes are optimized for data
operations and querying involving many tables and much larger
data volumes.

 Structured Query Language, or SQL, is the standard querying


language for relational databases.
Non-relational databases
Then we have non-relational databases, also known as NoSQL, or
“Not Only SQL”.
 Non-relational databases emerged in response to the volume,
diversity, and speed at which data is being generated today,
mainly influenced by advances in cloud computing, the Internet of
Things, and social media proliferation.
 Built for speed, flexibility, and scale,

4
 non-relational databases made it possible to store data in a
schema-less or free-form fashion.
 NoSQL is widely used for processing big data.

Data Warehouse.
A data warehouse works as a central repository that merges
information coming from disparate sources and consolidates it
through the extract, transform, and load process, also known as the
ETL process, into one comprehensive database for analytics and
business intelligence.

At a very high-level, the ETL process helps you to

 extract data from different data sources,


 transform the data into a clean and
usable state, and
 load the data into the enterprise’s data
repository.
Related to Data Warehouses are the concepts of Data Marts and
Data Lakes, which we will cover later. Data
Marts and Data Warehouses have historically been relational, since
much of the traditional enterprise data has resided in RDBMSes.
5
However, with the emergence of NoSQL technologies and new
sources of data, non-relational data repositories are also now being
used for Data Warehousing.

Big Data Stores


Another category of data repositories are Big Data Stores, that
include distributed computational and storage infrastructure to
store, scale, and process very large data sets.

Summary
Overall, data repositories help to isolate data and make reporting
and analytics more efficient and credible while also serving as a data
archive.

RDBMS ( Relational Database Management Systems)

6
A relational database is a collection of data organized into a table
structure, where the tables can be linked, or related, based on data
common to each. Tables are made of rows and columns, where rows
are the “records”, and the columns the “attributes”.

Let’s take the example of a customer table that maintains data about
each customer in a company. The columns, or attributes, in the
customer table are the Company ID, Company Name, Company
Address, and Company Primary Phone; and Each row is a customer
record.

Now let’s understand what we mean by tables being linked, or


related, based on data common to each.
Along with the customer table, the company also maintains
transaction tables that contain data describing multiple individual
transactions pertaining to each customer. The columns for the
transaction table might include the Transaction Date, Customer
ID, Transaction Amount, and Payment Method. The customer table
and the transaction tables can be related based on the common
Customer ID field. You can query the customer table to produce
reports such as a customer statement that consolidates all
transactions in a given period.

7
This capability of relating tables based on common data enables you
to retrieve an entirely new table from data in one or more tables
with a single query.

It also allows you to understand the relationships among all available


data and gain new insights for making better decisions.

Relational databases build on the organizational principles of flat files


such as spreadsheets, with data organized into rows and columns
following a well-defined structure and schema.

8
But this is where the similarity ends.
 Relational databases, by design, are ideal for the optimized
storage, retrieval, and processing of data for large volumes of
data, unlike spreadsheets that have a limited number of rows and
columns.
 Each table in a relational database has a unique set of rows and
columns and
 relationships can be defined between tables, which minimizes
data redundancy.
 Moreover, you can restrict database fields to specific data types
and values,
 which minimizes irregularities and leads to greater consistency
and data integrity.
 Relational databases use SQL for querying data, which gives you
the advantage of processing millions of records and retrieving
large amounts of data in a matter of seconds.
 Moreover, the security architecture of relational databases
provides controlled access to data and also ensures that the
standards and policies for governing data can be enforced.

9
Relational databases range from small desktop systems to massive
cloud-based systems. They can be either:
 open-source and internally supported,
 open-source with commercial support, or
 commercial closed-source systems.
IBM DB2, Microsoft SQL Server, MySQL, Oracle Database, and
PostgreSQL are some of the popular relational databases.
Cloud-based relational databases, also referred to as Database-as-a-
Service, are gaining wide use as they have access to the limitless
compute and storage capabilities offered by the cloud.
Some of the popular cloud relational databases include Amazon
Relational Database Service (RDS), Google Cloud SQL, IBM DB2 on
Cloud, Oracle Cloud, and SQL Azure.
RDBMS is a mature and well-documented technology, making it easy
to learn and find qualified talent.

10
One of the most significant advantages of the relational database
approach is
 its ability to create meaningful information by joining tables.
Some of its other advantages include:
 Flexibility: Using SQL, you can add new columns, add new
tables, rename relations, and make other changes while the
database is running and queries are happening.
 Reduced redundancy: Relational databases minimize data
redundancy. For example, the information of a customer
appears in a single entry in the customer table, and the
transaction table pertaining to the customer stores a link to the
customer table.
 Ease of backup and disaster recovery: Relational databases
offer easy export and import options, making backup and
restore easy. Exports can happen while the database is running,
making restore on failure easy.
Cloud-based relational databases do continuous mirroring, which
means the loss of data on restore can be measured in seconds or
less.

11
 ACID-compliance: ACID stands for Atomicity, Consistency,
Isolation, and Durability. And ACID compliance implies that
the data in the database remains accurate and
consistent despite failures, and database transactions are
processed reliably.

Now we’ll look at some use cases for relational databases:


 Online Transaction Processing: OLTP applications are focused
on transaction-oriented tasks that run at high rates.
Relational databases are well suited for OLTP applications
because
 they can accommodate a large number of users;
 they support the ability to insert, update, or delete
small amounts of data; and
 they also support frequent queries and updates as
well as fast response times.
 Data warehouses: In a data warehousing environment, relational
databases can be optimized for online analytical processing (or
OLAP), where historical data is analyzed for business intelligence.
 IoT solutions: Internet of Things (IoT) solutions require speed as
well as the ability to collect and process data from edge devices,
which need a lightweight database solution.
12
This brings us to the limitations of RDBMS:
 RDBMS does not work well with semi-structured and unstructured
data and is, therefore, not suitable for extensive analytics on such
data.
 For migration between two RDBMSs, schemas and type of data
need to be identical between the source and destination tables.
 Relational databases have a limit on the length of data fields,
which means if you try to enter more information into a field than
it can accommodate, the information will not be stored.
Despite the limitations and the evolution of data in these times of big
data, cloud computing, IoT devices, and social media, RDBMS
continues to be the predominant technology for working with
structured data.

13
NoSQL
NoSQL, which stands for “not only SQL,” or sometimes “non-SQL” is a
non-relational database design that provides flexible schemas for the
storage and retrieval of data. NoSQL databases have existed for
many years but have only recently become more popular in the era
of cloud, big data, and high-volume web and mobile
applications. They are chosen today for their attributes around scale,
performance, and ease of use.
It's important to emphasize that the "No" in "NoSQL" is an
abbreviation for "not only" and not the actual word "No."
What is a NoSQL database?
NoSQL (not only SQL) or Non-SQL is a non-relational database
design that provides flexible schemas for the storage and
retrieval of data
 Gained greater popularity due to the emergence of cloud
computing, big data, and high-volume web and mobile
applications
 Chosen for their attributes around scale, performance, and
ease of use

14
NoSQL databases are built for specific data models and have flexible
schemas that allow programmers to create and manage modern
applications. They do not use a traditional row/column/table
database design with fixed schemas, and typically not use the
structured query language (or SQL) to query data, although some
may support SQL or SQL-like interfaces.
NoSQL (not only SQL) or Non-SQL is a non-relational database
design that provides flexible schemas for the storage and retrieval
of data
 ·Built for specific data models
 Has flexible schemas that allow programmers to create and
manage modern applications
 Do not use a traditional row/column/table database design with
fixed schemas
 Do not, typically, use the structured query language (or SQL) to
query data

NoSQL allows data to be stored in a schema-less or free-form


fashion.
Any data, be It structured, semi-structured, or unstructured, can be
stored in any record.

Based on the model being used for storing data, there are four
common types of NoSQL databases.
1. Key-value store,
2. Document-based,

15
3. Column-based, and
4. Graph-based.
Key-value store.
 Data in a key-value database is stored as a collection of key-value
pairs.
 The key represents an attribute of the data and is a unique
identifier.
 Both keys and values can be anything from simple integers or
strings to complex JSON documents.
 Key-value stores are great for storing user session data and user
preferences, making real-time recommendations and targeted
advertising, and in-memory data caching.
However, not a great fit if you want to:
 if you want to be able to query the data on specific data value,
 need relationships between data values,
 need to have multiple unique keys, a key-value store may not
be the best fit.
 Key- Value Store tools.
Redis, Memcached, and DynamoDB are some well-known examples
in this category.

16
Document-based:
 Document databases store each record and its associated data
within a single document.

17
 They enable flexible indexing, powerful ad hoc queries, and
analytics over collections of documents.
 Document databases are preferable for eCommerce platforms,
medical records storage, CRM platforms, and analytics platforms.
Not a great fit if you want to:
 However, if you’re looking to run complex search queries
 and multi-operation transactions,
a document-based database may not be the best option for you.
MongoDB, Document DB, CouchDB, and Cloud ant are some of the
popular document-based databases.

18
Column-based:
 Column-based models store data in cells grouped as columns of
data instead of rows.
 A logical grouping of columns, that is, columns that are usually
accessed together, is called a column family. For example, a
customer’s name and profile information will most likely be
accessed together but not their purchase history. So, customer
name and profile information data can be grouped into a column
family.
 Since column databases store all cells corresponding to a column
as a continuous disk entry, accessing and searching the data
becomes very fast.
 Column databases can be great for systems that require heavy
write requests, storing time-series data, weather data, and IoT
data.
Not a great fit if you want to:
 But if you need to use complex queries
 change your querying patterns frequently, this may not be the
best option for you.
The most popular column databases are Cassandra and HBase.

19
Graph-based:
 Graph-based databases use a graphical model to represent and
store data.
 They are particularly useful for visualizing, analyzing, and finding
connections between different pieces of data.

20
The circles are nodes, and they contain the data. The arrows
represent relationships. Graph databases are an excellent choice for
working with connected data, which is data that contains lots of
interconnected relationships.
Graph databases are great for social networks, real-time product
recommendations, network diagrams, fraud detection, and access
management.
Not a great fit if you want to:
 But if you want to process high volumes of transactions, it may
not be the best choice for you, because graph databases are not
optimized for large-volume analytics queries.
Neo4J and Cosmos DB are some of the more popular graph
databases.

21
Graph databases are great for fit

22
Advantage of NoSQL
NoSQL was created in response to the limitations of traditional
relational database technology.
 The primary advantage of NoSQL is its ability to handle large
volumes of structured, semi-structured, and unstructured data.
Some of its other advantages include:
 The ability to run as distributed systems scaled across multiple
data centres, which enables them to take advantage of cloud
computing infrastructure;
 An efficient and cost-effective scale-out architecture that provides
additional capacity and performance with the addition of new
nodes;
 and Simpler design, better control over availability, and improved
scalability that enables you to be more agile, more flexible, and to

23
iterate more quickly.
To summarize the key differences between relational and non-
relational databases:

Key Differences Relational databases & non-relational databases

Relational databases Non-Relational databases


 RDBMS schemas rigidly  NoSQL databases can be
define how all data inserted schema-agnostic, allowing
into the database must be unstructured and semi-
typed and composed structured data to be stored
 Maintaining high-end, and manipulated.
commercial relational  Specifically designed for low-
database management cost commodity hardware
systems can be expensive  Most NoSQL databases are
 Support ACID-compliance, not ACID compliant
which ensures reliability of  A relatively newer
transactions and crash technology
recovery 
 A mature and well- 
documented technology,  A
which means the risks are
more or less perceivable.

Data Marts, Data Lakes, ETL, and Data


Pipelines
Earlier in the course, we examined databases, data warehouses, and
big data stores.
Now we’ll go a little deeper in our exploration of data warehouses,
data marts, and data lakes; and also learn about the ETL process and
data pipelines.

24
Data Warehouses.
A data warehouse works like a multi-purpose storage for different
use cases. By the time the data comes into the warehouse, it has
already been modelled and structured for a specific purpose,
meaning it is analysis ready. As an organization, you would opt for a
data warehouse when you have massive amounts of data from your
operational systems that needs to be readily available for reporting
and analysis
Data warehouses serve as the single source of truth—storing current
and historical data that has been cleansed, conformed, and
categorized.
A data warehouse is a multi-purpose enabler of operational and
performance analytics.

A data
warehouse is a
multi- purpose
enable of
A data warehouse is a multi- operation and
Data Marts. performance
analytics.
A data
mart is a
sub-
section of
the data

25
warehouse, built specifically for a particular business function,
purpose, or community of users. The idea is to provide stakeholders
data that is most relevant to them, when they need it. For example,
the sales or finance teams accessing data for their quarterly
reporting and projections.
 Since a data mart offers analytical capabilities for a restricted area
of the data warehouse,
 it offers isolated security and isolated performance.
 The most important role of a data mart is business-specific
reporting and analytics.

Data Lakes
A Data Lake is a storage repository that can store large amounts of
structured, semi-structured, and unstructured data in their native
format, classified and tagged with metadata. So, while a data
warehouse stores data processed for a specific need,

26
Data Lakes
 A data lake is a pool of raw data where each data element is given
a unique identifier and is tagged with metatags for further use.
 You would opt for a data lake if you generate, or have access to,
large volumes of data on an ongoing basis, but don’t want to be
restricted to specific or pre-defined use cases. Unlike data
warehouses,
 A data lake would retain all source data, without any exclusions.
And the data could include all types of data sources and types.
Data lakes are sometimes also used as a staging area of a data
warehouse.
 The most important role of a data lake is in predictive and
advanced analytics.

27
 The most important role of a data lakes is in
predictive and advanced analytics.

Now we come to the process that is at the heart of gaining value


from data—the Extract, Transform, and Load process, or ETL.
ETL is how raw data is converted into analysis-ready data. It is an
automated process in which you
 gather raw data from identified sources,
 extract the information that aligns with your reporting and
analysis needs,
 clean, standardize, and transform that data into a format that is
usable in the context of your organization;
 and load it into a data repository.
While ETL is a generic process, the actual job can be very different in
usage, utility, and complexity.
Extract is the step where data from source locations is collected for
transformation.
Data extraction could be through:
 Batch processing, meaning source data, is moved in large chunks
from the source to the target system at scheduled intervals.

28
 Tools for batch processing include Stitch and Blondo.

 Stream processing, which means source data is pulled in real-time


from the source and transformed while it is in transit and before it
is loaded into the data repository.
 Tools for stream processing include Apache Samza, Apache
Storm, and Apache Kafka.
Transform involves the execution of rules and functions that converts
raw data into data that can be used for analysis.
For example,
 making date formats and units of measurement consistent across
all sourced data,
 removing duplicate data,
 filtering out data that you do not need,
 enriching data, for example, splitting full name to first, middle,
and last names,
 establishing key relationships across tables,
 applying business rules and data validations.
Load is the step where processed data is transported to a destination
system or data repository.
 It could be: Initial loading, that is, populating all the data in the
repository,
 Incremental loading, that is, applying ongoing updates and
modifications as needed periodically;
 Full refresh, that is, erasing contents of one or more tables and
reloading with fresh data.
Load verification, which includes data checks for
 missing or null values,
 server performance, and monitoring
 load failures, are important parts of this process step.
29
It is vital to keep an eye on load failures and ensure the right
recovery mechanisms are in place.

ETL has historically been used for batch workloads on a large


scale. However, with the emergence of streaming ETL tools, they are

30
increasingly being used for real-time streaming event data as well.

Data Pipeline
It’s common to see the terms ETL and data pipelines used
interchangeably. And although both move data from source to
destination,
data pipeline is a broader term that

 encompasses the entire journey of moving data from one system


to another, of which ETL is a subset.
 Data pipelines can be architected for batch processing, for
streaming data, and a combination of batch and streaming data.
In the case of streaming data, data processing or transformation,
happens in a continuous flow. This is particularly useful for data that
needs constant updating, such as data from a sensor monitoring
traffic. A data pipeline is a high performing system that

 supports both long-running batch queries and smaller interactive


queries.
 The destination for a data pipeline is typically a data lake,
although the data may also be loaded to different target
destinations, such as another application or a visualization tool.
 There are a number of data pipeline solutions available, most
popular among them being Apache Beam and Dataflow.

31
Foundations of Big Data
Big Data
In this digital world, everyone leaves a trace. From our travel habits
to our workouts and entertainment, the increasing number of
internet connected devices that we interact with on a daily basis
record vast amounts of data about us there's even a name for it Big
Data.

32
Ernst and Young offers the following definition:
“Big data refers to the dynamic, large, and disparate volumes of data
being created by people, tools, and machines. It requires new,
innovative and scalable technology to collect, host, and analytically
process the vast amount of data gathered in order to drive real-time
business insights that relate to consumers, risk, profit, performance,
productivity management, and enhanced shareholder value.
Ernst and Young

There is no one definition of big data but there are certain elements
that are common across the different definitions, such as
velocity, volume, variety, veracity, and value. These are the V's of big
data

Velocity
Velocity is the speed at which data accumulates. Data is being
generated extremely fast in a process that never stops. Near or real-
time streaming, local, and cloud-based technologies can process
information very quickly.
Volume

33
Volume is the scale of the data or the increase in the amount of data
stored. Drivers of volume are the increase in data sources, higher
resolution sensors, and scalable infrastructure.
Variety
Variety is the diversity of the data. Structured data fits neatly into
rows and columns in relational databases, while unstructured data is
not organized in a predefined way like tweets, blog posts, pictures,
numbers, and video. Variety also reflects that data comes from
different sources; machines, people, and processes, both internal
and external to organizations. Drivers are mobile technologies
social media, wearable technologies, geo technologies video, and
many, many more. Veracity is the quality and origin of data and its
conformity to facts and accuracy. Attributes include consistency,
completeness, integrity, and ambiguity. Drivers include cost and the
need for traceability. With the large amount of data available, the
debate rages on about the accuracy of data in the digital age. Is the
information real or is it false?
Value
Value is our ability and need to turn data into value. Value isn't just
profit. It may have medical or social benefits, as well as customer,
employee or personal satisfaction. The main reason that people
invest time to understand big data is to derive value from it.

34
Let's look at some examples of the V's in action.
Velocity.
Velocity. Every 60 seconds, hours of footage are uploaded to
YouTube, which is generating data. Think about how quickly data
accumulates over hours, days, and years.
Volume.
Volume. The world population is approximately 7 billion people and
the vast majority are now using digital devices. Mobile phones,
desktop and laptop computers, wearable devices, and so on. These
devices all generate, capture, and store data approximately 2.5
quintillion bytes every day. That's the equivalent of 10 million blu-ray
DVDs.
Variety.
Variety. Let's think about the different types of data. Text, pictures,
film, sound, health data from wearable devices, and many different
types of data from devices connected to the internet of things.
Veracity

35
Veracity. Eighty percent of data is considered to be unstructured and
we must devise ways to produce reliable and accurate insights. The
data must be categorized, analyzed, and visualized.

Data Scientists
Data scientists, today, derive insights from big data and cope with
the challenges that these massive data sets present. The scale of the
data being collected means that it's not feasible to use conventional
data analysis tools, however, alternative tools that
leverage distributed computing power can overcome this
problem. Tools such as Apache Spark, Hadoop, and its
ecosystem provides ways to extract, load, analyze, and process the
data across distributed compute resources, providing new
insights and knowledge.

36
This gives organizations more ways to connect with their customers
and enrich the services they offer. So next time you strap on your
smartwatch, unlock your smartphone, or track your workout,
remember your data is starting a journey that might take it all the
way around the world, through big data analysis and back to you.

Big Data Processing Tools


Big Data Processing Tools.
The Big Data processing technologies provide ways to work with
large sets of structured, semi-structured, and unstructured data so
that value can be derived from big data.
In some of the other videos, we discussed Big Data technologies such
as
1. NoSQL databases and 2. Data Lakes.

37
In this video, we are going to talk about three open-source
technologies and the role they play in big data analytics
1. Apache 2. Apache Hive,
Hadoop, 3. Apache Spark.

Apache Hadoop
Hadoop is a collection of tools that provides distributed storage and
processing of big data.
Apache Hive,
Hive is a data warehouse for data query and analysis built on top of
Hadoop.
Apache Spark.
Spark is a distributed data analytics framework designed to perform
complex data analytics in real-time.
Hadoop
Hadoop, a java-based open-source framework, allows distributed
storage and processing of large datasets across clusters of
computers. In Hadoop distributed system, a node is a single
computer, and a collection of nodes forms a cluster. Hadoop can
scale up from a single node to any number of nodes, each offering
local storage and computation. Hadoop provides a reliable, scalable,
and cost-effective solution for storing data with no format
requirements.

38
Ben
efits include:
 Using Hadoop, you can: Incorporate emerging data formats, such
as streaming audio, video, social media sentiment, and
clickstream data, along with structured, semi-structured, and
unstructured data not traditionally used in a data warehouse.
 Provide real-time, self-service access for all stakeholders.
 Optimize and streamline costs in your enterprise data warehouse
by consolidating data across the organization and moving “cold”
data, that is, data that is not in frequent use, to a Hadoop-based
system.

Data offload and consolidation:


One of the four main components
OptimizesofandHadoop
streamlinesiscosts
Hadoop Distributed
by consolidating data,
File System, or HDFS, whichincluding
is a storage system
cold data, fororganization
across the big data that runs
on multiple commodity hardware connected through a network.

39
 HDFS provides scalable and reliable big data storage by
partitioning files over multiple nodes.
 It splits large files across multiple computers, allowing parallel
access to them. Computations can, therefore, run in parallel on
each node where data is stored.
 It also replicates file blocks on different nodes to prevent data
loss, making it fault-tolerant.

Let’s understand this through an example. Consider a file that


includes phone numbers for everyone in the United States; the
numbers for people with last name starting with A might be stored
on server 1, B on server 2, and so on.

With Hadoop, pieces of this phonebook would be stored across the


cluster. To reconstruct the entire phonebook, your program would

40
need the blocks from every server in the cluster.

HDFS also replicates these smaller pieces onto two additional servers
by default, ensuring availability when a server fails, In addition to
higher availability, this offers multiple benefits. It allows the Hadoop
cluster to break up work into smaller chunks and run those jobs
on all servers in the cluster for better scalability. Finally, you gain the
benefit of data locality, which is the process of moving the
computation closer to the node on which the data resides. This is
critical when working with large data sets because it minimizes
network congestion and increases throughput.
Some of the other benefits that come from using HDFS include:
 Fast recovery from hardware failures, because HDFS is built to
detect faults and automatically recover.
 Access to streaming data, because HDFS supports high data
throughput rates.
 Accommodation of large data sets, because HDFS can scale to
hundreds of nodes, or computers, in a single cluster.
 Portability, because HDFS is portable across multiple hardware
platforms and compatible with a variety of underlying operating
systems.

41
Hive
Hive is an open-source data warehouse software for reading, writing,
and managing large data set files that are stored directly in either
HDFS or other data storage systems such as Apache HBase.
Hadoop is intended for long sequential scans and, because Hive is
based on Hadoop, queries have very high latency—which means Hive
is less appropriate for applications that need very fast response
times.

 Hive is not suitable for transaction processing that


typically involves a high percentage of write operations.

42
 Hive is better suited for data warehousing tasks such as ETL,
reporting, and data analysis and includes tools that enable easy
access to data via SQL.

Apache Spark
This brings us to Spark, a general-purpose data processing engine
designed to extract and process large volumes of data for a wide
range of applications,
including
 Interactive Analytics,
 Streams Processing,
 Machine Learning,
 Data Integration, and
 ETL.
Key attributes:
 It takes advantage of in-memory processing to significantly
increase the speed of computations and spilling to disk only when
memory is constrained.
 Spark has interfaces for major programming languages, including
Java, Scala, Python, R, and SQL.
 It can run using its standalone clustering technology as well as on
top of other infrastructures such as Hadoop. And
 it can access data in a large variety of data sources, including HDFS
and Hive, making it highly versatile.

43
 The ability to process streaming data fast and perform complex
analytics in real-time is the key use case for Apache Spark.

Summary and Highlights


In this lesson, you have learned the following information:
A Data Repository is a general term that refers to data that has been
collected, organized, and isolated so that it can be used for reporting,
analytics, and also for archival purposes.
The different types of Data Repositories include:
 Databases, which can be relational or non-relational,
each following a set of organizational principles, the types of
data they can store, and the tools that can be used to query,
organize, and retrieve data.
 Data Warehouses, that consolidate incoming data into one
comprehensive storehouse.
 Data Marts, that are essentially sub-sections of a data
warehouse, built to isolate data for a particular

44
business function or use case.
 Data Lakes, that serve as storage repositories for large
amounts of structured, semi-structured, and unstructured data
in their native format.
 Big Data Stores, that provide distributed computational and
storage infrastructure to store, scale, and process very large
data sets.
ETL, or Extract Transform and Load, Process is an automated process
that converts raw data into analysis-ready data by:
 Extracting data from source locations.
 Transforming raw data by cleaning, enriching, standardizing,
and validating it.
 Loading the processed data into a destination system or data
repository.
Data Pipeline, sometimes used interchangeably with
ETL, encompasses the entire journey of moving data from the source
to a destination data lake or application, using the ETL process.
Big Data refers to the vast amounts of data that is being produced
each moment of every day, by people, tools, and machines. The
sheer velocity, volume, and variety of data challenge the tools and
systems used for conventional data. These challenges led to the
emergence of processing tools and platforms designed specifically
for Big Data, such as Apache Hadoop, Apache Hive, and Apache
Spark.

Practice Quiz
Question 1 :- Structured Query Language, or SQL, is the standard
querying language for what type of data repository?
Answer:- RDBMS

45
SQL is the standard querying language for RDBMSs.

Question 2 :-In use cases for RDBMS, what is one of the reasons that
relational databases are so well suited for OLTP applications?
Answer:- Support the ability to insert, update, or delete small
amount of data.
This is one of the abilities of RDBMSs that make them very well suited
for OLTP applications.

Question 3:- Which NoSQL database type stores each record and its
associated data within a single document and also works well
with Analytics platforms?
Answer:- Document Base
Document-based NoSQL databases store each record and its
associated data within a single document and work well with
Analytics platforms.
Question 4:-What type of data repository is used to isolate a subset
of data for a particular business function, purpose, or community of
users?
Answer:- Data Mart
A data mart is a sub-section of the data warehouse used to isolate a
subset of data for a particular business function, purpose, or
community of users.
Question 5:-What does the attribute “Velocity” imply in the context
of Big Data?
Answer:- The speed at which data accumulates.

46
Velocity, in the context of Big Data, is the speed at which data
accumulates.
Question 6:- Which of the Big Data processing tools provides
distributed storage and processing of Big Data?
Answer:-Hadoop
Hadoop, a java-based open-source framework, allows distributed
storage and processing of large datasets across clusters of
computers.

Graded Quiz
Question 1:- Data Marts and Data Warehouses have typically been
relational, but the emergence of what technology has helped to let
these be used for non-relational data?
Answer:- No SQL
The emergence of NoSQL technology has made it possible for data
marts and data warehouses to be used for both relational and non-
relational data.

Question 2 :-What is one of the most significant advantages of an


RDBMS?
Answer:- Is ACID - Compliant
ACID-Compliance is one of the significant advantages of an RDBMS.

Question 3 :-Which one of the NoSQL database types uses a


graphical model to represent and store data, and is particularly
useful for visualizing, analyzing, and finding connections between
different pieces of data?

47
Answer:- Graph base.
Graph-based NoSQL databases use a graphical model to represent
and store data and are used for visualizing, analyzing, and finding
connections between different pieces of data.
Question 4 :-Which of the data repositories serves as a pool of raw
data and stores large amounts of structured, semi-structured, and
unstructured data in their native formats?
Answer:- Data Lakes.
A Data Lake can store large amounts of structured, semi-structured,
and unstructured data in their native format, classified and tagged
with metadata.
Question 5:- What does the attribute “Veracity” imply in the context
of Big Data?
Answer:- Accuracy and conformity of Data to facts.
Veracity, in the context of Big Data, refers to the accuracy and
conformity of data to facts.
Question 6:- Apache Spark is a general-purpose data
processing engine designed to extract and process Big Data for a
wide range of applications. What is one of its key use cases?
Answer:- Perform Complex analytics in real-time
Spark is a general-purpose data processing engine used for
performing complex data analytics in real-time.

48

You might also like