0% found this document useful (0 votes)
11 views41 pages

DS231 Week 3

Uploaded by

Abdu 77
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views41 pages

DS231 Week 3

Uploaded by

Abdu 77
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

‫الجامعة السعودية االلكترونية‬

‫الجامعة السعودية االلكترونية‬

‫‪26/12/2021‬‬
College of Computing and
Informatics
Bachelor of Data Science
DS-231
Introduction to Data Science
Programming

2
DS-231
Introduction to Data Science Programming
Week 3
Tapping into Critical Aspects of Data
Engineering
Contents

1. Defining Big Data and the Three Vs


2. Identifying Important Data Sources
3. Grasping the Differences among Data
Approaches
4. Storing and Processing Data for Data Science

4
Weekly Learning
Outcomes
1. Unraveling the big data story

2. Looking at important data sources

3. Differentiating data science from data


engineering

4. Storing data on-premise or in a cloud

5. Exploring other data engineering solutions


5
Required Reading
1. Chapter 2. Tapping into Critical Aspects of Data
Engineering (Lillian Pierson, Data Science, 3rd
Edition, 2021)

Videos
Sources of Data
https://youtu.be/bbgDaRtKRQE
• What Is Cloud Computing And How It Is Enabling The Big Data Economy
https://youtu.be/SdvF9ohvmpw

6
Introduction
Introduction

• Most normal people aren’t aware of what data really is or


how it’s used to improve people’s lives.
• This chapter tells the full story about big data, explains where
big data comes from and how it’s used.
• We will outline the roles that machine learning engineers,
data engineers, and data scientists play in the modern data
ecosystem and how data science to improve business
performance.
• Fundamental concepts related to storing and processing data
for data science will be introduced.
1. Defining Big Data and the Three Vs
Defining Big Data and the Three
Vs
• Big Data — a term that characterizes data that exceeds the
processing capacity of conventional database systems
because it’s too big, it moves too fast, or it lacks the
structural requirements of traditional database architectures.
• In order to use big data, you need to have big data storage
and processing capabilities that are available only if you
invest in a Hadoop cluster.
Defining Big Data and the Three
Vs
• Hadoop is a data processing platform that is designed to
reduce big data into smaller datasets that are more
manageable for data scientists to analyze.
• Hadoop is, and was, powerful at satisfying one requirement:
batch-processing and storing large volumes of data.
Defining Big Data and the Three
Vs
• If companies want to stay competitive, they must be
proficient and expert at infusing data insights into their
processes, products, as well as their growth and management
strategies.
• This is especially true in light of the digital adoption explosion
that occurred as a direct result of the COVID-19 pandemic
• Whether your data volumes rank on the terabyte or petabyte
scales, data-engineered solutions must be designed to meet
requirements for the data’s intended destination and use.
Defining Big Data and the Three
Vs
• Three characteristics — also called “the three Vs” — define
big data: Volume, Velocity, and Variety.
• Because the three Vs of big data are continually expanding,
newer, more innovative data technologies must continuously
be developed to manage big data problems.
• In a situation where you’re required to adopt a big data
solution to overcome a problem caused by your data’s
velocity, volume, or variety, you have moved out the domain
of regular data — you have a big data problem on your
hands.
Defining Big Data and the Three
Vs
Dealing with data Volume
• The lower limit of big data volume starts as low as 1 terabyte,
and it has no upper limit. If your organization owns at least 1
terabyte of data, that data technically qualifies as big data.
• Big data is composed of huge numbers of very small
transactions that come in a variety of formats.
• Data engineers have the job of aggregating it, and data
scientists have the job of analyzing it.
Defining Big Data and the Three
Vs
Handling data Velocity
• A lot of big data is created by using automated processes.
• In engineering terms, data velocity is data volume per unit
time. Big data enters an average system at velocities ranging
between 30 kilobytes (K) per second to as much as 30
gigabytes (GB) per second.
• High-velocity, real-time moving data presents an obstacle to
timely decision-making. The capabilities of data handling and
data processing technologies often limit data velocities.
Defining Big Data and the Three
Vs
Dealing with data Variety
• Big data gets even more complicated when you add
unstructured and semistructured data to structured data
sources.
• Heterogeneous, high-variety data is often composed of any
combination of graph data, JSON files, XML files, social media
data, structured tabular data, weblog data, and data
generated from user clicks on a web page, known as click-
streams.
Defining Big Data and the Three
Vs
Dealing with data Variety
• Structured data can be stored, processed, and manipulated
in a traditional relational database management system
(RDBMS).
• Unstructured data comes completely unstructured: it’s
commonly generated from human activities and doesn’t fit
into a structured database format.
• Semistructured data doesn’t fit into a structured database
system, but is nonetheless structured, by tags that are useful
for creating a form of order and hierarchy in the data.
2. Identifying Important Data Sources
Identifying Important Data
Sources
• Vast volumes of data are
continually generated by
humans, machines, and
sensors everywhere.
• Typical sources include data
from social media, financial
transactions, health records,
click-streams, log files, and
the Internet of things
:FIGURE 2-1

Popular sources of big


.data
3. Grasping the Differences among Data
Approaches
Grasping the Differences among Data
Approaches
• Data science, machine learning engineering, and data
engineering cover different functions within the big data
paradigm.
• Huge velocities, varieties, and volumes of structured,
unstructured, and semistructured data are being captured,
processed, stored, and analyzed using a set of techniques and
technologies.
• Although the terms data science and data engineering are
often used interchangeably, they’re distinct domains of
expertise. Following we will introduce concepts that are
fundamental to data science and data engineering, as well as the
hybrid machine learning engineering role.
Grasping the Differences among Data
Approaches
Defining Data Science
• Data science is the scientific domain dedicated to knowledge
discovery via data analysis.
• Data science is a vast and multidisciplinary field. To call
yourself a true data scientist, you need to have expertise in
math and statistics, computer programming, and your own
domain-specific subject matter.
• Data science methods can provide more robust decision-
making capabilities in business and in science.
Grasping the Differences among Data
Approaches
Defining Data Science
• Using data science skills, you can do cool things like the
following:
 Use machine learning to optimize energy usage
 Optimize tactical strategies to achieve goals in business and
science.
 Predict for unknown contaminant levels from sparse
environmental datasets.
 Design automated theft and fraud prevention systems to
detect anomalies.
Grasping the Differences among Data
Approaches
Defining machine learning engineering
• Machine learning is the practice of applying algorithms to
learn from data and make automated predictions.
• A machine learning engineer is essentially a software
engineer who is skilled enough in data science to deploy
advanced data science models within the applications they
build.
• This person doesn’t need to know as much data science as
a data scientist but should know much more about
computer science and software development than a typical
data scientist.
Grasping the Differences among Data
Approaches
Defining data engineering
• Data engineering is dedicated to build and maintain data
systems for overcoming data processing bottlenecks and
data handling problems that arise from managing the high
volume, velocity, and variety of big data.
• Data engineers often have experience working with real-
time processing frameworks and massively parallel
processing (MPP) platforms, as well as with RDBMSs. They
generally code in Java, C++, Scala, or Python.
• They know how to deploy Hadoop MapReduce or Spark to
handle, process, and refine big data.
Grasping the Differences among Data
Approaches
Comparing machine learning engineers, data scientists,
and data engineers
• The roles of data scientist, machine learning engineer, and
data engineer are frequently conflated by hiring managers.
• Data scientists are sometimes stuck having to learn to do
the job of a data engineer, and vice versa.
• To summarize, hire (i) a data engineer to store, migrate,
and process your data; (ii) a data scientist to make sense of
it for you; (iii) and a machine learning engineer to bring
your machine learning models into production.
4. Storing and Processing Data for Data
Science
Storing and Processing Data for
Data Science
Storing data and doing data science directly in the cloud

• Storing data in a cloud environment offers serious business


advantages such as: Faster time-to-market, Enhanced flexibility,
and security.
• A lot of different technologies have emerged after the
cloud computing revolution, many of which are of interest
to those trying to leverage big data. The following examine
a few of these new technologies.
Storing and Processing Data for
Data Science
Storing data and doing data science directly in the cloud
• Using serverless computing to execute data science:
 Serverless computing refers to computing that’s executed in a
cloud environment rather than on your desktop or on-premise at
your company.
Serverless computing decreases the downtime that data scientists
spend in preparing data and infrastructure for their predictive
models.
With serverless computing, your data science model runs directly
within its container. Your cloud service provider handles all the
adjustments that need to be made to the infrastructure to support
your functions.
Storing and Processing Data for
Data Science
Storing data and doing data science directly in the cloud
• Containerizing predictive applications within Kubernetes
Kubernetes is an open-source software suite that manages,
orchestrates, and coordinates the deployment, scaling, and
management of containerized applications across clusters of
worker nodes.
One particularly attractive feature about Kubernetes is that you
can run it on data that sits in on-premise clusters, in the cloud, or
in a hybrid cloud environment.
Storing and Processing Data for
Data Science
Storing data and doing data science directly in the cloud
• Sizing up popular cloud-warehouse solutions
You have several products to choose from when it comes
to cloud-warehouse solutions. The following list looks at
the most popular options:
 Amazon Redshift
 Snowflake
 Google BigQuery
Storing and Processing Data for
Data Science
Storing data and doing data science directly in the cloud
• Introducing NoSQL databases
A traditional RDBMS isn’t equipped to handle big data
demands.
It’s designed to handle only relational datasets constructed
of data stored in clean rows and columns and thus is capable
of being queried via SQL.
RDBMSs are incapable of handling unstructured and
semistructured data.
RDBMSs simply lack capabilities needed for meeting big data
volume and velocity requirements.
Storing and Processing Data for
Data Science
Storing data and doing data science directly in the cloud
• Introducing NoSQL databases
This is where NoSQL comes in.
Its databases are nonrelational, distributed database
systems that were designed to rise to the challenges
involved in storing and processing big data. They can be
run on-premise or in a cloud environment.
NoSQL systems facilitate non-SQL data querying of
nonrelational or schema-free, semistructured and
unstructured data.
Storing and Processing Data for
Data Science
Storing big data on-premise

• Although cloud storage and cloud processing of big data is


widely accepted as safe, reliable, and cost-effective,
companies have a multitude of reasons for using on- premise
solutions instead.
• Reminiscing about Hadoop :To work around the limitations of
relational systems, data engineers originally turned to the Hadoop
data processing platform to boil down big data into smaller datasets
that are more manageable for data scientists to analyze.
Storing and Processing Data for
Data Science
Storing big data on-premise

• When people refer to Hadoop, they’re generally referring to


an on-premise Hadoop storage environment that includes
the HDFS (for data storage), MapReduce (for bulk data
processing), Spark (for real-time data processing), and YARN
(for resource management).
• Incorporating MapReduce, the HDFS, and YARN : MapReduce
is a parallel distributed processing framework that can process
tremendous volumes of data in-batch. MapReduce works by
converting raw data down to sets of tuples and then combining and
reducing those tuples into smaller sets of tuples.
Storing and Processing Data for
Data Science
Storing big data on-premise

• Storing data on the Hadoop distributed file system (HDFS):


The HDFS uses clusters of commodity hardware for storing
data. The HDFS is characterized by these three key features:
HDS blocks, Redundancy, Fault-tolerance.
• Putting it all together on the Hadoop platform: The Hadoop
platform was designed for large-scale data processing,
storage, and management. This open-source platform is
generally composed of the HDFS, MapReduce, Spark, and
YARN (a resource manager) all working together.
Storing and Processing Data for
Data Science
Storing big data on-premise

• Introducing massively parallel processing (MPP) platforms :


Massively parallel processing (MPP) platforms can be used
instead of MapReduce as an alternative approach for
distributed data processing. MPP runs parallel computing
tasks on costly custom hardware, whereas MapReduce
runs them on inexpensive commodity servers.
Storing and Processing Data for
Data Science
Processing big data in real-time

• A real-time processing framework is a framework that


processes data in real-time (or near-real-time) as the data
streams and flows into the system.
• Real-time processing frameworks do one of the following:
Increase the overall time efficiency of the system.
Deploy innovative querying methods to facilitate the real-time
querying of big data.
Storing and Processing Data for
Data Science
Processing big data in real-time

• In-memory refers to processing data within the computer’s


memory, without actually reading and writing its
computational results onto the disk. In-memory computing
provides results a lot faster but cannot process much data
per processing interval.
• Regardless of the industry in which you work, if your
business is impacted by real-time data streams that are
generated by humans, machines, or sensors, a real-time
processing framework would be helpful to you in optimizing
and generating value for your organization.
Thank
You

You might also like