0% found this document useful (0 votes)
48 views

LO2a) - Introduction To Data Engineering

The document discusses data engineering and its importance. It explains that data engineering involves sourcing, transforming, and managing data from various systems to ensure it is useful and accessible. It also covers key concepts like structured, unstructured and semi-structured data, as well as data science and the growing demand for data science jobs.

Uploaded by

Ali Azgar Katha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views

LO2a) - Introduction To Data Engineering

The document discusses data engineering and its importance. It explains that data engineering involves sourcing, transforming, and managing data from various systems to ensure it is useful and accessible. It also covers key concepts like structured, unstructured and semi-structured data, as well as data science and the growing demand for data science jobs.

Uploaded by

Ali Azgar Katha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Data Engineering

Himanshu Patel

Artificial Intelligence and Data Analytics


LO2: Explain the value of business data
“Learning from data is virtually universally useful. Master it and
you will be welcomed anywhere.”
– John Elder, founder of the Elder Research

• Today data science is everywhere.


• A recent study shows that demand for data scientists and analysts is projected to grow by 28
percent by 2021. This is on top of the current market need.
• According to the U.S. Bureau of Labor Statistics, growth for data science jobs skills will grow
about 28% through 2026.

Artificial Intelligence and Data Analytics


Objective
After attending this session, you should be able
• Importance of Data Science
• Know basic of Data Science
• Understand Data Engineering
• Understand various data types

Artificial Intelligence and Data Analytics


A few key concepts...

Artificial Intelligence and Data Analytics


Data
Data is a collection of facts, such as numbers, words, measurements,
observations or just descriptions of things.
• Qualitative vs Quantitative

https://www.mathsisfun.com/data/data.html

Artificial Intelligence and Data Analytics


Structured data

• A well-organized data in the form of tables that can be easily be operated is


known as structured data.
• Searching and accessing information from such type of data is very easy.
• For example, data stored in the relational database, i.e., SQL in the form of
tables having multiple rows and columns.
• The spreadsheet is another good example of structured data. Structured data
represent only 5% to 10% of all data present in the world

Artificial Intelligence and Data Analytics


Unstructured data

• Unstructured data requires advanced tools and software’s to access


information.
• For example, images and graphics, PDF files, word document, audio,
video, emails, PowerPoint presentations, webpages and web contents,
wikis, streaming data, location coordinates, etc., fall under the
unstructured data category.
• Unstructured data represent around 80% of the data

Artificial Intelligence and Data Analytics


Artificial Intelligence and Data Analytics
Semi-structured data

• Semi-structured data is structured data that is


unorganized. Web data such as JSON (JavaScript
Object Notation) files, BibTex files, CSV files,
tab-delimited text files, XML, and other markup
languages are examples of semi-structured data
found on the web.
• Semi-structured data represent only 5% to 10% of
all data present in the world

Artificial Intelligence and Data Analytics


Data engineering

• Data engineering is a critical field where data is concerned, but not many
people can accurately describe what data engineers do.
• Data drives the operations of businesses small and large. Businesses use data
to provide answers to relevant inquiries that range from consumer interest to
product viability.
• Without a doubt, data is an important part of scaling your business and gaining
valuable insights. And this makes data engineering just as important.
• In March 2019, about 6,500 LinkedIn users listed their title as “data
engineers”. They offered a wide variety of skill sets, including a knowledge
base of Python, SQL, and Java.

Artificial Intelligence and Data Analytics


• Data engineering, sometimes called information engineering, is a
software approach to developing information systems.
• To be clear, data engineering encompasses sourcing, transforming,
and managing data from various systems.
• This process ensures that data is useful and accessible. Above all, data
engineering emphasizes the practical applications of data collection and
analysis.
• It should come as no surprise that investigating the inquiries mentioned
above requires complex solutions.

Artificial Intelligence and Data Analytics


• As such, data engineering employs intricate methodologies for gathering
and authenticating data that range from data integration tools to artificial
intelligence.

• Similarly, data engineering relies on special mechanisms to apply found


data to real-world scenarios, usually designing and monitoring
sophisticated processing systems to that effect.

Artificial Intelligence and Data Analytics


Why Data Engineering?
• Data engineering is important because it allows businesses to optimize data
towards usability. For example, data engineering plays a large role in the
following pursuits:

 Finding the best practices for refining your software development life
cycle

 Tightening information security and protecting your business from


cyber attacks

 Increasing your understanding of business domain knowledge

 Bringing data together into one place via data integration tools

Artificial Intelligence and Data Analytics


• Whether business teams are dealing with sales data or analyzing their lead
life cycles, data is present every step of the way.
• Over the years, technological innovation has made a grand impact on the
vitality of data. These innovations comprise cloud technology, open-
source projects, and the growth of data in scale

Artificial Intelligence and Data Analytics


Data Science
• It’s become a universal truth that modern businesses are awash with data.
• Last year, McKinsey estimated that Big Data initiatives in the US healthcare
system could account for $300 billion to $450 billion in reduced healthcare
spending or 12-17 percent of the $2.6 trillion baselines in US healthcare costs.
On the other hand though, bad or unstructured data is estimated to be costing
the US roughly $3.1 trillion a year.
• Data-driven decision making is increasing in popularity.
• Accessing and finding information from the unstructured data is complex and
cannot be done easily with some BI tools; here data science comes into the
picture.

Artificial Intelligence and Data Analytics


• Data science is a field that extracts the knowledge and insights
from the raw data.
• To do so, it uses mathematics, statistics, computer science, and
programming language knowledge.
• A person who has all these skills is known as a data scientist. A
data scientist is all about being curious, self-driven, and passionate
about finding answers

Artificial Intelligence and Data Analytics


Artificial Intelligence and Data Analytics source: https://www.youtube.com/watch?v=X3paOmcrTjQ&feature=emb_title
Importance of data science
• Today data science is everywhere.
• The explosive growth of the digital world requires professionals with not
just strong skills, but also adaptability and a passion for staying on the
forefront of technology.
• A recent study shows that demand for data scientists and analysts is
projected to grow by 28 percent by 2021. This is on top of the current
market need.
• According to the U.S. Bureau of Labor Statistics, growth for data science
jobs skills will grow about 28% through 2026.

Artificial Intelligence and Data Analytics


The Data Science Hierarchy of Need

Think of Artificial Intelligence as the top of a pyramid of


needs. Yes, self-actualization (AI) is great, but you first
need food, water, and shelter (data literacy, collection,
and infrastructure).
Monica Rogati’s call-out in “The AI Hierarchy of Needs”

Artificial Intelligence and Data Analytics


Artificial Intelligence and Data Analytics
Data Warehouses

Data warehouses and data lakes refer to large, complex datasets


organizations store for business intelligence

Data engineering field could be thought of as a superset of business


intelligence and data warehousing that brings more elements from
software engineering. This discipline also integrates specialization
around the operation of so called “big data” distributed systems,
along with concepts around the extended Hadoop ecosystem, stream
processing, and in computation at scale.
-Maxime Beauchemin
Artificial Intelligence and Data Analytics
Data Warehouses

Artificial Intelligence and Data Analytics


Cloud vs On-premise data storage

Artificial Intelligence and Data Analytics


• A SaaS-based application, broadly, is any software you run that’s
not on your premises.
• Examples of SaaS-based applications include Google G-Suite,
Office 365, Salesforce, Cisco Webex
• A cloud-based product or service is anything running in the Cloud.
This includes SaaS-based applications, as well as PaaS and IaaS-
based
• The primarily the difference: SaaS offerings are applications that
are fully formed end-user applications. Cloud Computing is
computing infrastructure and services that you can rent.

Artificial Intelligence and Data Analytics


Artificial Intelligence and Data Analytics
ETL: Extract, Transform, and Load

Artificial Intelligence and Data Analytics


ETL: Extract, Transform, and Load

1. Extract: sensors wait for upstream data sources to generate data (e.g. an
upstream source could be machine or user-generated logs, relational
database copy, external dataset, etc).
2. Transform: apply business logic and perform actions such as filtering,
grouping, and aggregation to translate raw data into analysis-ready
datasets.
3. Load: load the processed data and transport it to a final destination.
Often, this dataset can be either
a. consumed directly by end-users be
b. treated as yet another upstream dependency to another ETL job - forming the so-
called data lineage.
Artificial Intelligence and Data Analytics
Artificial Intelligence and Data Analytics
Artificial Intelligence and Data Analytics source: https://www.youtube.com/watch?v=qWru-b6m030&feature=emb_title
Summary
• Data and its various types
• Data Engineering and its importance
• Data Science and its importance
• Database, Data Warehouse, Datalakes
• Cloud vs On-premise data storage
• ETL

Artificial Intelligence and Data Analytics


Himanshu Patel, Instructor
Saskatchewan Polytechnic
email: [email protected]
Faculty office, Mining building, Saskatoon

Artificial Intelligence and Data Analytics

You might also like