0% found this document useful (0 votes)
27 views35 pages

pREP dOC-Azure

Uploaded by

Sarthak Bapat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views35 pages

pREP dOC-Azure

Uploaded by

Sarthak Bapat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

DATA ARCHITECTURE

By Default – Blob storage is created – to create a datalake in advanced option click – enable hierarchical
namespace
Home, Author, Monitor, Manage, Learning center
Containers – Datalake

File Share – one stop solution to store all the files throughout the org

Quesues – can be json data - helps with streaming data, like messages

Tables – No SQL Databases – for semi structured data


Connections are called – Link Services’
Hit Debug to check the pipeline

Then publish all button

Up till now we have done is building static pipeline, as we have 8 different files one way is to copy the
same pipeline and duplicate it but we will be building a dynamic pipeline

1. Relative URL will be changing – base URL will be same


2. Folder will be different
3. File will be changing

We will create 3 different parameters.


Do not put the value as we will be using the for loop to pass the values

Do same with sink


We need to pass JSON file here in Items – to pass an array – order – relative url, sink folder, and sink file

Create a dictionary within an array with 3 parameters listed to pass in the loop in VS code
Uncheck this first row only box because it will else pass only one row in the loop.

Deactivate the other activities and run only Lookup activity – the below Value key is what we need to
pass in parameters
Cut the Copy Data Activity paste withing for each activity
Add value of parameter as

Same with Sink


Lauch Databricks
Service Level Application is the Key
App Registration
Save the IDs and create secret – used in databricks

Copy the value and secret ID of Secret

Now assigning a role to this application to access the datalake


Creating notebook in databricks

Heading
Replace few things in < >

APACHE SPARK
spark.read.format(‘csv) --- gives the format of the file

option(‘header”,True) --- gives that we have a header

option(“inferSchema”,True) --- By default when we save data in csv, it will read all column as text
columns, so we want spark to infer the schema i.e. decide the schema on your own

load(‘abfss://<container_name>@<storage_acc_name>.dfs.core.windows.net/<folder_name>’)

abfss --- Azure Blob File System Secure

df.display() --- to display data

PYSPARK – TRANSFORMATIONS

df.withColumn --- create a new column or modify the existing one

Now push the data to silver layer

There are 2 formats – parquet and delta but delta is built on parquet format

There are 4 modes – append(), overwrite(), error(), ignore()

Concat function
lit() function is used to define constants

1st way

2nd way – using concat_ws which is same as COMBINEVALUES in DAX

Split() function – this time transforming same column not creating a new one

Converting date format to timestamp format

Replacing any specific letter to another ----- regexp_replace() function

Multiplication

Aggregate the data using groupby() & agg() to get number of orders in one day

Azure Synapse
Unified Platform – Synapse analytics

Below is Apache Spark Pool – with same functionality as Databricks – both managing spark clusters
Data warehousing solution – create tables

Allowing synapse analytics to Access data in data lakes – no application (key) is required as both are
azure products by SYSTEM MANAGED IDENTITY or MANAGED IDENTITY
Select Members

Now next step go back to synapse analytics

Dedicated pool – traditional way of storing – where data actually resides in database (like MS SQL Server)
– it is traditional database but on cloud – optimized for query reads, for big data, data warehousing

Serverless – data lake & lake house concept – our data resides in datalake not in database (to save cost)

One step in between – assigning role to yourself


Using openrowset() – helps us to apply abstraction layer on data residing in datalake - returns the result
in tabular format

Change blob to dfs --- as by default blob is written


GOLD Layer – creating schema

Creating VIEWS for all the other tables and click publish all

Creating external tables --- there are 3 steps

1. Creating credentials – we tell synapse analytics to pick the data using managed identity
2. Create external data source
3. Create external file format
Step 1 – credentials

Step 2 – creating external datasource

Step 3 – creating external file format

CETAS – Create External Table As Select ---- directly using VIEWS


External tables saves the data but VIEW doesn’t

Connecting Synapse to Power BI using SQL Endpoints

Get Data --→ Azure Synapse Analytics

You might also like