DATA ARCHITECTURE
By Default – Blob storage is created – to create a datalake in advanced option click – enable hierarchical
namespace
Home, Author, Monitor, Manage, Learning center
Containers – Datalake
File Share – one stop solution to store all the files throughout the org
Quesues – can be json data - helps with streaming data, like messages
Tables – No SQL Databases – for semi structured data
Connections are called – Link Services’
Hit Debug to check the pipeline
Then publish all button
Up till now we have done is building static pipeline, as we have 8 different files one way is to copy the
same pipeline and duplicate it but we will be building a dynamic pipeline
1. Relative URL will be changing – base URL will be same
2. Folder will be different
3. File will be changing
We will create 3 different parameters.
Do not put the value as we will be using the for loop to pass the values
Do same with sink
We need to pass JSON file here in Items – to pass an array – order – relative url, sink folder, and sink file
Create a dictionary within an array with 3 parameters listed to pass in the loop in VS code
Uncheck this first row only box because it will else pass only one row in the loop.
Deactivate the other activities and run only Lookup activity – the below Value key is what we need to
pass in parameters
Cut the Copy Data Activity paste withing for each activity
Add value of parameter as
Same with Sink
Lauch Databricks
Service Level Application is the Key
App Registration
Save the IDs and create secret – used in databricks
Copy the value and secret ID of Secret
Now assigning a role to this application to access the datalake
Creating notebook in databricks
Heading
Replace few things in < >
APACHE SPARK
spark.read.format(‘csv) --- gives the format of the file
option(‘header”,True) --- gives that we have a header
option(“inferSchema”,True) --- By default when we save data in csv, it will read all column as text
columns, so we want spark to infer the schema i.e. decide the schema on your own
load(‘abfss://<container_name>@<storage_acc_name>.dfs.core.windows.net/<folder_name>’)
abfss --- Azure Blob File System Secure
df.display() --- to display data
PYSPARK – TRANSFORMATIONS
df.withColumn --- create a new column or modify the existing one
Now push the data to silver layer
There are 2 formats – parquet and delta but delta is built on parquet format
There are 4 modes – append(), overwrite(), error(), ignore()
Concat function
lit() function is used to define constants
1st way
2nd way – using concat_ws which is same as COMBINEVALUES in DAX
Split() function – this time transforming same column not creating a new one
Converting date format to timestamp format
Replacing any specific letter to another ----- regexp_replace() function
Multiplication
Aggregate the data using groupby() & agg() to get number of orders in one day
Azure Synapse
Unified Platform – Synapse analytics
Below is Apache Spark Pool – with same functionality as Databricks – both managing spark clusters
Data warehousing solution – create tables
Allowing synapse analytics to Access data in data lakes – no application (key) is required as both are
azure products by SYSTEM MANAGED IDENTITY or MANAGED IDENTITY
Select Members
Now next step go back to synapse analytics
Dedicated pool – traditional way of storing – where data actually resides in database (like MS SQL Server)
– it is traditional database but on cloud – optimized for query reads, for big data, data warehousing
Serverless – data lake & lake house concept – our data resides in datalake not in database (to save cost)
One step in between – assigning role to yourself
Using openrowset() – helps us to apply abstraction layer on data residing in datalake - returns the result
in tabular format
Change blob to dfs --- as by default blob is written
GOLD Layer – creating schema
Creating VIEWS for all the other tables and click publish all
Creating external tables --- there are 3 steps
1. Creating credentials – we tell synapse analytics to pick the data using managed identity
2. Create external data source
3. Create external file format
Step 1 – credentials
Step 2 – creating external datasource
Step 3 – creating external file format
CETAS – Create External Table As Select ---- directly using VIEWS
External tables saves the data but VIEW doesn’t
Connecting Synapse to Power BI using SQL Endpoints
Get Data --→ Azure Synapse Analytics