0% found this document useful (0 votes)
47 views

AWS Databricks Workspace Setup and Autoloader

Uploaded by

Anish Sebastian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

AWS Databricks Workspace Setup and Autoloader

Uploaded by

Anish Sebastian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 22

AWS Roles and S3 Setup:

Create a role to perform cross-policy connection to databrick to aws:

AWS Role Creation - retail-dev-cross-policy-role

Step 1 - Created Role - retail-dev-cross-policy-role:


Step 2 -Click retail-dev-cross-policy-role and Navigate into Trust
Relationship tab

Add this json file into Trust Relationship:


{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::414351767826:root",
"arn:aws:iam::414351767826:role/unity-catalog-prod-UCMasterRole-
14S5ZJVKOTYTL"
]
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "25818cb3-0b4b-4ea9-aa93-6b87cbffbb75"
}
}
}
]
}

 Here 414351767826 is the aws account-id, not your aws account-id.


 unity-catalog-prod-UCMasterRole-14S5ZJVKOTYTL this name is perform to create unity
catelog in the s3 bucket.
 25818cb3-0b4b-4ea9-aa93-6b87cbffbb75 provide your databricks account-id
Step 3 -Click retail-dev-cross-policy-role and Navigate into Permissions
tab and create new inline policy - s3-unity-catelog-policy

Add this codes into JSON tab:


{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject",
"s3:ListBucket",
"s3:GetBucketLocation",
"s3:GetLifecycleConfiguration",
"s3:PutLifecycleConfiguration"
],
"Resource": [
"arn:aws:s3:::retail-dev-datalabs/*",
"arn:aws:s3:::retail-dev-datalabs"
],
"Effect": "Allow"
},
{
"Action": [
"sts:AssumeRole"
],
"Resource": [
"arn:aws:iam::414351767826:role/unity-catalog-prod-UCMasterRole-
14S5ZJVKOTYTL"
],
"Effect": "Allow"
}
]
}

 Here retail-dev-datalabs is dedicated to s3 bucket to maintain media files and unity


catelog metadata
 Provide policy name and create inline policy called - s3-unity-catelog-policy

Step 4 -Click retail-dev-cross-policy-role and Navigate into Permissions


tab and create new inline policy - databricks-cross-role-policy

Add this codes into JSON tab:


{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Stmt1403287045000",
"Effect": "Allow",
"Action": [
"ec2:AssociateDhcpOptions",
"ec2:AssociateIamInstanceProfile",
"ec2:AssociateRouteTable",
"ec2:AttachInternetGateway",
"ec2:AttachVolume",
"ec2:AuthorizeSecurityGroupEgress",
"ec2:AuthorizeSecurityGroupIngress",
"ec2:CancelSpotInstanceRequests",
"ec2:CreateDhcpOptions",
"ec2:CreateInternetGateway",
"ec2:CreateRoute",
"ec2:CreateSecurityGroup",
"ec2:CreateSubnet",
"ec2:CreateTags",
"ec2:CreateVolume",
"ec2:CreateVpc",
"ec2:CreateVpcPeeringConnection",
"ec2:DeleteInternetGateway",
"ec2:DeleteRoute",
"ec2:DeleteRouteTable",
"ec2:DeleteSecurityGroup",
"ec2:DeleteSubnet",
"ec2:DeleteTags",
"ec2:DeleteVolume",
"ec2:DeleteVpc",
"ec2:DescribeAvailabilityZones",
"ec2:DescribeIamInstanceProfileAssociations",
"ec2:DescribeInstanceStatus",
"ec2:DescribeInstances",
"ec2:DescribePrefixLists",
"ec2:DescribeReservedInstancesOfferings",
"ec2:DescribeRouteTables",
"ec2:DescribeSecurityGroups",
"ec2:DescribeSpotInstanceRequests",
"ec2:DescribeSpotPriceHistory",
"ec2:DescribeSubnets",
"ec2:DescribeVolumes",
"ec2:DescribeVpcs",
"ec2:DetachInternetGateway",
"ec2:DisassociateIamInstanceProfile",
"ec2:ModifyVpcAttribute",
"ec2:ReplaceIamInstanceProfileAssociation",
"ec2:RequestSpotInstances",
"ec2:RevokeSecurityGroupEgress",
"ec2:RevokeSecurityGroupIngress",
"ec2:RunInstances",
"ec2:TerminateInstances"
],
"Resource": [
"*"
]
},
{
"Effect": "Allow",
"Action": "iam:PassRole",
"Resource": "arn:aws:iam::350476742415:role/ec2-instance-profile-role"
},
{
"Effect": "Allow",
"Action": [
"iam:CreateServiceLinkedRole",
"iam:PutRolePolicy"
],
"Resource":
"arn:aws:iam::*:role/aws-service-role/spot.amazonaws.com/AWSServiceRoleForEC2Spot",
"Condition": {
"StringLike": {
"iam:AWSServiceName": "spot.amazonaws.com"
}
}
}
]
}

 Here 350476742415 is my aws current account-id


 arn:aws:iam::350476742415:role/ec2-instance-profile-role - there is a new role to be
created on ec2-instance-profile-role role name
 Provide policy name and create inline policy called - databricks-cross-role-policy

Step 5 -Click retail-dev-cross-policy-role and make sure all the policy


with permission created
AWS Role Creation - ec2-instance-profile-role

Step 1 -Click Roles and create a new role


Step 2 -Click ec2-instance-profile-role and click to create inline poilicy

Polciy name - retail-dev-ec2-with-event-queue-policy

Add this codes into JSON tab:


{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DatabricksAutoLoaderSetup",
"Effect": "Allow",
"Action": [
"s3:GetBucketNotification",
"s3:PutBucketNotification",
"sns:ListSubscriptionsByTopic",
"sns:GetTopicAttributes",
"sns:SetTopicAttributes",
"sns:CreateTopic",
"sns:TagResource",
"sns:Publish",
"sns:Subscribe",
"sqs:CreateQueue",
"sqs:DeleteMessage",
"sqs:ReceiveMessage",
"sqs:SendMessage",
"sqs:GetQueueUrl",
"sqs:GetQueueAttributes",
"sqs:SetQueueAttributes",
"sqs:TagQueue",
"sqs:ChangeMessageVisibility"
],
"Resource": [
"arn:aws:s3:::retail-dev-datalabs",
"arn:aws:sqs:ap-southeast-1:350476742415:databricks-auto-ingest-*",
"arn:aws:sns:ap-southeast-1:350476742415:databricks-auto-ingest-*"
]
},
{
"Sid": "DatabricksAutoLoaderList",
"Effect": "Allow",
"Action": [
"sqs:ListQueues",
"sqs:ListQueueTags",
"sns:ListTopics"
],
"Resource": "*"
},
{
"Sid": "DatabricksAutoLoaderTeardown",
"Effect": "Allow",
"Action": [
"sns:Unsubscribe",
"sns:DeleteTopic",
"sqs:DeleteQueue"
],
"Resource": [
"arn:aws:sqs:ap-southeast-1:350476742415:databricks-auto-ingest-*",
"arn:aws:sns:ap-southeast-1:350476742415:databricks-auto-ingest-*"
]
},
{
"Effect": "Allow",
"Action": "iam:PassRole",
"Resource": "arn:aws:iam::350476742415:role/ec2-instance-profile-role"
}
]
}

 Here retail-dev-datalabs is aws s3 bucket name


 ap-southeast-1 is the bucket region should be used same in databricks workspace
creation as well
 350476742415 is the aws account-id
 arn:aws:iam::350476742415:role/ec2-instance-profile-role role arn of current role

Step 3 -Click ec2-instance-profile-role and make sure all the policy with
permission created
AWS S3 Bucket Creation - retail-dev-datalabs

Add S3 bucket policy:


{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Example permissions",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::350476742415:role/ec2-instance-profile-role"
},
"Action": [
"s3:GetBucketLocation",
"s3:ListBucket"
],
"Resource": "arn:aws:s3:::retail-dev-datalabs"
},
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::350476742415:role/ec2-instance-profile-role"
},
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:DeleteObject",
"s3:PutObjectAcl"
],
"Resource": "arn:aws:s3:::retail-dev-datalabs/*"
},
{
"Sid": "Grant Databricks Access",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::414351767826:root"
},
"Action": [
"s3:GetObject",
"s3:GetObjectVersion",
"s3:PutObject",
"s3:DeleteObject",
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": [
"arn:aws:s3:::retail-dev-datalabs/*",
"arn:aws:s3:::retail-dev-datalabs"
],
"Condition": {
"StringEquals": {
"aws:PrincipalTag/DatabricksAccountId": " 25818cb3-0b4b-4ea9-aa93-
6b87cbffbb75"
}
}
}
]
}

 Here retail-dev-datalabs is aws s3 bucket name


 arn:aws:iam::350476742415:role/ec2-instance-profile-role is the role for ec2 instance
Iam role mapping with event queues
 arn:aws:iam::414351767826:root grant databricks access from aws account-id
Databricks Workspace Creation:
Cloud Resources:

Databricks Create Credential Configuration - cross-role-credentials

Step-1:

 Here 25818cb3-0b4b-4ea9-aa93-6b87cbffbb75 is your databricks external-id


 arn:aws:iam::350476742415:role/ec2-instance-profile-role is the role you created in
retail-dev-cross-policy-role at first step
Databricks Create Storage Configuration - retail-dev-datalabs-storage
Step-2:
Databricks Data -> Metastores - retail-devin-unity-catelog
Step-3:
Note: One Region and One Metastore

Don’t Assign to Any workspace, once you created a workspace


then you can add
Workspaces:
Databricks Create new workspace - retail-dev-datalabs

Step-1:Create workspace with created cloud resouces


Step-2:Assign the created metastore point to created workspace
Data -> Metastore -> retail-devin-unity-catelog -> workspace tab ->
Assign to Workspace

Databricks Workspace Login

Step-1:Login to created workspace by providing email and databricks-


partner-portal password as admin
Step-2:create an instance profile to attach to the cluster as admin by
navigation admin console

Step-3:Role ec2-instance-profile-role arn value


Step-4:create a cluster with attched instance profile.

Note: attached instance profile will be pointing directly to s3 without


providing aws accesskey and secretkeys
Step-5:Attach the created cluster to the notebook and test the s3 files
are loading or not without mounting.

S3 Folder with file:

List all the from s3 by using the cluster and without providing
accesskeys and secretkeys, no mounting :
Read the file data using spark read :

Databricks Autoloader Demo


Read the files data using spark readStream :
Uploading one more file into s3 and wait for proceed using streaming:

You might also like