Medica Interview Answers

Download as txt, pdf, or txt
Download as txt, pdf, or txt
You are on page 1of 7

over the period of time I have used both Jenkins and TerraForm.

So currently we are
using Jenkins for the deployment of the code into the higher environments. And
TerraForm you're using for creating resources?
Not that much. Not extensively. Basically, I know its database schema change
metadata.
Management tool right means it can it supports tracking and managing and if we
change any changes to the database schemas. So the second is kind of a versal
control of the database schema.
So save me some most of my data pipeline was written in PI Spark, like ingesting
that data, cleaning the data and loading into the snowflake. So we are actually
deploying the code using CI CD. So we are pushing the changes into the feature
branch.
And then our CI basically Jenkins will run all the unit test cases which we have
written and then it will merge the code into the master branch and from there it
will deploy the code into the higher environments. So that's the process of Jenkins
which we are doing through Git repo.
Absolutely, yes. So the non feature branch will be there for dev environment and
stage environment. And then we if it's everything working fine, then we merge the
code into the master branch which is for the prod environment.
Yes, yes. So all the credentials we are storing in the key vaults, so nothing is
harder coded. And for Jenkins related credentials, we can store those keys
basically in the Jenkins secret manager.
So in the there is a one tab where you can actually store all the keys in Jenkins
itself
and in the Jenkins code, you can refer those keys basically
okay.
Then I think we need to change the code again, basically. So and we need to
redeploy the code in the QA environment again. Maybe we will write the code in the
modular fashion. So we don't have to do lots of changes we can just uncomment the
three features or come in the features come in the fear and we can deploy.
So, okay, so in that case, if any if it is present in the main branch.
So my approach will be even if it is present in the main
Siemens in this conditions my approach will be like even if it's present in the
main branch, we can go to the previous verse and always, that's the advantage of
GitHub. Like we can go to the previous versions and we can come in and again
redeploy the main redeploy in those into the main branch. So normally, so in data
engineering world if you know life means we create these pipelines and we do the
testing and everything is fine, then only we merge into the master. If we don't get
the approval will not deploy into the main branch right. And even if something goes
into the main branch, which is not correct, we can go to the previous version. of
the Code and we can do the changes.
Yeah, because those three features are not required anymore, right.
Are we can pull the existing code which is there sorry.
Yes, that's what I can do.
So CM is these are kind of application based scenarios right. These are not a daily
use case which we face in data engineering world, right. These are the things which
we normally face from the application point of view. So and this CI CD also we get
the template from the DevOps team. So we do the changes in those templates, and
then we merge those into the feature branch. So as a data engineer, we don't have
to worry too much about like, how many features are going or not we have DevOps
team for that right.
Yes, I'm working in snowflake from around one and a half to two years, right. So I
am, I'm very good at writing SQL queries and so and the snowflake architecture I'm
also well aware of so yeah,
so we have we can store JSON data in variant column type. So the variant is the
data type. Which can store the flicks semi structured data.
Definitely, so first of all, let's see snowflake is our distributed data warehouse
systems, right. So it is multi cluster shared data architecture. So snowflake
follows or multi cluster, they say a data architecture means that compute and
storage are separate it is decoupled. So data is stored in a centralised storage
layer called snowflake data platform and a storage layer is separate from the
compute resources and compute resources in snowflake are organised into virtual
warehouses, which is a cluster of compute resources that is executed useful for
executing the queries and we also have meta data layer, which stores the metadata
about the table schemas, users roles and other things.
So basically yes, you can process data using snow five snow pipe
is about we are mostly using the batch processing but if you
know not so I was talking about that how we can process the data in real time in
snowflake but the streams are also useful for capturing and processes changes made
to the data in the table. So they actually provide a way to track modifications of
rows within a table. And also it actually allows us or enables us to real time data
processing and analysis based on those changes. So if you want for an example, if
you want to implement Change Data Capture, so snowflake stream can act as a
mechanism for Change Data Capture, they can capture the inserts, updates and delete
operations. performed on a row within a table.
So see whenever a view is created, ultimately, the SQL queries are executed, but as
far as I know, streams are not directly applicable to views it can be implemented
only to the tables.
So if you want to check the status, so basically, we can check the status using the
SQL queries to retrieve information about streams metadata, so if you do if you
want to see the list of streams, then there is a command called Show streams.
If you want to see a particular stream like.
Streams captures changes or not okay.
Okay. So basically if that I think we can do
means whatever I remember using the streams, I was using show streams, and we can
give the like command like the stream name, so this query will return the metadata
about the stream including its current status, and if the stream is active and not
in the error state those things I can check
and also if you are it will show you the size, trim size and everything the bytes
of the data
in bytes
so
Okay, so if I know no means we are not using streams because as I said, we are
using snowflake for batch processing. But you cannot directly query the contents of
stream using the Select star statement.
But
you have to query the information schema views basically to so like Scream
Yeah, so basically, snow five basically enables us to like, continuously auto
ingest the data from the source as a real time. So basically, it can automatically
scalable and it can load the streaming data and even it can load the semi
structured like log files, or Clickstream data or any real time data.
No, no, we are not. We don't have access to instal anything. Those are the part of
the DevOps team. itself means as a developer we are working.
So so see it means basically, it doesn't support the truncate operations directly.
So in truncate is operations, which can delete all the rows from the table without
logging individual rows deletions, right. So snow pipe doesn't directly support the
truncate. So if you want to perform a truncate along with the loading using the
snowplough, we have to handle this as a separate step. Or as part of our data
loading process.
So you want to basically what I understood from your question is that you want to
load the CSV file from the blob location and then load the data into the table
using the snow pipe right. So, so what I think like you first of all you have to
create an external stage
which will contains the location of the blob locations where the CSV file is
present. Then we can define the snow pipe and then we have to start the snow pipe
basically.
So, basically when we define the snow pipe we have to
Yeah, so, see, first of all, you have to create a snow pipe and you have to ingest
the data from that external state. Stage that command will be CREATE OR REPLACE the
snow pipe name and the auto ingest you have to give true as copy into this TBL
department table from the stage we have created and the file format which will be
type is equal to CSV. That is the way we will define the snow pipe and then we have
to start this new pipe using the article command like alter pipe, pipe and name
zoom.
I can command
and resume and resume I will tell pipe pipe name resume
sorry
so I think we can query the information schema.ip history so if you do know select
star from table and in the bracket information underscore schema dot pipe history
and give the pipe name it can give you the only it will allow us to query the snow
pipe executions history of sloppy executions.
So, this query will return a history of all the executions including details as
number of files loaded the start date and the end date.
Sorry, sorry, what?
MERGE statement. So MERGE statement is used to do is used for upset logic. If you
want to do an update and insert or delete logic you can use that MERGE statement.
Suppose your requirement is that update the old records insert the new records you
can easily do it using MERGE statement.
SQL you're talking about?
Yeah, so we have inner join left join right join, cross join full outer join.
left join.
So in the LEFT JOIN basically you will get all the records from the left table and
only the matching records from the right table.
SQL functions aggregate functions and window functions are we talking about right
yeah definitely.
Yeah, so if you want I can write the query quickly
Yeah, so first of all, I will create a common table expressions and I will apply a
window functions row number partisan by the columns which are true which I want to
find out the duplicate records and I will assign a rank and in the outside the city
I will just write a command DELETE statement command where the rank is greater than
equal greater than one.
Not not greater than one not equal to.
So see, yeah, so there are multiple ways to transform the data.
So basically see the distinct values are fixed. We can easily apply it the case
when statement
and also there is another way to use the pivot function.
So see, there are two ways either I can apply the case when statement or ultra
simple ways used to pivot function
use a pivot function pivot
Yeah, so I have to write either case when five case when statement or simply I can
write as select Max case when it is equal to one then sell salary and a salary one
Similarly, the second segment will be max case when ID is equal to two then salary
and as salary to like this I can write the five different case when statement.
So mostly I am working on the big data environment so I'm using the data
engineering ETL services and coding environment like ADF data bricks, but yes, I
have limited experience in Informatica, but not that much.
See
Okay, so basically C D is nothing but it's, it is actually a temporary result set
within a query, which can then be referenced multiple times within the same query.
So suppose you have multiple sub queries, and you are using the same sub part is
multiple times in a query, you can use CT, so basically, it improves the
readability and maintainability that's the advantage of CT so whereas sub queries
are like inline queries nested within another query.
So CTE, when we will use when we need to reference the same temporal
Yes, you can call this a recursive CTE. Yes, yes.
Yes, I will give you an example also. So the basic example of recursive CTE is,
like, suppose you have an employee table and you have the Manager ID in that table.
And if you want to find the label of each employee
that then you have to use recursive CTE.
That is one of the use case.
So data, so that is my core competencies like all the data engineering stuffs are
my core competencies. I'm very good in writing SQL queries, pythonscript ETL,
pipelines. So, DevOps and all these things are the additional stuffs which we are
doing apart from the data engineering stuffs. So, I'll say that 80% of my
competencies is in writing SQL queries creating data pipelines doing automations
through scripts
Yes, absolutely.
Yes, definitely. So.
Yes, I have written stored procedures.
Mostly I have written the stored procedure in which SQL query sorry So, I will give
you an example also under real time use case which we have deployed
yeah so basically snowflake
snowflake actually does.
Basically we can write snowflake doesn't support our have a built in support for
creating stored procedures using procedural language like sequel or JavaScript.
Sorry
No, I think it uses JavaScript and Python both and Java also you can use
Yes, yes.
But you I think Python you can use using snowpark.
Now, yes, I want to know the next process of this interview and also the roles and
responsibilities
and highlight here so, yes my expertise is in

Hey bro, can you please play a YouTube video


not clear bro
it's not very clear but yes just to repeat if
it's better, it's better but if I asked you to repeat try to repeat from your side
yeah check, check, check check my daughter is working or not please check it's not
working.
Okay fine, fine, fine fine
check whether it's working or not
it's better it's better
Now you didn't get your question. Can you please repeat once more?
No, I No no. Can you tell me the expected output
can you give me
so you want me to miss? These are the two columns you needed? Right?
These are the two columns you needed.
And can you tell me the expected output the expected output will be
no no
expected?
This can be the output
Yeah it will be
okay
so, okay so in that case the output will be like something
so, you want employee ID Manager ID
so can Can I
Basically it's a self join
so, what I'm trying to do is like the, the manager name will represent the name of
the manager for each employee and if employee does not have manager
then it will represent us null and employee name column represent the name of each
employee
which is under that manager
okay so I think I have to write these using recursive CTE
right yeah yeah let me write it later
Yes, yes, yes I mean, so, I got your point so
yes, so, basically means I have to create a city
So, basically I have to define a CTE that consists of two parts and the anchor
member selects the top level managers those without a manager and the recursive
member joins the recursive CTE with the users table to find the employees of each
manager and the final query selects the manager employee pairs from that recursive
CTE.
Let me write it, let me write it so, if you want I can write it
this little longer put so, just bear with me
a difficult person.
Non printed I was typing from.
I have to write out a query
final body
yeah, which doesn't have any basically first parties like I'm selecting the top
level managers who doesn't have a manager
and after then I'm doing a recursive join with the Employees table to find out
their managers and the level
and the Level Level
got it.
I got to your point I got your point.
Yeah, I got your point.
This is a query
So, basically the key here is to use a DENSE RANK rather than rank.
know basically you need the salary order by you you don't ask you have not asked me
to use any department wise salary
if I saw partition by is not required.
Previous one was the difficult one this was the easy one tell him
no, so, let me tell you what type of experience I have in snowflake. So, basically
as a data engineer, I have ingested the data into snowflake from external sources
and even in snowflake I have done a lot of data analysis by writing SQL queries I
have written stored procedures
yes is definitely
yes, yes. Yeah, I mean so, I have used SQL mostly and I have also used JavaScript
can you please repeat
not I'm not able to understand you repeat
please repeat properly. Now, please repeat properly.
Yes, no.
Yes, yes, definitely. So, but means I was not able to hear the problem.
So, okay, so
right right so, let me take an example first so, so, basically it means I will
create a table and whatever changes will happen in that table. I will capture those
changes using the snowflake stream. So, I will have to first create a stream on
that particular table and then we can
Okay, so you're saying that means even a table
so normal use case is like first of all I have to create a table and then I have to
create a stream on that particular table
so
Okay, so first of all, I have to create the strain
on that particular table
so, so so see once the table yeah stream is there support I'm inserting record into
the table one
so this is the accent basically I'm I've created on a table one
then I have to like basically query the stream
and then this query will show the changes captured by the stream. So initially, it
will return all the existing rows in that table. And after that it will show any
rows added to the table in real time. Now if you want to continuously capture the
changes from the stream, and if put those changes into another table, then this
this is a command.
So this is a complete task.
So basically this task will run every five minutes and insert new rows from the
stream into the target table.
Let me think
so sim is basically when I'm creating a stream I am creating on sudden on top of a
table. So basically stream is associated with a single table and capture changes
made to that table. So you cannot directly create a stream on multiple table using
a join.
Sorry
No, no I didn't get you repeating
so status means What do you want to check?
So basically, I Yeah, so C means if you want to check this.
So basically if you want to check if you want to check the status, normally we do
so streams
like so stream like and the command that
select star from stream is possible. So
so it will give you all the records so the query will saw the changes captured by
the stream
know me so whatever changes is there
no, there will be few meta data columns, like meta data and dollar action columns,
which indicates the type of operations like insert update or delete and also there
are other metadata columns are there like is update with which actually indicates
whether it's updated record or not. So it will give you true false also it will
give you row ID timestamp.
So these are the additional column we'll get
sorry.
So you can write The MERGE statement. So it will take the Delta data from tab two
and it will do an update based on the old record and it will insert the new records
because you have a primary key data
Yeah, so when you write a MERGE statement, then you will get an option like the
primary key column. So if it exists, then it will update and if it is not exist,
then it will do an insert command.
So if you want I can write the exact query for you.
So you're saying the primary key has updated right
so how to identify that whether it's no no then there is no way I can identify this
is a new record or an old record.
So that means in that case, I don't know mean so I need some identifier to identify
those or whether it's a old record or a new record
maybe it's yes, is that what I'm saying? Maybe we can use some composite key.
So moments combination of few keys to identify whether it's old record or not.
Like first name, last name and all those things.
email id.
Yes, I have worked with C ICD.
Yes, I have used data from basically means we are using TerraForm if you want to
create any resources. So for example, if you want to create a bucket or if you want
to create a new table, then we change our existing TerraForm scripts and we push
those changes into the GitHub and the GitHub actions will trigger the script which
will deploy all those changes.
Yeah, we are using GitHub means it's not like as your it's a GitHub it's not like
So Miss mins. I think I can go to the previous version. And then I will like
I will deploy all the changes and I will incorporate his changes as well. So if I
counter like where our local branch is behind some future branch then I will face
the changes from the.
So what I was thinking, I will fetch the changes from the remote repository using
the gate fetch origin and then I will check out all our changes in the local
branch. And then I will merge changes from the remote branch.
That way it will resolve the any merge conflict
with basically I will use git fetch origin and then I will use get check out git
fetch origin
git fetch origin and then I will use
I will not remove the files.
I will not untracked the files
absolutely let me tell you the steps which I will do bro please speak in arriba.
First, I will fetch the changes from the remote repository using the git fetch
origin
then is it possible?

You might also like