Medica Interview Answers
Medica Interview Answers
Medica Interview Answers
So currently we are
using Jenkins for the deployment of the code into the higher environments. And
TerraForm you're using for creating resources?
Not that much. Not extensively. Basically, I know its database schema change
metadata.
Management tool right means it can it supports tracking and managing and if we
change any changes to the database schemas. So the second is kind of a versal
control of the database schema.
So save me some most of my data pipeline was written in PI Spark, like ingesting
that data, cleaning the data and loading into the snowflake. So we are actually
deploying the code using CI CD. So we are pushing the changes into the feature
branch.
And then our CI basically Jenkins will run all the unit test cases which we have
written and then it will merge the code into the master branch and from there it
will deploy the code into the higher environments. So that's the process of Jenkins
which we are doing through Git repo.
Absolutely, yes. So the non feature branch will be there for dev environment and
stage environment. And then we if it's everything working fine, then we merge the
code into the master branch which is for the prod environment.
Yes, yes. So all the credentials we are storing in the key vaults, so nothing is
harder coded. And for Jenkins related credentials, we can store those keys
basically in the Jenkins secret manager.
So in the there is a one tab where you can actually store all the keys in Jenkins
itself
and in the Jenkins code, you can refer those keys basically
okay.
Then I think we need to change the code again, basically. So and we need to
redeploy the code in the QA environment again. Maybe we will write the code in the
modular fashion. So we don't have to do lots of changes we can just uncomment the
three features or come in the features come in the fear and we can deploy.
So, okay, so in that case, if any if it is present in the main branch.
So my approach will be even if it is present in the main
Siemens in this conditions my approach will be like even if it's present in the
main branch, we can go to the previous verse and always, that's the advantage of
GitHub. Like we can go to the previous versions and we can come in and again
redeploy the main redeploy in those into the main branch. So normally, so in data
engineering world if you know life means we create these pipelines and we do the
testing and everything is fine, then only we merge into the master. If we don't get
the approval will not deploy into the main branch right. And even if something goes
into the main branch, which is not correct, we can go to the previous version. of
the Code and we can do the changes.
Yeah, because those three features are not required anymore, right.
Are we can pull the existing code which is there sorry.
Yes, that's what I can do.
So CM is these are kind of application based scenarios right. These are not a daily
use case which we face in data engineering world, right. These are the things which
we normally face from the application point of view. So and this CI CD also we get
the template from the DevOps team. So we do the changes in those templates, and
then we merge those into the feature branch. So as a data engineer, we don't have
to worry too much about like, how many features are going or not we have DevOps
team for that right.
Yes, I'm working in snowflake from around one and a half to two years, right. So I
am, I'm very good at writing SQL queries and so and the snowflake architecture I'm
also well aware of so yeah,
so we have we can store JSON data in variant column type. So the variant is the
data type. Which can store the flicks semi structured data.
Definitely, so first of all, let's see snowflake is our distributed data warehouse
systems, right. So it is multi cluster shared data architecture. So snowflake
follows or multi cluster, they say a data architecture means that compute and
storage are separate it is decoupled. So data is stored in a centralised storage
layer called snowflake data platform and a storage layer is separate from the
compute resources and compute resources in snowflake are organised into virtual
warehouses, which is a cluster of compute resources that is executed useful for
executing the queries and we also have meta data layer, which stores the metadata
about the table schemas, users roles and other things.
So basically yes, you can process data using snow five snow pipe
is about we are mostly using the batch processing but if you
know not so I was talking about that how we can process the data in real time in
snowflake but the streams are also useful for capturing and processes changes made
to the data in the table. So they actually provide a way to track modifications of
rows within a table. And also it actually allows us or enables us to real time data
processing and analysis based on those changes. So if you want for an example, if
you want to implement Change Data Capture, so snowflake stream can act as a
mechanism for Change Data Capture, they can capture the inserts, updates and delete
operations. performed on a row within a table.
So see whenever a view is created, ultimately, the SQL queries are executed, but as
far as I know, streams are not directly applicable to views it can be implemented
only to the tables.
So if you want to check the status, so basically, we can check the status using the
SQL queries to retrieve information about streams metadata, so if you do if you
want to see the list of streams, then there is a command called Show streams.
If you want to see a particular stream like.
Streams captures changes or not okay.
Okay. So basically if that I think we can do
means whatever I remember using the streams, I was using show streams, and we can
give the like command like the stream name, so this query will return the metadata
about the stream including its current status, and if the stream is active and not
in the error state those things I can check
and also if you are it will show you the size, trim size and everything the bytes
of the data
in bytes
so
Okay, so if I know no means we are not using streams because as I said, we are
using snowflake for batch processing. But you cannot directly query the contents of
stream using the Select star statement.
But
you have to query the information schema views basically to so like Scream
Yeah, so basically, snow five basically enables us to like, continuously auto
ingest the data from the source as a real time. So basically, it can automatically
scalable and it can load the streaming data and even it can load the semi
structured like log files, or Clickstream data or any real time data.
No, no, we are not. We don't have access to instal anything. Those are the part of
the DevOps team. itself means as a developer we are working.
So so see it means basically, it doesn't support the truncate operations directly.
So in truncate is operations, which can delete all the rows from the table without
logging individual rows deletions, right. So snow pipe doesn't directly support the
truncate. So if you want to perform a truncate along with the loading using the
snowplough, we have to handle this as a separate step. Or as part of our data
loading process.
So you want to basically what I understood from your question is that you want to
load the CSV file from the blob location and then load the data into the table
using the snow pipe right. So, so what I think like you first of all you have to
create an external stage
which will contains the location of the blob locations where the CSV file is
present. Then we can define the snow pipe and then we have to start the snow pipe
basically.
So, basically when we define the snow pipe we have to
Yeah, so, see, first of all, you have to create a snow pipe and you have to ingest
the data from that external state. Stage that command will be CREATE OR REPLACE the
snow pipe name and the auto ingest you have to give true as copy into this TBL
department table from the stage we have created and the file format which will be
type is equal to CSV. That is the way we will define the snow pipe and then we have
to start this new pipe using the article command like alter pipe, pipe and name
zoom.
I can command
and resume and resume I will tell pipe pipe name resume
sorry
so I think we can query the information schema.ip history so if you do know select
star from table and in the bracket information underscore schema dot pipe history
and give the pipe name it can give you the only it will allow us to query the snow
pipe executions history of sloppy executions.
So, this query will return a history of all the executions including details as
number of files loaded the start date and the end date.
Sorry, sorry, what?
MERGE statement. So MERGE statement is used to do is used for upset logic. If you
want to do an update and insert or delete logic you can use that MERGE statement.
Suppose your requirement is that update the old records insert the new records you
can easily do it using MERGE statement.
SQL you're talking about?
Yeah, so we have inner join left join right join, cross join full outer join.
left join.
So in the LEFT JOIN basically you will get all the records from the left table and
only the matching records from the right table.
SQL functions aggregate functions and window functions are we talking about right
yeah definitely.
Yeah, so if you want I can write the query quickly
Yeah, so first of all, I will create a common table expressions and I will apply a
window functions row number partisan by the columns which are true which I want to
find out the duplicate records and I will assign a rank and in the outside the city
I will just write a command DELETE statement command where the rank is greater than
equal greater than one.
Not not greater than one not equal to.
So see, yeah, so there are multiple ways to transform the data.
So basically see the distinct values are fixed. We can easily apply it the case
when statement
and also there is another way to use the pivot function.
So see, there are two ways either I can apply the case when statement or ultra
simple ways used to pivot function
use a pivot function pivot
Yeah, so I have to write either case when five case when statement or simply I can
write as select Max case when it is equal to one then sell salary and a salary one
Similarly, the second segment will be max case when ID is equal to two then salary
and as salary to like this I can write the five different case when statement.
So mostly I am working on the big data environment so I'm using the data
engineering ETL services and coding environment like ADF data bricks, but yes, I
have limited experience in Informatica, but not that much.
See
Okay, so basically C D is nothing but it's, it is actually a temporary result set
within a query, which can then be referenced multiple times within the same query.
So suppose you have multiple sub queries, and you are using the same sub part is
multiple times in a query, you can use CT, so basically, it improves the
readability and maintainability that's the advantage of CT so whereas sub queries
are like inline queries nested within another query.
So CTE, when we will use when we need to reference the same temporal
Yes, you can call this a recursive CTE. Yes, yes.
Yes, I will give you an example also. So the basic example of recursive CTE is,
like, suppose you have an employee table and you have the Manager ID in that table.
And if you want to find the label of each employee
that then you have to use recursive CTE.
That is one of the use case.
So data, so that is my core competencies like all the data engineering stuffs are
my core competencies. I'm very good in writing SQL queries, pythonscript ETL,
pipelines. So, DevOps and all these things are the additional stuffs which we are
doing apart from the data engineering stuffs. So, I'll say that 80% of my
competencies is in writing SQL queries creating data pipelines doing automations
through scripts
Yes, absolutely.
Yes, definitely. So.
Yes, I have written stored procedures.
Mostly I have written the stored procedure in which SQL query sorry So, I will give
you an example also under real time use case which we have deployed
yeah so basically snowflake
snowflake actually does.
Basically we can write snowflake doesn't support our have a built in support for
creating stored procedures using procedural language like sequel or JavaScript.
Sorry
No, I think it uses JavaScript and Python both and Java also you can use
Yes, yes.
But you I think Python you can use using snowpark.
Now, yes, I want to know the next process of this interview and also the roles and
responsibilities
and highlight here so, yes my expertise is in