Ankit Data Engineer Resume
Ankit Data Engineer Resume
[email protected]
+91 7703802034
PROFESSIONAL SUMMARY
Technical Skills:
Work Experience:
• Collaborating with the business for requirement gathering for data warehouse &
reporting.
• Extracting, transforming, and loading data from different sources to Azure Data
Storage Services using Azure data factory, t-SQL to perform data lake analytics.
• Working on data transformations for ML OPs - adding calculated columns,
managing relationships, creating different measures, merging & appending
queries, replacing values, split columns, grouping by, Date & Time Column.
• Data Ingestion to Azure Services - Azure Data Lake, Azure Storage, Azure SQL,
Azure DW, and processing the data in Azure Databricks.
• Creating batches and sessions to move data at specific intervals and on-demand
using Server Manager.
• Blended multiple data connections and created multiple joins across the various
data sources for data preparation.
• Extracted data from Data Lakes, EDW to relational databases for analyzing
and getting more meaningful insights using SQL Queries and PySpark.
Developed PL/SQL scripts to extract data from multiple data sources and
transform them into a format that can be easily analyzed.
• Developing Python scripts to do file validations in Databricks and automated the
process using ADF.
• For data processing developed JSON Scripts for deploying the Pipeline in Azure
Data Factory (ADF) using the SQL activity.
• Supported production data pipelines including performance tuning and
troubleshooting of SQL, Spark, and Python scripts.
• Developing audit, balance, and control framework using SQL DB audit tables to
control the ingestion, transformation, and load process in Azure.
• Creating tables in Azure SQL DW for data reporting and visualization for business
requirements.
• Creating visualization reports, dashboards, and KPI scorecards using Power BI
desktop.
• Designing, developing, and deploying ETL solutions using SQL Server
Integration Services (SSIS).
• Connecting various applications to the existing database, and create databases,
and schema objects including indexes and tables by writing various functions,
stored procedures, and triggers.
• For Query optimization and fast query retrieval performed Normalization and
De-Normalization of existing tables, with the effective use of Joins & indexes.
• Creating alerts on data integration events (success/failure) and monitored
them.
• Collaborating with product managers, scrum masters, and engineers to
develop Agile practices and documentation initiatives to bring experience for
retrospectives, backlog, and meetings.
Responsibilities:
• Designed a data pipeline to automate the ingestion, processing, and delivery
by processing batch and streaming data using Spark, AWS EMR Clusters,
Lambda, and Databricks.
• Developed Airflow automation and Python scripts for batch data processing,
ETL, and data warehouse ingestion using AWS Lambda Python functions, Elastic
Kubernetes Service (EKS), and S3.
• Data ingestion into a data lake(S3) and used AWS Glue to expose the data to
Redshift.
• Configured EMR cluster for data ingestion and used dbt (data build tool) to
transform the data in Redshift.
• Run batch processing to calculate the risk associated and to generate several
feeds to other systems such as Discounted cash flow (DCF), PNL, and Europe
credit platform for Pricing strategy.
• Wrote & tested SQL code for transformations using the data build tool.
• Designed and developed a data architecture to load data from AWS S3 to
Snowflake via Airflow by creating DAGs and processed data for Data Visualization
Tools.
• Worked on creating data pipelines with Airflow to schedule AWS jobs for
performing incremental loads and used Flume for weblog server data.
• Scheduled Airflow jobs to automate the ingestion process into the data lake using
Apache Airflow in a cluster.
• Evaluate snowflake Design considerations for any change in the application.
• Developed PL/SQL procedure to load data into a data warehouse.
• Wrote Python scripts and used Airflow DAGs to automate the process of
extracting weblogs.
• Developed and implemented Hive Bucketing and Partitioning.
• Loaded data into S3 buckets using AWS Glue and PySpark. Snowflake Involved
in filtering data stored in S3 buckets using Elasticsearch and loaded data into
Hive external tables.
• Worked on financial spreading by developing scalable applications for real-time
ingestions into various databases using AWS Kinesis and performed necessary
transformations and aggregation to build the common learner data model and
store the data in HBase.
• Orchestrated multiple ETL jobs using AWS step functions and Lambda and used
AWS Glue to load and prepare data Analytics for customers.
• Worked on AWS Lambda to run servers without managing them and to trigger
run code by S3 and SNS.
• Developed data transition programs from DynamoDB to AWS Redshift (ETL
Process) using AWS Lambda by creating functions in Python for certain events
based on use cases.
• Implemented the AWS cloud computing platform by using RDS, Python,
DynamoDB, S3, and Redshift.
• Worked in Developing Spark applications using Spark - SQL in Databricks for
data extraction, transformation, and aggregation from multiple file formats for
analyzing & transforming the data to uncover insights into the customer usage
patterns.
• Worked with various formats of files like delimited text files, clickstream log
files, Apache log files, Avro files, JSON files, and XML Files.
• Mastered in using different columnar file formats like RC, ORC, and Parquet
formats.
• Performed Database activities such as Indexing, and performance tuning.
• Collected data using Spark Streaming from AWS S3 bucket in near-real-time and
performed necessary transformations on the fly to build the common learner data
model.
• Responsible for loading and transforming huge sets of structured, semi-
structured, and unstructured data.
• Used AWS EMR clusters for creating Hadoop and spark clusters and these clusters
are used for submitting and executing Python applications in production.
• Designed and developed end-to-end ETL processing from Oracle to AWS using
Amazon S3, EMR, and Spark.
• Worked on CI/CD solution, using Git, Jenkins, Docker, and Kubernetes to set
up and configure big data architecture on AWS cloud platform.
• Written SQL Scripts and PL/SQL Scripts to extract data from the database to meet
business requirements and for Testing Purposes.
Client: Intuit
Location: Plano, Texas
Role: Bigdata developer
JUNE2019-Aug2020
Responsibilities:
• Written Map-Reduce code to process all the log files with rules defined in
HDFS (as log files generated by different devices have different xml rules).
• Involved in porting the existing on-premises Hive code migration to GCP
(Google Cloud Platform) Big Query.
• Involved in migration an Oracle SQL ETL to run on Google cloud platform
using cloud Data processing & Big Query, cloud pub/sub for triggering the
Apache Airflow jobs.
• Developed and designed application to process data using Spark.
• Developed MapReduce jobs, Hive & PIG scripts for Data warehouse migration
project.
• Developed stored procedures/views in snowflake and use in Talend for loading
Dimensions and Facts.
• Developed and designed system to collect data from multiple portals
using Kafka and then process it using spark.
• Developing MapReduce jobs, Hive & PIG scripts for Risk & Fraud Analytics
platform.
• Developed Data ingestion platform using Sqoop and Flume to ingest Twitter
and Facebook data for Marketing & Offers platform.
• Developed and designed automate process using shell scripting for data
movement and purging.
• Developed programs in JAVA, Scala-Spark for data reformation after
extraction from HDFS for analysis.
• Developed ETL pipelines in and out of data warehouse using combination of
Python and
Snowflake’s Snow SQL.
• Participated in the development improvement and maintenance of snowflake
database application.
• Written Hive jobs to parse the logs and structure them in tabular format to
facilitate effective querying on the log data.
• Importing and exporting data into Impala, HDFS and Hive using Sqoop.
• Responsible to manage data coming from different sources.
• Implemented Partitioning, Dynamic Partitions and Buckets in HIVE for efficient
data access.
• Developed Hive tables to transform, analyze the data in HDFS.
• Involved in creating Hive tables, loading with data and writing hive queries
which will run internally in map way.
• Developed Simple to Complex Map Reduce Jobs using Hive and Pig.
• Involved in running Hadoop Jobs for processing millions of records of text data.
• Developed the application by using the Struts framework.
• Created connection through JDBC and used JDBC statements to call stored
procedures.
• Developed Pig Latin scripts to extract the data from the web server output files
to load into HDFS.
• Developed the Pig UDF’S to pre-process the data for analysis.
• Implemented multiple Map Reduce Jobs in java for data cleansing and pre-
processing.
• Moved all RDBMS data into flat files generated from various channels to HDFS
for further processing.
• Developed job workflows in Oozie to automate the tasks of loading the data
into HDFS.
• Handled importing of data from various data sources, performed transformations
using Hive, MapReduce, loaded data into HDFS and extracted data from Teradata
into HDFS using Sqoop.
• Writing the script files for processing data and loading to HDFS.
Client: Novartis
Location: Parsippany, New Jersey
Role: Hadoop Developer
Responsibilities:
• Writing the script files for processing data and loading to HDFS.
• Processed data into HDFS by developing solutions.
• Analyzed the data using Map Reduce, Pig, Hive and produce summary results from
Hadoop to downstream systems.
• Developed data pipeline using flume, Sqoop and pig to extract the data
from weblogs and store in HDFS.
• Used Pig as ETL tool to do transformations, event joins and some pre-
aggregations before storing the data onto HDFS.
• Build pipelines using Unix to connect different tools and it is used to extract data
from a database ,transform the data and load the data into a data warehouse.
• Used Sqoop to import and export data from HDFS to RDBMS and vice-versa.
• Exported the analyzed data to the relational database MySQL using Sqoop for
visualization and to generate reports.
• Created HBase tables to load large sets of structured data.
• Managed and reviewed Hadoop log files.
• Involved in providing inputs for estimate preparation for the new proposal.
• Worked extensively with HIVE DDLs and Hive Query language (HQLs).
• Developed UDF, UDAF, UDTF functions and implemented it in HIVE Queries.
• Implemented SQOOP for large dataset transfer between Hadoop and RDBMs.
• Created Map Reduce Jobs to convert the periodic of XML messages into a
partition Avro Data.
• Used Sqoop widely to import data from various systems/sources (like MySQL)
into HDFS.
• Created components like Hive UDFs for missing functionality in HIVE for
analytics.
• Used different file formats like Text files, Sequence Files, Avro.
• Cluster co-ordination services through Zookeeper.
• Assisted in creating and maintaining technical documentation to launching
HADOOP Clusters and even for executing Hive queries and Pig Scripts.
• Assisted in Cluster maintenance, cluster monitoring, adding, and removing
cluster nodes and Trouble shooting.
• Installed and configured Hadoop, Map Reduce, HDFS, developed multiple Map
Reduce jobs in java for data cleaning and pre-processing.
Responsibilities:
• Designed & built reports, processes, and analyses with a variety of business
intelligence tools &
Technologies.
• Transformed data into meaningful insights from various data sources to support
the development of global strategy and initiatives.
• Involved in requirements gathering, source data analysis, identified business
rules for data migration, and for developing data warehouse/data mart.
• Collected data using SQL Script, created reports using SSRS and used Tableau for
data visualization and custom reports analysis.
• Created reports in tab
• Performed Exploratory Data analysis (EDA) to find and understand interactions
between different fields in the dataset, handling missing values, detecting
outliers, data distribution, and extracting important variables graphically.
• Worked on python library - NumPy, Pandas, SciPy for data wrangling and analysis,
while visualization libraries of Python using Matplotlib for graphs plotting.
• Performed data collection, cleaning, wrangling, analysis, and building machine
learning models on the data sets in both R and Python.
• Used Agile methodologies to emphasize face-to-face communication and that
iteration are passing through full SDLC.