1
1
Data Engineer
---------------------------------------------------------------------------------------------------------------------
----------
Professional Summary:
Extensive experience in Information Technology with 10+ years of Hadoop/Bigdata
processing.
Comprehensive working experience in implementing Big Data projects using
Apache Hadoop, Pig, Hive, HBase, Spark, Sqoop, Flume, Zookeeper, Oozie.
Experience working on Hortonworks / Cloudera / Map R.
Experience on Migrating SQL database to Azure data Lake, Azure data lake Analytics,
Azure SQL Database, Data Bricks and Azure SQL Data warehouse and controlling and
granting database access and Migrating On premise databases to Azure Data Lake store
using Azure Data factory.
Hands-on experience in designing and implementing data engineering pipelines and
analyzing data using AWS stack like AWS EMR, AWS Glue, EC2, AWSLambda, Athena,
Redshift, Scoop andHive.
Comprehensive experience in importing and exporting data using Sqoop between RDBMS
to HDFS.
Had good understanding and hands on with Spark tools like RDD, Data frame, Dataset
and spark SQL.
Excellent working knowledge of HDFS Filesystem and Hadoop Demons such as
Resource Manager, Node Manager, Name Node, Data Node, Secondary Name
Node, Containers etc.
In depth understanding of Apache spark job execution Components like DAG, lineage
graph, DAG Scheduler, Task scheduler, Stages and task.
Strong Experience in implementing Data warehouse solutions in Confidential Redshift.
Worked on various projects to migrate data from on premise databases to Confidential
Redshift, RDS and S3.
Experience with developing User Defined Functions (UDFs) in Apache Hive using Java,
Scala, and Python.
Skilled in Hadoop Architecture and ecosystem which includes HDFS, Job Tracker, Task
Tracker, Name Node, Data Node, YARN.
Experience working on Spark and Spark Streaming.
Experience with new Hadoop 2.0 architecture YARN and developing YARN Applications on it
Worked on Performance Tuning to Ensure that assigned systems were patched, configured
and optimized for maximum functionality and availability. Implemented solutions that re-
duced single points of failure and improved system uptime to 99.9% availability
Experience with distributed systems, large-scale non-relational data stores and
multi-terabyte data warehouses.
Extensively worked on Python and build the custom ingest framework.
Firm grip on data modeling, data marts, database performance tuning and NoSQL
map-reduce systems
Extensive snowflake cloud data warehouse implementation on AWS.
Experience in managing and reviewing Hadoop log files
Real time experience in Hadoop/Big Data related technology experience in Storage,
Querying, Processing and analysis of data
Worked on Data serialization formats for converting complex objects into sequence bits by
using Avro, Parquet, JSON, CSV formats.
Expertise in extending Hive and Pig core functionality by writing custom UDFs and
UDAF’s.
Designing and creating Hive external tables using shared meta-store instead of derby
with partitioning, dynamic partitioning and buckets.
Worked with different File Formats like TEXTFILE, SEQUENCE FILE, AVROFILE, ORC, and
PARQUET for Hive querying and processing.
Proficient in NoSQL databases like HBase.
Experience in importing and exporting data using Sqoop between HDFS and Relational
Database Systems.
Knowledge in Kafka installation & integration with Spark Streaming.
Demonstrated a full understanding of the Fact/Dimension data warehouse design model,
including star and snowflake design methods.
Hands-on experience building data pipelines using Hadoop components Sqoop, Hive,
Pig, MapReduce, Spark, Spark SQL.
Loaded and transformed large sets of structured, semi structured and unstructured data in
various formats like text, zip, XML and JSON.
Experience in designing both time driven and data driven automated workflows using Oozie.
Good understanding of Zookeeper for monitoring and managing Hadoop jobs.
Monitoring Map Reduce Jobs and YARN Applications.
Strong Experience in installing and working on NoSQL databases like HBase, Cassandra.
Work experience with cloud infrastructure such as Amazon Web Services (AWS) EC2 and
S3.
Used Git for source code and version control management.
Experience with RDBMS and writing SQL and PL/SQL scripts used in stored procedures.
Proficient in Java, J2EE, JDBC, Collection Framework, JSON, XML, REST, SOAP Web services.
Strong understanding in Agile and Waterfall SDLC methodologies.
Experience working both independently and collaboratively to solve problems and deliver
high-quality results in a fast-paced, unstructured environment.
Education Details:
Certifications:
DP 203: Data Engineering on Microsoft Azure
AZ-900: Microsoft Azure Fundamentals
Responsibilities:
Worked closely with stake holders to understand business requirements to design quality tech-
nical solutions that align with business and IT strategies and comply with the organization's ar-
chitectural standards.
Developed multiple applications required for transforming data across multiple layers of Enter-
prise Analytics Platform and implement Big Data solutions to support distributed processing
using Big Data technologies.
Responsible for data identification and extraction using third-party ETL and data-
transformation tools or scripts. (e.g., SQL, Python)
Worked on migration of data from On-prem SQL server to Cloud databases (Azure Synapse An-
alytics (DW)&Azure SQL DB).
Install and configure Apache Airflow for azure storage container and Snowflake data ware-
house and created dags to run the Airflow.
Worked on Shell scripting in Linux environment.
Built an ETL framework for Data Migration from on premise data sources such as Hadoop,
Oracle to Azure cloud using Apache Airflow, Apache Sqoop and Apache Spark.
Part of databridge development team using python as programming language.
Worked on flattening the JSON data so that it can be useful to downstream teams.
Developed and managed Azure Data Factory pipelines that extracted data from various data
sources, transformed it according to business rules, using python scripts that utilized Pyspark
and consumed APIs to move data into an Azure SQL database.
Created a new data quality check framework project in Python that utilized pandas.
Implemented source control and development environments for Azure Data Factory pipelines
utilizing Azure Repos.
Created Hive/Spark external tables for each source table in the Data Lake and written HiveSQL
and Spark SQL to parse the logs and structure them in tabular format to facilitate effective
querying on the log data.
Designed and developed ETL & ETL frameworks using Azure Data Factory and Azure Data
Bricks.
Flattening and transforming huge amounts of nested data in parquet and delta forms using
Spark SQL and the newest join optimization methods, then loading them into Hive, DeltaLake,
and Snowflake tables.
Created generic data bricks NOTEBOOKs for performing data cleansing.
Created Azure Data factory pipelines to refactor on-prem SSIS packages into Data factory
pipelines.
Working with Azure BLOB and Data Lake storage for loading data into Azure SQL Synapse
(DW).
Ingested and transformed source data using Azure Data flows and Azure HDInsight.
Created Azure Functions to ingest data at regular intervals.
Created Data Bricks notebooks for performing complex transformations and integrated them
as activities in ADF pipelines.
Loading data into snowflake tables from internal stage using snowSQL.
Written complex SQL queries for data analysis and extraction of data in required format.
Created Power BI DataMart’s and reports for various stakeholders in the business.
Created CI/CD pipelines using Azure DevOps.
Enhanced the functionality of existing ADF pipeline by adding new logic to transform the data.
Worked on Spark jobs for data preprocessing, validation, normalization, and transmission.
Optimized code and configurations for performance tuning of Spark jobs.
Worked with unstructured and semi structured data sets to aggregate and build analytics on
the data.
Work independently with business stakeholders with strong emphasis on influencing and col-
laboration.
Daily participation in Agile based Scrum team with tight deadlines.
Environment: Azure Synapse Analytics, Azure Data Factory, Azure Data bricks, Hadoop, SQL
Server, Delta Lake, Power BI,SnowSql, snowflake, Oracle 12c/11g, SQL scripting, PL/SQL, Python,
Unix Shell, Jira, Confluence.
Client: First Republic Bank, San Francisco, CA Aug 2022– July 2023
Sr. Data Engineer
Responsibilities:
Developed scalaSpark pipelines which transforms the raw data from several formats to
parquet files for consumption by downstream systems.
Developed scripts using Spark which are used to load the data from Hive to Amazon RDS
(Aurora) at a faster rate.
Experience developing Scala applications for loading/streaming data into NoSQL databases (MongoDB)
and HDFS.
Used AWS Glue services like crawlers and ETL jobs to catalog all the parquet files and make
transformations over data according to the business needs.
Developed and managed pipelines that extracted data from various data sources,
transformed it according to business rules, using scala scripts that utilized scalaspark and
consumed APIs to move data into an AWS SQL database
Worked with AWS services like S3, Glue, EMR, SNS, SQS, Lambda, EC2, RDS and Athena to
process data for the downstream customers.
Created libraries and SDKs which will be helpful in making JDBC connections to hive
database and query the data using Play framework and various AWS services.
Created views on top of data in Hive which will be used by the application using Spark SQL.
Applied security on data using Apache Ranger to set row level filters and group level policies
on data.
Experience building reusable ETL components using Postgres and snowflake.
Normalized the data according to the business needs like data cleansing, modifying the
datatypes and various transformations using Spark, Scala and AWS EMR.
Worked on creating the CI/CD pipelines using tools like Jenkins and Rundeck which will be
responsible for scheduling the daily jobs.
Developed Sqoop jobs which will be responsible for importing the data from Oracle to AWS
S3.
Developed a utility which transforms and exports the data from AWS S3 to AWS glue and
sends alerts and notifications to downstream systems (AI and Data Analytics) once the data
is ready for usage.
Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs,
Python and Scala.
Used import and export from internal stage (snowflake) from external stage (AWS S3).
Developed pipelines for auditing the metrics of all applications using AWS Lambda, Kinesis
Firehoses.
Worked extensively on writing triggering Snowpipe, Snowflake data loads automatically using Amazon
SQS (Simple Queue Service) notifications for an S3 bucket.
Developed end to end pipeline which exports the data from parquet files in S3 to Amazon
RDS.
Integrated Apache Airflow with AWS to monitor multi-stage ML workflows with the tasks
running on Amazon SageMaker.
Worked on optimizing performance of Hive queries using Hive LLAP and various other
techniques.
Environment: Spark, Scala, Hadoop, Hive, Sqoop, Redshift, Lambda, RDS, Play framework,
Apache Ranger, S3, EMR, EC2, SNS, SQS, Lambda, Sagemaker, Zeppelin, snowflake, Kinesis,
Athena, Jenkins, Rundeck and AWS Glue.
Environment: Hadoop (Cloudera), HDFS, Map Reduce, Hive, Scala, snowflake, Pig,
Sqoop,Azure, DB2, UNIX Shell Scripting, JDBC.
Environment: Hadoop, MapReduce, HDFS, Hive, HBase, Sqoop, Pig, Flume, Oracle 11/10g,
DB2, Teradata, MySQL, Eclipse, PL/SQL, Java, Linux, Shell Scripting, SQL Developer, SOLR.
Responsibilities:
Experience developing Scala applications for loading/streaming data into NoSQL databases
(MongoDB) and HDFS.
Perform T-SQL tuning and optimizing queries for and SSIS packages.
Designed Distributed algorithms for identifying trends in data and processing them effec-
tively.
Creating an SSIS package to import data from SQL tables to different sheets in Excel.
Used Spark and Scala for developing machine learning algorithms that analyze clickstream
data.
Used Spark SQL for data pre-processing, cleaning, and joining very large data sets.
Performed data validation with Redshift and constructed pipelines designed over 100TB per
day.
Co-developed the SQL server database system to maximize performance benefits for clients.
Assisted senior-level data scientists in the design of ETL processes, including SSIS packages.
Database migrations from traditional data warehouses to spark clusters.
Ensure the data warehouse was populated only with quality entries by performing regular
cleaning and integrity checks.
Used Oracle relational tables and used them in process design.
Developed SQL queries to perform data extraction from existing sources to check format ac-
curacy.
Developed automated tools and dashboards to capture and display dynamic data.
Installed a Linux operated Cisco server and performed regular updates and backup and used
MS excel functions for data validation.
Coordinated data security issues and instructed other departments about secure data trans-
mission and encryption.
Environment: T-SQL, MongoDB, HDFS, Scala, Relational Databases, SSIS, SQL, Linux, Data
Validation, MS Excel, Agile Methodology.