0% found this document useful (0 votes)
40 views9 pages

DataEngineer Shreya Hadoop

The document outlines the professional experience and technical skills of a Senior Data Engineer with over 9 years of experience in data engineering, specializing in technologies such as Apache Spark, Kafka, and various cloud services including AWS, Azure, and GCP. It details responsibilities in designing data pipelines, implementing ETL processes, and utilizing tools for data modeling and visualization. The document also highlights experience with multiple databases and programming languages, as well as involvement in Agile methodologies and collaboration with stakeholders in various projects.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views9 pages

DataEngineer Shreya Hadoop

The document outlines the professional experience and technical skills of a Senior Data Engineer with over 9 years of experience in data engineering, specializing in technologies such as Apache Spark, Kafka, and various cloud services including AWS, Azure, and GCP. It details responsibilities in designing data pipelines, implementing ETL processes, and utilizing tools for data modeling and visualization. The document also highlights experience with multiple databases and programming languages, as well as involvement in Agile methodologies and collaboration with stakeholders in various projects.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Senior Data Engineer

PROFESSIONAL SUMMARY
● Over 9+ years of experience in Data Engineer, including profound expertise and experience in
traditional data engineering background with expertise in Apache Spark, PySpark, Kafka, Spark
Streaming, Spark SQL, Hadoop, HDFS, Hive, Sqoop, Pig, MapReduce, Flume, Beam.
● Extensive experience in relational databases including Microsoft SQL Server, Teradata, Oracle, Postgres
and No SQL Databases including MongoDB, HBase, Azure Cosmos DB, AWS DynamoDB, Cassandra.
● Hands on experience with Data modeling, Physical Datawarehouse designing & cloud data warehousing
technologies including Snowflake, Redshift, Big Query, Synapse.
● Experience with major cloud providers & cloud data engineering services including AWS, Azure, GCP
& Databricks.
● Designed & orchestrated data processing layer & ETL pipelines using Airflow, Azure Data Factory,
Oozie, Autosys, Cron & Control-M.
● Hands on experience with AWS services including EMR, EC2, Redshift, Glue, Lambda, SNS, SQS,
CloudWatch, Kinesis, Step functions, Managed Airflow instances, Storage & Compute.
● Hands on experience with Azure services including Synapse, Azure Data Factory, Azure functions,
EventHub, Stream Analytics, Key Vault, Storage & Compute.
● Hands on experience with GCP services including DataProc, VM, Big Query, Dataflow, Cloud
functions, Pub/Sub, Composer, Secrets, Storage & Compute.
● Hands on experience with Databricks services including Notebooks, Delta Tables, SQL Endpoints,
Unity Catalog, Secrets, Clusters.
● Have Extensive Experience in IT data analytics projects, Hands on experience in migrating on-premises
data & data processing pipelines to cloud including AWS, Azure & GCP.
● Experienced in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling
and SCD (Slowly changing dimension)
● Strong expertise in working with multiple databases, including DB2, Oracle, SQL Server, Netezza, and
Cassandra, for data storage and retrieval in ETL workflows.
● Experienced in working with ETL tools such as Informatica, DataStage, or SSIS (SQL Server
Integration Services).
● Good knowledge in Database Creation and maintenance of physical data models with Oracle, Teradata,
Netezza, DB2, MongoDB, HBase and SQL Server databases.
● Experienced in writing complex SQL Quires like Stored Procedures, triggers, joints, and Sub queries.
● Extensive experience in loading and analyzing large datasets with Hadoop framework (MapReduce,
HDFS, PIG, HIVE, Flume, Sqoop, SPARK, Impala, Scala), NoSQL databases like MongoDB, HBase,
Cassandra.
● Expert in Migrating SQL database to Azure data Lake storage, Azure Data Factory (ADF), Azure data
lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and controlling and
granting database access and migrating on premise databases to Azure Data Lake store using Azure Data
factory.
● Strong experience in the Analysis, design, development, testing and Implementation of Business
Intelligence solutions using Data Warehouse/Data Mart Design, ETL, BI, Client/Server applications and
writing ETL scripts using Regular Expressions and custom tools (Informatica, Pentaho, and Sync Sort)
to ETL data.
● Hands on experience with ETL, Hadoop and Data Governance tools such as Tableau, Informatica
Enterprise Data Catalog
● Experience in efficiently doing ETL's using Spark - in memory processing, Spark SQL and Spark
streaming using Kafka distributed messaging system.
● Extensive experience in development of Bash scripting, TSQL, and PL/SQL scripts
● Understanding of structured data sets, data pipelines, ETL tools, data reduction, transformation and
aggregation technique, Knowledge of tools such as DBT, DataStage.
● Have good knowledge in Job Orchestration tools like Oozie, Zookeeper & Airflow.
● Excellent performance in building, publishing customized interactive reports and dashboards with
customized parameters including producing tables, graphs, listings using various procedures and tools
such as Tableau, PowerBI and user-filters using Tableau.
● Practical understanding of the Data modeling (Dimensional & Relational) concepts like Star-Schema
Modeling, Snowflake Schema Modeling, Fact and Dimension tables.

TECHNICAL SKILLS

Big Data Technologies Hadoop, MapReduce, Spark, HDFS, Sqoop, YARN, Oozie, Hive,
Impala, Zookeeper, Apache Flume, Apache Airflow, Cloudera,
HBase

Programming Languages Python, PL/SQL, SQL, Scala, C, C#, C++, T-SQL, Power Shell
Scripting, JavaScript, Perl script.

Cloud Technologies AWS, Microsoft Azure, GCP, Databricks, Snowflake

Cloud Services Azure Data Lake Storage Gen 2, Azure Data Factory, Blob storage,
Azure SQL DB, Databricks, Azure Event Hubs, AWS RDS, Amazon
SQS, Amazon S3, AWS EMR, Lambda, AWS SNS, Data Flow, Big
Query, VM, Delta Tables, Cloud functions, Clusters.

Databases MySQL, SQL Server, IBM DB2, Postgres, Oracle, MS Access,


Teradata, and Snowflake

NoSQL Data Bases MongoDB, Cassandra DB, HBase

Development Strategies Agile, Lean Agile, Pair Programming, Waterfall, and Test-Driven
Development.

ETL, Visualization & Reporting Tableau, Data Stage, Informatica, Talend, SSIS, and SSRS

Frameworks Django, Pandas, NumPy, Matplot Lib, TensorFlow, PyTorch

Version Control & Jenkins, Git, CircleCI and SVN


Containerization tools
Monitoring tool Apache Airflow, Control M

Tools PyCharm, Eclipse, Visual Studio, SQL*Plus, SQL Developer,


TOAD, SQL Navigator, Query Analyzer, SQL Server Management
Studio, SQL Assistance, Eclipse, Postman

PROFESSIONAL EXPERIENCE
Client: Discover Financial Feb 2022- Present
Role: Senior Data Engineer

Responsibilities:
● Performed multiple MapReduce jobs in Hive for data cleaning and pre-processing. Loaded the data from
Teradata tables into Hive Tables.
● Experience in importing and exporting data by Sqoop between HDFS and RDBMS and migrating
according to client's requirement.
● Used Flume to collect, aggregate, and store the web log data from different sources like web servers and
pushed to HDFS.
● Implemented data lineage solutions using Ab Initio Metadata Hub for end-to-end visibility across data
pipelines.
● Integrated Ab Initio Metadata Hub with AWS and Azure to enhance data traceability in cloud
environments.
● Designed and implemented real-time data streaming pipelines using Apache Kafka, ensuring high
availability and scalability for processing millions of messages per second.
● Developed and deployed RESTful APIs using industry best practices, enabling seamless integration
between microservices and external systems.
● Built microservices using Spring Boot/Play Framework, improving application modularity and
performance.
● Optimized Kafka producers and consumers by fine-tuning partitioning, replication, and retention
strategies, improving system resilience and throughput.
● Secured RESTful endpoints with authentication and authorization mechanisms, ensuring data protection
and compliance.
● Engineered event-driven architectures for financial clients, leveraging Kafka to streamline transaction
processing and real-time analytics.
● Designed scalable backend services with Play Framework/Spring Boot, enhancing the performance and
maintainability of financial applications.
● Collaborated with stakeholders in the finance sector to modernize legacy systems, integrating RESTful
services and improving operational efficiency.
● Managed and tracked metadata across large-scale data integration projects in Big Data ecosystems.
● Streamlined data governance by leveraging Ab Initio Metadata Hub to ensure accurate lineage and
compliance.
● Designed data lineage reporting and metadata management processes to improve data quality and
transparency.
● Collaborated with cloud architects to integrate Ab Initio Metadata Hub with cloud ETL tools like AWS
Glue and Azure Data Factory.
● Enhanced operational efficiency by centralizing metadata storage in AWS S3, Azure Data Lake, and
GCP Cloud Storage.
● Supported regulatory compliance by providing clear data lineage documentation and audit trails for
stakeholders.
● Developed Big Data solutions focused on pattern matching and predictive modeling.
● Involved in Agile methodologies, Scrum meetings and Sprint planning.
● Worked on installing cluster commissioning decommissioning of data node, name node recovery
capacity planning and slots configuration.
● Resource management of HADOOP Cluster including adding/removing cluster nodes for maintenance
and capacity needs.
● Involved in loading data from UNIX file system to HDFS.
● Partitioned the fact tables and materialized views to enhance the performance. Implemented Hive
Partitioning and Bucketing on the collected data in HDFS.
● Involved in integrating hive queries into spark environment using Spark SQL.
● Used Hive to analyze the partitioned and bucketed data to compute various metrics for reporting.
● Created data models for customers data using Cassandra Query Language (CQL).
● Developed and ran Map-Reduce Jobs on YARN and Hadoop clusters to produce daily and monthly
reports as per user's need.
● Address the performance tuning of Hadoop ETL processes against very large data set work directly with
statistically on implementing solutions involving predictive analytics.
● Performed Linux operations on the HDFS server for data lookups, job changes if any commits were
disabled, and rescheduling data storage jobs.
● Created data processing pipelines for data transformation and analysis by developing spark jobs in Scala.
Testing and validating database tables in relational databases with SQL queries, as well as performing
Data Validation and Data Integration. Worked on visualizing the aggregated datasets in Tableau.

Environment: Hadoop, Spark, Kafka, MapReduce, Hive, HDFS, YARN, Linux, Cassandra, NoSQL
database, Python Spark SQL, Spring Boot/Play Framework, Tableau, RDBMS, Flume, Spark Streaming.

Client: Apex Health Aug 2020- Jan 2022


Role: Senior Data Engineer

Responsibilities:
● Involved in full Software Development Life Cycle (SDLC) - Business Requirements Analysis,
preparation of Technical Design documents, Data Analysis, Logical and Physical database design,
Coding, Testing, Implementing, and deploying to business users.
● Designed and implemented end-to-end data pipelines on AWS using services such as AWS Glue, AWS
Lambda, and AWS EMR.
● Written complex SQLs using joins, sub queries and correlated sub queries. Expertise in SQL Queries for
cross verification of data.
● Created ingestion framework for creating Data Lake from heterogeneous sources like Flat files, Oracle
Db, mainframe, and SQL server Databases.
● Design and Develop ETL Processes in AWS Glue to load data from external sources like S3, glue
catalog, and AWS Redshift.
● Developed complex ETL mappings for Stage, Dimensions, Facts, and Data marts load. Involved in Data
Extraction for various Databases & Files using Talend.
● Optimized Spark jobs in Databricks to enhance performance and reduce processing time, utilizing AWS
EMR integration where necessary.
● Integrated Databricks with AWS services such as Amazon S3, Amazon Redshift, Amazon RDS, and
AWS Glue.
● Ingested large-size files around 600 GB files to S3 in an efficient way.
● Using Glue job read the data from S3 and load it into Redshift tables by reading metadata from the data
catalog in JSON format.
● Developed ETL Processes in AWS Glue to migrate Campaign data from external sources like S3,
ORC/Parquet/Text Files into AWS snowflake.
● Worked on end-to-end deployment of the project that involved Data Analysis, Data Pipelining, Data
Modelling, Data Reporting and Data documentations as per the business needs.
● Authoring Python (PySpark) Scripts for custom UDF’s for Row/ Column manipulations, merges,
aggregations, stacking, data labeling and for all Cleaning and conforming tasks.
● Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like
S3, ORC/Parquet/Text Files into AWS Redshift
● Developed a python script to transfer data from on-premises to AWS S3. Developed a python script to
hit REST API’s and extract data to AWS S3.
● Worked on Ingesting data by going through cleansing and transformations and leveraging AWS
Lambda, AWS Glue and Step Functions.
● Developed Spark applications using PySpark and Spark-SQL for data extraction, transformation, and
aggregation from multiple file formats.
● Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores
and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.
● Performed benchmark tests to read data from database, object store using pandas and PySpark API s to
compare results, identify potential improvement areas and provide recommendations.
● Read and write Parquet, JSON files from S3 buckets using Spark, Pandas data frame with various
configuration.
● Designed, developed, and maintained complex data pipelines using Apache Airflow for data extraction,
transformation, and loading (ETL) processes.
● Orchestrated and scheduled data workflows in Apache Airflow to ensure timely and automated
execution of data tasks.
● Designed and implemented data visualizations and charts in Tableau to effectively communicate
complex data insights and trends to non-technical users.
● Work closely with the application customers to resolve JIRA tickets related to API issues, data issues,
consumption latencies, onboarding, and publishing data.

Environment: Python, Spark, AWS EC2, AWS S3, AWS EMR, AWS Redshift, AWS Glue, AWS RDS,
AWS Kinesis firehose, kinesis data stream, AWS SNS, AWS SQS, AWS Athena, snowflake, SQL, Tableau,
Git, Jira.

Client: BCBS Oct 2019- July


2020
Role: SQL Server Developer

Responsibilities:
 Designed DTS/SSIS packages to transfer data between servers, load data into database, archive data file
from different DBMSs using SQL Enterprise Manager/SSMS on SQL Server 2008 environment and
deploy the data.
 Worked with business users, business analysts, IT leads and developers in analyzing business
requirements and translating requirements into functional and technical design specifications.
 Created SSIS packages to capture the daily maintenance plan scheduled jobs status success failure with
a status report daily. Also created SSIS package to list down the server configuration, database sizing,
non-DB owned objects, public role privileges necessary for the Monthly report as per the audit
requirement.
 Worked with SQL Server and T-SQL in constructing DDL/DML triggers, tables, user defined functions,
views, indexes, Stored Procedures, user profiles, relational database models, cursors, Common Table
Expression CTE’s, data dictionaries, and data integrity.
 Worked closely with the team in designing, developing, and implementing the logical and physical
model for the Data mart.
 Identify and resolve database performance issues, database capacity issues and other distributed data
issues.
 Designed ETL Extract, Transform, Load strategy to transfer data from source to stage and stage to target
tables in the data warehouse tables and OLAP database from heterogeneous databases using SSIS and
DTS Data Transformation Service.
 Performed the ongoing delivery, migrating client mini-data warehouses or functional data-marts from
different environments to MS SQL server.
 Involved in creation of dimension and fact tables based on the business requirements.
 Developed SSIS packages to export data from Excel Spreadsheets to SQL Server, automated all the
SSIS packages and monitored errors using SQL Job daily.
 Prepared reports using SSRS SQL Server Reporting Service to highlight discrepancies between
customer expectations and customer service efforts which involved scheduling the subscription reports
via the subscription report wizard.
 Involved in generating and deploying various reports using global variables, expressions and functions
using SSRS.
 Migrated data from legacy system text-based files, Excel spreadsheets, and Access to SQL Server
databases using DTS, SQL Server Integration Services SSIS to overcome the transformation constraints.
 Automated the SSIS jobs in SQL scheduler as SQL server agent job for daily, weekly, and monthly
loads.
 Designed reports using SQL Server Reporting Services SSRS based on OLAP cubes which make use of
multiple value selection in parameters pick list, cascading prompts, matrix dynamics reports and other
features of reporting service.

Environment: T-SQL, SQL Server 2008/ 2008R2, SSRS, SSIS, SSAS, MS Visio, BIDS, Agile.

Client: TCS Aug 2016- Sep


2019
Role: Data Engineer

Responsibilities:
● Develop deep understanding of the data sources, implement data standards, and maintain data quality
and master data management.
● Used T-SQL in constructing User Functions, Views, Indexes, User Profiles, Relational Database
Models, Data Dictionaries, and Data Integrity.
● Expert in building Databricks notebooks in extracting the data from various source systems like DB2,
Teradata and perform data cleansing, data wrangling, data ETL processing and loading to AZURE SQL
DB.
● Expert in developing JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process
the data.
● Developed ETL processes in Azure Databricks to extract, transform, and load data from various sources
into Azure data lakes and data warehouses.
● Collaborated with data scientists to deploy machine learning models using Azure Databricks for real-
time scoring and inference.
● Integrated Azure Databricks with Azure services such as Azure Synapse Analytics, Azure SQL
Database, and Azure Data Factory.
● Performed ETL operations in Azure Databricks by connecting to different relational database source
systems using job connectors.
● Analyzed the SQL scripts and designed it by using Pyspark SQL for faster performance.
● Worked on reading and writing multiple data formats like JSON, Parquet, and delta from various
sources using Pyspark.
● Developed an automated process in Azure cloud, which can ingest data daily from web service and load
into Azure SQL DB.
● Designing and maintaining ADF pipelines with activities Copy, Lookup, For Each, Get Metadata,
Execute Pipeline, Stored Procedure, if condition, Web, Wait, Delete etc.
● Worked on Ingestion of data from source (On prem SQL) to target (ADLS) using Azure Synapse
Analytics (DW) & Azure SQL DB.
● Expert in building the Azure Notebooks functions by using Python, Scala, and Spark.
● Extract Transform and Load data from Sources Systems to Azure Data Storage services using a
combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Data
Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW)
and processing the data in In Azure Databricks.
● Created, provisioned different Databricks clusters needed for batch and continuous streaming data
processing and installed the required libraries for the clusters. Integrated Azure Active Directory
authentication to every Cosmos DB request sent and demoed feature to Stakeholders.
● Created numerous pipelines in Azure using Azure Data Factory v2 to get the data from disparate source
systems by using different Azure Activities like Move &Transform, Copy, filter, for each, Databricks
etc.
● Created several Databricks Spark jobs with PySpark to perform several tables to table operations.
● Working with complex SQL, Stored Procedures, Triggers, and packages in large databases from various
servers.
● Writing a Data Bricks code and ADF pipeline fully parameterized for efficient code management.
● Strong skills in visualization tools Power BI, Confidential Excel - formulas, Pivot Tables, Charts and
DAX Commands.
● Implemented CI/CD (Continuous Integration/Continuous Deployment) pipelines for Azure Databricks
using tools like Azure DevOps or GitHub Actions.
● Worked on creating few Power BI dashboard reports, Heat map charts and supported numerous
dashboards, pie charts and heat map charts.
● Helping team member to resolve any technical issue, Troubleshooting, Project Risk & Issue
identification, and management Addressing resource issue, Monthly one on one, Weekly meeting.

Environment: Azure SQL Database, Azure Data Lake, Azure Data Factory (ADF), Azure SQL Data
Warehouse, Azure Analysis Service (AAS), Azure Blob Storage, Azure Search, Azure App Service, Azure
Database Migration Service (DMS), GIT, PySpark, Python, JSON, ETL Tools, SQL Azure.

Client: LTI Mindtree Jun 2015- Aug


2016
Role: Hadoop Developer/ Big Data Engineer

Responsibilities:
● Demonstrated proficiency in Agile methodologies, working within cross-functional Agile teams to
deliver data engineering solutions on schedule and within scope.
● Actively participated in Agile ceremonies such as daily stand-ups, sprint planning, sprint reviews, and
retrospectives to ensure effective collaboration and communication within the team.
● Designed and implemented end-to-end data solutions on the Azure cloud platform, including Azure
Databricks, Azure Synapse Pipeline, and Azure Blob Storage.
● Developed and managed Azure Data Lake and Azure Blob Storage accounts to store, manage, and
secure data assets.
● Created and maintained ETL pipelines using Azure Data Factory to orchestrate data workflows.
● Worked with Azure Postgres SQL and Azure SQL databases to store and retrieve data for various
applications and analytics.
● Designed and developed complex data models and database schemas in Azure SQL Database,
optimizing data storage, retrieval, and organization.
● Engineered end-to-end data pipelines using SQL for data extraction, transformation, and loading (ETL)
processes, ensuring high-quality data for analytics and reporting.
● Implemented data versioning, change tracking, and data lineage for enhanced data governance and
auditing in Azure environments.
● Developed end-to-end data pipelines in Azure Databricks, encompassing the bronze, silver, and gold
stages for comprehensive data processing.
● Implemented the Bronze stage in data pipelines, focusing on raw data ingestion, storage, and initial data
quality checks.
● Enhanced data quality and usability by transitioning data through the Silver stage, performing data
transformations, normalization, and schema changes.
● Orchestrated data cleaning and transformation processes within Azure Databricks, ensuring the silver
data was structured and ready for analysis.
● Proficient in designing, developing, and maintaining data pipelines and ETL processes using Azure
Databricks.
● Skilled in data ingestion from various sources such as Azure Blob Storage, Azure Data Lake Storage,
and Azure SQL Database into Azure Databricks.
● Leveraged Databricks for advanced data transformations, including aggregations, joins, and feature
engineering, to prepare data for analytical purposes in the gold stage.
● Stored and managed gold data in Azure data warehousing solutions, optimizing data structures for high-
performance querying and reporting.
● Implemented Spark Streaming for real-time data processing and analytics, enabling immediate insights
from streaming data sources.
● Developed data archiving and retention strategies in Snowflake to store historical data while optimizing
storage costs.
● Utilized Snowflake's data-sharing capabilities to securely share historical data with external partners,
enabling collaborative analysis and reporting.
● Conducted performance tuning and optimization of data processing workflows to reduce processing
times and costs.
● Managed and optimized Snowflake data warehousing solutions for data storage and retrieval.
● Developed and maintained PySpark and Pandas-based data processing scripts and notebooks for data
transformation and analysis.
● Developed a python script to transfer data from on-premises and developed a python script to hit REST
API’s and extract data.
● Strong understanding of data warehousing concepts and experience in building data warehouses on
Azure using Databricks.
● Proficient in working with structured, semi-structured, and unstructured data formats in Azure
Databricks.
● Implemented data security and governance practices in Azure Databricks environments.
● Worked on continuous integration and continuous deployment (CI/CD) pipelines for automated testing
and deployment of data solutions.
● Collaborated with Azure DevOps teams to ensure high availability, scalability, and resource
optimization for data systems.
● Successfully completed data migration projects to Azure, ensuring data consistency and integrity during
the transition.
● Ensured compliance with data privacy regulations and company policies, including GDPR and HIPAA,
by implementing data access controls and encryption.

Environment: Python, SQL, Azure Databricks, Azure synapse pipeline, Azure blob storage, Azure data
lake, Terraform, Azure Postgres SQL, Azure SQL, Spark streaming, GitHub, PyCharm, Snowflake,
Pyspark, Pandas.

You might also like