0% found this document useful (0 votes)
64 views7 pages

Anisha ETL DataEngineer

Uploaded by

hema digari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views7 pages

Anisha ETL DataEngineer

Uploaded by

hema digari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Anisha

Senior Data Engineer

[email protected]
Ph: 314-717-0287

Accomplished Data Engineer with over 9+ years of expertise in machine learning, data mining, and the
management of large datasets encompassing both structured and unstructured data. Proficient in data
acquisition, validation, predictive modeling, data visualization, web crawling, and web scraping.
Demonstrated proficiency in Big Data technologies, including Hadoop and Hive, coupled with
expertise in statistical programming languages such as R and Python. Experienced in navigating the
Hadoop Ecosystem and leveraging key Big Data components, including Apache Spark, Scala,
Python, HDFS, MapReduce, and KAFKA.

• Extensive proficiency in the Software Development Life Cycle (SDLC), encompassing


Requirement Analysis, Design, Development, Database Design, Deployment, Testing, and
Debugging.
• Extensive hands-on experience in developing enterprise-level solutions utilizing Hadoop
components like Apache Spark, MapReduce, HDFS, Sqoop, PIG, Hive, HBase, Oozie, Flume,
NiFi, Kafka, Zookeeper, and YARN.
• Proficient in data wrangling, utilizing Spark Vectorized panda user-defined functions for data
manipulation. Uses Python modules such as NumPy, Matplotlib, Pickle, Pandas, urllib, Beautiful
Soup, PySide, PyTables for graphical data creation, histograms, data analysis, and numerical
computations.
• In-depth understanding of Hadoop architecture, encompassing components such as YARN, HDFS,
Name Node, Data Node, Job Tracker, Application Master, Resource Manager, Task Tracker, and the
MapReduce programming paradigm.
• Handling database issues and connections with both SQL and NoSQL databases such as MongoDB,
HBase, Cassandra, SQL Server, and PostgreSQL.
• Designing RDBMS tables, views, user-created data types, indexes, stored procedures, cursors,
triggers, and transactions.
• Designing Parallel jobs using various stages for data processing.
• Worked on Informatica Developer tools - Informatica Data Analyst, Informatica Developer tool.
• Experienced Data Warehousing Specialist with a strong background in designing and implementing
data warehousing solutions, developing ETL pipelines, and managing data integration processes.
• Configuring batch jobs in Denodo scheduler, implementing cluster settings, and load balancing for
improved performance.
• Architecting and maintaining CI/CD pipelines, applying automation with tools like GIT,
Terraform, and Ansible.
• In-depth knowledge of Fixed Income Securities, Trading, Trade Lifecycle, and Market/Reference
Data.
• Proficient in Data Ingestion, Data Processing (Transformations, enrichment, and aggregations),
and a strong grasp of Distributed Systems and Parallel Processing concepts.
• Skilled in SQL querying on Relational Database systems (MySQL, MSSQL, Oracle, Postgres,
Teradata) and competent in NoSQL databases (Cassandra, DynamoDB, Cosmos DB, HBase).
• Hands-on experience with the Azure cloud platform (HDInsight, Data Lake, Databricks, Blob
Storage, Data Factory, Synapse, SQL, SQL DB, DWH, and Data Storage Explorer).
• Skilled in creating Tableau dashboards using various features for data visualization.
• Designing ETL data flows, creating mappings/workflows for data extraction from SQL Server,
and migration/transformation from Oracle/Access/Excel Sheets using SQL Server SSIS.
• Proficient in troubleshooting and maintaining ETL/ELT jobs using Matillion, with expertise in
building ETL scripts in languages like PLSQL, Informatica, Hive, Pig, and PySpark.
Demonstrates proficiency in creating, debugging, scheduling, and monitoring jobs using Airflow and
Oozie.
• Experience with AWS Cloud Computing, involving configuration, deployment of instances, and
automation in cloud environments, utilizing services like EC2, S3, EBS, VPC, ELB, IAM, Glue,
Crawler, Spectrum, SNS, Autoscaling, LAMBDA, Cloud Watch, Cloud Trail, and Cloud
Formation.
• Experience in GCP services, including Cloud Storage, Dataflow, BigQuery, Pub/Sub, and Cloud
Composer.
• Hands-on experience with Snowflake cloud data warehouse and AWS S3 bucket for integrating
data from multiple source systems.
• Experienced at using PySpark for data analysis and possesses a good understanding of data modeling
concepts (Dimensional & Relational).
• Work extensively with US healthcare data, specifically EPIC, to derive actionable insights and
support data-driven decision-making.
• Managing data ingestion from diverse sources into HDFS using Sqoop and Flume, performing
transformations with Hive and MapReduce, and loading data into HDFS. Proficient in Sqoop jobs
with incremental load for populating HIVE external tables.
• Proficient in Oozie for managing Hadoop jobs through Direct Acyclic Graphs (DAG) of actions
with control flows.
• Implemented security measures for Hadoop, integrating with Kerberos authentication
infrastructure.
• Importing and exporting databases using SQL Server Integration Services (SSIS) and Data
Transformation Services (DTS Packages).
• Experience with Hive partitions and bucketing concepts, designing Managed and External
tables, and working with various file formats like Avro, Parquet, ORC, JSON, and XML.

Technical Skills:

Category Technologies and Tools


Languages Python, SQL, Java, Unix/Shell Scripting, Bash, Scala, R, SAS
Big Data Tools Hadoop MapReduce, Impala, HDFS, Hive, Pig, HBase, Flume,
Storm, Sqoop, Oozie, Spark, Hue, Kafka, Zookeeper
Database DB2, MySQL, Oracle, SQL Server, Snowflake
NoSQL Databases HBase, Cassandra, MongoDB, DynamoDB, Cosmos DB
Data Visualization Tools Tableau, Power BI
ETL/BI Informatica, SSIS, SSRS, SSAS, QlikView, Arcadia
Cloud AWS (EC2, VPC, EBS, SNS, RDS, S3, Autoscaling, Lambda,
Redshift, CloudWatch), Azure Cloud (Data Factory, Databricks,
Data Lake, BLOB Storage, Cosmos DB)
DevOps Tools Jenkins, Docker, Maven

Professional Experience:
Client: MONTEFIORE, Elmsford, NY June 2021 - Present

Role: Sr Data Engineer

• Responsible for maintaining high-quality reference data in the source by executing operations like
cleaning, transformation, and ensuring integrity in a relational environment through close
collaboration with stakeholders and solution architects.
• Developed PySpark code for AWS Glue jobs and EMR for streamlined data processing.
• Experience in resolving on-going maintenance issues and bug fixes; monitoring Informatica
sessions as well as performance tuning of mappings and sessions.
• Conducted data blending and preparation using Alteryx and SQL for Tableau consumption,
publishing data sources to the Tableau server.
• Created Lambda functions with Boto3 to deregister unused AMIs in all application regions,
reducing costs for EC2 resources.
• Designed and implemented a Security Framework for precise access control to AWS S3 objects
through AWS Lambda and DynamoDB.
• Imported data from various sources like HDFS/HBase into Spark RDD and performed
computations using PySpark to generate output responses.
• Utilized Lambda to configure DynamoDB Autoscaling and developed a Data Access Layer for
accessing AWS DynamoDB data.
• Managed data extraction into HDFS using Sqoop commands and scheduled Map/Reduce jobs
within Hadoop.
• Leveraged Data Integration, Apache Spark Engine, and AWS Databricks for efficient data
management with speed and scalability.
• Assisted in developing and reviewing technical documentation, including ETL workflows,
research, and data analysis.
• Gained experience with healthcare data and EPIC systems.
• Designed and implemented data warehousing solutions to support analytics and reporting needs,
optimizing for performance and scalability.
• Designed and implemented dimensional data models to support analytics and reporting,
optimizing for performance and scalability.
• Design and implement dimensional data models to support analytics and reporting.
• Orchestrated job scheduling using Airflow scripts in Python, integrating various tasks into DAG
and LAMBDA. Developed database tables, indexes, constraints, and triggers to ensure data
integrity.
• Developed a reusable framework for future migrations, automating ETL from RDBMS systems
to the Data Lake using Spark Data Sources and Hive data objects.
• Tuned Informatica mappings and sessions for optimum performance.
• Conducted Data blending, Data preparation using Alteryx and SQL for Tableau consumption and
publishing data sources to Tableau server.
• Designed, developed, and tested ETL Processes in AWS Glue to migrate Campaign data from
external sources like S3, ORC/Parquet/Text Files into AWS Redshift.
• Oversaw migration of large datasets to Databricks (Spark), administered clusters, configured data
pipelines, and orchestrated data loading from Oracle to Databricks.
• Created Databrick notebooks for efficient data curation across diverse business use cases.
• Engaged in the migration of on-premises applications to AWS Redshift, utilizing services like
EC2 and S3 for processing and storage.
• Imported and exported databases using SQL Server Integration Services (SSIS) and Data
Transformation Services (DTS Packages).
• Implemented AWS Step Functions to automate and orchestrate Amazon SageMaker-related tasks
such as publishing data to S3, training ML models, and deploying them for prediction.
• Configured Kerberos authentication principles to establish secure network communication on the
cluster, conducting testing on HDFS, Hive, Pig, and MapReduce for new user access.

Client: ILLUMINA, San Diego, CA Nov 2020 - June 2021


Role: Senior ETL Data Engineer

• Developed a framework to perform data profiling, cleansing, automated pipeline restart, and
managed rollback strategy for ETL processes using Azure Data Factory and SSIS.
• Orchestrated and enforced ETL and data solutions with Azure Data Factory and SSIS.
• Developed and maintained ETL pipelines to support the data warehousing and reporting
infrastructure.
• Translated business requirements into operational and application requirements.
• Involved in Migration of Informatica mappings from Oracle to SQL Server.
• Created Azure Data Factory (ADF) pipelines using Azure Blob.
• Designed and implemented data models to optimize data retrieval and reporting processes.
• Utilized Python scripting for script automation, executing data curation with Azure Data Bricks.
• Managed data ingestion and processing across multiple Azure services (Azure Data Lake, Azure
Storage, Azure SQL, Azure DW) using Azure Databricks.
• Developed mapping files to map source columns to target columns.
• Created dashboards and visualizations using SQL Server Reporting Services (SSRS) and Power
BI for business analysis and upper management insights.
• Utilized Azure Logic Apps to automate batch jobs by integrating apps, ADF pipelines, and
services like HTTP requests and email triggers.
• Ingested data in mini-batches and performed RDD transformations using Spark Streaming for
streaming analytics in Databricks.
• Designed schemas for drilling data and created PySpark procedures, functions, and programs for
data loading.
• Developed multi-cloud strategies, optimizing GCP (PAAS) and Azure (SAAS) strengths.
• Managed, configured, and scheduled resources across the cluster using Azure Kubernetes
Service.
• Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark Data
Bricks cluster.
• Created Databricks notebooks using SQL, Python, and automated notebooks using jobs.
• Created/modified Informatica ETL mappings that map the source data from the various sources to
the target database and the data warehouse based on requirement.
• Created Spark clusters and configured high-concurrency clusters using Azure Databricks to speed
up the practice of data.
• Developed PySpark code for AWS Glue jobs and EMR for streamlined data processing.
• Conducted end-to-end Architecture & implementation assessment of various AWS services like
Amazon EMR, Redshift, S3.
• Coded Teradata BTEQ scripts for loading and transforming data, addressing defects like SCD 2
date chaining and cleaning up duplicates.
• Responsible for maintaining high-quality reference data in the source by executing operations like
cleaning, transformation, and ensuring integrity in a relational environment through close
collaboration with stakeholders and solution architects.
• Developed a reusable framework for future migrations, automating ETL from RDBMS systems
to the Data Lake using Spark Data Sources and Hive data objects.
• Extracted, transformed, and loaded data from Source Systems to Azure Data Storage using Azure
Data Factory and HDInsight.
• Created various pipelines to load the data from Azure Data Lake into Staging SQLDB and
followed by to Azure SQL DB.
• Worked extensively on Azure Data Factory including data transformations, Integration Runtimes,
Azure Key Vaults, Triggers, and migrating data factory pipelines using ARM Templates.
• Utilized Spark SQL for Scala and Python interfaces, automatically converting RDD case classes
to schema RDD.
• Developed a framework to perform data profiling, cleansing, automated pipeline restart, and
managed rollback strategy for ETL processes using Azure Data Factory and SSIS.
• Orchestrated and enforced ETL and data solutions with Azure Data Factory and SSIS.
• Translated business requirements into operational and application requirements.
• Implemented database solutions in Azure SQL Data Warehouse and Azure SQL, leading a team
of six developers through the migration process.
• Created Azure Data Factory (ADF) pipelines using Azure Blob.
• Assisted in the development and maintenance of ETL pipelines and data models.
• Leveraged Azure Data Lake as a source and retrieved data using Azure Blob.
• Conducted Extract, Transform, and Load operations using a mix of Azure Data Factory, T-SQL,
Spark SQL, and U-SQL in Azure Data Lake Analytics.
Client: UPS, Louisville, Kentucky Sep 2018 – Oct 2020
Role: Data Engineer

• Developed AWS Lambda functions in Python for AWS Lambda, enabling the invocation of
Python scripts for extensive transformations and analytics on large datasets in EMR clusters.
• Utilized Java API, Pig, and Hive for writing MapReduce Jobs for data extraction, transformation,
and aggregation from various file formats.
• Collaborated with business partners, Business Analysts, and product owners to comprehend
requirements and construct scalable distributed data solutions using the Hadoop ecosystem.
• Leveraged Spark for diverse transformations and actions, saving result data back to HDFS before
final storage in the Snowflake database.
• Worked with healthcare data, particularly EPIC, to support various analytics and reporting
requirements.
• Developed Spark scripts and UDFS for data aggregation, querying, and writing data back into
RDBMS through Sqoop.
• Demonstrated expertise in partitioning and bucketing concepts in Hive, designing Managed and
External tables for performance optimization.
• Created Hive tables on HDFS, developed Hive Queries for data analysis, and connected Tableau
with Spark clusters to build dashboards.
• Implemented Oozie scripts for managing and scheduling Hadoop jobs.
• Demonstrated proficiency in working with Azure BLOB and Data Lake storage, loading data into
Azure SQL Synapse analytics (DW).
• Developed Spark Streaming programs for near real-time data processing from Kafka,
incorporating both stateless and stateful transformations.
• Orchestrated Build and Release for multiple projects in a production environment using Visual
Studio Team Services (VSTS).
• Developed ETL processes (Data Stage Open Studio) using FLUME and SQOOP to load data
from multiple sources into HDFS, with structural modifications using MapReduce and HIVE.
• Utilized DataStax Spark connector for storing or retrieving data from Cassandra databases.
• Transformed data using AWS Glue dynamic frames with PySpark, cataloged the transformed
data using Crawlers, and scheduled jobs and crawlers using the workflow feature.
• Managed cluster installation, data node commissioning and decommissioning, name node
recovery, capacity planning, and slots configuration.
• Utilized Hive for analyzing data ingested into HBase, computing metrics for reporting on
dashboards.
• Developed Spark Streaming programs for near real-time data processing from Kafka,
incorporating both stateless and stateful transformations.
• Developed a log producer in Scala for application log transformation and integration with Kafka
and Zookeeper-based log collection platforms.
• Utilized Terraform scripts to automate instances, enhancing efficiency in comparison to manually
launched instances.
• Developed Spark scripts and UDFS for data aggregation, querying, and writing data back into
RDBMS through Sqoop.
• Validated target data in the Data Warehouse, ensuring transformation and loading via Hadoop
Big Data.

Thomson Reuters India Pvt Ltd, Hyderabad June 2014 – Dec 2017
Data Engineer

Responsibilities:

• Description: Developed end to end Qlik view report apps by users streaming data by making
operations on data using python/spark.
• Developed multi cloud strategies in better using GCP (for its PAAS) and Azure (for its SAAS).
• Built scalable distributed Hadoop cluster running Hortonworks Data Platform.
• Working experience with data streaming process with Kafka, Apache Spark, Hive, Pig, etc.
• Importing and exporting data into HDFS Sqoop and Flume and Kafka.
• Used Spark-Streaming APIs to perform necessary transformations and actions on the data got
from Kafka.
• Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators both
old and newer operators.
• Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of
data. Analyzed the SQL scripts and designed the solution to implement using Scala.
• Developed and designed automate process using shell scripting for data movement.
• Involved in loading data from UNIX file system to HDFS using Shell Scripting.

You might also like