Anisha ETL DataEngineer
Anisha ETL DataEngineer
[email protected]
Ph: 314-717-0287
Accomplished Data Engineer with over 9+ years of expertise in machine learning, data mining, and the
management of large datasets encompassing both structured and unstructured data. Proficient in data
acquisition, validation, predictive modeling, data visualization, web crawling, and web scraping.
Demonstrated proficiency in Big Data technologies, including Hadoop and Hive, coupled with
expertise in statistical programming languages such as R and Python. Experienced in navigating the
Hadoop Ecosystem and leveraging key Big Data components, including Apache Spark, Scala,
Python, HDFS, MapReduce, and KAFKA.
Technical Skills:
Professional Experience:
Client: MONTEFIORE, Elmsford, NY June 2021 - Present
• Responsible for maintaining high-quality reference data in the source by executing operations like
cleaning, transformation, and ensuring integrity in a relational environment through close
collaboration with stakeholders and solution architects.
• Developed PySpark code for AWS Glue jobs and EMR for streamlined data processing.
• Experience in resolving on-going maintenance issues and bug fixes; monitoring Informatica
sessions as well as performance tuning of mappings and sessions.
• Conducted data blending and preparation using Alteryx and SQL for Tableau consumption,
publishing data sources to the Tableau server.
• Created Lambda functions with Boto3 to deregister unused AMIs in all application regions,
reducing costs for EC2 resources.
• Designed and implemented a Security Framework for precise access control to AWS S3 objects
through AWS Lambda and DynamoDB.
• Imported data from various sources like HDFS/HBase into Spark RDD and performed
computations using PySpark to generate output responses.
• Utilized Lambda to configure DynamoDB Autoscaling and developed a Data Access Layer for
accessing AWS DynamoDB data.
• Managed data extraction into HDFS using Sqoop commands and scheduled Map/Reduce jobs
within Hadoop.
• Leveraged Data Integration, Apache Spark Engine, and AWS Databricks for efficient data
management with speed and scalability.
• Assisted in developing and reviewing technical documentation, including ETL workflows,
research, and data analysis.
• Gained experience with healthcare data and EPIC systems.
• Designed and implemented data warehousing solutions to support analytics and reporting needs,
optimizing for performance and scalability.
• Designed and implemented dimensional data models to support analytics and reporting,
optimizing for performance and scalability.
• Design and implement dimensional data models to support analytics and reporting.
• Orchestrated job scheduling using Airflow scripts in Python, integrating various tasks into DAG
and LAMBDA. Developed database tables, indexes, constraints, and triggers to ensure data
integrity.
• Developed a reusable framework for future migrations, automating ETL from RDBMS systems
to the Data Lake using Spark Data Sources and Hive data objects.
• Tuned Informatica mappings and sessions for optimum performance.
• Conducted Data blending, Data preparation using Alteryx and SQL for Tableau consumption and
publishing data sources to Tableau server.
• Designed, developed, and tested ETL Processes in AWS Glue to migrate Campaign data from
external sources like S3, ORC/Parquet/Text Files into AWS Redshift.
• Oversaw migration of large datasets to Databricks (Spark), administered clusters, configured data
pipelines, and orchestrated data loading from Oracle to Databricks.
• Created Databrick notebooks for efficient data curation across diverse business use cases.
• Engaged in the migration of on-premises applications to AWS Redshift, utilizing services like
EC2 and S3 for processing and storage.
• Imported and exported databases using SQL Server Integration Services (SSIS) and Data
Transformation Services (DTS Packages).
• Implemented AWS Step Functions to automate and orchestrate Amazon SageMaker-related tasks
such as publishing data to S3, training ML models, and deploying them for prediction.
• Configured Kerberos authentication principles to establish secure network communication on the
cluster, conducting testing on HDFS, Hive, Pig, and MapReduce for new user access.
• Developed a framework to perform data profiling, cleansing, automated pipeline restart, and
managed rollback strategy for ETL processes using Azure Data Factory and SSIS.
• Orchestrated and enforced ETL and data solutions with Azure Data Factory and SSIS.
• Developed and maintained ETL pipelines to support the data warehousing and reporting
infrastructure.
• Translated business requirements into operational and application requirements.
• Involved in Migration of Informatica mappings from Oracle to SQL Server.
• Created Azure Data Factory (ADF) pipelines using Azure Blob.
• Designed and implemented data models to optimize data retrieval and reporting processes.
• Utilized Python scripting for script automation, executing data curation with Azure Data Bricks.
• Managed data ingestion and processing across multiple Azure services (Azure Data Lake, Azure
Storage, Azure SQL, Azure DW) using Azure Databricks.
• Developed mapping files to map source columns to target columns.
• Created dashboards and visualizations using SQL Server Reporting Services (SSRS) and Power
BI for business analysis and upper management insights.
• Utilized Azure Logic Apps to automate batch jobs by integrating apps, ADF pipelines, and
services like HTTP requests and email triggers.
• Ingested data in mini-batches and performed RDD transformations using Spark Streaming for
streaming analytics in Databricks.
• Designed schemas for drilling data and created PySpark procedures, functions, and programs for
data loading.
• Developed multi-cloud strategies, optimizing GCP (PAAS) and Azure (SAAS) strengths.
• Managed, configured, and scheduled resources across the cluster using Azure Kubernetes
Service.
• Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark Data
Bricks cluster.
• Created Databricks notebooks using SQL, Python, and automated notebooks using jobs.
• Created/modified Informatica ETL mappings that map the source data from the various sources to
the target database and the data warehouse based on requirement.
• Created Spark clusters and configured high-concurrency clusters using Azure Databricks to speed
up the practice of data.
• Developed PySpark code for AWS Glue jobs and EMR for streamlined data processing.
• Conducted end-to-end Architecture & implementation assessment of various AWS services like
Amazon EMR, Redshift, S3.
• Coded Teradata BTEQ scripts for loading and transforming data, addressing defects like SCD 2
date chaining and cleaning up duplicates.
• Responsible for maintaining high-quality reference data in the source by executing operations like
cleaning, transformation, and ensuring integrity in a relational environment through close
collaboration with stakeholders and solution architects.
• Developed a reusable framework for future migrations, automating ETL from RDBMS systems
to the Data Lake using Spark Data Sources and Hive data objects.
• Extracted, transformed, and loaded data from Source Systems to Azure Data Storage using Azure
Data Factory and HDInsight.
• Created various pipelines to load the data from Azure Data Lake into Staging SQLDB and
followed by to Azure SQL DB.
• Worked extensively on Azure Data Factory including data transformations, Integration Runtimes,
Azure Key Vaults, Triggers, and migrating data factory pipelines using ARM Templates.
• Utilized Spark SQL for Scala and Python interfaces, automatically converting RDD case classes
to schema RDD.
• Developed a framework to perform data profiling, cleansing, automated pipeline restart, and
managed rollback strategy for ETL processes using Azure Data Factory and SSIS.
• Orchestrated and enforced ETL and data solutions with Azure Data Factory and SSIS.
• Translated business requirements into operational and application requirements.
• Implemented database solutions in Azure SQL Data Warehouse and Azure SQL, leading a team
of six developers through the migration process.
• Created Azure Data Factory (ADF) pipelines using Azure Blob.
• Assisted in the development and maintenance of ETL pipelines and data models.
• Leveraged Azure Data Lake as a source and retrieved data using Azure Blob.
• Conducted Extract, Transform, and Load operations using a mix of Azure Data Factory, T-SQL,
Spark SQL, and U-SQL in Azure Data Lake Analytics.
Client: UPS, Louisville, Kentucky Sep 2018 – Oct 2020
Role: Data Engineer
• Developed AWS Lambda functions in Python for AWS Lambda, enabling the invocation of
Python scripts for extensive transformations and analytics on large datasets in EMR clusters.
• Utilized Java API, Pig, and Hive for writing MapReduce Jobs for data extraction, transformation,
and aggregation from various file formats.
• Collaborated with business partners, Business Analysts, and product owners to comprehend
requirements and construct scalable distributed data solutions using the Hadoop ecosystem.
• Leveraged Spark for diverse transformations and actions, saving result data back to HDFS before
final storage in the Snowflake database.
• Worked with healthcare data, particularly EPIC, to support various analytics and reporting
requirements.
• Developed Spark scripts and UDFS for data aggregation, querying, and writing data back into
RDBMS through Sqoop.
• Demonstrated expertise in partitioning and bucketing concepts in Hive, designing Managed and
External tables for performance optimization.
• Created Hive tables on HDFS, developed Hive Queries for data analysis, and connected Tableau
with Spark clusters to build dashboards.
• Implemented Oozie scripts for managing and scheduling Hadoop jobs.
• Demonstrated proficiency in working with Azure BLOB and Data Lake storage, loading data into
Azure SQL Synapse analytics (DW).
• Developed Spark Streaming programs for near real-time data processing from Kafka,
incorporating both stateless and stateful transformations.
• Orchestrated Build and Release for multiple projects in a production environment using Visual
Studio Team Services (VSTS).
• Developed ETL processes (Data Stage Open Studio) using FLUME and SQOOP to load data
from multiple sources into HDFS, with structural modifications using MapReduce and HIVE.
• Utilized DataStax Spark connector for storing or retrieving data from Cassandra databases.
• Transformed data using AWS Glue dynamic frames with PySpark, cataloged the transformed
data using Crawlers, and scheduled jobs and crawlers using the workflow feature.
• Managed cluster installation, data node commissioning and decommissioning, name node
recovery, capacity planning, and slots configuration.
• Utilized Hive for analyzing data ingested into HBase, computing metrics for reporting on
dashboards.
• Developed Spark Streaming programs for near real-time data processing from Kafka,
incorporating both stateless and stateful transformations.
• Developed a log producer in Scala for application log transformation and integration with Kafka
and Zookeeper-based log collection platforms.
• Utilized Terraform scripts to automate instances, enhancing efficiency in comparison to manually
launched instances.
• Developed Spark scripts and UDFS for data aggregation, querying, and writing data back into
RDBMS through Sqoop.
• Validated target data in the Data Warehouse, ensuring transformation and loading via Hadoop
Big Data.
Thomson Reuters India Pvt Ltd, Hyderabad June 2014 – Dec 2017
Data Engineer
Responsibilities:
• Description: Developed end to end Qlik view report apps by users streaming data by making
operations on data using python/spark.
• Developed multi cloud strategies in better using GCP (for its PAAS) and Azure (for its SAAS).
• Built scalable distributed Hadoop cluster running Hortonworks Data Platform.
• Working experience with data streaming process with Kafka, Apache Spark, Hive, Pig, etc.
• Importing and exporting data into HDFS Sqoop and Flume and Kafka.
• Used Spark-Streaming APIs to perform necessary transformations and actions on the data got
from Kafka.
• Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators both
old and newer operators.
• Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of
data. Analyzed the SQL scripts and designed the solution to implement using Scala.
• Developed and designed automate process using shell scripting for data movement.
• Involved in loading data from UNIX file system to HDFS using Shell Scripting.