Vijay - Data Engineer Re
Vijay - Data Engineer Re
[email protected]
PH: 9725651721
Sr. Data Engineer
PROFESSIONAL SUMMARY
Around 10 years of IT experience in Analysis, Design, and Development in Big Data technologies like Spark,
MapReduce, Hive, Kafka, and HDFS including programming languages like Java, Scala, and Python.
Strong experience building data pipelines and performing large-scale data transformations.
In-depth knowledge in working with Distributed Computing Systems and parallel processing techniques to
efficiently deal with Big Data.
Experience with different cloud-based storage systems like S3, Azure Blob Storage, Azure DataLake Storage Gen
1 & Gen2.
Hands on experience in MS SQL Server with Business Intelligence in SQL Server Integration Services (SSIS), SQL
Server Analysis Services (SSAS), SQL Server Reporting Services (SSRS), Azure Cloud Technologies including Azure
Database, Azure SQL, Azure Datawarehouse, Azure Data Factory (ADF), Azure Data Lake (ADL), Azure Databricks
(ADB).
Implemented scalable micro services applications using Spring boot for building Rest endpoints.
Has experience in Gen AI , Language Model etc.
Hands on experience in migrating on premise ETLs to Google Cloud Platform (GCP) using cloud native tools such
as BIG query,
Design, build, and maintain scalable data pipelines using Snowflake and DBT.
Extensively used Docker containers for testing Cassandra
GCP Looker (Google Reporting tool)
Developed MapReduce jobs in java for data cleaning and preprocessing
Setting up the Qlik Replicate server, including configuring necessary components such as the management
console, data sources, and targets.
Analyzing replication performance and making necessary adjustments to optimize throughput and reduce latency.
Understanding and utilizing features like parallel processing, batching, and partitioning
Ensuring proper installation of any required drivers for the source and target databases
Databricks, Jupiter Notebooks and storing data in secured and efficient storage space such as Azure Blob
Storage, Azure Data Lake Storage (ADLS Gen1 & Gen2) and Azure BW.
Creating and configuring replication tasks to define what data to replicate, including full loads and change data
capture (CDC).Monitoring task performance, analyzing logs, and troubleshooting issues as they arise.
Orchestrated data integration pipelines in ADF using various Activities like Get Metadata, Lookup, For Each,
Wait, Execute Pipeline, Set Variable, Filter, until, etc.
Working knowledge of Message Queue (MQ), MQ Trigger, and embedding XML tags in RPG Programs
Real time data processing, Real time programming, Gen AI is preferred
Create and maintain the data pipelines using Matillion ETL, Fivetran.
Create and modify PowerShell and XML files.
In-depth understanding of data warehousing concepts, cloud platforms (AWS, Azure, GCP)
Evaluated Fivetran and Matillion for streaming and batch data ingestion into Snowflake
Good experience in Cassandra in the AWS environment.
Ensure robustness, reliability, and scalability of AI/ML solutions in production environments.
Should be able to identify and recommend solutions to PostgreSQL query / Stored procedures performance issues
Built data pipelines using, Azure Databricks.
Experience in Creating, design, and deploy logic apps to automate business processes by using templates, and
APIs.
Experience in developing enterprise level solution using batch processing (using Apache Pig) and streaming
framework (using Spark Streaming, apache Kafka & Apache Flink).
Loaded the data to Azure Data Lake, Azure SQL Database.
Strong Teamcenter core concepts, Data Modelling & BMIDE Skills
data pipelines and ETL processes using Snowflake, AWS services, Python, and DBT.
Ability to develop SSIS packages to augment ingestion of data and automate processing of SSAS Cubes.
Expertise in Azure ADF, Azure Data Bricks, and Azure Data Lake with Azure Synapse database.
Used Azure SQL Data Warehouse to control and grant database access.
migrating packages, migrating stored procedures and optimizing the same. Should be able to identify and
recommend solution on PostgreSQL query
Very keen in knowing newer techno stack that Google Cloud platform (GCP) adds
Builds data transformations with SSIS including importing data from files moving data from one database platform
to another
strategic company objectives of embedding AI into our product portfolio.
Knowledge of data modeling, interface management applications and tools.
Analyze live ADF and SQL data and fine tune performance
Worked with Azure services such as HDInsight, Databricks, Data Lake, Blob Storage, Data Factory, Storage
Explorer, SQL DB, SQL DWH, Cosmos DB.
Understand and address the challenges faced during the Power BI rollout, including licensing terms and
compliance.
Also have 4+ years of experience in working on traditional DWH solutions using Netezza, Teradata, Datastage and
IBMCDC replication tool.
Working with various software applications with advanced knowledge (FiveTran, Airbyte, HVR, Apache flink, SQL)
Build robust data pipelines on the Cloud using AWS Glue, Aurora Postgres, EKS, Redshift, PySpark, Lambda, and
Snowflake.
Excellent knowledge of working with dynamic data (JSON & XML) through various interface types, such as REST
API
Should be proficient in Terraform script creations
Hands on experience in setting up workflow using Apache Airflow and Oozie workflow engine for managing and
scheduling Hadoop jobs.
Experience building highly scalable real-time Data Pipelines using Apache Kafka and Hudi.
Extensive knowledge and experience on ingestion tools like HVR, ATTUNITY.
Experience with Snowflake, DBT, Fivetran, and HVR replication tools.
Firm understanding of Hadoop architecture and various components including HDFS, Yarn, MapReduce, Hive,
Pig, HBase, Kafka, Oozie etc.
Proficient in configuring Apache Spark clusters for optimal performance and resource utilization
Experience with tuning Spark configuration parameters, including memory settings, core allocation, and
dynamic allocation.
Spearheaded design and implementation of modern data solutions on Azure PaaS services, including Azure Data
Factory (ADF), Azure Synapse Analytics, and Azure Databricks | Microservices architecture design and
implementation using Microsoft Azure Service Fabric
Strong experience building Spark applications using Scala and python as programming language.
Good experience troubleshooting and fine-tuning long running spark applications.
Strong experience using Spark RDD API, Spark Data Frame/Dataset API, Spark-SQL and Spark ML
frameworks for building end to end data pipelines.
Extensive hands-on experience tuning spark Jobs.
Experienced in working with structured data using HiveQL and optimizing Hive queries.
Good experience working with real time streaming pipelines using Kafka and Spark-Streaming.
Strong experience working with Hive for performing various data analysis.
Detailed exposure with various hive concepts like Partitioning, Bucketing, Join optimizations, Ser-De’s, built-in
UDF’s and custom UDF’s.
Good experience in automating end to end data pipelines using Oozie workflow orchestrator.
Good experience working with Cloudera, Hortonworks, Snowflake and AWS big data services.
Working on AWS Services IAM, EC2, VPC, AMI, SNS, RDS, SQS, EMR, LAMBDA, GLUE, ATHENA, Dynamo DB,
Kinesis, Cloud Watch, Auto Scaling, S3, and Route 53.
Implemented Lambda to configure Dynamo DB Auto scaling feature and implemented Data Access Layer to
access AWS DynamoDB data.
Performed Power Shell scripting for Active Directory and Exchange.
Developed and deployed various Lambda functions in AWS with in-built AWS Lambda Libraries and deployed
Lambda Functions in Scala with custom Libraries.
Experience with developing and maintaining Applications written for AWS S3, AWS EMR (Elastic Map Reduce),
and AWS Cloud Watch.
Experience in analyzing, designing, and developing ETL Strategies and processes, writing ETL specifications.
Excellent understanding of NoSQL databases like HBASE, Cassandra, MongoDB.
Proficient knowledge and hand on experience in writing shell scripts in Linux.
Develop Power Shell scripts to automate locating and correcting accounts with provisioning issues.
Experienced in requirement analysis, application development, application migration and maintenance using
Software Development Lifecycle (SDLC) and Python/Java technologies.
Excellent technical and analytical skills with clear understanding of design goals and development for OLTP
and dimensions modeling for OLAP.
Adequate knowledge and working experience in Agile and Waterfall Methodologies.
Defining user stories and driving the agile board in JIRA during project execution, participate in sprint demo
and retrospective.
Have good interpersonal, communication skills, strong problem-solving skills, explore/adopt to new
technologies with ease and a good team member.
In-depth knowledge of Snowflake Database, Schema and Table structures.
Experience with gathering and analyzing the system requirements.
TECHNICAL SKILLS:
Hadoop/Big Data Technologies Hadoop, Apache Spark, HDFS, Map Reduce, Sqoop, Hive, Oozie, Zookeeper, Kafka,
Flume
Programming & Scripting Python, SQL, PySpark.
Databases MY SQL, Oracle, MS- STAR Schemar, Teradata
NO SQL Database HBase, Cassandra, Dynamo DB, MongoDB.
BIG data Distribution Horton Works, Cloudera, Spark
Version Control Git, Bit bucket.
Operating Systems Linux, Unix, Mac OS-X, CentOS, Windows 10, Windows 8, Windows 7
Cloud Computing Azure Data Lake, Blob storage, Data factory, Azure Databricks, Azure SQL
database, Azure Synapse Analytics. AWS s3, EMR, Amazon RDS, Athena, Glue,
Kinesis, Redshift, DynamoDB, Lambda.
Visualization Tools Power BI, Tableau, Matplotlib, Seaborn, Quick sight.
PROFESSIONAL EXPERIENCE:
Role: Data Engineer April 2023 to Present
Client: DTCC. NJ
Responsibilities:
Responsible for the complete management of ETL data pipelines for the customer, emphasizing the
improvement of pricing plans and customer analytics via Snowflake and Azure services.
Experience in working on airbyte/Apache flink is preferred
Experience working with Azure BLOB and Data lake storage and loading data into Azure SQL Synapse analytics
(DW).
Hands on experience in MS SQL Server with Business Intelligence in SQL Server Integration Services (SSIS), SQL
Server Analysis Services (SSAS), SQL Server Reporting Services (SSRS), Azure Cloud Technologies including Azure
Database, Azure SQL, Azure Datawarehouse, Azure Data Factory (ADF), Azure Data Lake (ADL), Azure Databricks
(ADB).
Accessing Datamart Dataset will be accomplished via Snowflake and Databricks for further data iteration.
Worked on scaling micro services applications using Kubernetes and docker
Automated jobs using different triggers like Events, Schedules and Tumbling in ADF.
Implemented advanced query optimization techniques and strategic indexing strategies to enhance data
feytching efficiency by 30%.
Led the integration of diverse data sources, including transactional data and customer demographics, using
Azure Data Factory to efficiently collect and aggregate information.
Managed and enhanced ETL data pipelines with success, managing 50 million records a day on average, and
guaranteeing scalability and continuous operation.
Experienced in Microsoft Azure Cloud technologies including Azure Data Factory (ADF), Azure Data Lake Storage
(ADLS)
Developed UDF’s in Java as and when necessary to use in Hive queries.
Implemented multiple modules in micro services to expose data through Restful Api’s.
Built pipelines to move hashed and un-hashed data from Azure Blob to Datalake.
Develop Power Shell scripts to perform pre-migration assessments of Active Directory and server states.
Collaborate with the DS team to develop a self-service internal developer Generative AI platform.
Real-time data processing from Azure Service Bus using Spark streaming/airbyte/flink
Deploy, maintain and automate Azure and other services using PowerShell and Azure CLI.
Designed SSIS packages to Extract, Transfer and Load (ETL) existing data into SQL Server from different
environment;
Experience in Data Conversion and Data Migration using SSIS across different databases like Oracle, MS access
and flat files;
Design and build fault-tolerant infrastructure to support the Generative AI Ref architecture
Debug and tune SSIS or other ETL processes to ensure accurate and efficient movement of data
Relational databases such as MySQL and PostgreSQL; NoSQL databases such as DynamoDB
Experience with Infrastructure as Code, using Cloud Formation, Terraform or similar tools.
Hands-on experience with AI frameworks and libraries such as TensorFlow, PyTorch, or similar.
Creating Databricks notebooks using SQL, Python and automated notebooks using jobs.
Creating Spark clusters and configuring high concurrency clusters using Azure Databricks to speed up the
preparation of high-quality data Experience in generating data pipelines in Azure ADF
Experience in ingesting data from different Sources using Azure ADF
Working with various software applications with advanced knowledge (FiveTran, Airbyte, HVR, Apache flink,
SQL)
Designed and developed various SSIS packages (ETL) to extract and transform data and involved/Scheduled in
Scheduling SSIS Packages.
Created, provisioned different Databricks clusters needed for batch and continuous streaming data processing
and installed the required libraries for the clusters.
Created Linked services to connect the external resources to ADF.
Experience building highly scalable real-time Data Pipelines using Apache Kafka and Hudi
By actively cooperating on ETL activities and putting in place reliable error-handling mechanisms, I helped to
contribute to a 25% improvement in the reliability and integrity of the data pipeline.
Made use of Snowflake to manage and store a variety of data formats, enabling scalability and effective data
extraction for additional processing.
Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure daily execution in
production.
Experience in automate processes using runbooks or automate configuration management using Desired State
Configuration.
Involved in the development of real time streaming applications using PySpark, Apache Flink, Kafka, Hive on
distributed Hadoop Cluster.
Expert in Migrating SQL database to Azure data Lake storage, Azure Data Factory (ADF), Azure data lake
Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and controlling and granting
database access and migrating on premise databases to Azure Data Lake store using Azure Data factory.
Validating DSE Graph database. Implemented TDE for data at rest on DSE Cassandra.
Experience with Terraform and/or native Azure automation.
Handled complete IBM CDC replication implementation for Confidential .
working with REST API's, code packages, deployment tools
Made use of Snowflake to manage and store a variety of data formats, guaranteeing scalability and effective
data extraction for additional processing.
ODS replication layer consisted of IBM CDC for oracle and delta was generated using flat files.
Developed common Flink module for serializing and deserializing AVRO data by applying schem
Employed SQL queries for data retrieval and manipulation, encompassing DDL, DML, and diverse database
objects.
Used Azure Data Factory's built-in features to implement data cleansing and transformation, solving problems
like duplicates and guaranteeing data integrity.
Provide training and support to users to maximize the adoption and effective use of Power BI.
Designed and implemented data loading and aggregation frameworks and jobs that will be able to handle
hundreds of GBs of json files, using Spark, Airflow and Snowflake.
Working on Data migration from Oracle to Cassandra.
Expert in building Databricks notebooks in extracting the data from various source systems like DB2, Teradata
and perform data cleansing, data wrangling, data ETL processing and loading to AZURE SQL DB.
Employing Azure Data Factory to integrate on-premises databases (MySQL, Cassandra) with cloud-based
solutions (Blob storage, Azure SQL DB), apply transformations, and load data into Snowflake.
Analyzed client behavior using Snowflake and Azure Machine Learning, finding trends in age, behavior, and
hobbies to help with decision-making.
developing applications using Microsoft Visual Studio/Azure DevOps/C#/.NET/MVC/JavaScript.
Environment: Azure Databricks, Azure Data Factory, Azure Logic Apps, Functional App, Snowflake,
Snowflake Schema, MySQL, Azure SQL Database, HDFS, MapReduce, YARN, Apache Spark, Apache Hive, SQL,
Responsibilities:
Responsible for the design, implementation, and architecture of very large-scale data intelligence solutions
around big data platforms.
Analyzed large and critical datasets using HDFS, HBase, Hive, HQL, Pig, Sqoop and Zookeeper.
Developed multiple POC’s using Spark, Scala and deployed on the Yarn Cluster, compared the performance
of Spark, with Hive and SQL.
Developed different modules in micro services to collect stats of application for visualization.
Use Amazon Elastic Cloud Compute (EC2) infrastructure for computational tasks and Simple Storage Service (S3)
as storage mechanism.
Building the pipelines to copy the data from source to destination in Azure Data Factory (ADF V1).
Working Experience on Azure cloud components (HDInsight, Databricks, Data Lake, Blob Storage, Data Factory,
Storage Explorer, SQL DB, SQL DWH, Cosmos DB).
Proficient in Terraform and cloud infrastructure automation tools.
Involved in building an Enterprise DataLake using Data Factory and Blob storage, enabling other teams to work
with more complex scenarios and ML solutions.
Expert in developing JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data.
Extensive experience working on HVR as a real-time database replication software. Ability to create table and
load using HVR software and configure HVR for integration.
Design and build highly performant function-based API's.
Capable of using AWS utilities such as EMR, S3 and Cloud Watch to run and monitor Hadoop and Spark jobs on
AWS.
Worked on SQL queries in dimensional data warehouses and relational data warehouses. Performed Data
Analysis and Data Profiling using Complex SQL queries on various systems.
Troubleshoot and resolve data processing issues and proactively engaged in data modelling discussions.
Worked on RDD Architecture and implementing spark operations on RDD and optimizing transformations and
actions in Spark.
Written programs in Spark using Python (PySpark) packages for performance tuning, optimization, and data
quality validations.
ETL
Hands on experience implementing Spark and Hive jobs performance tuning.
Worked on developing Kafka Producers and Kafka Consumers for streaming millions of events per second
on streaming data.
Led the installation of Qlik Replicate across multiple environments, ensuring seamless integration with existing
data architectures.
Configured the Qlik Replicate Management Console for optimized performance and user access management.
Managed and maintained connections to various source and target systems, including relational databases and
cloud platforms, ensuring secure and efficient data transfer.
Implemented best practices for connection pooling and optimization to enhance data replication speed and
reliability.
Developed and managed replication tasks, including full-load and incremental-load tasks, ensuring data accuracy
and timeliness.
Utilized advanced scheduling features to automate task execution and minimize system resource contention.
Designed and implemented complex data transformation rules to align with business logic and compliance
requirements.
Created and maintained mapping templates to standardize data replication processes across projects.
Conducted performance assessments of data replication tasks, identifying bottlenecks and implementing tuning
strategies to optimize throughput and reduce latency.
Leveraged Qlik Replicate's monitoring tools to analyze replication performance metrics and adjust configurations.
Implemented a distributing messaging queue to integrate with Cassandra using Apache Kafka.
Hands on experience on fetching the live stream data from UDB into HBase table using PySpark streaming and
Apache Kafka.
Expert in using Databricks with Azure Data Factory (ADF) to compute large volumes of data.
Evaluate Snowflake Design considerations for any change in the application.
Environment: HDFS, Python, SQL, Spark, Scala, Kafka, Hive, Yarn, Sqoop, Snowflake, Tableau, AWS Cloud,
GitHub, Shell Scripting.
Utilized AWS services like EMR, S3, Glue Meta store and Athena extensively for building the data
applications.
Worked on building input adapters for data dumps from FTP Servers using Apache spark.
Wrote spark applications to perform operations like data inspection, cleaning, load and transforms the large
sets of structured and semi-structured data.
Strong understanding of IaC, CI/CD Pipelines, Terraform, CloudFormation, Bitbucket, Bamboo etc.
Setting up the HVR Channel in Linux setup as well as Confidential AWS server.
Developed Spark with Scala and Spark-SQL for testing and processing of data.
Reporting the spark job stats, Monitoring, and running Data Quality Checks are made available for each
Datasets.
Environment: AWS Cloud Services, Apache Spark, Spark-SQL, Snowflake, Unix, Kafka, Scala, SQL Server.
Responsibilities:
Collaborated with the client to understand their requirements and formulated a detailed design plan in
conjunction with the team.
Engaged with the client team to validate the design and made necessary modifications based on their feedback
and evolving requirements.
Extracted data from DB2 and transferred it to AWS for subsequent analysis, visualization, and report generation.
Established HBase tables and configured columns to efficiently store user event data.
Utilized Hive and Impala for querying the data stored in HBase, ensuring efficient retrieval and analysis.
Developed and deployed core API services using Scala and Spark, enabling seamless interaction with the data.
Leveraged Spark's capabilities to manage data frames, perform migrations between AWS and MySQL, and
execute complex ETL processes.
Implemented a continuous ETL pipeline using Kafka, Spark Streaming, and HDFS, ensuring smooth data flow and
processing.
Configured Qlik Replicate to streamline data integration between healthcare systems, supporting data-driven
decision-making for patient care and operational efficiency.
Designed replication tasks to handle complex data structures specific to healthcare, including claims data,
provider details, and patient records, ensuring compliance with HIPAA regulations.
Implemented robust data transformation rules to standardize formats across varied healthcare data sources,
improving data usability for reporting and analytics.
Established secure, encrypted data pipelines in Qlik Replicate for transmitting sensitive patient information
between systems, safeguarding against unauthorized access.
Conducted performance tuning on Qlik Replicate tasks for high-volume healthcare data transfers, minimizing
latency and ensuring real-time data availability for critical care systems.
Developed Qlik Replicate monitoring and alerting protocols tailored for healthcare applications, enabling
proactive management of data workflows to prevent disruptions in patient services.
Created and maintained replication mappings for data exchange in healthcare ecosystems, facilitating
interoperability between EHR, claims processing, and reporting systems.
Documented healthcare-specific replication workflows and compliance processes within Qlik Replicate to
support audits and ensure alignment with regulatory standards.
Conducted ETL operations on data sourced from various formats, including JSON, Parquet, and databases,
employing Scala within Spark for complex transformations.
Translated SQL queries into Spark transformations using Spark RDDs and Scala, optimizing data processing
workflows.
Integrated real-time data streams into Hadoop using Kafka and orchestrated the process with Oozie for efficient
job scheduling and execution.
Gathered log data from web servers and stored it in HDFS for subsequent analysis and monitoring.
Workflow Management with Oozie: Configured Oozie workflows to concurrently execute Spark and Pig jobs,
streamlining data processing tasks.
Established Hive tables to organize and store data in a structured format, facilitating easier querying and
analysis
Environment: Spark, Scala, HDFS, SQL, Oozie, SQOOP, Zookeeper, MySQL, HBase.
EDUCATIONAL QUALIFICATIONS:
Bachelor’s in computer science engineering, from CR Reddy college, Andhra University, India. 2014
Masters in IT Managment, from Concordia university St Paul, US in 2023