Prashanth - Data Engineer
Prashanth - Data Engineer
Summary
8+ years of professional experience in information technology as Data Engineer with an expert hand in
the areas of Database Development, ETL Development, Data modelling, Report Development and Big
Data Technologies.
Experience in Data Integration and Data Warehousing using various ETL tools Informatica
PowerCenter, AWS Glue, SQL Server Integration Services (SSIS), and Talend.
Experience in Designing Business Intelligence Solutions with Microsoft SQL Server and using MS SQL
Server Integration Services (SSIS), MS SQL Server Reporting Services (SSRS) and SQL Server
Analysis Services (SSAS).
Extensively used Informatica PowerCenter, Informatica Data Quality (IDQ) as ETL tool for extracting,
transforming, loading and cleansing data from various source data inputs to various targets, in batch and
real time.
Experience working with Amazon Web Services (AWS) cloud and its services like Snowflake, EC2, S3,
RDS, EMR, VPC, IAM, Elastic Load Balancing, Lambda, RedShift, Elastic Cache, Auto Scaling, Cloud
Front, Cloud Watch, Data Pipeline, DMS, Aurora, ETL and other AWS Services.
Strong expertise in Relational Data Base systems like Oracle, MS SQL Server, Teradata, MS Access,
DB2 design and database development using SQL, PL/SQL, SQL PLUS, TOAD, SQL - LOADER.
Highly proficient in writing, testing and implementation of triggers, stored procedures, functions,
packages, Cursors using PL/SQL.
Hands on Experience with AWS Snowflake cloud data warehouse and AWS S3 bucket for integrating
data from multiple source system which include loading nested JSON formatted data into Snowflake
table.
Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for
Data ingestion and transformation in GCP.
Extensive experience in integration of Informatica Data Quality (IDQ) with Informatica PowerCenter.
Extensive experience in Data Mining solutions to various business problems and generating data
visualizations using Tableau, PowerBI, Alteryx.
Well knowledge and experience in Cloudera ecosystem such as HDFS, Hive, SQOOP, HBASE, Kafka,
Data pipeline, Data analysis and processing with Hive SQL, IMPALA, SPARK, SPARK SQL.
Worked with different scheduling tools like Talend Administrator Console(TAC), UC4/Atomic, Tidal,
Control M, Autosys, CRON TAB and TWS (Tivoli Workload Scheduler).
Experienced in design, development, Unit testing, integration, debugging and implementation and
production support, client interaction and understanding business application, business data ow and data
relations.
Using Flume, Kafka and Spark streaming to ingest real time or near real time data to HDFS. Analyzed
data and provided insights with Python Pandas.
Worked on AWS Data Pipeline to configure data loads from S3 into Redshift.
Worked on Data Migration from Teradata to AWS Snowflake Environment using Python and BI tools
like Alteryx.
Experience in moving data between GCP and Azure using Azure Data Factory.
Developed Python scripts to parse the Flat Files, CSV, XML, JSON files and extract the data from
various sources and load the data into data warehouse.
Developed Automated scripts to do the migration using Unix shell scripting, Python, Oracle/TD SQL,
TD Macros and Procedures.
Good Knowledge on No SQL database like HBase, Cassandra.
Expert-level mastery in designing and developing complex mappings to extract data from diverse
sources including RDBMS tables, legacy system files, XML files, Applications, COBOL Sources &
Teradata.
Worked on JIRA for defect/issues logging & tracking and documented all my work using
CONFLUENCE.
Experience with ETL workflow Management tools like Apache Air flow and have significant experience
in writing the python scripts to implement the work flow.
Experience in identifying Bottlenecks in ETL Processes and Performance tuning of the production
applications using Database Tuning, Partitioning, Index Usage, Aggregate Tables, Session partitioning,
Load strategies, commit intervals and transformation tuning.
Worked on performance tuning of user queries by analyzing the explain plans, recreating the user driver
tables by right Primary Index, scheduled collection of statistics, secondary or various join indexes.
Experience with scripting languages like PowerShell, Perl, Shell, etc.
Expert knowledge and experience in fact dimensional modelling (Star schema, Snowflake schema),
transactional modelling and SCD (Slowly changing dimension).
Create clusters in Google Cloud and manage the clusters using Kubernetes(k8s).
Using Jenkins to deploy code to Google Cloud, create new namespaces, creating docker images and
pushing them to container registry of Google Cloud.
Excellent interpersonal and communication skills, experienced in working with senior level managers,
businesspeople and developers across multiple disciplines.
Strong problem solving, analytical and have the ability to work both independently and as a team.
Highly enthusiastic, self-motivated and rapidly assimilated with new concepts and technologies.
Skills
Informatica Power Center 10.x/9.6/9.1, AWS Glue, Talend 5.6, Teradata 15/14, Oracle 11g10g, SQL Assistant,
Erwin 8/9, ER Studio
Cloud Environment: AWS Snowflake, AWS RDS, AWS Aurora, Redshift, EC2, EMR, S3, Lambda, Glue, Data
Pipeline, Athena, Data Migration Services, SQS, SNS, ELB, VPC, EBS, RDS, Route53, Cloud Watch, AWS
Auto Scaling, Git, AWS CLI, Jenkins, Microsoft Azure, Google Cloud Platform(GCP)
Reporting Tools: Tableau, PowerBI
Big Data Ecosystem: HDFS, Map Reduce, Hive/Impala, Pig, Sqoop, HBase, Spark, Scala, Kafka,
Programming languages: Unix Shell Scripting, SQL, PL/SQL, Perl, Python, T-SQL
Data Warehousing & BI: Star Schema, Snowflake schema, Facts and Dimensions tables, SAS, SSIS, Splunk
Experience
Experience in building and architecting multiple Data pipelines, end to end ETL and ETL process for
Data ingestion and transformation.
Developed Data Pipeline with Kafka and Spark.
Developed Informatica design mappings using various transformations.
Used AWS Lambda to perform data validation, altering, sorting, or other transformations for every data
change in a DynamoDB table and load the transformed data to another data store.
Performs data analysis and design, and creates and maintains large complex logical and physical data
models and meta data repositories using ERWIN and MB MDR
Developed Python scripts to parse XML, JSON files and load the data in AWS Snowflake Data
warehouse.
Programmed ETL functions between Oracle and Amazon Redshift.
Used Kafka producer to ingest the raw data into Kafka topics run the Spark Streaming app to process
clickstream events.
Performed data analysis and predictive data modeling.
Explore clickstream events data with Spark SQL.
Architecture and Hands-on production implementation of the big data MapR Hadoop solution for
Digital Media Marketing using Telecom Data, Shipment Data, Point of Sale (POS), exposure and
advertising data related to Consumer Product Goods.
Prepared ETL design document which consists of the database structure, change data capture, Error
handling, restart and refresh strategies.
Spark SQL is used as a part of Apache Spark big data framework for structured, Shipment, POS,
Consumer, Household, Individual digital impressions, Household TV impressions data processing.
Created Data Frames from different data sources like Existing RDDs, Structured data, JSON Datasets,
Hive tables, External databases.
Load terabytes of different level raw data into Spark RDD for data Computation to generate the Output
response.
Leadership of a major new initiative focused on Media Analytics and Forecasting will have the ability to
deliver the sales lift associated with the customer marketing campaign initiatives.
Worked on end-to-end machine learning workflow, written python code for gathering the data from
AWS snowflake, data preprocessing, feature extraction, feature engineering, modeling, evaluating the
model, deployment. Written python code for exploration data analysis using Scikit-learn.
Developed various machine learning models such as Logistic regression, KNN, and Gradient Boosting
with Pandas, NumPy, Seaborn, Matplotlib, Scikit-learn in Python.
Responsibility includes platform specification and redesign of load processes as well as projections of
future platform growth.
Coordinating the QA, PROD environments deployments.
Python was used in automation of Hive and Reading Configuration files.
Involved in Spark for fast processing of data. Used both Spark Shell and Spark Standalone cluster.
Using Hive to analyze the partitioned data and compute various metrics for reporting.
Environment: Map Reduce, HDFS, Hive, Python, Scala, Kafka, Spark, Spark Sql, Oracle, Informatica 9.6,
SQL, MapR, Sqoop, Zookeeper, AWS EMR,AWS S3,Data Pipeline, Jenkins, GIT, JIRA, Unix/Linux, Agile
Methodology, Scrum.
Perform Informatica Cloud Services, Informatica Power Center Administration ETL strategies and ETL
Informatica mapping.
Setting up of Secure Agent and connecting different applications and its Data Connectors for processing
the different kinds of data including unstructured (logs, click streams, Shares, likes, topics etc..), semi
structured (XML, JSON) and structured like RDBMS.
Worked extensively with AWS services like EC2, S3, VPC, ELB, Auto Scaling Groups, Route 53, IAM,
CloudTrail, CloudWatch, CloudFormation, CloudFront, SNS, and RDS.
Developed Python scripts to parse XML, Json files and load the data in AWS Snowflake Data
warehouse.
Outguessed the data from HDFS to Azure SQL data warehouse by building ETL pipelines worked on
various methods including data fusion, machine learning and improved the accuracy of distinguished the
right rules from potential rules.
Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like
S3, Parquet/Text Files into AWS Redshift.
Build a program with Python and Apache beam and execute it in cloud Dataflow to run Data validation
between raw source and big query tables.
Strong background in Data Warehousing, Business Intelligence and ETL process (Informatica, AWS
Glue) and expertise on working on large data sets and analysis
Building a Scala and spark based configurable framework to connect common Data sources like
MYSQL, Oracle, Postgres, SQL Server, Salesforce, Big query and load it in Big query.
Extensive Knowledge and hands-on experience implementing PaaS, IaaS, SaaS style delivery models
inside the Enterprise (Data center) and in Public Clouds using like AWS, Google Cloud, and Kubernetes
etc.
Transform and Load, Designed, developed and validated and deployed the Talend ETL processes for the
Data Warehouse team using PIG, Hive.
Applied required transformation using AWS Glue and loaded data back to Redshift and S3.
Extensively worked on making REST API (application program interface) calls to get the data as JSON
response and parse it.
Experience in analyzing and writing SQL queries to extract the data in Json format through Rest API
calls with API Keys, ADMIN Keys and Query Keys and load the data into Data warehouse.
Extensively worked on Informatica tools like source analyzer, mapping designer, work ow manager,
work flow monitor, Mapplets, Worklets and repository manager.
Building data pipeline ETLs for data movement to S3, then to Redshift.
Designed and implemented ETL pipelines between various Relational Data Bases to the Data Warehouse
using Apache Air flow.
Worked on Data Extraction, aggregation and consolidation of Adobe data within AWS Glue using
PySpark.
Developed SSIS packages to Extract, Transform and Load ETL data into the SQL Server database from
legacy mainframe data sources.
Worked on Building data pipelines in airflow in GCP for ETL related jobs using different air flow
operators.
Worked on Postman using HTTP requests to GET the data from RESTful API and validate the API calls.
Hands-on experience with Informatica power center and power exchange in integrating with different
applications and relational database.
Prepared dashboards using Tableau for summarizing Configuration, Quotes, Orders and other e-
commerce data.
Developed the Pyspark code for AWS Glue jobs and for EMR.
Created custom T-SQL procedures to read data from at les to dump to SQL Server database using SQL
Server import and export data wizard.
Design and architect various layer of Data Lake.
Developed ETL python scripts for ingestion pipelines which run on AWS infrastructure setup of EMR,
S3, Redshift and Lambda.
Monitoring big queries, Dataproc and cloud Data flow jobs via Stack driver for all environments.
Configured EC2 instances and configured IAM users and roles and created S3 data pipe using Boto API
to load data from internal data sources.
Hands-on experience with Alteryx software for ETL, data preparation for EDA and performing spatial
and predictive analytics.
Provided Best Practice document for Docker, Jenkins, Puppet and GIT.
Expertise in implementing DevOps culture through CI/CD tools like Repos, Code Deploy, Code
Pipeline, GitHub.
Install and configured Splunk Enterprise environment on Linux, Configured Universal and Heavy
forwarder.
Developed various Shell Scripts for scheduling various data cleansing scripts and loading process and
maintained the batch processes using Unix Shell Scripts.
Backing up AWSPostgrestoS3on daily job run on EMR using Data Frames.
Developed server-based web trac using RESTful API's statistical analysis tool using Flask, Pandas.
Analyze various type of raw le like Json, Csv, Xml with Python using Pandas, NumPy etc.
Environment: Informatica Power Center 10.x/9.x, IDQ, AWS Redshift, Snowflake, S3, Postgres, Google
Cloud Platform(GCP), MS SQL Server, Big query, Salesforce SQL, Python, Postman, Tableau, Unix Shell
Scripting, EMR, GitHub.
Developed and maintained a robust data pipeline architecture for optimal data processing.
Loaded data from internal servers and the Snowflake data warehouse into Amazon S3 buckets.
Designed and implemented an ETL system to extract, transform, and load data from multiple sources.
Configured Amazon EC2 instances for application-specific needs using AWS (Linux/Ubuntu).
Migrated data from Snowflake to Amazon S3 for TMCOMP/ESD feeds.
Utilized Amazon EMR for big data processing on a Hadoop cluster of EC2 instances, S3, and Redshift.
Supported AWS continuous storage solutions using Elastic Block Storage, S3, and Glacier; configured
EC2 volumes and snapshots.
Automated table creation in S3 using AWS Lambda and AWS Glue with Python and PySpark.
Leveraged the Oozie workflow engine to automate Hive and Python jobs for hourly data ingestion.
Wrote JSON-based data pipeline definitions for production use.
Used AWS Athena extensively to import structured data from S3 into Redshift and other systems, with
Spark-Streaming APIs for real-time data transformations from Kinesis.
Created Snowflake views for data loading/unloading to and from S3 and managed code deployments.
Modeled data in Snowflake with expertise in data warehousing techniques like data cleansing, Slowly
Changing Dimensions, surrogate keys, and change data capture.
Consulted on Snowflake data platform architecture, design, and deployment to establish a data-driven
culture within enterprises.
Developed Python and SQL-based ETL processes, converting and loading data into CSV files.
Using the Data Build Tool, I created SQL queries for data transformations to generate and models in
snowflake.
Built ETL workflows in Snowflake Data Factory to manage and store data from various sources.
Automated workflows in Airflow for scheduling tasks with Python.
Developed data integration applications for traditional and NoSQL data sources in Hadoop and RDBMS
environments, using Spark’s in-memory computing for advanced analytics.
Analyzed Hive data using Spark API on EMR Hadoop YARN, enhancing algorithms with Spark-SQL,
Data Frames, and Pair RDDs.
Provided production support for EMR, troubleshooting memory and Spark job issues.
Developed AWS Lambda functions to monitor EMR cluster statuses and job completion.
Created and managed Hive tables, performing data loading and analysis with HiveQL.
Designed Jenkins jobs for CI/CD pipelines, integrating and executing processes.
Built applications in a data lake to transform data for business analytics.
Conducted exploratory data analysis and visualization using Python libraries such as Matplotlib,
NumPy, Pandas, and Seaborn.
Created and maintained reports to display the status and performance of deployed model and algorithm
with Tableau.
Environment: Environment: AWS S3, EC2, EMR, Redshift, Snowflake, Data Build Tool, Hadoop YARN, SQL
Server, Spark, Spark Streaming, Scala, Kinesis, Python, Hive, Linux, Sqoop, Tableau, Cassandra, oozie,
Control-M, RDS, Dynamo DB Oracle 12c.
Wipro | Hyderabad
Data Engineer | June 2018 – Nov 2019
Involved in full Software Development Life Cycle (SDLC) - Business Requirements, Analysis,
preparation of Technical Design documents, Data Analysis, Logical and Physical database design,
Coding, Testing, Implementing, and deploying to business users.
Developed complex mappings using Informatica Power Center Designer to transform and load the data
from various source systems like Oracle, Teradata, and Sybase into the target database.
Analyzed source data coming different sources like SQL Server tables, XML files and Flat files then
transformed according to business rules using Informatica and loaded the data into target tables.
Designed and developed a number of complex mappings using various transformations like Source
Qualifier, Aggregator, Router, Joiner, Union, Expression, Lookup, Filter, Update Strategy, Stored
Procedure, Sequence Generator, etc.
Involved in creating the Tables in Greenplum and loading the data through Alteryx for Global Audit
Tracker.
Analyzed large and critical datasets using HDFS, HBase, Hive, HQL, PIG, Sqoop and Zookeeper.
Data Extraction, aggregations, and consolidation of Adobe data within AWS Glue using PySpark.
Developed Python scripts to automate the ETL process using Apache Air ow and CRON scripts in the
UNIX operating system as well.
Worked on Google Cloud Platform (GCP) services like compute engine, cloud load balancing, cloud
storage and cloud SQL.
Developed data engineering and ETL python scripts for ingestion pipelines which run on AWS
infrastructure setup of EMR, S3, Glue and Lambda.
Changing the existing Data Models using Erwin for Enhancements to the existing Datawarehouse
projects.
Used Talend connectors integrated to Redshift - BI Development for multiple technical projects running
in parallel.
Performed Query Optimization with the help of explaining plans, collect statistics, Primary and
Secondary indexes.
Used volatile table and derived queries for breaking up complex queries into simpler queries.
Streamlined the scripts and shell scripts migration process on the UNIX box.
Using g-cloud function with Python to load Data into Big query for on arrival csv files in GCS bucket.
Created iterative macro in Alteryx to send Json request and download Json response from webservice
and analyze the response data.
Migrated data from Transactional source systems to Redshift data warehouse using spark and AWS
EMR.
Experience in Google Cloud components, Google container builders and GCP client libraries.
Supported various business teams with Data Mining and Reporting by writing complex SQLs using
Basic and Advanced SQL including OLAP functions like Ranking, partitioning and windowing
functions, Etc.
Expertise in writing scripts for Data Extraction, Transformation and Loading of data from legacy
systems to target data warehouse using BTEQ, Fast Load, Multiload, and Tpump.
Worked with EMR, S3 and EC2 services in AWS cloud and Migrating servers, databases, and
applications from on premise to AWS.
Tuning SQL queries using Explain analyzing the data distribution among AMPs and index usage, collect
statistics, definition of indexes, revision of correlated sub queries, usage of Hash functions, etc.
Developed shell scripts for job automation, which will generate the log le for every job. Extensively
used spark SQL and Data frames API in building spark applications.
Written complex SQLs using joins, sub queries and correlated sub queries.
Expertise in SQL Queries for cross verification of data.
Extensively worked on performance tuning of Informatica and IDQ mappings.
Creating, maintain, support, repair, customizing System & Splunk applications, search queries and
dashboards.
Environment: MS SQL Server 2016, ETL, SSIS, SSRS, Cassandra, AWS Redshift, AWS S3, Oracle 12c,
Oracle Enterprise Linux, Teradata, Jenkins, PowerBI, Autosys, Unix Shell Scripting.
Involved in gathering business requirements, logical modeling, physical database design, data sourcing
and data transformation, data loading, SQL, and performance tuning.
Used SSIS to populate data from various data sources, creating packages for different data loading
operations for applications.
Created various types of reports such as complex drill-down reports, drill through reports, parameterized
reports, matrix reports, Sub reports, non-parameterized reports and charts using reporting services based
on relational and OLAP databases.
Experience in developing Spark applications using Spark-SQL in Databricks for data extraction,
transformation, and aggregation from multiple le formats.
Extracted data from various sources like SQL Server 2016, CSV, Microsoft Excel and Text le from
Client servers.
Developed and executed a migration strategy to move Data Warehouse from an Oracle platform to AWS
Redshift.
Built S3 buckets and managed policies for S3 buckets and used S3 bucket and Glacier for storage and
backup on AWS.
Developed Spark scripts using Python on AWS EMR for Data Aggregation, Validation and Adhoc
querying. Performed data analytics on Data Lake using PySpark on data bricks platform
Involved in creation/review of functional requirement specifications and supporting documents for
business systems, experience in database design process and data modeling process.
Designed and documented the entire Architecture of Power BI POC.
Implementation and delivery of MSBI platform solutions to develop and deploy ETL, analytical,
reporting and scorecard / dashboards on SQL Server using SSIS, SSRS.
Extensively worked with SSIS tool suite, designed and created mapping using various SSIS
transformations like OLEDB command, Conditional Split, Lookup, Aggregator, Multicast and Derived
Column.
Scheduled and executed SSIS Packages using SQL Server Agent and Development of automated daily,
weekly and monthly system maintenance tasks such as database backup, Database Integrity verification,
indexing and statistics updates.
Design and develop new Power BI solutions and migrating reports from SSRS.
Developed and executed a migration strategy to move Data Warehouse from Greenplum to Oracle
platform.
Load data into Amazon Redshift and use AWS Cloud Watch to collect and monitor AWS RDS instances
within Confidential.
Worked extensively on SQL, PL/SQL, and UNIX shell scripting.
Expertise in creating PL/ SQL Procedures, Functions, Triggers and cursors.
Loading data in No SQL database (HBase, Cassandra) Expert level knowledge of complex SQL using
Teradata functions, macros and stored procedures. Developing under scrum methodology and in a
CI/CD environment using Jenkin.
Developed UNIX shell scripts to run batch jobs in Autosys and loads into production.
Do participate in architecture council for database architecture recommendation
Utilized Unix Shell Scripts for adding the header to the at le targets.
Used Teradata utilities fast load, multiload, tpump to load data.
Preparation of the Test Cases and involvement in Unit Testing and System Integration Testing.
Deep analysis on SQL execution plan and recommend hints or restructure or introduce index or
materialized view for better performance.
Deploy EC2 instances for oracle database.
Utilized Power Query in Power BI to Pivot and Un-pivot the data model for data cleansing.
Environment: MS SQL Server 2016, ETL, SSIS, SSRS, SSMS, Cassandra, AWS Redshift, AWS S3, Oracle
12c, Oracle Enterprise Linux, Teradata, Databricks, Jenkins, PowerBI, Autosys, Unix Shell Scripting.