Venkata Sai (Sr. GCP Data Engineer)
Venkata Sai (Sr. GCP Data Engineer)
[email protected]
(985) 402-1710
Sr. GCP DEVELOPER / DATA ENGINEER /AZURE /AWS / BIGDATA / ETL / HADOOP.
PROFESSIONAL EXPERIENCE:
Results-oriented IT professional with over 10+ years of expertise spanning Business Intelligence/Data Analytics
Solutions Architecture, Data Warehouse Development, ETL Design & Development, Migration and Cloud Data
Analytics. Proven track record in Business & System Analysis, Data Quality and Governance, and Test Case
Development, showcasing strong Team Management capabilities.
Extensive experience in Big Data Engineering and Cloud Data Solutions, particularly Hadoop with hands-on skills in
HDFS, Hive, Pig, HBase, Sqoop, and Kafka. Proficient in Machine Learning and Data Mining for both Structured and
Unstructured Data, with a deep understanding of Predictive Modeling and Data Acquisition. Known for delivering
projects within budget and timeline, effectively managing large-scale initiatives with a focus on Data Quality and
Governance.
Hands-on expertise in Data Warehouse Architecture including Star and Snowflake Schema Design and OLTP/OLAP
Analysis. Skilled in developing and implementing complex Hadoop Ecosystems, leveraging tools like MapReduce,
Impala, MongoDB, Oozie, and Spark Streaming. Proficient in Spark SQL, Spark Core, DataFrame API, and RDD
architectures for real-time data processing. Experienced with Airflow for scheduling complex workflows, Data
Pipelines, and ETL orchestration.
Advanced capabilities in Data Integration, Data Transformation, Data Mapping, and Data Cleansing with an in-depth
command over SQL and Python, including libraries like NumPy, SciPy, Pandas, and Scikit-Learn. Skilled in Azure Data
Lake, Databricks, and AWS (EC2, S3, DynamoDB, Redshift) for big data solutions, with strong experience in Data
Migration to cloud ecosystems, including Snowflake.
Proficient with Machine Learning Models and Statistical Analysis, as well as Text Analytics and Data Visualization tools
such as Tableau and Power BI. Adept at building and automating data workflows using Kafka, Apache Spark, and
Streaming Analytics. Expertise in Big Data Ingestion Tools like Flume and Sqoop.
Highly skilled in ETL Tools (Informatica PowerCenter, AWS Glue, Talend, SQL Server Integration Services), with a solid
grounding in Data Warehousing, Data Mart and Data Modeling. Extensive hands-on experience in Hadoop
Architecture, MapReduce Programs, Distributed Data Storage Solutions, and Software Development Lifecycle (SDLC)
methodologies, including Agile and Scrum.
Known for strong problem-solving, analytical, and organizational skills, with a deep understanding of Business Logic and
Workflow Implementation in Distributed Application Environments. Collaborative, self-motivated, and adaptable to
evolving technologies, bringing a robust knowledge of Data Engineering, Big Data Technologies, ETL Development,
and Machine Learning for comprehensive data solutions.
TECHNICAL SKILLS:
Project & Team Leadership: Project Management, Strategic Planning, Team Leadership, and Agile Methodologies
Data Management & Analytics: Business Analysis, Data Architecture, Data Modeling, Business Intelligence, Data
Warehouse Development, Data Governance, Master Data Management, Data Profiling, Data Migration, and Data
Standardization
Process Optimization: ETL/ELT Processes, Process Re-engineering, DevOps (CI/CD), Technical Documentation, and User
Documentation
Data Engineering & Cloud Computing: Big Data Analysis, Cloud Platforms (Azure, AWS, Google Cloud Platform), Data
Lake, and Data Pipeline
Programming & Scripting: SQL, PL/SQL, T-SQL, Unix Shell Scripting, Python (Pandas), Scala, and Perl
Big Data Tools & Frameworks: Hadoop, HDFS, Hive, HiveQL, MapReduce, Spark, Sqoop, Kafka, and Impala
ETL & Integration Tools: Informatica PowerCenter, Talend, AWS Glue, Azure Data Factory, SSIS, and Jupyter Notebooks
Database Platforms: SQL Server, Oracle, DB2, Teradata, Netezza, AWS RDS, AWS Redshift, AWS Snowflake, and Azure
HDInsight
Business Intelligence & Reporting: Tableau, Power BI, QlikView, Crystal Reports, and SSRS
Data Modeling & Warehousing: Star Schema, Snowflake Schema, OLAP, Cubes, Facts and Dimensions, SAS, SSAS, and
Splunk
Version Control & Collaboration: Git, Bitbucket, TFS, JIRA, Confluence, and Microsoft DevOps
Additional Tools & Platforms: EBX-TIBCO, NAS Server, Jenkins, AWS CLI, Erwin, Visual Studio, SharePoint, and Microsoft
Visio
PROFESSIONAL EXPERIENCE:
Shift4 payments, MD June 2023 to Present
Role: Sr. GCP Developer / Lead Data Engineer
Responsibilities:
Worked as Data Engineer to review business requirements and create source-to-target data mapping documents.
Participated actively in agile development methodology as a scrum team member.
Engaged in Data Profiling and merged data from multiple sources.
Performed Big Data requirement analysis and developed solutions for ETL and Business Intelligence platforms.
Designed 3NF data models for ODS and OLTP systems, as well as dimensional models using Star and Snowflake
Schemas.
Worked on the Snowflake environment, managing real-time data loading into HDFS via Kafka.
Developed a data warehouse model in Snowflake for over 100 datasets.
Designed and implemented large-scale data solutions on Snowflake Data Warehouse.
Managed structured and semi-structured data ingestion and processing on AWS using S3 and Python; migrated on-
premises Big Data workloads to AWS.
Designed data aggregations on Hive for ETL processing on Amazon EMR.
Migrated data from RDBMS to Hadoop using Sqoop for performance evaluations.
Implemented Data Validation using MapReduce for data quality checks before loading into Hive tables.
Developed Hive tables and queries for data processing, generating data cubes for visualization.
Extracted data from HDFS using Hive and Presto, analyzed data using Spark with Scala and PySpark, and created
nonparametric models in Spark.
Handled data import and transformation using Hive, MapReduce, and loading into HDFS.
Configured and used Kafka clusters for real-time data processing with Spark Streaming and RDD to Parquet format in
HDFS.
Implemented Kafka-Spark pipelines, including the use of Kafka brokers for high-throughput message processing.
Developed Spark Streaming solutions for reading data from Kafka and applying Change Data Capture (CDC) before
loading into Hive.
Integrated AWS Kinesis with Kafka clusters for event log data aggregation and analysis.
Created and managed StreamSets pipelines for event log processing using Spark Streaming.
Automated ETL processes with Python scripts using Apache Airflow and CRON.
Utilized Apache Airflow and Genie for job automation on EMR and AWS S3.
Developed Databricks notebooks using SQL and Python, configuring high concurrency clusters on Azure.
Managed Azure data product migration from Oracle to Azure Databricks.
Utilized Apache Spark, MapReduce, and Hadoop ecosystem tools on HDInsight for analytics.
Processed data on Azure with Data Factory, Spark SQL, and U-SQL in Data Lake and SQL DW.
Coordinated with Data Governance, Data Quality, and Data Architecture teams.
Developed MapReduce/Spark Python modules for machine learning and predictive analytics in Hadoop on AWS.
Built machine learning models in Python with Spark ML, MLib, Scikit-learn, NLTK, and Pandas.
Created Oozie workflows and maintained effective client and business communications.
Developed ETL processes in AWS Glue for data ingestion into Redshift.
Built data validation frameworks in Google Cloud Dataflow with Python.
Configured AWS EC2, IAM, and S3 data pipelines for internal data sources using Boto API.
Performed data warehousing and ETL with tools like Informatica, AWS Glue, and Azure Data Factory.
Designed RESTful APIs for web traffic analysis, utilizing Flask, Pandas, and NumPy.
Developed SQL and T-SQL procedures for data extraction and transformation.
Managed AWS services like EC2, VPC, CloudTrail, CloudWatch, CloudFormation, SNS, and RDS.
Worked extensively on Informatica PowerCenter and IDQ mappings for batch and real-time processing.
Automated workflows and ETL processes in GCP using Apache Airflow.
Developed SSIS packages and SQL Server imports for legacy data sources.
Created dashboards with Tableau for summarizing e-commerce data.
Extensively used AWS Redshift for ETL processes, Python with Apache Beam in Dataflow for data validation.
Documented best practices for Docker, Jenkins, Puppet, and GIT.
Installed and configured Splunk and developed Shell scripts for data processing.
Managed BigQuery, Dataproc, and Cloud Dataflow jobs via Stackdriver for monitoring.
Delivered data solutions within PaaS, IaaS, SaaS environments using AWS, GCP, and Kubernetes.
Extensively used REST API for JSON data handling and API integration with SQL databases.
Prepared and developed Informatica workflows and ETL processes for data integration with RDBMS.
Designed ETL pipelines using Talend, Pig, Hive, and AWS Glue for comprehensive data processing.
Environment: Data Analysis, MySQL, HBase, Hive, Impala, Flume, NIFI, Agile, Neo4j, KeyLines, Cypher, Shell Scripting,
Python, SQL, XML, Oracle, JSON, Cassandra, Tableau, Git, Jenkins, AWS Redshift, PostgreSQL, Google Cloud Platform
(GCP), MS SQL Server, BigQuery, Salesforce SQL, Postman, Unix Shell Scripting, EMR, GitHub.
IBing Software Solutions Private Limited Hyderabad India September 2015 to July 2017
Big Data Engineer / ETL
Responsibilities:
Utilized Pandas, NumPy, Seaborn, SciPy, Matplotlib, Scikit-learn, and NLTK in Python for machine learning
development, applying algorithms like linear regression, multivariate regression, naive Bayes, Random Forests, K-
means, and KNN for data analysis.
Extensive experience in designing and implementing statistical models, predictive models, enterprise data models,
metadata solutions, and data lifecycle management across RDBMS and Big Data environments.
Applied domain knowledge and application portfolio knowledge to shape the future of large-scale business technology
programs.
Created and modified SQL and PL/SQL database objects, including Tables, Views, Indexes, Constraints, Stored
Procedures, Packages, Functions, and Triggers.
Created and manipulated large datasets through SQL joins, dataset sorting, and merging.
Designed ecosystem models (conceptual, logical, physical, canonical) to support enterprise data architecture for services
across the ecosystem.
Developed Linux Shell scripts using NZSQL/NZLOAD utilities for data loading into Netezza. Designed a system
architecture for an Amazon EC2-based cloud solution.
Tested complex ETL mappings and sessions to meet business requirements, loading data from flat files and RDBMS
tables to target tables.
Hands-on experience with database design, relational integrity constraints, OLAP, OLTP, cubes, and normalization
(3NF) and denormalization.
Developed MapReduce/Spark Python modules for machine learning and predictive analytics in Hadoop on AWS.
Implemented customer segmentation using unsupervised learning techniques like clustering.
Used Teradata15 tools and utilities, including Teradata Viewpoint, Multi-Load, ARC, Teradata Administrator, BTEQ, and
other Teradata Utilities.
Followed J2EE standards for module architecture, covering Presentation-tier, Business-tier, and Persistence-tier.
Wrote refined SQL queries for extracting attacker records and used Agile/SCRUM for project workflow.
Leveraged Spring Inversion of Control and Transaction Management, along with designing front-end/user-interface
(UI) using HTML 4.0/5.0, CSS3, JavaScript, JQuery, Bootstrap, and AJAX.
Managed JavaScript events and functions, implemented AJAX/jQuery for asynchronous data retrieval, and updated CSS
for new component layouts.
Conducted web service testing with SoapUI and logging with Log4j; performed test-driven development with JUnit and
used Maven for code builds.
Deployed applications on WebSphere Application Server and managed source code with MKS.
Conducted data analysis, data migration, data cleansing, data integration, and ETL design using Talend for Data
Warehouse population.
Developed PL/SQL stored procedures, functions, triggers, views, and packages, using indexing, aggregation, and
materialized views to optimize query performance.
Created logistic regression models in R and Python for predicting subscription response rates based on customer
variables.
Developed Tableau dashboards for data visualization, reporting, and analysis, and presented insights to business
stakeholders.
Conducted FTP operations using Talend Studio for file transfers with tFileCopy, TRleArchive, tFileDelete, and other
components.
Designed and developed Spark jobs with Scala for batch processing data pipelines.
Managed Tableau Server for configuration, user management, license administration, and data connections, embedding
views for operational dashboards.
Collaborated with senior management on dashboard goals and communicated project status daily to management and
internal teams.
Environment: Hadoop Ecosystem (HDFS), Talend, SQL, Tableau, Hive, Sqoop, Kafka, Impala, Spark, Unix Shell Scripting,
Java, J2EE, DB2, JavaScript, XML, Eclipse, AJAX/Query, MKS, SoapUI, Erwin, Python, SQL Server, Informatica, SSRS,
PL/SQL, T-SQL, MLlib, MongoDB, logistic regression, Hadoop, OLAP, Azure, MariaDB, SAP CRM, SVM, JSON, AWS.
Yana Software Private Limited Hyderabad, India June 2014 to August 2015
Hadoop Developer
Responsibilities:
As a Senior Data Engineer, delivered expertise in Hadoop technologies to support analytics development. Implemented
data pipelines with Python and adhered to SDLC methodologies.
Participated in JAD sessions for optimizing data structures and ETL processes. Loaded and transformed extensive
structured, semi-structured, and unstructured datasets using Hadoop/Big Data principles.
Leveraged Windows Azure SQL Reporting to create dynamic reports with tables, charts, and maps.
Designed a data model (star schema) for the sales data mart using Erwin and extracted data with Sqoop into Hive
tables.
Developed SQL scripts for table creation, sequences, triggers, and views, and conducted ad-hoc analysis on Azure Data
Bricks using a KANBAN approach.
Utilized Azure Reporting Services for report management, and debugged production issues in SSIS packages, loading
real-time data from various sources into HDFS using Kafka.
Created MapReduce jobs for data cleanup, defined ETL/ELT processes, and integrated MDM with data warehouses.
Created Pig scripts for data movement into MongoDB and developed MapReduce tasks with Hive and Pig.
Set up Data Marts with star and snowflake schemas and worked with multiple data formats in HDFS using Python.
Built Oracle PL/SQL functions, procedures, and workflows, and managed Hadoop jobs with Oozie.
Prepared Tableau dashboards and reports, and translated business requirements into SAS code.
Migrated ETL processes from RDBMS to Hive and set up an Enterprise Data Lake for storing, processing, and analytics
using AWS.
Leveraged AWS S3, EC2, Glue, Athena, Redshift, EMR, SNS, SQS, DMS, and Kinesis for diverse data management and
processing tasks.
Created Glue Crawlers and ETL jobs, performed PySpark transformations, and used CloudWatch for monitoring.
Employed AWS Athena for data queries and QuickSight for BI reports.
Used DMS to migrate databases and Kinesis Data Streams for real-time data processing.
Built Lambda functions to automate processes, used Agile methods for project management, and conducted complex
SQL data analysis.
Collaborated with MDM teams, created HBase tables, and worked on normalization for OLTP and OLAP systems.
Developed SSIS packages and SQL scripts, managed Hive tables, and used Informatica for ETL workflows. Designed Data
Marts and applied data governance and cleansing rules.
Built Hive queries for visualization, repopulated data warehouse tables using PL/SQL, and designed XML schemas.
Delivered customized reports and interfaces in Tableau and created the data model for the Enterprise Data Warehouse
(EDW).
Utilized SQL Server Reporting Services (SSRS) to create reports and used Pivot tables for business insights.
Environment: Erwin 9.7, Erwin 9.8, Redshift, Agile, MDM, Oracle 12c, SQL, HBase 1.1, HBase 1.2, UNIX, NoSQL, OLAP,
OLTP, SSIS, Informatica, HDFS, Hive, XML, PL/SQL, RDS, Apache Spark, Kinesis, Athena, Sqoop, Python, Big Data 3.0,
Hadoop 3.0, Azure, Sqoop 14, ETL, Kafka 1.1, MapReduce, MOM, Pigo 17, MongoDB, Hive 2.3, Oozie 4.3, SAS.