Ravi Teja AWS Data Engineer
Ravi Teja AWS Data Engineer
Data Engineer
E-Mail ID: [email protected]
Contact No: (816) 579-2762
SUMMARY:
Technical software professional with an extensive portfolio of projects, with 8 years of experience in
Data Engineering, Data Pipeline Design, Development and Implementation as a Data Engineer using
Python, SQL, and Spark. Seeking a challenging position as a Data Engineer to leverage my 6+ years of
experience in AWS projects to design, develop, and maintain data infrastructure solutions that drive
business insights and growth.
Profile Summary:
Strong experience in programming languages like Java, Scala and Python.
Experience in working with Hadoop components like HDFS, Map Reduce, Hive, HBase,
Sqoop, Oozie, Spark, Kafka.
Strong understanding of Distributed systems design, HDFS architecture, internal working
details of MapReduce and Spark processing frameworks.
Solid experience developing Spark Applications for performing high scalable data
transformations using RDD, Dataframe and Spark-SQL.
Strong experience troubleshooting failures in spark applications and fine-tuning spark
applications and hive queries for better performance.
Good experience utilizing various optimization options in Spark like broadcast joins,
caching (persisting), sizing executors appropriately, reducing shuffle stages etc.,
Worked extensively on Hive for building complex data analytical applications.
Strong experience writing complex map-reduce jobs including the development of custom
Input Formats and custom Record Readers.
Sound knowledge in map side join, reduce side join, shuffle & sort, distributed cache,
compression techniques, multiple Hadoop Input & output formats.
Good experience working with AWS Cloud services like S3, EMR, EC2, Redshift, Athena,
IAM,Glue metastore, Lambda, cloudwatch, Eventbridge, etc.,
Proficient in monitoring Step Function executions using AWS CloudWatch, enabling
proactive troubleshooting and performance optimization.
Developed automated workflows where ECS tasks are triggered by specific events via Event
Bridge, enhancing data pipeline responsiveness and reducing manual intervention.
Used Sql Concepts, Hive, SQL, Python, and Pyspark to cope with the increasing volume of
data.
Generated a script in AWS Glue to transfer the data and utilized AWS Glue to run ETL
jobs and run aggregation on PySpark code.
Proficient in monitoring event patterns and diagnosing issues in real-time using
EventBridge and integrated AWS monitoring tools like CloudWatch.
Experience with implementation of Snowflake cloud data warehouse and operational
deployment of Snowflake DW solution into production.
Migrating Legacy applications to Snowflake.
Knowledge of Snowflake and other Peripheral Data Warehousing.
Deep understanding of performance tuning, and partitioning for optimizing spark
applications.
Worked on building real time data workflows using Kafka, Spark streaming and HBase.
Extensive knowledge on NoSQL databases like HBase, Cassandra and Mongo DB.
Solid experience in working with csv, text, Avro, parquet, ORC , JSON formats of data.
Expertise in Creating, Debugging, Scheduling and Monitoring jobs using Airflow.
Developed python code for different tasks, dependencies, SLA watcher and time sensor for
each job for workflow management and automation using Airflow tool.
Designed and implemented Hive and Pig UDF's using Java for evaluation, filtering, loading
and storing of data.
Strong understanding of Data Modelling and experience with Data Cleansing, Data
Profiling and Data analysis.
Experience in writing test cases in Java Environment using JUnit.
Proficiency in programming with different IDEs like Eclipse, and Net Beans.
Good Knowledge about scalable, secure cloud architecture based on Amazon Web Services
like EC2, Cloud Formation, VPC, S3, EMR, Redshift, Athena, Glue Metastore etc.
Integrated GitLab CI/CD pipelines with AWS Data Pipeline services for seamless automation
and continuous integration of data processes.
Good knowledge in the core concepts of programming such as algorithms, data structures,
and collections.
Excellent communication and inter-personal skills, flexible and adaptive to new
environments, self-motivated, team player, positive thinker and enjoy working in
multicultural environment.
Analytical, organized and enthusiastic to work in a fast paced and team-oriented
environment.
Expertise in interacting with business users and understanding the requirement and
providing solutions to match their requirement.
TECHNICAL SKILLS:
Hadoop/Big Data: Spark, Hive, HDFS, MapReduce, Sqoop, Oozie, Kafka, Impala,
Zookeeper, Kinesis, Ambari, Yarn,
Programming Java, Scala, Python, Pyspark
languages:
Cloud AWS-EC2, S3, EMR, RDS, Lambda, SNS, CloudWatch, Aurora, Redshift,
IAM,Athena, Glue Metastore, Eventbridge
Database: NoSQL (Hbase, Cassandra, MongoDB) , Teradata, Terraform,Oracle,
DB2, MySQL, Posgtres
IDE Tools: Eclipse, IntelliJ, PyCharm
Development Agile, Waterfall
Approach:
Version Control: CVS, SVN, Git, SBT, Maven
Reporting Tools: Tableau, QlikView, QlikSense
PROFESSIONAL EXPERIENCE:
• Developed series of data ingestion jobs for collecting the data from multiple channels
and external applications in Scala.
• Worked on both batch and streaming ingestion of the data.
• Built Python based Data pipelines from multiple data sources by performing necessary
ETL Tasks
• Imported clickstream log data from FTP Servers and performed various data
transformations using Spark Data frame api and Spark-SQL apis.
• Designed and implemented ETL jobs in AWS Glue, automating data transformations and
integrations between diverse data sources such as RDS, S3, and third-party APIs.
• Implemented Java based Kafka Producer applications for streaming messages to Kafka
topics.
• Built Spark Streaming applications for consuming messages and writing to HBase.
• Worked on troubleshooting and optimizing Spark Applications.
• Worked on ingesting data from Sql-server to S3 using Sqoop with in AWS EMR.
• Migrated Map-reduce jobs to Spark applications built on Scala and integrated with
Apache Phoenix and HBase.
• Orchestrated and scaled Spark and Hadoop clusters using EMR, facilitating distributed
data processing tasks and advanced analytics.
• Designed automated data workflows with AWS Data Pipeline, ensuring consistent,
timely, and fault-tolerant data processing.
• Developed interactive dashboards and reports in QuickSight, providing stakeholders
with actionable business insights.
• Developed daily and monthly ETL processes which were automated using custom UNIX
shell scripts & Python.
• Worked on building ETL pipelines using Python scripting, Pandas Data Frames, and
PySpark
• Involved in loading and transforming large sets of data and analyzed them using Hive
Scripts.
• Implemented SQL queries on AWS with platforms like Athena and Redshift
• Have experience in Querying in AWS Athena where alerts are coming from S3 buckets
and finding the difference in time interval between Clusters of Kafka and Kinesis
• Loaded portion of processed data into Redshift tables and automated the process.
• Worked on various performance optimizations in spark like using broadcast variables,
dynamic allocation, partitioning and built custom Spark UDFs.
• Worked on fine tuning long running hive queries by utilized proven standards like using
Parquet Columnar format, partitioning, vectorized execution etc.,
• Analyzed the data using Spark Data Frames and series of Hive Scripts to produce
summarized results to downstream systems.
• Worked with Data Science team in developing Spark ML applications to develop various
predictive models.
• Expertise on interacting with the project team to organize timelines, responsibilities, and
deliverables to provide all aspects of technical support.
Environment: Hadoop, Spark, Scala, Hive, Sqoop, Python, Oozie, Kafka, AWS EMR, Redshift, S3,
Kinesis, Spark Streaming, Athena, HBase, YARN, JIRA, Shell Scripting, Maven, Git
Vanguard, PA
Sr Data Engineer Nov 2021 - Dec 2023
Responsibilities:
Utilizing analytical, statistical, and programming skills to collect, analyze and interpret large
data sets to develop data-driven and technical solutions to difficult business problems using
tools such as SQL, and Python.
Created lambda jobs to trigger EMR cluster for running spark application.
Worked on designing AWS EC2 instance architecture to meet high availability application
architecture and security parameters.
Experience in optimizing data pipelines through the effective use of AWS EventBridge,
improving data flow efficiency, and reducing latency.
Integrated APIs into ETL processes for extracting data from diverse sources, transforming it
into a standardized format, and loading it into target data stores.
Created AWS S3 buckets and managed policies for S3 buckets and Utilized S3 buckets and
Glacier for storage and backup.
Worked with different file formats like Parquet files and Impala using PySpark for
accessing the data and performed Spark Streaming with RDDs and Data Frames.
Performed the aggregation of log data from different servers and used them in downstream
systems for analytics using Apache Kafka.
Worked on Data Integration for extracting, transforming, and loading processes for the
designed packages.
Designed and deployed automated ETL workflows using AWS lambda, organized and
cleansed the data in S3 buckets using AWS Glue, and processed the data using Amazon
Redshift.
Utilized CloudWatch metrics and logs for in-depth performance analysis and
troubleshooting of data pipelines and applications, leading to optimized resource utilization
and reduced downtime.
Worked within the ETL architecture enhancements to increase the performance using
query optimizer.
Implemented the data that is extracted using Spark, Hive, and large data sets using HDFS.
Worked on Streaming data transfer, data from different data sources into HDFS, No SQL
databases.
Worked on scripting with Python and Pyspark in Spark for transforming the data from
various files like Text files, CSV and JSON.
Worked on processing the data and testing using Spark SQL and on real-time processing by
Spark Streaming and Kafka using Python.
Scripted using Python and PowerShell for setting up baselines, branching, merging, and
automation processes across the process using GIT.
Worked with the implementation of the ETL architecture for enhancing the data and
optimized workflows by building DAGs in Apache Airflow to schedule the ETL jobs and
additional components in Apache Airflow like Pool, Executors, and multi-node
functionality.
explored new features and updates in Amazon SNS, implementing innovative solutions to
improve system efficiency and user experience.
Utilized CloudFormation to define and provision infrastructure components required for
deploying and running data pipelines.
Involved in continuous Integration of application using Jenkins.
Worked on creating SSIS packages for Data Conversion using data conversion
transformation and producing advanced extensible reports using SQL Server Reporting
Services.
Environment: Python, SQL, AWS EC2, AWS S3 buckets, SNS, Cloudwatch, PySpark, AWS lambda,
AWS Glue, Amazon Redshift, Spark Streaming, Apache Kafka, SSIS, ETL, Hive, HDFS, NoSQL, MySQL,
Teradata, PowerShell, GIT, Apache Airflow.
Responsibilities:
Environment: AWS, S3, EMR, Spark, Kafka, Hive, Athena, Glue, Redshift, Teradata, Tableau, Step
functions.
AT&T, IL
Data Engineer Oct 2019 - Feb 2021
Responsibilities:
Environment: Spark, PySpark, Hive, AWS EMR, S3, JIRA, Bamboo, Bitbucket, Control M, Presto
Aetna, India
Big Data Developer Aug 2018 – July 2019
Responsibilities:
Worked on building centralized Data Lake on AWS Cloud utilizing primary services like S3,
EMR, Redshift and Athena.
Worked on migrating datasets and ETL workloads with Scala from On-prem to AWS Cloud
services.
Extensive experience in utilizing ETL Process for designing and building very large-scale
data using
Apache spark.
Migrating the data from local Teradata data warehouse to AWS S3 data lakes
Built series of Spark Applications and Hive scripts to produce various analytical datasets
needed for digital marketing teams.
Worked extensively on building and automating data ingestion pipelines and moving
terabytes
of data from existing data warehouses to cloud.
Responsible for Data Ingestion projects to inject the data into data lake using multiple data
sources systems using Talend Bigdata.
Worked extensively on fine tuning spark applications and providing production support to
various pipelines running in production.
Developed Python code to gather the data from HBase (Cornerstone) and designs the
solution to implement using PySpark.
Developed and optimized Python based ETL pipelines in both legacy and
distributedenvironments
Developed Spark with Python based pipelines using spark data frame operations to load
data to EDLusing EMR for jobs execution & AWS S3 as storage layer.
Worked closely with business teams and data science teams and ensured all the
requirements are translated accurately into our data pipelines.
Worked on full spectrum of data engineering pipelines: data ingestion, data
transformations and data analysis/consumption with Python
Extracted the data from AWS Aurora Databases for big data processing
Developed AWS lambdas using Python & Step functions to orchestrate data pipelines.
Worked on automating the Infrastructure setup, launching and termination EMR clusters
etc.,
Created Hive external tables on top of datasets loaded in AWS S3 buckets and created
various
hive scripts to produce series of aggregated datasets for downstream analysis.
Used Scala data pipelines to perform transformations on the EMR clusters and loading the
transformed data into S3 and from S3 into redshift
Worked on creating Kafka producers using Kafka Java Producer Api for connecting to
external Rest live stream application and producing messages to Kafka topic.
Environment: AWS S3, EMR, Redshift, Aurora, Athena, Glue, Talend, Spark, Python, Java, Hive,
Kafka
Responsibilities:
Environment: Java, JSP, J2EE, Servlets, Java Beans, HTML, JavaScript, JDeveloper, Tomcat
Webserver, Oracle, JDBC, XML.