Mourya K Data Engineer
Mourya K Data Engineer
Professional Summary:
Seasoned data engineer with 10 years of experience crafting data-intensive applications using Big Data
Ecosystems, Cloud Computing Services, Data Warehousing, Visualization, and Reporting Tools.
Expertly navigated the Hadoop framework, steering analysis, design, development, documentation, deployment,
and SQL integration within big data technologies.
Showcased mastery across pivotal Hadoop ecosystem components—HDFS, YARN, MapReduce, Apache Spark,
Apache Sqoop, and Apache Hive —critical to fostering robust data engineering.
Proficiently handled data integration from various sources, encompassing RDBMS, NoSQL databases,
spreadsheets, text files, JSON files, and delimited files.
Experienced with Apache Spark, improving the performance and optimization of the existing jobs using Spark
Context, Spark-SQL, and DataFrame API, and worked explicitly on PySpark and Scala.
Successfully designed and implemented a scalable data processing solution on Google Cloud Platform(GCP) by
utilizing Google Cloud Storage for data storage, provisioning Google Data Proc Cluster for distributed data
processing, orchestrating tasks with Google Dataproc Workflow Templates, and leveraging Google BigQuery for
high-speed data analytics. This streamlined workflow improved data analysis efficiency and actionable insights,
ultimately enhancing business operations.
Guided the complete design and implementation of diverse projects, leveraging ETL/Visualization tools and
showcasing proficiency in big data, Cloud Computing, and in-memory applications.
Implemented and scheduled data pipelines using Apache Airflow to automate data ingestion, transformation,
and loading (ETL) processes, which included designing Directed Acyclic Graphs (DAGs) to define workflows,
utilizing operators for various data processing tasks and configuring the Airflow scheduler to ensure timely
execution of pipelines
Experience in Dimensional Modeling using Snowflake schema methodologies of Data Warehouse and Integration
projects.
Automated workflow processes using Control-M for scheduling Batch jobs, setting up dependencies, and
generating reports using custom scripts, enhancing efficiency and data accessibility.
Experience in Data modeling for Data Mart/Data Warehouse development including conceptual, logical and
physical model design, developing Entity Relationship Diagram (ERD), reverse/forward engineer (ERD) with CA
Erwin data modeler.
Experience in extracting, transforming and loading (ETL) data from spreadsheets, database tables and other
sources using Microsoft SSIS.
Have experience in Dimensional Modeling using Snowflake schema methodologies of Data Warehouse and
Integration projects.
Experience in building and optimizing AWS data pipelines, architectures, and data sets.
Experience in managing and reviewing Hadoop log files .
Strong Experience in Data Migration from RDBMS to Snowflake cloud data warehouse.
Excellent knowledge in designing and developing dashboards using QlikView by extracting the data from multiple
sources.
Experience in Data Transformation and Data Mapping from source to target database schemas and also data
cleansing.
Offers crucial production support, diligently identifying root causes, resolving bugs, and promptly updating
stakeholders on production matters.
Highly skilled in AWS, Snowflake Database, Python, Oracle, Exadata, Informatica, SQL, PL/SQL, bash scripting,
Hadoop, Hive, Databricks.
Experience in data stream processing using Kafka (Zookeeper for developing data pipelines with PySpark
Experience in developing Logical data modeling, Reverse engineering and physical data modeling of CRM system
using ER-WIN and Infosphere.
Expertise in all aspects of Agile SDLC from requirement analysis, Design, Development Coding, Testing,
Implementation, and maintenance.
1
Experience with Airflow to schedule ETL jobs and Glue and Athena to extract the data from AWS data
warehouse.
Designed NoSQL, Google BigQuery for transforming unstructured data to structured data sets.
Experience with container based deployments using Docker, working with Docker images, Docker registries.
Hands-On experience on Analyzing SAS ETL, Implementation of Data integration in Informatica using XML,
Webservices, SAP ABAP, SAP IDoc.
Created various types of reports such as complex drill-down reports, drill through reports, parameterized
reports, matrix reports, Sub reports, non-parameterized reports and charts using reporting services based on
relational and OLAP databases.
Experience in developing Spark applications using Spark-SQL in Databricks for data extraction, transformation,
and aggregation from multiple file formats.
Experienced with Teradata utilities Fast Load, Multi Load, BTEQ scripting, Fast Export, SQL Assistant and Tuning
of Teradata Queries using Explain plan
Established end-to-end CI/CD pipelines for data workflows using tools like Jenkins and GitLab, ensuring version
control, automated testing, validation, and deployment of data pipelines.
Possess critical communication, analytical, and leadership skills, adeptly navigating independent and
collaborative work settings.
Dedicated to ensuring data accuracy and integrity through validation frameworks, automated testing, and
anomaly detection. Recognized for enhancing data reliability, which directly improves downstream analytics.
Exhibited a robust grasp of the Software Development Life Cycle (SDLC), showcasing adeptness in testing
methodologies, task execution, resource management, scheduling, and related disciplines, and implemented
various data warehouse projects in Agile Scrum/Waterfall methodologies.
Skill Set: Pyspark, Shell Scripting, Spark-SQL, Apache Hive, Apache Pig, Google BigQuery, Apache Airflow, Apache
Oozie, Google Cloud Storage, Google Dataproc Clusters, GitLab, Google Workflow Templates, Screwdriver.
4
visualization and performed Gap analysis.
Created impactful Big Data projects utilizing Spark in conjunction with Scala, seamlessly integrating with essential
tools from the Hadoop Ecosystem, including YARN, MapReduce, Apache Hive, Apache Sqoop, and Control-M.
Constructed efficient dataflow pipelines to migrate Hedis medical data from diverse origins like SQL Server, DB2,
and files. Employed Spark-Scala and an Ingestion Framework to enforce transformation rules and validation,
optimizing data movement to the target platform.
Operated within dynamic environments encompassing Apache Hive, S3 buckets, and frameworks related to data
ingestion. Facilitated seamless integration with Dashboards and orchestrated workflows using Netflix conductor.
Developed ETL (Extract, Transform, Load) jobs to automate data integration processes, improving data accuracy
and reducing processing time.
Performed ad-hoc SQL queries on data stored in Cloud Storage buckets, enabling rapid data analysis and
providing actionable insights to stakeholders, resulting in data-driven decision-making and improved business
processes.
Orchestrated and managed data processing clusters, leveraging Apache Spark for data analysis, leading to
actionable insights for business stakeholders.
Optimized database performance by implementing best practices, resulting in a reduction in query execution
time and enhanced data retrieval capabilities.
Managed the scheduling of numerous scalable, independent end-to-end data migration jobs, expertly employing
scheduling tools such as Control-M.
Managed data migration from SQL Server and Teradata to Amazon S3 and structured a data service layer in Hive.
Worked with databases like Oracle, MySQL, DB2, and Postgres.
Spearheaded CD/CI initiatives with Jenkins and Shell Scripting, streamlining deployments and optimizing code
reuse.
Patterns observed in fraudulent claims using text mining in R and Hive.
Exported the data required information to RDBMS using Sqoop to make the data available for the claims
processing team to assist in processing a claim based on the data.
Developed Map Reduce programs to parse the raw data, populate staging tables and store the refined data in
partitioned tables in the EDW.
Managed seamless data migrations, especially Oracle to S3 and Teradata to Snowflake, ensuring smooth on-
prem to cloud shifts.
Developed PySpark-based Spark applications, extracting vital customer usage patterns through data analysis.
Perform advanced SQL queries to pull data from AWS Redshift, track KPIs, and automate monthly report
refreshing using Power BI dashboards along with ad-hoc analysis and reporting.
Applied diverse transformations and actions to Spark DataFrames, aligning them with specific business
requisites.
Deployed Spark-SQL to load Parquet data efficiently, created Case class-defined Datasets, and effectively
managed structured data. This culminated in storing data in Hive tables for downstream utilization.
Collaborated with cross-functional teams, including data scientists and analysts, to understand data
requirements, ensuring the provisioning of clean, accurate, and structured data for analysis and reporting.
Expertly managed extensive datasets through strategic partitioning, harnessed Spark's in-memory capabilities,
optimized performance with efficient joins, and adeptly employed transformations during the data ingestion.
Utilized version control systems and collaborative platforms to streamline code management, fostering
enhanced team productivity, seamless code review, and adequate knowledge sharing.
Created Scala scripts and UDFs utilizing DataFrames in Apache Spark to aggregate and process data.
Ensured smooth operations by monitoring production jobs, identifying and resolving errors, reprocessing failed
batch jobs, and communicating issues to stakeholders.
Enhanced product performance through judicious selection of file formats (Avro, ORC), resource allocation,
optimized joins, and efficient transformations.
Skill Set: HDFS, S3, Apache Hive, Apache Hue, DB2, Microsoft SQL Server, YARN, HBase, MapReduce, Scala, Apache
Sqoop, Control-M, Spark-SQL, Netflix conductor, Shell Script, UDFs, Ingestion Framework, GitLab, Jenkins, S3
storage.
5
Client: Star Network, India
Jun 2013 - Dec 2014
Role: Data Engineer
Horizon and Bloom, internal products within the organization, provide revenue and viewership KPIs to diverse
teams. Collaborating with the backend team, I contributed to the development of scalable Data Lake solutions and
applications supporting the functionalities of Horizon and Bloom.
Responsibilities:
Contributed to seamlessly transferring and transforming extensive volumes of structured, semi-structured, and
unstructured data from relational databases into HDFS using Apache Sqoop imports.
Constructed robust distributed data solutions on the Hadoop framework, ensuring scalability and optimal data
handling.
Formulated Apache Sqoop Jobs and Hive Scripts to extract data from relational databases, comparing it against
historical data for insightful analysis. Utilized reporting and visualization tools like Matplotlib to present data
insights effectively.
Building data pipeline ETLs for data movement to S3, then to Redshift.
Designed and implemented ETL pipelines between from various Relational Data Bases to the Data Warehouse
using Apache Airflow.
Worked on Data Extraction, aggregations and consolidation of Adobe data within AWS Glue using PySpark.
Developed SSIS packages to Extract, Transform and Load ETL data into the SQL Server database from the legacy
mainframe data sources.
Worked on Design, Development and Documentation of the ETL strategy to populate the data from the various
source systems using Talend ETL tool into Data Warehouse.
Devised PL/SQL Stored Procedures, Functions, Triggers, Views and packages. Made use of Indexing, Aggregation
and Materialized views to optimize query performance.
Developed logistic regression models (using R programming and Python) to predict subscription response rate
based on customer’s variables like past transactions, response to prior mailings, promotions, demographics,
interests and hobbies, etc.
Created Tableau dashboards/reports for data visualization, Reporting and Analysis and presented it to Business.
Assumed a pivotal role in implementing dynamic partitioning and Bucketing techniques to enhance data
organization within Hive Metadata
Mastery in converting extensive structured and semi-structured data, harnessing state-of-the-art methodologies
Created and managed multiple Apache Hive tables, populating them with data and crafting Hive Queries to drive
internal processes.
Spearheaded the maintenance and monitoring of reporting jobs, employing Jenkins for continuous integration
and deployment.
Facilitated smooth data flow to downstream consumption teams through proactive meetings, ensuring
consistent and reliable access for end users.
Created custom T-SQL procedures to read data from flat files to dump to SQL Server database using SQL Server
import and export data wizard.
Design and architect various layer of Data Lake.
Developed ETL python scripts for ingestion pipelines which run on AWS infrastructure setup of EMR, S3, Redshift
and Lambda.
Monitoring big query, Dataproc and cloud Data flow jobs via Stack driver for all environments.
Configured EC2 instances and configured IAM users and roles and created S3 data pipe using Boto API to load
data from internal data sources.
Developed data stage jobs to cleanse, transform, and load data to the data warehouse, complemented by
sequencers to encapsulate the job flow.
Conducted comprehensive Data Analysis and Profiling, generating both scheduled and ad-hoc reports for users.
Engineered Bash scripts to enable seamless integration of big data tools and to manage error handling and
notifications.
Skill Set: Apache Hadoop, Apache Sqoop, RDBMS, HDFS, Python, Apache Hadoop, Matplotlib, Bash, Gitlab,
Jenkins.
6
Education:
Master of Science in Computer Science, Florida State University, Tallahassee, FL. 2017
Bachelor of Technology in Computer Science & Engineering, Jawaharlal Nehru Technological University,
Anantapur, India. 2013