Project Name
Project Name
0
Scala 2.12
Maven 3.6
Java 8
Job Duties:
Involve in Technical Analysis, Design, Development and Deployment of highly complex
Internet/Intranet application projects.
Gather and clarify requirements with business architect to feed into high-level
customization design, development and installation phases.
Involve in Daily Scrum meetings, Sprint planning and estimation of the tasks for
the user stories, participated in retrospective and presenting Demo at end of the
sprint.
Conduct design and code reviews to adhere to design specifications, oversee
preparation of test data, testing and debugging of applications.
Develop logging framework in conjunction with Log4j for logging.
Participate in the agile development process, and document and communicate issues
and bugs relative to data standards.
Produce Unit tests for Spark Transformations and Helper methods.
Use Scala sbt to develop Scala coded spark projects and executed using spark-submit
Design and develop scalable, reliable, and efficient data pipelines using Apache
Spark.
Create ETL (Extract, Transform, Load) processes to extract data from various
sources, transform it into the desired format, and load it into target systems.
Implement data workflows that accommodate data from different sources, such as
databases, APIs, flat files, and streaming sources.
Data Transformation:
Perform data transformation tasks to clean, validate, enrich, and aggregate raw
data into a structured and usable format.
Write Spark transformations and SQL queries to process and manipulate data
according to business requirements.
Data Storage and Management:
Choose appropriate data storage solutions based on the requirements, such as HDFS,
Apache Parquet, Apache ORC, or cloud-based storage systems like Amazon S3 or Azure
Data Lake Storage.
Optimize data storage formats to improve query performance and minimize storage
costs.
Cluster Management:
Set up and manage Spark clusters to ensure efficient utilization of resources for
processing large-scale data.
Monitor cluster performance, diagnose and troubleshoot issues, and implement
optimizations to enhance cluster efficiency.
Performance Optimization:
Tune Spark jobs to improve query performance, reduce execution times, and optimize
resource allocation.
Implement partitioning and bucketing strategies to optimize data distribution and
parallel processing.
Implement data validation checks and quality controls to ensure the accuracy and
integrity of data throughout the pipeline.
Collaborate with data analysts and data scientists to define data quality rules and
standards.
Maintain metadata and documentation for data pipelines, transformations, and data
sources.
Ensure that the data catalog is up-to-date and accessible for the wider data team.
Use version control systems (e.g., Git) to manage code changes and collaborate with
team members.
Deploy data pipelines and code changes to production environments in a controlled
and repeatable manner.
Implement security measures to protect sensitive data and ensure compliance with
data protection regulations.
Monitor and apply security updates to the Spark cluster and associated components.
Work closely with cross-functional teams, including data scientists, data analysts,
and business stakeholders, to understand requirements and deliver solutions.
Communicate effectively to provide updates on project status, challenges, and
solutions.
Continuous Learning:
Stay updated with the latest developments in the Spark ecosystem and data
engineering best practices.
Overall, a Spark Data Engineer plays a pivotal role in enabling organizations to
leverage the power of big data processing and analytics by building robust,
scalable, and performant data pipelines.
HIVE :
============
Design and implement Hive queries to transform raw data into structured formats for
analysis and reporting.
Write HiveQL (Hive Query Language) statements to perform data manipulation tasks
such as filtering, aggregating, and joining datasets.
Create and maintain data models and schemas using Hive's Data Definition Language
(DDL).
Design optimal table structures, partitions, and bucketing strategies to enhance
query performance.
Develop ETL processes using Hive to extract data from various sources, transform it
as needed, and load it into target data stores.
Integrate Hive with other data processing tools and frameworks as required.
Performance Optimization:
Tune Hive queries and optimize query execution plans to improve performance and
reduce query processing time.
Implement partitioning, bucketing, and indexing strategies to optimize data storage
and access.
Implement data quality checks and validation mechanisms within Hive queries to
ensure data accuracy and integrity.
Collaborate with data governance teams to establish data quality standards and
rules.
Query Optimization:
Maintain metadata and documentation for Hive tables, views, and other objects.
Ensure that the data catalog is up-to-date and accessible to the broader data team.
Use version control systems to manage Hive scripts and code changes.
Collaborate with data engineers, data scientists, and analysts to understand
requirements and deliver solutions.
Diagnose and resolve issues related to query failures, performance degradation, and
data inconsistencies.
Provide timely support for production incidents and outages.
Monitor Hive clusters and query performance using relevant tools and metrics.
Proactively identify areas for improvement and implement optimizations to enhance
the overall system performance.
Provide training and knowledge sharing sessions for less experienced team members
or users who interact with Hive data.
Hive Developers play a crucial role in enabling organizations to leverage the power
of distributed data processing and analytics by creating efficient and scalable
data pipelines for querying and analyzing large datasets.
=========================
Kafka
==================
Collaborate with architects and other team members to design Kafka-based solutions
that meet the business requirements.
Design data pipeline architectures, including topics, partitions, replication
factors, and data serialization formats.
Determine the appropriate Kafka components to use, such as producers, consumers,
brokers, and connectors.
Development:
Implement Kafka producers and consumers, following best practices to ensure high
throughput and low latency.
Develop custom Kafka connectors or use existing connectors to integrate Kafka with
various data sources and sinks.
Write code to handle message serialization and deserialization using formats like
Avro, JSON, or others.
Implement error handling, retries, and fault tolerance mechanisms to ensure data
reliability.
Configuration and Performance Tuning:
Configure Kafka broker settings, such as heap size, retention policies, and
partition configurations, to optimize performance.
Fine-tune producer and consumer configurations to balance factors like message
delivery guarantees, throughput, and latency.
Monitoring and Troubleshooting:
Set up monitoring and alerting for Kafka clusters using tools like Prometheus,
Grafana, or Kafka-specific monitoring solutions.
Monitor key Kafka metrics, such as throughput, latency, broker health, and consumer
lag.
Diagnose and resolve performance issues, bottlenecks, and failures within the Kafka
ecosystem.
Security and Access Control:
Stay up to date with the latest Kafka releases and updates, evaluating the impact
on existing systems and planning upgrades.
Perform regular maintenance tasks such as Kafka cluster rebalancing, topic
management, and data retention policies.
Performance Testing and Benchmarking:
9288020874