Eguide of Cloud Data Engineering
Eguide of Cloud Data Engineering
Data engineering requires a wide range of skills, including technical, analytical, and
problem-solving capabilities.
Technical Skills
Data engineers must have a strong knowledge of programming
languages, databases, and other technical tools. Examples of
commonly used programming languages for data engineering include
Python, Java, and SQL. Data engineers also need to be well-versed in
data modeling and data warehousing principles.
Analytical Skills
Data engineers must be able to identify patterns and trends within large
data sets. They must be able to analyze the data to identify insights and
develop solutions to complex problems.
Problem-solving Skills
Data engineers must be able to think logically and creatively in order to
solve complex problems. They should have a knack for troubleshooting
and a deep understanding of data-related issues.
Organizational Skills
Data engineers must be organized and able to keep
track of large amounts of data and multiple
projects. They must be able to prioritize tasks and
stay on top of deadlines.
Communication Skills
Data engineers must be able to effectively communicate their findings
to other team members and stakeholders. They must be able to explain
technical concepts in simple terms and provide clear and actionable
recommendations.
Curiosity
Curiosity is a key attribute for any data engineer, as it allows them to explore data
sets and uncover hidden insights. Curiosity enables data engineers to ask the right
questions and think outside of the box when it comes to data analysis. It also allows
them to think critically and develop creative solutions to data-related problems. By
being curious, data engineers can better understand the data they are working with
and come up with innovative solutions that can help their organization achieve its
goals.
Technical Skills:
● Python
● SQL
● Linux & Shell
Python
Popular programming language that is widely used for web
development, software engineering, data analysis, and many other
applications. It is a high-level, interpreted, and object-oriented
language that is easy to learn and use.
Python is known for its simple syntax and readability, making it easy for
developers to write clean and maintainable code. It also features
powerful libraries for scientific computing, data science, and machine
learning.
Getting Started with Python
The official website for Python provides a comprehensive guide to
getting started with the language. This guide includes instructions on
how to install Python, the basics of the language, and how to use the
Python interpreter. It also provides links to other resources and
tutorials to help you get up and running quickly.
https://www.python.org/about/gettingstarted/
Python Installation
Python 3 Installation for Windows
● Download the latest version of Python 3 from the Python official website:
https://www.python.org/downloads/
● Run the installer. Make sure to check the box that says "Add Python 3.X to
PATH"
● Once the installation is complete, open the command prompt and type
“python --version” to confirm you have the correct version installed.
● Download the latest version of Python 3 from the Python official website:
https://www.python.org/downloads/
● Double-click the downloaded .pkg file and follow the instructions to install.
● Once the installation is complete, open the terminal and type “python
--version” to confirm you have the correct version installed.
Notebooks / Platforms
Google Colab
Google Colab is a free cloud service provided by Google that allows users to create and share
documents that contain live code, equations, visualizations, and other rich content. With Google
Colab, you can easily create machine learning models, develop websites, and even explore new
research.
To get started with Google Colab, visit the official Google Colab documentation. This page includes
instructions on how to access the service and what features are available. After reading the
documentation, you can sign up for a free account and start creating documents.
Once you’ve created an account, you can start creating documents. Google Colab allows you to write
code in several programming languages, including Python and R. You can also add equations,
visualizations, and other rich content to your documents.
Google Colab also provides several tutorials and sample documents to help you learn how to use the
service. You can also find resources for working with Google Colab, such as a forum to ask questions
and share tips with other users.
Notebooks / Platforms
Jupyter Notebook
Jupyter Notebook is an open-source web application that allows you to
create and share documents that contain live code, equations,
visualizations and narrative text. It is an interactive computational
environment, in which you can combine code execution, rich text,
mathematics, plots and rich media.
Getting Started with Jupyter Notebook
1. Install the Jupyter Notebook:
To install the Jupyter Notebook, you will need Python installed on your system.
2. Start the Notebook Server: After the installation of Jupyter Notebook, you can start the
Notebook server by typing the following command in the command line:
jupyter notebook
This will start the Jupyter Notebook server and open the dashboard in your default browser.
3. Create a New Notebook: To create a new notebook, click on the "New" button at the top
right corner of the dashboard and select the language of your choice.
4. Write Code: Once a new notebook is created, you can start writing code in the cells. The
code can be executed by pressing the "Run" button or by pressing "Shift + Enter".
5. Save and Share: You can save the notebook by clicking on the "Save" button or by pressing
"Ctrl + S". You can also share the notebook by clicking on the "Share" button.
These are the basic steps to get started with Jupyter Notebook. For more information, you
can refer to the official documentation.
https://jupyter-notebook.readthedocs.io/en/stable/
Some most used Python keywords
● True: This keyword is used to represent boolean values (i.e. true or false). ● global: This keyword is used to declare a global variable.
● False: This keyword is used to represent boolean values (i.e. true or false). ● if: This keyword is used to create a conditional statement.
● None: This keyword is used to represent a null value. ● import: This keyword is used to import a module or package.
● and: This keyword is used to combine two conditions in a single ● in: This keyword is used to check if an item is in a list or tuple.
expression. ● is: This keyword is used to compare the identity of two objects.
● as: This keyword is used to create an alias for a module, class, or object. ● lambda: This keyword is used to create an anonymous function.
● break: This keyword is used to break out of a loop. ● nonlocal: This keyword is used to declare a nonlocal variable.
● class: This keyword is used to create a class. ● not: This keyword is used to negate a condition.
● continue: This keyword is used to skip the current iteration of a loop. ● or: This keyword is used to combine two conditions in a single
● def: This keyword is used to define a function. expression.
● del: This keyword is used to delete an object. ● pass: This keyword is used to create a null statement.
● elif: This keyword is used to create an else-if statement. ● raise: This keyword is used to raise an exception.
● else: This keyword is used to create an else statement. ● return: This keyword is used to return a value from a function.
● except: This keyword is used to create an exception handler. ● try: This keyword is used to create a try-except statement.
● finally: This keyword is used to execute code after a try-except statement. ● while: This keyword is used to create a loop.
● for: This keyword is used to create a loop. ● with: This keyword is used to create a context manager.
● from: This keyword is used to import a module or package. ● yield: This keyword is used to generate a value from a generator.
Some most used libraries in python data engineering
● Pandas: Pandas is an open source library providing high-performance,
easy-to-use data structures and data analysis tools for the Python
programming language. It offers data manipulation and analysis in Python,
including data wrangling, aggregation, and visualization. It is widely used for
data munging and preparation, for data analysis, and for creating data
visualizations.
SQL stands for Structured Query Language. It is a language used to interact with
databases. It is the most widely used language for data manipulation, data
definition, and data control. SQL is used to perform operations such as inserting
data, deleting data, updating data, and retrieving data from a database.
In order to use DBMS, SQL we can use MySQL Workbench from oracle
https://dev.mysql.com/doc/workbench/en/wb-installing-windows.html
Some Common commands and keywords used in SQL are
● SELECT – retrieve data from a database ● ORDER BY - sort the retrieved data in either ascending or
● UPDATE – update data in a database descending order
● DELETE – delete data from a database ● GROUP BY - group the retrieved data by one or more columns
● INSERT INTO – insert new data into a database ● INNER JOIN - combine two or more tables together by matching
● CREATE TABLE – create a new table values in specified columns
● ALTER TABLE – modify a table’s structure ● LEFT JOIN - combine two or more tables together by matching
● DROP TABLE – delete a table values in specified columns and return all rows from the left
● CREATE DATABASE – create a new database table, even if there are no matches in the right table
● ALTER DATABASE – modify a database’s structure ● RIGHT JOIN - combine two or more tables together by matching
● DROP DATABASE – delete a database values in specified columns and return all rows from the right
● TRUNCATE TABLE – remove all records from a table table, even if there are no matches in the left table
● USE – select a database for use ● LIKE - search for a specific pattern within a column
● SHOW – list the tables in a database ● AND - specify multiple conditions that must be met in order for
● SELECT - select the columns of data from a table that you wish to the data to be retrieved
retrieve ● OR - specify a condition or set of conditions that must be met in
● FROM - specify which table the data is being selected from order for the data to be retrieved
● WHERE - condition or set of conditions that must be met in order for ● LIMIT - limit the number of rows returned by a query
the data to be retrieved ● COUNT - count the number of rows returned by a query
Linux and shell scripting
Shell scripts can be used to automate system administration tasks, such as creating
user accounts, installing software, and managing services. They can also be used to
create custom scripts for specific tasks, such as backing up data or creating reports.
Shell scripts are an essential tool for power users of Linux, and can be used to make
any task easier and faster.