0% found this document useful (0 votes)
143 views

Beginning Linux Command Line For Data Engineers and Analysts

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
143 views

Beginning Linux Command Line For Data Engineers and Analysts

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Beginning Linux Command Line

for Data Engineers


and Analysts
Effective Data Pipelines

Douglas Eadline

© Copyright 2020, Basement Supercomputing, All rights Reserved.


Presenter

Douglas Eadline
[email protected]
@thedeadline

• HPC/Hadoop Consultant/Writer
• http://www.basement-supercomputing.com

© Copyright 2020, Basement Supercomputing, All rights Reserved.


Outline

• Segment 1: Introduction and Course Goals (15 mins)


• Segment 2: The Linux Hadoop Minimal VM (20 mins)
• Segment 3: Basic Linux Commands (50 mins)
• Break (10 mins)
• Segment 4: Editing/Viewing Text Files (25 mins)
• Segment 5: Moving Data to/from Local File System (20
mins)
• Segment 6: Course Wrap-Up (10 min)

© Copyright 2020, Basement Supercomputing, All rights Reserved.


Courses In Scalable Data Pipeline

1. Apache Hadoop, Spark, and Kafka Foundations (3 hours-1 day)


2. Beginning Linux Command Line for Data Engineers and
Analysts (3 hours-1 day)
3. Intermediate Linux Command Line for Data Engineers and
Analysts (3 hours-1 day)
4. Hands-on Introduction to Apache Hadoop, Spark, and Kafka
Programming (6 hours-2 days)
5. Data Engineering at Scale with Apache Hadoop and Spark (3
hours-1 day)
6. Scalable Analytics with Apache Hadoop, Spark, and Kafka (3
hours-1 day)

© Copyright 2020, Basement Supercomputing, All rights Reserved.


Course Webpage

https://www.clustermonkey.net/scalable-analytics

© Copyright 2020, Basement Supercomputing, All rights Reserved.


Segment 1

Introduction and
Course Goals

© Copyright 2020, Basement Supercomputing, All rights Reserved.


Course Goals

1. Provide enough background (and


worked examples) so students can
begin to use Linux command line
right away
2. Continue learning and practicing
with LHM-VM

© Copyright 2020, Basement Supercomputing, All rights Reserved.


Recommended Approach To Class
• Course covers a lot of material!
• Designed to get you started (“hello.c” approach)
• Sit back and watch the examples
• All examples are provided in a NOTES.txt file for
each segment
• I will refer to these files throughout the class
(cut and paste)
• The notes files are available for download along
with some help on installing software

© Copyright 2020, Basement Supercomputing, All rights Reserved.


NOTES.txt Files

© Copyright 2020, Basement Supercomputing, All rights Reserved.


Why The Command Line?
It is 2019, why do we need the
Linux/Unix command line?
Surprise:
– flexibility
– speed
– control
– convenience

© Copyright 2020, Basement Supercomputing, All rights Reserved.


But, I Use Windows/Mac

Can you spot the trend?


• Windows has PowerShell (similar to Unix)
– https://en.wikipedia.org/wiki/PowerShell
• Mac based on BSD Unix
• Linux is very similar to Unix (almost identical in
many cases)

© Copyright 2020, Basement Supercomputing, All rights Reserved.


What is ssh ?
• Linux/Unix are multi-user operating systems
• This design allows multiple users to “log-in” at the
same time
• Users can be local or come from the LAN/Internet
• How to keep remote sessions secure?
• ssh (secure shell) provides encrypted remote
sessions. (previous unsecure is telnet)
• Systems (servers) often provide ssh login
capability on default port 22 (can be reassigned if
needed)

© Copyright 2020, Basement Supercomputing, All rights Reserved.


ssh Command
• Remote users will need a TEXT TERMINAL to run
the “ssh client” on their computer
• MAC and Linux already installed
• Windows (free solutions)
– Good: ssh can be installed in PowerShell
– Better: putty: http://www.putty.org
– Best: MobaXterm: http://mobaxterm.mobatek.net
(allows remote X Windows graphic applications to run on
local Windows machine)

© Copyright 2020, Basement Supercomputing, All rights Reserved.


ssh Out to LAN/Internet

© Copyright 2020, Basement Supercomputing, All rights Reserved.


Segment 2

The Linux Hadoop


Minimal VM

© Copyright 2020, Basement Supercomputing, All rights Reserved.


LHM: Your Personal Linux Server
• Use a virtual machine (VM) to create a real Linux
environment.
• Run on Laptop or Desktop
• The Linux Hadoop Minimal VM is designed to be a
small, single server Hadoop/Spark system (used in
other courses)
• Notebook or Laptop with at least 2 cores, 4 GB of
Memory, 70 GB of disk space.
• Linux CentOS 6 (Red Hat rebuild) image
• Walks and talks like a separate computer

© Copyright 2020, Basement Supercomputing, All rights Reserved.


Connect to the LHM

© Copyright 2020, Basement Supercomputing, All rights Reserved.


Segment 3

Basic Linux
Commands

© Copyright 2020, Basement Supercomputing, All rights Reserved.


Command Overview

• What is a *nix shell?


• Basic Linux commands
• Basic shell commands
• Input/Output and pipes
• File permissions
• Process management
• Commands to access system information

© Copyright 2020, Basement Supercomputing, All rights Reserved.


Questions ?

© Copyright 2020, Basement Supercomputing, All rights Reserved.


Segment 4

Editing/Viewing Text
Files

© Copyright 2020, Basement Supercomputing, All rights Reserved.


File Editing Overview

• Only the basics, enough to get started


• Use the vi (visual editor) available in all
Unix/Linux systems
• Basic modes and navigation
• Insert/delete copy/paste
• Search/Replace

© Copyright 2020, Basement Supercomputing, All rights Reserved.


Questions ?

© Copyright 2020, Basement Supercomputing, All rights Reserved.


Segment 5

Moving Data to/from


Local File System

© Copyright 2020, Basement Supercomputing, All rights Reserved.


Moving Data Overview

• Compressing and archiving using tar and zip


• Secure copy (scp)
• Web get (wget)
• Data integrity

© Copyright 2020, Basement Supercomputing, All rights Reserved.


Segment 6

Course Wrap-Up

© Copyright 2020, Basement Supercomputing, All rights Reserved.


Course Takeaways

• Linux command line basics are not that hard


• 80% of what you need to do can be done with
20% of the command line tool capability
• Basically Unix/Linux command line users are lazy!
Tools designed to make things easy!
• Command line provides maximum control and
flexibility with minimum resources (a laptop and
ssh client)
• Moving data between cloud, web, local is quick and
easy.

© Copyright 2020, Basement Supercomputing, All rights Reserved.


Next: Intermediate Linux Command Line

© Copyright 2020, Basement Supercomputing, All rights Reserved.


Intermediate Command Line Coverage

• Linux Analytics
• Moving Data into Hadoop HDFS
• Running Command Line Analytics Tools
• Bash Scripting Basics
• Creating Bash Scripts

© Copyright 2020, Basement Supercomputing, All rights Reserved.


Questions ?
Thank you

© Copyright 2020, Basement Supercomputing, All rights Reserved.

You might also like