0% found this document useful (0 votes)

16 views23 pages

Module 1 Session 2 Part 2 Linux

The document provides an overview of AWK, a scripting language used for text processing and data manipulation, particularly in bioinformatics. It details AWK's syntax, features, and commonly used commands, along with practical examples of how to extract and manipulate data from files. The document also explains how to perform arithmetic operations and calculate statistics using AWK.

Uploaded by

jackson.sembera

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views23 pages

Module 1 Session 2 Part 2 Linux

Uploaded by

jackson.sembera

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Genomics Sequencing

Bioinformatics Africa Course 2023

Introduction to Linux
Session 2 – Part 2 – AWK

Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
AWK
● Scripting language with text processing capabilities for data extraction,
comparison, transformation
● Similar to sed, AWK is available on most unix operating systems
● Named/ derived after the initials of its inventors Alfred Aho, Peter
Weinberger, and Brian Kernighan in 1970s
● Used when one wants to extract fields, make comparisons, filter data
and general data wrangling

Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
Some features of AWK
● AWK is great as it allows one to work with delimited data
● Similar to sed, it reads in files line by line
● Different to sed, it splits the line into fields – allows for columns
● A lot of data formats in bioinformatics are delimited with a tab (\t) being
a common field separator
● AWK has inbuilt functions that allow one to manipulate these fields –
unlike sed i.e., allows one to work with columns within a dataset

● E.g., awk –F “\t” '{ print $1 }' genes.gff

chr1
chr1
chr1
chr2
chr2
chr3
chr4
chr10
chr10
chrX
• The –F flag indicates the field delimiter – in this case a tab

Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
Commonly used Awk commands:

•print: Outputs text or variables to the screen or a file.

•printf: Provides formatted output, similar to C's printf function.
•if/else: Implements conditional statements.
•for: Sets up a loop.
•split: Divides a string into an array based on a delimiter.
•length: Calculates the length of a string or the number of elements in an array.
•gsub: Global substitution for a specific pattern in a string.
•NR: Represents the current line number.
•NF: Represents the number of fields in the current line.

Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
Practical examples:

● awk '{print $1, $3}' filename(Print specific fields of a file)

● awk '{sum += $2} END {print "Total:", sum}' filename (Calculate the total of the
second column)
● awk '/pattern/ {print}' filename(Print lines with a certain pattern)
● awk -F',' '{print $2}' filename.csv(Use a delimiter other than whitespace)
● awk -f script.awk filename(can save command in a script file)

Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
Basic AWK syntax
● Similar to sed, awk prints the output to the screen, if you want to save
the output then will need to redirect it to an outfile
● Different to sed – awk has inbuilt variables called $1, $2, $3 …. that
map to the fields separated by the \t delimiter when specified
● Usually useful to determine the number of fields a file has first ● E.g.
awk '{print NF}' genes.gff
● Number of Fields (NF) is an inbuilt awk variable that is defined each
time awk reads in a line

Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
Basic AWK syntax
● E.g., awk '{print NF}' genes.gff
9
10
9
8
10
9
9
9
9
9
• Strange that they are 2 records in line 2 and 5 that have 10 fields, one in line 4 that has 8 fields and the rest have 9 – any thoughts as
to why?

● The annotation column for record 4 is empty

● The file is tab separated, in the previous construct we did not tell awk to split the file according to a
delimiter: awk '{print NF}' genes.gff

● E.g. awk –F “\t” '{print NF}' genes.gff

9
9
9
8
9
9
9
9

Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
AWK usage
● awk - options ‘optional_selection_criteria { action} ’ input_file
● Let’s use the optional_selection_criteria to do some filtering on the genes.gff file
● awk -F "\t" '$1 > "chr1" {print $0}' genes.gff
chr2 source2 gene 10000 1200 0.95 + 0 chr2 source1 gene 50 900 0.4 - 0
name=gene2;product=gene2 protein
chr3 source1 gene 200 210 0.8 . 0 name=gene3 chr4 source3 repeat 300 400 1 + .
name=ALU chr10 source2 repeat 60 70 0.78 + . name=LINE1 chr10 source2 repeat
150 166 0.84 + . name=LINE2 chrX source1 gene 123 456 0.6 + 0
name=gene4;product=unknown

● awk recognizes mathematical operators such as the greater than sign

● The construct above does two things:

● Optional_selection_criteria is to use field 1 of the line being read in and check if it is greater then chr1
● As awk reads in the file line by line, it will print the line ($0) only when the condition is met

● Useful for extracting lines based on a field from a file e.g. all entries for chromosome 2 only

● awk -F "\t" ‘$1 == "chr2" {print $0}' genes.gff

● sed -n '1,6p' genes.gff

chr1 source1 gene 100 300 0.5 + 0 name=gene1;product=unknown

chr1 source2 gene 1000 1100 0.9 - 0 name=recA;product=RecA protein

chr1 source5 repeat 10000 14000 1 + . name=ALU chr2 source2 gene 10000 1200 0.95 + 0 chr2 source1 gene
50 900 0.4 - 0 name=gene2;product=gene2 protein

chr3 source1 gene 200 210 0.8 . 0 name=gene3

● Looks like there is a space in fields 2 and 5 between the product name and protein e.g. RecA protein

● The annotation column for record 4 is empty

● E.g., awk -F "\t" '{print $1,$3,$7}' genes.gff

chr1 gene +
chr1 gene -
chr1 repeat +
chr2 gene +
chr2 gene -
chr3 gene .
chr4 repeat +
chr10 repeat +
chr10 repeat +
chrX gene +

● Printed out the columns I wanted, and I can send the output to a new file ● Problem - the output
does not seem to be tab delimited?

● To get to the output in \t format, need to change awk’s default behaviour – can use the Output
Field Separator (OFS)

● E.g., awk -F "\t" 'BEGIN {OFS="\t"} {print $1,$3,$7}' genes.gff chr1 gene +
chr1 gene -
chr1 repeat +
chr2 gene +
chr2 gene -
chr3 gene .
chr4 repeat +
chr10 repeat +
chr10 repeat +
chrX gene +

● BEGIN is an awk variable that tells awk to execute the action in the first set of {} once the first line is read
in

● In this case, to set the Output Field Separator variable to be a \t ● awk can also be used to

replace every value in a specified field ● E.g., awk -F"\t" 'BEGIN {OFS="\t"} {$2=”H_sapiens";

print $0}' genes.gff

chr1 H_sapiens gene 100 300 0.5 + 0 name=gene1;product=unknown
chr1 H_sapiens gene 1000 1100 0.9 - 0 name=recA;product=RecA protein
chr1 H_sapiens repeat 10000 14000 1 + . name=ALU chr2 H_sapiens gene 10000 1200 0.95 + 0

● E.g. awk -F"\t" '$1=="chr1" && $3=="gene"' genes.gff

chr1 source1 gene 100 300 0.5 + 0

name=gene1;product=unknown
chr1 source2 gene 1000 1100 0.9 - 0
name=recA;product=RecA protein

• Can also meet criteria 1 “and” criteria 2 “and” criteria 3 • E.g. awk -F"\t" '$1=="chr1" && $3=="gene" && $7=="+"' genes.gff

chr1 source1 gene 100 300 0.5 + 0

name=gene1;product=unknown

Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
AWK usage
● E.g., awk -F"\t" '$1=="chr1" && $3=="gene"' genes.gff
chr1 source1 gene 100 300 0.5 + 0
name=gene1;product=unknown
chr1 source2 gene 1000 1100 0.9 - 0
name=recA;product=RecA protein

● Can use the || as an “or” condition to mean do if meets criteria 1 “or” criteria 2 ● E.g. awk -F"\t"
'$1=="chr1" || $3=="gene"' genes.gff
chr1 source1 gene 100 300 0.5 + 0
name=gene1;product=unknown
chr1 source2 gene 1000 1100 0.9 - 0
name=recA;product=RecA protein
chr1 source5 repeat 10000 14000 1 + . name=ALU chr2 source2 gene 10000 1200 0.95 + 0
chr2 source1 gene 50 900 0.4 - 0
name=gene2;product=gene2 protein
chr3 source1 gene 200 210 0.8 . 0 name=gene3 chrX source1 gene 123 456 0.6 + 0
name=gene4;product=unknown

● E.g. awk -F"\t" '$1=="chr1" && $3=="gene" && $4 < 1100'

genes.gff chr1 source1 gene 100 300 0.5 + 0
name=gene1;product=unknown
chr1 source2 gene 1000 1100 0.9 - 0
name=recA;product=RecA protein

Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
AWK basic arithmetic

● As awk recognizes mathematical operators, can use it to

preform basic calculations based on some criteria

● E.g. to find the length of repeats in the genes.gff file -

awk -F"\t" '$3=="repeat" {print $5 - $4 + 1}' genes.gff
4001
101
Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
AWK basic arithmetic
● E.g. to find the length of repeats in the genes.gff file - awk -F"\t"
'$3=="repeat" {print $5 - $4 + 1}' genes.gff

● The +1 addition is due to the General Feature Format where the

sequence numbering starts at 1
(https://www.ensembl.org/info/website/upload/gff.html )

● Different to the BED file format where the sequence numbering starts at
0
(https://m.ensembl.org/info/website/upload/bed.html )
Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
AWK basic arithmetic
● Can use awk to add up the total length of the repeats by using a variable

● E.g. awk -F"\t" 'BEGIN{sum=0} $3=="repeat" {sum = sum + $5 - $4 + 1} END{print sum}' genes.gff 🡪 4130

● A variable called “sum” is set at zero before awk reads in the file

● Each time the line repeat is found, the calculated length of the repeat is added to variable sum

● The END statement tells awk what to do once all the lines in the file have been read – in this instance to
print the final value of the variable sum

● Can also use awk’s += operator as a counter e.g. awk -F"\t" 'BEGIN{sum=0} $3=="repeat" {sum =+ $5 - $4
+ 1} END{print sum}' genes.gff 🡪 4130

● E.g., awk -F"\t" 'BEGIN{sum=0; count=0} $3=="gene" {sum =+ $6; count++} END{print sum/count}'
genes.gff 🡪 0.1

● We use a second variable called count is set to zero and adds 1 each time the term gene is
matched – this keeps track of the number of matches to gene

● The END statement tells awk divide the total value of sum (0.6) by the number of matches to gene
(6) = 0.1

● /home/manager/course_data/unix/practical -> unix.pdf ●

https://www.tutorialspoint.com/awk/index.htm
● https://bioinformatics.cvr.ac.uk/category/awk/

● https://linuxhint.com/category/awk/

● https://www.shortcutfoo.com/app/dojos/awk/cheatsheet

Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS

Basic Awk Syntax: Awk (Options) Script' File(s) Awk (Options) - F Scriptfile File(s)
No ratings yet
Basic Awk Syntax: Awk (Options) Script' File(s) Awk (Options) - F Scriptfile File(s)
43 pages
Linux Aplikasi: Dr. Sugeng Pribadi
No ratings yet
Linux Aplikasi: Dr. Sugeng Pribadi
75 pages
Lecture6 Awk
No ratings yet
Lecture6 Awk
16 pages
Linux Unit 5
No ratings yet
Linux Unit 5
33 pages
14 Awk
No ratings yet
14 Awk
72 pages
AWK Rules
No ratings yet
AWK Rules
12 pages
Awk Compbio
No ratings yet
Awk Compbio
12 pages
Awk - A Tutorial and Introduction - by Bruce Barnett
No ratings yet
Awk - A Tutorial and Introduction - by Bruce Barnett
233 pages
To Become An Expert AWK Programmer
No ratings yet
To Become An Expert AWK Programmer
19 pages
A Practical Guide To Learning Awk
No ratings yet
A Practical Guide To Learning Awk
11 pages
Awk Programming
100% (1)
Awk Programming
85 pages
Awk 2
No ratings yet
Awk 2
23 pages
Awk Tutorial
No ratings yet
Awk Tutorial
172 pages
Awk Programming Tutorial
No ratings yet
Awk Programming Tutorial
11 pages
Module 5 Unix
No ratings yet
Module 5 Unix
23 pages
Unit Iii AWK
No ratings yet
Unit Iii AWK
8 pages
Presentation For Os
No ratings yet
Presentation For Os
9 pages
Awk-An Advanced Filter
No ratings yet
Awk-An Advanced Filter
17 pages
12th Certificate
No ratings yet
12th Certificate
1 page
Sodapdf
No ratings yet
Sodapdf
13 pages
Untitled
50% (2)
Untitled
343 pages
Wa0022.
No ratings yet
Wa0022.
2 pages
Linux CMD AWK
No ratings yet
Linux CMD AWK
32 pages
AWK Functions
No ratings yet
AWK Functions
11 pages
Cut, Awk Commands
No ratings yet
Cut, Awk Commands
2 pages
Unix Beyond Basics
No ratings yet
Unix Beyond Basics
20 pages
Unix and AWK Guide Final
No ratings yet
Unix and AWK Guide Final
13 pages
The Basic Syntax of AWK
No ratings yet
The Basic Syntax of AWK
18 pages
AWK Cheat Sheet
No ratings yet
AWK Cheat Sheet
4 pages
Awk Tutorial
No ratings yet
Awk Tutorial
13 pages
UNIX II:grep, Awk, Sed: October 30, 2017
No ratings yet
UNIX II:grep, Awk, Sed: October 30, 2017
26 pages
Awk Cheat Sheet
No ratings yet
Awk Cheat Sheet
3 pages
Advanced Scripting in Unix: SED, AWK, Makefile & GDB
No ratings yet
Advanced Scripting in Unix: SED, AWK, Makefile & GDB
35 pages
AWK and Sed
No ratings yet
AWK and Sed
14 pages
Awk
No ratings yet
Awk
5 pages
Awk Is One of The Most Powerful Utilities Used in The Unix World. Whenever It Comes To Text Parsing
No ratings yet
Awk Is One of The Most Powerful Utilities Used in The Unix World. Whenever It Comes To Text Parsing
39 pages
Awk Session7
No ratings yet
Awk Session7
29 pages
Last Updated - Sat Apr 17 12:39:35 EDT 2010: Why Learn AWK?
No ratings yet
Last Updated - Sat Apr 17 12:39:35 EDT 2010: Why Learn AWK?
58 pages
AWK Command in Unix
No ratings yet
AWK Command in Unix
6 pages
AwkUsageIn Bash Scripting
No ratings yet
AwkUsageIn Bash Scripting
67 pages
Week 7&8
No ratings yet
Week 7&8
8 pages
Introawk
No ratings yet
Introawk
16 pages
Cheat Sheet Gnuawk v3 PDF
No ratings yet
Cheat Sheet Gnuawk v3 PDF
2 pages
Lecture 3 - AWK Utility
No ratings yet
Lecture 3 - AWK Utility
52 pages
Awk Cheatsheet
No ratings yet
Awk Cheatsheet
3 pages
The Body As Medium and Metaphor
83% (6)
The Body As Medium and Metaphor
213 pages
Bios User Guide
No ratings yet
Bios User Guide
263 pages
Awk - Read A File and Split The Contents
No ratings yet
Awk - Read A File and Split The Contents
37 pages
Awk Options (Pattern) (Action) : Single Quotes
No ratings yet
Awk Options (Pattern) (Action) : Single Quotes
6 pages
Awk Cheatsheet PDF
No ratings yet
Awk Cheatsheet PDF
3 pages
Awk Cheatsheet PDF
0% (1)
Awk Cheatsheet PDF
3 pages
Awk - A Pattern Scanning and Processing Language (Second Edition)
No ratings yet
Awk - A Pattern Scanning and Processing Language (Second Edition)
8 pages
Dsa Record Dsa Elab Answers
No ratings yet
Dsa Record Dsa Elab Answers
126 pages
Unix Talk #2: AWK Overview Patterns and Actions Records and Fields Print vs. Printf
No ratings yet
Unix Talk #2: AWK Overview Patterns and Actions Records and Fields Print vs. Printf
31 pages
Awk Cheatsheet PDF
No ratings yet
Awk Cheatsheet PDF
3 pages
Linux Network Namespace Introduction - Docker Kubernetes Lab 0.1
No ratings yet
Linux Network Namespace Introduction - Docker Kubernetes Lab 0.1
10 pages
AWK Practical Guide To Learning Gnu Awk
No ratings yet
AWK Practical Guide To Learning Gnu Awk
34 pages
Description of An Awk Program: Pattern Action
No ratings yet
Description of An Awk Program: Pattern Action
8 pages
Awk: More Complex Examples
No ratings yet
Awk: More Complex Examples
16 pages
De Cuong On Tap Giua Ki 1tieng Anh Lop 6 Ilearn Smart World
No ratings yet
De Cuong On Tap Giua Ki 1tieng Anh Lop 6 Ilearn Smart World
4 pages
AWK One Liners
No ratings yet
AWK One Liners
5 pages
Anwer and Question
No ratings yet
Anwer and Question
9 pages
Formative Assessment-1
No ratings yet
Formative Assessment-1
15 pages
AWK Hartigan
No ratings yet
AWK Hartigan
4 pages
L2 Lesson Plan - Technology Around Us - Y1
No ratings yet
L2 Lesson Plan - Technology Around Us - Y1
3 pages
NSS Hindi Book List 2007
No ratings yet
NSS Hindi Book List 2007
19 pages
Awk Patterns: 'Awk' Patterns May Be One of The Following
No ratings yet
Awk Patterns: 'Awk' Patterns May Be One of The Following
3 pages
Tom04 Quick Overview of The Bible
No ratings yet
Tom04 Quick Overview of The Bible
38 pages
Comptency Map 21ST Literature of The Philippines and The World
No ratings yet
Comptency Map 21ST Literature of The Philippines and The World
6 pages
English Verb Conjugation 2
No ratings yet
English Verb Conjugation 2
2 pages
Nad Installation Docs-Amit
No ratings yet
Nad Installation Docs-Amit
79 pages
TEST 1 - READING - IELTS Cambridge 13 (1-183) 2
No ratings yet
TEST 1 - READING - IELTS Cambridge 13 (1-183) 2
136 pages
Brkent-2183 (1
No ratings yet
Brkent-2183 (1
67 pages
JMAK
0% (1)
JMAK
21 pages
CSE2005 Lab Da1
No ratings yet
CSE2005 Lab Da1
25 pages
The Early Trinity
No ratings yet
The Early Trinity
14 pages
ZPTD
No ratings yet
ZPTD
7 pages
Table of Specification
No ratings yet
Table of Specification
37 pages
Car, Gun Push and Pull Lissa
No ratings yet
Car, Gun Push and Pull Lissa
6 pages
Greece and The Greeks in Ottoman History and Turkish Historiography
No ratings yet
Greece and The Greeks in Ottoman History and Turkish Historiography
15 pages
Dissertation de Philosophie Sur La Culture
100% (1)
Dissertation de Philosophie Sur La Culture
7 pages
Verbos Regulares: Ejemplos
No ratings yet
Verbos Regulares: Ejemplos
4 pages
ProLoan O
No ratings yet
ProLoan O
12 pages
Roman Numerals
No ratings yet
Roman Numerals
12 pages
Matlab Code For Embedded Zero Wavelet (EZW) Image Compression - Image Processing Projects - Projects Source Code
No ratings yet
Matlab Code For Embedded Zero Wavelet (EZW) Image Compression - Image Processing Projects - Projects Source Code
22 pages
Identity Essay Rough Draft
No ratings yet
Identity Essay Rough Draft
3 pages
Marriland Team Builder For Pokémon Teams
No ratings yet
Marriland Team Builder For Pokémon Teams
1 page
XProc 3.0 Programmer Reference
From Everand
XProc 3.0 Programmer Reference
Erik Siegel
No ratings yet
50 Recipes for Programming Node.js
From Everand
50 Recipes for Programming Node.js
Jamie Munro
3/5 (4)
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
From Everand
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

Module 1 Session 2 Part 2 Linux

Uploaded by

Module 1 Session 2 Part 2 Linux

Uploaded by

Genomics Sequencing

Bioinformatics Africa Course 2023

● E.g., awk –F “\t” '{ print $1 }' genes.gff

•print: Outputs text or variables to the screen or a file.

● awk '{print $1, $3}' filename(Print specific fields of a file)

● The annotation column for record 4 is empty

● E.g. awk –F “\t” '{print NF}' genes.gff

● awk recognizes mathematical operators such as the greater than sign

● The construct above does two things:

● awk -F "\t" ‘$1 == "chr2" {print $0}' genes.gff

● sed -n '1,6p' genes.gff

chr1 source1 gene 100 300 0.5 + 0 name=gene1;product=unknown

chr1 source2 gene 1000 1100 0.9 - 0 name=recA;product=RecA protein

chr3 source1 gene 200 210 0.8 . 0 name=gene3

● The annotation column for record 4 is empty

● E.g., awk -F "\t" '{print $1,$3,$7}' genes.gff

print $0}' genes.gff

● E.g. awk -F"\t" '$1=="chr1" && $3=="gene"' genes.gff

chr1 source1 gene 100 300 0.5 + 0

chr1 source1 gene 100 300 0.5 + 0

● E.g. awk -F"\t" '$1=="chr1" && $3=="gene" && $4 < 1100'

● As awk recognizes mathematical operators, can use it to

● E.g. to find the length of repeats in the genes.gff file -

● The +1 addition is due to the General Feature Format where the

● /home/manager/course_data/unix/practical -> unix.pdf ●

You might also like