0% found this document useful (0 votes)
4 views

UnixCommands Day1

Uploaded by

byron7cueva
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

UnixCommands Day1

Uploaded by

byron7cueva
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Unix commands for data

editing

Daniela Lourenco

BLUPF90 TEAM, 02/2023


Hands on…getting some data

cp –r /home/guest002/course/labs/lab1linux_une .

curl http://nce.ads.uga.edu/wiki/lib/exe/fetch.php?media=lab1Linux_une.zip -o
lab1linux.zip
Popular commands
head file prints first 10 lines
head -20 file prints first 20 lines
tail file prints last 10 lines
less file lists file line-by-line or page-by-page
less -S file lists file line-by-line or page-by-page without wrapping the text

wc –l file counts the number of lines


grep text file finds lines that contains text
cat file1 file2 concatenates files

sort sorts a file


cut cuts specific columns
join joins lines of two files on specific columns
paste pastes lines of two files
expand replaces TAB with spaces
uniq retains unique lines on a sorted file
head / tail
$ head pedigree.txt head -20 pedigree.txt

UGA42011 UGA41101 UGA34199


tail pedigree.txt
UGA42012 UGA41101 UGA38407
UGA42013 UGA41101 UGA39798
UGA42014 UGA41101 UGA37367
UGA42015 UGA41101 UGA40507
UGA42016 UGA41101 UGA34449
UGA42017 UGA41101 UGA37465
UGA42018 UGA41101 UGA40205
UGA42019 UGA41101 UGA37513
UGA42020 UGA41101 UGA34836
Genomics - huge volume of information
• Example 50kv2 (54609 SNP)
• For 104 individuals
• Illumina final report file:
• 5,679,346 records
• 302 MB

• Not efficient to read/edit with regular editors (vi, vim, gedit…)


less command
• Allows to view the content of file and move forward and backward
• For files with long lines use option –S (disable line wrapping)
less -S genotypes.txt
Counting lines/characters inside files
• Command wc counts the number of lines/words/bytes
wc genotypes.txt
2024 4048 91108336 genotypes.txt

• Number of lines of a file(s)


wc -l genotypes.txt pedigree.txt
2024 genotypes.txt
10000 pedigree.txt
12024 total
Concatenating files
Put content of file1 and file2 in output_file
cat file1 file2 > output_file

Add content of file3 to output_file using >> redirection


Append content at the end of the file

cat file3 >> output_file


paste / expand
paste merges files line by line with a TAB delimiter
expand replaces TAB with spaces
paste –d “ “ merges files line by line with a space delimiter

head file1 file2


paste file1 file 2 | head
1 a
==> file1 <== 2 b
1 3 c
2
3
paste -d “ ” file1 file 2 | head
==> file2 <==
a
b 1a
c 2b
3c
sort
• Sorts a file in alphanumeric order
• specifying which column should be sorted
sort –k 2,2 file4 > a or sort +1 -2 file4 > a
sort –k 1,1 file4 > b or sort +0 -1 file4 > b

• Sorts a file in numeric order


sort –nk 2,2 file4 > a or sort -n +1 -2 file4 > a
sort –nk 1,1 file4 > b or sort -n +0 -1 file4 > b

• Sorts a file in reverse numeric order


sort –nrk 2,2 file4 > a or sort -nr +1 -2 file4 > a

• Sorts based on column 1 then column 2


sort -k1,1 -k2,2 file4 > ab
join
• Merges two files by column 1 in both (they should be sorted)

join -1 1 -2 1 phenotypes.txt pedigree.txt > new_file

• Merges two files by column 1 in both (sorting at the same time)


join -1 1 -2 1 <(sort -k1,1 phenotypes.txt) <(sort –k1,1 pedigree.txt) > new_file
OR
join -1 1 -2 1 <(sort +0 -1 phenotypes.txt) <(sort +0 -1 pedigree.txt) > new_file

• Merges two files by column 1 but suppresses the joined output lines
join –v1 phenotypes.txt pedigree.txt > new_file
grep
• grep finds patterns within a file and lists all lines that match the pattern
grep UGA42014 pedigree.txt

• grep -v shows all lines that do not match the pattern


grep -v UGA pedigree.txt

• Pattern with spaces use -e


grep -e “pattern with spaces” file1
sed
• Sed is a stream editor
It reads input file and apply commands that match the pattern
• Substitution of a pattern
sed ‘s/pattern1/new pattern/g’ file > newfile
sed ‘s:pattern1:new pattern:g’ file > newfile
sed ‘s:UGA:DL:g’ pedigree.txt > dl.temp

• Substitution of a pattern in the same file


sed -i ‘s/pattern1/new pattern/g’ file

• Substitution of a pattern in a specific line (e.g., line 24)


sed ‘24s/pattern1/new pattern/’ file > newfile

• Deletes lines that contain “pattern to match”


sed '/pattern to match/d' file
awk
AWK is a language for text processing and typically used as a data extraction and reporting tool

Alfred Aho
Peter Weinberger
Brian Kernighan
awk

• Interpreted program language, that process data stream of a file line by line

• Very useful and fast command to work with text files

• Can be used as a database query program


• Selects specific columns or creates new ones
• Selects specific rows matching some criteria

• Can be used with if/else and for structures


awk Implicit variables
NF - number of fields
• Print column 1, and last of pedigree file NR - record number
FS - input field separator
awk '{print $1,$NF}' pedigree.txt > anim_dam.temp OFS - output field separator
• Print all columns:
awk '{print $0}' phenotypes.txt > all_phen.temp
• Print column 1 based on occurrence in column 2:
awk '{if ($3==2) print $1}' phenotypes.txt > fem.temp
• Print columns 3 and 4 skipping the first 1000 lines:
awk '{if (NR>1000) print $3,$4}' phenotypes.txt > part.temp
• Print length of column 2 from line 1:
awk '{if (NR==1) print length($2)}' genotypes.txt

• Process CSV files


awk 'BEGIN {FS=","} {print $1,$2,$3}' pedigree.txt > ped_out.temp
awk hash tables
• Arrays can be indexed by alphanumeric variables in an efficient way

• awk version to count progeny by sire


• sire id is column 2

awk '{ sire[$2]+=1} END { for (i in sire)


{print "Sire " i, sire[i]}}' pedigree.txt
awk
• awk can be used for pretty much anything related to data processing in Unix
file1
• Sum of elements in column 1
awk '{ sumf += $1 } END { print sumf}' file1
6

• Sum of squares of element in column 1


awk '{ sumf += $1*$1 } END { print sumf}' file1
14

• Average of elements in column 1


awk '{ sumf += $1 } END { print sumf/NR}' file1
2
uniq
• Command uniq lists all unique lines of a file
• Option –c counts the number of times each level occurs in a file

Example: counting progeny by sire in a pedigree file


awk '$2>0{print $2}' pedigree.txt | sort | uniq –c > s.temp

awk '{if ($2>0) print $2}' pedigree.txt | sort | uniq –c > s.temp
Useful commands for Linux
• Several tutorials on the WEB !!

• unixcombined.pdf from Misztal web site


• http://nce.ads.uga.edu/~ignacy/ads8200/unixcombined.pdf

• Online
• https://tldp.org/LDP/Bash-Beginners-Guide/Bash-Beginners-Guide.pdf

You might also like