0% found this document useful (0 votes)
59 views46 pages

4 Managingdata

This document provides tips for managing bioinformatics data and compression tools in Linux. It discusses that bioinformatics data has historically been stored in plain text files that can be multiple gigabytes in size. It recommends compressing data when moving it to its smallest form using tools like gzip and bzip2, and using symbolic links to point to data in different folders without moving the data itself. The document also covers checking available storage space on disks using the df command, formatting disks with different file systems like ext4, and using wildcards to select multiple files with commands like du and ls.

Uploaded by

Parisha Singh
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views46 pages

4 Managingdata

This document provides tips for managing bioinformatics data and compression tools in Linux. It discusses that bioinformatics data has historically been stored in plain text files that can be multiple gigabytes in size. It recommends compressing data when moving it to its smallest form using tools like gzip and bzip2, and using symbolic links to point to data in different folders without moving the data itself. The document also covers checking available storage space on disks using the df command, formatting disks with different file systems like ext4, and using wildcards to select multiple files with commands like du and ls.

Uploaded by

Parisha Singh
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Managing data

Joachim Jacob 8 and 15 November 2013

Bioinformatics data
Historically, bioinformatics has al ays !sed te"t files to store data#
&'B file e"cer(t

$enban% record

HMM (rofile

N$) data
*he N$) machines s(it a lot of data, stored in plain text files# *hese files are m!lti(le gigabytes in si+e#

*i(s for managing N$) data


1# ,hen yo! move the data, do it in its smallest form# Compress the data# 2# ,hen yo! !n(ac% the data, leave it

here it is#

Symbolic links (oint to the data in different folders#

3# &rovide eno!gh storage for yo!r data#


choose yo!r file system type isely

-om(ression. tools in /in!"

2nd some more e"ist###

htt(.00

#lin!"lin%s#com0article0201102200111011310-om(ression*ools#html

*i(s
,idely !sed com(ression tools. $N3 +i( 4gzip5 Bloc% )orting com(ression 4 bzip25 *y(ically, com(ression tools or% on one file# Ho to com(ress directories and their contents6

*ar

itho!t com(ression

*ar 4*a(e 2rchive5 is a tool for bundling a set of files or directories into a single archive. *he res!lting file is called a tar ball# )ynta" to create a tarball. $ tar -cf archive.tar file1 file2 )ynta" to e"tract. $ tar -xvf /path/to/archive.tar

-om(ression. a ty(ical case


2rchiving and com(ression mostly occ!r together# *he most !sed formats are tar.gz or tar.bz. *hese files are the res!lt of two (rocesses#

Archiving 4tar5
Compressing 4g+i( or b+i(25

-om(ression. on yo!r des%to(

-om(ression. on yo!r des%to(

-om(ression. on the command line


Tar is the tool for creating #tar archives, b!t it can com(ress in one go, ith the + or 7 o(tion# Creating a com(ressed tar archive. $ tar cvfz mytararchive.tar.gz $ tar cvfj mytararchive.tar.bz
create -om(ression techni8!e

docs/ docs/

Decompressing a com(ressed tar archive $ tar xvfz mytararchive.tar.gz $ tar xvfj mytararchive.tar.bz
e"tract files verbose

'e90com(ression
*o com(ress one or more files. $ gzip [options] file $ bzip2 [options] file *o decom(ress one or more files. $ gunzip [options] file(s) $ bunzip2 [options] file(s)

*i(s
Many com(ression tools on the command line allo to read compressed files 4instead of first !n(ac%ing then reading5#

$ zcat file(s) $ bzcat file(s)


-om(ression is al ays a balance bet een time and com(ression ratio# $+i( is faster, b+i(2 com(resses harder# :f com(ression is im(ortant to yo!. benchmar%;

<"ercise
a little com(ression e"ercise#

)ymlin%s
&ay attention# )omething very convenient; 2 symbolic link 4or symlin%5 is a file hich (oints to the location of the lin%ed9to file# =o! can do anything ith the symlin% that yo! can do on the original file# 2s yo! move the original file from its location, the symlin% is >dead>#
'o nloads0

?
&ro7ects0

2nnotation0 @ice0 B!tterfly0 )e8!ences0 alignment.sam

)ymlin%s
*o create a symlin%, move to the folder in m!st be created, and e"ec!te ln # here the symlin%

~/Projects $ cd Butterfly ~/Butterfly $ ln -s ../Rice/ e!uences/alignment.sam "in#$to$alignment.sam

'o nloads0

?
&ro7ects0

2nnotation0 @ice0 B!tterfly0 )e8!ences0 alignment.sam

)ymlin%s
*he symlin% is created# =o! can chec% *o delete a symlin%, !se unlin#. ith ls.

~/Projects $ cd Butterfly ~/Butterfly $ ln -s ../Rice/ e!uences/alignment.sam "in#$to$alignment.sam ~/Butterfly $ ls -lh "in#$to$alignment.sam lr%xr%xr%x & joachim joachim '' (ct )) &'*'+ "in#$to$alignment.sam -, ../ e!uences/alignment.sam

'o nloads0

?
&ro7ects0

2nnotation0 @ice0 )e8!ences0 alignment.sam

B!tterfly0 ink!to!alignment.sam

<"ercise
a little symlin% e"ercise

'is%s and storage


:f yo! dive into bioinformatics, yo! manage dis%s and storage# * o ty(es of dis%s " solid state disks /o ca(acity, high s(eed, random " spinning hard disks High ca(acity, >normal> s(eed, se8!ential rites#
htt(.00en# i%i(edia#org0 i%i0)olid9stateAdrive htt(.00en# i%i(edia#org0 i%i0HardAdis%

ill have to

rites

2 dis% is a device
Bia the terminal, sho the dis%s !sing

$ sudo fdis# -l -sudo. /ass%ord for joachim* 0is# /dev/sda* &1.' 2B3 &1'45&'&1&) bytes ... 0is# /dev/sdb* 166+ 7B3 166+&+&+&) bytes ...

2 dis% is divided into (artitions


2 dis% can be divided in (arts, called (artitions# 2n internal disk hich r!ns an o(erating system is !s!ally divided in (artitions, one for each f!nctions# 2n external disk is !s!ally not divided in (artitions#

-hec% o!t the dis% !tility tool

*he system dis%

Name of the dis%

*he system dis%

Name c!rrently highlighted (artition

*he system dis%

&lace in the directory str!ct!re here the (artition can be accessed

2n e"am(le of an 3)B dis%


9

&lace in the directory str!ct!re here the (artition can be accessed

2n e"am(le of an 3)B dis%


*he 3)B dis% is >mo!nted> a!tomatically on the directory tree !nder #media#

2n e"am(le of an 3)B dis%


*his is the ty(e of file system on the (artition# *he (artition is said to be formatted in C2*32 4in this case5#

Cile system formats


By defa!lt, many 3)B flash dis%s are formatted in $AT%&# Dther ty(es are N*C), e"tE, FC)#
$AT%& G ma" E$B files 'T$S G ma"im!m (ortability 4also for !se !nder (xt) G defa!lt file system in /in!", indo s5

htt(.00en# i%i(edia#org0 i%i0CileAsystemHCileAsystemsAandAo(eratingAsystems

2n e"am(le of an 3)B dis%


Cirst !nmo!nt the device# Ne"t, choose format the device#

&

Cormat dis%s

ith dis% !tility

-hoose the ty(e of file system yo! ant to be on that device#

Cormat dis%s

ith dis% !tility

Cormat dis%s

ith dis% !tility


or%

=o! don>t ant to %no all the commands that behind the gnome9dis%9!tility for yo!# B!t if yo! do. 9 mo!nt 9 !mo!nt 9 fdis% 9 m%fs

=o! can read the man (ages and search for g!ides on the internet if yo! ant to get to %no these 4o!t of sco(e for this co!rse5#

-hec%ing storage s(ace


By defa!lt >dis% !sage analy+er>#

-hec%ing storage s(ace


Bon!s. IE'ir)tat# Not installed by defa!lt#

-hec%ing storage s(ace


Bon!s. IE'ir)tat# Not installed by defa!lt#

IE'irstat is a I'< (ac%age


@ehearsal. hat is I'<6

Bon!s. hat ha((ens hen yo! install this (ac%age on o!r system6

)(ace left on dis%s

ith df

*o chec% the storage that is !sed on the different dis%s#


~/ $ df -h

8ilesystem /dev/sda& udev tm/fs none none /dev/sdb&


~/ $ df -h .

ize &)2 '647 )447 <.47 '657 1.52

9sed :vail 9se; 7ounted on <.12 <.+2 '6; / '.4= '647 &; /dev 6)4= &667 &; /run 4 <.47 4; /run/loc# +>= '657 &; /run/shm )47 1.+2 &; /media/test

*he si+e of directories


*o chec% the si+e of files or directories#
~/ $ du -sh ? <)4= bin )5&7 @om/ression$exercise '.4= 0es#to/ '.4= 0ocuments <.47 0o%nloads '.4= 7usic '.4= Pictures '.4= Public 1+17 Rice Axam/le '.4= Bem/lates '.4= test &+7 test.img &&'7 ugene-&.&).) '.4= Cideos

,ildcards on the command line


,ildcards are !sed to describe the names of files#dirs# +, Dn that (osition, the character may be one of the characters bet een J K, e#g# saniti+sz,ation matches. sanitisation and sanitization Dn that (osition, any character is allo ed# e#g# saniti-ation matches. sanitisation, sanitiration, ### L Dn that (osition, any length of string is allo ed e#g# s. matches. san, sdd, sanitisation, sam#alignment,###

,ildcards on the command line


Many tools that re8!ire an argument to (oint to files or directories acce(t these ildcards#
~/ $ du -sh 0o?

,ildcards on the command line


Many tools that re8!ire an argument to (oint to files or directories acce(t these ildcards#
~/ $ du -sh 0o? '.4= 0ocuments )42 0o%nloads

,ildcards on the command line


Many tools that re8!ire an argument to (oint to files or directories acce(t these ildcards#
~/ $ ls ?.fast!

,ildcards on the command line


Many tools that re8!ire an argument to (oint to files or directories acce(t these ildcards#
~/ $ ls ?.fast! ARR&'5<<)$&.fast! testout.fast! ARR&'5<<)$&$/rinse!$good$zz%D.fast! ARR&'5<<)$).fast! test.fast!

Iey ords
-om(ression 2rchive )ymbolic lin% mo!nting Cile system format (artition @ec!rsively df d! !nlin% ,rite in yo!r o n ords hat the terms mean

Brea%

You might also like