Tutorial Bash Data Handling
Tutorial Bash Data Handling
(1) (2)
Department of Informatics and Institute for Applied Computer Sciences
Business Information Systems Karlsruhe Institute of Technologie
University of Applied Sciences Karlsruhe Germany
Germany
http://www.smiffy.de/dbkda-2017/ 1
• Slideset
• Exercises
• Command refcard
• Example datasets
• Overview
• Search and Inspect + 3 hands on exercices
• File operations • First contact
• Excursus Regular Expressions • Analyzing text
• sed & awk • sed & awk
• Emulating SQL with the Shell
• Summary
Pipe Pipe
Filter Filter Filter
Pipe Pipe
command Filter Filter
Aachen,D,"Nordrhein Westfalen",247113,NULL,NULL
Aalborg,DK,Denmark,113865,10,57
Aarau,CH,AG,NULL,NULL,NULL
Aarhus,DK,Denmark,194345,10.1,56.1
Aarri,WAN,Nigeria,111000,NULL,NULL
Aba,WAN,Nigeria,264000,NULL,NULL
Abakan,R,"Rep. of Khakassiya",161000,NULL,NULL
Abancay,PE,Apurimac,NULL,NULL,NULL
Abeokuta,WAN,Nigeria,377000,NULL,NULL
Aberdeen,GB,Grampian,219100,NULL,NULL
Aberystwyth,GB,Ceredigion,NULL,NULL,NULL
Abidjan,CI,"Cote dIvoire",NULL,-3.6,5.3
Abilene,USA,Texas,108476,-99.6833,32.4167
"Abu Dhabi",UAE,"United Arab Emirates",363432,54.36,24.27
Abuja,WAN,Nigeria,NULL,NULL,NULL
Acapulco,MEX,Guerrero,515374,NULL,NULL
Austria,A,Vienna,Vienna,83850,8023244
Afghanistan,AFG,Kabul,Afghanistan,647500,22664136
"Antigua and Barbuda",AG,"Saint Johns","Antigua and Barbuda",440,65647
Albania,AL,Tirane,Albania,28750,3249136
Andorra,AND,"Andorra la Vella",Andorra,450,72766
Angola,ANG,Luanda,Luanda,1246700,10342899
Armenia,ARM,Yerevan,Armenia,29800,3463574
Australia,AUS,Canberra,"Australia Capital Territory",7686850,18260863
Azerbaijan,AZ,Baku,Azerbaijan,86600,7676953
Belgium,B,Brussels,Brabant,30510,10170241
Bangladesh,BD,Dhaka,Bangladesh,144000,123062800
Barbados,BDS,Bridgetown,Barbados,430,257030
Benin,BEN,Porto-Novo,Benin,112620,5709529
"Burkina Faso",BF,Ouagadougou,"Burkina Faso",274200,10623323
Bulgaria,BG,Sofia,Bulgaria,110910,8612757
Bhutan,BHT,Thimphu,Bhutan,47000,1822625
• Most of the commands accept the input from file or from STDIN. If no (or not
enough) input files are given, it is expected that the input comes from STDIN
head -n4 my-file.txt
cat -n my-file.txt | head -n4
• Most of the commands have a lot of options which couldn’t be explained in detail.
To get an overview of the possibilities of a command, simple type
man command
• Example:
man head
• Print all numbers between 1000 and 9999 which have two consequtive 5 in it
seq 1000 9999| grep 55
• Print only matching part (i.e. ’Salzburg’ instead of whole line)
grep -E -o 'S[a-z]+g' city.csv
• Count the number of times, the word „Karlsruhe“ appears
grep -c Karlsruhe famous-cities.txt
• Look for lines containing words from file
grep -f grep-strings.txt -E newsCorpora.csv
• file: grep-strings.txt
Obama
Climate
• Runtime:
grep something big.txt | wc -l # ~ 20sec.
zgrep something big.txt.gz | wc -l # ~ 80 sec.
bzgrep something big.txt.bz2 | wc -l # ~ 380 sec.
Output delimiter
cat
paste
cut
u
u
split
• Translate, squeeze, and/or delete characters from standard input, writing to stand-
ard output.
• Translate: Mapping between characters, i.e.
• {A->a, B->b, ...}
• {A->*, E->*, I->*, O->*, U->*}
• Delete:
• {0,1,2,3,4,5,6,7,8,9}
• Squeeze:
• {aa...a -> a, xx...x -> x, \n\n...\n->\n}
• Predefined character classs/ASCII-Code:
• [:punct:], [:alnum:], [:alpha:], [:blank:], [:upper:] [:lower:]
• \xxx : Octal ASCII number (i.e. <space> -> \040)
• city.csv • country.csv
Aachen,D,"Nordrhein Westfalen",247113,NULL,NULL ...
Aalborg,DK,Denmark,113865,10,57 Germany,D,Berlin,Berlin,356910,83536115
Aarau,CH,AG,NULL,NULL,NULL Djibouti,DJI,Djibouti,Djibouti,22000,42764
Aarhus,DK,Denmark,194345,10.1,56.1 Denmark,DK,Copenhagen,Denmark,43070,524963
Aarri,WAN,Nigeria,111000,NULL,NULL Algeria,DZ,Algiers,Algeria,2381740,2918303
... Spain,E,Madrid,Madrid,504750,39181114
...
Aachen,Germany,"Nordrhein Westfalen",247113
Aalborg,Denmark,Denmark,113865
Aarau,Switzerland,AG,NULL
Aarhus,Denmark,Denmark,194345
Aarri,Nigeria,Nigeria,111000
Aba,Nigeria,Nigeria,264000
Abakan,Russia,"Rep. of Khakassiya",161000
Andreas Schmidt DBKDA - 2017 27/64
Compare Operator
• Options:
• -1: supress column 1 • -3: supress column 3
• -2: supress column 2 • --total: output a summary
• file1.txt • file2.txt
Barcelona Andorra
Bern Barcelona
Chamonix Berlin
Karlsruhe Pisa
Pisa Porto
Porto
Rio
• Intersection:
$ cat file*.txt | sort | uniq -d
Barcelona
Pisa
Porto
• Character classes:
grep '[0-9]' city.csv # print all lines with a digit in it
grep -v '[0-9]' city.csv # print all lines without a digit in it
grep '[A-Za-z]' numeric.data # all lines with at least one character
grep '[^AEIOUaeiou]' city.csv # lines with at least one non-vocal
• Special characters:
• [ ] : definition of a character class
• . : matches any character
• ^ : matches begin of line or negation inside character class
• $ : matches end of line
• \b : represents a word boundary
grep -a '^The' The-Adventures-of-Tom-Sawyer.txt
grep -a -v '^$' The-Adventures-of-Tom-Sawyer.txt
grep -a -i '\bhouse' The-Adventures-of-Tom-Sawyer.txt
• Examples:
egrep ',(USA|TR),' city.csv
egrep 'St\.' city.csv
i.e. St. Louis, but not Stanford
egrep 'E([a-z])\1' city.csv
i.e. Essex
• Repetition
• ? : optional
• * : zero or more times
• + : one or more times
• {n} : exactly n times
matches house, houses
• {n,m} : between n and m times
• Examples:
egrep -a -i '\bhouses?\b' The-Adventures-of-Tom-Sawyer.txt
egrep -a 'X{1,3}' The-Adventures-of-Tom-Sawyer.txt
• s: substitude
• Replace all occurences of D with GER
sed 's/\bD\b/GER/g' city.csv > city2.csv
• Replace „Stuttgart“ with „Stuttgart am Neckar“ (extended regexp)
sed -r '/Stuttgart/ s/^([A-Za-z]+)/\1 am Neckar/' city.csv
• Replace all occurences of NULL in a line with \N (Inplace Substitution)
sed -i 's/\bNULL\b/\\N/g' city.csv
• p: print (typically used with default printing behaviour off (-n option))
• print from line 10 to 20 (resp.: 5-10, 23, 56-71)
sed -n 10,20p city.csv
sed -n '5,10p;23p;56,71p' city.csv
• print lines starting from dataset about ’Sapporo’ inclusive dataset about ’Seattle’
sed -n '/^Sapporo/,/^Seattle/p' city.csv
• c: change
• Replace entry of Biel
sed '/^Biel\b/ c Biel,CH,BE,53308,47.8,7.14' city.csv
• a: append
• Underline each CHAPTER
sed '/^CHAPTER/ a ------------' The-Adventures-of-Tom-Sawyer.txt
• ...
# script: average.awk
BEGIN { sum = 0
num = 0
}
pattern $4!="NULL" {
sum += $4
num++
}
Andreas Schmidt
KIT Andreas Schmidt,KIT,Germany
Germany Petre Dini,IARIA,USA
Fritz Laux,University Reutlingen,Germany
Petre Dini
IARIA
USA awk -f demo2.awk address.txt
Fritz Laux
University Reutlingen
Germany BEGIN {
FS="\n"
RS="\n\n"
ORS=""
}
{
for (i=1; i<NF; i++) {
print $i","
}
print $NF"\n"
}
• Example:
• File:
12,45,Test, 123.56
13,21,"Test without comma", 345.2
14,71,"Test, with comma", 0.7
• Command:
awk -F, '{print $1" : "$3}' fpat-demo.txt
• Output:
12 : Test
13 : "Test without comma"
14 : "Test
WRONG!!!!!
• Output:
12 : Test
13 : "Test without comma"
14 : "Test, with comma"
• If you take a look at the files, downloaded from the Gutenberg Project, you can
identify some boilerplate text at the begin and the end of the book. Which are the
lines, who separate the literary text from the boilerplate text?
• Write a command, which removes the boilerplate text (Hint: use sed, head, tail)
• Unfortunately, most of the command use a different character to specify the field
separator ... and also the default separator differs
Tabelle 1:
update city
set population=308135
where name='Karlsruhe' and country='D';
update city
set population=round(poulation*1.05)
where country='D';
172d
776d
839d
1094d
1749d
1756d
1904d
2189d
2921d
awk -F"," ' $4 { print $4, $1 }' city.csv | sort -nr | head -n30 \
> biggest-cities.data
• file biggest-cities.gpt
set terminal postscript
set output "city-population.ps"
set title "City Population"
set style fill solid
set style data histogram
set xtic nomirror rotate by -60
plot "biggest-cities.data" \
using 1:xtic(2) title ''
• Generate:
$ gnuplot biggest-cities.gpt
$ gnuplot.exe temperature.gpt
• http://www.theunixschool.com/p/awk-sed.html
• Dale Dougherty, Arnold Robbinssed & awk, 2nd Edition UNIX Power Tools.
O’Reilly, 2nd Edition 1997
• Arnold Robbins. Sed and Awk: Pocket Reference, 2nd Edition Paperback – June ,
O’Reilly, 2002
• Ramesh Natarajan. sed and awk 101 hacks. http://www.thegeekstuff.com/sed-
awk-101-hacks-ebook/
• gnuplot homepage: http://www.gnuplot.info/
ENTITY=New_York
Innsbruck,Austria,Tyrol,118000 Innsbruck,1,Tyrol,118000
Vienna,Austria,Vienna,1583000 Vienna,1,Vienna,1583000
Bregenz,Austria,Vorarlberg,NULL Bregenz,1,Vorarlberg,NULL
Kabul,Afghanistan,Afghanistan,892000 Kabul,2,Afghanistan,892000
"Saint Johns","Antigua and Barbuda",... "Saint Johns",3,"Antigua and
Tirane,Albania,Albania,192000 Barbuda",36000
Korce,Albania,Albania,52000 Tirane,4,Albania,192000
Elbasan,Albania,Albania,53000 Korce,4,Albania,52000
Vlore,Albania,Albania,56000 Elbasan,4,Albania,53000
Vlore,4,Albania,56000
1,Austria
2,Afghanistan
3,"Antigua and Barbuda"
4,Albania
5,Andorra
6,Angola
7,Armenia
8,Australia
BEGIN {
key=0; reuse existing key
OFS=",";
}
initialisation stuff {
if (has_value[$2]>0) generate next key value
out=has_value[$2];
else {
has_value[$2] = ++key;
out=key;
print key","$2 > "lookup_table.csv"
}
$2=out;
print $0;
} write key/value pair in file