0% found this document useful (0 votes)

40 views64 pages

Tutorial Bash Data Handling

The document discusses data processing using shell commands. It provides an overview and examples of commands for file inspection, searching files, extracting and transforming data. Hands-on exercises are included to practice using the commands.

Uploaded by

anjali anju

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views64 pages

Tutorial Bash Data Handling

Uploaded by

anjali anju

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 64

The Ninth International Conference on Advances in Databases, Knowledge, and Data Applications

Mai 21 - 25, 2017 - Barcelona, Spain

Data Manipulation and Data Transformation using the Shell

Andreas Schmidt1,2 and Steffen G. Scholz2

(1) (2)
Department of Informatics and Institute for Applied Computer Sciences
Business Information Systems Karlsruhe Institute of Technologie
University of Applied Sciences Karlsruhe Germany
Germany

Andreas Schmidt DBKDA - 2017 1/64

Resources available

http://www.smiffy.de/dbkda-2017/ 1
• Slideset
• Exercises
• Command refcard
• Example datasets

1. all materials copyright, 2017 by andreas schmidt

Andreas Schmidt DBKDA - 2017 2/64

Outlook

• Overview
• Search and Inspect + 3 hands on exercices
• File operations • First contact
• Excursus Regular Expressions • Analyzing text
• sed & awk • sed & awk
• Emulating SQL with the Shell
• Summary

Andreas Schmidt DBKDA - 2017 3/64

Data Processing with the Shell

• Architectural Pattern: Pipes and Filters (Douglas McIlroy, 1973)

• Data exchange between processes
• Loose coupling
• POSIX Standard
• Filter represent data-sources and data-sinks

Pipe Pipe
Filter Filter Filter

Pipe Pipe
command Filter Filter

Andreas Schmidt DBKDA - 2017 4/64

Shell commandos in the Linux/Unix/Cygwin
Environment
• Input-/Output channels
• Standardinput (STDIN)
• Standardouput (STDOUT)
• Standarderror (STDERR)
• In-/Output Redirection
• > : Redirect Standardoutput (into file)
• < : Redirect Standardinput (from file)
• 2> : Redirect Standarderror (into file)
• >> : Redirect Standardoutput (append into file)
• | : Pipe operator: Connect Standardoutput of a command with Standardinput of
the next command
• Example:
cut -d, -f1 city.csv|sort|uniq -c|sort -nr|awk '$1>1'>result.txt

Andreas Schmidt DBKDA - 2017 5/64

Overview over Operations

• File inspection • Counting

• Column/Row extraction • Insert/Append/Delete lines
• Filtering • Join-Operations
• Searching • Aggregation
• String substitution • Set Operations
• Splitting and Merging files • Compression
• Sorting • Operations on compressed data

Andreas Schmidt DBKDA - 2017 6/64

Example File city.csv

Aachen,D,"Nordrhein Westfalen",247113,NULL,NULL
Aalborg,DK,Denmark,113865,10,57
Aarau,CH,AG,NULL,NULL,NULL
Aarhus,DK,Denmark,194345,10.1,56.1
Aarri,WAN,Nigeria,111000,NULL,NULL
Aba,WAN,Nigeria,264000,NULL,NULL
Abakan,R,"Rep. of Khakassiya",161000,NULL,NULL
Abancay,PE,Apurimac,NULL,NULL,NULL
Abeokuta,WAN,Nigeria,377000,NULL,NULL
Aberdeen,GB,Grampian,219100,NULL,NULL
Aberystwyth,GB,Ceredigion,NULL,NULL,NULL
Abidjan,CI,"Cote dIvoire",NULL,-3.6,5.3
Abilene,USA,Texas,108476,-99.6833,32.4167
"Abu Dhabi",UAE,"United Arab Emirates",363432,54.36,24.27
Abuja,WAN,Nigeria,NULL,NULL,NULL
Acapulco,MEX,Guerrero,515374,NULL,NULL

Andreas Schmidt DBKDA - 2017 7/64

Example File country.csv

Austria,A,Vienna,Vienna,83850,8023244
Afghanistan,AFG,Kabul,Afghanistan,647500,22664136
"Antigua and Barbuda",AG,"Saint Johns","Antigua and Barbuda",440,65647
Albania,AL,Tirane,Albania,28750,3249136
Andorra,AND,"Andorra la Vella",Andorra,450,72766
Angola,ANG,Luanda,Luanda,1246700,10342899
Armenia,ARM,Yerevan,Armenia,29800,3463574
Australia,AUS,Canberra,"Australia Capital Territory",7686850,18260863
Azerbaijan,AZ,Baku,Azerbaijan,86600,7676953
Belgium,B,Brussels,Brabant,30510,10170241
Bangladesh,BD,Dhaka,Bangladesh,144000,123062800
Barbados,BDS,Bridgetown,Barbados,430,257030
Benin,BEN,Porto-Novo,Benin,112620,5709529
"Burkina Faso",BF,Ouagadougou,"Burkina Faso",274200,10623323
Bulgaria,BG,Sofia,Bulgaria,110910,8612757
Bhutan,BHT,Thimphu,Bhutan,47000,1822625

Andreas Schmidt DBKDA - 2017 8/64

General comment

• Most of the commands accept the input from file or from STDIN. If no (or not
enough) input files are given, it is expected that the input comes from STDIN
head -n4 my-file.txt
cat -n my-file.txt | head -n4

• Most of the commands have a lot of options which couldn’t be explained in detail.
To get an overview of the possibilities of a command, simple type

man command

• Example:

man head

Andreas Schmidt DBKDA - 2017 9/64

Andreas Schmidt DBKDA - 2017 10/64
File Inspection

• Show content of a file

cat HelloWorld.java
• Concatenate files and print them to STDOUT
cat german_cities.csv french_cities.csv > cities.csv
cat *_cities.csv > cities.csv
• Add line numbers to each line in file(s)
cat -n city.csv
• Create a file with input from STDIN:
cat > grep-strings.txt
Obama
Climate
CTRL-D

Andreas Schmidt DBKDA - 2017 11/64

File Inspection

• View first 5 lines from file:

head -n5 city.csv
• View last 4 lines of a file with line numbers:
cat -n city.csv| tail -n4
• View content of file, starting from line 40: to remove header line(s)

tail -n +40 city.csv

• Print all but the last 2 lines:
head -n -2 city.csv
• Count the number of lines, words and bytes to remove trailing line(s)
wc city.csv
• Count the number of lines
wc -l city.csv

Andreas Schmidt DBKDA - 2017 12/64

less command

• Page by page scrolling of a file or STDIN (also with search capability)

• Examples:
less city.csv
ls -l | less

man head # inspection of man-pages with less !!

• Commands:
• q : quit less • e, ret, : scroll forward one line
• > : Goto end of file • y, : scroll backwards one line
• < : Goto begin of file • nd : scroll forward n lines (i.e. 20n)
• f: Scroll forward one page • mb : scroll backwards m lines
• b: scroll backwards on page • ng: Goto line <n>

Andreas Schmidt DBKDA - 2017 13/64

less commands (2)

• /pattern : Search forward the next line with pattern

• ?pattern : Search backward the previous line with pattern
• n : repeat previous search
• N : repeat previous search in reverse direction
• &pattern : Display only lines containing the pattern (type &<ret> to quit)
• !command : executes shell command
• v : invokes standard editor for file (at current position, if supported)

type man less for complete reference

Andreas Schmidt DBKDA - 2017 14/64

• Print lines matching pattern (case sensitive)

grep USA city.csv
• Print matching lines in a binary file
grep -a USA kddNuggests.data
• Print lines matching pattern (case insensitive)
grep -i town city.csv
• Print lines containing the regular expression (City starting with ’S’, ending with ’g’)
grep -E 'S[a-z]+g,' city.csv # same as egrep
• Print only lines, not containing the String NULL
grep -v NULL city.csv
• Prefix each line of output with the line number
grep -n NULL city.csv

Andreas Schmidt DBKDA - 2017 15/64

• Print all numbers between 1000 and 9999 which have two consequtive 5 in it
seq 1000 9999| grep 55
• Print only matching part (i.e. ’Salzburg’ instead of whole line)
grep -E -o 'S[a-z]+g' city.csv
• Count the number of times, the word „Karlsruhe“ appears
grep -c Karlsruhe famous-cities.txt
• Look for lines containing words from file
grep -f grep-strings.txt -E newsCorpora.csv

• file: grep-strings.txt
Obama
Climate

Andreas Schmidt DBKDA - 2017 16/64

Compression

• gzip compresse files based on LZ77-coding (typ. 60%-70% reduction in size)

• bzip2 compress files based on Huffman coding
• zcat, bzcat, zgrep, bzgrep work on compressed files
• Example:
• Size:
big.txt: 8,9 GB
big.txt.gz: 2.4 GB (gzip -c big.txt > big.txt.gz)
big.txt.bz2: 2.0 GB (bzip2 -c big.txt > big.txt.bz2)

• Runtime:
grep something big.txt | wc -l # ~ 20sec.
zgrep something big.txt.gz | wc -l # ~ 80 sec.
bzgrep something big.txt.bz2 | wc -l # ~ 380 sec.

Andreas Schmidt DBKDA - 2017 17/64

Exercise I

• Download the book "The Adventures of Tom Sawyer" from

http://www.gutenberg.org/ebooks/74 (utf-8 format).
• For cygwin users: Convert file to Unix Format with command:
dos2unix.exe <file>
• Browse (using less) through the pages of the book and use some of the com-
mands explained before (page 12)
• Go to line 1234 of the file. What ist the third word?
• How many chapters has the book? (try also the -a option for grep)
• Count the number of empty lines
• Execute grep Tom <file> and grep -o Tom <file>. What is the difference?
• How often does the names "Tom" and "Huck" appears in the book?
• How often do they appear together in one line?

Andreas Schmidt DBKDA - 2017 18/64

File operations

• Split file by row (here, after each 10 lines)

split --lines=10 city.csv
• Print selected parts of lines from each file to standard output.
cut -d',' -f1,4 city.csv

Column separator Output columns

• Output bytes 10 to 20 from each line
cut -b10-20 data.fixed
• Merge lines of files
paste -d'\t' city_name.txt city_pop.txt > city_name_pop.csv

Output delimiter

Andreas Schmidt DBKDA - 2017 19/64

Summary File operations

cat
paste
cut
u
u

split

Andreas Schmidt DBKDA - 2017 20/64

tr command

• Translate, squeeze, and/or delete characters from standard input, writing to stand-
ard output.
• Translate: Mapping between characters, i.e.
• {A->a, B->b, ...}
• {A->*, E->*, I->*, O->*, U->*}
• Delete:
• {0,1,2,3,4,5,6,7,8,9}
• Squeeze:
• {aa...a -> a, xx...x -> x, \n\n...\n->\n}
• Predefined character classs/ASCII-Code:
• [:punct:], [:alnum:], [:alpha:], [:blank:], [:upper:] [:lower:]
• \xxx : Octal ASCII number (i.e. <space> -> \040)

Andreas Schmidt DBKDA - 2017 21/64

• Examples
• Translate to lowercase:
tr 'A-Z' 'a-z' < The-Adventures-of-Tom-Sawyer.txt
• Replace <newline> with <space>
tr '\n' '' < short-story.txt > one-liner.txt
• Delete all (") characters
tr '"' -d < city.csv
• Delete all non alphanumeric and non whitespace characters
tr -c -d '[:alnum:][:space:]' < The-Adventures-of-Tom-Sawyer.txt

complement delete operation

Andreas Schmidt DBKDA - 2017 22/64

sort

• Sort lines of text files

• Write sorted concatenation of all FILE(s) to standard output.
• With no FILE, or when FILE is -, read standard input.
• sorting alpabetic, numeric, ascending, descending, case (in)sensitive
• column(s)/bytes to be sorted can be specified
• Random sort option (-R)
• Remove of identical lines (-u)
• Examples:
• sort file city.csv starting with the second column (field delimiter: ,)
sort -k2 -t',' city.csv
• merge content of file1.txt and file2.txt and sort the result
sort file1.txt file2.txt

Andreas Schmidt DBKDA - 2017 23/64

sort - examples

• sort file by country code, and as a second criteria

population (numeric, descending)
sort -t, -k2,2 -k4,4nr city.csv

numeric (-n), descending (-r)

field separator: ,
second sort criteria from column 4 to column 4

first sort criteria from column 2 to column 2

Andreas Schmidt DBKDA - 2017 24/64

sort - examples

• Sort by the second and third character of the first column

Andreas Schmidt DBKDA - 2017 25/64

Further File operations

• join - join lines of two files on a common field

• Fields to compare must be sorted (alphabetic, not numeric)
• Output fields can be specified
• Example:

sort -k2 -t, city.csv | join -t, -12 -22 - country.csv \

-o1.1,2.1,1.3,1.4

Andreas Schmidt DBKDA - 2017 26/64

Join Operation

• city.csv • country.csv
Aachen,D,"Nordrhein Westfalen",247113,NULL,NULL ...
Aalborg,DK,Denmark,113865,10,57 Germany,D,Berlin,Berlin,356910,83536115
Aarau,CH,AG,NULL,NULL,NULL Djibouti,DJI,Djibouti,Djibouti,22000,42764
Aarhus,DK,Denmark,194345,10.1,56.1 Denmark,DK,Copenhagen,Denmark,43070,524963
Aarri,WAN,Nigeria,111000,NULL,NULL Algeria,DZ,Algiers,Algeria,2381740,2918303
... Spain,E,Madrid,Madrid,504750,39181114
...

sort -k2 -t, city.csv | join -t, -12 -22 - country.csv \

-o1.1,2.1,1.3,1.4

Aachen,Germany,"Nordrhein Westfalen",247113
Aalborg,Denmark,Denmark,113865
Aarau,Switzerland,AG,NULL
Aarhus,Denmark,Denmark,194345
Aarri,Nigeria,Nigeria,111000
Aba,Nigeria,Nigeria,264000
Abakan,Russia,"Rep. of Khakassiya",161000
Andreas Schmidt DBKDA - 2017 27/64
Compare Operator

• comm - compare two sorted files line by line

only in file2 in file1
Barcelona and file2
Bern only in file1
Chamonix
Karlsruhe
Pisa
Andorra
Porto
Rio Barcelona
Berlin
comm Bern
Chamonix
Andorra
Karlsruhe
Barcelona
Pisa
Berlin
Porto
Pisa
Rio
Porto

• Options:
• -1: supress column 1 • -3: supress column 3
• -2: supress column 2 • --total: output a summary

Andreas Schmidt DBKDA - 2017 28/64

set semantic with sorted input !!!

uniq

• report or omit repeated lines

• Filter adjacent matching lines from INPUT
• Range of comparision can be specified (first n chars, skip first m chars)
• options:
• -c: count number of occurences
• -d: only print duplicate lines
• -u: only print unique line
• -i: ignore case

Andreas Schmidt DBKDA - 2017 29/64

set semantic with sorted input !!!
uniq - example

• file1.txt • file2.txt
Barcelona Andorra
Bern Barcelona
Chamonix Berlin
Karlsruhe Pisa
Pisa Porto
Porto
Rio
• Intersection:
$ cat file*.txt | sort | uniq -d
Barcelona
Pisa
Porto

Andreas Schmidt DBKDA - 2017 30/64

• Counting:
cat file*.txt | sort | uniq -c
1 Andorra
2 Barcelona
1 Berlin
1 Bern
1 Chamonix
1 Karlsruhe
2 Pisa
2 Porto
1 Rio

Andreas Schmidt DBKDA - 2017 31/64

Exercise II - (Duration 15 min.)

• Count the words and lines in the book ’The-Adventures-of-Tom-Sawyer.txt’

• What does the following command perform?
grep -o -E '[A-Za-z]+' The-Adventures-of-Tom-Sawyer.txt
• Translate all words of ’The-Adventures-of-Tom-Sawyer.txt’ into lowercase using tr
• Count, how often each word in this book appears (hint: use uniq)
• Order the result, starting with the word with the highest frequency. Which word is it?
• Write all the above steps in one statement (using pipes)
• Compare the result with the result from the following book:
http://www.gutenberg.org/files/2701/2701-0.txt. At which position do the first book
specific words appear?
• Compare the 20 most frequent words of each book. How many are in common?
(hint: use head, cut, comm)

Andreas Schmidt DBKDA - 2017 32/64

Excursus Regular Expressions

• Character classes:
grep '[0-9]' city.csv # print all lines with a digit in it
grep -v '[0-9]' city.csv # print all lines without a digit in it
grep '[A-Za-z]' numeric.data # all lines with at least one character
grep '[^AEIOUaeiou]' city.csv # lines with at least one non-vocal
• Special characters:
• [ ] : definition of a character class
• . : matches any character
• ^ : matches begin of line or negation inside character class
• $ : matches end of line
• \b : represents a word boundary
grep -a '^The' The-Adventures-of-Tom-Sawyer.txt
grep -a -v '^$' The-Adventures-of-Tom-Sawyer.txt
grep -a -i '\bhouse' The-Adventures-of-Tom-Sawyer.txt

Andreas Schmidt DBKDA - 2017 33/64

Excursus Regular Expressions (2)

• Special characters (continued):

• | : alternative
• ( ) : for back referencing
\n : back reference to (...) (n numeric)

• Examples:
egrep ',(USA|TR),' city.csv
egrep 'St\.' city.csv
i.e. St. Louis, but not Stanford
egrep 'E([a-z])\1' city.csv

i.e. Essex

If you need to match a . ( ) [ ] { } + \ ^ $ preceed it with a backslash \

Andreas Schmidt DBKDA - 2017 34/64

Excursus Regular Expressions (3)

• Repetition
• ? : optional
• * : zero or more times
• + : one or more times
• {n} : exactly n times
matches house, houses
• {n,m} : between n and m times

• Examples:
egrep -a -i '\bhouses?\b' The-Adventures-of-Tom-Sawyer.txt
egrep -a 'X{1,3}' The-Adventures-of-Tom-Sawyer.txt

matches X, XX, XXX

Andreas Schmidt DBKDA - 2017 35/64

String Substitution with sed

• sed - Stream Editor

• non interactiv, controlled by a script
• line oriented text processing
• short scripts are typically given as parameter (-e option),
longer scripts as files (-f option)
• Possible operations: Insert, Substitude, Delete, Append, Change, Print, Delete
• Commands in script can take an optional address, specifying the line(s) to be per-
formed.
• Address can be a as ingle line number or a regular expression
• Address can be an interval (start, stop)
• A loop executes script commands on each input line
• Default behavior: printing each processed line to stdout (suppress with: -n)

Andreas Schmidt DBKDA - 2017 36/64

sed commands

• s: substitude
• Replace all occurences of D with GER
sed 's/\bD\b/GER/g' city.csv > city2.csv
• Replace „Stuttgart“ with „Stuttgart am Neckar“ (extended regexp)
sed -r '/Stuttgart/ s/^([A-Za-z]+)/\1 am Neckar/' city.csv
• Replace all occurences of NULL in a line with \N (Inplace Substitution)
sed -i 's/\bNULL\b/\\N/g' city.csv

• p: print (typically used with default printing behaviour off (-n option))
• print from line 10 to 20 (resp.: 5-10, 23, 56-71)
sed -n 10,20p city.csv
sed -n '5,10p;23p;56,71p' city.csv
• print lines starting from dataset about ’Sapporo’ inclusive dataset about ’Seattle’
sed -n '/^Sapporo/,/^Seattle/p' city.csv

Andreas Schmidt DBKDA - 2017 37/64

• i: insert
• Insert dataset about Karlsruhe at line 2
sed '2i Karlsruhe,D,"Baden Wuerttemberg",301452,49.0,6.8' city.csv
• d: delete
• delete Aachen (inplace)
sed -i '/Aachen/ d' city.csv
• delete all empty lines
sed '/^ *$/d' The-Adventures-of-Tom-Sawyer.txt
• delete lines 2-10
sed '2,10d' city.csv
• delete all <script>..</script> sections in a file
sed -Ei '/<script>/,/<\/script>/d' jaccard.html
• delete from <h2>Navigation menu</h2> to end of file
sed -i '/<h2>Navigation menu<\/h2>/,$d' jaccard.html

Andreas Schmidt DBKDA - 2017 38/64

sed Examples

• c: change
• Replace entry of Biel
sed '/^Biel\b/ c Biel,CH,BE,53308,47.8,7.14' city.csv

• a: append
• Underline each CHAPTER
sed '/^CHAPTER/ a ------------' The-Adventures-of-Tom-Sawyer.txt

• ...

Andreas Schmidt DBKDA - 2017 39/64

awk

• like sed, but with powerful programming language

• filter and report writer
• ideal for processing rows and columns
• suport for associative arrays
• structure: pattern { action statements }
• special BEGIN, END pattern match before the first line is read and after the last
line is read
• Access to column values via $1, $2, ... variables ($0: whole line)
• Examples:
awk -F, '$3=="Bayern" && $4 < 1000000 { print $1", "$4 }' city.csv

Andreas Schmidt DBKDA - 2017 40/64

awk

• Calculating average population

awk -F, -f average.awk city.csv

# script: average.awk
BEGIN { sum = 0
num = 0
}

pattern $4!="NULL" {
sum += $4
num++
}

END { print "Average population: "sum/num }

Andreas Schmidt DBKDA - 2017 41/64

• predefined variables
• NF: number of fields
• NR: number of records
• FS: field separator (default: " ", same as -F from command line)
• RS: record separator (default: \n)
• ORS: output record separator
• FPAT: Field pattern (alternative way to specify a field instead of use of FS)
• FILENAME

Andreas Schmidt DBKDA - 2017 42/64

awk example: multi-line input

Andreas Schmidt
KIT Andreas Schmidt,KIT,Germany
Germany Petre Dini,IARIA,USA
Fritz Laux,University Reutlingen,Germany
Petre Dini
IARIA
USA awk -f demo2.awk address.txt

Fritz Laux
University Reutlingen
Germany BEGIN {
FS="\n"
RS="\n\n"
ORS=""
}
{
for (i=1; i<NF; i++) {
print $i","
}
print $NF"\n"
}

Andreas Schmidt DBKDA - 2017 43/64

awk

• FPAT: Split a line by pattern, rather then by delimiter

• Example:
• File:
12,45,Test, 123.56
13,21,"Test without comma", 345.2
14,71,"Test, with comma", 0.7
• Command:
awk -F, '{print $1" : "$3}' fpat-demo.txt
• Output:
12 : Test
13 : "Test without comma"
14 : "Test
WRONG!!!!!

Andreas Schmidt DBKDA - 2017 44/64

• Example with FPAT:
matches "..."
• command: (more specific rule)
strings, containing
no comma

awk 'BEGIN{FPAT = "([^,])|(\"[^\"]\")"} \

{print $1" : "$3}' fpat-demo.txt

• Output:
12 : Test
13 : "Test without comma"
14 : "Test, with comma"

Andreas Schmidt DBKDA - 2017 45/64

Exercise III

• Create a working copy of your file city.csv (for security reasons)

• Exchange all occurences of the Province "Amazonas" in Peru (Code PE) with
"Province of Amazonas" using sed (inplace).
• Look for entries with the String "ce of Amazonas" - it should be only 1 !
• Make the same operations using awk.
• Print all cities which have no population given.
• Print the line numbers of the cities in Great Britain (Code: GB)
• Delete the records 5-12 and 31-34 from city.csv and store the result in city2.csv
using awk.
• Combine the used commands from the last two tasks and write a bash-script
(sequence of commands), which delete all british cities from the file city.csv (Hint:
generate with awk the commands for sed to delete the corresponding lines)
• Count the datasets (lines) in city.csv - it should be 2880

Andreas Schmidt DBKDA - 2017 46/64

Exercise III (continued)

• If you take a look at the files, downloaded from the Gutenberg Project, you can
identify some boilerplate text at the begin and the end of the book. Which are the
lines, who separate the literary text from the boilerplate text?
• Write a command, which removes the boilerplate text (Hint: use sed, head, tail)

Andreas Schmidt DBKDA - 2017 47/64

Specifying the field-separator

• Unfortunately, most of the command use a different character to specify the field
separator ... and also the default separator differs

Tabelle 1:

command specification Default sepa-

parameter rator
cut -d <Tab>
sort -t <Blanc>,
awk -F <Blanc>, <Tab>
join -t <Blanc>
uniq <Blanc>, <Tab>

Andreas Schmidt DBKDA - 2017 48/64

Emulation of SQL-Statements (1)

select * cat city.csv

from city

select name, population cut -d, -f1,4 city.csv

from city

select name,population grep ',F', city.csv | cut -d, -f1,4

from city
where country='F' awk -F, '$2=="F" {print $1,$4}' city.csv

select count(*) wc -l city.csv

from city

select count(*) grep ',F', city.csv | cut -d, -f1,4 | wc -l

from city
where country='F'

Andreas Schmidt DBKDA - 2017 49/64

Emulation of SQL-Statements (2)

select max(population) grep ',F,', city.csv | cut -d, -f4 | \

from city sort -nr| head -n1
where country='F'

select country, count(*) cut -d, -f2 city.csv | sort -t,| \

from city uniq -c|sort -nr
group by country
order by count(*) desc

select ci.name, sort -k2 -t, city.csv | \

co.name, ci.population join -t, -12 -22 - country.csv \
from city ci -o1.1,2.1,1.4| \
join country co sort -t, -k3,3 -nr
on ci.country=co.code
order by ci.population desc

Andreas Schmidt DBKDA - 2017 50/64

Emulation of SQL-Statements (3)

select country, count(*) cut -d, -f2 city.csv | sort -t,| \

from city uniq -c| sort -nr| \
group by country awk -F' ' '$1>100 {print}'
having count(*) > 100
order by count(*) desc

select country, count(*) ?????

from city c
where (select area
from country co
where co.code=c.country)
> 5000000
group by country
having counT(*) > 100
order by count(*) desc

Andreas Schmidt DBKDA - 2017 51/64

update Statement

update city
set population=308135
where name='Karlsruhe' and country='D';

awk -F, '{ if ($1=="Karlsruhe" and $2=="D") $4=308135; print $0}' \

city.csv

update city
set population=round(poulation*1.05)
where country='D';

awk -F, '{if ($2=="D") $4=$4*1.05; print $0}' city.csv

Andreas Schmidt DBKDA - 2017 52/64

delete statement

delete awk -F, '$3=="Bayern" {print NR"d"}' \

from city city.csv > delete-bayern.sed
where province='Bayern'

sed -i -f delete-bayern.sed city.csv

172d
776d
839d
1094d
1749d
1756d
1904d
2189d
2921d

Andreas Schmidt DBKDA - 2017 53/64

Visualization with Gnuplot

awk -F"," ' $4 { print $4, $1 }' city.csv | sort -nr | head -n30 \
> biggest-cities.data

• biggest-cities.data: • Gnuplot file biggest-cities.gpt

10229262 Seoul set terminal postscript
9925891 Mumbai set output "city-population.ps"
9863000 Karachi set title "City Population"
9815795 "Mexico City" set style fill solid
9811776 "Sao Paulo" set style data histogram
8717000 Moscow set xtic nomirror rotate by -60
8259266 Jakarta plot "biggest-cities.data" \
7843000 Tokyo using 1:xtic(2) title ''
7830000 Shanghai

Andreas Schmidt DBKDA - 2017 54/64

Visualization with Gnuplot

• file biggest-cities.gpt
set terminal postscript
set output "city-population.ps"
set title "City Population"
set style fill solid
set style data histogram
set xtic nomirror rotate by -60
plot "biggest-cities.data" \
using 1:xtic(2) title ''

• Generate:
$ gnuplot biggest-cities.gpt

Andreas Schmidt DBKDA - 2017 55/64

Visualization with Gnuplot

• file temp.dat • temperature.gpt

Month "Min. Temperature" "Max. Temperature" set term jpeg
Jan -1.4 3.5 set output "temperatur.jpeg"
Feb -1.2 4.4 set title "Temperature, Karlsruhe"
Mar 1.1 8.0 set style line 1 lt 2 lc rgb "blue" lw 3
Apr 3.3 12.3 set style line 2 lt 5 lc rgb "red" lw 2
May 7.4 17.5 set multiplot
Jun 10.5 19.9 plot "temp.dat" using 2:xtic(1) \
Jul 12.7 22.1 ls 1 with lines \
Aug 12.5 22.2 title columnheader(2), \
Sep 9.6 17.9 "temp.dat" using 3:xtic(1) \
Oct 6.0 13.0 ls 2 with lines \
Nov 2.4 7.5 title columnheader(3);
Dez 0.0 4.6 unset multiplot

Andreas Schmidt DBKDA - 2017 56/64

Visualization with Gnuplot

$ gnuplot.exe temperature.gpt

Andreas Schmidt DBKDA - 2017 57/64

Andreas Schmidt DBKDA - 2017 58/64

Further examples ...

Andreas Schmidt DBKDA - 2017 59/64

Further
Jaccardexamples
Example

ENTITY=New_York

wget -O data/$(ENTITY).html https://en.wikipedia.org/wiki/$(ENTITY)

tr < data/$(ENTITY).html > data/$(ENTITY).txt '[A-Z]' '[a-z]'
sed -ri '/<script>/,/<\/script>/d' data/$(ENTITY).txt
sed -ri 's// /g' data/$(ENTITY).txt
sed -ri '//d' data/$(ENTITY).txt
sed -i '/<h2>navigation menu<\/h2>/,$d' data/$(ENTITY).txt
sed -ri 's/<\/?[a-z]+[^>]*>/ /g' data/$(ENTITY).txt
egrep -o -e '[a-z]+' data/$(ENTITY).txt | sort | uniq | \
$(PHP) porter.php > data/$(ENTITY).set

comm --total data/$(E1).set data/$(E2).set | tail -n1|sed -r 's/([\t0-9]+)[A-

Za-z]+/.\/jaccard.pl \1/g' > run.sh
sh run.sh

Andreas Schmidt DBKDA - 2017 60/64

Create Lookup-Table

Innsbruck,Austria,Tyrol,118000 Innsbruck,1,Tyrol,118000
Vienna,Austria,Vienna,1583000 Vienna,1,Vienna,1583000
Bregenz,Austria,Vorarlberg,NULL Bregenz,1,Vorarlberg,NULL
Kabul,Afghanistan,Afghanistan,892000 Kabul,2,Afghanistan,892000
"Saint Johns","Antigua and Barbuda",... "Saint Johns",3,"Antigua and
Tirane,Albania,Albania,192000 Barbuda",36000
Korce,Albania,Albania,52000 Tirane,4,Albania,192000
Elbasan,Albania,Albania,53000 Korce,4,Albania,52000
Vlore,Albania,Albania,56000 Elbasan,4,Albania,53000
Vlore,4,Albania,56000

1,Austria
2,Afghanistan
3,"Antigua and Barbuda"
4,Albania
5,Andorra
6,Angola
7,Armenia
8,Australia

Andreas Schmidt DBKDA - 2017 61/64

Create Lookup-Table

awk -F, -f make_dictionary.awk city_with_countryname.csv

BEGIN {
key=0; reuse existing key
OFS=",";
}
initialisation stuff {
if (has_value[$2]>0) generate next key value
out=has_value[$2];
else {
has_value[$2] = ++key;
out=key;
print key","$2 > "lookup_table.csv"
}
$2=out;
print $0;
} write key/value pair in file

Andreas Schmidt DBKDA - 2017 62/64

Looking for doubled paragraphs

$ grep -E '.{20,}' diss-tobi.txt| sort | uniq -d > double.grep

$ grep -a -n -f double.grep diss-tobi.txt

3487:Bei einigen Verfahren nivellieren Walzen oder eine Fr¦seinrich-

tung das abgeschiedene
3496:Bei einigen Verfahren nivellieren Walzen oder eine Fr¦seinrich-
tung das abgeschiedene
10352:Design Engineering Technical Conferences & Computers and Infor-
mation in
10529:Design Engineering Technical Conferences & Computers and Infor-
mation in
1079:Diese Ausrichtung bleibt auch, wenn das Feld entfernt wird, und
verleiht dem
4659:Diese Ausrichtung bleibt auch, wenn das Feld entfernt wird, und
verleiht de

Andreas Schmidt DBKDA - 2017 63/64

Simple encryption (like rot13)

$ tr 'A-Za-z' 'D-Za-zABC' < top-secret.plain > top-secret.enc

tr 'D-Za-zABC' 'A-Za-z' < top-secret.enc

Andreas Schmidt DBKDA - 2017 64/64

50 Linux Commands List With Examples - Javatpoint
No ratings yet
50 Linux Commands List With Examples - Javatpoint
17 pages
Awk One Liners Explained
100% (1)
Awk One Liners Explained
58 pages
Bash
100% (1)
Bash
2 pages
UNIxnkv
No ratings yet
UNIxnkv
25 pages
UnixCommands Day1
No ratings yet
UnixCommands Day1
20 pages
CHP 6 Commands 1
No ratings yet
CHP 6 Commands 1
24 pages
Lecture4 Data Files Text Processing Formattng
No ratings yet
Lecture4 Data Files Text Processing Formattng
15 pages
Linux 6
No ratings yet
Linux 6
25 pages
OS Practical 6 Onwards Writing Material
No ratings yet
OS Practical 6 Onwards Writing Material
8 pages
UNIT-4: Filters
No ratings yet
UNIT-4: Filters
30 pages
Finacle User Guide Idbi
No ratings yet
Finacle User Guide Idbi
337 pages
Awk: More Complex Examples
No ratings yet
Awk: More Complex Examples
16 pages
Unix Basic Commands
No ratings yet
Unix Basic Commands
6 pages
CAESAR-II Output Report
100% (1)
CAESAR-II Output Report
37 pages
Advanced Unix Commands-Tmp
No ratings yet
Advanced Unix Commands-Tmp
30 pages
Machine Learning Cheat Sheet: 1. Hardware
No ratings yet
Machine Learning Cheat Sheet: 1. Hardware
14 pages
Basic Commands
No ratings yet
Basic Commands
17 pages
Lec 05
No ratings yet
Lec 05
39 pages
Perintah Dasar Debian
No ratings yet
Perintah Dasar Debian
6 pages
Chapter 7
No ratings yet
Chapter 7
10 pages
Golden Software Voxler v4 - User's Guide (Voxler4UserGuide-eBook)
100% (1)
Golden Software Voxler v4 - User's Guide (Voxler4UserGuide-eBook)
970 pages
Abintio Documents
No ratings yet
Abintio Documents
15 pages
Os 1
No ratings yet
Os 1
5 pages
Unix Text Analysis
No ratings yet
Unix Text Analysis
9 pages
Osy PR8
No ratings yet
Osy PR8
4 pages
Sodapdf
No ratings yet
Sodapdf
13 pages
UNIX II:grep, Awk, Sed: October 30, 2017
No ratings yet
UNIX II:grep, Awk, Sed: October 30, 2017
26 pages
Linux Commands
No ratings yet
Linux Commands
5 pages
Unix 2 Sem
No ratings yet
Unix 2 Sem
16 pages
Pipingfile
No ratings yet
Pipingfile
11 pages
UNIX Helpful Commands: Brush Up Basic Commands
No ratings yet
UNIX Helpful Commands: Brush Up Basic Commands
12 pages
02 Advanced Unix Commands Notes - px4D2Ov
No ratings yet
02 Advanced Unix Commands Notes - px4D2Ov
8 pages
RHSA1 Day5
No ratings yet
RHSA1 Day5
37 pages
Linux For Data Science Cheatsheet KDnuggets
No ratings yet
Linux For Data Science Cheatsheet KDnuggets
1 page
Unix Beyond Basics
No ratings yet
Unix Beyond Basics
20 pages
Command Line Tricks For Data Scientists - Kade Killary - Medium
No ratings yet
Command Line Tricks For Data Scientists - Kade Killary - Medium
16 pages
Os Beyond2asdasdas
No ratings yet
Os Beyond2asdasdas
7 pages
Unix and AWK Guide POLISHED FINAL FIXED
No ratings yet
Unix and AWK Guide POLISHED FINAL FIXED
22 pages
Linux Lab Record
No ratings yet
Linux Lab Record
37 pages
Perfected Unix and AWK Guide
No ratings yet
Perfected Unix and AWK Guide
21 pages
Re: What Is Normalization Means..? Answer
No ratings yet
Re: What Is Normalization Means..? Answer
43 pages
Commands
No ratings yet
Commands
20 pages
Linux Bash
No ratings yet
Linux Bash
5 pages
Linux Shell Scripting Advanced
No ratings yet
Linux Shell Scripting Advanced
37 pages
UNIX Filters
No ratings yet
UNIX Filters
18 pages
Files:: Ls Ls - L Ls - A Esc K More Filename
No ratings yet
Files:: Ls Ls - L Ls - A Esc K More Filename
9 pages
Awk One-Liners Explained (Preview Copy)
No ratings yet
Awk One-Liners Explained (Preview Copy)
12 pages
Example: Unix Commands Man
No ratings yet
Example: Unix Commands Man
5 pages
Complete Unix Commands
No ratings yet
Complete Unix Commands
66 pages
Linux Commands: Appendix A
No ratings yet
Linux Commands: Appendix A
20 pages
Linux Terminal Comands
100% (1)
Linux Terminal Comands
7 pages
List of A-Z Linux Commands
No ratings yet
List of A-Z Linux Commands
10 pages
Linux File Management and Viewing
No ratings yet
Linux File Management and Viewing
4 pages
Loans Lifecycle
No ratings yet
Loans Lifecycle
17 pages
Lecture14 Unix Advanced Commands
No ratings yet
Lecture14 Unix Advanced Commands
13 pages
Unix Bible
No ratings yet
Unix Bible
8 pages
Lab Lookups Subsearches
0% (1)
Lab Lookups Subsearches
16 pages
Grep Awk Sed
No ratings yet
Grep Awk Sed
9 pages
Linux
No ratings yet
Linux
5 pages
Shows The "Present Working Directory"
No ratings yet
Shows The "Present Working Directory"
7 pages
Corrigendum Iii RFP For Core Banking System Upgrade / Migration Ref No: Idbi/Pcell/Rfp/2015-16/021 Dated: 20-October-2015
100% (1)
Corrigendum Iii RFP For Core Banking System Upgrade / Migration Ref No: Idbi/Pcell/Rfp/2015-16/021 Dated: 20-October-2015
16 pages
User Manual Matlab Filter Design HDL Coder 2 e
No ratings yet
User Manual Matlab Filter Design HDL Coder 2 e
95 pages
UNIX and Shell Scripting - Module 4
No ratings yet
UNIX and Shell Scripting - Module 4
100 pages
LMS UserGuideforRME40
No ratings yet
LMS UserGuideforRME40
1,324 pages
Automating Administration With Windows Powershell®
No ratings yet
Automating Administration With Windows Powershell®
5 pages
Trace
No ratings yet
Trace
1,036 pages
AWK One Liners
No ratings yet
AWK One Liners
5 pages
StreamServe Design Center GUI Reference
No ratings yet
StreamServe Design Center GUI Reference
146 pages
Signature Verfication System
No ratings yet
Signature Verfication System
67 pages
HPE - c04423965 - RESTful Interface Tool User Guide
No ratings yet
HPE - c04423965 - RESTful Interface Tool User Guide
82 pages
A To Z Index of Bash Commands
No ratings yet
A To Z Index of Bash Commands
4 pages
AMIETE ETmain PDF
No ratings yet
AMIETE ETmain PDF
138 pages
ITN ERP Connector For Pentaho Data Integration v4
No ratings yet
ITN ERP Connector For Pentaho Data Integration v4
6 pages
Unix
No ratings yet
Unix
2 pages
Limits
No ratings yet
Limits
18 pages
KeyViewFilterSDK 12.10 DotNetProgramming
No ratings yet
KeyViewFilterSDK 12.10 DotNetProgramming
270 pages
Problem Statement
No ratings yet
Problem Statement
14 pages
Finacle Pre Bid Queries
No ratings yet
Finacle Pre Bid Queries
41 pages
Architectural Patterns: Massimo Felici Conrad Hughes
No ratings yet
Architectural Patterns: Massimo Felici Conrad Hughes
49 pages
Excel Sorting, Filtering & Advanced Filtering of Data: IT Training St. George's, University of London
No ratings yet
Excel Sorting, Filtering & Advanced Filtering of Data: IT Training St. George's, University of London
18 pages
All Workday Accounts - Service Centre
No ratings yet
All Workday Accounts - Service Centre
30 pages
Rsa NW 11.6 Logstash Guide
No ratings yet
Rsa NW 11.6 Logstash Guide
29 pages
MS DOS COmmands
No ratings yet
MS DOS COmmands
15 pages
Pipes and Filters
No ratings yet
Pipes and Filters
3 pages
Lab 2
No ratings yet
Lab 2
3 pages
Limited Time Discount Offer! 15% Off - Ends in 01:37:22 - Use Discount Coupon Code A4T2023
No ratings yet
Limited Time Discount Offer! 15% Off - Ends in 01:37:22 - Use Discount Coupon Code A4T2023
2 pages
Acronym Generator SRS
No ratings yet
Acronym Generator SRS
17 pages
Validitas Dan Reliabilitas Instrument
No ratings yet
Validitas Dan Reliabilitas Instrument
8 pages
CS3411 Project 5 - Filter I/O and Run A Program
No ratings yet
CS3411 Project 5 - Filter I/O and Run A Program
3 pages
DP-420 Designing and Implementing Cloud-Native Applications Using Microsoft Azure Cosmos DB Certification Exam Guide
From Everand
DP-420 Designing and Implementing Cloud-Native Applications Using Microsoft Azure Cosmos DB Certification Exam Guide
Anand Vemula
No ratings yet
DP-500 Designing and Implementing Enterprise-Scale Analytics Solutions Using Microsoft Azure and Microsoft Power BI Exam Guide
From Everand
DP-500 Designing and Implementing Enterprise-Scale Analytics Solutions Using Microsoft Azure and Microsoft Power BI Exam Guide
Anand Vemula
No ratings yet
Q Tips: Fast, Scalable, and Maintainable Kdb+
From Everand
Q Tips: Fast, Scalable, and Maintainable Kdb+
Nick Psaris
No ratings yet
IBM Cognos 8 Planning
From Everand
IBM Cognos 8 Planning
Jason Edwards
No ratings yet
Customizing AutoCAD 2020, 13th Edition
From Everand
Customizing AutoCAD 2020, 13th Edition
Prof. Sham Tickoo
No ratings yet
Mastering DuckDB: High-Performance Analytics Made Easy
From Everand
Mastering DuckDB: High-Performance Analytics Made Easy
Robert Johnson
No ratings yet