0% found this document useful (0 votes)

18 views37 pages

Lec 3

The document outlines the course announcements and project details for a Software Security Lab class at National Chengchi University. It includes information about a make-up course, homework reviews, and a multi-stage team project focused on developing a search engine that outperforms Google. Additionally, it covers text processing concepts, Java string manipulation, and pattern matching algorithms, specifically the Boyer-Moore algorithm.

Uploaded by

ims

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views37 pages

Lec 3

Uploaded by

ims

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Fall 2017

Fang Yu

Software Security Lab.

Data Structures
Dept. Management Information
Systems,
National Chengchi University
Lecture 3
Announcement
A Make Up Course on Sep. 30
¡ We will have a make-up course on Sep. 30.
¡ 9-12am (Mandarin) and 1-4pm (English)
¡ Room 312, College of Commerce.
¡ We will not have courses on Oct 5 and Oct. 12.
¡ Lab on Oct. 2 for HW3 (Submit your team members)
¡ Lab on Oct. 16 for HW4 (and tutorial on SVN)
HWs Review – What you
should have learned?
¡ Calculate your BMI
¡ Java Class Library

¡ Generic Geometric Progression

¡ Inheritance
¡ Generics
¡ Exceptions
Project Announcement
¡ A Team Project: 30%
¡ 3-5 students as a team
¡ Send the team list (name and contact) to your TAs before the
end of this week
¡ Develop your application using Eclipse with SVN
¡ TAs will help you set up SVN
¡ You will get extra points for having constant code update
Lets Beat Google!
¡ Goal: On the top of a giant’s shoulder, achieve advanced
information searching with your expertise!

¡ Select a topic that you/your team members have interests.

¡ Make sure your search engine gets better results than a

general search engine such as Google.

¡ Stage 0 (HW3): Keyword Counting

¡ Given an URL and a keyword
¡ Return how many times the keyword appears in the contents of
the URL
Lets Beat Google!
¡ Stage 1 (30%+): Page Ranking
¡ Given a set of keywords and URLs
¡ Rank the URLs based on their score
¡ Define a score formula based on keyword appearances
¡ For each URL (a web page), return its rank, score, and the
count on appearance of each keyword
Lets Beat Google!
¡ Stage 2 (50%+) Site Ranking
¡ Multiple level keyword search
¡ Given a set of Web sites (URLs) and Keywords
¡ Rank the Web sites with their keyword appearances (including
all its sub URLs)
¡ Define a score formula based on keyword appearances in the
URL and all its sub URLs
¡ For each URL (a web site), return its rank, score, and a tree
structure for its sub URLs along with the number of appearance
of each keyword in each node
Lets Beat Google!
¡ Stage 3 (70%+) Refine the rank of Google
¡ Given a set of Keywords (No URLs)
¡ Use search engines to find potential URLs
¡ Apply the ranking on Stage 2 to these Web sites

¡ Stage 4 (80%+) Semantics Analysis

¡ Derive relative keywords from the discovered Web sites
¡ Iteratively do the same analysis on Stage 3

¡ Stage 5 (90%+) Publish Your Work Online

¡ Build a web site/service for your searching engine

¡ Stage 6 (100%+) Export Your Work to App

¡ Wrap your search engine as an iOS/android mobile application
Important Date
Each team needs to
1. Submit the project proposal (4-8 pages) on Nov. 2
2. Give a Demo on Jan. 11
3. Upload the source code before Jan. 18
Text Processing
Strings and Pattern matching
Text Processing
¡ Due to internet, social networks, web and mobile
applications, a lot of documents and contents are online and
public available

¡ Text processing becomes one of the dominant functions of

computers

¡ HTML and XML

¡ Primary text formats with added tags for multimedia content
¡ Java Applet (embedded Java bytecode in the HTML)
Strings
¡ A string is a sequence of characters

¡ An alphabet Σ is the set of possible characters for a family

of strings

¡ Example of alphabets:
¡ ASCII
¡ Unicode
¡ {0, 1}
¡ {A, C, G, T}
Strings
¡ Let P be a string of size m
¡ A substring P[i .. j] of P is the subsequence of P consisting
of the characters with ranks between i and j
¡ A prefix of P is a substring of the type P[0 .. i]
¡ “Fan” is a prefix of “Fang Yu, NCCU”

¡ A suffix of P is a substring of the type P[i ..m - 1]

¡ “CCU” is a suffix of “Fang Yu, NCCU”
Java String Class
String S;
¡ Immutable strings: operations simply return information about strings (no
modification)

length() Return the length of S

charAt(i) Return the ith character
startsWith(Q) True if Q is a prefix of S
endsWith(Q) True is Q is a suffix of S
substring(i,j) Return the substring S[i,j]
concat(Q) Return S+Q
equals(Q) True is Q is equal to S
indexOf(Q) If Q is a substring of S, returns the index of the beginning
of the first occurrence of Q in S
Java String Class
String a = “Hello World!”;

Operation Output
a.length()
a.charAt(1)
a.startsWith(“Hell”)
a.endsWith(“rld”)
a.substring(1,2)
a.concat(“rld”)
a.substring(1,2).equals(“e”)
indexOf(“rld”)
Java String Class
String a = “Hello World!”;

Operation Output
a.length() 12
a.charAt(1) e
a.startsWith(“Hell”) true
a.endsWith(“rld”) false
a.substring(1,2) e
a.concat(“rld”) Hello World!rld
a.substring(1,2).equals(“e”) true
a.indexOf(“rld”) 8
Java StringBuffer Class
StringBuffer S;
¡ Mutable strings: operations modify the strings

append(Q) Replace S with S+Q. Return S.

Insert(i,Q) Insert Q in S starting at index i. Return S
reverse() Reverse S. Return S.
setCharAt(i, ch) Set the character at index i in S to ch
charAt(i) Return the character at index i in S
toString() Return a String version of S
Java StringBuffer Class
StringBuffer a = new StringBuffer();

Operation a
a.append(“Hello World!”)
a.reverse()
a.reverse()
a.insert(6,”Fang and the ”)
a.setCharAt(4, ‘!’)
Java StringBuffer Class
StringBuffer a = new StringBuffer();

Operation a
a.append(“Hello World!”) Hello World!
a.reverse() !dlroW olleH
a.reverse() Hello World!
a.insert(6,”Fang and the ”) Hello Fang and the World!
a.setCharAt(4, ‘!’) Hell! Fang and the World!
Pattern Matching
¡ Given a text string T of length n and a pattern string P of
length m

¡ Find whether P is a substring of T

¡ If so, return the starting index in T of a substring matching

¡ The implementation of T.indexOf(P)

¡ Applications:
¡ Text editors, Search engines, Biological research
Brute-Force Pattern Matching
The idea:
¡ Compare the pattern P with the text T for each possible
shift of P relative to T, until
¡ either a match is found, or

¡ all placements of the pattern have been tried

Algorithm BruteForceMatch(T, P)
Input text T of size n and pattern
P of size m
Output starting index of a
substring of T equal to P or -1
if no such substring exists
for i ← 0 to n – m // test shift i of the pattern
j←0
while j < m ∧ T[i + j] = P[j]
j←j+1
if j = m
return i //match at i
else
break while loop //mismatch
return -1 //no match anywhere
Brute-Force Pattern Matching
¡ Time Complexity:
¡ O(mn), where m is the length of T and n is the length of P

¡ A worst case example:

¡ T = aaaaaaaaaaaaaab
¡ P = aab
¡ Need 39 comparisons to find a match
¡ may occur in images and DNA sequences
¡ unlikely in English text
Can we do better?
Here are two Heuristics.

1. Backward comparison

¡ Compare T and P from the end of P and move backward to

the front of P

2. Shift as far as you can

¡ When there is a mismatch of P[j] and T[i]=c, if c does not

appear in P, shift P[0] to T[i+1]
The Boyer-Moore Algorithm
¡ The Boyer-Moore’s pattern matching algorithm is based on
these two heuristics:

¡ The looking-glass heuristic: Compare P with a subsequence

of T moving backwards

¡ The character-jump heuristic: When a mismatch occurs at

T[i] = c
¡ If P contains c, shift P to align the last occurrence of c in P with
T[i]
¡ Else, shift P to align P[0] with T[i + 1]
An Example

T a p a t t e r n m a t c h i n g a l g o r i t h m

1 3 5 11 10 9 8 7
P r i t h m r i t h m r i t h m r i t h m

2 4 6
r i t h m r i t h m r i t h m

t appears in P. e does not appears in P.

Shift to t align P[0] and T[i+1]
Last Occurrence Function
¡ Boyer-Moore’s algorithm preprocesses the pattern P and
the alphabet Σ to build the last-occurrence function L
mapping Σ to integers
¡ L(c) is defined as (c is a character)
¡ the largest index i such that P[i] = c or
¡ -1 if no such index exists

¡ Example:
¡ Σ = {a, b, c, d} c a b c d
¡ P = abacab L(c) 4 5 3 -1
Last Occurrence Function
¡ The last-occurrence function can be represented by an array
indexed by the numeric codes of the characters

¡ The last-occurrence function can be computed in time O(m

+ s), where m is the size of P and s is the size of Σ

The Boyer-Moore Algorithm
Algorithm BoyerMooreMatch(T, P, Σ)
L ← lastOccurenceFunction(P, Σ )
i ← m – 1 //backward
j←m–1
repeat
if T[i] = P[j]
if j = 0
return i // match at i
else
i←i-1
j←j-1
else
// character-jump
l ← L[T[i]]
i ← i + m – min(j, 1 + l) How to shift i?
j←m-1
until i > n - 1
return -1 { no match }
How to shift i after
mismatching characters?
¡ i ← i + m – min(j, 1 + l)

¡ Don’t shift back!

Case 1: j ≤ 1 + l (a appears after Case 2: 1 + l ≤ j (a appears
b) before b, jump!)

. . . . . . a . . . . . . . . . . . . a . . . . . .

i i

. . . . b a . a . . b .
j l l j
m−j m − (1 + l)

. . . . b a . a . . b .

j 1+l
Another Example
a b a c a a b a d c a b a c a b a a b b
1
a b a c a b
4 3 2 13 12 11 10 9 8
Case 2 a b a c a b a b a c a b
5 7
Case 1 a b a c a b a b a c a b
6
a b a c a b
Is it a better algorithm?
¡ Boyer-Moore’s algorithm runs in time O(nm + s)

¡ An example of the worst case:

¡ T = aaa … a
¡ P = baaa

¡ The worst case may occur in images and DNA sequences

but it is unlikely happened in English text

¡ It has been shown that in practice Boyer-Moore’s algorithm

is significantly faster than the brute-force algorithm on
English text
The Worst-case Example
a a a a a a a a a
6 5 4 3 2 1
b a a a a a
12 11 10 9 8 7
b a a a a a
18 17 16 15 14 13
b a a a a a
24 23 22 21 20 19
b a a a a a
HW3 (Due on 10/5)
(Lab on 10/2)
Count A Keyword in a Web Page!

¡ Get a URL and a keyword from user inputs

¡ Return how many times the keyword appears in the

contents of the URL

¡ For example:
¡ Enter URL: http://soslab.nccu.edu.tw
¡ Enter Keyword: Fang
¡ Output: Fang appears X times
Hints
Count A Keyword in a Web Page!

¡ Implement indexOf() with Boyer-Moore’s algorithm

¡ Use looking-glass and character-jump heuristics

Coming up…
¡ We will start to discuss fundamental data structures such as
arrays and linked lists on Sep. 30 (this Saturday)

¡ We have NO classes on Oct. 5 and Oct. 12

¡ We will then continue the basic data structure discussion on

Oct. 19.

¡ Read TB Chapter 3

Chapter 3 - String Processing
0% (1)
Chapter 3 - String Processing
28 pages
Unit-V DS Pattern Matching and Tries
No ratings yet
Unit-V DS Pattern Matching and Tries
26 pages
4.2 Sorting and Searching
No ratings yet
4.2 Sorting and Searching
43 pages
Unit 5 DS
No ratings yet
Unit 5 DS
53 pages
DAA - Unit IV - Space and Time Tradeoffs - Lecture Slides
No ratings yet
DAA - Unit IV - Space and Time Tradeoffs - Lecture Slides
41 pages
Lecture 40 Boyer Moore Algorithm
100% (1)
Lecture 40 Boyer Moore Algorithm
13 pages
DS V Unit Notes
No ratings yet
DS V Unit Notes
33 pages
Unit 2 Daa PDF
No ratings yet
Unit 2 Daa PDF
99 pages
Week 9 String Algorithms, Approximation
No ratings yet
Week 9 String Algorithms, Approximation
22 pages
String Matching Class
No ratings yet
String Matching Class
31 pages
Ecommerce Marketing - How To Get Traffic That BUYS To Your Website
67% (3)
Ecommerce Marketing - How To Get Traffic That BUYS To Your Website
241 pages
SplitPDFFile 346 To 402
No ratings yet
SplitPDFFile 346 To 402
57 pages
資料工程 Data Engineering: Pattern Matching 張賢宗
No ratings yet
資料工程 Data Engineering: Pattern Matching 張賢宗
38 pages
Algoritmen & Datastructuren 2012 - 2013 Substring Search (Slides by Sedgewick)
No ratings yet
Algoritmen & Datastructuren 2012 - 2013 Substring Search (Slides by Sedgewick)
32 pages
Text Processing (Complete)
No ratings yet
Text Processing (Complete)
100 pages
Lec 6-String Processing
100% (1)
Lec 6-String Processing
25 pages
MADF Unit 4
No ratings yet
MADF Unit 4
144 pages
Unit 5
No ratings yet
Unit 5
42 pages
Notes 5
No ratings yet
Notes 5
23 pages
Pattern Matching
No ratings yet
Pattern Matching
46 pages
M4 - Chapter 7 1
No ratings yet
M4 - Chapter 7 1
30 pages
Inf715 11
No ratings yet
Inf715 11
57 pages
04 03-PatternMatchingAndTries
No ratings yet
04 03-PatternMatchingAndTries
28 pages
U3 - SpaceAndTimeTradeoff
No ratings yet
U3 - SpaceAndTimeTradeoff
30 pages
CCCS314 - DAA - 22!23!3rd 05 Space and Time Tradeoffs - Modified
No ratings yet
CCCS314 - DAA - 22!23!3rd 05 Space and Time Tradeoffs - Modified
30 pages
UNIT 5.3 (String Mactching)
No ratings yet
UNIT 5.3 (String Mactching)
23 pages
4string Matching Kmprabin Karp and Naive
No ratings yet
4string Matching Kmprabin Karp and Naive
57 pages
Pattern Matching 2
No ratings yet
Pattern Matching 2
46 pages
Question Bank
No ratings yet
Question Bank
123 pages
DS Unit-V
No ratings yet
DS Unit-V
35 pages
Efficient Name Generation Using The Boyer-Moore Algorithm For Meaningful Combinations
No ratings yet
Efficient Name Generation Using The Boyer-Moore Algorithm For Meaningful Combinations
6 pages
Ads Unit5
No ratings yet
Ads Unit5
26 pages
Pattern Matching Algorithms
No ratings yet
Pattern Matching Algorithms
17 pages
Unit-4 Ads
100% (1)
Unit-4 Ads
31 pages
UNIT-4 PPT New
No ratings yet
UNIT-4 PPT New
47 pages
28 - Text Processing
No ratings yet
28 - Text Processing
7 pages
Chapter 2 - String Processing
No ratings yet
Chapter 2 - String Processing
26 pages
IRS Unit-5
No ratings yet
IRS Unit-5
62 pages
A Fast String Matching Algorithm: H N Verma, Ravendra Singh M.Tech (CSE-0104cs09mt16) RKDF IST Bhopal, India
No ratings yet
A Fast String Matching Algorithm: H N Verma, Ravendra Singh M.Tech (CSE-0104cs09mt16) RKDF IST Bhopal, India
7 pages
ALo 2
No ratings yet
ALo 2
23 pages
Co 4 (Lo 2)
No ratings yet
Co 4 (Lo 2)
12 pages
M269 - Lec8 Fall 1819
No ratings yet
M269 - Lec8 Fall 1819
24 pages
String Matching Algorithm
100% (1)
String Matching Algorithm
14 pages
Approximate String
No ratings yet
Approximate String
36 pages
Pattern Matching
No ratings yet
Pattern Matching
3 pages
Ir Asnment
No ratings yet
Ir Asnment
6 pages
Survey Paper On String Matching
No ratings yet
Survey Paper On String Matching
4 pages
String Processing
No ratings yet
String Processing
19 pages
Algo Lecture 7
No ratings yet
Algo Lecture 7
52 pages
Pattren Matching
No ratings yet
Pattren Matching
3 pages
Unit 5
No ratings yet
Unit 5
14 pages
Outline and Reading: Strings ( 9.1.1) Pattern Matching Algorithms
No ratings yet
Outline and Reading: Strings ( 9.1.1) Pattern Matching Algorithms
3 pages
MADFL 2025 Expt8
No ratings yet
MADFL 2025 Expt8
8 pages
Data Structures Unit 5
No ratings yet
Data Structures Unit 5
20 pages
String Search Algorithm
No ratings yet
String Search Algorithm
6 pages
DSA String Matching - Part 3
No ratings yet
DSA String Matching - Part 3
6 pages
DS Unit V
No ratings yet
DS Unit V
12 pages
Presented To: Dr. Rasha Waheeb
50% (2)
Presented To: Dr. Rasha Waheeb
72 pages
A Two Way Pattern Matching Algorithm Using Sliding Patterns
No ratings yet
A Two Way Pattern Matching Algorithm Using Sliding Patterns
5 pages
String Matching Algorithms: 1 Brute Force
No ratings yet
String Matching Algorithms: 1 Brute Force
5 pages
Market Analysis of Zomato
No ratings yet
Market Analysis of Zomato
39 pages
My Link Building Recommendations
100% (1)
My Link Building Recommendations
2 pages
How To Make Money Online Ebook
No ratings yet
How To Make Money Online Ebook
10 pages
SEO Spider Configuration - Screaming Frog
No ratings yet
SEO Spider Configuration - Screaming Frog
55 pages
SEO Interview Questions and Answers
No ratings yet
SEO Interview Questions and Answers
17 pages
Marketing Consultant For Small Business
No ratings yet
Marketing Consultant For Small Business
2 pages
Snsjvu
No ratings yet
Snsjvu
112 pages
Final 12sp Solution
No ratings yet
Final 12sp Solution
12 pages
22.00 - FD - An2 - s1 - CCIA-eng - Numerical Methods - 23-24
No ratings yet
22.00 - FD - An2 - s1 - CCIA-eng - Numerical Methods - 23-24
4 pages
cs170 Fa2013 mt1 Rao Soln
No ratings yet
cs170 Fa2013 mt1 Rao Soln
10 pages
ComponentSpecification2014 Prez
No ratings yet
ComponentSpecification2014 Prez
64 pages
Celebal Technology I Financial Proposal
No ratings yet
Celebal Technology I Financial Proposal
33 pages
Search Engines & Search Engine Optimization (SEO)
No ratings yet
Search Engines & Search Engine Optimization (SEO)
16 pages
The Beginner'S Guide To Using Elementor in Wordpress
No ratings yet
The Beginner'S Guide To Using Elementor in Wordpress
15 pages
A Study On Digital Marketing Strategy On Sales Performance at IMR Softech
No ratings yet
A Study On Digital Marketing Strategy On Sales Performance at IMR Softech
61 pages
Best Online Marketing Courses
No ratings yet
Best Online Marketing Courses
20 pages
23.00 FD An2 s1 Civil - Eng. Topography 23-24
No ratings yet
23.00 FD An2 s1 Civil - Eng. Topography 23-24
5 pages
Lec 1
No ratings yet
Lec 1
21 pages
20.0 - FD - An2 - s1 - CCIA-eng - Mechanics II - 23-24
No ratings yet
20.0 - FD - An2 - s1 - CCIA-eng - Mechanics II - 23-24
4 pages
PC 1
No ratings yet
PC 1
53 pages
Shreyas Kolhe CV SD
No ratings yet
Shreyas Kolhe CV SD
5 pages
Lec 4
No ratings yet
Lec 4
37 pages
Graph Definition
No ratings yet
Graph Definition
18 pages
Help Wanted Lesson Plan
No ratings yet
Help Wanted Lesson Plan
11 pages
Maryam Tavakkoli Master Thesis TUDelft
No ratings yet
Maryam Tavakkoli Master Thesis TUDelft
90 pages
MMPC 06 Answer Self Generated
No ratings yet
MMPC 06 Answer Self Generated
25 pages
Content Management SystemCMS
No ratings yet
Content Management SystemCMS
23 pages
Lipika Report Initial
No ratings yet
Lipika Report Initial
8 pages
8 Services and IPC Parts 14 15 and 16
No ratings yet
8 Services and IPC Parts 14 15 and 16
89 pages
Components Intro 2022
No ratings yet
Components Intro 2022
40 pages
8 Services and IPC Parts 22 23 24
No ratings yet
8 Services and IPC Parts 22 23 24
76 pages
Lecture 11 - Enhancing The User Experience (UX)
No ratings yet
Lecture 11 - Enhancing The User Experience (UX)
29 pages
E Business
No ratings yet
E Business
8 pages
8 Services and IPC Part 17
No ratings yet
8 Services and IPC Part 17
37 pages
5.00 FD An1 s1 CE Chemistry 23-24
No ratings yet
5.00 FD An1 s1 CE Chemistry 23-24
4 pages
Digital Marketing Plansbook
No ratings yet
Digital Marketing Plansbook
22 pages
Method of Coupling Metrics For Object-Oriented Sof
No ratings yet
Method of Coupling Metrics For Object-Oriented Sof
20 pages
Midterm Pc1201
No ratings yet
Midterm Pc1201
25 pages
Review DS
No ratings yet
Review DS
7 pages
Oops Project
No ratings yet
Oops Project
28 pages
Lab 6
No ratings yet
Lab 6
2 pages
Lab 2
No ratings yet
Lab 2
2 pages
Marketing and Real Estate
No ratings yet
Marketing and Real Estate
3 pages
Lab 5
No ratings yet
Lab 5
1 page
Search Engine Marketing Jargon Buster
No ratings yet
Search Engine Marketing Jargon Buster
4 pages
Get Your MBA in Content Marketing
No ratings yet
Get Your MBA in Content Marketing
4 pages
Rendy Mahendra
No ratings yet
Rendy Mahendra
5 pages
HospitalityMarketing Lesson12 LessonPlan 020716
No ratings yet
HospitalityMarketing Lesson12 LessonPlan 020716
8 pages
Learn Python through Nursery Rhymes and Fairy Tales: Classic Stories Translated into Python Programs (Coding for Kids and Beginners)
From Everand
Learn Python through Nursery Rhymes and Fairy Tales: Classic Stories Translated into Python Programs (Coding for Kids and Beginners)
Shari Eskenas
5/5 (1)
Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet