0% found this document useful (0 votes)
15 views

Unit 5 Strings

Topic in ADA

Uploaded by

Anshul Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Unit 5 Strings

Topic in ADA

Uploaded by

Anshul Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Unit-5

Strings
Data Compression

Data compression is a process of reducing the size of a file or dataset to save storage space or
transmission time. The goal of compression is to represent the same information in a more
efficient way, using fewer bits or bytes than the original representation. Compressed data can
be decompressed to reconstruct the original data, allowing for reversible compression. Varied
kinds of data can be compressed. This includes numbers, text, video, images, audio, or even
programs and software. Suppose the original file size is 200 bytes, and after compression, the
file size is reduced to 120 bytes. The resultant compressed file is approximately 60% of the
size of the original file.

Some data compression algorithms are:

Huffman Coding, Run Length Encoding, LZW etc

There are mainly two types of data compression techniques –

1. Lossless Data Compression


2. Lossy Data Compression

Fig. 1. Types of Data Compression

What is Lossless data compression?

Lossless data compression is used to compress the files without losing an original file's
quality and data. Simply, we can say that in lossless data compression, file size is reduced,
but the quality of data remains the same.

The main advantage of lossless data compression is that we can restore the original data in its
original form after the decompression.

Page 1 of 15
Lossless compression employs various algorithms to eliminate redundancy and reduce data
size. Some popular algorithms include Huffman coding, Run-Length Encoding (RLE),
Lempel-Ziv-Welch (LZW), and DEFLATE, etc.

Fig. 2. Lossless Data Compression

What is Lossy data compression?

Lossy data compression is used to compress larger files into smaller files. In this compression
technique, some specific amount of data and quality are removed (loss) from the original file.
It takes less memory space from the original file due to the loss of original data and quality.
This technique is generally useful for us when the quality of data is not our first priority.

Lossy data compression is most widely used in JPEG images, MPEG video, and MP3 audio
formats.

Fig. 3. Lossy Data Compression

Some important Lossy data compression techniques are -

 Transform coding
 Discrete Cosine Transform (DCT)
 Discrete Wavelet Transform (DWT)

Page 2 of 15
Benefits or advantages of Data Compression

 Compression minimizes the amount of storage space needed, leading to cost savings
and efficient resource utilization.
 Compressed data requires less bandwidth, enabling faster data transfer over networks
and reducing communication costs.
 Compressed files can be downloaded more quickly, improving user experience,
especially in web-based applications.
 Reduced storage costs and optimized use of resources contribute to overall cost
savings.

Disadvantages of Data Compression

 Lossy compression techniques sacrifice some data quality for higher compression
ratios, which may be unacceptable in certain applications.
 Compression and decompression processes can introduce computational overhead,
affecting system performance.
 Implementing and maintaining compression algorithms can be complex, requiring
careful consideration of trade-offs.
 Different compression formats may not be universally supported, leading to
compatibility issues across systems and applications.

String Sort

String sort refers to the process of arranging a collection of strings in a specific order. Sorting
can be done in ascending or descending order based on certain criteria.

Example:

• Input: geeksforgeeks

• Output: eeeefggkkorss

Here are a few common ways strings can be sorted:

 Lexicographical Order: This is the default alphabetical order. In lexicographical


order, strings are arranged based on the order of their characters. For example, "apple"
comes before "banana" because 'a' comes before 'b'.
 Length Order: Sorting based on the length of the strings. Shorter strings come before
longer ones or vice versa.
 Custom Order: Sorting based on a custom criterion defined by the user or the
program. For instance, sorting a list of names based on the number of vowels in each
name.

Page 3 of 15
 Case-Insensitive Order: Sorting strings while ignoring the case of the characters.
This ensures that uppercase and lowercase versions of the same letter are treated as
equal.

Some common sorting algorithms can be applied to strings:

 Quicksort: Quicksort is a comparison-based sorting algorithm that divides the input


into smaller parts, recursively sorts them, and combines them. It's commonly used for
sorting strings by comparing characters. Strings can be partitioned based on a chosen
pivot character. The sorting process continues recursively for each partition.
 Mergesort: Mergesort is a divide-and-conquer algorithm that divides the input into
smaller parts, sorts them, and then merges the sorted parts. Strings can be divided into
smaller segments, sorted individually, and then merged to obtain the final sorted
order.
 Bucket Sort: Bucket sort divides the input into a number of buckets, each responsible
for a specific range of values. The buckets are then individually sorted, and the results
are combined. Strings can be assigned to buckets based on certain characteristics (e.g.,
length or a specific character), and each bucket can be sorted individually.

Here is a simple algorithm to sort a string in ascending order:

1. Convert the string to an array of characters.


2. Use a sorting algorithm such as quick sort or merge sort to sort the array in ascending
order.
3. Convert the sorted array back to a string.

Regular Expression

A regular expression, often abbreviated as regex or regexp, is a sequence of characters that


defines a search pattern. It is a powerful tool used in computer science and programming for
matching patterns within strings. Regular expressions are widely employed in various
applications, including text processing, searching, and data validation.

To write a regular expression, one must understand the special characters used in regex, such
as “.”, “*”, “+”, “?”, and more. The pattern is then written using these special characters and
literal characters. The appropriate function or method is used to search for the pattern in a
string. Here are some examples of how to use regex:

 To match a sequence of literal characters, simply write those characters in the pattern.

Page 4 of 15
 To match a single character from a set of possibilities, use square brackets, e.g.
[0123456789] matches any digit.
 To match zero or more occurrences of the preceding expression, use the star (*)
symbol.
 To match one or more occurrences of the preceding expression, use the plus (+)
symbol.

There are following different type of characters of a regular expression:

1. Metacharacters
2. Quantifier
3. Groups and Ranges
4. Escape Characters or character classes

Metacharacters

 Caret (^): This character is used to match an expression to its right at the start of a
string.
Example: ^a is an expression match to the string which starts with 'a' such as "aab",
"a9c", "apr", "aaaaab", etc.

 Dollar ($): The $sign is used to match an expression to its left at the end of a string.
Example: r$ is an expression match to a string which ends with r such as "aaabr",
"ar", "r", "aannn9r", etc.

 Dot Symbol (.): This character is used to match any single character in a string except
the line terminator, i.e. /n.
Example: b.x is an expression that match strings such as "bax", "b9x", "bar".

 Vertical Bar (|): It is used to match a particular character or a group of characters on


either side. If the character on the left side is matched, then the right side's character is
ignored.
Example: A|b is an expression which gives various strings, but each string contains
either a or b.

 slash symbol (\):It is used to escape a special character after this sign in a string.

 A: It is used to match the character 'A' in the string.


Example: This expression matches those strings in which at least one-time A is
 present. Such strings are "Amcx", "mnAr", "mnopAx4".

 Ab: It is used to match the substring 'ab' in the string.


Example: This expression matches those strings in which 'Ab' is present at least one time.
Such strings are "Abcx", "mnAb", "mnopAbx4".

Page 5 of 15
Quantifiers
The quantifiers are used in the regular expression for specifying the number of occurrences of
a character.

 + (Plus Symbol): This character specifies an expression to its left for one or more
times.
Example: s+ is an expression which gives "s", "ss", "sss", and so on.

 ? (Question Mark): This character specifies an expression to its left for 0 (Zero) or 1
(one) times.
Example: aS? is an expression which gives either "a" or "as", but not "ass".

 *(asterisk symbol): This character specifies an expression to its left for 0 or more
times.
Example: Br* is an expression which gives "B", "Br", "Brr", "Brrr", and so on…

 {x}: It specifies an expression to its left for only x times.


Example: Mab{5} is an expression which gives the following string which contains 5
b's: "Mabbbbb"

 {x, }: It specifies an expression to its left for x or more times.


Example: Xb{3, } is an expression which gives various strings containing at least 3
b's. Such strings are "Xbbb", "Xbbbb", and so on.

 {x,y}: It specifies an expression to its left, at least x times but less than y times.
Example: Pr{3,6}a is an expression which provides two strings. Both strings are as
follows: "Prrrr" and "Prrrrr"

Groups and Ranges


The groups and ranges in the regular expression define the collection of characters enclosed
in the brackets.

 ( ): It is used to match everything which is in the simple bracket. Example: A(xy) is


an expression which matches with the following string: "Axy"

 { }: It is used to match a particular number of occurrences defined in the curly


bracket for its left string. Example: xz{4,6} is an expression which matches with the
following string: "xzzzzz"

 [ ]: It is used to match any character from a range of characters defined in the square
bracket. Example: xz[atp]r is an expression which matches with the following strings:
"xzar", "xztr", and "xzpr"

 [pqr]: It matches p, q, or r individually. Following strings are matched with this


expression: "p", "q", and "r".

 [pqr][xy]: It matches p, q, or r, followed by either x or y. Example: Following strings


 are matched with this expression: "px", "qx", and "rx", "py", "qy", and "ry".

Page 6 of 15
 (?: …): It is used for matching a non-capturing group. A(?:nt|pple) is an expression
which matches to the following string: "Apple"

 [^…..]:It matches a character which is not defined in the square bracket. Example:
Suppose, Ab[^pqr] is an expression which matches only the following string: "Ab"

 [a-z]: It matches letters of a small case from a to z. This expression matches the
strings such as: "a", "python", "good".

 [0-9]: It matches a digit from 0 to 9. Example: This expression matches the strings such
as: "9845", "54455"

 ab[^4-9]: It matches those digits or characters which are not defined in the square
bracket. Example: This expression matches those strings which do not contain 5, 6, 7,
and 8.

 [A-Z]: It matches letters of an upper case from A to Z. Example: This expression


matches the strings such as: "EXCELLENT", "NATURE".

 ^[a-zA-Z]: It is used to match the string, which is either starts with a small case or
upper-case letter. This expression matches the strings such as: "A854xb", "pv4fv",
"cdux".

Escape Characters or Character Classes

Page 7 of 15
Tries
A trie, also known as a digital tree or prefix tree, is a tree-like data structure used for
efficiently storing and searching a dynamic set or associative array where keys are usually
strings. The term "trie" comes from the word "retrieval."
In a trie, each node of the tree represents a single character of a key or a portion of a key. The
root of the trie represents an empty string or the null key. The structure of a trie allows for
efficient insertion, deletion, and search operations for keys.

Properties of a Trie Data Structure


 There is one root node in each Trie.
 Each node of a Trie represents a string and each edge represents a character.
 Every node consists of hashmaps or an array of pointers, with each index representing
a character and a flag to indicate if any string ends at the current node.
 Trie data structure can contain any number of characters including alphabets,
numbers, and special characters. But for this article, we will discuss strings with
characters a-z. Therefore, only 26 pointers need for every node, where the 0th index
represents ‘a’ and the 25th index represents ‘z’ characters.
 Each path from the root to any node represents a word or string.

Fig. 4. Trie Data Structure


Insertion Operation in Trie
In a trie data structure, the insertion operation involves adding a new key into the trie. Tries
are tree-like structures where each node represents a character of the key, and the path from

Page 8 of 15
the root to a node forms the key associated with that node. Here's a step-by-step explanation
of how the insertion operation works in a trie:
Here is a step-by-step explanation of the insertion operation in a trie:

1. Start at the Root:


Begin at the root of the trie.
2. Traverse the Trie:
Traverse the trie based on the characters of the key you want to insert. For each character
in the key:
 If the current node has a child node with the current character, move to that
child node.
 If there is no child node with the current character, create a new node and link
it as a child to the current node.
3. Repeat Until End of Key:
Continue this process for each character in the key until you reach the end of the key.

4. Mark the Last Node:


Once you reach the end of the key, mark the last node as the end of a word (or set a flag
indicating the completion of a word).

The following picture explains the construction of trie using keys given in the example
below.

Fig. 5. Insertion Operation in Trie

Page 9 of 15
Searching Operation in Trie
In a trie data structure, the search operation involves determining whether a given key exists
in the trie. Here's how you can perform the search operation in a trie:

1. Start at the Root:


Begin at the root of the trie.
2. Traverse the Trie:
Traverse the trie based on the characters of the key you want to search for. For each
character in the key:
 If the current node has a child node with the current character, move to that child
node.
 If there is no child node with the current character, the key is not in the trie, and
you can stop the search.
3. Check for End of Key:
Once you have traversed all the characters of the key, check whether the last node is
marked as the end of a word. If it is, the key is present in the trie. Otherwise, the key is
not in the trie.

Fig.6. Searching Operation in Trie

Page 10 of 15
Deletion Operation in Trie
Deletion in a trie data structure involves removing a key from the trie. Deleting a key may
lead to the removal of some nodes in the trie, but it is important to retain the structure for
other keys. Here's a step-by-step explanation of the deletion operation in a trie:

1. Start at the Root:


Begin at the root of the trie.

2. Traverse the Trie:


Traverse the trie based on the characters of the key you want to delete. For each character
in the key:
 If the current node has a child node with the current character, move to that child
node.
 If there is no child node with the current character, the key is not in the trie, and
no deletion is needed.

3. Mark the End of Key:


Once you have traversed all the characters of the key, mark the last node as the end of a
word (if it was marked as such). This is important for other keys that might share the
same prefix.

4. Delete Nodes Bottom-Up:


Start from the last character of the key and move upwards. For each character:
 Check if the current node has any other children. If it doesn't, delete the current
node.
 If the node has other children or is marked as the end of another word, stop the
deletion process.

This operation is used to delete strings from the Trie data structure. There are three cases
when deleting a word from Trie.

1. The deleted word is a prefix of other words in Trie.


2. The deleted word shares a common prefix with other words in Trie.
3. The deleted word does not share any common prefix with other words in Trie.

Page 11 of 15
1. The deleted word is a prefix of other words in Trie.

As shown in the following figure, the deleted word “an” share a complete prefix with another
word “and” and “ant“.

Fig.7. Deletion of word which is a prefix of other words in Trie

2. The deleted word shares a common prefix with other words in Trie.
As shown in the following figure, the deleted word “and” has some common prefixes
with other words ‘ant’. They share the prefix ‘an’.

Page 12 of 15
Fig. 8. Deletion of word which shares a common prefix with other words in Trie

3. The deleted word does not share any common prefix with other words in Trie.
As shown in the following figure, the word “geek” does not share any common prefix
with any other words.

Fig. 9. Deletion of a word that does not share any common prefix with other words in Trie

Advantages of tries data structure

Page 13 of 15
1. In tries the keys are searched using common prefixes. Hence it is faster. The lookup
of keys depends upon the height in case of binary search tree.
2. Tries take less space when they contain a large number of short strings. As nodes are
shared between the keys.
3. Tries help with longest prefix matching, when we want to find the key.

Disadvantages of Trie data structure


1. It requires more memory to store the strings.
2. It is slower than the hash table.

Applications of Trie Data Sturcture


 Dictionary Implementation:
Tries are commonly used to implement dictionaries or associative arrays, providing
fast and efficient lookup operations for words or keys.
 Spell Checkers:
Tries are employed in spell checkers to suggest corrections for misspelled words by
efficiently traversing the trie structure.
 Auto-Complete Systems:
Trie data structures are utilized in auto-complete systems to offer suggestions for
partially typed words, enhancing user experience in search fields and text editors.
 Prefix Matching:
Tries excel at prefix matching operations, making them useful in applications like
searching for contacts, finding files, or implementing efficient autocomplete features.
 Browser History:
Web browsers keep track of the history of websites visited by the user. So when the
prefix of a previously visited URL is written in the address bar, the user would be
given suggestions of the website to visit. Trie is used by storing the number of visits
to a website as the key value and organizing this history on the Trie data structure.

Substring Search
Substring search, also known as substring matching or sub string search, is the process of
finding the occurrences of a smaller string (substring) within a larger string (text). The
objective is to identify the positions or indices in the larger string where the specified

Page 14 of 15
substring occurs. Substring search is a fundamental operation in string processing and has
applications in various fields, including text processing, data analysis, bioinformatics, and
information retrieval.
For example, consider the following:
Text: "This is an example text."
Substring: "example"
A substring search on the given text for the substring "example" would return the position
where the substring starts, which is 11 in this case.
There are several algorithms and approaches to perform substring search efficiently, each
with its own advantages and use cases. Some well-known substring search algorithms
include:
 Brute Force Algorithm:
The simplest approach involves checking each position in the text for a match with
the substring.
 Knuth-Morris-Pratt (KMP) Algorithm:
An efficient algorithm that avoids unnecessary character comparisons by utilizing
information about the substring's structure.
 Boyer-Moore Algorithm:
Another efficient algorithm that skips characters based on a heuristic approach,
leading to fewer character comparisons.
 Rabin-Karp Algorithm:
Uses hashing to compare substrings, allowing for faster identification of potential
matches.

Algorithm for Substring Search:


1. Let string be the given string and pattern be the substring to be searched.
2. Let n be the length of the string and m be the length of the pattern.
3. For i = 0 to n-m:
a. For j = 0 to m-1:
i. If string[i+j] != pattern[j], break the inner loop.
b. If the inner loop completed, return the index i.
4. If the pattern is not found, return -1.
***

Page 15 of 15

You might also like