Unit 5 Strings
Unit 5 Strings
Strings
Data Compression
Data compression is a process of reducing the size of a file or dataset to save storage space or
transmission time. The goal of compression is to represent the same information in a more
efficient way, using fewer bits or bytes than the original representation. Compressed data can
be decompressed to reconstruct the original data, allowing for reversible compression. Varied
kinds of data can be compressed. This includes numbers, text, video, images, audio, or even
programs and software. Suppose the original file size is 200 bytes, and after compression, the
file size is reduced to 120 bytes. The resultant compressed file is approximately 60% of the
size of the original file.
Lossless data compression is used to compress the files without losing an original file's
quality and data. Simply, we can say that in lossless data compression, file size is reduced,
but the quality of data remains the same.
The main advantage of lossless data compression is that we can restore the original data in its
original form after the decompression.
Page 1 of 15
Lossless compression employs various algorithms to eliminate redundancy and reduce data
size. Some popular algorithms include Huffman coding, Run-Length Encoding (RLE),
Lempel-Ziv-Welch (LZW), and DEFLATE, etc.
Lossy data compression is used to compress larger files into smaller files. In this compression
technique, some specific amount of data and quality are removed (loss) from the original file.
It takes less memory space from the original file due to the loss of original data and quality.
This technique is generally useful for us when the quality of data is not our first priority.
Lossy data compression is most widely used in JPEG images, MPEG video, and MP3 audio
formats.
Transform coding
Discrete Cosine Transform (DCT)
Discrete Wavelet Transform (DWT)
Page 2 of 15
Benefits or advantages of Data Compression
Compression minimizes the amount of storage space needed, leading to cost savings
and efficient resource utilization.
Compressed data requires less bandwidth, enabling faster data transfer over networks
and reducing communication costs.
Compressed files can be downloaded more quickly, improving user experience,
especially in web-based applications.
Reduced storage costs and optimized use of resources contribute to overall cost
savings.
Lossy compression techniques sacrifice some data quality for higher compression
ratios, which may be unacceptable in certain applications.
Compression and decompression processes can introduce computational overhead,
affecting system performance.
Implementing and maintaining compression algorithms can be complex, requiring
careful consideration of trade-offs.
Different compression formats may not be universally supported, leading to
compatibility issues across systems and applications.
String Sort
String sort refers to the process of arranging a collection of strings in a specific order. Sorting
can be done in ascending or descending order based on certain criteria.
Example:
• Input: geeksforgeeks
• Output: eeeefggkkorss
Page 3 of 15
Case-Insensitive Order: Sorting strings while ignoring the case of the characters.
This ensures that uppercase and lowercase versions of the same letter are treated as
equal.
Regular Expression
To write a regular expression, one must understand the special characters used in regex, such
as “.”, “*”, “+”, “?”, and more. The pattern is then written using these special characters and
literal characters. The appropriate function or method is used to search for the pattern in a
string. Here are some examples of how to use regex:
To match a sequence of literal characters, simply write those characters in the pattern.
Page 4 of 15
To match a single character from a set of possibilities, use square brackets, e.g.
[0123456789] matches any digit.
To match zero or more occurrences of the preceding expression, use the star (*)
symbol.
To match one or more occurrences of the preceding expression, use the plus (+)
symbol.
1. Metacharacters
2. Quantifier
3. Groups and Ranges
4. Escape Characters or character classes
Metacharacters
Caret (^): This character is used to match an expression to its right at the start of a
string.
Example: ^a is an expression match to the string which starts with 'a' such as "aab",
"a9c", "apr", "aaaaab", etc.
Dollar ($): The $sign is used to match an expression to its left at the end of a string.
Example: r$ is an expression match to a string which ends with r such as "aaabr",
"ar", "r", "aannn9r", etc.
Dot Symbol (.): This character is used to match any single character in a string except
the line terminator, i.e. /n.
Example: b.x is an expression that match strings such as "bax", "b9x", "bar".
slash symbol (\):It is used to escape a special character after this sign in a string.
Page 5 of 15
Quantifiers
The quantifiers are used in the regular expression for specifying the number of occurrences of
a character.
+ (Plus Symbol): This character specifies an expression to its left for one or more
times.
Example: s+ is an expression which gives "s", "ss", "sss", and so on.
? (Question Mark): This character specifies an expression to its left for 0 (Zero) or 1
(one) times.
Example: aS? is an expression which gives either "a" or "as", but not "ass".
*(asterisk symbol): This character specifies an expression to its left for 0 or more
times.
Example: Br* is an expression which gives "B", "Br", "Brr", "Brrr", and so on…
{x,y}: It specifies an expression to its left, at least x times but less than y times.
Example: Pr{3,6}a is an expression which provides two strings. Both strings are as
follows: "Prrrr" and "Prrrrr"
[ ]: It is used to match any character from a range of characters defined in the square
bracket. Example: xz[atp]r is an expression which matches with the following strings:
"xzar", "xztr", and "xzpr"
Page 6 of 15
(?: …): It is used for matching a non-capturing group. A(?:nt|pple) is an expression
which matches to the following string: "Apple"
[^…..]:It matches a character which is not defined in the square bracket. Example:
Suppose, Ab[^pqr] is an expression which matches only the following string: "Ab"
[a-z]: It matches letters of a small case from a to z. This expression matches the
strings such as: "a", "python", "good".
[0-9]: It matches a digit from 0 to 9. Example: This expression matches the strings such
as: "9845", "54455"
ab[^4-9]: It matches those digits or characters which are not defined in the square
bracket. Example: This expression matches those strings which do not contain 5, 6, 7,
and 8.
^[a-zA-Z]: It is used to match the string, which is either starts with a small case or
upper-case letter. This expression matches the strings such as: "A854xb", "pv4fv",
"cdux".
Page 7 of 15
Tries
A trie, also known as a digital tree or prefix tree, is a tree-like data structure used for
efficiently storing and searching a dynamic set or associative array where keys are usually
strings. The term "trie" comes from the word "retrieval."
In a trie, each node of the tree represents a single character of a key or a portion of a key. The
root of the trie represents an empty string or the null key. The structure of a trie allows for
efficient insertion, deletion, and search operations for keys.
Page 8 of 15
the root to a node forms the key associated with that node. Here's a step-by-step explanation
of how the insertion operation works in a trie:
Here is a step-by-step explanation of the insertion operation in a trie:
The following picture explains the construction of trie using keys given in the example
below.
Page 9 of 15
Searching Operation in Trie
In a trie data structure, the search operation involves determining whether a given key exists
in the trie. Here's how you can perform the search operation in a trie:
Page 10 of 15
Deletion Operation in Trie
Deletion in a trie data structure involves removing a key from the trie. Deleting a key may
lead to the removal of some nodes in the trie, but it is important to retain the structure for
other keys. Here's a step-by-step explanation of the deletion operation in a trie:
This operation is used to delete strings from the Trie data structure. There are three cases
when deleting a word from Trie.
Page 11 of 15
1. The deleted word is a prefix of other words in Trie.
As shown in the following figure, the deleted word “an” share a complete prefix with another
word “and” and “ant“.
2. The deleted word shares a common prefix with other words in Trie.
As shown in the following figure, the deleted word “and” has some common prefixes
with other words ‘ant’. They share the prefix ‘an’.
Page 12 of 15
Fig. 8. Deletion of word which shares a common prefix with other words in Trie
3. The deleted word does not share any common prefix with other words in Trie.
As shown in the following figure, the word “geek” does not share any common prefix
with any other words.
Fig. 9. Deletion of a word that does not share any common prefix with other words in Trie
Page 13 of 15
1. In tries the keys are searched using common prefixes. Hence it is faster. The lookup
of keys depends upon the height in case of binary search tree.
2. Tries take less space when they contain a large number of short strings. As nodes are
shared between the keys.
3. Tries help with longest prefix matching, when we want to find the key.
Substring Search
Substring search, also known as substring matching or sub string search, is the process of
finding the occurrences of a smaller string (substring) within a larger string (text). The
objective is to identify the positions or indices in the larger string where the specified
Page 14 of 15
substring occurs. Substring search is a fundamental operation in string processing and has
applications in various fields, including text processing, data analysis, bioinformatics, and
information retrieval.
For example, consider the following:
Text: "This is an example text."
Substring: "example"
A substring search on the given text for the substring "example" would return the position
where the substring starts, which is 11 in this case.
There are several algorithms and approaches to perform substring search efficiently, each
with its own advantages and use cases. Some well-known substring search algorithms
include:
Brute Force Algorithm:
The simplest approach involves checking each position in the text for a match with
the substring.
Knuth-Morris-Pratt (KMP) Algorithm:
An efficient algorithm that avoids unnecessary character comparisons by utilizing
information about the substring's structure.
Boyer-Moore Algorithm:
Another efficient algorithm that skips characters based on a heuristic approach,
leading to fewer character comparisons.
Rabin-Karp Algorithm:
Uses hashing to compare substrings, allowing for faster identification of potential
matches.
Page 15 of 15