0% found this document useful (0 votes)
50 views18 pages

Exploring A Self-Replication Algorithm To Flexibly Match Patterns

This document summarizes a research paper that explores a self-replication algorithm called Matcher Cells for flexibly matching patterns. Matcher Cells is inspired by how biological cells self-replicate. The paper describes Matcher Cells using a functional programming language to provide a generic implementation, and an object-oriented architecture for languages like Java. It also presents two case studies that use Matcher Cells for applications in social media analysis and adaptive education. An evaluation with students found Matcher Cells to have good usability. The paper discusses tradeoffs of Matcher Cells and other pattern matching algorithms.

Uploaded by

Luimarco Diaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views18 pages

Exploring A Self-Replication Algorithm To Flexibly Match Patterns

This document summarizes a research paper that explores a self-replication algorithm called Matcher Cells for flexibly matching patterns. Matcher Cells is inspired by how biological cells self-replicate. The paper describes Matcher Cells using a functional programming language to provide a generic implementation, and an object-oriented architecture for languages like Java. It also presents two case studies that use Matcher Cells for applications in social media analysis and adaptive education. An evaluation with students found Matcher Cells to have good usability. The paper discusses tradeoffs of Matcher Cells and other pattern matching algorithms.

Uploaded by

Luimarco Diaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Received 1 December 2023, accepted 5 January 2024, date of publication 17 January 2024, date of current version 29 January 2024.

Digital Object Identifier 10.1109/ACCESS.2024.3355319

Exploring a Self-Replication Algorithm


to Flexibly Match Patterns
PAUL LEGER 1 , (Member, IEEE), HIROAKI FUKUDA 2,

NICOLÁS CARDOZO3 , AND DANIEL SAN MARTÍN 1


1 Escuela de Ingeniería, Universidad Católica del Norte, Coquimbo 1781421, Chile
2 Shibaura Institute of Technology, Tokyo 135-8548, Japan
3 Systems and Computing Engineering Department, Universidad de los Andes, Bogotá 111711, Colombia
Corresponding author: Paul Leger ([email protected])

ABSTRACT Pattern matching algorithms have been studied on numerous occasions, mainly focusing on
performance because of the large amount of data used in a matching process. However, a strong focus on
performance can entail particular issues like the lack of flexibility to match patterns. As a consequence,
programming developers need to tweak matching algorithms in contortive ways or create new specialized
ones altogether if their specific needs are not supported. Inspired by the self-replication behavior of cells
in biology, we explore and evaluate the design and implementation of an algorithm to flexibly match
patterns, named Matcher Cells. Through the composition of simple rules applied to cells, developers can
adjust the matching semantics of this algorithm to different needs. We describe this algorithm using a
pure functional language as a recipe for any Turing-complete programming language and then offer an
object-oriented architecture for languages like Java. To show the flexibility of our proposal, we use a
concrete implementation in TypeScript to describe two applications, from different domains, that use pattern
matching in a stream of tokens. Additionally, we carry out performance and developer experience empirical
evaluations with undergraduate students using Matcher Cells. Finally, we discuss the pros and cons of using a
biological-based algorithm, exploiting the compositions of rules, to match patterns.

INDEX TERMS Pattern matching, self-replication algorithms, string matching, context-aware systems.

I. INTRODUCTION and processing to uncover valuable data for researchers


Pattern matching algorithms [1] check the occurrences and practitioners. Thus, pattern-matching techniques should
of a pattern in a sequence of tokens. Such patterns are demonstrate their efficiency by identifying one or more
usually expressed using abstractions (e.g., automata [2]), patterns within datasets in a relatively short timeframe [5].
or languages (e.g., regular expressions [3]). Although these They should also possess the necessary flexibility and
algorithms have undergone extensive historical study, they user-friendliness to accommodate pattern matching without
continue to be a focal point of attention in contemporary requiring developers to possess an in-depth understanding of
times. This interest is attributed to their wide-ranging pattern matching algorithms or the need to fine-tune existing
applications across several domains, including but not limited algorithms to meet their specific requirements [6].
to spam filters, digital libraries, natural language processing, One specific context that exemplifies the needs for
word processors, web search engines, parsers, computational flexible and extensible pattern matching algorithms is web
molecular biology, and screen scrapers [4], [5]. A common scraping, which involves the practice of retrieving content
characteristic among these applications is the abundant from websites to store in repositories like databases or
availability of large datasets that require filtration, extraction, CSV files [7], [8]. Within the sphere of web scraping,
a diverse array of pattern matching techniques are employed,
The associate editor coordinating the review of this manuscript and including regular expressions (Regex), HTML Document
approving it for publication was Fabrizio Messina . Object Model (DOM), and XPATH. Nonetheless, these
2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
VOLUME 12, 2024 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ 13553
P. Leger et al.: Exploring a Self-Replication Algorithm to Flexibly Match Patterns

FIGURE 1. Methodology used in this study.

methodologies have been perceived as not so flexibility, and 3) Two case studies validating Matcher Cells. First, a pat-
complex in terms of implementation, and dependence on tern matching tool to analyze streaming services
the structure of the data source [9]. For instance, websites as social network sites (e.g., Twitter [17]). Second,
often employ similar yet inconsistent templates for creating a context-aware system [18] that adapts the difficulty of
pages of the same category. Therefore, over time, the inner addition exercises tasked to students, according to their
structure of a webpage can change without prior notification performance (i.e., context). Both case studies are avail-
due to periodic updates in the layout, which may imply able at: https://pragmaticslaboratory.github.io/matcher-
rewriting the matcher pattern algorithm to get the desired cells-study-cases [19].
data. Consequently, these changes could impact the time, 4) An experience evaluation of our proposal incorporating
effort, and cost associated with web scraping and data 23 undergraduate students from Universidad Católica
extraction tasks [9]. A web scraping tool, which considers an del Norte (Chile). The usability of Matcher Cells is
approximate or flexible pattern matching algorithm, can help evaluated using the System Usability Scale (SUS) [20]
address the issue related to the website evolution. approach.
In this paper, we present an algorithm founded on the 5) A preliminary performance evaluation with a compar-
principles of Biologically Inspired Computing (BIC) [10], ison to other two pattern matching algorithms, brute-
which provides researchers with the basis to create flexible force and KMP.
algorithms for pattern matching. This algorithm centers 6) A discussion about the trade-off between programming
around a self-replication algorithm called Matcher Cells, abstractions and expressiveness of Matcher Cells.
which takes inspiration from the self-replicating behavior of 7) A deep reference frame that contains proposals related
biological cells to articulate a broad spectrum of matching to Matcher Cells, that is flexible-pattern matching
semantics [11]. This work extends an initial proposal of algorithms.
Matcher Cells [12], where it was mainly employed for Roadmap: Fig. 1 shows the methodology followed in
matching program execution traces in the context of aspect- this article. Section II presents two flexibility issues in pat-
oriented programming [13]. It is worth noting that, to the tern matching algorithms with their consequences, focused
best of our knowledge, the application of BIC concepts in the mainly on performance. After the motivation, Section III
realm of pattern matching remains relatively unexplored. presents Matcher Cells, our self-replication algorithm to flex-
This paper extends our work in the following aspects: ibly match patterns through a conceptual design. The design
is followed by a concrete implementation of Matcher Cells
1) A mature description of Matcher Cells using the in TypeScript, named MCJs. Section IV validates our
Scheme programming language [14] to provide a implementation with two applications: a Twitter analyzer and
generic implementation of our proposal that works on a context identifier for context-aware systems. Additionally,
Turing-complete programming languages. We selected we present our user experience evaluation from three
Scheme as a functional language that provides few perspectives: (1) developer experience, (2) performance,
and simple constructs to formally describe, as much as and (3) abstractions and expressiveness. Finally, the paper
possible, a generic implementation. discusses different algorithms for pattern matching in per-
2) An architecture to realize Matcher Cells in object- spective of our proposal in Section VI, leading to Section VII
oriented languages like Java. We exemplify this with the conclusion and avenues of future work.
architecture with a concrete implementation in
TypeScript [15] for NodeJS (v16) [16], available II. FLEXIBILITY IN PATTERN MATCHING
at: https://github.com/pragmaticslaboratory/match-cell- With a large number of pattern matching algorithms available
base (rev. e4c556d). in the body of literature [21], pattern matching is currently

13554 VOLUME 12, 2024


P. Leger et al.: Exploring a Self-Replication Algorithm to Flexibly Match Patterns

TABLE 1. Description of pattern matching algorithms.

used in several fields such as string matching [1], execution allow KMP to appropriately reuse information of previous
trace matching in aspect-oriented programming [22], and partial matchings of the patterns. For example, consider a
intrusion detection [23]. However, as most of these algo- pattern p = ‘‘aaa’’ and a sequence s = ‘‘aabaab’’; here,
rithms mainly focus on their performance or algorithmic the matching fails every time the pattern is about to match,
complexity, some issues can appear when more flexibility i.e., when a ‘‘b’’ is found. KMP is useful for matching large
is needed. We now present existing pattern matching data sets, like DNA sequence analysis or image processing
algorithms, followed by the issues detected in existing (i.e., its parallel version) with a better performance than naïve
approaches. or Rabin-Karp.
The Boyer-Moore (BM) algorithm [25] flips the algorithm
A. MATCHING ALGORITHMS AND RESTRICTIONS to match from the tail-end of the pattern. The algorithm uses
There are eleven main algorithms describing the families of a bad match heuristic to move forward in the search of the
pattern matching algorithms, as described in TABLE 1. For pattern, skipping over the characters before the bad match,
the evaluation of the algorithms, we assume a data set of until the first match in the pattern with the bad character.
n tokens and a pattern of m tokens. Naïve or brute-force The time complexity of the algorithm is O(n ∗ m) in the
algorithms enumerate all possible matchings of a pattern worst case, given its dependence on the token sequence. The
by checking the satisfiability of each potential matching of BM algorithm is used for text editors, command substitution,
the pattern in the complete dataset. This algorithm implies or intrusion detection.
comparing the pattern with every data point, leading to a BNDM (Backward Non-deterministic DAWG Match-
worst-case performance (i.e., time complexity) of O(n ∗ m). ing) [26] is a bit-parallel algorithm based on suffix automata,
Brute-force algorithms have proven useful when matching extending BDM for faster execution. The algorithm uses
over small datasets, due to their simplicity. However, the a table to store bitmasks, a sequence of bits to keep/clear
algorithm looses efficiency whenever the matching string bits from another sequence, for each token to search. The
has too many prefixes to match the pattern (e.g., pattern algorithm shifts the search window according to the word
p = ‘‘ddde’’ to match string s = ‘‘ddddddddddddde’’). size (ω), detecting matches whenever the last word bit is 1.
2
The Rabin-Karp algorithm [24] offers an optimization over The time complexity of the algorithm is O( n∗m ω ) for general
the naïve approach, and a general framework for other string patterns, but can be O(n ∗ m) (as in BDM) if the pattern used
matching algorithms by preprocessing the pattern string. The is smaller than the word size. BNDM is applicable in general
algorithm uses modulo equivalences and rolling hashes to matching scenarios but most efficient for searching patterns
process the string, taking the module of the pattern elements, smaller than the memory word size.
and the search string taking windows of size m, with a time The BOM (Backward Oracle Matching) algorithm [27]
complexity of 2(m). Given this, the complexity of the match- expands on the Boyer-Moore suffix matching by matching
ing algorithm reduces the complexity to 2(n − m + 1). Given prefixes (i.e., using a suffix oracle on the reverse pattern).
the use of hashed data, the Rabin-Karp algorithm is applicable As a consequence, BOM optimizes search window shifts,
for situations where there might be many matches in the obtaining a better performance. On average, the performance
string, processing them faster. Additionally, the algorithm is of BOM is O(n ∗ log|6| (m)/m), but degrades to O(m ∗ n) in
used to match bitmap objects (e.g., images) easier. the worst case. BOM is used for DNA sequence matching and
The Knuth-Morris-Pratt (KMP) algorithm [1] uses general string matching.
sequences of pre-processed tokens. The matching time So far, the described algorithms have gained linear
complexity of KMP is O(n) in its best case time complexity. speedups proportional to the size of the data, i.e., O(n).
At first glance, KMP shows a much better performance In order to improve such execution time, it is possible to
than brute-force algorithms, however, KMP’s performance preprocess the data. Suffix data structures are used for such a
can degrade to O(n + m) if a sequence of tokens does not purpose, keeping track of suffixes from the data set. Suffix

VOLUME 12, 2024 13555


P. Leger et al.: Exploring a Self-Replication Algorithm to Flexibly Match Patterns

trees [28] use common sequences in edges going down where a web application has a filtering policy to prevent
the tree. Tree leaves then contain the index in which the malicious requests that can affect its availability, compromise
suffix, going down to that leave, starts in the data set. The security, or consume excessive resources (e.g., application
processing of the suffix tree takes O(n) time, as it requires firewalls for Amazon Web Services [35]). Here, the use of
going through the complete data set. However, once built, it is algorithms like KMP or BM may not work appropriately
possible to directly match any possible pattern in O(m) time. because the token sequence and the pattern to search may be
Given this property, suffix trees are of special interest for unknown beforehand. More precisely, when it is necessary
matching problems, as for example the case of DNA sequence to filter for malicious requests, we may not know exactly
search. Similar to Suffix trees, Suffix arrays [29] propose an which patterns, that represent malicious requests, should be
alternative of matching algorithms that compromise search used. For example, we may need to extend the filtering
time performance to increase flexibility of matching patterns. policy with new kinds of patterns if new variants of security
Finding all k matches on a dataset takes O(m ∗ log(n) + k ∗ m) attacks are observed. In this example, if we are using exact
time, with an additional best-time of O(n) to build the array. pattern matching (i.e., a key-value table), changing the pattern
Suffix arrays are built by extracting all possible suffixes of model to use regular expressions may be necessary; that is,
the dataset, keeping their start index, and then sorting said a completely different implementation is needed (e.g., using
array. Suffix automata [30], [33], similar to the previous two a deterministic finite automaton).
cases, are used to pre-process the dataset to search patterns. In application domains such as stream processing over the
The difference between the automaton and the array and tree Internet, patterns and their conditions to be matched (e.g., a
structures lies in the memory used for its construction, suffix period of time) can vary on the fly; meaning that the semantics
automata presenting an optimization on the space used. of pattern matching algorithms should be flexibly adaptable
The Aho-Corasick algorithm [31], [34] builds a finite without tweaking or changing these algorithms.
state machine with additional links between internal nodes
to speedup transitions between failed matches. In particular, III. MATCHER CELLS
Aho-Corasick is used to match multiple patterns in the data Through small entities with simple rules, self-replication
set, with a time complexity of O(n + l + z) with l as an algorithms [11] allow developers to flexibly express the
upper bound on patterns’ size, and z the total number of semantics of a process. This is because each rule defines a
appearances of the patterns, and a preprocessing time of portion of the semantics, and their composition defines the
O(n). This algorithm is used mainly for string search, as for full semantics. The composition process allows developers to
example finding the smallest string of a given length, that easily adjust or create new semantics (on the fly), bringing a
contains k strings. flexible expressiveness to define matching semantics.
The bit-vector parallel algorithm for string matching [32]. This section defines a self-replication algorithm, named
This algorithms is built from the ideas of the dynamic pro- Matcher Cells, to flexibly match patterns, brining a variety
gramming algorithm for string matching, taking advantage of matching semantics that allows developers to express dif-
of a bit-mask representation of the matching dynamic pro- ferent the pattern matching algorithm semantics (Section 1).
gramming matrix. The matrix captures the bit representation This section is organized as follows. We first start introducing
of the difference between that data and the pattern, being able self-replication behaviors in cells to describe how to use
to manipulate the matrix in parallel using bit-wise operations. these behaviors to match patterns. The section then describes
This approach constitutes a significant matching speedup how to express a wide range of pattern matching semantics
with a time complexity of O( n∗mw ), where w is the word size.
through compositions of cell behavior rules. Using the
This algorithms is of special interest in bioinformatics and language Scheme [36], in Section III-D, we describe a generic
DNA sequencing, as it is able to detect pattern matches with recipe to implement our proposal in different paradigms such
a bounded error between the pattern and its match string in as programming languages like JavaScript. We finally present
the data set, for example a maximum error of 3 molecules. a concrete implementation of Matcher Cells in TypeScript,
Taking into account existing matching algorithms, we note a typed version of JavaScript.
that when considering a large amount of data as the sequence
of tokens, some pattern matching algorithms like BM or A. SELF-REPLICATION ALGORITHMS
brute-force, in their worst case, are not usable in practice Self-replication algorithms are inspired by cellular behav-
(cf., Section V-B). From this analysis, we can conclude ior [11]. Concretely, these algorithms emulate the reactions
that it is necessary to know the features of many pattern of a set of biological cells to a sequence of reagents in a
matching algorithms as a specific strategy may be used to solution. Fig. 2 shows the different possible reactions of a cell
boost performance, depending on the sequence of tokens and to a reagent, which can be:
used patterns. 1) the creation of an identical copy of the cell, or with a
small variation to persist in the solution,
B. UNKNOWN SEQUENCES OF TOKENS AND PATTERNS 2) nothing,
To exemplify the shortcomings of existing matching algo- 3) death, or
rithms, with respect to their flexibility, consider a scenario 4) a combination of the previous ones.

13556 VOLUME 12, 2024


P. Leger et al.: Exploring a Self-Replication Algorithm to Flexibly Match Patterns

FIGURE 2. Different reactions of a cell to a reagent.

An algorithm that follows self-replicating behavior is


defined by a pair (Seeds, R), where Seeds is the first set of
cells into a solution, and R is the set of rules that govern
the evolution of the solution using combinations of reactions.
Additionally, if we consider that a solution is an autopoietic FIGURE 3. Reactions of two solutions to a reagent a. Solution 1 contains
a cell that must match a. The cell in solution 2 must match a→b.
system [37], this solution can add or remove cells to maintain
itself after all the current cells react to a reagent.
Fig. 4 shows three different evolutions of the same solution
with a cell c(a→b) and a sequence of tokens a→a→b→b.
B. MATCHING PATTERNS
The first evolution ends with only one match (i.e., c($))
To match patterns, Matcher Cells’ algorithm borrows con- because a cell dies when this one creates new one(s);
cepts from self-replication algorithms, giving flexibility to for this reason, c(a→b) dies when it matches a token b,
match patterns. For this proposal, we consider reagents as and c(b) when it matches b. The second evolution ends
tokens that must be matched, and cells to contain patterns with four matches because no cell dies when there is a
of tokens and metainformation. We use the notation c(P) to match. In the third evolution, the solution evolves with two
represent a cell c that must match a pattern P, and c($) for matches because the seed (i.e., c(a→b)) never dies, but other
a matched cell, which is the match of a pattern. When a cell cells die when they match a token. Which is the correct
creates a new cell, the new cell can gather metainformation evolution? The answer will depend on the semantics required
like the link to the parent or the time when the cell was by programmers, opening up the flexibility for programmers
created. Fig. 3 illustrates the reaction to reagent a of two to choose the most convenient option for their case.
solutions:
Solution 1 - c(a): a cell that must match the pattern a C. MATCHING PATTERNS WITH FLEXIBLE SEMANTICS
(i.e., only one token), creating a cell c($) when As Fig. 4 shows, matching semantics strongly affects how a
a is matched. pattern is matched as well as how many times each pattern
Solution 2 - c(a→b): a cell that must match a pattern that will be matched with a specific sequence of tokens.
is composed of the sequence a→b (i.e., a and In our self-replication algorithm, Matcher Cells, different
then b). When this cell matches a token a, matching semantics can be expressed through compositions
it will create a new cell, c(b), which must of simple rules, which can be per-cell or per-solution:
match the token b. 1) per-cell rules: They are applied to each cell in a current
In both solutions shown in Fig. 3, the cell reaction is manner using the same token.
creation, and the links between them are stored in the 2) per-solution rules: They are applied after all per-cell
metainformation of the created cell. Additionatlly, Solution rules. Aiming to emulate the behavior of an autopoietic
2 shows the seed cell, c(a→b), gathers its creation time system, per-solution rules point to keep a solution useful
information, which is passed to the new cell when it is created. to continue the developer-defined matching process.

VOLUME 12, 2024 13557


P. Leger et al.: Exploring a Self-Replication Algorithm to Flexibly Match Patterns

FIGURE 4. Three different evolution scenarios of a solution to the same sequence of tokens and pattern. The first
evolution ends with a match, the second one ends with four matches, and the last one ends up with two matches.

FIGURE 5. Using the same pattern and sequence of tokens, five different results because of matching semantics.

An example is the per-cell rule named match token, which Using this and other rules, Fig. 5 illustrates five different
executes the reaction of each cell to a token as shown in Fig. 2. potential matching semantics of our algorithm:

13558 VOLUME 12, 2024


P. Leger et al.: Exploring a Self-Replication Algorithm to Flexibly Match Patterns

1) MULTIPLE MATCHES trace time-life rule, all cells are killed when the time period
Using only the match token rule, Matcher Cells provides elapsed from the first token match is greater than 1time.
multiple matching semantics. For example, Fig. 5. shows that Fig. 5.5 shows an example of such situations.
c(a→b) generates two c($)s when it meets the sequence of
a→a→b. D. EXPRESSING CELLS AND RULES
Cells and rules can be expressed using different programming
2) SINGLE MATCH paradigms’ abstractions (e.g., objects). To illustrate our
This semantics is provided by the composition of proposal with one of the most straightforward programming
match token and kill creator rules. This new rule kills the abstractions, we use functional programming, as realized in
parent cell when a new cell is created. For example, c(a→b) Scheme [36]. As such, cells and rules are entirely expressed
creates only one c($) because c(a→b) is killed by the rule using only plain functions. The benefit of using a pure
when c(b) is created as it is shown in Fig. 5.2. functional abstraction is that this description can be used as a
recipe for any Turing-complete programming language.
3) ONE MATCH AT A TIME As token and pattern definitions strongly depend on the
Programmers often need to execute an action every time application domain (e.g., string matching), they are not
that the same pattern occurs, e.g., when a security flaw considered as part of Matcher Cells’ core. In the validation
section, we discuss two applications, where concrete imple-
occurs [38]. However, the composition of match token and
mentations for tokens and patterns are presented.
kill creator will kill the seed cell such as c(a→b), resulting
in a single match. Using the composition of match token and
1) CELLS
kill creator together with add seed, a per-solution rule that
works after previous rules is applied; then it is possible to A cell is a function composed of a pattern and its metainfor-
match the same pattern every time it occurs (Fig. 5.3). The mation, which may create other cells when it reacts to a token
add seed rule creates a copy of the seed cell whenever there (Fig. 2). In our proposal, the signature and implementation of
is no cell in the solution, allowing the matcher to start a new a cell are:
matching process (i.e., with a new cell and solution). ; ; Signature
; ; Cell: Pattern x MetaInf −> (Token) −> MetaInf U List<Cell>
(define ( Cell pattern meta−inf)
4) ALWAYS START A MATCH (lambda (token null )
(if ( null ? token) meta−inf
The same pattern might start simultaneously, making simul- (let ([ result ( react token pattern )])
(if (pattern−matched? result )
taneous processes of matching work at the same time. For (return−list−with−new−cells result ))
example, this semantics can be useful to capture simultaneous (return−empty−list)))))
multi-intruders [39]. However, the previous compositions of
A Cell function returns its metainformation whenever it is
rules does not allow Matcher Cells to start a new match
called without a token, otherwise it returns the reaction to the
if there is a matching process already executing. If we
token. The result of a cell reaction is a (possibly empty) list
replace the add seed rule with always add, the algorithm
of matching cells.
will always be able to start the process of a new matching
(Fig. 5.4). Note this semantics does not keep a link between
2) RULES
the seed and its child cell because the kill creator rule kills the
seed, and the always add rule inserts a new seed. Although We define per-cell and per-solution rules, which correspond
this matching semantics and multiple matches lead to the to the functions applied to a list of cells (Section III-C).
same number of matches, both semantics are different. This Application of a rule may remove or add other cells from
is because the multiple matches semantics allows the matcher the cells list. A per-cell rule is applied to each cell into
to continue a match from any part of a pattern already a solution, which consumes a token of its sequence to
matched, instead the always start a match semantics can only match. A per-solution rule is applied to resultant cells after
start a match from the beginning of a pattern. To illustrate applying all per-cell rules. The most basic example of
this difference, consider that the sequence of tokens has b a per-cell rule is identity, which given a cell, it returns
as suffix, i.e., a→a→b→b. With this extended sequence, it untouched. An example of a per-solution rule can be
multiple matches would have two new matches, while always remove-match-cells, which removes all match cells that
start a match semantics would not have any new match when are into a solution. Both rule examples are presented in the
the last b token is received by the solution. following Snippets with their corresponding signatures.
; ; Signature
; ; Per−Cell Rule: Token x List<Cell> −> List<Cell>
5) MATCH PER TIME FRAME (define ( identity token cells )
cells )
Suppose the scenario presented in Section Unknown
Sequences of Tokens and Patterns, where malicious requests ; ; Signature
; ; Per−Solution Rule: List<Cell> x Pattern −> List<Cell>
occur in a short period of time (e.g., seconds). Here, (define (remove−match−cells cells pattern)
patterns should be matched before 1time elapses. Using the (remove−match−cells cells ))

VOLUME 12, 2024 13559


P. Leger et al.: Exploring a Self-Replication Algorithm to Flexibly Match Patterns

composed of a cell and two rules, is defined in Line 9.


Line 13 uses the defined solution to match the pattern with
the input “abc”.
1 let a :Token = new Token("a" ) ; //match a
2 let b:Token = new Token("b" ) ; //match b
3 let c :Token = new Token("c" ) ; //match c
4 let ab:Sequence = new Sequence(a ,b) ; //match ab
5 let abc:Sequence = new Sequence(ab, c ) ; //match abc
6
7 let seed: Cell = new Cell (abc , new MetaInformation ( ) ) ;
8
9 let sol : Solution = new Solution ([seed) ] , // l i s t of seeds
10 new OnlyOneMatch() , // c e l l rule
11 new AddSeed( ) ) ; //solution rule
12
13 sol .match("abc" ) ; //find one match
FIGURE 6. Main components of Matcher Cells’ architecture.

IV. VALIDATION
Additionally, we introduce composable rules. A compos- This section validates the usability and applicability of
able rule is a function that takes a rule (say rule 1) as Matcher Cells through two applications, from different
parameter and returns a new rule (rule 2), implying that a domains, that focus on matching stream sequences of tokens,
rule is applied first and then its composition rule. We posit a distinguishing feature of our proposal. The first application
kill creator as a composable rule: is the identification of tweets in the Twitter feed. The second
; ; Signature application is the implementation of context identifications
; ; Composable−Rule: Rule −> Rule
(define ( kill−creator rule ) in a context-aware system [18]. Both implementations are
(lambda ( cells token) available online [19].
(let ([new−cells ( rule cells token)])
(remove∗ (map get−creator new−cells) new−cells))))
A. APPLICATION 1: IDENTIFYING TWEETS
Using composable rules, we can create different semantics
Fig. 7 shows our emulated Twitter environment using a real
like the ones shown Fig. 5. In the code snippet below,
data set of 100,000 tweets related to the video game subject
we present the multiple match and single-match semantics
during 2020. Every 60 seconds (a customizable parameter),
using composable rules, both starting with the identity
these tweets appear in the Web page’s feed (the central panel
rule.
in the figure). A user can identify specific tweets through
(define multiple−match (match−token identity ))
(define single−match ( kill−creator (match−token identity )))
the matching of a pattern in one tweet, as the figure shows
with a red background. In this application, the token sequence
corresponds to appearing tweets as time passes; this behavior
E. AN OBJECT-ORIENTED ARCHITECTURE is similar to that of streaming services.
This section now shows an object-oriented architecture for As tweets are freely written, the same concept in a
Matcher Cells that can be used in languages like Java, tweet might be written in different ways, for example,
JavaScript, and TypeScript. Fig. 6 shows the core components ‘‘play’’,‘‘play station’’, or ‘‘ps1’’ all refer to the same
defined in our proposal. A solution carries out the matching concept. Additionally, a concept can be expressed more times
process and is created with seeds, and the composition of than others, identifying potentially more enthusiastic tweets.
per-cell and per-solution rules. Both types of rules evolve Taking into account the previous two observations, this Web
the cells in the solution. In addition, a developer can use application exhibits two features of Matcher Cells: regular
of the Composite design pattern [40] to create composable expressions and multiple matching semantics. We highlight
rules, where the identity rule is the leaf component of this that, although the definition of a specific pattern language is
design pattern. A cell contains the pattern that must match, not part of our proposal, it is not difficult to use a pattern
and metainformation that is cloned with potential mutations language specification.
when passed to children cells. With the react method, a cell In this application, the multiple matches semantics is used
reacts to a token, potentially returning the creation of new to identify the intensity of a pattern. Fig. 7 shows that
cells. users can select this semantics in the Web page. As regular
We provide a concrete implementation of our object- expressions (regex) are used to match different strings that
oriented architecture, named MCJs, using TypeScript, a typed can represent a same concept because these strings follow a
version for JavaScript (one of the most used programming similar structured form, users can enable the use of regexs to
languages to develop Web applications [41], [42]). In the match similar tweets in this application. To implement regex
following code snippet, we illustrate the matching of pattern in our proposal, we define cells that if they do (not) match
abc using this object-oriented implementation. Lines 1 to a token, these cells create other cells that expect to match
5 define the pattern abc. Lines 4 and 5 use an object the following term in a regex. For example, Fig. 8 shows
composition to set the sequence. The solution, which is the matching process of a+ b. When token a is matched, the

13560 VOLUME 12, 2024


P. Leger et al.: Exploring a Self-Replication Algorithm to Flexibly Match Patterns

FIGURE 7. Web application that uses Matcher Cells to match tweets.

environment. Fig. 9 shows a context-aware system that


adapts the difficulty of addition exercises tasked to students,
according to their performance (i.e., context). In this system,
we identify three contexts:
1) Good performance. If a student answers a number
FIGURE 8. Using regular expressions in Matcher Cells. (parametrized) of exercises correctly in a row, the system
increases by one the number of digits of both terms in
cell creates a new cell with the pattern a∗ b. Additionally, the the following exercises.
following code snippet shows the use of functions that sketch 2) Bad performance. If a student answers a number
the a+ regex operator implementation for Matcher Cells in (parametrized) of exercises wrong in a row, the system
TypeScript.1 While the star function returns a function that decreases by one the number of digits of both terms in
continues matching the same pattern (a∗ ) until a different the following exercises.
token appears, the plus function applies star when there is 3) No performance evaluated. If a student skips a number
a match (a+ = aa∗ ). (parametrized) of exercises, the system starts from the
function star (op: function) { first level (i.e., additions with one digit).
return function inner(token : string ) : function {
let result : function = op(token ) ; The context-aware system can use the events associated
i f (/∗ result does not match ∗/) return result ; //a match c e l l
i f (/∗ result matches ∗/) return inner ;
with correct, wrong, and skip exercises to identify the
} } previous contexts. Given that the Matcher Cells algorithm
function plus(op: function) {
matches a sequence of tokens, we modified the callbacks
return function inner(token : string) { of these events to add the creation of the tokens correct,
let result : function = op(token ) ;
i f (/∗ result does not match ∗/) return inner ; //end of the match
wrong, and skip. These tokens represent the respective
i f (/∗ result matches ∗/) return star (token ) ; //continue with star events, and the sequence of these tokens represents a stream
} }
of events. To implement this context-aware system, we added
three Matcher Cells instances, where each instance is used
let a_plus : function = plus("a" ) ; // this function matches "a+"
to identify a particular context. When one Matcher Cells
Given that Matcher Cells ess more complex regular instance matches a pattern, the system executes its associated
expression-based patterns to match URLs, Hashtags (#), adaptation, e.g., adding one digit in the addition terms in the
or mentions (@). For example, a programmer might need to Good Performance context.
match all tweets containing a given URL.
V. EMPIRICAL EVALUATION
B. APPLICATION 2: IDENTIFYING CONTEXTS In addition to the usability and applicability validation
A context-aware system adapts its behavior according to the of Matcher Cells in the previous section. We now turn
current identified context [18] from its surrounding execution our attention to the evaluation of the developer experi-
1 A full implementations of this and other operators are available on ence, algorithm performance, and programming abstraction
https://github.com/pragmaticslaboratory/match-cell-base. In this implemen- expressiveness of self-replication algorithms for pattern
tation, the functions are exchanged with objects. matching.

VOLUME 12, 2024 13561


P. Leger et al.: Exploring a Self-Replication Algorithm to Flexibly Match Patterns

FIGURE 9. Web application that uses Matcher Cells to identify the context associated with the student performance to resolve additions.

A. DEVELOPER EXPERIENCE
Throughout this paper, we claim that Matcher Cells is simple
to use by developers because of the rule composition;
comparing to other pattern matching algorithms like regex,
which is perceived as difficult by both students and pro-
fessional programmers [43], [44]. This section presents the
results of four evaluations related to developer experience.
These evaluations were carried out by 23 undergraduate
students in the third-year computer science program at the
Universidad Católica del Norte (Chile). After a 40-minute
session teaching Matcher Cells, the students are asked to
answer five tasks of pattern matching, where they have FIGURE 10. Percentage of students that finished the task.
to express the correct pattern and composition of rules in
Matcher Cells, specifically using a Web interface [19] for
95% of participants solved these tasks. The percentages of the
MCJs. Table 2 contains a brief description of the five tasks,
remaining tasks had a lower success rate than the previous
ordered by incremental complexity.
ones. The last task had the lowest percentage (close to 80%),
given that this task is the most challenging one, as it requires
1) EXPERIENCE RESULTS the use of a time constraint to match a pattern.
As a first result, we highlight that 100% of the developers
recommended Matcher Cells for the development of pattern b: AVERAGE TIME TO SOLVE A TASK
matching algorithms. The usability of Matcher Cells is Fig. 11 shows the average solution time per task. For the first
evaluated using the System Usability Scale (SUS) [20] four tasks, the participants solved them significantly faster
approach, which has been widely used for years in dif- than the last task, which took almost double the time. As in
ferent contexts [45], [46], [47]. This usability evaluation the previous evaluation, we think this task took more time
works as a proxy to measure how error-prone is to use because it requires an extra configuration to be solved: the
a matching pattern algorithm that requires to compose a time to match a pattern.
set of rules before using it. The data used to create the
charts and SUS evaluation are anonymized and available c: HOW EASY A TASK WAS RESOLVED
on http://pleger.cl/sites/matchercells/results.html (responses Using a Likert scale [48] of five levels (from ‘‘Strongly
in Spanish and translated to English). agree’’ to ‘‘Strongly disagree’’), we asked to the participants
the question: ‘‘How easy was the task?’’. The answers to the
a: PERCENTAGE OF DEVELOPERS WHO SOLVED A TASK question are shown in Fig. 12. Although the ‘‘Strongly agree’’
Fig. 10 compares the percentage of participants who solved a option is not the most chosen in all tasks, there is a clear
task. For the first two tasks, which are the easiest ones, over preference towards the ‘‘Agree’’ option; in fact, no participant

13562 VOLUME 12, 2024


P. Leger et al.: Exploring a Self-Replication Algorithm to Flexibly Match Patterns

TABLE 2. Pattern matching tasks that developers asked to be solved.

6) ‘‘I thought there was too much inconsistency in this


system’’
7) ‘‘I would imagine that most people would learn to use
this system very quickly’’
8) ‘‘I found the system very cumbersome to use’’
9) ‘‘I felt very confident using the system’’
10) ‘‘I needed to learn a lot of things before I could get going
with this system’’
To calculate a final score, we follow a three-step
procedure:
1) Add up the final score for all odd-numbered questions,
FIGURE 11. Average time per task.
then subtract 5 from the total to get final-odd.
2) Add up the final score for all even-numbered questions,
then subtract 25 from the total to get final-even.
3) Add final-odd and final-even, then multiply the result
by 2.5.
The final score is in the range of 0-100, which determines
a tool’s usability, shown in Fig. 13 for Matcher Cells.2
Fig. 14 shows, in ascending order, the score of the usability
evaluation using SUS for each participant. At first glance,
we observe that most participants (52.2%) find ‘‘Acceptable’’
the use of Matcher Cells. A percentage of 39.1% of students
find its use ‘‘Good’’ or ‘‘Excellent’’, and 8.7% (equivalent
FIGURE 12. Using a Likert scale: how easy was the task?
to two students) find its use ‘‘Poor’’. The average score is
66.19, meaning that the usability to use Matcher Cells is
‘‘Acceptable’’, close to ‘‘Good’’.

e: CONCLUSION
FIGURE 13. Using SUS, the value range determines how usable an Considering the results of our four empirical evaluations,
interface is.
we can observe that even if this pattern matching algorithm
requires to configure a set of rules before using, developers
chose the ‘‘Strongly disagree’’ option. Using this evaluation, are able to use it without a steep learning curve. The trade-off
we might claim that the use of Matcher Cells is not complex. between a pre-configuration of the rule composition and
the flexibility to express matching semantics can affect the
d: USABILITY
preference of Matcher Cells. Nevertheless, if developers can
To evaluate Matcher Cells in terms of usability, we used the use an external library for rule compositions, it might give
SUS approach [20]. To use SUS, the participants who use a preference towards Matcher Cells.
product (e.g., software) are asked to score the following ten
sentences using the five-level Likert scale: B. PERFORMANCE
1) ‘‘I think that I would like to use this system frequently’’ The main goal of this study focuses on defining a
2) ‘‘I found the system unnecessarily complex’’ self-replication algorithm to flexibly express matching
3) ‘‘I thought the system was easy to use’’ semantics; hence, we have yet to sacrifice any potentially
4) ‘‘I think that I would need the support of a technical valuable feature based on its expected cost. Nonetheless,
person to be able to use this system’’ any pattern matching algorithm must exhibit a performance
5) ‘‘I found the various functions in this system were well
integrated’’ 2 Subtle variations in the ranges can be found on the Web.

VOLUME 12, 2024 13563


P. Leger et al.: Exploring a Self-Replication Algorithm to Flexibly Match Patterns

FIGURE 14. In ascending order, the SUS score for 23 students.

FIGURE 15. Scenario 1: Sequence of tokens an x and pattern x.

FIGURE 16. Scenario 2: Sequence of tokens (ax)n and pattern ax.

evaluation against large amounts of data. Hence, we carry out In our proposal, we evaluated Matcher Cells with three
a preliminary performance evaluation using the TypeScript scenarios with different effects, where each one increases
implementation of Matcher Cells. the number of cells or rules to manage. The first scenario

13564 VOLUME 12, 2024


P. Leger et al.: Exploring a Self-Replication Algorithm to Flexibly Match Patterns

FIGURE 17. Scenario 3: Sequence of tokens an x and pattern an . Note the brute-force algorithm is not in the chart
because its performance evaluation is out of the chart range, i.e., too slow.

only manages one cell in the solution. The second scenario the sequence is aaax and the pattern is aaa. First, note that
manages a limited number (n) of cells in the solution. The the figure does not display the brute-force algorithm. This
last one creates cells for every new token that appears in is because its performance is orders of magnitude slower
the sequence, meaning that Matcher Cells has to manage than the performance of the other algorithms. Therefore,
a massive number of cells simultaneously in the solution. we excluded it to be able to observe the difference between
To execute these three scenarios, we used NodeJS (v16) [16] Matcher Cells and KMP. Second, although KMP is better
on a Macbook Pro (2020) with a 2 GHz Quad-Core Intel than Matcher Cells, both algorithms have a similar trend,
Core i5 and 16GB of RAM running macOS Big Sur, and indicating that our proposal might be scalable to more
Matcher Cells’ GitHub revision e4c556d (April 7, 2021). complex sequences and patterns.
Fig. 15, Fig. 16, and Fig. 17 show these three scenarios Conclusion: With this preliminary evaluation, we can
of the performance evaluation comparing three algorithms observe that our proposal is not as efficient as existing pattern
(TABLE 1): brute-force (baseline), KMP [1] (efficient matching algorithms. Indeed, if we carry out a preliminary
algorithm), and Matcher Cells. time complexity of the current version of Matcher Cells,
For KMP, we use an implementation available on the NPM we might estimate:
repository [49]. Each scenario’s length sequence goes from • Worst case. If no rule removes cells, the time complexity
50,000 tokens to 2,500,000 tokens. is O(cr ∗ n), where c is the number of cells, r is number
Fig. 15 shows the evaluation of the simplest scenario of rules, and n is the length of the input. Although
of a pattern consisting only of one character (x), which this result is clearly much slower than the existing
appears at the end of the sequence. For example, if the algorithms, we can observe that the use of Matcher Cells
sequence length is 50,000, the sequence is defined by with regular expressions is better than that of the brute-
a49,999 x, and the pattern is x. Given that this scenario is force algorithm (Fig. 17). This case shows that our
simple, we can observe the brute-force algorithm has the best algorithm is exponentially time-consuming, meaning
performance, and our proposal has the worst performance. that this algorithm can be extremely slow.
This is because Matcher Cells and KMP require overload to • Best case. If a rule like kill creator is used, the time
work; for example, our proposal must keep cells and execute complexity might improve to O(1r ∗ n) → O(n) because
a set of rules that are composed. only one cell is alive during the matching process. This
Fig. 16 shows the multiple matches of the pattern ax time complexity means that Matcher Cells is linear,
in a sequence defined by the regex (ax)n . For example, making it works fast. However, this efficient result only
if n = 3, the sequence is axaxax, and the pattern is ax. happens when a programmer wants to match the first
In this case, we can see that KMP has the best performance, match in an input, and not all possible matches.
while our proposal has a similar performance with the brute- Although this paper explores how we might use self-
force algorithm. Compared to the previous Matcher Cells replication algorithms to flexibly match patterns, we think,
evaluation, the performance in this case is 3 times better for as a future step, that is possible to explore efficient ways
large sequences. This is because every time there is a match, to process units like cells. For example, as our current
all cells except the seed are removed from the solution. implementation evaluates all cells within a solution for every
Fig. 17 shows the performance evaluation of a complex new token, we might index or classify cells to prevent
pattern, an , in a sequence of an x. For example, if n = 3, evaluations when specific tokens do not affect some cells.

VOLUME 12, 2024 13565


P. Leger et al.: Exploring a Self-Replication Algorithm to Flexibly Match Patterns

C. PROGRAMMING ABSTRACTIONS AND can use all the power of a Turing-complete language,
EXPRESSIVENESS we can also affirm that these rules are expressive enough.
A distinguishing feature of our proposal is the composition of However, the use of high-level programming abstractions and
simple abstractions, i.e., rules, to flexibly express matching expressiveness present the following trade-off:
semantics. By simple, we mean a rule that only targets one • Programming abstractions. Using simple abstractions
concern in an isolated manner, where compositions of these like rules, developers can enhance the flexibility to
rules are able to express advanced semantics. The use of express different matching semantics in one algorithm;
simple abstractions boosts modularity [50], meaning that being careful in the composition of rules. However, if a
the reuse of abstractions (components) by allowing separate rule has to tangle concerns, modularity and reuse for the
concerns. However, the effect of tyranny of the dominant composition of rules might be affected.
decomposition [51] raises the following issue: a concern that • Expressiveness. To define rules, developers can use a
does not fit into the initial view of a system ends up being Turing-complete language. However, the use of the full
tangled and scattered with other concerns, implying that this power of a language can break the spirit to be simple
concern cannot be defined in an isolated manner. This issue enough only to implement one concern of a particular
appears in the rule compositions of our proposal as well. matching semantics.
Consider as an example a mobile context-aware system A potential solution is to set a boundary between abstrac-
scenario where the system must match malicious patterns if tions and expressiveness. This boundary can be addressed
and only if the Internet connection context is unsafe. For this by the use of a domain-specific language [54] to define
scenario, an intuitive composition of the per-solution rules language-level rules that enforce the simple spirit of matching
is to use the add-seed and then the kill-all-on-safe rules, rules that are expressive enough.
as presented in the code snippet below. In this composition,
add-seed creates new cells to match malicious patterns, VI. RELATED WORK
while the kill-all-on-safe rule kills all cells, that match String pattern matching has been a subject of study since
malicious patterns, when the Internet connection is safe the late 1970s and remains a vibrant research field due to
because these patterns are only relevant in an unsafe context. its diverse applications that encompass a broad spectrum
Unfortunately, note that the composition will not match all of domains, including intrusion detection systems, bioinfor-
desired malicious patterns. In particular, any pattern that matics, web search engines, spam filters, natural language
starts in a safe context and ends the matching process in processing, and web scraping. According to [6], string
an unsafe context will not be matched. This is because the matching algorithms fall into two main categories: exact
rule kill-all-on-safe kills any cell in a safe context, involving string matching algorithms and approximate string matching
the cell seeds added by the add-seed rule. The erroneous algorithms. The former aims to identify a complete match,
behavior arises due to the composition of two isolated rules whereas the latter is designed to locate a substring that closely
whose impacts affect each other, e.g., kill-all-on-safe impacts resembles a specified pattern string.
on add-seed. Exact string matching algorithms can be further catego-
rized into single-pattern and multiple-pattern exact matching
(define one−match−at−a−time−on−unsafe
(kill−all−on−safe (add−seed identity ))) approaches. In single-pattern matching algorithms, the
algorithm is designed to work with a single pattern as input,
Regarding Matcher Cells’ expressiveness, we claim the directing its efforts to locating that specific pattern within
composition of simple rules in our proposal allow for an the target database (i.e., sequence of tokens). In contrast,
expressive definition of the matching semantics. To affirm multiple-pattern matching algorithms are equipped to handle
that our proposal can express and execute any matching a single input, tasked with searching for multiple instances
semantics that a Turing-complete language can express, of that input throughout the target database. Moreover,
we only need to simulate one of these languages with software based exact matching algorithms can be divided
our rules. The λ-calculus [52] is a Turing-complete and into character comparison, hashing, bit-parallel, and hybrid
functional programming language whose abstractions consist approaches [6], [55].
of functions that take one function as parameter and return a Unlike exact string matching algorithms, approximate
function as a result. Like the λ-calculus, Matcher Cells’ rules algorithms can be classified as filtration-based algorithms
are also functions that take functions (cells) as parameters and and backtracking-based algorithms. The first one follows
return functions (cells) as a result. Using the currying design a two-stage process. In the initial stage, these algorithms
pattern [53] to remove the need of a second parameter in rules, pinpoint potential occurrences of patterns within the text.
we can say our proposal simulates the λ-calculus; therefore, In the subsequent stage, all of these identified locations
Matcher Cells is a Turing-complete expressive. undergo comprehensive verification. On the other hand,
Conclusion: We can affirm that Matcher Cells users will approximate algorithms build upon the foundations of exact
have to understand how to compose rules; indeed, we used string matching algorithms but precise string matching
a 40-minute session to teach developers how to compose algorithms are adapted to facilitate approximate searching
rules before evaluating Matcher Cells. Likewise, if functions through edit distance operations [56], [57], [58].

13566 VOLUME 12, 2024


P. Leger et al.: Exploring a Self-Replication Algorithm to Flexibly Match Patterns

For instance, the Levenshtein distance [59], also known comparisons from left to right. However, the approach
as edit distance, is a measure of the minimum number of is constrained by hash collisions, where two distinct
single-character edits (insertions, deletions, or substitutions) strings may map to the same numerical value. While
required to change one word into another. For example, these methods accelerate pattern matching, they ultimately
let’s consider two words: ‘‘kitten’’ and ‘‘sitting’’. In total, rely on character-based comparisons and lack the runtime
we needed seven operations to transform ‘‘kitten’’ into flexibilities offered by Matcher Cells.
‘‘sitting’’. Thus, the levenshtein distance between these two Other classical algorithms are bit-parallel and hybrid
words is 7. The smaller the levenshtein distance, the more approaches. The first one relies on the principles of parallel
similar the words are in terms of their spelling or structure. computing, reducing the number of operations within the
Such approaches often promote the utilization of compact algorithm to match the number of bits in a computer
data and data structures based on suffix indexing [60], [61]. word [65]. This algorithm demonstrates speed and efficiency,
To evaluate Matcher Cells against the most relevant particularly when the length of the provided pattern p is
approaches in terms of flexibility, we categorize them into shorter than the word length [64]. The second one combines
three sets: classical general-purpose algorithms, domain- the advantages of different algorithms and is performs
specific algorithms, and nature-inspired models. better than individual algorithms [66]. These approaches can
combine one or more character-based methods, one or more
A. CLASSICAL GENERAL-PURPOSE ALGORITHMS methods from automata-based and character- and automata-
Classical general-purpose algorithms for string pattern based methods [6]. In terms of flexibility, each of them lacks
matching are foundational techniques designed to efficiently the option for semantic customization.
locate occurrences of a specific pattern within a given text or
a sequence of tokens. These algorithms encompass character- B. DOMAIN-SPECIFIC ALGORITHMS
based, hashing, suffix automata, bit-parallel, and hybrid Domain-specific algorithms for pattern matching are tailored
approaches, as categorized by [62]. A concise overview of methods designed to address specific application domains
each follows. or types of data. Unlike general-purpose algorithms that
The character-based approach is a classical method that aim for versatility across various scenarios, domain-specific
addresses string matching problems by directly comparing algorithms are optimized to excel in particular contexts.
individual characters. This method does not entail any While Matcher Cells can operate in the same domains as
preprocessing and typically involve two essential stages: the classical algorithms, it could be especially advantageous in
searching phase and the shift phase. Previous research has domains where temporal information is crucial for pattern
sought enhancements for both stages. Significantly, among detection, particularly in highly dynamic environments.
various character-based approaches, the BM algorithm [25] In the realm of information security, specifically concern-
stands as a benchmark and standard method in the field. ing spambots, algorithms play a crucial role in protecting
The suffix automaton is a composite structure involving digital systems and user data from the actions of automated
two interconnected yet separate automaton constructors: the programs engineered for spam distribution [67]. Spambots,
deterministic acyclic finite state automaton, which serves as also known as spam robots, are automated scripts or software
a data structure representing a finite set of strings, and the applications designed to create and propagate unsolicited
suffix automaton, a finite automaton functioning as a suffix and potentially harmful content, including unwanted adver-
index for matching purposes [63]. This strategy is suitable tisements, phishing schemes, and malware [68]. According
for long-length patterns and performs very well because as to [69], there are four types of spam detection tech-
it gives a pre-generated prefix table, so the procedure allows niques: content-based, link-based, machine learning, and
skipping certain comparisons during matching. string pattern matching-based. Subsequently, we compare
KMP [1] and BM [25] are examples of algorithms that Matcher Cells with techniques based on string pattern
uses the concept of automata, mainly focus on performance. matching in the domain of spambot detection.
The Matcher Cells algorithm, on the contrary, focuses on Alamro et al. [69] propose an algorithm that can detect one
the runtime flexibilities that enables developers to cus- or more sequences of indeterminate actions in text T in linear
tomize matching semantics, inspired by the self-replicating time. The algorithm takes into account temporal information,
behavior of cells. Therefore, we should not directly com- because it considers time-annotated sequences and requires a
pare Matcher Cells with existing proposals in terms of match to occur within a time window t. Authors state that
performance; rather we should compare them regarding the some spambots might attempt to disguise their actions by
flexibility to match patterns in different ways. varying certain actions. For example, a spambot takes the
In hashing-based strategies, characters are represented actions ABCDEF, then ACCDEF, then ABDDEF, etc. Thus, the
by hash values rather than being compared individually, sequence can be described as A[BC][CD]DEF. Spambots
significantly reducing computational overhead through the try to deceive by changing the second and third action so
comparison of integer values instead of characters [64]. actions [BC] and [CD] are variations of the same sequence.
For instance, the Karp-Rabin algorithm [24] employs this Additionally, spambots can execute these variations across
method to address string matching challenges, conducting different time frames, adding complexity to their detection.

VOLUME 12, 2024 13567


P. Leger et al.: Exploring a Self-Replication Algorithm to Flexibly Match Patterns

In reference to the research conducted by Alamro et al. [69], identified two models for comparison with Matcher Cells:
Matcher Cells exhibits comparable functionality by amal- cellular automaton and chemical abstract machine. We elab-
gamating multiple rules to identify variations in actions. orate on these comparisons below.
Employing the One Match at a Time and Match per Time A cellular automaton [71] is a collection of cells that
Frame semantic rules, Matcher Cells seamlessly incorporates evolves through a number of discrete time steps according
temporal information, enabling it to effectively detect to a set of simple rules based on the states of neighboring
spambots that disguise their actions. Furthermore, a pivotal cells. In contrast to its simplicity, cellular automata can
feature distinguishing Matcher Cells could be the ability to model complex behavior in various areas such as physics,
identify multiple spambots through the application of the engineering, and theoretical biology. For example, it is
Always Start a Match rule. This rule facilitates a concurrent known that pores of leaves in plants can be represented
process of matching sequences, leading to the simultaneous using a cellular automaton [72]. Similarly, in Matcher Cells,
identification of multiple spambots. a programmer can use a composition of simple rules to
To detect the spambot sequences, the algorithm requires define their own matching semantics. Although cellular
as input sequences temporally annotated actions from user automata can be Turing-complete, so it can be applied to
logs. Specifically, each request in a user log is mapped to a pattern matching, to the best of our knowledge there is
predefined index key in the sequence and the date timestamp no research proposing them as concrete interfaces for this
for the request in the user log is mapped to a time point in the subject. In contrast, our proposal provides a concrete interface
sequence. Then, by using Manber and Myers algorithm [29] of rules that are composed and applied for pattern matching.
and bit masking operation, the algorithm can detect one or A chemical abstract machine [73] is a model for asyn-
more indeterminate sequences in a Web user log. chronous concurrent computations. This solution borrows an
Ghanaei et al. [70] present a technique for identifying Web idea from a chemical solution in which floating molecules
spambots, addressing spam-related issues on the Web. This can interact with each other according to reaction rules,
method relies on analyzing Web usage behavior, extracting allowing for contact among molecules. This model gives
discriminative features known as action strings from user expressive power to proposals such as Petri Nets [74] in
logs to distinguish between spambot and human actions. concurrent programming. Indeed, it is possible to represent
An action is defined as a set of user efforts aimed at the full Calculus of Communicating Systems (CSS) [75]
achieving specific purposes, while action strings represent using elements such as agents, molecules and rules defined
sequences of actions for a particular user in a transaction. in chemical abstract machines. Matcher Cells also adopt a
To implement a real-time, on-the-fly classification method, set of reaction rules to represent expressiveness for defining
the authors construct a trie data structure based on action programmers’ own semantics. It also might be possible to
strings. Within this structure, each trie edge includes an action improve the performance of Matcher Cells, introducing the
key index, and each node incorporates the probability of concept of a membrane that encapsulates molecule evolutions
a given action string being associated with either human locally [73].
or spambot behavior. Consequently, new actions can be
classified into two categories: Match and NotMatched. VII. CONCLUSION
Hayati et al. [68] introduce a method for detecting web The field of pattern matching algorithms has been vastly stud-
spambots. The authors propose a rule-based approach that ied, with solid contributions from the research community.
analyzes web usage behavior action strings using Trie data Most of the contributions in the field are related to efficiency
structures. These action strings are indicative of spambot and performance, leaving the flexibility to express different
activity. The system is designed for on-the-fly classification, matching semantics aside. As a consequence, developers of
meaning it can quickly and effectively identify web spambots these algorithms need to learn many algorithmic techniques,
in real-time. tweak them in contortive ways, or create new specialized
In light of the research conducted by Ghanaei et al. [70] and techniques altogether if their specific needs are not supported
Hayati et al. [68], it is anticipated that Matcher Cells would off-the-shelf. This paper explores the use of self-replication
exhibit superior performance in spambot detection. This algorithms to express different matching semantics flexibly.
expectation arises from the observation that these studies do As a result of this exploration, we propose Matcher Cells,
not incorporate temporal information. Therefore, in the con- an algorithm inspired by the self-replication behavior of
text of evaluating the most pertinent works on string-based cells that allows developers to match patterns flexibly.
approaches to spambot detection, Matcher Cells is expected The matching semantics of Matcher Cells is expressed by
to showcase enhanced flexibility in rule composition, thereby the composition of simple match rules. We provide a
boosting its effectiveness in identifying spambots. functional description of our proposal to implement it in
any Turing-complete language that provides functions like
C. NATURE-INSPIRED MODELS abstractions. Additionally, we provide a concrete imple-
Nature-inspired models are computational or mathematical mentation for TypeScript used to evaluate our proposal by
models that draw inspiration from natural processes, phe- means of two applications for streaming data sequences.
nomena, or systems observed in the natural world. We have Additionally, we evaluate the performance of our approach

13568 VOLUME 12, 2024


P. Leger et al.: Exploring a Self-Replication Algorithm to Flexibly Match Patterns

with an empirical evaluation to assess the usability aspects of [21] A. Apostolico and Z. Galil, Pattern Matching Algorithms. Oxford, U.K.:
Matcher Cells. Oxford Univ. Press, 1997.
[22] P. Leger, É. Tanter, and H. Fukuda, ‘‘An expressive stateful aspect
Considering this paper as a first step to propose language,’’ Sci. Comput. Program., vol. 102, pp. 108–141, May 2015.
self-replication algorithms to match patterns, there are [23] S. Kumar and E. H. Spafford, ‘‘A pattern matching model for misuse
still some open issues to address. For example, although intrusion detection,’’ in Proc. 17th Nat. Comput. Secur. Conf., Oct. 1994,
pp. 11–21.
performance is beyond the scope of our evaluation, we are [24] R. M. Karp and M. O. Rabin, ‘‘Efficient randomized pattern-matching
aware that the current implementation needs to improve its algorithms,’’ IBM J. Res. Develop., vol. 31, no. 2, pp. 249–260, Mar. 1987.
performance. We plan to explore and evaluate index strategies [25] R. S. Boyer and J. S. Moore, ‘‘A fast string searching algorithm,’’ Commun.
ACM, vol. 20, no. 10, pp. 762–772, Oct. 1977.
in cells and rules to solve this issue.
[26] G. Navarro and M. Raffinot, ‘‘A bit-parallel approach to suffix automata:
Fast extended string matching,’’ in Proc. Annu. Symp. Combinat. Pattern
ACKNOWLEDGMENT Matching. Cham, Switzerland: Springer, 1998, pp. 14–33.
[27] C. Allauzen, M. Crochemore, and M. Raffinot, ‘‘Factor oracle: A new
The authors want to thank Marcelo Lazo (marcelo.lazo@
structure for pattern matching,’’ in Proc. Conf. Current Trends Theory
alumnos.ucn.cl), an undergraduate student from Universidad Pract. Informat. Cham, Switzerland: Springer, Nov. 1999, pp. 295–310.
Católica del Norte, Chile, who implemented the TypeScript [28] P. Weiner, ‘‘Linear pattern matching algorithms,’’ in Proc. 14th Annu.
version of Matcher Cells. Additionally, they thank Éric Tanter Symp. Switching Automata Theory, Oct. 1973, pp. 1–11.
[29] U. Manber and G. Myers, ‘‘Suffix arrays: A new method for on-line string
([email protected]) for providing initial ideas. searches,’’ SIAM J. Comput., vol. 22, no. 5, pp. 935–948, Oct. 1993.
[30] M. Crochemore and D. Perrin, ‘‘Two-way string-matching,’’ J. ACM,
REFERENCES vol. 38, no. 3, pp. 650–674, Jul. 1991.
[31] A. V. Aho and M. J. Corasick, ‘‘Efficient string matching: An aid
[1] D. E. Knuth, J. H. Morris, and V. R. Pratt, ‘‘Fast pattern matching in to bibliographic search,’’ Commun. ACM, vol. 18, no. 6, pp. 333–340,
strings,’’ SIAM J. Comput., vol. 6, no. 2, pp. 323–350, Jun. 1977. Jun. 1975.
[2] J. Sakarovitch, Elements of Automata Theory. Cambridge, U.K.: [32] G. Myers, ‘‘A fast bit-vector algorithm for approximate string matching
Cambridge Univ. Press, Oct. 2009. based on dynamic programming,’’ J. ACM, vol. 46, no. 3, pp. 395–415,
[3] K. Thompson, ‘‘Programming techniques: Regular expression search May 1999.
algorithm,’’ Commun. ACM, vol. 11, no. 6, pp. 419–422, Jun. 1968. [33] M. Rubinchik and A. M. Shur, ‘‘EERTREE: An efficient data structure
[4] L. Chen, S. Lu, and J. Ram, ‘‘Compressed pattern matching in DNA for processing palindromes in strings,’’ Eur. J. Combinatorics, vol. 68,
sequences,’’ in Proc. IEEE Comput. Syst. Bioinf. Conf., Standford, CA, pp. 249–265, Feb. 2018.
USA, Aug. 2004, pp. 62–68. [34] B. Meyer, ‘‘Incremental string matching,’’ Inf. Process. Lett., vol. 21, no. 5,
[5] B. A. Hamed, O. A. S. Ibrahim, and T. Abd El-Hafeez, ‘‘A survey pp. 219–227, Nov. 1985.
on improving pattern matching algorithms for biological sequences,’’ [35] Amazon. (Oct. 1, 2023). AWS WAF and AWS Shield Documentation.
Concurrency Comput., Pract. Exper., vol. 34, no. 26, p. e7292, Nov. 2022. [Online]. Available: https://aws.amazon.com/documentation/waf
[6] S. I. Hakak, A. Kamsin, P. Shivakumara, G. A. Gilkar, W. Z. Khan, and [36] G. Springer and D. P. Friedman, Scheme and the Art of Programming.
M. Imran, ‘‘Exact string matching algorithms: Survey, issues, and future Cambridge, MA, USA: MIT Press, 1989.
research directions,’’ IEEE Access, vol. 7, pp. 69614–69637, 2019. [37] H. R. Maturana and F. J. Varela, Autopoiesis and Cognition: The
[7] G. Barbera, L. Araujo, and S. Fernandes, ‘‘The value of web data scraping: Realization of the Living, vol. 42. Berlin, Germany: Springer, 2012.
An application to TripAdvisor,’’ Big Data Cognit. Comput., vol. 7, no. 3, [38] M. Martin, B. Livshits, and M. S. Lam, ‘‘Finding application errors and
p. 121, Jun. 2023. security flaws using PQL: A program query language,’’ in Proc. 20th ACM
[8] M. Khder, ‘‘Web scraping or web crawling: State of art, techniques, SIGPLAN Conf. Object-Oriented Program. Syst., Lang. Appl., San Diego,
approaches and application,’’ Int. J. Adv. Soft Comput. Appl., vol. 13, no. 3, CA, USA, Oct. 2005, pp. 365–383.
pp. 145–168, Dec. 2021. [39] T. Shoji, M. Takimoto, and Y. Kambayashi, ‘‘Capture of multi intruders by
[9] P. Gao, M. Saeki, J. Guo, and H. Han, ‘‘Stable web scraping: An approach cooperative multiple robots using mobile agents,’’ in Proc. 12th Int. Conf.
based on neighbour zone and path similarity of page elements,’’ Int. J. Web Agents Artif. Intell., Valletta, Malta, 2020, pp. 370–377.
Eng. Technol., vol. 13, no. 4, pp. 301–333, 2018. [40] E. Gamma, R. Helm, R. Johnson, and J. Vlissides, Design Patterns:
[10] A. K. Kar, ‘‘Bio-inspired computing: A review of algorithms and scope of Elements of Reusable Object-Oriented Software (Professional Computing
applications,’’ Exp. Syst. Appl., vol. 59, pp. 20–32, Oct. 2016. Series). Reading, MA, USA: Addison-Wesley, Oct. 1994.
[11] J. V. Neumann, Theory of Self-Reproducing Automata. Champaign, IL, [41] W3 Techs. (Oct. 1, 2023). Usage of Client-side Programming
USA: Univ. Illinois Press, 1966. Languages. [Online]. Available: https://w3techs.com/technologies/
[12] P. Leger and É. Tanter, ‘‘A self-replication algorithm to flexibly match history_overview/client_side_language/all
execution traces,’’ in Proc. 11th Workshop Found. Aspect-Oriented Lang., [42] Stackover Flow. (Sep. 1, 2022). Developer Survey Results. [Online].
Potsdam, Germany, Mar. 2012, pp. 27–32. Available: https://insights.stackoverflow.com/survey/2021
[13] G. Kiczales, J. Irwin, J. Lamping, J. Loingtier, C. Lopes, C. Maeda, [43] C. W. Brown and E. A. Hardisty, ‘‘RegeXeX: An interactive system
and A. Mendhekar, ‘‘Aspect-oriented programming,’’ in Special Issues in providing regular expression exercises,’’ in Proc. 38th SIGCSE Tech. Symp.
Object-Oriented Programming. Berlin, Germany: Springer, 1996. Comput. Sci. Educ. New York, NY, USA: Association for Computing
[14] R. A. Kelsey and J. A. Rees, ‘‘A tractable scheme implementation,’’ LISP Machinery, Mar. 2007, pp. 445–449.
Symbolic Comput., vol. 7, no. 4, pp. 315–335, 1994. [44] L. G. Michael, J. Donohue, J. C. Davis, D. Lee, and F. Servant, ‘‘Regexes
[15] TypeScript. (Oct. 1, 2023). JavaScript With Syntax for Types. [Online]. are hard: Decision-making, difficulties, and risks in programming regular
Available: https://www.typescriptlang.org expressions,’’ in Proc. 34th IEEE/ACM Int. Conf. Automated Softw. Eng.
[16] The OpenJS Foundation. (Oct. 1, 2023). NodeJS: A Javascript Runtime (ASE), Nov. 2019, pp. 415–426.
Built for the Server Side. [Online]. Available: https://nodejs.org [45] A. Bangor, P. T. Kortum, and J. T. Miller, ‘‘An empirical evaluation of
[17] Twitter. (Oct. 1, 2023). A Microblogging and Social Networking Service. the system usability scale,’’ Int. J. Hum.-Comput. Interact., vol. 24, no. 6,
[Online]. Available: http://twitter.com pp. 574–594, Jul. 2008.
[18] M. Satyanarayanan, ‘‘Pervasive computing: Vision and challenges,’’ IEEE [46] D. Derisma, ‘‘The usability analysis online learning site for supporting
Pers. Commun., vol. 8, no. 4, pp. 10–17, Aug. 2001. computer programming course using system usability scale (SUS) in a
[19] P. Leger and M. Lazo. (Oct. 1, 2023). Case Studies of Matcher Cells. university,’’ Int. Assoc. Online Eng., Austria, Tech. Rep., Jun. 2020.
[Online]. Available: http://pragmaticslaboratory.github.io/matcher-cells- [47] P. Vlachogianni and N. Tselios, ‘‘Perceived usability evaluation of
study-cases educational technology using the system usability scale (SUS): A
[20] J. Brooke, Usability Evaluation in Industry. Boca Raton, FL, USA: CRC systematic review,’’ J. Res. Technol. Educ., vol. 54, no. 3, pp. 392–409,
Press, 1996. May 2022.

VOLUME 12, 2024 13569


P. Leger et al.: Exploring a Self-Replication Algorithm to Flexibly Match Patterns

[48] G. Albaum, ‘‘The Likert scale revisited,’’ Market Res. Soc. J., vol. 39, no. 2, [72] D. Peak, J. D. West, S. M. Messinger, and K. A. Mott, ‘‘Evidence for
pp. 1–21, Mar. 1997. complex, collective dynamics and emergent, distributed computation in
[49] M. Mota. (2022). A Concrete Implementation of KMP Available on plants,’’ Proc. Nat. Acad. Sci. USA, vol. 101, no. 4, pp. 918–922, Jan. 2004.
the NPM Repository. Accessed: Oct. 1, 2023. [Online]. Available: [73] G. Berry and G. Boudol, ‘‘The chemical abstract machine,’’ Theor.
https://www.npmjs.com/package/kmp Comput. Sci., vol. 96, no. 1, pp. 217–248, Apr. 1992.
[50] D. Parnas, ‘‘On the criteria for decomposing systems into modules,’’ [74] W. Reisig, Petri Nets: An Introduction, vol. 4. Berlin, Germany: Springer,
Commun. ACM, vol. 15, no. 12, pp. 1053–1058, Dec. 1972. 2012.
[51] P. Tarr, H. Ossher, W. Harrison, and S. M. Sutton, ‘‘N degrees of separation: [75] R. Milner, A Calculus of Communicating Systems. Berlin, Germany:
Multi-dimensional separation of concerns,’’ in Proc. Int. Conf. Softw. Eng., Springer, 1982.
Los Angeles, CA, USA, May 1999, pp. 107–119.
[52] H. P. Barendregt, The Lambda Calculus: Its Syntax and Semantics. North-
Holland, 1984.
[53] H. B. Curry, ‘‘Some philosophical aspects of combinatory logic,’’ in Proc.
The Kleene Symp., vol. 101, J. Barwise, H. J. Keisler, and K. Kunen, Eds.
PAUL LEGER (Member, IEEE) received the Ph.D.
1980, pp. 85–101.
degree in computer science from the University
[54] A. Van Deursen and P. Klint, ‘‘Domain-specific language design requires
deature descriptions,’’ J. Comput. Inf. Technol., vol. 10, no. 1, pp. 1–17, of Chile. He is currently an Associate Professor
2002. with Universidad Católica del Norte, Chile. His
[55] K. Al-Khamaiseh and S. Al Shagarin, ‘‘A survey of string matching research interests include issues related to pro-
algorithms,’’ Int. J. Eng. Res. Appl., vol. 4, pp. 144–156, Aug. 2014. gramming languages, software engineering, and
[56] M. Farach-Colton, G. M. Landau, S. C. Sahinalp, and D. Tsur, ‘‘Optimal different programming approaches.
spaced seeds for faster approximate string matching,’’ J. Comput. Syst. Sci.,
vol. 73, no. 7, pp. 1035–1044, Nov. 2007.
[57] J. Kärkkäinen and J. C. Na, ‘‘Faster filters for approximate string match-
ing,’’ in Proc. Meeting Algorithm Eng. Experiments, 2007, pp. 84–90.
[58] G. Kucherov, L. Noé, and M. Roytberg, ‘‘Multi-seed lossless filtration,’’ in HIROAKI FUKUDA received the Ph.D. degree
Combinatorial Pattern Matching, S. C. Sahinalp, S. Muthukrishnan, and in computer science from Keio University. He is
U. Dogrusoz, Eds. Berlin, Germany: Springer, 2004, pp. 297–310. currently an Associate Professor with the Shibaura
[59] F. P. Miller, A. F. Vandome, and J. McBrewster, Levenshtein Distance: Institute of Technology, Japan. His research inter-
Information Theory, Computer Science, String (Computer Science),
ests include software engineering and distributed
String Metric, Damerau? Levenshtein Distance, Spell Checker, Hamming
programming.
Distance. Alpha Press, 2009.
[60] G. Kucherov, K. Salikhov, and D. Tsur, ‘‘Approximate string matching
using a bidirectional index,’’ Theor. Comput. Sci., vol. 638, pp. 145–158,
Jul. 2016.
[61] G. Navarro and R. Baeza-Yates, ‘‘A hybrid indexing method for
approximate string matching,’’ J. Discrete Algorithms, vol. 1, pp. 205–239,
Jan. 2001. NICOLÁS CARDOZO received the joint Doctoral
[62] A. N. M. E. Rafiq, M. W. El-Kharashi, and F. Gebali, ‘‘A fast string search Diploma degree from Université Catholique de
algorithm for deep packet classification,’’ Comput. Commun., vol. 27, Louvain and Vrije Universiteit Brussel, Belgium.
no. 15, pp. 1524–1538, Sep. 2004. He was a Postdoctoral Fellow with Trinity College
[63] W. Yang, ‘‘Mealy machines are a better model of lexical analyzers,’’ Dublin and Vrije Universiteit Brussel. He is
Comput. Lang., vol. 22, no. 1, pp. 27–38, Apr. 1996. currently an Associate Professor with Universidad
[64] A. Akram Abdulrazzaq, N. Abdul Rashid, A. Hasan, and M. Abu-Hashem, de los Andes, Colombia. His research inter-
‘‘The exact string matching algorithms efficiency review,’’ Global J. est includes the design and implementation of
Technol., vol. 4, pp. 576–589, Jan. 2013.
programming languages for distributed adaptive
[65] S. Faro and T. Lecroq, ‘‘The exact online string matching problem:
software systems. He has worked in the imple-
A review of the most recent results,’’ ACM Comput. Surv., vol. 45, no. 2,
pp. 1–42, Mar. 2013.
mentation of dynamic distributed adaptations in the smart cities domain
[66] F. Franek, C. G. Jennings, and W. F. Smyth, ‘‘A simple fast hybrid pattern- from different perspectives, such as automated personalized assistants and
matching algorithm,’’ J. Discrete Algorithms, vol. 5, no. 4, pp. 682–695, evolutionary models for dynamic adaptations.
Dec. 2007.
[67] P. Heymann, G. Koutrika, and H. Garcia-Molina, ‘‘Fighting spam on social DANIEL SAN MARTÍN received the B.S. degree
web sites: A survey of approaches and future challenges,’’ IEEE Internet in engineering science and computer engineering
Comput., vol. 11, no. 6, pp. 36–45, Nov. 2007.
from Universidad Católica del Norte, Coquimbo,
[68] P. Hayati, V. Potdar, A. Talevski, and W. Smyth, ‘‘Rule-based on-the-fly
Chile, and the M.Sc. and Ph.D. degrees in
web spambot detection using action strings,’’ in Proc. Annu. Collaboration,
Electron. Messaging, Anti-Abuse Spam Conf., 2010, pp. 1–7. computer science from Universidade Federal de
[69] H. Alamro, C. S. Iliopoulos, and G. Loukides, ‘‘Efficiently detecting São Carlos, Brazil. He is currently an Assistant
web spambots in a temporally annotated sequence,’’ in Advanced Professor with the School of Engineering, Uni-
Information Networking and Applications. Cham, Switzerland: Springer, versidad Católica del Norte. Also, he has held
2020, pp. 1007–1019. pivotal positions as a Chief Information Security
[70] V. Ghanaei, C. S. Iliopoulos, and S. P. Pissis, ‘‘Detection of web spambot Officer (CISO), the Project Manager (PM), and the
in the presence of decoy actions,’’ in Proc. IEEE 4th Int. Conf. Big Data Information Analyst in both public and private sectors. His current research
Cloud Comput., Dec. 2014, pp. 277–279. interests include software engineering, software architecture, programming
[71] P. Sarkar, ‘‘A brief history of cellular automata,’’ ACM Comput. Surv., languages, and models.
vol. 32, no. 1, pp. 80–107, Mar. 2000.

13570 VOLUME 12, 2024

You might also like