Exploring A Self-Replication Algorithm To Flexibly Match Patterns

This document summarizes a research paper that explores a self-replication algorithm called Matcher Cells for flexibly matching patterns. Matcher Cells is inspired by how biological cells self-replicate. The paper describes Matcher Cells using a functional programming language to provide a generic implementation, and an object-oriented architecture for languages like Java. It also presents two case studies that use Matcher Cells for applications in social media analysis and adaptive education. An evaluation with students found Matcher Cells to have good usability. The paper discusses tradeoffs of Matcher Cells and other pattern matching algorithms.

Uploaded by

Luimarco Diaz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views18 pages

Exploring A Self-Replication Algorithm To Flexibly Match Patterns

Uploaded by

Luimarco Diaz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Received 1 December 2023, accepted 5 January 2024, date of publication 17 January 2024, date of current version 29 January 2024.

Digital Object Identifier 10.1109/ACCESS.2024.3355319

Exploring a Self-Replication Algorithm

to Flexibly Match Patterns
PAUL LEGER 1 , (Member, IEEE), HIROAKI FUKUDA 2,

NICOLÁS CARDOZO3 , AND DANIEL SAN MARTÍN 1

1 Escuela de Ingeniería, Universidad Católica del Norte, Coquimbo 1781421, Chile
2 Shibaura Institute of Technology, Tokyo 135-8548, Japan
3 Systems and Computing Engineering Department, Universidad de los Andes, Bogotá 111711, Colombia
Corresponding author: Paul Leger ([email protected])

ABSTRACT Pattern matching algorithms have been studied on numerous occasions, mainly focusing on
performance because of the large amount of data used in a matching process. However, a strong focus on
performance can entail particular issues like the lack of flexibility to match patterns. As a consequence,
programming developers need to tweak matching algorithms in contortive ways or create new specialized
ones altogether if their specific needs are not supported. Inspired by the self-replication behavior of cells
in biology, we explore and evaluate the design and implementation of an algorithm to flexibly match
patterns, named Matcher Cells. Through the composition of simple rules applied to cells, developers can
adjust the matching semantics of this algorithm to different needs. We describe this algorithm using a
pure functional language as a recipe for any Turing-complete programming language and then offer an
object-oriented architecture for languages like Java. To show the flexibility of our proposal, we use a
concrete implementation in TypeScript to describe two applications, from different domains, that use pattern
matching in a stream of tokens. Additionally, we carry out performance and developer experience empirical
evaluations with undergraduate students using Matcher Cells. Finally, we discuss the pros and cons of using a
biological-based algorithm, exploiting the compositions of rules, to match patterns.

INDEX TERMS Pattern matching, self-replication algorithms, string matching, context-aware systems.

I. INTRODUCTION and processing to uncover valuable data for researchers

Pattern matching algorithms [1] check the occurrences and practitioners. Thus, pattern-matching techniques should
of a pattern in a sequence of tokens. Such patterns are demonstrate their efficiency by identifying one or more
usually expressed using abstractions (e.g., automata [2]), patterns within datasets in a relatively short timeframe [5].
or languages (e.g., regular expressions [3]). Although these They should also possess the necessary flexibility and
algorithms have undergone extensive historical study, they user-friendliness to accommodate pattern matching without
continue to be a focal point of attention in contemporary requiring developers to possess an in-depth understanding of
times. This interest is attributed to their wide-ranging pattern matching algorithms or the need to fine-tune existing
applications across several domains, including but not limited algorithms to meet their specific requirements [6].
to spam filters, digital libraries, natural language processing, One specific context that exemplifies the needs for
word processors, web search engines, parsers, computational flexible and extensible pattern matching algorithms is web
molecular biology, and screen scrapers [4], [5]. A common scraping, which involves the practice of retrieving content
characteristic among these applications is the abundant from websites to store in repositories like databases or
availability of large datasets that require filtration, extraction, CSV files [7], [8]. Within the sphere of web scraping,
a diverse array of pattern matching techniques are employed,
The associate editor coordinating the review of this manuscript and including regular expressions (Regex), HTML Document
approving it for publication was Fabrizio Messina . Object Model (DOM), and XPATH. Nonetheless, these
2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
VOLUME 12, 2024 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ 13553
P. Leger et al.: Exploring a Self-Replication Algorithm to Flexibly Match Patterns

FIGURE 1. Methodology used in this study.

methodologies have been perceived as not so flexibility, and 3) Two case studies validating Matcher Cells. First, a pat-
complex in terms of implementation, and dependence on tern matching tool to analyze streaming services
the structure of the data source [9]. For instance, websites as social network sites (e.g., Twitter [17]). Second,
often employ similar yet inconsistent templates for creating a context-aware system [18] that adapts the difficulty of
pages of the same category. Therefore, over time, the inner addition exercises tasked to students, according to their
structure of a webpage can change without prior notification performance (i.e., context). Both case studies are avail-
due to periodic updates in the layout, which may imply able at: https://pragmaticslaboratory.github.io/matcher-
rewriting the matcher pattern algorithm to get the desired cells-study-cases [19].
data. Consequently, these changes could impact the time, 4) An experience evaluation of our proposal incorporating
effort, and cost associated with web scraping and data 23 undergraduate students from Universidad Católica
extraction tasks [9]. A web scraping tool, which considers an del Norte (Chile). The usability of Matcher Cells is
approximate or flexible pattern matching algorithm, can help evaluated using the System Usability Scale (SUS) [20]
address the issue related to the website evolution. approach.
In this paper, we present an algorithm founded on the 5) A preliminary performance evaluation with a compar-
principles of Biologically Inspired Computing (BIC) [10], ison to other two pattern matching algorithms, brute-
which provides researchers with the basis to create flexible force and KMP.
algorithms for pattern matching. This algorithm centers 6) A discussion about the trade-off between programming
around a self-replication algorithm called Matcher Cells, abstractions and expressiveness of Matcher Cells.
which takes inspiration from the self-replicating behavior of 7) A deep reference frame that contains proposals related
biological cells to articulate a broad spectrum of matching to Matcher Cells, that is flexible-pattern matching
semantics [11]. This work extends an initial proposal of algorithms.
Matcher Cells [12], where it was mainly employed for Roadmap: Fig. 1 shows the methodology followed in
matching program execution traces in the context of aspect- this article. Section II presents two flexibility issues in pat-
oriented programming [13]. It is worth noting that, to the tern matching algorithms with their consequences, focused
best of our knowledge, the application of BIC concepts in the mainly on performance. After the motivation, Section III
realm of pattern matching remains relatively unexplored. presents Matcher Cells, our self-replication algorithm to flex-
This paper extends our work in the following aspects: ibly match patterns through a conceptual design. The design
is followed by a concrete implementation of Matcher Cells
1) A mature description of Matcher Cells using the in TypeScript, named MCJs. Section IV validates our
Scheme programming language [14] to provide a implementation with two applications: a Twitter analyzer and
generic implementation of our proposal that works on a context identifier for context-aware systems. Additionally,
Turing-complete programming languages. We selected we present our user experience evaluation from three
Scheme as a functional language that provides few perspectives: (1) developer experience, (2) performance,
and simple constructs to formally describe, as much as and (3) abstractions and expressiveness. Finally, the paper
possible, a generic implementation. discusses different algorithms for pattern matching in per-
2) An architecture to realize Matcher Cells in object- spective of our proposal in Section VI, leading to Section VII
oriented languages like Java. We exemplify this with the conclusion and avenues of future work.
architecture with a concrete implementation in
TypeScript [15] for NodeJS (v16) [16], available II. FLEXIBILITY IN PATTERN MATCHING
at: https://github.com/pragmaticslaboratory/match-cell- With a large number of pattern matching algorithms available
base (rev. e4c556d). in the body of literature [21], pattern matching is currently