Simplify tokenization logic in diffWords #494

ExplodingCabbage · 2024-02-19T19:01:32Z

This eliminates all the logic about rejoining boundary splits. It was dumb; we were doing an initial split into tokens based on a Unicode-naive notion of "word characters", and then bandaging some words back together with a Unicode-aware regex, when we could instead just use the Unicode-aware regex to start with.

I actually wrote this intending it to just be a refactor that would make it easier to make further fixes to diffWords, but I think
I inadvertently fixed two bugs along the way! Namely, those bugs are:

Bug on diff words with accent #311
that when tokenizing text containing Windows-style newlines, the carriage return and line feed characters would each get their own token instead of being grouped into a single newline token

The test failure below is what you get if you run the new test on master (with a removeEmpty call added, like there used to be), and demonstrates both behaviour changes:

      AssertionError: expected [ Array(35) ] to deeply equal [ Array(33) ]
      + expected - actual

         "\n"
         "\n"
         "\n"
         "  "
      -  "anim"
      -  "á-"
      +  "animá"
      +  "-"
         "los"
      -  "\r"
      -  "\n"
      -  "\r"
      -  "\n"
      +  "\r\n"
      +  "\r\n"
         "("
         "wibbly"
         " "
         "wobbly"

Resolves #311.

Previously the behaviour was kinda correct by coincidence; a \n without a \r was treated as a punctuation mark by the regex

ExplodingCabbage added 3 commits February 19, 2024 18:38

Refactor tokenization logic in diffWords

3ddc8ce

Add test

9495564

Tweak handling of Unix newlines for clarity

37ba04b

Previously the behaviour was kinda correct by coincidence; a \n without a \r was treated as a punctuation mark by the regex

ExplodingCabbage marked this pull request as ready for review February 19, 2024 19:34

ExplodingCabbage merged commit 3da78c2 into master Feb 19, 2024

ExplodingCabbage deleted the refactor-diffwords branch February 19, 2024 19:35

This was referenced Sep 5, 2024

Combinations of numbers and letters no longer considered a single token in v6 #553

Closed

Fix diffWords treating numbers and underscores as not being word characters #554

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Simplify tokenization logic in diffWords #494

Simplify tokenization logic in diffWords #494

Uh oh!

ExplodingCabbage commented Feb 19, 2024 •

edited

Loading

Uh oh!

Uh oh!

Simplify tokenization logic in diffWords #494

Simplify tokenization logic in diffWords #494

Uh oh!

Conversation

ExplodingCabbage commented Feb 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ExplodingCabbage commented Feb 19, 2024 •

edited

Loading