Skip to content

Simplify tokenization logic in diffWords #494

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 19, 2024
Merged

Conversation

ExplodingCabbage
Copy link
Collaborator

@ExplodingCabbage ExplodingCabbage commented Feb 19, 2024

This eliminates all the logic about rejoining boundary splits. It was dumb; we were doing an initial split into tokens based on a Unicode-naive notion of "word characters", and then bandaging some words back together with a Unicode-aware regex, when we could instead just use the Unicode-aware regex to start with.

I actually wrote this intending it to just be a refactor that would make it easier to make further fixes to diffWords, but I think
I inadvertently fixed two bugs along the way! Namely, those bugs are:

  • Bug on diff words with accent #311
  • that when tokenizing text containing Windows-style newlines, the carriage return and line feed characters would each get their own token instead of being grouped into a single newline token

The test failure below is what you get if you run the new test on master (with a removeEmpty call added, like there used to be), and demonstrates both behaviour changes:

      AssertionError: expected [ Array(35) ] to deeply equal [ Array(33) ]
      + expected - actual

         "\n"
         "\n"
         "\n"
         "  "
      -  "anim"
      -  "á-"
      +  "animá"
      +  "-"
         "los"
      -  "\r"
      -  "\n"
      -  "\r"
      -  "\n"
      +  "\r\n"
      +  "\r\n"
         "("
         "wibbly"
         " "
         "wobbly"

Resolves #311.

Previously the behaviour was kinda correct by coincidence; a \n without a \r was treated as a punctuation mark by the regex
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug on diff words with accent
1 participant