Skip to content

chore: add pgroonga MeCab tokenizer support #462

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jan 11, 2023

Conversation

pcnc
Copy link
Member

@pcnc pcnc commented Dec 23, 2022

What kind of change does this PR introduce?

  • Installs libraries needed by groonga to enable MeCab tokenizer support - used instead of the Bigram tokenizer when checking against japanese words
  • parallelizes groonga's build step, improving duration from ~9m to ~2.5m
  • adds parallelization support to the Docker build
    • this requires a new version of Ansible's make task type, provided by installing the general Ansible collection

@pcnc pcnc requested a review from a team as a code owner December 23, 2022 17:50
- name: groonga - download & install dependencies
apt:
pkg:
- libmecab2
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that libmecab-dev is needed to build PGroonga's MeCab support.
libmecab2 doesn't provide header files.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - thanks!

I picked those libraries using Ubuntu 20's groonga-tokenizer-mecab's dependency list so that PGroonga's buildchain detects those and MeCab support is turned on during compilation.

This did enable MeCab support in the resulting build. I agree, though, that including libmecab-dev to increase the stability and consistency of future builds is desirable.

Just added the lib, also works considering that libmecab2 is its only dependency.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, we already installed libmecab-dev in ansible/tasks/docker/setup.yml.
So we don't need libmecab-dev here. (We don't need libmecab2 too because it's automatically installed via libmecab-dev.)

We may want to move remained mecab-naist-jdic to ansible/tasks/docker/setup.yml.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see! Was not aware of the MeCab-related packages also being defined in ansible/tasks/docker/setup.yml - thanks again!

I shall move them to ansible/tasks/postgres-extensions/24-pgroonga.yml as ideally we'd want to have all the dependencies of an extension contained within the extension's Ansible task, as we're building both Docker images for local development and self-hosting, as well as Amazon AMIs for the Supabase platform to use.
Also removing libmecab2 from the task file.

@pcnc pcnc requested a review from kou January 5, 2023 17:35
- name: groonga - download & install dependencies
apt:
pkg:
- groonga-normalizer-mysql
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, wait. Can we use groonga-normalizer-package in this environment? (postgres:15.1 that is based on Debian GNU/Linux bullseye?)

How about using deb packages for Groonga/PGroonga provided by https://packages.groonga.org/ instead of building Groonga/PGroonga manually? PGroonga's Docker images use them: https://github.com/pgroonga/docker/blob/master/debian/15/Dockerfile

See also: https://pgroonga.github.io/install/debian.html

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately the images we build and distribute also bundle extensions that are compiled during the build process, as part of keeping the software bill of materials in check so we can have dependency versions locked down - thus avoiding vulnerabilities introduced upstream or dependency chains breaking, additionally maintaining compliance since project databases might contain sensitive information.

The MeCab tokenizer is already present in Groonga's source code and easily to install during compilation, albeit groonga-normalizer-mysql isn't 🤔 and we might have to also build it from source.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

I think that we can remove groonga-normalizer-mysql. It's for using MySQL compatible normalization (collation) in Groonga. In general, PGroonga users don't use it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thank you for the information and all the feedback!

I shall be merging this to make Docker images available - we're aiming to roll this out for new projects on the Supabase platform sometime next week.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kou support for this rolled out today for new projects, and projects which pause and then unpause themselves to update to the latest version.

@pcnc pcnc merged commit 3655512 into develop Jan 11, 2023
@pcnc pcnc deleted the pcnc/add-pgroonga-mecab-support branch January 11, 2023 11:33
@3ru
Copy link

3ru commented Jun 2, 2023

Forgive me for commenting here, as I think it is relevant.

In the local development environment of Supabase, specifying TokenMecab returns the following error.

pgroonga: [option][tokenizer][validate] invalid tokenizer: <TokenMecab>: [info][set][default-tokenizer][(anonymous)] unknown tokenizer: <TokenMecab>

Production did not generate this error.

Extensions are enabled.

postgres=> select * from pg_available_extensions where name like 'pgroonga'
postgres-> ;
   name   | default_version | installed_version |                                    comment

----------+-----------------+-------------------+-----------------------------------------------------------------------
---------
 pgroonga | 2.4.0           | 2.4.0             | Super fast and all languages supported full text search index based on
 Groonga
(1 row)

I think mecab isn't in here.

# groonga --version |grep mecab
# 

I don't know much about these things, but this is an issue with which library? I would like to raise the issue in the correct place if commenting here is wrong.

@kou
Copy link

kou commented Jun 2, 2023

the local development environment of Supabase

What is it? Could you explain how to setup it?

@3ru
Copy link

3ru commented Jun 2, 2023

@kou

Local environments can be easily set up using the Supabase CLI.

Local Development Document

@pcnc
Copy link
Member Author

pcnc commented Jun 5, 2023

@sweatybridge I think there might be a divergence between our platform AMIs and Docker images - groonga-tokenizer-mecab would need to be installed in addition to pgroonga

@dshukertjr
Copy link
Member

@pcnc
I got a report saying that MeCab tokenizer stopped working again, and I was able to reproduce it on an Supabase instance with the latest Postgres version, v15.6.1.100.

When running the following:

CREATE TABLE memos (
  id integer,
  content text
);

CREATE INDEX pgroonga_content_index ON memos
  USING pgroonga (content)
  WITH (tokenizer='TokenMecab');

we get the following error.

ERROR:  22023: pgroonga: [option][tokenizer][validate] invalid tokenizer: <TokenMecab>: [info][set][default-tokenizer][(anonymous)] unknown tokenizer: <TokenMecab>

Would you have any idea what could be the cause?

@pcnc
Copy link
Member Author

pcnc commented Aug 6, 2024

@dshukertjr Might be related to the PG15.6 bump - will work with @samrose to figure this out

@samrose samrose mentioned this pull request Aug 9, 2024
damonrand pushed a commit to cepro/postgres that referenced this pull request Jun 15, 2025
* chore: add pgroonga MeCab tokenizer support; build job optimization

* chore: bump version

* chore: add libmecab-dev package

* chore: bump postgres version

* chore: move groonga packages to PGroonga task file

* chore: remove groonga-normalizer-mysql

* chore: bump postgres version
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants