-
-
Notifications
You must be signed in to change notification settings - Fork 199
chore: add pgroonga MeCab tokenizer support #462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- name: groonga - download & install dependencies | ||
apt: | ||
pkg: | ||
- libmecab2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that libmecab-dev
is needed to build PGroonga's MeCab support.
libmecab2
doesn't provide header files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch - thanks!
I picked those libraries using Ubuntu 20's groonga-tokenizer-mecab
's dependency list so that PGroonga's buildchain detects those and MeCab support is turned on during compilation.
This did enable MeCab support in the resulting build. I agree, though, that including libmecab-dev
to increase the stability and consistency of future builds is desirable.
Just added the lib, also works considering that libmecab2
is its only dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, we already installed libmecab-dev
in ansible/tasks/docker/setup.yml
.
So we don't need libmecab-dev
here. (We don't need libmecab2
too because it's automatically installed via libmecab-dev
.)
We may want to move remained mecab-naist-jdic
to ansible/tasks/docker/setup.yml
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see! Was not aware of the MeCab-related packages also being defined in ansible/tasks/docker/setup.yml
- thanks again!
I shall move them to ansible/tasks/postgres-extensions/24-pgroonga.yml
as ideally we'd want to have all the dependencies of an extension contained within the extension's Ansible task, as we're building both Docker images for local development and self-hosting, as well as Amazon AMIs for the Supabase platform to use.
Also removing libmecab2
from the task file.
- name: groonga - download & install dependencies | ||
apt: | ||
pkg: | ||
- groonga-normalizer-mysql |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, wait. Can we use groonga-normalizer-package
in this environment? (postgres:15.1
that is based on Debian GNU/Linux bullseye?)
How about using deb packages for Groonga/PGroonga provided by https://packages.groonga.org/ instead of building Groonga/PGroonga manually? PGroonga's Docker images use them: https://github.com/pgroonga/docker/blob/master/debian/15/Dockerfile
See also: https://pgroonga.github.io/install/debian.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately the images we build and distribute also bundle extensions that are compiled during the build process, as part of keeping the software bill of materials in check so we can have dependency versions locked down - thus avoiding vulnerabilities introduced upstream or dependency chains breaking, additionally maintaining compliance since project databases might contain sensitive information.
The MeCab tokenizer is already present in Groonga's source code and easily to install during compilation, albeit groonga-normalizer-mysql
isn't 🤔 and we might have to also build it from source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
I think that we can remove groonga-normalizer-mysql
. It's for using MySQL compatible normalization (collation) in Groonga. In general, PGroonga users don't use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Thank you for the information and all the feedback!
I shall be merging this to make Docker images available - we're aiming to roll this out for new projects on the Supabase platform sometime next week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kou support for this rolled out today for new projects, and projects which pause and then unpause themselves to update to the latest version.
Forgive me for commenting here, as I think it is relevant. In the local development environment of Supabase, specifying pgroonga: [option][tokenizer][validate] invalid tokenizer: <TokenMecab>: [info][set][default-tokenizer][(anonymous)] unknown tokenizer: <TokenMecab> Production did not generate this error. Extensions are enabled. postgres=> select * from pg_available_extensions where name like 'pgroonga'
postgres-> ;
name | default_version | installed_version | comment
----------+-----------------+-------------------+-----------------------------------------------------------------------
---------
pgroonga | 2.4.0 | 2.4.0 | Super fast and all languages supported full text search index based on
Groonga
(1 row) I think
I don't know much about these things, but this is an issue with which library? I would like to raise the issue in the correct place if commenting here is wrong. |
What is it? Could you explain how to setup it? |
Local environments can be easily set up using the Supabase CLI. |
@sweatybridge I think there might be a divergence between our platform AMIs and Docker images - |
@pcnc When running the following: CREATE TABLE memos (
id integer,
content text
);
CREATE INDEX pgroonga_content_index ON memos
USING pgroonga (content)
WITH (tokenizer='TokenMecab'); we get the following error.
Would you have any idea what could be the cause? |
@dshukertjr Might be related to the PG15.6 bump - will work with @samrose to figure this out |
* chore: add pgroonga MeCab tokenizer support; build job optimization * chore: bump version * chore: add libmecab-dev package * chore: bump postgres version * chore: move groonga packages to PGroonga task file * chore: remove groonga-normalizer-mysql * chore: bump postgres version
What kind of change does this PR introduce?
--with-mecab
defaults to true when building groonga, though reverts to false if library checks fail