Page MenuHomePhabricator

Language assets for Norwegian
Closed, ResolvedPublic

Description

Event Timeline

@Ladsgroup, can you run Bad-Words-Detection-System for nowiki?

@Galar71, once this is done being run a page will be generated at https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service/Word_lists/no. From there, we'll ask you to review and sort the output lists. Once you are done, we'll take over and get the resulting lists integrated into revscoring and start building prediction models.

It's running, tell me if it's not there after 24 hours

Looks like we're ready to go: https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service/Word_lists/no

@Galar71, could you review the "generated list" and split it into "badwords" and "informal words"? You can ignore the "common words" for now.

Looks like we're ready to go: https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service/Word_lists/no

@Galar71, could you review the "generated list" and split it into "badwords" and "informal words"? You can ignore the "common words" for now.

Hi...

I added some "bad" and "informal" words, but not necessarily all from the generated list... I can take closer look at this if it's wrong to add words not in the Generated list... I just did a quick review (a bit short on time right now), but I can redo it from the generated list if that's how it's supposed to be done.

A couple of questions in that regard:

  • Should ALL the words in the generated list be split into the Bad and Informal categories, or is a subset enough?
  • Is it ok to add words to the "bad" and "informal" categories that do not exist in the generated list?

Best Regards,
Galar71

Hey, Thank you for your help!

Looks like we're ready to go: https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service/Word_lists/no

@Galar71, could you review the "generated list" and split it into "badwords" and "informal words"? You can ignore the "common words" for now.

Hi...

I added some "bad" and "informal" words, but not necessarily all from the generated list... I can take closer look at this if it's wrong to add words not in the Generated list... I just did a quick review (a bit short on time right now), but I can redo it from the generated list if that's how it's supposed to be done.

A couple of questions in that regard:

  • Should ALL the words in the generated list be split into the Bad and Informal categories, or is a subset enough?

As much as you think is enough, It would be great to have all of possible bad words though. If a word is a total false positive, let's say "hamster" you can ignore it.

  • Is it ok to add words to the "bad" and "informal" categories that do not exist in the generated list?

Totally, It's a helper.

Best Regards,
Galar71

Hi, @Halfak.

I'll go through it, hopefully in the next few days, and do a more thorough review of the generated list, and then I'll copy the relevant words to the "Bad" and "Informal" lists. I'll make sure to leave "hamster" out. ;)

Best Regards,
Galar71

@Galar71. Hey, Is it finished? Please keep me posted :) Thanks

As someone who can read Norwegian, it looks finished.

(I'd possibly move "cool" to from "bad words" to "informal".)

I went through and copied some commonly used insults into the bad words list. Noticed that the generated list contains quite a lot of variations, will you be stemming the words?

I went through and copied some commonly used insults into the bad words list. Noticed that the generated list contains quite a lot of variations, will you be stemming the words?

We don't do stemming. I think mostly for performance reasons.