Language is a city to the building of which every human being brought a stone.
Ralph Waldo Emerson
Stone I. Parser wikokit. This program parses Wiktionaries, constructs and fills machine-readable Wiktionaries.
Stone II. PHP API (piwidict project) to work with machine-readable Wiktionary.
The goal of this project is to extract semi-structured information from Wiktionary and construct machine-readable dictionary (database + API + GUI).
Download new Wiktionary parsed databases from Academic Torrents:
- Russian Wiktionary parsed ruwikt20230901;
- English Wiktionary parsed enwikt20231001.
Archives of Wiktionary parsed databases are available at whinger.krc.karelia.ru/soft/wikokit.
How to import dump of parsed Wiktionary into MySQL (in Russian).
I) The maximum goal (in distant future) is to extract all information (i.e. all sections of entry) from all wiktionaries and convert data to machine-readable format.
II) Today's result. Now machine-readable Wiktionary contains the following information extracted from Russian Wiktionary and English Wiktionary:
- word's language and part of speech;
- meanings / definitions;
- semantic relations;
- translations;
- (^) context labels (from definitions);
- (^) quotations (text + bibliographic data).
(^) Context labels and quotations were extracted only from Russian Wiktionary.
The structure (tables and relations) of the Wiktionary parsed database (database layout, see the file wikt_parsed_empty_with_foreign_keys.png):
Set of tables related to quotations (fragment of the Wiktionary parsed database):
Machine-readable Wiktionary framework:
I am interested that all two hundred Wiktionaries were parsed by this parser. But I know only Russian and English :)
If you are developer and if you are interested in adding modules to parse "your Wiktionary", then
- start from the paper describing the database (tables and relations) of machine-readable Wiktionary: Transformation of Wiktionary entry structure into tables and relations in a relational database schema. 2010. But there are new tables (absent in the publication) related to quotations and context labels, see Machine-readable database schema;
- GettingStartedWiktionaryParser — install parser and try to parse English Wiktionary and Russian Wiktionary;
- Play with parsed English or Russian Wiktionary SQL — download dumps of Wiktionary parsed databases from Academic Torrents;
- OneMoreWiktionary — extend parser in order to extract invaluable information from your Wiktionary.
The machine-readable dictionary database statistics:
- English Wiktionary: total, semantic relations, translations, part of speech
- Russian Wiktionary: total, semantic relations, translations, part of speech, context labels, quote (languages & sources, authors with clusters, other authors, years)
Wiki tool kit (wikokit) contains several projects related to wiki
./common_wiki — common (low-level) functions to handle data of Wikipedia and Wiktionary in MySQL database,
./common_wiki_jdbc — functions to handle data of Wiktionary in MySQL and SQLite databases (JDBC, Java SE) (depends on common_wiki.jar).
./android/common_wiki_alink — Eclipse copy (source link) of ./common_wiki (!NetBeans)
./android/common_wiki_android — functions for access to Wiktionary in Android SQLite version of database (depends on common_wiki.jar).
./android/magnetowordik — Android word game (Wiktionary thesaurus).
./hits_wiki — API for access to Wikipedia in MySQL database, algorithms to search synonyms in Wikipedia (depends on jcfd.jar, common_wiki.jar).
./TGWikiBrowser — visual browser to search for synonyms in local or remote Wikipedia (depends on hits_wiki.jar and common_wiki.jar)
./wikidf — Wiki Index Database (list of lemmas and links to wiki pages, which contain these lemmas).
./wikt_parser — Wiktionary parser creates a MySQL database (like WordNet) from an Wiktionary MySQL dump file. The project goal is to convert Wiktionary articles to machine-readable format. (It depends on common_wiki, common_wiki_jdbc)
./wiwordik — Visualization of parsed Wiktionary database. wiki + word = wiwordik.
The code of previous project Synarcher are used in wikokit.
- A. Krizhanovsky, A. Smirnov. An approach to automated construction of a general-purpose lexical ontology based on Wiktionary // Journal of Computer and Systems Sciences International, 2013, Vol. 52, No. 2, pp. 215–225.
- A. Smirnov, T. Levashova, A. Karpov, I. Kipyatkova, A. Ronzhin, A. Krizhanovsky, N. Krizhanovsky. Analysis of the quotation corpus of the Russian Wiktionary // Research in Computing Science, Vol. 56, pp. 101-112, 2012.
- A. Krizhanovsky. A quantitative analysis of the English lexicon in Wiktionaries and WordNet // International Journal of Intelligent Information Technologies (IJIIT), October-December 2012, Vol. 8, No. 4, pp. 13-22.
- F. Lin, A. Krizhanovsky. Multilingual ontology matching based on Wiktionary data accessible via SPARQL endpoint // In: Proceedings of the 13th Russian Conference on Digital Libraries RCDL’2011. October 19-22, Voronezh, Russia. – pp. 19-26. link2
- A. A. Krizhanovsky. Transformation of Wiktionary entry structure into tables and relations in a relational database schema. Preprint. 2010.
- A. A. Krizhanovsky. The comparison of Wiktionary thesauri transformed into the machine-readable format. Preprint. 2010.
- A. A. Krizhanovsky, F. Lin. Related terms search based on WordNet / Wiktionary and its application in Ontology Matching // In: Proceedings of the 11th Russian Conference on Digital Libraries RCDL’2009. September 17-21, Petrozavodsk, Russia. – pp. 363-369.
- Крижановский А.А., Смирнов А.В., Круглов В.М., Крижановская Н.Б., Кипяткова И.С. Автоматическое извлечение словарных помет из Русского Викисловаря // Труды СПИИРАН. 2014. Вып. 2(33). С. 164-185.
- Крижановский А.А., Смирнов А.В. Подход к автоматизированному построению общецелевой лексической онтологии на основе данных викисловаря // Известия РАН. Теория и системы управления. N2, 2013, С. 53-63.
- Крижановский А. А., Луговая Н. Б., Круглов В. М. Извлечение и анализ дат произведений в корпусе цитат онлайн-словаря // Информационные технологии и письменное наследие: материалы VI междунар. науч. конф. El'Manuscript-12 (Петрозаводск, 3-8 сентября 2012) / отв. ред. В.А.Баранов, А.Г.Варфоломеев. – Петрозаводск; Ижевск, 2012. – 328 с. – C. 137—142. ISBN 978-5-8021-1402-5. (PDF)
- Смирнов А.В., Круглов В.М., Крижановский А.А., Луговая Н.Б., Карпов А.А., Кипяткова И.С. Количественный анализ лексики русского WordNet и викисловарей // Труды СПИИРАН. 2012. Вып. 23. С. 231–253.
- Крижановский А. Количественный анализ лексики английского языка в викисловарях и Wordnet // Труды СПИИРАН. 2011. Вып. 19. С. 87–101.
- Крижановский А. Оценка использования корпусов и электронных библиотек в Русском Викисловаре // Труды международной конференции «Корпусная лингвистика–2011». – СПб.: С.-Петербургский гос. университет, Филологический факультет, 2011, 348 с. – C. 217—222. ISBN 978-5-8465-0005-5.
- Крижановский А. Преобразование структуры словарной статьи Викисловаря в таблицы и отношения реляционной базы данных. Препринт. 2010.
- Крижановский А. Сравнение тезаурусов Русского и Английского Викисловарей, преобразованных в машиночитаемый формат. Препринт. 2010.
- Крижановский А. Машинная обработка Русского Викисловаря // Вики-конференция 2009. 24—25 октября, Санкт-Петербург.
- Java Wiktionary Library (JWKTL)
- perl-wiktionary-parser // github.com, Perl module
- wiktionary_parser // github.com, Perl module
- Dbnary
This program is multi-licensed and may be used under the terms of any of the following licenses:
- EPL, Eclipse Public License V1.0 or later, http://www.eclipse.org/legal
- LGPL, GNU Lesser General Public License V3.0 or later, http://www.gnu.org/licenses/lgpl.html
- GPL, GNU General Public License V3.0 or later, http://www.gnu.org/licenses/gpl.html
- AL, Apache License, V2.0 or later, http://www.apache.org/licenses
- BSD, New BSD License, http://www.opensource.org/licenses/bsd-license
See documentation.