Wikidata:WikiProject Couchdb

This project has the purpose to investigate how having Wikidata on CouchDb could work.

Primary goals

edit
  1. to find out if it is useful for the community to have access to fully updated copy of Wikidata in CouchDb (we ideally want to ingest every 60 seconds)
  2. to investigate the time to load human items in Wikidata
  3. to investigate the time to load scholarly items in Wikidata (later to be defined)
  4. to investigate the time to load all items in Wikidata
  5. to investigate how to keep items in couchdb updated based on a stream of edits from e.g. Kafka (if we have access to that)
  6. to investigate how to do example searches on items in Wikidata using MapReduce
  7. to investigate if chatgpt or similar GenAI could help in rewriting example sparql queries to mapreduce.
  8. to investigate optimal hardware requirements for a minutely updated CouchDb with fast response time for queries
  9. find out if user defined views are feasible. How does the view affect disk space? Can we accomodate 1000 views? 1 million?
  10. Can we rely on replica ElasticSearch indices to find the items we are interested in and avoid a huge amount of indices? Maybe we get get away with just indexing some select fields such as the QID and the P31 field and just brute-force the rest.

About CouchDB

edit
Apache CouchDB offers high availability, excellent throughput and scalability. These goals were achieved using immutable data structures - but they have a price: disk space. CouchDB was designed under the assumption that disk space is cheap.[1]

Installing CouchDb

edit

See https://wiki.archlinux.org/title/CouchDB

Loading items as documents from the dump

edit

See https://github.com/maxlath/import-wikidata-dump-to-couchdb

Maxlath noted 2024-09-11:

This tool was a bit of a naive implementation; if I wanted to do that today, I would do it differently, and make sure to use CouchDB bulk mode:
  1. Get a wikidata json dump
  2. Optionally, filter to get the desired subset. In any case, turn the dump into valid NDJSON (drop the first and last lines and the comma at the end of each lines).
  3. Pass each entity through a function to move the "id" attribute to "_id", using https://github.com/maxlath/ndjson-apply, to match CouchDB requirements.
  4. Bulk upload the result to CouchDB using https://github.com/maxlath/couchdb-bulk2


Filtering items before load

edit

See https://github.com/maxlath/wikibase-dump-filter

Pre-format entities into clickable URLs

edit

https://github.com/maxlath/wikibase-dump-formatter

How IDs work

edit
CouchDB stores its documents in a B+ tree. Each additional or updated document is stored as a leaf node, and may require re-writing intermediary and parent nodes. You may be able to take advantage of sequencing your own ids more effectively than the automatically generated ids if you can arrange them to be sequential yourself.[2]

Disk space requirements

edit

Minimum:

  • json.gz dump download:
    • items: 140 GB (as of 2024-09-11)
    • lexemes: 0.47 GB (as of 2024-09-11)
  • couchdb instance: ? GB

Compaction

edit
Database compaction compresses the database file by removing unused file sections created during updates. Old documents revisions are replaced with small amount of metadata called tombstone which are used for conflicts resolution during replication. The number of stored revisions (and their tombstones) can be configured by using the _revs_limit URL endpoint.[3]

Reducing number of revisions

edit
To reduce the size of the list of old revisions, CouchDB offers a parameter _revs_limit (revisions limit) which limits the number of revisions stored in a database. By default it is set to 1000. It can be changed by issuing an HTTP PUT command:

curl -X PUT -d “10” https://localhost:5984/testdb/\_revs\_limit

Note that reducing the revisions limit increases the risk of getting conflicts during replication. Therefore you should only do that if you replicate often (before 10 new revisions have been created) or if you don’t use replication at all[...][4]

Purge

edit
A database purge permanently removes the references to documents in the database. Normal deletion of a document within CouchDB does not remove the document from the database, instead, the document is marked as _deleted[equalsign]true (and a new revision is created). This is to ensure that deleted documents can be replicated to other databases as having been deleted. This also means that you can check the status of a document and identify that the document has been deleted by its absence. The purge request must include the document IDs, and for each document ID, one or more revisions that must be purged. Documents can be previously deleted, but it is not necessary. Revisions must be leaf revisions.[5]

Clustering

edit
Cluster: A cluster of CouchDB installations internally replicate with each other via optimized network connections. This is intended to be used with servers that are in the same data center. This allows for database sharding to improve performance.[6]

Purge

edit

Replication

edit

This is natively supported and could mean an improved performance for European or Asian Wikdata users if WMF would replicate to datacenters in those regions which is not currently possible/feasible for Blazegraph.

One of CouchDB’s strengths is the ability to synchronize two copies of the same database. This enables users to distribute data across several nodes or data centers, but also to move data more closely to clients. Replication involves a source and a destination database, which can be on the same or on different CouchDB instances. The aim of replication is that at the end of the process, all active documents in the source database are also in the destination database and all documents that were deleted in the source database are also deleted in the destination database (if they even existed).[7]
edit

Members

edit

References

edit
  1. https://eclipsesource.com/blogs/2012/07/11/reducing-couchdb-disk-space-consumption/#comment-16725
  2. https://docs.couchdb.org/en/stable/best-practices/documents.html
  3. https://docs.couchdb.org/en/stable/maintenance/compaction.html
  4. https://eclipsesource.com/blogs/2012/07/11/reducing-couchdb-disk-space-consumption/#comment-16725
  5. https://docs.couchdb.org/en/stable/api/database/misc.html
  6. https://docs.couchdb.org/en/stable/cluster/index.html
  7. https://docs.couchdb.org/en/stable/replication/intro.html