User Details
- User Since
- Mar 31 2015, 8:12 PM (500 w, 1 d)
- Availability
- Available
- LDAP User
- Unknown
- MediaWiki User
- Jberkel [ Global Accounts ]
Jul 9 2024
Done some testing with the latest (20240701) dumps (allowing for some tolerance around the moment of dump generation):
Jun 4 2024
That's good news. I've done some tests, and it's looking much better now. The XML dumps haven't been released yet (due to T365501), so there's no baseline to do more detailed testing.
May 26 2024
Latest HTML enwikt dump (20240520) vs XML dump:
May 23 2024
It's probably just the new content, with the baseline still being incomplete. I'll check with the XML dumps.
Apr 18 2024
The HTML dumps are pretty much useless until T351712 is fixed.
Mar 25 2024
Can anyone clarify though? It seems that the new sub-tasks are now stuck again.
Mar 18 2024
It probably means the investigation has been "resolved". The main task is now T351712 + subtasks.
Mar 1 2024
Feb 21 2024
Feb 19 2024
I'll add a command to automatically clear the tmp storage, that should help
I've deleted tmp and other unused stuff it's now down to 16GB, is that acceptable?
Feb 5 2024
Could you explain a bit more what this means, please?
Feb 2 2024
Jan 26 2024
Latest enwikt dump is now at 9.6 GB, still some way to go to the 13GB of the 20230701 dump (also incomplete, but still useful as a baseline).
Jan 9 2024
OK. I think it might be worth putting a disclaimer somewhere, perhaps on https://dumps.wikimedia.org/other/enterprise_html/, to warn users that the dumps are incomplete.
@REsquito-WMF thanks! So this means the next dumps will have more data, but will still be incomplete until this other bug is fixed?
Dec 31 2023
Dec 11 2023
@REsquito-WMF not sure if the changes were already in place, but the current enwiktionary NS0 dump is still at 3.5 GB (compared to 13 GB on 20230701).
Nov 6 2023
Is there anything going to be done about this? The enterprise dumps have been in full failure mode for a few months now and are absolutely unusable. I really don't know how an obvious total failure of service can stay in triage hell for such a long time. I understand WMF resources are limited, but then at least let volunteers help out with this. My question about the code generating the dumps above is still unanswered. The transparency/communication on this whole issue has been miserable.
Oct 27 2023
We don't really need to keep all the old dumps around, I've started the deletion of all dump files before 2023. Different files are needed different purposes: for the stats, and for the "wanted entries" on Wiktionary. After generating the dumps, all the data "lives" on Wiktionary, except for the raw data, which is hosted on ~tools.digero/www and shouldn't be deleted. Right now it uses about 1.3G.
Oct 24 2023
@tstarling Thanks for unblocking this! 🙌
Oct 20 2023
Oct 5 2023
Some random guessing: perhaps the error handling code is borked, and it just finishes the dump and closes the file (without erroring the process)? But why then would so many repositories hit errors at the same time? All the 7-20 dumps seem to be affected, maybe some site-wide network/server problems which weren't handled properly?
Oct 4 2023
I've suggested this previously: the XML dumps have been around for a very long time, and compared to the HTML version, they are very reliable, so why not simply base the HTML dumps off them? Then you could easily compare the counts of the two files, it would also help users who consume both types of data.
Sep 20 2023
Any idea why this would affect primarily non-wikipedia instances? Is the code which generates these dumps available somewhere?
Sep 15 2023
Weirdly, there seems to be less variation in filesizes for Wikipedia dumps:
Aug 24 2023
file sizes from the most recent enwikt HTML dumps (NS0):
Jul 24 2023
Hasn't been fixed yet, data is still missing.
Jul 21 2023
Ok, I hope this can be rolled out quickly, it can't get much worse than the current state
I just checked the latest dumps (2023-07-20), and it's now worse: there are around 2.5 million pages missing from the HTML dump (using the XML dump as a baseline).
Jul 12 2023
Why was this already marked as resolved? New dumps haven't even been published yet, so it's impossible to verify.
Jun 12 2023
Closing this, maybe it'll be useful for future reference. I haven't added documentation to wikitech, not sure where it should go.
I'll see if I can prebuilt the binaries and then just launch the commands without gradle to avoid this issue (so the locks are only held during building, not execution)
Jun 10 2023
There are ~150 entries missing from the HTML dump (compared to 2200 earlier):
It looks like the situation has improved with the latest dump (20230601, enwikt):
Jun 9 2023
looks like the files have finally been synced to toolforge!
Jun 7 2023
Jun 5 2023
The rsync, which copies the files over to the nfs share accessible to toolforge, is still in progress.
Jun 2 2023
Looks like the data was copied successfully this time! I've downloaded the enwiktionary-NS0 dump and the checksum matches.
May 29 2023
It might be the case that we are just serving the checksum of the previous dump.
Meaning: we are grabbing the checksum before the upload has finished.
@ArielGlenn if the API side isn't fixed until the June run would it be possible to ignore the checksums and copy the files regardless? We've been dump-less for 2 months now…
May 25 2023
@Protsack.stephan Where are the checksums calculated? Can you re-index the metadata of the dump files on the API side so that they match the actual file content? It looks like they might get calculated before the file is fully processed, or they are calculated from a different version of the file (as you indicated in your comment)?
@ArielGlenn Is the downloaded data usable, that is, can you decompress the files without error? If the files are OK, maybe it's a problem with the checksum generation: if the checksums are off only for some files, it could be related to the file size. Perhaps some sort of overflow where the hashes are calculated?
May 22 2023
May 17 2023
May 16 2023
Another question, where are the enterprise dumps stored on toolforge now? They seem to have stopped updating October last year.
Thanks for moving this one forward!
May 11 2023
Perhaps the same underlying issue as T305407.
May 8 2023
The files haven't materialized, guess something is still amiss…
May 4 2023
Yes that's what I meant, thanks 🤞
Ok, so the files have been generated, but not copied? Can they be recovered?
May 2 2023
Thanks! Is there any way to check the HTML dump progress/state "from the outside"? The XML dumps have a status page + the machine readable dumpstatus.json
Apr 11 2023
Related to T318371
Mar 24 2023
Ok, let me know once you have dumps available with the new infra and I'll re-generate them.
On the English Wiktionary we now use HTML dumps to generate our stats. Some of our content is not in the mainspace and therefore not reflected in the statistics. There are also problems generating information related to proto-languages, these live in the Reconstruction: namespace.
Thanks, are you referring to the deprecation of restbase/MCS? On the English Wiktionary, we're relying more and more on these dumps for statistics and maintenance tasks, and many editors have noticed problems with data derived from these dumps.
Mar 22 2023
Another cache fail related ticket, probably not related though: T226931
Mar 16 2023
Looks like T122934 is relevant and would help with this. Unfortunately, there's been no movement on that task recently.
Dec 5 2022
It works when adding -t latest.
Dec 4 2022
I've been looking at submitting a patch for this myself, but while building the docker images from https://gerrit.wikimedia.org/g/operations/docker-images/toollabs-images
I get the following error:
Nov 9 2022
I have disabled all gadgets and beta features (except "Visual Editing" and "New wikitext mode"), still the same result.
I've also tried it with Safari (see screenshot).
Oct 10 2022
The stats now have a correct timestamp, but there's still missing data. Can you please fix this? With this unpredictable mix of old and new data they're useless for most purposes right now, might as well not generate them at all.
Oct 9 2022
Oct 7 2022
Hmm, dumps are still not available…
Oct 4 2022
Thanks for the update!
Oct 2 2022
@Protsack.stephan great! however, looks like the october dumps haven't been generated yet?
Aug 22 2022
@nfliu Unfortunately, the HTML dumps don't seem to be very reliable at the moment.
Jul 1 2022
Any updates on this? The task has been moved around a bit recently, but it's not clear what is happening. Is it difficult to fix?
Apr 21 2022
Just a thought: perhaps the HTML dumps should be generated from the XML dumps, so that the revisions in both match (and they can both be used interchangeably without consistency problems).
Apr 4 2022
Mar 18 2022
Mar 11 2022
Dec 7 2021
Seeing that T114072 is marked as resolved, is it now possible to implement this?
Nov 21 2021
Even if you don't want to change this behavior, it should probably be mentioned in the documentation of #iferror. Because of this limitation, the function is practically useless when used with Scribunto.
Nov 5 2021
Thanks for fixing this so quickly, I'll wait for rc2 and re-test.
Aug 8 2021
If Lua on MediaWiki can't be upgraded to 5.2 or later (T178146 is stalled, with "re-evaluation in 2024"), maybe just the GC changes could be backported to 5.1, to have at least some predictable GC behaviour?