User Details
- User Since
- Nov 12 2020, 6:16 PM (207 w, 11 h)
- Availability
- Available
- LDAP User
- Fabian Kaelin
- MediaWiki User
- FKaelin (WMF) [ Global Accounts ]
Yesterday
Adding a quick snippet to look at the percentage of revisions that have the previous revision as their parent. Generally above 99%, for enwiki it is 99.98%.
The snippet pasted above now returns the same maxmind metadata for all hosts the job ran on: 2024-10-29 19:58:23.
Tue, Oct 29
Agreed that the SimpleSkeinOperator on kubernetes airflow is not needed anymore (and that the code should use kubernetes native operator instead).
This dataset has been reviewed and approved by privacy and legal (details on asana). Note that since the clickstream dataset is older than the L3SC process, both the dataset itself and the expansion of the set of languages were reviewed and approved.
Mon, Oct 28
Mon, Oct 21
To follow up my previous comment:
Another point for discussion: the mediawiki history dumps are published as tsv files (without a header) for the community. Changing the definition of user_is_anonymous could have an impact consumers in the community? Likely it would involve a prior notice to the community, cc @KinneretG who is working on an announcement about temp accounts to the research list.
Thu, Oct 17
Wed, Oct 16
Summarizing my take-away from this slack thread about how to use html datasets in airflow dags.
Tue, Oct 15
Fri, Oct 4
Thank you @MunizaA, your mr is merged.
Thu, Oct 3
This is completed.
Oct 2 2024
Great - thank you @TAndic. I am closing this as resolved as this means it will work same as before which should fine for Research.
Oct 1 2024
@TAndic, do you have suggestions for who to tag here?
@TAndic, do you have suggestions for who to tag here?
Thanks for the clarifications @Mayakp.wiki, though in my opinion we should still consider adding a user_is_named flag as a replacement for the previous definition of user_is_anonymous, to minimize the downstream implications of this change.
Sep 27 2024
Weekly updates
- Held a hands-on Tuesday meeting for research scientist to use and contribute to shared research code base
- Notebooks to get started and for contributing to research codebase
- Notebook for strategies to process webrequest logs at scale
Sep 25 2024
Instead, we want each consumer to stop and think how they want to handle temporary users.
@Ottomata thanks for sharing - the intricacies of naming/classifying the user types are real!
From a data/metrics usage perspective, the user_is_anonymous field seems to be mostly used for the binary anon/editor classification (e.g. in wikistats editor types are "Anonymous - a user that is not logged in" and "User - a registered, logged in", in research we create datasets/models that use this as a feature). In my understanding this binary nature will not change (we might want to update some naming), i.e. the temp accounts will eventually replace all anonymous edits (once the feature is rolled out to all wikis).
Sep 23 2024
This is done, research-api-template repository
Sep 11 2024
@KCVelaga_WMF I misread this code previously - for now the model loading/inference for the various variants of the revert risk model is not unified (we plan to do that though). There are separate loading and classify methods for the multi-lingual model. The features extraction pipeline for all models is generalized, but the inference step needs some more work. I started updating this notebook to work with the multi-lingual model but run into some torch/transfomer version issues.
Sep 9 2024
Weekly updates
- preparation for Tuesday hand-on session scheduled for Sep 17th
- developing teaching notebooks
Sep 4 2024
Aug 26 2024
A process for publishing survey data has been established, so I am inclined to close this task as resolved and track the release of new datasets in other phabs. Do you have a preference @Miriam?
Aug 24 2024
This work has been merged at last, with this MR.
No updates
Aug 19 2024
There is no pre computed dataset available. The implementation is general, e.g. by passing a multi-lingual model url it could create predictions for that model. See the risk observatory pipeline code as an example that uses this transformation end-to-end; this can be run via a notebook (pip install repo and import/execute the run method). To create a pipeline that generates predictions for the multi-lingual model via airflow dag, we would need to create research engineering request.
Weekly updates
- Defined scoping for Q1, updated description
- Started defining scope/style of Tuesday meeting
Jul 23 2024
Jul 8 2024
Finally got around to this. Thank you @YLiou_WMF for the data file, this looks good to me in general.
Jun 19 2024
I confirm that this request is legit, also adding @XiaoXiao-WMF as manager.
Jun 17 2024
Thanks. Is this now using AQS 2? It has been a moment, can you point to a current/good example job that writes to a AQS cassandra dataset from airflow?
Summary of developments:
- implementation of an end-to-end ml training workflows for
- revert risk model (done)
- add-a-link model (in progress)
- airflow dags that to execute pipelines (scheduled for retraining pipelines, manual trigger for development)
- discussions for how new model versions can be deployed
- for now, continue with manual process established by ML platform
- T366528 to track automation, as a manual process will not scale as research puts more training pipelines into production
- guide for contributing to repository containing training workflows
- future work in collab with ML platform
- GPU support
- enable using new ML boxes once they become available
- use gpu available on existing infra in production airflow job (maybe with a sprint with ML platform that we didn't get to in Q4 FY24)
- standards for ML training
- there is a style guide and existing ml training pipelines to base new work on, but we refrained from introducing "abstractions/framework" like code or language - instead we used the existing infrastructure.
- lead by the ML Platform team, we should revisit this once the new ML boxes become available, as there will be a need for new tooling at that point
- related: the current tooling for end-to-end ML training workflows is not convenient for iterative research/development (setup/deployment is error prone and too involved for one-off use xcases), research engineering has a goal in FY25 to improve researcher tooling
- GPU support
Weekly updates
- initial review on the MR
- meeting with Aisha/Martin to discuss MR and how to approach remaining work
Weekly update
- pipeline are merged
- airflow dags are deployed, final testing in progress
Jun 13 2024
In T358366#9831389 I asked if other fields could be added to the schema; in particular the diff between two revisions, which is frequently used by research (wikidiff). I agree with @xcollazo's concerns, but this lead me to think about the implications of computing the diff separately in regards to reconciliation.
- the diff is expensive to compute, as a the parent revision might be at any moment in the past and is not necessarily the most recent previous revision. The wikidiff pipeline batches jobs by page (i.e. a batch contains the full history of the pages in the batch).
- the full diff dataset computed for each snapshot to follow the "snapshot pattern". However it is not significantly cheaper to make this pipeline incremental (e.g. only append diffs for the new month of revisions) as any revision in the past can be a parent revision so the join is still expensive
- so how would one go about "enriching" wmf_dumps.wikitext_raw_rc2 with a diff column? the job could filter the full history for only the pages changed in that hour (broadcast join) and then do the self join, but that would still require a full pass over the data which seems expensive. This certainly is solvable, e.g. one could decrease the update interval, but it is tempting to instead implement the diff as a streaming "enrichment" pipeline.
- this would look similar to the existing "page change" job, e.g. query mediawiki for the current and parent revision text and compute the diff (maybe with a cache for the previous wikitext for each page which is the most common parent revision)
- however, this leads to the question of correctness/reconciliation, since this diff dataset would not be derived from wmf_dumps.wikitext_raw_rc2 and would thus require its own reconciliation mechanism? Which would be an argument in favour of the "s wmf_dumps.wikitext_raw the right 'place' to check whether we are missing events or not? Shouldn't we do these checks upstream?" point raised above.
Jun 3 2024
May 31 2024
Indeed, different versions of the database seems to be present on cluster hosts.
May 23 2024
- This pipeline is implemented (MR)
- Remaining work: schedule an airflow dag to regularly compute new topics dataset
Closing this task as resolved as the storage request was handled.
Closing this as resolved. After more discussion and some experimentation, it was decided that doing batch inference within the distributed jobs (e.g. by broadcasting the model to the workers) is preferable. Pasting the comments from the relevant slack thread here.
May 22 2024
More on "Availability" / time travel. This question is not easy to answer, as it also relates to the current snapshot approach, which forces a pipeline to reason about the past in a rather limiting way. Aka "do you want the data as it looked today, or 1 month ago, or 2 month ago?", and finding out if/how the past data is different is not trivial and rarely practical. Generally pipelines either
- offload dealing with the snapshot semantics to the consumers by producing snapshotted datasets themselves
- implement a pseudo-incremental dataset by disregarding the "new past" and any changes it might contain.
For this reason I find it hard to define requirements for time travel, it is basically a new capability (for example, the replacement for "mediawiki_wikitext_current" could be a transformation of a time travel query). Starting with 90 days should be sufficient as it is strictly an improvement to what one can do now.
- Availability: Research is mostly treating the current history dumps as a pseudo incremental dataset- i.e. pipelines that depend on the history wait for a new snapshot to be released and then only use the "new" data from that snapshot (aka the revisions created in the month since the last snapshot was generated). This means that the wmf_dumps.wikitext_raw allows to significantly reduce the latency - roughly from 1 month (wait for snapshot interval to trigger) +12days (dump processing) to a few hours.
- Schema: As the schemas are almost identical, my main question is about extending the existing dataset in ways that depend on the snapshot mechanism. For example research has a number of use cases that involve comparing the revision text with the parent revision text. This involves a computationally expensive self join and some pitfalls, so there is a wikidiff job that creates (yet another) version of the wikitext history that includes a column with the unified diff between the current and parent revision.
- Could we add the diff to the proposed wmf_dumps.wikitext_raw? As the parent revision could be at any point in the past, this would likely involve the equivalent to the wmf.mediawiki_wikitext_current available when new revisions are ingested into the dataset.
- More generally, what is the replacement for the wmf.mediawiki_wikitext_current?
- Data quality: the discussion around correctness of the events data T120242 also applies in this context. For research in particular, many use cases don't have high requirements (e.g. for training datasets for ML, or for metrics datasets that involve models that can also be "incorrect"), and we could/would migrate existing jobs to the new dumps table once it is available/supported in prod.
Apr 30 2024
I am closing this as done - a summary:
This task requires design/implementation. Given that the current implementation is stable, I am moving this task to the freezer until there is a more urgent need for an incremental dataset.
Update on the use of gitlab issues:
- Research doesn't use them for team internal planning, work is tracked in Phabricator.
- Some researchers use gitlab issues for managing tasks with external collaborators (e.g. outreachy internships) as it more convenient than depending another tool.
Apr 16 2024
Closing this. Deploying on CloudVPS is supported, blubber integration to be done when a kubernetes deploy is needed.
Done - code
Removing due date and moving to backlog to prioritize.
Apr 3 2024
@Pablo thanks for flagging - there was indeed an issue with the wikidiff table: it is an external hive table, the required data was on hdfs and triggered the risk observatory dag, but the hive table itself was not being correctly updated, so no data was ingested. This is fixed now, and the dashboard shows data until Feb 24 now.
Mar 21 2024
Pasting this reply from a slack thread for context
Mar 5 2024
Mar 4 2024
Weekly updates
- Interesting development with the ml team, there is a conversation with an European HPC infra provider about getting compute resources, and research projects are good candidates. Naturally this is relevant to this cloud GPU initiative, and research is very interested.
Feb 29 2024
These directories can be removed both on the stat clients and hdfs. Thanks!
Feb 27 2024
This is fixed (MR) and the the data is available.
Feb 12 2024
Weekly updates
- Trained the simplification model on a 3 billion parameter model (flan-t5-xl) on a single H100 (80GB). Results look promising.
- Training for 2 epochs (~10h), running inference on test datasets (~6h), and downloading model weights: total cost ~50$
- The fine-tuned model model weights are on stat1008. Validated that inference on the currently available GPU in the WMF infra works (it is slow)
Feb 6 2024
Feb 5 2024
The cultural geographical gap data is now in production, aggregated on the level of the WMF regions. The gap name is geography_cultural_wmf_region, e.g. see here, the documentation is also updated as well as the example intersections notebook.
For completeness: the datasets are also documented for the hive tables (which are equivalent to the published datasets) that are only available internally; see datahub (SSO login required)
The is done: Datasets.
Weekly updates
- initial experiments with lambda labs, using text simplification as use case (T354653)
- tested with A100 (40GB) and H100 (80GB) to validate approach and get an estimate of the cost for fine-tuning runs.
- for a model size that can be trained on WMF infra (T5 large, 700M params), 1 epoch takes ~24h in WMF infra. On lambda labs 1 epoch costs ~6$ (i.e. time depends on hardware, ~4 h on A100, ~2h on a H100).
- next up: use a model (3B param model) that can't currently be fine-tuned using WMF infra, but can be served using WMF infra.
Jan 29 2024
Weekly updates
Jan 25 2024
Also for reference, at some point I created a template superset dashboard which mirrors the content_gap_metric hive tables - here https://superset.wikimedia.org/superset/dashboard/472, that is just a draft with example charts.