Page MenuHomePhabricator

Milimetric (Dan Andreescu)
Staff Engineer (Data Engineering)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Oct 8 2014, 5:48 PM (517 w, 2 d)
Availability
Available
IRC Nick
Milimetric
LDAP User
Milimetric
MediaWiki User
Milimetric (WMF) [ Global Accounts ]

Recent Activity

Yesterday

Milimetric added a comment to T372677: Figure a performant way to read all data from revision table via Spark.

Well, as usual the actual logic is pretty easy but I'm fighting with java/scala to build this jar with the correct version of scala. For some reason it's picking up 2.13. But in theory I kind of have this working, the patch is super simple and would be very easy to build and maintain as we upgrade Spark. It only requires a few lines of code to be changed. So except for java being insane, this kind of thing could work well:

Fri, Sep 6, 8:58 PM ยท Dumps 2.0 (Kanban Board)

Fri, Aug 23

Milimetric added a comment to T371099: No longer use removed cuc_actiontext column in analytics/refinery.

I have deployed this so it should be picked up by the job that starts on Sep. 1st. I have not dropped the column out of the table because old data is deleted quickly anyway and it doesn't hurt the insert.

Fri, Aug 23, 4:37 PM ยท Data Products (Data products Sprint 18), Data-Engineering
Milimetric moved T371099: No longer use removed cuc_actiontext column in analytics/refinery from To Deploy to Done on the Data Products (Data products Sprint 18) board.
Fri, Aug 23, 4:34 PM ยท Data Products (Data products Sprint 18), Data-Engineering
Milimetric edited projects for T371099: No longer use removed cuc_actiontext column in analytics/refinery, added: Data Products (Data products Sprint 18); removed Data Products (Data Products Sprint 17).

this got lost in the sprint move and almost broke a bunch of stuff, but I had a nightmare about it so it's all good

Fri, Aug 23, 1:06 PM ยท Data Products (Data products Sprint 18), Data-Engineering
Milimetric moved T371099: No longer use removed cuc_actiontext column in analytics/refinery from Sprint Backlog to To Deploy on the Data Products (Data products Sprint 18) board.
Fri, Aug 23, 1:04 PM ยท Data Products (Data products Sprint 18), Data-Engineering

Mon, Aug 19

Milimetric added a comment to T372677: Figure a performant way to read all data from revision table via Spark.

Seems easy enough to extend, here's the PR that added support for timestamp. but I doubt upstream would ever want to merge. So this would just be our personal little hack.

Mon, Aug 19, 9:53 PM ยท Dumps 2.0 (Kanban Board)
Milimetric claimed T372677: Figure a performant way to read all data from revision table via Spark.

temporarily grabbing this to look into the "modify the JDBC data source to accept a function" option.

Mon, Aug 19, 8:17 PM ยท Dumps 2.0 (Kanban Board)
Milimetric added a comment to T369868: Improve handling of delete, restore, and merge from incremental update.

I'm back on dumps after some time, things I'm going to look at, in this order:

Mon, Aug 19, 3:37 PM ยท Dumps 2.0 (Kanban Board)
Milimetric added a comment to T342267: Investigate surprising "10% Other" portion of Analytics Browsers report.

So to confirm, this means:

  • over 1% of page views (after deducting known bots and spiders) are coming from clients with user agents that are entirely unknown to ua-parser. That is, the "Other" is already there in the raw wmf.webrequest_text dataset, and we've not created or normalized anything else to "Other".
  • 0.26% is "Redacted" where we replace/normalize/summarise for privacy reasons browser/OS names in our pipeline.
Mon, Aug 19, 2:21 PM ยท Data Products (Data Products Sprint 17), Analytics-Data-Problem, MediaWiki-Platform-Team (Radar), Data-Engineering, Data-Engineering-Dashiki

Fri, Aug 16

Milimetric edited P67350 Queries looking into unique device numbers in Hong Kong.
Fri, Aug 16, 6:41 PM
Milimetric created P67350 Queries looking into unique device numbers in Hong Kong.
Fri, Aug 16, 6:39 PM

Thu, Aug 15

Milimetric moved T342267: Investigate surprising "10% Other" portion of Analytics Browsers report from To Deploy to Done on the Data Products (Data Products Sprint 17) board.

Hm.. it seems the "Other" bucket has grown slightly larger than our predictions of 0.26% prediction at T342267#9998984

Thu, Aug 15, 2:37 AM ยท Data Products (Data Products Sprint 17), Analytics-Data-Problem, MediaWiki-Platform-Team (Radar), Data-Engineering, Data-Engineering-Dashiki

Tue, Aug 13

Milimetric updated the task description for T372412: Archive unused limn-*-data Gerrit repositories and delete their Github mirrors.
Tue, Aug 13, 2:41 PM ยท Projects-Cleanup, Wikimedia-GitHub, Release-Engineering-Team
Milimetric created T372412: Archive unused limn-*-data Gerrit repositories and delete their Github mirrors.
Tue, Aug 13, 2:38 PM ยท Projects-Cleanup, Wikimedia-GitHub, Release-Engineering-Team
Milimetric reassigned T356743: AQS deployment guide from SGupta-WMF to apaskulin.
Tue, Aug 13, 11:14 AM ยท Data Products (Data products Sprint 18), AQS2.0
Milimetric reassigned T368035: Data gateway integration in CIM APIs from SGupta-WMF to Sfaci.
Tue, Aug 13, 11:12 AM ยท Data Products (Data Products Sprint 19), AQS2.0

Mon, Aug 12

Milimetric moved T372364: Bug: pivot does not handle varied casing from Sprint Backlog to Code Review / Tech Input on the Data Products (Data Products Sprint 17) board.
Mon, Aug 12, 11:11 PM ยท Data Products (Data Products Sprint 17), Data-Engineering
Milimetric created T372364: Bug: pivot does not handle varied casing.
Mon, Aug 12, 11:07 PM ยท Data Products (Data Products Sprint 17), Data-Engineering
Milimetric added a comment to T342267: Investigate surprising "10% Other" portion of Analytics Browsers report.

the new graphs are up. The pivot transformation failed for all the browser family reports, so I'm still fixing that. But, for example, we can now compare these two:

Mon, Aug 12, 7:14 PM ยท Data Products (Data Products Sprint 17), Analytics-Data-Problem, MediaWiki-Platform-Team (Radar), Data-Engineering, Data-Engineering-Dashiki

Fri, Aug 9

Milimetric added a comment to T342267: Investigate surprising "10% Other" portion of Analytics Browsers report.

The backfill job should be done sometime this weekend, and I'll rerun the weekly job then.

Fri, Aug 9, 1:23 AM ยท Data Products (Data Products Sprint 17), Analytics-Data-Problem, MediaWiki-Platform-Team (Radar), Data-Engineering, Data-Engineering-Dashiki

Aug 7 2024

Milimetric added a comment to T342267: Investigate surprising "10% Other" portion of Analytics Browsers report.

UPDATE: the below didn't work, I just ended up deleting the DAG and setting its start date to 2015-06-01

Aug 7 2024, 9:27 PM ยท Data Products (Data Products Sprint 17), Analytics-Data-Problem, MediaWiki-Platform-Team (Radar), Data-Engineering, Data-Engineering-Dashiki
Milimetric added a comment to T342267: Investigate surprising "10% Other" portion of Analytics Browsers report.

Status update on this: the new job is running, I'm going to keep it here until we vet the data. But new data should start showing up right away, and we can compare dashboards side by side and day by day:

Aug 7 2024, 9:09 PM ยท Data Products (Data Products Sprint 17), Analytics-Data-Problem, MediaWiki-Platform-Team (Radar), Data-Engineering, Data-Engineering-Dashiki
Milimetric created T372014: Problem deploying - missing airflow_client dependency.
Aug 7 2024, 8:47 PM ยท Dumps 2.0 (Kanban Board), Data-Engineering (Q1 2024 July 1st - September 30th)

Aug 6 2024

Milimetric added a comment to T369847: Setup basic send and receive wiring between a MW instance and a Statsig cloud instance.

I think I need to speak with Sam and understand this N / M enrollments problem some more because I'm not getting it right now. With that caveat, thoughts:

Aug 6 2024, 2:03 PM ยท Data Products (Data products Sprint 18), Metrics Platform

Jul 31 2024

Milimetric added a project to T217792: Add wikitech (labswiki) to the sqoop list: Data Products (Data Products Sprint 17).

@VirginiaPoundstone adding this to our sprint because it's basically a no-op and an easy resolution to an old task.

Jul 31 2024, 6:31 PM ยท Data Products (Data Products Sprint 19), Data-Engineering

Jul 30 2024

Milimetric moved T371031: Spike: Deep Dive on Growthbook data pipeline from In Process to Code Review / Tech Input on the Data Products (Data Products Sprint 17) board.
Jul 30 2024, 8:00 PM ยท Data Products (Data Products Sprint 17)
Milimetric updated the task description for T371031: Spike: Deep Dive on Growthbook data pipeline.
Jul 30 2024, 8:00 PM ยท Data Products (Data Products Sprint 17)
Milimetric updated the task description for T337562: Decide how to split wmf database into functional areas.
Jul 30 2024, 7:42 PM ยท Data Pipelines (Sprint 14)
Milimetric added a comment to T217792: Add wikitech (labswiki) to the sqoop list.

Oh! Thanks for the reminder, this is now available but not included in the sqoop lists. I'll make a patch, easy enough. labswiki seems to be available in both the analytics replicas and cloud replicas (as labswiki_p in the latter)

Jul 30 2024, 2:08 PM ยท Data Products (Data Products Sprint 19), Data-Engineering

Jul 29 2024

Izno awarded T342267: Investigate surprising "10% Other" portion of Analytics Browsers report a Like token.
Jul 29 2024, 10:06 PM ยท Data Products (Data Products Sprint 17), Analytics-Data-Problem, MediaWiki-Platform-Team (Radar), Data-Engineering, Data-Engineering-Dashiki
Milimetric added a comment to T371031: Spike: Deep Dive on Growthbook data pipeline.

Some initial thoughts: https://docs.google.com/document/d/1NorKzBiQyz2nXCUUkGqdFP8SfPD0bzBMdfPO183dqNY/edit

Jul 29 2024, 9:21 PM ยท Data Products (Data Products Sprint 17)
Milimetric added a project to T371099: No longer use removed cuc_actiontext column in analytics/refinery: Data Products (Data Products Sprint 17).

Adding this to our Sprint board as it needs to get a look and deployment. But the change looks good, just have to alter the table and make sure the timing of all the jobs works out.

Jul 29 2024, 8:18 PM ยท Data Products (Data products Sprint 18), Data-Engineering
Milimetric merged T371319: Update sqoop code to remove cuc_actiontext from query and table into T371099: No longer use removed cuc_actiontext column in analytics/refinery.
Jul 29 2024, 8:17 PM ยท Data Products (Data products Sprint 18), Data-Engineering
Milimetric merged task T371319: Update sqoop code to remove cuc_actiontext from query and table into T371099: No longer use removed cuc_actiontext column in analytics/refinery.
Jul 29 2024, 8:16 PM ยท CheckUser, DBA, Data-Engineering, Data Products, Schema-change-in-production
Milimetric placed T371319: Update sqoop code to remove cuc_actiontext from query and table up for grabs.
Jul 29 2024, 8:05 PM ยท CheckUser, DBA, Data-Engineering, Data Products, Schema-change-in-production
Milimetric created T371319: Update sqoop code to remove cuc_actiontext from query and table.
Jul 29 2024, 8:01 PM ยท CheckUser, DBA, Data-Engineering, Data Products, Schema-change-in-production
Milimetric moved T368253: MetricsPlatform: Add performance instrumentation from Code Review / Tech Input to In Process on the Data Products (Data Products Sprint 17) board.
Jul 29 2024, 4:10 PM ยท MW-1.43-notes (1.43.0-wmf.20; 2024-08-27), Data Products (Data products Sprint 18), Metrics Platform
Milimetric moved T342267: Investigate surprising "10% Other" portion of Analytics Browsers report from Sprint Backlog to To Deploy on the Data Products (Data Products Sprint 17) board.
Jul 29 2024, 4:07 PM ยท Data Products (Data Products Sprint 17), Analytics-Data-Problem, MediaWiki-Platform-Team (Radar), Data-Engineering, Data-Engineering-Dashiki
Milimetric edited projects for T342267: Investigate surprising "10% Other" portion of Analytics Browsers report, added: Data Products (Data Products Sprint 17); removed Data Products (Data Products Sprint 16).
Jul 29 2024, 3:24 PM ยท Data Products (Data Products Sprint 17), Analytics-Data-Problem, MediaWiki-Platform-Team (Radar), Data-Engineering, Data-Engineering-Dashiki
Milimetric moved T342267: Investigate surprising "10% Other" portion of Analytics Browsers report from Sign Off to To Deploy on the Data Products (Data Products Sprint 16) board.

great, moving this to get deployed. Steps will be:

Jul 29 2024, 3:22 PM ยท Data Products (Data Products Sprint 17), Analytics-Data-Problem, MediaWiki-Platform-Team (Radar), Data-Engineering, Data-Engineering-Dashiki
Milimetric added a comment to T352650: Migrate current-generation dumps to run from our containerized images.

Just for the record, we met and discussed @Joe's proposal (this task's description) and were in general agreement that it's the best way forward. We have follow-up discussions to have and coordination to do, but we're aligned on the idea.

Jul 29 2024, 3:04 PM ยท Data Products, Data-Platform-SRE, MW-on-K8s, Dumps-Generation, Release-Engineering-Team, serviceops

Jul 27 2024

Milimetric added a comment to T364045: [Bug?] Can't find wikidatawiki on wmf.mediawiki_wikitext_history.

K, as a final update here, the pipeline is:

Jul 27 2024, 12:37 PM ยท Wikidata, Wikidata Analytics, Data-Engineering (Q4 2024 April 1st - June 30th)

Jul 26 2024

Milimetric added a comment to T364045: [Bug?] Can't find wikidatawiki on wmf.mediawiki_wikitext_history.

manually running this in a screen on an-launcher1002:

Jul 26 2024, 6:30 PM ยท Wikidata, Wikidata Analytics, Data-Engineering (Q4 2024 April 1st - June 30th)
Milimetric added a comment to T364045: [Bug?] Can't find wikidatawiki on wmf.mediawiki_wikitext_history.

Ok, this ended up being very involved. I believe the root of all the confusion is that all the dumps jobs assume the PREVIOUS dump finished and work only on the CURRENT dump. So we ran around dumpsdata and snapshot hosts, hardcoding 20240701 where it was looking for "latest" and we're not sure whether we broke anything. At the end of the day, we basically figured that the snapshot1010 version of dumps files seemed all good, and we just rsynced them over to dumpsdata and clouddumps hosts. The rsync service that runs ALSO assumes this "latest" thing, but not for all files, just for the status files. So as far as we could tell everything was already rsynced except the status and html files. The monitor/html generation service ALSO assumes this "latest" thing so we weren't able to run that to generate the html, even after trying to hack it, but the html files were already on the snapshot hosts so we just moved those over with rsync too. The base rsync excludes json and html, so we just hacked it to include them instead.

Jul 26 2024, 6:21 PM ยท Wikidata, Wikidata Analytics, Data-Engineering (Q4 2024 April 1st - June 30th)
Milimetric added a comment to T369868: Improve handling of delete, restore, and merge from incremental update.

Quick spike to size up how many revisions we're dealing with on a daily basis:

Jul 26 2024, 1:52 PM ยท Dumps 2.0 (Kanban Board)

Jul 25 2024

Milimetric claimed T371031: Spike: Deep Dive on Growthbook data pipeline.
Jul 25 2024, 2:46 PM ยท Data Products (Data Products Sprint 17)
Milimetric created T371031: Spike: Deep Dive on Growthbook data pipeline.
Jul 25 2024, 2:45 PM ยท Data Products (Data Products Sprint 17)
Milimetric added a comment to T364045: [Bug?] Can't find wikidatawiki on wmf.mediawiki_wikitext_history.

oof, I just realized this is for the month BEFORE. I see that's still in-process:

Jul 25 2024, 2:06 PM ยท Wikidata, Wikidata Analytics, Data-Engineering (Q4 2024 April 1st - June 30th)

Jul 24 2024

Milimetric created T370948: Spike: MediaWiki db schema and reconciliation.
Jul 24 2024, 6:53 PM ยท Dumps 2.0 (Kanban Board)

Jul 23 2024

Milimetric added a comment to T364045: [Bug?] Can't find wikidatawiki on wmf.mediawiki_wikitext_history.

Ok, dug into this a bit more. Looks like the job set up to import the dumps XML is running fine but the status file says wikidatawiki is still in progress. Specifically it says this:

Jul 23 2024, 9:27 PM ยท Wikidata, Wikidata Analytics, Data-Engineering (Q4 2024 April 1st - June 30th)
Milimetric added a comment to T364045: [Bug?] Can't find wikidatawiki on wmf.mediawiki_wikitext_history.

The airflow sensor timed out. But I never saw an alert for it (maybe it was before this week). I cleared it and will report back here in a bit after it has a chance to think about running again.

Jul 23 2024, 7:18 PM ยท Wikidata, Wikidata Analytics, Data-Engineering (Q4 2024 April 1st - June 30th)
Milimetric moved T362783: Add instrumentation for actor signatures from Ready to Deploy to Done on the Data-Engineering (Q1 2024 July 1st - September 30th) board.

Deployed, started job, waiting to see if it works.

Jul 23 2024, 7:06 PM ยท Data-Engineering (Q1 2024 July 1st - September 30th), Patch-For-Review
Milimetric moved T362785: Add host level instrumentation on webrequest from Ready to Deploy to Done on the Data-Engineering (Q1 2024 July 1st - September 30th) board.

I deployed this and started the job, checking in now to make sure it runs.

Jul 23 2024, 7:05 PM ยท Data-Engineering (Q1 2024 July 1st - September 30th), Patch-For-Review
Milimetric changed the visibility for F56356662: running_reconciliation_queries.py.
Jul 23 2024, 5:42 PM

Jul 22 2024

Milimetric added a comment to T369851: NEW BUG REPORT Mediawiki_history contains duplicate rows for some revisions.
  • wmf.mediawiki_history: duplicate revision/create records indeed exist, some have 4 copies and some 2 copies but all spot-checked duplicates come in even numbers
  • wmf_raw.mediawiki_revision: does not show the same duplication
  • analytics mysql replicas: the pages those revisions belong to were moved and had some delete/restore and delete/revision actions in the logging table
  • cloud replicas: agrees with analytics replicas
Jul 22 2024, 9:59 PM ยท Data-Engineering (Q1 2024 July 1st - September 30th), Movement-Insights, Analytics-Data-Problem, Data-Platform
Milimetric added a comment to T353817: Create legacy EventLogging proxy HTTP intake (for MediaWikiPingback) endpoint to EventGate.

+1 to decom

Jul 22 2024, 4:05 PM ยท Data-Engineering (Q1 2024 July 1st - September 30th), Patch-For-Review, MW-1.43-notes (1.43.0-wmf.8; 2024-06-04), MediaWiki-Platform-Team (Radar), Event-Platform, MediaWiki-General
Milimetric added a comment to T370394: Drop gb_by from globalblocks table.

We do not currently use globalblocks anywhere I know of or searched.

Jul 22 2024, 3:53 PM ยท Data-Engineering, Schema-change-in-production, DBA
Milimetric added a comment to T352650: Migrate current-generation dumps to run from our containerized images.

Sorry to ask this very basic question, but I found a bunch of others didn't know: how exactly is Dumps blocking the php 8 upgrade? Like, if we leave everything exactly as-is and just upgrade PHP, would it not run the way it is currently set up? On surface I see no big difference between the current setup and a containerized MW running on the same servers, so I'm curious about the nuance I'm missing here.

Jul 22 2024, 12:50 PM ยท Data Products, Data-Platform-SRE, MW-on-K8s, Dumps-Generation, Release-Engineering-Team, serviceops
Milimetric added a comment to T365074: Requesting access to cassandra-staging-devs for milimetric.

Sorry, I just signed it, I'm sure I signed it or some form of it at some point before, I've been an employee for like 12 years almost :P

Jul 22 2024, 12:42 PM ยท SRE, SRE-Access-Requests

Jul 19 2024

Milimetric created T370551: Bug: Cassandra Unique Devices not loading Wikifunctions mobile data.
Jul 19 2024, 7:28 PM ยท Wikifunctions, Abstract Wikipedia team, Data-Platform
Milimetric moved T342267: Investigate surprising "10% Other" portion of Analytics Browsers report from Code Review / Tech Input to To Deploy on the Data Products (Data Products Sprint 16) board.

ok, moving to ready to deploy. I'm going to ping @Krinkle one more time for data review. I executed this as I was testing and the results are available in milimetric.browser_general_test. You can query this like this:

Jul 19 2024, 4:29 PM ยท Data Products (Data Products Sprint 17), Analytics-Data-Problem, MediaWiki-Platform-Team (Radar), Data-Engineering, Data-Engineering-Dashiki

Jul 17 2024

Milimetric added a subtask for T369898: Reduce the number of resource_change and resource_purge events emitted due to template changes: T370298: Spike: estimate how much of resource_purge could be filtered out by joining to webrequest.
Jul 17 2024, 3:15 PM ยท Essential-Work, MW-1.43-notes (1.43.0-wmf.16; 2024-07-30), serviceops, Performance Issue, MediaWiki-Engineering, MediaWiki-Core-HTTP-Cache, ChangeProp
Milimetric added a parent task for T370298: Spike: estimate how much of resource_purge could be filtered out by joining to webrequest: T369898: Reduce the number of resource_change and resource_purge events emitted due to template changes.
Jul 17 2024, 3:14 PM ยท MediaWiki-Engineering
Milimetric created T370298: Spike: estimate how much of resource_purge could be filtered out by joining to webrequest.
Jul 17 2024, 3:14 PM ยท MediaWiki-Engineering

Jul 16 2024

Milimetric added a comment to T370108: Missed pageview data over API.

weird... quick steps as I look into this.

Jul 16 2024, 10:14 PM ยท Analytics-Data-Problem, Data Products, Pageviews-API, Data-Engineering

Jul 12 2024

Milimetric added a comment to T342267: Investigate surprising "10% Other" portion of Analytics Browsers report.

Ok, sent updated code, it's fast now due to a CACHE statement, but that doesn't change the query plan which is still absolutely nuts, check this out:

Jul 12 2024, 2:46 PM ยท Data Products (Data Products Sprint 17), Analytics-Data-Problem, MediaWiki-Platform-Team (Radar), Data-Engineering, Data-Engineering-Dashiki
Milimetric added a comment to T365487: Progress: Build a data visualization tool for the evolution of Wikipedia articles maintained by WikiProjects.

Quick spark-sql query to get link changes where someone tags a new wiki project on the talk page:

Jul 12 2024, 2:32 PM ยท Research-foundational, Research, Outreachy (Round 28), Outreach-Programs-Projects

Jul 11 2024

Milimetric moved T369868: Improve handling of delete, restore, and merge from incremental update from Sprint Backlog to In Process on the Dumps 2.0 (Kanban Board) board.
Jul 11 2024, 8:36 PM ยท Dumps 2.0 (Kanban Board)
Milimetric created T369868: Improve handling of delete, restore, and merge from incremental update.
Jul 11 2024, 8:36 PM ยท Dumps 2.0 (Kanban Board)
Milimetric moved T368176: [Dumps 2] Spike: Figure root causes of missing rows when doing reconciliation from Sprint Backlog to Done on the Dumps 2.0 (Kanban Board) board.

OK, so it seems most problems do indeed track back to not applying delete and restore events. It feels like we can mark this task complete. We can find a way to apply delete/restore/merge, and then run these queries again and see what we need to reconcile. The period I looked at above was 10 days of enwiki revisions. If anyone disagrees, do move this task back.

Jul 11 2024, 5:01 PM ยท Event-Platform, Data-Engineering, Dumps 2.0 (Kanban Board)
Milimetric added a comment to T368176: [Dumps 2] Spike: Figure root causes of missing rows when doing reconciliation.

That one mismatch_page that has no other reason listed is apparently part of a merge, so if we're not following up on delete/restore properly then this makes perfect sense because merges are more complicated still. Here are the two pages involved and the logging table records for them:

Jul 11 2024, 3:39 PM ยท Event-Platform, Data-Engineering, Dumps 2.0 (Kanban Board)
Milimetric added a comment to T368176: [Dumps 2] Spike: Figure root causes of missing rows when doing reconciliation.

Ok, I think I got this query to make sense... the results:

Jul 11 2024, 3:13 PM ยท Event-Platform, Data-Engineering, Dumps 2.0 (Kanban Board)

Jul 10 2024

Milimetric added a comment to T368176: [Dumps 2] Spike: Figure root causes of missing rows when doing reconciliation.

I am still trying to find an elegant way to change the queries and show all this, but I just wanted to share results so far:

Jul 10 2024, 8:14 PM ยท Event-Platform, Data-Engineering, Dumps 2.0 (Kanban Board)
Milimetric updated the task description for T368303: REQUEST: Add Special:AllEvents to allowlist for campaigns-product pageview tracking.
Jul 10 2024, 5:07 PM ยท Data Products (Data products Sprint 18), Event-Discovery, Data-Platform
Milimetric added a comment to T342267: Investigate surprising "10% Other" portion of Analytics Browsers report.

Apologies for the week delay here, I was out sick, picking it back up soon.

Jul 10 2024, 4:11 PM ยท Data Products (Data Products Sprint 17), Analytics-Data-Problem, MediaWiki-Platform-Team (Radar), Data-Engineering, Data-Engineering-Dashiki

Jul 9 2024

Milimetric updated the task description for T368782: MediaWiki Reconciliation API.
Jul 9 2024, 4:10 PM ยท Data-Engineering (Q1 2024 July 1st - September 30th), Dumps 2.0 (Kanban Board)

Jul 3 2024

Milimetric set the point value for T342267: Investigate surprising "10% Other" portion of Analytics Browsers report to 13.
Jul 3 2024, 7:51 PM ยท Data Products (Data Products Sprint 17), Analytics-Data-Problem, MediaWiki-Platform-Team (Radar), Data-Engineering, Data-Engineering-Dashiki
Milimetric added a comment to T358373: [Dumps 2] Reconcillation mechanism to detect and fetch missing/mismatched revisions.

Quick summary of last meeting. Luke started working on a draft of what we were talking about (see the reconciliation flow on https://miro.com/app/board/uXjVNfaohl0=/).

Jul 3 2024, 7:00 PM ยท Patch-For-Review, Dumps 2.0 (Kanban Board)

Jul 1 2024

Milimetric added a comment to T368176: [Dumps 2] Spike: Figure root causes of missing rows when doing reconciliation.

My first hunch, that the revisions were coming from only specific pages, is wrong:

Jul 1 2024, 3:23 PM ยท Event-Platform, Data-Engineering, Dumps 2.0 (Kanban Board)
Milimetric claimed T368176: [Dumps 2] Spike: Figure root causes of missing rows when doing reconciliation.
Jul 1 2024, 1:45 PM ยท Event-Platform, Data-Engineering, Dumps 2.0 (Kanban Board)

Jun 26 2024

Milimetric added a project to T368405: Special:Homepage is rendered much slower (<1 sec to 2+ sec): Data Products.
Jun 26 2024, 5:43 PM ยท Data-Platform-SRE (2024.07.08 - 2024.07.28), Growth-Team (FY2024-25 Q1 Sprint 1), MW-1.43-notes (1.43.0-wmf.11; 2024-06-25), Data Products, User-Michael, Data-Platform, Performance Issue, GrowthExperiments-Homepage
Milimetric added a comment to T365952: Special:Statistics disagrees with stats.wikimedia.org on the number of active users..

are we literally saying that we should just change the value of statistics-users-active to Editors? Code here: https://gerrit.wikimedia.org/g/mediawiki/core/+/1aa990f1725bf81caaf44527b9e778b5a8fe7e4d/languages/i18n/en.json#1950

Jun 26 2024, 4:07 PM ยท MediaWiki-Engineering, MediaWiki-Special-pages
Milimetric added a comment to T367781: Drop deprecated abuse filter fields on wmf wikis.

Thanks for pinging us, we don't use abuse filter tables anywhere I'm aware of, so this shouldn't affect us.

Jun 26 2024, 3:59 PM ยท Schema-change-in-production, Data-Engineering, DBA
Milimetric added a comment to T367856: Cleanup revision table schema.

Thanks for pinging us on this. The sqoop code should run without modification, so we're good downstream. Thank you!

Jun 26 2024, 3:57 PM ยท Schema-change-in-production, Data-Engineering, DBA, Data Products
Milimetric added a comment to T364548: [SPIKE] Design API for the standardised page lifecycle instrument mixin.

obligatory reference: https://www.mediawiki.org/wiki/Extension:NavigationTiming (is this roughly related?)

Jun 26 2024, 3:44 PM ยท Data Products, Patch-For-Review, Metrics Platform

Jun 25 2024

Milimetric claimed T366944: MPIC: Enable API to return sample rates per wiki.
Jun 25 2024, 5:43 PM ยท Data Products (Data Products Sprint 15), Metrics Platform
Milimetric closed T367526: Cloud VPS "dashiki" project Buster deprecation as Resolved.
Jun 25 2024, 12:59 PM ยท Cloud-VPS (Debian Buster Deprecation)
Milimetric added a comment to T367526: Cloud VPS "dashiki" project Buster deprecation.

This is now done.

Jun 25 2024, 12:59 PM ยท Cloud-VPS (Debian Buster Deprecation)
Milimetric updated the task description for T367526: Cloud VPS "dashiki" project Buster deprecation.
Jun 25 2024, 12:59 PM ยท Cloud-VPS (Debian Buster Deprecation)
Milimetric added a comment to T342267: Investigate surprising "10% Other" portion of Analytics Browsers report.

Great question, @mforns. This was mostly for performance reasons. I couldn't find a way to get Spark to optimally work on the full day of pageviews without first aggregating it like this to > 250. But the execution plan I ended up with looks pretty wild. Let's talk tomorrow when you have some time. I'm attaching the change here.

Jun 25 2024, 12:50 AM ยท Data Products (Data Products Sprint 17), Analytics-Data-Problem, MediaWiki-Platform-Team (Radar), Data-Engineering, Data-Engineering-Dashiki

Jun 24 2024

Milimetric moved T368183: MPIC: Build Location + Sample Rates component from Sprint Backlog to In Process on the Data Products (Data Products Sprint 15) board.
Jun 24 2024, 4:15 PM ยท Metrics Platform, Data Products (Data Products Sprint 15)
Milimetric claimed T367526: Cloud VPS "dashiki" project Buster deprecation.

I've migrated and shut off the old instances. I will delete them in a couple of days, just in case. But everything's working fine without them. Did not know about the wmflabs -> wmcloud automatic redirect, that made everything very simple.

Jun 24 2024, 2:46 PM ยท Cloud-VPS (Debian Buster Deprecation)
Milimetric added a comment to T366004: Add page-title to the x_analytics header.

I grouped a couple of tasks under this so we're less likely to lose them in the fray.

Jun 24 2024, 1:52 PM ยท Data-Engineering
Milimetric added subtasks for T366004: Add page-title to the x_analytics header: T304362: Pageview definition relies on X-Analytics to determine special pages, T240676: Develop a consistent rule for which special pages count as pageviews.
Jun 24 2024, 1:49 PM ยท Data-Engineering
Milimetric added a parent task for T304362: Pageview definition relies on X-Analytics to determine special pages: T366004: Add page-title to the x_analytics header.
Jun 24 2024, 1:49 PM ยท Analytics-Data-Problem, Patch-Needs-Improvement, Data-Platform-SRE
Milimetric added a parent task for T240676: Develop a consistent rule for which special pages count as pageviews: T366004: Add page-title to the x_analytics header.
Jun 24 2024, 1:49 PM ยท Movement-Insights, Data-Engineering-Icebox, Campaign-Registration

Jun 21 2024

Milimetric added a comment to T342267: Investigate surprising "10% Other" portion of Analytics Browsers report.

The simpler way to do this, just two phases as opposed to progressive, gets us fairly similar results, with about 200 fewer rows which are all detailing specific browser versions.

Jun 21 2024, 8:12 PM ยท Data Products (Data Products Sprint 17), Analytics-Data-Problem, MediaWiki-Platform-Team (Radar), Data-Engineering, Data-Engineering-Dashiki
Milimetric added a comment to T342267: Investigate surprising "10% Other" portion of Analytics Browsers report.

We get a ton more detailed results this way, and the total coverage increases to 99.7%. Still not 99.9%, but I think we may have too much detail at some point. I'm fairly happy with these results, and I'm going to prepare the new browser general query as a gerrit change. It'll be good to get some review.

Jun 21 2024, 8:03 PM ยท Data Products (Data Products Sprint 17), Analytics-Data-Problem, MediaWiki-Platform-Team (Radar), Data-Engineering, Data-Engineering-Dashiki
Milimetric moved T368113: Design and merge the new tables of file tables from Incoming (new tickets) to To be estimated/discussed on the Data-Engineering board.

This might affect some data we sqoop into HDFS and some of how we compute commons impact metrics or similar future metrics. We have to wait until a schema change is proposed to know for sure.

Jun 21 2024, 6:38 PM ยท Data-Engineering, Data Products, Schema-change, DBA
Milimetric added a comment to T342267: Investigate surprising "10% Other" portion of Analytics Browsers report.

From a discussion with @Krinkle about the data, a preliminary idea of how to roll up is:

Jun 21 2024, 3:09 PM ยท Data Products (Data Products Sprint 17), Analytics-Data-Problem, MediaWiki-Platform-Team (Radar), Data-Engineering, Data-Engineering-Dashiki