Meta data
The ugly truth revealed when someone “scrapes” the scrapers.
Some days, one can’t help but feel that pesky ethernet cable plugged into the back of their router, serves more harm than good. The sad part is, it never used to. What began as a tool to interconnect academic and military networks, grew into a beautifully simple yet powerful place people felt open and inspired — most importantly, safe — to share freely. Before the age of social media, clickbait, fake news, malware, scams, propaganda, and grifters profiting at everyone’s expense, ordinary people were putting themselves out there, without need of reservation, to the entire world. Now, since the advent of Web 3.0, the rules have changed; the game having been rigged by the tech giants.
Ironically, my previous post about AI being trained on our writing in order to masquerade as us, only tells part of the story — if only it stopped there. Sadly, the truth is far worse. The AI grifters and their tech giant overlords truly believe that your content is theirs. Your words, your photos, your messages and conversations; everything you post online, all theirs. It’s been well known for some time how these companies have been ignoring, or worse, actively bypassing the blocks people are putting in place to protect their content from scraping. Again, if that was the extent of it, things would be bad enough. Spoiler alert: it isn’t (emphasis, mine):
Rather than scraping from sites directly, many of the addresses on Meta’s leaked list belong to Content Delivery Networks (CDNs) that are used by websites to cache and store information to improve site performance. […] The data are captured using an internal tool called Web Crawler. Regardless of whether it is removed by the site, scraped data continues to live on Meta’s internal servers and databases, employees say.
The above article was linked via a Mastodon thread, whereby it was discovered content on that and numerous other instances (thus far listed, with many more yet to be) was being scraped by Meta to feed more data to its Asshat Interloper. More troubling is they’re now also targeting CDNs (e.g. Cloudflare) to do it. As rancid icing on this already scorched bundt cake, they’re keeping the data they retrieve, regardless whether it’s subsequently removed from the target in question — because “fuck you, your data belongs to us, and we don’t give a shit.”
What makes this situation greatly alarming is Cloudflare — who already has massive influence and control over internet traffic and routing — recently released anti-scraping tools on their own platform, even making them available as part of their free offering. The pragmatist in me would simply assume Cloudflare could block Meta’s bots from directly accessing the data; the realist knows this battle is far more complex. Perplexity, for example, has actively — even gloatingly — taken every step to maliciously obfuscate their identification in order to subvert blockers. If Meta, along with other tech/AI giants can evade mitigation, then freely rifle through public CDN infrastructure to take what they want, how the hell is anyone supposed to combat it? What are we left with if consent means nothing?
That right there is the crux of this whole fiasco: consent.
The tech and AI giants use the legal argument of “free use” and “public domain” in defence of their actions. Sorry, that’s a steaming mound of horse shit. It’s akin to someone plagiarizing copyrighted work from a novel, then justifying it because the book was available in a public library. That’s the entire point of copyright: protecting content ownership and attribution, regardless where it’s published. Why should digital words on a news, media, literary, or personal site be any different and less worthy of protection — or consent? Obviously, things shared on a social media platform are another beast altogether, and no doubt much more difficult to categorize what is subject (if it can be at all) to copyright protection. That said, it doesn’t make what Meta is doing any less of a dick move.
Apple seems to get the lion’s share of flak these days about their actions, but let’s not pretend any of the other tech giants aren’t cut from the exact same cloth. It’s not just what you see that warrants judgement; it’s what you don’t. Meta was clearly hoping to surreptitiously get away with this behind the scenes, only someone — internally it seems — busted them. The problem now at hand is, the internet itself is the target. Irrespective of what company’s hardware or software you use, just wanting to post words online makes them the apparent property of AI grifters, consent be damned.
The second that cable was attached, all bets were off.
As has been proven time and again, things you post on public silos should be treated just like email: consider them insecure and non-private. Such was already the case before AI theft; even more so now. These are simply not the correct tools for private and confidential conversation, nor information exchange. There was never going to be even an illusion of privacy on any social network, regardless of the existence or not of AI. Capitalism, as is its wont, simply changed and further skewed the rules of the game in favour of the elites.
But where does that leave writing on personal sites with regards to the onslaught of AI theft? I wish there was an easy answer to this question, but sadly that’s just not the case. Pandora’s box has been opened, and all signs point to it being permanent. Short of sweeping political and legal legislation against content theft from the top down, this will be a difficult battle to directly win. The only feasible blocking options range from robots.txt, to server level, to placing a CDN proxy in front — all a moving target at best — even the last of which now coming under threat. Beyond making peace with the reality of the situation, this will continue to be a cat and mouse game.
Although there isn’t currently an easy answer to the above question, I do know what the right answer is: press on, and keep writing. Web 3.0 is here to stay. If we allow it and this Asshat Interloper to take away our drive, will, and passion to put proverbial pen to paper, we let it destroy the values, charm, and authenticity the earlier web — and still parts of the current one — had to offer. We owe it to ourselves to keep that spirit alive.
Lest we unplug that ethernet cable and concede defeat.