"Data Trespassing" — JULES' WORLD

as of 11-04-24

CONTACT

@_juliarosenberg
julia@julia-rosenberg.com

ABOUT ME

currently: ventures lead, uniswap labs

previously:
+ co-founder & ceo, metropolis
+ m&a, acreage holdings
+ student, NYU

WRITING

09-30-24 data trespassing
08-15-24 building in vegas

Data Trespassing: AI's Growing Appetite for Data
09-30-24

As AI scales, hardware demands—GPUs, data warehouses, cooling systems—are expanding to match the computing required for model training and inference. But hardware is only part of the equation; another looming bottleneck is data access. Unlike traditional data extraction (for ads or personalization), AI models depend on vast, diverse datasets. AGIs aren’t built on a handful of key data points; they require the breadth and variety of information to truly evolve. Each piece of data may be insignificant alone, but collectively, they form the crucial landscape powering these systems.

What we’re seeing now is rampant “data trespassing”—a term that seems increasingly apt. The public web is a free-for-all, allowing AI models to siphon off public data (in ethically and legally murky ways). Cases like hiQ Labs v. LinkedIn carved out a grey area where scrapers don’t technically violate laws like the CFAA. But how long will this loophole last? Today’s AI models depend on free, publicly available data, but this reliance is unlikely to endure as regulatory and economic pressures mount.

As AI becomes more specialized—more domain-specific, more "intelligent"—the demand for both general knowledge and curated, private datasets will surge. We’re already familiar with information companies turning their data pipelines into a lucrative commodity (from Pinterest’s 2015 Promoted Pins to X’s 2023 API access shift). Traditional data sellers are wrestling with pricing strategies and usage boundaries as AI companies strike deals with media firms to secure vast swaths of private data for training. This is just the start of increasing data privatization and monetization.

LinkedIn: pre-selected settings allow personal data to be used for AI model training

So, what about our personal data? The debate over user privacy is, at its core, a conversation about data monetization. In isolation, our data may seem worthless, but in a diverse collective, it’s invaluable. Companies like Vana are enabling data portability and consumer data monetization, allowing users to self-upload their data or contribute to data collectives. Will we ever reap the benefits of our digital trails? Maybe. There’s plenty of room to further privatize (and thus monetize) internet data, but it’s complicated—data scraping disruptors, complex pricing models, digital ownership precedents, etc.

And what about the free, open internet? The web, once a digital commons, is colliding with the growing monetization of data. This isn’t about ad-targeting anymore. We’re entering a realm of AI models that anticipate and shape human behavior on a level far beyond today’s algorithms. The new competitive edge is the quality of data fed into the machine–better data in = better data out. As AI’s appetite for data grows, so too will the conflict over ownership, profit, and access. The challenge isn’t just technical—it's legal, monetary, and ethical. The future of AI will shape and be shaped by what we feed it, and the question is no longer just how much data, but more so what data, and who provides it.