The Deepdive

Perplexity AI And The Hidden Data Pipeline

Allen & Ida Season 3 Episode 50

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 23:20

Send us Fan Mail

You type a sensitive question into an AI search box and feel the same relief as whispering into a private confessional. Now imagine learning that the “confessional” may be wired to the biggest ad networks on earth. That’s the unsettling thread we pull today as we unpack a series of major legal filings aimed at Perplexity AI, including privacy class actions, a copyright mega-suit that reaches across the generative AI industry, and Amazon’s federal injunction over autonomous browsing. 

We walk through the core privacy allegations in plain language: tracking pixels, third-party analytics scripts, and forensic-style request logs that purportedly show chat text and AI responses leaving a user’s device. We also dig into the psychology of “incognito mode” and why a privacy toggle can feel protective while the underlying data architecture still routes information outward. Along the way, we ask what it means if intimate queries about money, health, relationships, or legal fears become raw material for targeted advertising profiles. 

Then we shift to agentic AI with Perplexity’s Comet, where the stakes move from speech to action. Amazon’s injunction forces a sharp question: even if you give an AI agent your credentials and consent, can a platform still ban that agent and treat continued access as unauthorized under the Computer Fraud and Abuse Act? Finally, we connect the dots to the copyright wars, shadow libraries, BitTorrent downloads, stealth crawlers, and retrieval augmented generation, all pointing to a single pattern: boundary-breaking data acquisition as the default fuel for AI capabilities. 

If this raised your eyebrows, subscribe for more deep dives, share this with a friend who uses AI for sensitive questions, and leave a review. What’s your line, what should never be collected or automated by a chatbot?

Leave your thoughts in the comments and subscribe for more tech updates and reviews.

Allan

So uh I want you to imagine something for a second. Just picture you're walking into this like totally soundproof confessional booth.

Ida

Okay, setting the scene. I like it.

Allan

Right. It's totally dark, completely private, and you are there to ask your absolute deepest, most sensitive, maybe you know, slightly embarrassing questions.

Ida

We all have them.

Allan

Exactly. So you sit down, you whisper your secrets into the dark, and you get this immense sense of relief.

Ida

Sounds nice, honestly.

Allan

It does. But then uh you walk outside, you look up, and you realize that Booth had a hidden microphone wired directly to a giant glowing billboard in the middle of Times Square.

Ida

Oh no!

Allan

Yeah, broadcasting literally every word you just said, right next to a targeted ad for like whatever you just confessed to.

Ida

It is a genuinely terrifying visual.

Allan

Yeah.

Ida

And while based on the massive stack of legal filings we are unpacking today, it is uncomfortably close to how the artificial intelligence industry is currently operating behind closed doors.

Allan

Welcome to today's deep dive. So our mission today is to look at this whole series of recent, honestly bombshell lawsuits aimed squarely at major AI companies.

Ida

But with a very specific, intense focus on perplexity AI.

Allan

Yes, exactly. We are looking at a 135-page privacy class action lawsuit, a massive sweeping copyright infringement case, and uh a federal injunction from Amazon.

Ida

Aaron Powell Which is a heavy lineup, but our goal here isn't just to, you know, list off court cases like a textbook.

Allan

Right. Nobody wants that. We want to look under the hood at the mechanics of these systems and understand how the AI sausage is actually made.

Ida

Aaron Powell And the contrast here is what makes this so fascinating to me because AI search engines and perplexity in particular have so aggressively pitched themselves to you, the user, as the sleek, clean alternative. Trevor Burrus, Jr.

Allan

Yeah, the privacy conscious anti-Google, basically.

Ida

Trevor Burrus, Jr. Exactly. They sell this pristine interface. They sell the idea of a direct answer without the tracking. But uh the stakes we were talking about today aren't just about a technical glitch or a vaguely worded privacy policy.

Allan

No, it's way bigger than that.

Ida

Right. We are looking at a fundamental shift in the digital economy where the raw material powering these multi-billion dollar valuations appears to be composed of our leaked anxieties.

Allan

And massive automated heists of intellectual property. Which, you know, if you have ever sat alone in your room late at night and asked an AI a sensitive medical question.

Ida

Or asked it to explain a complicated financial decision you were too embarrassed to ask a human about.

Allan

Yes. Then this deep dive directly impacts you. So we had to start with the privacy mirage, because this is the stuff that hits the user first.

Ida

Let's look at the class action lawsuits. Specifically, there's Doe v Perplexity, which was filed recently by a man in Utah, and a preceding case, Meyer v Perplexity, out of California.

Allan

And the core of these lawsuits revolves around the alleged use of deeply embedded tracking software.

Ida

Right, which for anyone who works in digital marketing, the technology at the center of this really isn't new.

Allan

No, it's super common.

Ida

Exactly. The plaintiffs allege that the second you log on to Perplexity's interface, the platform is quietly executing tracking code, specifically things like the Facebook Pixel and Google Analytics directly on your device.

Allan

So uh for those who might not spend their days deep in ad tech, a pixel is essentially an invisible one-by-one image embedded in the website's code.

Ida

Yep. Just hiding there.

Allan

And when your browser loads that invisible image, it triggers a server call to a third party. In this case, Meta or Google.

Ida

Aaron Powell, which is standard practice if you're a retail site trying to track if somebody bought a pair of shoes.

Allan

Sure. But finding one inside an AI search engine that literally markets itself as an anonymous alternative to Google. That's wild.

Ida

It really is.

Allan

And the plaintiffs actually included these specific HTTP request logs in the lawsuit. They showed data strings like uh FBPPFB.1, followed by a timestamp and an ID. Trevor Burrus, Jr.

Ida

They documented the exact digital fingerprints of the data leaving the building, and the volume and specificity of the data in those request logs is where this goes from, you know, a standard privacy concern to a massive breach of trust.

Allan

Aaron Powell Because it's not just basic metadata, right?

Ida

Trevor Burrus, Jr. No, not at all. It's not just what browser you are using or your IP address. The trackers are allegedly capturing the full text of your chat.

Allan

Wait, the full text.

Ida

Verbatim. They are transmitting your exact search queries and the AI's direct responses straight to Meta and Google servers. And they're tying it all together with personal identifiers like your email address and your Facebook ID.

Allan

Wow. Which means your most sensitive, completely unfiltered thoughts are just flying out the back door.

Ida

Exactly.

Allan

Like the Utah man in the lawsuit, John Doe, he was asking perplexity about his tax obligations, details about his family finances.

Ida

I think there were questions about Roth IRA conversions and a potential cannabis investments in there, too.

Allan

Right. And the filings mention other users asking about deeply personal medical symptoms, relationship advice, private legal questions.

Ida

Stuff you wouldn't want your closest friends to know, let alone the world's largest advertising brokers.

Allan

Yeah, it completely shatters the illusion of the AI as this impartial, isolated oracle. You're treating the machine like a private diary, but the data pipeline is treating you like a commodity.

Ida

And Meta and Google can harvest that verbatim text, pair it with your real identity using that Facebook ID, and use it to build hyper-targeted, incredibly intimate advertising profiles.

Allan

Okay, but here's the thing perplexity is trying to compete with Google. Like their entire brand identity, their pitch to investors is literally we are the Google killer.

Ida

Right.

Allan

So why on earth would they be secretly hard-coding Google Analytics into their front end and feeding Meta, their own users, highly sensitive data? That seems like handing your entire playbook to the opposing team.

Ida

Well, it comes down to the brutal reality of scaling a tech platform. This data isn't just digital oil, right? It's a topographical map of human vulnerability and user behavior.

Allan

Ah, I see where you're going with it.

Ida

Yeah. Perplexity, as a rapidly growing startup, needs enterprise grade analytics to understand how people are interacting with its site, where they drop off, and how to acquire new users cheaply.

Allan

And Meta and Google provide the most powerful radar equipment in the world to do that.

Ida

Exactly. And they provide it essentially for free so long as they get a copy of the map in return.

Allan

Wow. So Perplexity gets the growth metrics it desperately needs to show investors, and Meta and Google get your secrets to fuel their ad networks.

Incognito Mode That Still Leaks

Ida

The cognitive dissonance is staggering. You accept perplexity's terms of service thinking you are in a walled garden, but these third-party trackers are just quietly slurping up the garden in the background.

Allan

Wait, it gets better. And by better, I mean significantly worse. Let's talk about incognito mode.

Ida

Oh boy, yes.

Allan

Because perplexity heavily markets this feature where you can toggle a switch to create anonymous threads that don't save to your history. It implicitly promises the user a safe space for those really sensitive questions.

Ida

But according to the technical forensic analysis presented in the lawsuit, the data transmission to Meta and Google happens even when users activate that incognito mode.

Allan

You've got to be kidding me.

Ida

I wish I was. Toggling the switch on the user interface might stop the chat from showing up in your personal history log on the screen, but it allegedly doesn't stop the underlying code from executing those server calls to the third-party trackers.

Allan

It's like putting on a fake mustache to go to a bar, while simultaneously handing the bartender your real driver's license, your social security card, and your personal diary.

Ida

That is the perfect analogy. You feel completely hidden because you flipped a switch on the screen, but the data architecture underneath doesn't care about your fake mustache.

Allan

Now, to be fair to the companies involved, Perplexity has stated they haven't been served with this newest March 2026 suit yet and can't verify the technical claims.

Ida

Right. And Meta has pointed to a policy that explicitly forbids advertisers from sending them sensitive personal information.

Allan

But if the forensic logs in this lawsuit actually hold up in court, it exposes a massive contradiction between the marketing of AI privacy and the mechanical reality of how these platforms actually operate.

Amazon Injunction And AI Agents

Ida

Yeah, so perplexity is allegedly perfectly happy to leave your backdoor wide open to Meta and Google. But when it comes to other platforms' boundaries, they suddenly have a very different operational philosophy.

Allan

Which is a perfect segue because this brings us to the injunction from Amazon, where we move from perplexity leaking your secrets out to perplexity aggressively breaking into places it shouldn't go.

Ida

Right. And this injunction from Amazon view perplexity is fascinating because it forces us to deal with the legal reality of agentic AI.

Allan

Yeah, for anyone who hasn't been following the absolute bleeding edge of the industry, we're moving past passive AI, you know, where you ask a question and it just generates a block of text.

Ida

Aaron Powell We are entering the era of agents. Perplexity has this AI-powered browser feature called Comet, which is designed to autonomously carry out tasks on your behalf across the web.

Allan

Aaron Powell So it doesn't just tell you you gotta buy something, it literally goes and tries to buy it for you.

Ida

Exactly. And Comet was allegedly accessing users' password-protected Amazon accounts to execute tasks.

Allan

Wait, how was it getting in?

Ida

Well, it was disguising itself as a standard web browser, bypassing Amazon security measures, and navigating the internal architecture of accounts.

Allan

And Perplexity's defense in court was essentially built on user consent, right? Yeah. They argued we are only doing this because the user explicitly told us to.

Ida

Yep. Their stance was the user handed us their login credentials and asked us to perform a specific task on their behalf. We have permission.

Allan

And honestly, my first instinct is to completely agree with Perplexity's defense there.

Ida

Really?

Allan

Yeah. Look, if I give my best friend the keys to my apartment to water my plans while I'm out of town, my landlord cannot call the police and have my friend arrested for trespassing.

Ida

I mean, logically that makes sense.

Allan

Right. I gave them explicit permission to be there. Why is a digital agent acting on my behalf treated any differently?

Ida

Well, that is the exact tension U.S. District Judge Maxine M. Chesney had to untangle, and her ruling sets a massive precedent. She drew a sharp legal line between user permission and platform authorization.

Allan

Aaron Powell Okay, break that down for me.

Ida

Aaron Powell Yes, you gave Perplexity the digital keys to your Amazon account. But Amazon, acting as the landlord of the digital building, sent Perplexity a formal cease and desist letter. They essentially said, We do not care if our tenant invited you in, you, the automated robot, are explicitly banned from the premises.

Allan

So the platform's terms of service absolutely override my personal consent regarding my own account data.

Ida

Aaron Powell Under the Computer Fraud and Abuse Act or the CFAA, yes.

Allan

The CFAA. That's a federal anti-hacking law from like the 1980s, right?

Ida

Aaron Powell Exactly. Famously inspired in part by the movie War Games. And applying a 40-year-old law to an autonomous web agent is tricky, but the judge relied on precedent showing that once a platform explicitly revokes authorization, continued access becomes a federal offense. Nope. The platform owns the infrastructure. Amazon even proved a cognizable loss under the CFAA by showing they had to spend over$5,000 just deploying engineers to develop technical countermeasures to detect and block comets' stealthy disguised activity on their servers.

Allan

We are entering this completely bizarre legal gray area. These highly advanced AI agents are acting as our personal proxies, but legally they are essentially trespassing on corporate property to do our bidding.

Ida

And the judge completely dismissed Perplexity's argument that stopping this behavior would stifle tech innovation.

Allan

Yeah, she ruled that the public interest in preventing unauthorized access to private computer systems massively outweighs a startup's desire to maintain a first mover advantage in the AI shopping space.

The Copyright Book Heist Pipeline

Ida

Which creates a fascinating contradiction. The AI industry wants frictionless access to everything to make these agents work. But the internet is fundamentally made of private, gated communities. And if we look at how these models got smart enough to act as agents in the first place, that complete disregard for digital boundaries is actually the foundational engineering principle of the entire industry.

Allan

Which brings us to the great AI book heist.

Ida

The heist, yes.

Allan

We have this sweeping copyright lawsuit filed against basically every major AI player by a group of prominent authors. One of the lead plaintiffs is John Kerry Roo.

Ida

The investigative journalist who famously wrote Bad Blood detailing the massive fraud at Theranos.

Allan

Yeah. And I love that this exists, but also why? Like if you are a tech company orchestrating a massive, legally dubious data sweep, maybe don't steal the likes work of the specific journalist who specializes in taking down billion-dollar tech frauds.

Ida

Strategically, perhaps not the best target to agitate.

Allan

Seriously.

Ida

But this lawsuit meticulously outlines how these models didn't just casually scrape publicly available blogs, they intentionally targeted shadow libraries to acquire massive volumes of copyrighted text.

Allan

We were talking about utilizing sites like Libgen, Z Library, and Bibliotic.

Ida

They ingested data sets with names like Books 3, which contains hundreds of thousands of pirated books.

Allan

And the infrastructure of piracy is remarkably resilient, right? Like when the FBI seizes the domains for a site like Z Library, the pirate communities instantly spin up decentralized mirrors.

Ida

Like the Pirate Library Mirror or PiliMe.

Allan

Which is such a funny name. Pili sounds like a cute startup, but it's basically a hydra-like repository of stolen intellectual property.

Ida

Exactly. And the mechanism of how tech companies acquired this data is crucial. The lawsuit details how they utilize BitTorrent to download these massive libraries.

Allan

Right. And for anyone who remembers the early 2000s internet, when you torrent a file, you aren't just downloading it from one central server.

Ida

No, you are pulling a tiny encrypted piece of the file from thousands of other computers on the peer-to-peer network.

Allan

But the software is designed so that while you are downloading, which is called leaching, you are simultaneously uploading those pieces to other users, which is seeding.

Ida

Which means they weren't just passively downloading pirated books.

Allan

They were actively distributing them.

Ida

Yes. The mechanical act of acquiring the data meant these tech companies were actively distributing pirated materials back into the network.

Allan

That is wild.

Ida

Every step of the large language model training pipeline requires making unauthorized copies. You copy it to download it, you copy it to pre-process the text, you copy it to tokenize it, and you copy it again and again as you run it through the neural network to adjust the weights.

Allan

So the obvious question here is why risk this level of federal copyright exposure? Why not just stick to scraping Wikipedia, Reddit, and public domain text?

Ida

Because long form prose is the absolute gold standard for teaching an artificial intelligence how to think.

Allan

Okay, how so?

Ida

Well, if you want an AI to understand complex logical reasoning, how to structure a persuasive argument, how narrative flows across hundreds of pages, or how syntax and rhythm operate at a high level, you need books.

Allan

You need the structured thought of professional authors.

Ida

Exactly. The lawsuit actually quotes an anthropic co-founder, essentially admitting that to create a model with truly advanced generative capabilities, you cannot rely on internet chatter. You need the entire text of diverse professionally edited books.

Allan

And instead of paying the creators for that immense value, the industry just took it.

Ida

They just took it.

Allan

And the lengths they went to take it are incredible. There were these investigations by Cloudflare and Wired, which are cited in these discussions, and they found that perplexity was deploying what they call stealth crawlers.

Ida

Right. So when you build a website, you can put up a digital do not enter sign called a robots.txt file.

Allan

It's like a standard internet protocol designed to politely tell automated web crawlers not to scrape your site.

Ida

Exactly. And Cloudflare's digital forensics team discovered that Perplexities crawlers were allegedly ignoring those protocols entirely.

Allan

Just blowing right past the signs.

Ida

But even more aggressively, they were deploying undeclared automated agents that were actively spoofing their identity. The code was written to impersonate the Google Chrome browser used by a normal human.

Allan

They put the fake mustache on the robot.

Ida

They did.

Allan

They are writing code to dynamically change their user agent strings, routing their traffic through different IP addresses, and actively deceiving the host servers, all to sneak past firewalls and scrake content they don't want to pay for.

Ida

And this leads to why the authors are so furious about the RAG system retrieval augmented generation.

Allan

Yeah, explain Rag, because this mechanism is crucial to understand here.

Ida

Right, because it operates differently than just training the model. During training, the AI reads the book to learn how to speak, embedding the patterns into its weights.

Allan

Aaron Powell Okay, so that's the foundation.

Ida

Yes. But Rag allows the AI to act like it's taking an open book test. When a user asks a question, the AI searches a massive external vector database in real time, retrieves the specific information, and patches it directly into the answer it generates for you.

Allan

Aaron Powell Which means it's not just using Carrie Roo's book to learn grammar, it's using a pirated copy of his book as a reference library to bypass his sales.

Ida

Precisely. The complaint points out that when asked about Carrie Roo's work, perplexity could spit out highly specific chapter-by-chapter summaries, pulling the thematic sequencing directly from the pirated text it was querying in real time.

Allan

So it is functioning as a direct free substitute for the author's paid copyrighted work. That is just a staggering appropriation of value.

Ida

It really is. But you know, if we zoom out, this isn't just a perplexity problem. This aggressive data acquisition strategy is the original sin of the entire generative AI arms race.

Allan

Oh, for sure. The copyright suit names Anthropic, Google, Meta, and OpenAI. Literally every single major player is accused of making massive copies of pirated works to build their multi-billion dollar ecosystems.

Ida

Anthropic allegedly used a dataset called the Pile. Google used the C4 dataset, Meta used Books3 for its Llama models, and OpenAI used LibGen for the GPT series.

Allan

But out of this entire mountain of legal filings, I have to say my absolute favorite detail, the cherry on top of this massive digital heist, is XAI's Grok model.

Ida

Oh, yes.

Allan

Grok essentially took the witness stand and enthusiastically testified against its own creators. This is simultaneously impressive and completely ridiculous.

Ida

It is genuinely one of the most remarkable self-owns in recent tech history because Grok was specifically designed by Elon Musk's XAI to be rebellious, right? It's supposed to be anti-censorship, answering questions without the typical corporate guardrails that companies like OpenAI put on their models.

Allan

And it turns out it was entirely too honest. The lawsuit details these chat logs where a user simply asks Grok how it manages to know the contents of so many obscure books.

Ida

And what did it say?

Allan

Grok candidly replies, I basically vacuumed up whatever was out there, and Libjan has been one of the biggest whatever was out there troves for years. It just straight up confesses to the crime.

Ida

It goes into so much detail. Grok told the user, When I seem to know obscure academic monographs out of print novels or textbooks that normally cost$200, there's a decent chance some of that knowledge traces back to files that originally lived on Libjan.

Allan

Unbelievable.

Ida

It literally understood its own training provenance and cheerfully explained the mechanics of the shadow library to the user.

Allan

Of course they did. They spent billions of dollars to build these super intelligent AI that is literally too smart and too structurally honest to keep its own creators' legal secrets.

Ida

It perfectly highlights the reckless breakneck speed of this entire industry. They deployed these stealth crawlers to scrape absolutely everything they could find, blindly vacuuming up the internet to win the capability arms race, just assuming they could litigate the consequences later.

Allan

Shoot first, ask permission in court later.

Ida

Exactly. And to be clear here, these stealth crawlers are completely ideology blind. Whether it's vacuuming up right-wing manifestos or left-wing political theory from these shadow libraries, the algorithm genuinely doesn't care.

Allan

Right. It's completely impartial.

Ida

Yeah, it's not taking a stance, it's just blindly consuming massive amounts of data to build its vocabulary and predictive capabilities.

The Bigger Pattern And Liability

Allan

The technology has no morality, it just has an insatiable appetite. Which honestly brings us to the bigger picture. Let's zoom out. All of these lawsuits, the privacy breaches with the tracking pixels, the autonomous trespassing on Amazon servers, the massive copyright infringement via BitTorrent, they are all symptoms of the exact same underlying engineering philosophy. Move fast, break the digital boundaries, and take the data. But when you read through the stark realities of these legal filings, you see that the foundation of that promised utopia is currently built on a bedrock of pirated books, ignored digital property rights, and the surreptitious monetization of our most private queries.

Ida

Absolutely.

Allan

So what does this say about us as a society? I mean, we want the convenience so badly. We want the magic answer engine to solve our problems instantly.

Ida

We are so eager for the technology to work that we are willing to treat a corporate chatbot like a private confessional.

Allan

Completely forgetting that the priest on the other side of the screen is backed by the world's largest, most aggressive data brokers who are logging every single word.

Ida

We are actively treating our privacy and the financial livelihoods of human creators for the ability to get a five second summary of a book we didn't want to buy or an answer to a question we were too lazy to research ourselves.

Allan

Genuinely unsettling thought to leave you with today. We've talked about how these AI agents are starting to take actions on our behalf using our credentials to navigate platforms like Amazon. And we've seen how they brazenly ignore the legal boundaries of the platforms they interact with. So what happens when the inevitable goes wrong?

Ida

Oh, that's a scary thought.

Allan

If an AI agent like Perplexity's Comet, operating under your name, using your passwords, and executing a prompt you gave it, bypasses a firewall, violates the Computer Fraud and Abuse Act, or accidentally commits financial fraud while trying to optimize a purchase for you.

Ida

Who holds the legal liability?

Allan

Exactly. If the robot commits a federal crime because you asked it to water your digital plants, is the tech company responsible or are you the one going to jail?

Ida

That is the multi-billion dollar legal question that nobody in Silicon Valley seems to have an answer for yet.

Allan

We are building autonomous proxies without understanding the rules of engagement. So the next time you step into that digital confessional booth or hand over your passwords to a helpful AI agent, just remember to look up.

Ida

Always look up.

Allan

Because that Times Square billboard is always on, the microphone is definitely recording, and you might just be the one held responsible for whatever the machine decides to do next.