Amazon probing AI startup Perplexity for scraping sites without permission

Amazon probing AI startup Perplexity for scraping sites without permission

Amazon is investigating buzzy AI startup Perplexity for allegedly violating its Cloud division’s rules by improperly “scraping” content from other websites without permission, according to a report Friday.

Perplexity, which recently drew a $3 billion valuation, is allegedly ignoring a well-known web standard called the Robots Exclusion Protocol, commonly referred to as robots.txt, which news publishers and other sites use to show automated bots which pages they aren’t allowed to scrape, tech outlet Wired reported.

While adhering to the standard isn’t required by law, most internet firms opt to follow the protocol. Compliance is also mandatory for websites that rely on Amazon Web Services, such as Perplexity.

Perplexity was recently valued at $3 billion. NurPhoto via Getty Images

“AWS’s terms of service prohibit customers from using our services for any illegal activity, and our customers are responsible for complying with our terms and all applicable laws,” an Amazon spokesperson said in a statement.

Scrutiny of Perplexity’s practices has intensified after Forbes accused the company earlier this month of “directly ripping off” articles written by its reporters and others by CNBC and Bloomberg, including those that were behind paywalls.

Wired approached Amazon after its own investigation determined that Perplexity allegedly used an “unpublished IP address” to scrape websites operated by its parent company Condé Nast — even though it was trying to block access.

The outlet said that representatives from other outlets, including Forbes, the New York Times and the Guardian, had detected the same IP address visiting their servers.

The Post reached out to Amazon for comment.

Perplexity spokesperson Sara Platnick pushed back on Wired’s report, calling it “inaccurate.”

“Our PerplexityBot — which runs on AWS — respects robots.txt, and we confirmed that Perplexity-controlled services are not crawling in any way that violates AWS Terms of Service,” Platnick said in a statement.

“AWS looked into WIRED’s media query as part of a standard protocol for investigating reports of abuse of AWS resources,” Platnick added. “We had not heard anything from AWS prior to a WIRED reporter contacting them. To say that AWS is ‘investigating’ Perplexity outside of this specific WIRED inquiry is incorrect. AWS is a valuable partner to Perplexity and we are grateful for their ongoing collaboration.”

Amazon confirmed that it was investigating Perplexity’s practices. Sundry Photography – stock.adobe.com

Platnick told Wired that the PerplexityBot would bypass the robots.txt protocol in “very infrequent” circumstance that a user included a specific URL in their query.

Perplexity CEO Aravind Srinivas had previously slammed Wired’s findings, asserting that they “reflect a deep and fundamental misunderstanding of how Perplexity and the Internet work.” 

Forbes had taken issue with a feature called “Perplexity Pages,” a product that displays “curated” articles that pull details from articles written by third-party news outlets.

The original authors weren’t credited by name, even when the wording of Perplexity’s posts closely matched that of the source text.

Forbes accused Perplexity of “directly ripping off” its work. perplexity.ai

Instead, Perplexity used what Forbes described as “small, easy-to-miss logos” linking back to the original sources.

In one egregious example, Perplexity’s chatbot churned out a version of an exclusive, paywalled Forbes report on ex-Google CEO Eric Schmidt’s military drone project.

“Our reporting on Eric Schmidt’s stealth drone project was posted this AM by @perplexity_ai,” Forbes Executive Editor John Paczkowski wrote on X at the time. “It rips off most of our reporting. It cites us, and a few that reblogged us, as sources in the most easily ignored way possible.”

Srinivas said the tool “has rough edges” but otherwise denied wrongdoing.

Leave a Reply

Your email address will not be published. Required fields are marked *