Connect with us

Trending Spy News

AI is killing the old web, and the new web struggles to be born

An illustration of a woman typing on a keyboard, her face replaced with lines of code.
Image: The Verge

In recent months, the signs and portents have been accumulating with increasing speed. Google is trying to kill the 10 blue links. Twitter is being abandoned to bots and blue ticks. There’s the junkification of Amazon and the enshittification of TikTok. Layoffs are gutting online media. A job posting looking for an “AI editor” expects “output of 200 to 250 articles per week.” ChatGPT is being used to generate whole spam sites. Etsy is flooded with “AI-generated junk.” Chatbots cite one another in a misinformation ouroboros. LinkedIn is using AI to stimulate tired users. Snapchat and Instagram hope bots will talk to you when your friends don’t. Redditors are staging blackouts. Stack Overflow mods are on strike. The Internet Archive is fighting off data scrapers, and “AI is tearing Wikipedia apart.” The old web is dying, and the new web struggles to be born.

The web is always dying, of course; it’s been dying for years, killed by apps that divert traffic from websites or algorithms that reward supposedly shortening attention spans. But in 2023, it’s dying again — and, as the litany above suggests, there’s a new catalyst at play: AI.

The problem, in extremely broad strokes, is this. Years ago, the web used to be a place where individuals made things. They made homepages, forums, and mailing lists, and a small bit of money with it. Then companies decided they could do things better. They created slick and feature-rich platforms and threw their doors open for anyone to join. They put boxes in front of us, and we filled those boxes with text and images, and people came to see the content of those boxes. The companies chased scale, because once enough people gather anywhere, there’s usually a way to make money off them. But AI changes these assumptions.

Given money and compute, AI systems — particularly the generative models currently in vogue — scale effortlessly. They produce text and images in abundance, and soon, music and video, too. Their output can potentially overrun or outcompete the platforms we rely on for news, information, and entertainment. But the quality of these systems is often poor, and they’re built in a way that is parasitical on the web today. These models are trained on strata of data laid down during the last web-age, which they recreate imperfectly. Companies scrape information from the open web and refine it into machine-generated content that’s cheap to generate but less reliable. This product then competes for attention with the platforms and people that came before them. Sites and users are reckoning with these changes, trying to decide how to adapt and if they even can.

A screenshot of Google’s “search generative experience.”
Screenshot by Jay Peters / The Verge
Google is remaking search by placing AI-generated answers ahead of data sources.

In recent months, discussions and experiments at some of the web’s most popular and useful destinations — sites like Reddit, Wikipedia, Stack Overflow, and Google itself — have revealed the strain created by the appearance of AI systems.

Reddit’s moderators are staging blackouts after the company said it would steeply increase charges to access its API, with the company’s execs saying the changes are (in part) a response to AI firms scraping its data. “The Reddit corpus of data is really valuable,” Reddit founder and CEO Steve Huffman told The New York Times. “But we don’t need to give all of that value to some of the largest companies in the world for free.” This is not the only factor — Reddit is trying to squeeze more revenue from the platform before a planned IPO later this year — but it shows how such scraping is both a threat and an opportunity to the current web, something that makes companies rethink the openness of their platforms.

Wikipedia is familiar with being scraped in this way. The company’s information has long been repurposed by Google to furnish “knowledge panels,” and in recent years, the search giant has started paying for this information. But Wikipedia’s moderators are debating how to use newly capable AI language models to write articles for the site itself. They’re acutely aware of the problems associated with these systems, which fabricate facts and sources with misleading fluency, but know they offer clear advantages in terms of speed and scope. “The risk for Wikipedia is people could be lowering the quality by throwing in stuff that they haven’t checked,” Amy Bruckman, a professor of online communities and author of Should You Believe Wikipedia? told Motherboard recently. “I don’t think there’s anything wrong with using it as a first draft, but every point has to be verified.”

Stack Overflow offers a similar but perhaps more extreme case. Like Reddit, its mods are also on strike, and like Wikipedia’s editors, they’re worried about the quality of machine-generated content. When ChatGPT launched last year, Stack Overflow was the first major platform to ban its output. As the mods wrote at the time: “The primary problem is that while the answers which ChatGPT produces have a high rate of being incorrect, they typically look like they might be good and the answers are very easy to produce.” It takes too much time to sort the results, and so mods decided to ban it outright.

The site’s management, though, had other plans. The company has since essentially reversed the ban by increasing the burden of evidence needed to stop users from posting AI content, and it announced it wants to instead take advantage of this technology. Like Reddit, Stack Overflow plans to charge firms that scrape its data while building its own AI tools — presumably to compete with them. The fight with its moderators is about the site’s standards and who gets to enforce them. The mods say AI output can’t be trusted, but execs say it’s worth the risk.

All these difficulties, though, pale in significance to changes taking place at Google. Google Search underwrites the economy of the modern web, distributing attention and revenue to much of the internet. Google has been spurred into action by the popularity of Bing AI and ChatGPT as alternative search engines, and it’s experimenting with replacing its traditional 10 blue links with AI-generated summaries. But if the company goes ahead with this plan, then the changes would be seismic.

A writeup of Google’s AI search beta from Avram Piltch, editor-in-chief of tech site Tom’s Hardware, highlights some of the problems. Piltch says Google’s new system is essentially a “plagiarism engine.” Its AI-generated summaries often copy text from websites word-for-word but place this content above source links, starving them of traffic. It’s a change that Google has been pushing for a long time, but look at the screenshots in Piltch’s piece and you can see how the balance has shifted firmly in favor of excerpted content. If this new model of search becomes the norm, it could damage the entire web, writes Piltch. Revenue-strapped sites would likely be pushed out of business and Google itself would run out of human-generated content to repackage.

Again, it’s the dynamics of AI — producing cheap content based on others’ work — that is underwriting this change, and if Google goes ahead with its current AI search experience, the effects would be difficult to predict. Potentially, it would damage whole swathes of the web that most of us find useful — from product reviews to recipe blogs, hobbyist homepages, news outlets, and wikis. Sites could protect themselves by locking down entry and charging for access, but this would also be a huge reordering of the web’s economy. In the end, Google might kill the ecosystem that created its value, or change it so irrevocably that its own existence is threatened.

An image showing a robot performing various tasks
Illustration by Alex Castro / The Verge

But what happens if we let AI take the wheel here, and start feeding information to the masses? What difference does it make?

Well, the evidence so far suggests it’ll degrade the quality of the web in general. As Piltch notes in his review, for all AI’s vaunted ability to recombine text, it’s people who ultimately create the underlying data — whether that’s journalists picking up the phone and checking facts or Reddit users who have had exactly that battery issue with the new DeWalt cordless ratchet and are happy to tell you how they fixed it. By contrast, the information produced by AI language models and chatbots is often incorrect. The tricky thing is that when it’s wrong, it’s wrong in ways that are difficult to spot.

Here’s an example. Earlier this year, I was researching AI agents — systems that use language models like ChatGPT that connect with web services and act on behalf of the user, ordering groceries or booking flights. In one of the many viral Twitter threads extolling the potential of this tech, the author imagines a scenario in which a waterproof shoe company wants to commission some market research and turns to AutoGPT (a system built on top of OpenAI’s language models) to generate a report on potential competitors. The resulting write-up is basic and predictable. (You can read it here.) It lists five companies, including Columbia, Salomon, and Merrell, along with bullet points that supposedly outline the pros and cons of their products. “Columbia is a well-known and reputable brand for outdoor gear and footwear,” we’re told. “Their waterproof shoes come in various styles” and “their prices are competitive in the market.” You might look at this and think it’s so trite as to be basically useless (and you’d be right), but the information is also subtly wrong.

To check the contents of the report, I ran it by someone I thought would be a reliable source on the topic: a moderator for the r/hiking subreddit named Chris. Chris told me that the report was essentially filler. “There are a bunch of words, but no real value in what’s written,” he said. It doesn’t mention important factors like the difference between men’s and women’s shoes or the types of fabric used. It gets facts wrong and ranks brands with a bigger web presence as more worthy. Overall, says Chris, there’s just no expertise in the information — only guesswork. “If I were asked this same question I would give a completely different answer,” he said. “Taking advice from AI will most likely result in hurt feet on the trail.”

This is the same complaint identified by Stack Overflow’s mods: that AI-generated misinformation is insidious because it’s often invisible. It’s fluent but not grounded in real-world experience, and so it takes time and expertise to unpick. If machine-generated content supplants human authorship, it would be hard — impossible, even — to fully map the damage. And yes, people are plentiful sources of misinformation, too, but if AI systems also choke out the platforms where human expertise currently thrives, then there will be less opportunity to remedy our collective errors.

A screenshot of a website with a headline beginning with the phrase “Sorry, as an AI language model.”
Image: NewsGuardTech
More websites are being filled with cheap and faulty AI-generated content.

The effects of AI on the web are not simple to summarize. Even in the handful of examples cited above, there are many different mechanisms at play. In some cases, it seems like the perceived threat of AI is being used to justify changes desired for other reasons (as with Reddit), while in others, AI is a weapon in a struggle between workers who create a site’s value and the people who run it (Stack Overflow). There are also other domains where AI’s capacity to fill boxes is having different effects — from social networks experimenting with AI engagement to shopping sites where AI-generated junk is competing with other wares.

In each case, there’s something about AI’s ability to scale — the simple fact of its raw abundance — that changes a platform. Many of the web’s most successful sites are those that leverage scale to their advantage, either by multiplying social connections or product choice, or by sorting the huge conglomeration of information that constitutes the internet itself. But this scale relies on masses of humans to create the underlying value, and humans can’t beat AI when it comes to mass production. (Even if there is a lot of human work behind the scenes necessary to create AI.) There’s a famous essay in the field of machine learning known as “The Bitter Lesson,” which notes that decades of research prove that the best way to improve AI systems is not by trying to engineer intelligence but by simply throwing more computer power and data at the problem. The lesson is bitter because it shows that machine scale beats human curation. And the same might be true of the web.

Does this have to be a bad thing, though? If the web as we know it changes in the face of artificial abundance? Some will say it’s just the way of the world, noting that the web itself killed what came before it, and often for the better. Printed encyclopedias are all but extinct, for example, but I prefer the breadth and accessibility of Wikipedia to the heft and reassurance of Encyclopedia Britannica. And for all the problems associated with AI-generated writing, there are plenty of ways to improve it, too — from improved citation functions to more human oversight. Plus, even if the web is flooded with AI junk, it could prove to be beneficial, spurring the development of better-funded platforms. If Google consistently gives you garbage results in search, for example, you might be more inclined to pay for sources you trust and visit them directly.

Really, the changes AI is currently causing are just the latest in a long struggle in the web’s history. Essentially, this is a battle over information — over who makes it, how you access it, and who gets paid. But just because the fight is familiar doesn’t mean it doesn’t matter, nor does it guarantee the system that follows will be better than what we have now. The new web is struggling to be born, and the decisions we make now will shape how it grows.

Trending Spy News

FTC investigating OpenAI on ChatGPT data collection and publication of false information

OpenAI CEO Samuel Altman Testifies To Senate Committee On Rules For Artificial Intelligence
Photo by Win McNamee / Getty Images

The Federal Trade Commission (FTC) is investigating ChatGPT creator OpenAI over possible consumer harm through its data collection and the publication of false information.

First reported by The Washington Post, the FTC sent a 20-page letter to the company this week. The letter requests documents related to developing and training its large language models, as well as data security.

The FTC wants to get detailed information on how OpenAI vets information used in training for its models and how it prevents false claims from being shown to ChatGPT users. It also wants to learn more about how APIs connect to its systems and how data is protected when accessed by third parties.

The FTC declined to comment. OpenAI did not immediately respond to requests for comment.

This is the first major US investigation into OpenAI, which burst into the public consciousness over the past year with the release of ChatGPT. The popularity of ChatGPT and the large language models that power it kicked off an AI arms race prompting competitors like Google and Meta to release their own models.

The FTC has signaled increased regulatory oversight of AI before. In 2021, the agency warned companies against using biased algorithms. Industry watchdog Center for AI and Digital Policy also called on the FTC to stop OpenAI from launching new GPT models in March.

Large language models can put out factually inaccurate information. OpenAI warns ChatGPT users that it can occasionally generate incorrect facts, and Google’s chatbot Bard’s first public demo did not inspire confidence in its accuracy. And based on personal experience, both have spit out incredibly flattering, though completely invented, facts about myself. Other people have gotten in trouble for using ChatGPT. A lawyer was sanctioned for submitting fake cases created by ChatGPT, and a Georgia radio host sued the company for results that claimed he was accused of embezzlement.

US lawmakers showed great interest in AI, both in understanding the technology and possibly looking into enacting regulations around it. The Biden administration released a plan to provide a responsible framework for AI development, including a $140 million investment to launch research centers. Supreme Court Justice Neil Gorsuch also discussed chatbots’ potential legal liability earlier this year.

It is in this environment that AI leaders like OpenAI CEO Sam Altman have made the rounds in Washington. Altman lobbied Congress to create regulations around AI.

Continue Reading

Trending Spy News

OpenAI will use Associated Press news stories to train its models

An illustration of a cartoon brain with a computer chip imposed on top.
Illustration by Alex Castro / The Verge

OpenAI will train its AI models on The Associated Press’ news stories for the next two years, thanks to an agreement first reported by Axios. The deal between the two companies will give OpenAI access to some of the content in AP’s archive as far back as 1985.

As part of the agreement, AP will gain access to OpenAI’s “technology and product expertise,” although it’s not clear exactly what that entails. AP has long been exploring AI features and began generating reports about company earnings in 2014. It later leveraged the technology to automate stories about Minor League Baseball and college sports.

AP joins OpenAI’s growing list of partners. On Tuesday, the AI company announced a six-year deal with Shutterstock that will let OpenAI license images, videos, music, and metadata to train its text-to-image model, DALL-E. BuzzFeed also says it will use AI tools provided by OpenAI to “enhance” and “personalize” its content. OpenAI is also working with Microsoft on a number of AI-powered products as part of Microsoft’s partnership and “‘multibillion dollar investment” into the company.

“The AP continues to be an industry leader in the use of AI; their feedback — along with access to their high-quality, factual text archive — will help to improve the capabilities and usefulness of OpenAI’s systems,” Brad Lightcap, OpenAI’s chief operating officer, says in a statement.

Earlier this year, AP announced AI-powered projects that will publish Spanish-language news alerts and document public safety incidents in a Minnesota newspaper. The outlet also launched an AI search tool that’s supposed to make it easier for news partners to find photos and videos in its library based on “descriptive language.”

AP’s partnership with OpenAI seems like a natural next step, but there are still a lot of crucial details missing about how the outlet will use the technology. AP makes it clear it “does not use it in its news stories.”

Did you miss our previous article…

Continue Reading

Trending Spy News

Congress is trying to stop discriminatory algorithms again

A person with their hand hovering over the Like button on Facebook.
Photo by Amelia Holowaty Krales / The Verge

US policymakers hope to require online platforms to disclose information about their algorithms and allow the government to intervene if these are found to discriminate based on criteria like race or gender.

Sen. Edward Markey (D-MA) and Rep. Doris Matsui (D-CA) reintroduced the Algorithmic Justice and Online Platform Transparency Act, which aims to ban the use of discriminatory or “harmful” automated decision-making. It would also establish safety standards, require platforms to provide a plain language explanation of algorithms used by websites, publish annual reports on content moderation practices, and create a governmental task force to investigate discriminatory algorithmic processes.

The bill applies to “online platforms” or any commercial, public-facing website or app that “provides a community forum for user-generated content.” This can include social media sites, content aggregation services, or media and file-sharing sites.

Markey and Matsui introduced a previous version of the bill in 2021. It moved to the Subcommittee on Consumer Protection and Commerce but died in committee.

Data-based decision-making, including social media recommendation algorithms or machine learning systems, often lives in proverbial black boxes. This opacity sometimes exists because of intellectual property concerns or a system’s complexity.

But lawmakers and regulators worry this could obscure biased decision-making with a huge impact on people’s lives, well beyond the reach of the online platforms the bill covers. Insurance companies, including those working with Medicaid patients, already use algorithms to grant or deny patient coverage. Agencies such as the FTC signaled in 2021 that they may pursue legal action against biased algorithms.

Calls to make more transparent algorithms have grown over the years. After several scandals in 2018 — which included the Cambridge Analytica debacle — AI research group AI Now found governments and companies don’t have a way to punish organizations that produce discriminatory systems. In a rare move, Facebook and Instagram announced the formation of a group to study potential racial bias in its algorithms.

“Congress must hold Big Tech accountable for its black-box algorithms that perpetuate discrimination, inequality, and racism in our society – all to make a quick buck,” Markey said in a statement.

Most proposed regulations around AI and algorithms include a push to create more transparency. The European Union’s proposed AI Act, in its final stages of negotiation, also noted the importance of transparency and accountability.

Continue Reading