🕷️ Crawler Inspector

URL Lookup

Direct Parameter Lookup

Raw Queries and Responses

1. Shard Calculation

Query:

Response:

Calculated Shard: 21 (from laksa006)

2. Crawled Status Check

Query:

curl -X POST \
  'http://laksa021.int.ahrefs:8124/' \
  -H 'Content-Type: text/plain' \
  -H 'X-ClickHouse-Database: crawler3' \
  -H 'Authorization: Basic YXBpOg==' \
  -d 'SELECT getAhrefsURLFromUnparsed(src_unparsed) AS found_url, ifNull(toUnixTimestamp(download_stamp), 0) AS crawl_time, ifNull(toUnixTimestamp(props_url_first_seen), 0) AS first_indexed_time, download_http_code AS http_code, src_unparsed AS src_unparsed, src_root_hash AS src_root_hash, history_drop_reason AS history_drop_reason, meta_title AS meta_title, meta_descriptions AS meta_descriptions, attrs_boilerpipe_text AS attrs_boilerpipe_text, attrs_markdown AS attrs_markdown, attrs_readable_markdown AS attrs_readable_markdown, meta_canonical AS meta_canonical, ml_categories_json AS ml_categories_json, ml_types_json AS ml_types_json, ml_intent_types_json AS ml_intent_types_json, meta_language AS meta_language, attrs_author AS attrs_author, ifNull(toUnixTimestamp(attrs_publish_time), 0) AS attrs_publish_time, ifNull(toUnixTimestamp(attrs_original_publish_time), 0) AS attrs_original_publish_time, ifNull(attrs_is_republished, 0) AS attrs_is_republished, ifNull(attrs_nr_words, 0) AS attrs_nr_words, ifNull(attrs_boilerpipe_nr_words, 0) AS attrs_boilerpipe_nr_words, ifNull(body_ext_links_number, 0) AS body_ext_links_number, ifNull(body_int_links_number, 0) AS body_int_links_number, ifNull(meta_nofollow, 0) AS meta_nofollow, ifNull(meta_noarchive, 0) AS meta_noarchive, ifNull(props_was_rendered, 0) AS props_was_rendered, ifNull(src_redirect, \'\') AS src_redirect, ifNull(download_time_msec, 0) AS download_time_msec, ifNull(download_ttfb_msec, 0) AS download_ttfb_msec, ifNull(download_size, 0) AS download_size FROM crawler3.page_info_local FINAL PREWHERE (src_root_hash, src_unparsed) IN ((getAhrefsRootHashFromUnparsed(getAhrefsUnparsedNoserviceFromURL(\'https://www.theatlantic.com/technology/archive/2025/09/youtube-ai-training-data-sets/684116/\')), getAhrefsUnparsedNoserviceFromURL(\'https://www.theatlantic.com/technology/archive/2025/09/youtube-ai-training-data-sets/684116/\'))) FORMAT JSONEachRow'

Response:

{"found_url":"https:\/\/www.theatlantic.com\/technology\/archive\/2025\/09\/youtube-ai-training-data-sets\/684116\/","crawl_time":1776258825,"first_indexed_time":1757517259,"http_code":200,"src_unparsed":"com,theatlantic!www,\/technology\/archive\/2025\/09\/youtube-ai-training-data-sets\/684116\/ s443","src_root_hash":"13119341252700813021","history_drop_reason":null,"meta_title":"At Least 15 Million YouTube Videos Have Been Snatched by AI Companies - The Atlantic","meta_descriptions":["At least 15 million videos have been snatched by tech companies."],"attrs_boilerpipe_text":"Editor’s note: This analysis is part of\nThe Atlantic\n’s investigation into how YouTube videos are taken to train AI tools. You can use the search tool directly\nhere\n, to see whether videos you’ve created or watched are included in the data sets. This work is part of\nAI Watchdog\n,\nThe Atlantic\n’s ongoing investigation into the generative-AI industry.\nW\nhen Jon Peters uploaded his first video\nto YouTube in 2010, he had no idea where it would lead. He was a professional woodworker running a small business who decided to film himself making a dining table with some old legs he had found in a barn. It turned out that people liked his candid style, and as he posted more videos, a fan base began to grow. “All of a sudden there’s people who appreciate the work I’m doing,” he told me. “The comments were a motivator.” Fifteen years later, his channel has more than 1 million subscribers. Sometimes he gets photos of people in their shops, following his guidance from a big TV on the wall—most of his viewers, Peters told me, are woodworkers looking to him for instruction.\nBut\nPeters’s channel\ncould soon be obsolete, along with millions of other videos created by people who share their expertise and advice on YouTube. Over the past few months, I’ve discovered more than 15.8 million videos from more than 2 million channels that tech companies have, without permission, downloaded to train AI products. Nearly 1 million of them, by my count, are how-to videos. You can find these videos in at least 13 different data sets distributed by AI developers at tech companies, universities, and research organizations, through websites such as Hugging Face, an online AI-development hub.\nIn most cases the videos are anonymized, meaning that titles and creator names are not included. I was able to identify the videos by extracting unique identifiers from the data sets and looking them up on YouTube—similar to the process I followed when I revealed the contents of the\nBooks3\n,\nOpenSubtitles\n, and\nLibGen\ndata sets. You can search the data sets using the tool below, typing in channel names like “MrBeast” or “James Charles,” for example.\n(\nA note for users: Just because a video appears in these data sets does not mean it was used for training by AI companies, which could choose to omit certain videos when developing their products.\n)\nTo create AI products capable of generating video, developers need huge quantities of videos, and YouTube has become a common source. Although YouTube does offer paying subscribers the ability to download videos and watch them through the company’s app whenever they’d like, this is something different: Video files are being ripped from YouTube en masse and saved in files that are then fed to AI algorithms. This kind of downloading\nviolates the platform’s terms of service\n, but many tools allow AI developers to download videos in this way. YouTube appears to have done little, if anything, to stop the mass downloading, and the company did not respond to my request for comment.\nNot all YouTube videos are copyrighted (and some are uploaded by people who don’t own the copyrights), but many are. Unauthorized copying or distribution of those videos is illegal, but whether AI training constitutes a form of copying or distribution is still a question being debated in many ongoing lawsuits. Tech companies have argued that training is a “fair use” of copyrighted work, and some\njudges have disagreed\nin their responses. How the courts ultimately apply the law to this novel technology could have massive consequences for creators’ motivations to post their work on YouTube and similar platforms—if tech companies are able to continue taking creators’ work to build AI products that compete with them, then creators may have little choice but to stop sharing.\nG\nenerative-AI tools are already producing\nvideos that compete with human-made work on YouTube. AI-generated history videos with hundreds of thousands of views and many inaccuracies\nare drowning out\nfact-checked, expert-produced content. Popular music-remix videos are frequently created\nusing this technology\n, and many of them perform better than human-made videos.\nThe problem extends far beyond YouTube, however. Most modern chatbots are “multimodal,” meaning they can respond to a question by creating relevant media. Google’s Gemini chatbot, for instance, will produce short clips for paying users. Soon, you may be able to ask ChatGPT or another generative-AI tool about how to build a table from found legs and get a custom how-to video in response. Even if that response isn’t as good as any video Peters would make, it will be immediate, and it will be tailor-made to your specifications. The online-publishing business has already been\ndecimated by text-generation tools\n; video creators should expect similar challenges from generative-AI tools in the near future.\nMany major tech companies have used these data sets to train AI, according to research papers I’ve read and AI developers I’ve spoken with. The group includes Microsoft, Meta, Amazon, Nvidia, Runway, ByteDance, Snap, and Tencent. I reached out to each of these companies to ask about their use of these data sets. Only Meta, Amazon, and Nvidia responded. All three said they “respect” content creators and believe that their use of the work is legal under existing copyright law. Amazon also shared that, where video is concerned, it is currently focused on developing ways to generate “compelling, high-quality advertisements from simple prompts.”\nWe can’t be certain whether all these companies will use the videos to create for-profit video-generating tools. Some of the work they’ve done may be simply experimental. But a few of these companies have an obvious interest in pursuing commercial products: Meta, for instance, is developing a suite of tools called\nMovie Gen\nthat creates videos from text prompts, and Snap offers\n“AI Video Lenses”\nthat allow users to augment their videos with generative AI. Videos such as the ones in these data sets are the raw material for products like these; much as ChatGPT couldn’t write like Shakespeare without first “reading” Shakespeare, a video generator couldn’t construct a fake newscast without “watching” tons of recorded broadcasts. In fact, a large number of the videos in these data sets are from news and educational channels, such as the BBC (which has at least 33,000 videos in the data sets, across its various brands) and TED (nearly 50,000). Hundreds of thousands of others—if not more—are from individual creators, such as Peters.\nAI companies are more interested in some videos than others. A spreadsheet leaked\nto\n404 Media\nby a former employee at Runway, which builds AI video-generation tools, shows what the company valued about certain channels: “high camera movement,” “beautiful cinematic landscapes,” “high quality scenes from movies,” “super high quality sci-fi short films.” One channel was labeled “THE HOLY GRAIL OF CAR CINEMATICS SO FAR”; another was labeled “only 4 videos but they are really well done.”\nDevelopers seek out high-quality videos in a variety of ways. Curators of two of the data sets collected here—HowTo100M and HD-VILA-100M—prioritized videos with high view counts on YouTube, equating popularity with quality. The creators of another data set, HD-VG-130M,\nnoted\nthat “high view count does not guarantee video quality,” and used an AI model to select videos of high “aesthetic quality.” Data-set creators often try to avoid videos that contain overlaid text, such as subtitles and logos, so these identifying features don’t appear in videos generated by their model. So, some advice for YouTubers: Putting a watermark or logo on your videos, even a small one, makes them less desirable for training.\nTo prepare the videos for training, developers split the footage into short clips, in many cases cutting wherever there is a scene or camera change. Each clip is then given an English-language description of the visual scene so the model can be trained to correlate words with moving images, and to generate videos from text prompts. AI developers have a few methods of writing these captions. One way is to pay workers to do it. Another is to use separate AI models to generate a description automatically. The latter is more common, because of its lower cost.\nA\nI video tools aren’t yet\nas mainstream as chatbots or image generators, but they are already in wide use. You may already have seen AI-manipulated video without realizing it. For example, TED has been using AI to dub speakers’ talks in different languages. This includes the video as well as the audio: Speakers’ mouths are\nlip-synched\nwith the new words so it looks like they’re speaking Japanese, French, or Russian. Nishat Ruiter, TED’s general counsel, told me this is done with the speakers’ knowledge and consent.\nThere are also consumer-facing products for tweaking videos with AI. If your face doesn’t look right, for example, you can try a face-enhancer such as\nFacetune\n, or ditch your mug entirely with a face-swapper such as\nFacewow\n. With Runway’s\nAleph\n, you can change the colors of objects, or turn sunshine into a snowstorm.\nThen there are tools that generate new videos based on an image you provide. Google\nencourages Gemini users\nto animate their “favorite photos.” The result is a clip that extrapolates eight seconds of movement from an initial image, making a person dance, cook, or\nswing a golf club\n. These are often both amazing and creepy. “Talking head generation”—for\nemployee-orientation videos\n, for example—is also advancing.\nVidnoz AI\npromises to generate “Realistic AI Spokespersons of Any Style.” A company called\nArcads\nwill generate a complete advertisement, with actors and voiceover. ByteDance, the company that operates TikTok, offers a similar product called Symphony Creative Studio. Other applications of AI video generation include\nvirtual try-on of clothes\n,\ngenerating custom video games\n, and animating\ncartoon characters and people\n.\nSome companies are both working with AI and simultaneously fighting to defend their content from being pilfered by AI companies. This reflects the Wild West mentality in AI right now—companies exploiting legal gray areas to see how they can profit. As I investigated these data sets, I learned about an incident involving TED—again, one of the most-pilfered organizations in the data sets captured here, and one that is attempting to employ AI to advance its own business. In June, the Cannes Lions international advertising festival gave one of its Grand Prix awards to an ad that included deepfaked footage from a TED talk by DeAndrea Salvador, currently a state senator in North Carolina. The ad agency, DM9, “used AI cloning to change her talk and repurposed it for a commercial ad campaign,” Ruiter told me on a video call recently. When the manipulation was discovered, the Cannes Lions festival\nwithdrew the award\n. Last month, Salvador\nsued\nDM9 along with its clients—Whirlpool and Consul—for misappropriation of her likeness, among other things. DM9 apologized for the incident and\ncited\n“a series of failures in the production and sending” of the ad. A spokesperson from Whirlpool told me the company was unaware the senator’s remarks had been altered.\nOthers in the film industry have filed lawsuits against AI companies for training with their content. In June, Disney and Universal sued Midjourney, the maker of an image-generating tool that can produce images containing recognizable characters (Warner Brothers\njoined\nthe lawsuit last week). The lawsuit called Midjourney a “bottomless pit of plagiarism.” The following month, two adult-film companies sued Meta for downloading (and distributing through BitTorrent) more than 2,000 of their videos. Neither Midjourney nor Meta has responded to the allegations, and neither responded to my request for comment. One YouTuber filed their own lawsuit: In August of last year,\nDavid Millette sued Nvidia\nfor unjust enrichment and unfair competition with regard to the training of its\nCosmos AI\n, but the case was voluntarily dismissed months later.\nThe Disney characters and the deepfaked Salvador ad are just two instances of how these tools can be damaging. The floodgates may soon be opening further. Thanks to the enormous amount of investment in the technology, generated videos are beginning to appear everywhere. One company, DeepBrain AI,\npays “creators”\nto post AI-generated videos made with its tools on YouTube. It currently offers $500 for a video that gets 10,000 views, a relatively low threshold. Companies that run social-media platforms, such as Google and Meta, also pay users for content, through ad-revenue sharing, and many directly\nencourage\nthe posting of AI-generated content. Not surprisingly, a coterie of\ngurus\nhas arrived to teach the secrets of making money with AI-generated content.\nGoogle\nand\nMeta\nhave also trained AI tools on large quantities of videos from their own platforms: Google has taken\nat least 70 million\nclips from YouTube, and Meta has taken more than\n65 million clips from Instagram\n. If these companies succeed in flooding their platforms with synthetic videos, human creators could be left with the unenviable task of competing with machines that churn out endless content based on their original work. And social media will become even less social than it is.\nI asked Peters if he knew his videos had been taken from YouTube to train AI. He said he didn’t, but he wasn’t surprised. “I think everything’s gonna get stolen,” he told me. But he didn’t know what to do about it. “Do I quit, or do I just keep making videos and hope people want to connect with a person?”","attrs_markdown":"[Skip to content](https:\/\/www.theatlantic.com\/technology\/archive\/2025\/09\/youtube-ai-training-data-sets\/684116\/#main-content)\n\n## Site Navigation\n- [Popular](https:\/\/www.theatlantic.com\/most-popular\/)[Latest](https:\/\/www.theatlantic.com\/latest\/)[Newsletters](https:\/\/www.theatlantic.com\/newsletters\/)\n  \n  ## Sections\n  - [Ideas](https:\/\/www.theatlantic.com\/ideas\/)\n  - [Politics](https:\/\/www.theatlantic.com\/politics\/)\n  - [Economy](https:\/\/www.theatlantic.com\/economy\/)\n  - [Global](https:\/\/www.theatlantic.com\/international\/)\n  - [National Security](https:\/\/www.theatlantic.com\/national-security\/)\n  - [Washington Week](https:\/\/www.theatlantic.com\/category\/washington-week-atlantic\/)\n  - [Features](https:\/\/www.theatlantic.com\/category\/features\/)\n  - [Technology](https:\/\/www.theatlantic.com\/technology\/)\n  - [AI Watchdog](https:\/\/www.theatlantic.com\/category\/ai-watchdog\/)\n  - [Science](https:\/\/www.theatlantic.com\/science\/)\n  - [Planet](https:\/\/www.theatlantic.com\/projects\/planet\/)\n  - [Health](https:\/\/www.theatlantic.com\/health\/)\n  - [Philosophy](https:\/\/www.theatlantic.com\/category\/philosophy\/)\n  - [Education](https:\/\/www.theatlantic.com\/education\/)\n  - [Culture](https:\/\/www.theatlantic.com\/culture\/)\n  - [Comedy](https:\/\/www.theatlantic.com\/category\/comedy\/)\n  - [Family](https:\/\/www.theatlantic.com\/family\/)\n  - [Books](https:\/\/www.theatlantic.com\/books\/)\n  - [Fiction](https:\/\/www.theatlantic.com\/category\/fiction\/)\n  - [Photography](https:\/\/www.theatlantic.com\/photo\/)\n  - [Events](https:\/\/www.theatlantic.com\/atlantic-across-america\/)\n  \n  - [![](https:\/\/cdn.theatlantic.com\/_next\/static\/images\/nav-archive-promo-5541b02ae92f1a9276249e1c6c2534ee.png)Explore The Atlantic Archive](https:\/\/www.theatlantic.com\/archive\/)\n  - [![games promo icon](https:\/\/cdn.theatlantic.com\/media\/games\/games_promo_color.png)Play The Atlantic Games](https:\/\/www.theatlantic.com\/games\/)\n  - [Listen to Podcasts and Articles](https:\/\/www.theatlantic.com\/audio\/)\n  \n  ## The Print Edition\n  [![View the current print edition](https:\/\/www.theatlantic.com\/magazine\/images\/current-issue.420.jpg)](https:\/\/www.theatlantic.com\/magazine\/)\n  \n  [Latest Issue](https:\/\/www.theatlantic.com\/magazine\/)[Past Issues](https:\/\/www.theatlantic.com\/magazine\/backissues\/)\n  ***\n  [Give a Gift](https:\/\/accounts.theatlantic.com\/products\/gift)\n- [Popular](https:\/\/www.theatlantic.com\/most-popular\/)\n- [Latest](https:\/\/www.theatlantic.com\/latest\/)\n- [Newsletters](https:\/\/www.theatlantic.com\/newsletters\/)\n\n- [Sign In](https:\/\/accounts.theatlantic.com\/login\/)\n- [Subscribe](https:\/\/www.theatlantic.com\/subscribe\/navbar\/)\n\n[Technology](https:\/\/www.theatlantic.com\/technology\/)\n\n# AI Is Coming for YouTube Creators\nAt least 15 million videos have been snatched by tech companies.\n\nBy [Alex Reisner](https:\/\/www.theatlantic.com\/author\/alex-reisner\/)\n\n![Animated illustration of data sets filled with binders labeled YouTube](https:\/\/cdn.theatlantic.com\/thumbor\/9h8vqTtTTTInm7xO_B0aNxJANEc=\/0x0:2000x1125\/960x540\/media\/img\/mt\/2025\/09\/ai_watchdog_youtube\/original.gif)\n\nIllustration by Matteo Giuseppe Pani \/ The Atlantic\n\nSeptember 10, 2025\n\nShare\n\nSave\n\n![Animated illustration of data sets filled with binders labeled YouTube](https:\/\/cdn.theatlantic.com\/thumbor\/3XF4RLuo7rs0K1ifxgZyLBufKP8=\/438x0:1563x1125\/80x80\/media\/img\/mt\/2025\/09\/ai_watchdog_youtube\/original.gif)\n\nListen\n\n−\n\n1\\.0x\n\n\\+\n\nSeek\n\n0:0014:50\n\n*Editor’s note: This analysis is part of* The Atlantic*’s investigation into how YouTube videos are taken to train AI tools. You can use the search tool directly [here](https:\/\/www.theatlantic.com\/technology\/archive\/2025\/09\/search-youtube-videos-generative-ai\/684158\/), to see whether videos you’ve created or watched are included in the data sets. This work is part of [AI Watchdog](https:\/\/www.theatlantic.com\/category\/ai-watchdog\/),* The Atlantic*’s ongoing investigation into the generative-AI industry.*\n***\nWhen Jon Peters uploaded his first video to YouTube in 2010, he had no idea where it would lead. He was a professional woodworker running a small business who decided to film himself making a dining table with some old legs he had found in a barn. It turned out that people liked his candid style, and as he posted more videos, a fan base began to grow. “All of a sudden there’s people who appreciate the work I’m doing,” he told me. “The comments were a motivator.” Fifteen years later, his channel has more than 1 million subscribers. Sometimes he gets photos of people in their shops, following his guidance from a big TV on the wall—most of his viewers, Peters told me, are woodworkers looking to him for instruction.\n\nBut [Peters’s channel](https:\/\/www.youtube.com\/@JonPetersArtHome) could soon be obsolete, along with millions of other videos created by people who share their expertise and advice on YouTube. Over the past few months, I’ve discovered more than 15.8 million videos from more than 2 million channels that tech companies have, without permission, downloaded to train AI products. Nearly 1 million of them, by my count, are how-to videos. You can find these videos in at least 13 different data sets distributed by AI developers at tech companies, universities, and research organizations, through websites such as Hugging Face, an online AI-development hub.\n\nIn most cases the videos are anonymized, meaning that titles and creator names are not included. I was able to identify the videos by extracting unique identifiers from the data sets and looking them up on YouTube—similar to the process I followed when I revealed the contents of the [Books3](https:\/\/www.theatlantic.com\/technology\/archive\/2023\/09\/books3-database-generative-ai-training-copyright-infringement\/675363\/), [OpenSubtitles](https:\/\/www.theatlantic.com\/technology\/archive\/2024\/11\/opensubtitles-ai-data-set\/680650\/), and [LibGen](https:\/\/www.theatlantic.com\/technology\/archive\/2025\/03\/libgen-meta-openai\/682093\/) data sets. You can search the data sets using the tool below, typing in channel names like “MrBeast” or “James Charles,” for example.\n\n(*A note for users: Just because a video appears in these data sets does not mean it was used for training by AI companies, which could choose to omit certain videos when developing their products.*)\n\nTo create AI products capable of generating video, developers need huge quantities of videos, and YouTube has become a common source. Although YouTube does offer paying subscribers the ability to download videos and watch them through the company’s app whenever they’d like, this is something different: Video files are being ripped from YouTube en masse and saved in files that are then fed to AI algorithms. This kind of downloading [violates the platform’s terms of service](https:\/\/www.bloomberg.com\/news\/articles\/2024-04-04\/youtube-says-openai-training-sora-with-its-videos-would-break-the-rules), but many tools allow AI developers to download videos in this way. YouTube appears to have done little, if anything, to stop the mass downloading, and the company did not respond to my request for comment.\n\nNot all YouTube videos are copyrighted (and some are uploaded by people who don’t own the copyrights), but many are. Unauthorized copying or distribution of those videos is illegal, but whether AI training constitutes a form of copying or distribution is still a question being debated in many ongoing lawsuits. Tech companies have argued that training is a “fair use” of copyrighted work, and some [judges have disagreed](https:\/\/www.theatlantic.com\/technology\/archive\/2025\/07\/anthropic-meta-ai-rulings\/683526\/) in their responses. How the courts ultimately apply the law to this novel technology could have massive consequences for creators’ motivations to post their work on YouTube and similar platforms—if tech companies are able to continue taking creators’ work to build AI products that compete with them, then creators may have little choice but to stop sharing.\n\nGenerative-AI tools are already producing videos that compete with human-made work on YouTube. AI-generated history videos with hundreds of thousands of views and many inaccuracies [are drowning out](https:\/\/www.404media.co\/ai-generated-boring-history-videos-are-flooding-youtube-and-drowning-out-real-history\/) fact-checked, expert-produced content. Popular music-remix videos are frequently created [using this technology](https:\/\/www.youtube.com\/watch?v=eIahbtBz6Uo), and many of them perform better than human-made videos.\n\nThe problem extends far beyond YouTube, however. Most modern chatbots are “multimodal,” meaning they can respond to a question by creating relevant media. Google’s Gemini chatbot, for instance, will produce short clips for paying users. Soon, you may be able to ask ChatGPT or another generative-AI tool about how to build a table from found legs and get a custom how-to video in response. Even if that response isn’t as good as any video Peters would make, it will be immediate, and it will be tailor-made to your specifications. The online-publishing business has already been [decimated by text-generation tools](https:\/\/www.theatlantic.com\/technology\/archive\/2025\/06\/generative-ai-pirated-articles-books\/683009\/); video creators should expect similar challenges from generative-AI tools in the near future.\n\nMany major tech companies have used these data sets to train AI, according to research papers I’ve read and AI developers I’ve spoken with. The group includes Microsoft, Meta, Amazon, Nvidia, Runway, ByteDance, Snap, and Tencent. I reached out to each of these companies to ask about their use of these data sets. Only Meta, Amazon, and Nvidia responded. All three said they “respect” content creators and believe that their use of the work is legal under existing copyright law. Amazon also shared that, where video is concerned, it is currently focused on developing ways to generate “compelling, high-quality advertisements from simple prompts.”\n\nWe can’t be certain whether all these companies will use the videos to create for-profit video-generating tools. Some of the work they’ve done may be simply experimental. But a few of these companies have an obvious interest in pursuing commercial products: Meta, for instance, is developing a suite of tools called [Movie Gen](https:\/\/ai.meta.com\/research\/movie-gen\/) that creates videos from text prompts, and Snap offers [“AI Video Lenses”](http:\/\/theverge.com\/news\/628354\/snap-snapchat-ai-video-lenses) that allow users to augment their videos with generative AI. Videos such as the ones in these data sets are the raw material for products like these; much as ChatGPT couldn’t write like Shakespeare without first “reading” Shakespeare, a video generator couldn’t construct a fake newscast without “watching” tons of recorded broadcasts. In fact, a large number of the videos in these data sets are from news and educational channels, such as the BBC (which has at least 33,000 videos in the data sets, across its various brands) and TED (nearly 50,000). Hundreds of thousands of others—if not more—are from individual creators, such as Peters.\n\nAI companies are more interested in some videos than others. A spreadsheet leaked [to *404 Media*](https:\/\/www.404media.co\/runway-ai-image-generator-training-data-youtube\/) by a former employee at Runway, which builds AI video-generation tools, shows what the company valued about certain channels: “high camera movement,” “beautiful cinematic landscapes,” “high quality scenes from movies,” “super high quality sci-fi short films.” One channel was labeled “THE HOLY GRAIL OF CAR CINEMATICS SO FAR”; another was labeled “only 4 videos but they are really well done.”\n\nDevelopers seek out high-quality videos in a variety of ways. Curators of two of the data sets collected here—HowTo100M and HD-VILA-100M—prioritized videos with high view counts on YouTube, equating popularity with quality. The creators of another data set, HD-VG-130M, [noted](https:\/\/arxiv.org\/pdf\/2305.10874) that “high view count does not guarantee video quality,” and used an AI model to select videos of high “aesthetic quality.” Data-set creators often try to avoid videos that contain overlaid text, such as subtitles and logos, so these identifying features don’t appear in videos generated by their model. So, some advice for YouTubers: Putting a watermark or logo on your videos, even a small one, makes them less desirable for training.\n\nTo prepare the videos for training, developers split the footage into short clips, in many cases cutting wherever there is a scene or camera change. Each clip is then given an English-language description of the visual scene so the model can be trained to correlate words with moving images, and to generate videos from text prompts. AI developers have a few methods of writing these captions. One way is to pay workers to do it. Another is to use separate AI models to generate a description automatically. The latter is more common, because of its lower cost.\n\nAI video tools aren’t yet as mainstream as chatbots or image generators, but they are already in wide use. You may already have seen AI-manipulated video without realizing it. For example, TED has been using AI to dub speakers’ talks in different languages. This includes the video as well as the audio: Speakers’ mouths are [lip-synched](https:\/\/blog.ted.com\/announcing-ai-adapted-multilingual-ted-talks\/) with the new words so it looks like they’re speaking Japanese, French, or Russian. Nishat Ruiter, TED’s general counsel, told me this is done with the speakers’ knowledge and consent.\n\nThere are also consumer-facing products for tweaking videos with AI. If your face doesn’t look right, for example, you can try a face-enhancer such as [Facetune](https:\/\/www.facetuneapp.com\/create\/video-face-editor), or ditch your mug entirely with a face-swapper such as [Facewow](https:\/\/facewow.ai\/face-swap\/video\/). With Runway’s [Aleph](https:\/\/runwayml.com\/research\/introducing-runway-aleph), you can change the colors of objects, or turn sunshine into a snowstorm.\n\nThen there are tools that generate new videos based on an image you provide. Google [encourages Gemini users](https:\/\/blog.google\/products\/gemini\/photo-to-video\/) to animate their “favorite photos.” The result is a clip that extrapolates eight seconds of movement from an initial image, making a person dance, cook, or [swing a golf club](https:\/\/chromeunboxed.com\/i-just-tried-geminis-new-photo-to-video-feature-and-im-blown-away\/). These are often both amazing and creepy. “Talking head generation”—for [employee-orientation videos](https:\/\/www.youtube.com\/watch?v=2nzdDQr_LqA), for example—is also advancing. [Vidnoz AI](https:\/\/www.vidnoz.com\/) promises to generate “Realistic AI Spokespersons of Any Style.” A company called [Arcads](https:\/\/www.arcads.ai\/) will generate a complete advertisement, with actors and voiceover. ByteDance, the company that operates TikTok, offers a similar product called Symphony Creative Studio. Other applications of AI video generation include [virtual try-on of clothes](https:\/\/github.com\/showlab\/Awesome-Video-Diffusion?tab=readme-ov-file#virtual-try-on), [generating custom video games](https:\/\/github.com\/showlab\/Awesome-Video-Diffusion?tab=readme-ov-file#game-generation), and animating [cartoon characters and people](https:\/\/shiyi-zh0408.github.io\/projectpages\/FlexiAct\/).\n\nSome companies are both working with AI and simultaneously fighting to defend their content from being pilfered by AI companies. This reflects the Wild West mentality in AI right now—companies exploiting legal gray areas to see how they can profit. As I investigated these data sets, I learned about an incident involving TED—again, one of the most-pilfered organizations in the data sets captured here, and one that is attempting to employ AI to advance its own business. In June, the Cannes Lions international advertising festival gave one of its Grand Prix awards to an ad that included deepfaked footage from a TED talk by DeAndrea Salvador, currently a state senator in North Carolina. The ad agency, DM9, “used AI cloning to change her talk and repurposed it for a commercial ad campaign,” Ruiter told me on a video call recently. When the manipulation was discovered, the Cannes Lions festival [withdrew the award](https:\/\/www.canneslions.com\/news\/cannes-lions-statement-dm9-entries-into-cannes-lions-2025). Last month, Salvador [sued](https:\/\/www.courthousenews.com\/wp-content\/uploads\/2025\/08\/deandrea-salvador-whirlpool-complaint.pdf) DM9 along with its clients—Whirlpool and Consul—for misappropriation of her likeness, among other things. DM9 apologized for the incident and [cited](https:\/\/www.linkedin.com\/posts\/dm9_nota-de-esclarecimento-na-semana-passada-activity-7343346894644875264-sHmV\/?rcm=ACoAAABPu1wBY1EaJ5vQb_gdSm1BybbXG1_20hE) “a series of failures in the production and sending” of the ad. A spokesperson from Whirlpool told me the company was unaware the senator’s remarks had been altered.\n\nOthers in the film industry have filed lawsuits against AI companies for training with their content. In June, Disney and Universal sued Midjourney, the maker of an image-generating tool that can produce images containing recognizable characters (Warner Brothers [joined](https:\/\/apnews.com\/article\/warner-bros-midjourney-ai-copyright-lawsuit-dc-studios-b87d80d7b4a4dfdcf0ee149d30830551) the lawsuit last week). The lawsuit called Midjourney a “bottomless pit of plagiarism.” The following month, two adult-film companies sued Meta for downloading (and distributing through BitTorrent) more than 2,000 of their videos. Neither Midjourney nor Meta has responded to the allegations, and neither responded to my request for comment. One YouTuber filed their own lawsuit: In August of last year, [David Millette sued Nvidia](https:\/\/www.courtlistener.com\/docket\/69045427\/millette-v-nvidia-corporation\/) for unjust enrichment and unfair competition with regard to the training of its [Cosmos AI](https:\/\/www.nvidia.com\/en-us\/ai\/cosmos\/), but the case was voluntarily dismissed months later.\n\nThe Disney characters and the deepfaked Salvador ad are just two instances of how these tools can be damaging. The floodgates may soon be opening further. Thanks to the enormous amount of investment in the technology, generated videos are beginning to appear everywhere. One company, DeepBrain AI, [pays “creators”](https:\/\/www.aistudios.com\/promotion\/creator-join) to post AI-generated videos made with its tools on YouTube. It currently offers \\$500 for a video that gets 10,000 views, a relatively low threshold. Companies that run social-media platforms, such as Google and Meta, also pay users for content, through ad-revenue sharing, and many directly [encourage](https:\/\/blog.youtube\/news-and-events\/new-shorts-creation-tools-2025\/) the posting of AI-generated content. Not surprisingly, a coterie of [gurus](https:\/\/www.youtube.com\/watch?v=TWpg1RmzAbc) has arrived to teach the secrets of making money with AI-generated content.\n\n[Google](https:\/\/arxiv.org\/abs\/2007.14937) and [Meta](https:\/\/arxiv.org\/abs\/1905.00561) have also trained AI tools on large quantities of videos from their own platforms: Google has taken [at least 70 million](https:\/\/arxiv.org\/abs\/2007.14937) clips from YouTube, and Meta has taken more than [65 million clips from Instagram](https:\/\/arxiv.org\/abs\/1905.00561). If these companies succeed in flooding their platforms with synthetic videos, human creators could be left with the unenviable task of competing with machines that churn out endless content based on their original work. And social media will become even less social than it is.\n\nI asked Peters if he knew his videos had been taken from YouTube to train AI. He said he didn’t, but he wasn’t surprised. “I think everything’s gonna get stolen,” he told me. But he didn’t know what to do about it. “Do I quit, or do I just keep making videos and hope people want to connect with a person?”\n\n### About the Author\n[Alex Reisner](https:\/\/www.theatlantic.com\/author\/alex-reisner\/)\n\n[Alex Reisner](https:\/\/www.theatlantic.com\/author\/alex-reisner\/) is a staff writer at *The Atlantic.*\n\nExplore More Topics\n\n[YouTube](https:\/\/www.theatlantic.com\/tag\/product\/youtube\/)\n\n## Popular Links\n- ### About\n  - [Our History](https:\/\/www.theatlantic.com\/history\/)\n  - [Careers](https:\/\/www.theatlantic.com\/jobs\/)\n- ### Contact\n  - [Help Center](https:\/\/support.theatlantic.com\/)\n  - [Contact Us](https:\/\/www.theatlantic.com\/contact\/)\n  - [Atlantic Brand Partners](https:\/\/atlanticbrandpartners.com\/)\n  - [Press](https:\/\/www.theatlantic.com\/press-releases\/)\n  - [Reprints & Permissions](https:\/\/support.theatlantic.com\/hc\/en-us\/articles\/360011460753-Permissions-to-reprint-or-reproduce-content-from-The-Atlantic)\n- ### Podcasts\n  - [Radio Atlantic](https:\/\/www.theatlantic.com\/podcasts\/radio-atlantic\/)\n  - [The David Frum Show](https:\/\/www.theatlantic.com\/podcasts\/the-david-frum-show\/)\n  - [Galaxy Brain](https:\/\/www.theatlantic.com\/podcasts\/galaxy-brain\/)\n  - [Autocracy in America](https:\/\/www.theatlantic.com\/podcasts\/autocracy-in-america\/)\n  - [How to Age Up](https:\/\/www.theatlantic.com\/podcasts\/how-to-build-a-happy-life\/)\n- ### Subscription\n  - [Purchase](https:\/\/www.theatlantic.com\/subscribe\/footer-cover\/)\n  - [Give a Gift](https:\/\/www.theatlantic.com\/subscribe\/footer-gift\/)\n  - [Manage Subscription](https:\/\/accounts.theatlantic.com\/)\n  - [Group Subscriptions](https:\/\/www.theatlantic.com\/group-subscriptions\/)\n  - [Atlantic Editions](https:\/\/www.theatlantic.com\/atlantic-editions\/)\n  - [Newsletters](https:\/\/www.theatlantic.com\/newsletters\/)\n- ### Follow\n\n### About\n- [Our History](https:\/\/www.theatlantic.com\/history\/)\n- [Careers](https:\/\/www.theatlantic.com\/jobs\/)\n\n### Contact\n- [Help Center](https:\/\/support.theatlantic.com\/)\n- [Contact Us](https:\/\/www.theatlantic.com\/contact\/)\n- [Atlantic Brand Partners](https:\/\/atlanticbrandpartners.com\/)\n- [Press](https:\/\/www.theatlantic.com\/press-releases\/)\n- [Reprints & Permissions](https:\/\/support.theatlantic.com\/hc\/en-us\/articles\/360011460753-Permissions-to-reprint-or-reproduce-content-from-The-Atlantic)\n\n### Podcasts\n- [Radio Atlantic](https:\/\/www.theatlantic.com\/podcasts\/radio-atlantic\/)\n- [The David Frum Show](https:\/\/www.theatlantic.com\/podcasts\/the-david-frum-show\/)\n- [Galaxy Brain](https:\/\/www.theatlantic.com\/podcasts\/galaxy-brain\/)\n- [Autocracy in America](https:\/\/www.theatlantic.com\/podcasts\/autocracy-in-america\/)\n- [How to Age Up](https:\/\/www.theatlantic.com\/podcasts\/how-to-build-a-happy-life\/)\n\n### Subscription\n- [Purchase](https:\/\/www.theatlantic.com\/subscribe\/footer-cover\/)\n- [Give a Gift](https:\/\/www.theatlantic.com\/subscribe\/footer-gift\/)\n- [Manage Subscription](https:\/\/accounts.theatlantic.com\/)\n- [Group Subscriptions](https:\/\/www.theatlantic.com\/group-subscriptions\/)\n- [Atlantic Editions](https:\/\/www.theatlantic.com\/atlantic-editions\/)\n- [Newsletters](https:\/\/www.theatlantic.com\/newsletters\/)\n### Follow\n\n## Site Information\n- [Privacy Policy](https:\/\/www.theatlantic.com\/privacy-policy\/)\n- [Your Privacy Choices](https:\/\/www.theatlantic.com\/do-not-sell-my-personal-information\/)\n- [Advertising Guidelines](https:\/\/www.theatlantic.com\/advertising-guidelines\/)\n- [Terms & Conditions](https:\/\/www.theatlantic.com\/terms-and-conditions\/)\n- [Terms of Sale](https:\/\/www.theatlantic.com\/terms-of-sale\/)\n- [Responsible Disclosure](https:\/\/www.theatlantic.com\/responsible-disclosure-policy\/)\n- [Site Map](https:\/\/www.theatlantic.com\/site-map\/)\n\nTheAtlantic.com © 2026 The Atlantic Monthly Group. All Rights Reserved.\n\nThis site is protected by reCAPTCHA and the Google [Privacy Policy](https:\/\/policies.google.com\/privacy) and [Terms of Service](https:\/\/policies.google.com\/terms) apply","attrs_readable_markdown":"*Editor’s note: This analysis is part of* The Atlantic*’s investigation into how YouTube videos are taken to train AI tools. You can use the search tool directly [here](https:\/\/www.theatlantic.com\/technology\/archive\/2025\/09\/search-youtube-videos-generative-ai\/684158\/), to see whether videos you’ve created or watched are included in the data sets. This work is part of [AI Watchdog](https:\/\/www.theatlantic.com\/category\/ai-watchdog\/),* The Atlantic*’s ongoing investigation into the generative-AI industry.*\n***\nWhen Jon Peters uploaded his first video to YouTube in 2010, he had no idea where it would lead. He was a professional woodworker running a small business who decided to film himself making a dining table with some old legs he had found in a barn. It turned out that people liked his candid style, and as he posted more videos, a fan base began to grow. “All of a sudden there’s people who appreciate the work I’m doing,” he told me. “The comments were a motivator.” Fifteen years later, his channel has more than 1 million subscribers. Sometimes he gets photos of people in their shops, following his guidance from a big TV on the wall—most of his viewers, Peters told me, are woodworkers looking to him for instruction.\n\nBut [Peters’s channel](https:\/\/www.youtube.com\/@JonPetersArtHome) could soon be obsolete, along with millions of other videos created by people who share their expertise and advice on YouTube. Over the past few months, I’ve discovered more than 15.8 million videos from more than 2 million channels that tech companies have, without permission, downloaded to train AI products. Nearly 1 million of them, by my count, are how-to videos. You can find these videos in at least 13 different data sets distributed by AI developers at tech companies, universities, and research organizations, through websites such as Hugging Face, an online AI-development hub.\n\nIn most cases the videos are anonymized, meaning that titles and creator names are not included. I was able to identify the videos by extracting unique identifiers from the data sets and looking them up on YouTube—similar to the process I followed when I revealed the contents of the [Books3](https:\/\/www.theatlantic.com\/technology\/archive\/2023\/09\/books3-database-generative-ai-training-copyright-infringement\/675363\/), [OpenSubtitles](https:\/\/www.theatlantic.com\/technology\/archive\/2024\/11\/opensubtitles-ai-data-set\/680650\/), and [LibGen](https:\/\/www.theatlantic.com\/technology\/archive\/2025\/03\/libgen-meta-openai\/682093\/) data sets. You can search the data sets using the tool below, typing in channel names like “MrBeast” or “James Charles,” for example.\n\n(*A note for users: Just because a video appears in these data sets does not mean it was used for training by AI companies, which could choose to omit certain videos when developing their products.*)\n\nTo create AI products capable of generating video, developers need huge quantities of videos, and YouTube has become a common source. Although YouTube does offer paying subscribers the ability to download videos and watch them through the company’s app whenever they’d like, this is something different: Video files are being ripped from YouTube en masse and saved in files that are then fed to AI algorithms. This kind of downloading [violates the platform’s terms of service](https:\/\/www.bloomberg.com\/news\/articles\/2024-04-04\/youtube-says-openai-training-sora-with-its-videos-would-break-the-rules), but many tools allow AI developers to download videos in this way. YouTube appears to have done little, if anything, to stop the mass downloading, and the company did not respond to my request for comment.\n\nNot all YouTube videos are copyrighted (and some are uploaded by people who don’t own the copyrights), but many are. Unauthorized copying or distribution of those videos is illegal, but whether AI training constitutes a form of copying or distribution is still a question being debated in many ongoing lawsuits. Tech companies have argued that training is a “fair use” of copyrighted work, and some [judges have disagreed](https:\/\/www.theatlantic.com\/technology\/archive\/2025\/07\/anthropic-meta-ai-rulings\/683526\/) in their responses. How the courts ultimately apply the law to this novel technology could have massive consequences for creators’ motivations to post their work on YouTube and similar platforms—if tech companies are able to continue taking creators’ work to build AI products that compete with them, then creators may have little choice but to stop sharing.\n\nGenerative-AI tools are already producing videos that compete with human-made work on YouTube. AI-generated history videos with hundreds of thousands of views and many inaccuracies [are drowning out](https:\/\/www.404media.co\/ai-generated-boring-history-videos-are-flooding-youtube-and-drowning-out-real-history\/) fact-checked, expert-produced content. Popular music-remix videos are frequently created [using this technology](https:\/\/www.youtube.com\/watch?v=eIahbtBz6Uo), and many of them perform better than human-made videos.\n\nThe problem extends far beyond YouTube, however. Most modern chatbots are “multimodal,” meaning they can respond to a question by creating relevant media. Google’s Gemini chatbot, for instance, will produce short clips for paying users. Soon, you may be able to ask ChatGPT or another generative-AI tool about how to build a table from found legs and get a custom how-to video in response. Even if that response isn’t as good as any video Peters would make, it will be immediate, and it will be tailor-made to your specifications. The online-publishing business has already been [decimated by text-generation tools](https:\/\/www.theatlantic.com\/technology\/archive\/2025\/06\/generative-ai-pirated-articles-books\/683009\/); video creators should expect similar challenges from generative-AI tools in the near future.\n\nMany major tech companies have used these data sets to train AI, according to research papers I’ve read and AI developers I’ve spoken with. The group includes Microsoft, Meta, Amazon, Nvidia, Runway, ByteDance, Snap, and Tencent. I reached out to each of these companies to ask about their use of these data sets. Only Meta, Amazon, and Nvidia responded. All three said they “respect” content creators and believe that their use of the work is legal under existing copyright law. Amazon also shared that, where video is concerned, it is currently focused on developing ways to generate “compelling, high-quality advertisements from simple prompts.”\n\nWe can’t be certain whether all these companies will use the videos to create for-profit video-generating tools. Some of the work they’ve done may be simply experimental. But a few of these companies have an obvious interest in pursuing commercial products: Meta, for instance, is developing a suite of tools called [Movie Gen](https:\/\/ai.meta.com\/research\/movie-gen\/) that creates videos from text prompts, and Snap offers [“AI Video Lenses”](http:\/\/theverge.com\/news\/628354\/snap-snapchat-ai-video-lenses) that allow users to augment their videos with generative AI. Videos such as the ones in these data sets are the raw material for products like these; much as ChatGPT couldn’t write like Shakespeare without first “reading” Shakespeare, a video generator couldn’t construct a fake newscast without “watching” tons of recorded broadcasts. In fact, a large number of the videos in these data sets are from news and educational channels, such as the BBC (which has at least 33,000 videos in the data sets, across its various brands) and TED (nearly 50,000). Hundreds of thousands of others—if not more—are from individual creators, such as Peters.\n\nAI companies are more interested in some videos than others. A spreadsheet leaked [to *404 Media*](https:\/\/www.404media.co\/runway-ai-image-generator-training-data-youtube\/) by a former employee at Runway, which builds AI video-generation tools, shows what the company valued about certain channels: “high camera movement,” “beautiful cinematic landscapes,” “high quality scenes from movies,” “super high quality sci-fi short films.” One channel was labeled “THE HOLY GRAIL OF CAR CINEMATICS SO FAR”; another was labeled “only 4 videos but they are really well done.”\n\nDevelopers seek out high-quality videos in a variety of ways. Curators of two of the data sets collected here—HowTo100M and HD-VILA-100M—prioritized videos with high view counts on YouTube, equating popularity with quality. The creators of another data set, HD-VG-130M, [noted](https:\/\/arxiv.org\/pdf\/2305.10874) that “high view count does not guarantee video quality,” and used an AI model to select videos of high “aesthetic quality.” Data-set creators often try to avoid videos that contain overlaid text, such as subtitles and logos, so these identifying features don’t appear in videos generated by their model. So, some advice for YouTubers: Putting a watermark or logo on your videos, even a small one, makes them less desirable for training.\n\nTo prepare the videos for training, developers split the footage into short clips, in many cases cutting wherever there is a scene or camera change. Each clip is then given an English-language description of the visual scene so the model can be trained to correlate words with moving images, and to generate videos from text prompts. AI developers have a few methods of writing these captions. One way is to pay workers to do it. Another is to use separate AI models to generate a description automatically. The latter is more common, because of its lower cost.\n\nAI video tools aren’t yet as mainstream as chatbots or image generators, but they are already in wide use. You may already have seen AI-manipulated video without realizing it. For example, TED has been using AI to dub speakers’ talks in different languages. This includes the video as well as the audio: Speakers’ mouths are [lip-synched](https:\/\/blog.ted.com\/announcing-ai-adapted-multilingual-ted-talks\/) with the new words so it looks like they’re speaking Japanese, French, or Russian. Nishat Ruiter, TED’s general counsel, told me this is done with the speakers’ knowledge and consent.\n\nThere are also consumer-facing products for tweaking videos with AI. If your face doesn’t look right, for example, you can try a face-enhancer such as [Facetune](https:\/\/www.facetuneapp.com\/create\/video-face-editor), or ditch your mug entirely with a face-swapper such as [Facewow](https:\/\/facewow.ai\/face-swap\/video\/). With Runway’s [Aleph](https:\/\/runwayml.com\/research\/introducing-runway-aleph), you can change the colors of objects, or turn sunshine into a snowstorm.\n\nThen there are tools that generate new videos based on an image you provide. Google [encourages Gemini users](https:\/\/blog.google\/products\/gemini\/photo-to-video\/) to animate their “favorite photos.” The result is a clip that extrapolates eight seconds of movement from an initial image, making a person dance, cook, or [swing a golf club](https:\/\/chromeunboxed.com\/i-just-tried-geminis-new-photo-to-video-feature-and-im-blown-away\/). These are often both amazing and creepy. “Talking head generation”—for [employee-orientation videos](https:\/\/www.youtube.com\/watch?v=2nzdDQr_LqA), for example—is also advancing. [Vidnoz AI](https:\/\/www.vidnoz.com\/) promises to generate “Realistic AI Spokespersons of Any Style.” A company called [Arcads](https:\/\/www.arcads.ai\/) will generate a complete advertisement, with actors and voiceover. ByteDance, the company that operates TikTok, offers a similar product called Symphony Creative Studio. Other applications of AI video generation include [virtual try-on of clothes](https:\/\/github.com\/showlab\/Awesome-Video-Diffusion?tab=readme-ov-file#virtual-try-on), [generating custom video games](https:\/\/github.com\/showlab\/Awesome-Video-Diffusion?tab=readme-ov-file#game-generation), and animating [cartoon characters and people](https:\/\/shiyi-zh0408.github.io\/projectpages\/FlexiAct\/).\n\nSome companies are both working with AI and simultaneously fighting to defend their content from being pilfered by AI companies. This reflects the Wild West mentality in AI right now—companies exploiting legal gray areas to see how they can profit. As I investigated these data sets, I learned about an incident involving TED—again, one of the most-pilfered organizations in the data sets captured here, and one that is attempting to employ AI to advance its own business. In June, the Cannes Lions international advertising festival gave one of its Grand Prix awards to an ad that included deepfaked footage from a TED talk by DeAndrea Salvador, currently a state senator in North Carolina. The ad agency, DM9, “used AI cloning to change her talk and repurposed it for a commercial ad campaign,” Ruiter told me on a video call recently. When the manipulation was discovered, the Cannes Lions festival [withdrew the award](https:\/\/www.canneslions.com\/news\/cannes-lions-statement-dm9-entries-into-cannes-lions-2025). Last month, Salvador [sued](https:\/\/www.courthousenews.com\/wp-content\/uploads\/2025\/08\/deandrea-salvador-whirlpool-complaint.pdf) DM9 along with its clients—Whirlpool and Consul—for misappropriation of her likeness, among other things. DM9 apologized for the incident and [cited](https:\/\/www.linkedin.com\/posts\/dm9_nota-de-esclarecimento-na-semana-passada-activity-7343346894644875264-sHmV\/?rcm=ACoAAABPu1wBY1EaJ5vQb_gdSm1BybbXG1_20hE) “a series of failures in the production and sending” of the ad. A spokesperson from Whirlpool told me the company was unaware the senator’s remarks had been altered.\n\nOthers in the film industry have filed lawsuits against AI companies for training with their content. In June, Disney and Universal sued Midjourney, the maker of an image-generating tool that can produce images containing recognizable characters (Warner Brothers [joined](https:\/\/apnews.com\/article\/warner-bros-midjourney-ai-copyright-lawsuit-dc-studios-b87d80d7b4a4dfdcf0ee149d30830551) the lawsuit last week). The lawsuit called Midjourney a “bottomless pit of plagiarism.” The following month, two adult-film companies sued Meta for downloading (and distributing through BitTorrent) more than 2,000 of their videos. Neither Midjourney nor Meta has responded to the allegations, and neither responded to my request for comment. One YouTuber filed their own lawsuit: In August of last year, [David Millette sued Nvidia](https:\/\/www.courtlistener.com\/docket\/69045427\/millette-v-nvidia-corporation\/) for unjust enrichment and unfair competition with regard to the training of its [Cosmos AI](https:\/\/www.nvidia.com\/en-us\/ai\/cosmos\/), but the case was voluntarily dismissed months later.\n\nThe Disney characters and the deepfaked Salvador ad are just two instances of how these tools can be damaging. The floodgates may soon be opening further. Thanks to the enormous amount of investment in the technology, generated videos are beginning to appear everywhere. One company, DeepBrain AI, [pays “creators”](https:\/\/www.aistudios.com\/promotion\/creator-join) to post AI-generated videos made with its tools on YouTube. It currently offers \\$500 for a video that gets 10,000 views, a relatively low threshold. Companies that run social-media platforms, such as Google and Meta, also pay users for content, through ad-revenue sharing, and many directly [encourage](https:\/\/blog.youtube\/news-and-events\/new-shorts-creation-tools-2025\/) the posting of AI-generated content. Not surprisingly, a coterie of [gurus](https:\/\/www.youtube.com\/watch?v=TWpg1RmzAbc) has arrived to teach the secrets of making money with AI-generated content.\n\n[Google](https:\/\/arxiv.org\/abs\/2007.14937) and [Meta](https:\/\/arxiv.org\/abs\/1905.00561) have also trained AI tools on large quantities of videos from their own platforms: Google has taken [at least 70 million](https:\/\/arxiv.org\/abs\/2007.14937) clips from YouTube, and Meta has taken more than [65 million clips from Instagram](https:\/\/arxiv.org\/abs\/1905.00561). If these companies succeed in flooding their platforms with synthetic videos, human creators could be left with the unenviable task of competing with machines that churn out endless content based on their original work. And social media will become even less social than it is.\n\nI asked Peters if he knew his videos had been taken from YouTube to train AI. He said he didn’t, but he wasn’t surprised. “I think everything’s gonna get stolen,” he told me. But he didn’t know what to do about it. “Do I quit, or do I just keep making videos and hope people want to connect with a person?”","meta_canonical":null,"ml_categories_json":"{\"\/Internet_and_Telecom\":743,\"\/Internet_and_Telecom\/Web_Services\":656,\"\/Science\":445,\"\/Science\/Computer_Science\":441,\"\/Science\/Computer_Science\/Machine_Learning_and_Artificial_Intelligence\":440,\"\/Internet_and_Telecom\/Web_Services\/Search_Engine_Optimization_and_Marketing\":276,\"\/Law_and_Government\":238,\"\/Law_and_Government\/Legal\":231,\"\/News\":128,\"\/News\/Technology_News\":125,\"\/Law_and_Government\/Legal\/Intellectual_Property\":120}","ml_types_json":"{\"\/Article\":999,\"\/Article\/News_Update\":904}","ml_intent_types_json":"{\"Informational\":999}","meta_language":"en","attrs_author":"Alex Reisner","attrs_publish_time":1757516340,"attrs_original_publish_time":1756684800,"attrs_is_republished":0,"attrs_nr_words":"2535","attrs_boilerpipe_nr_words":"2234","body_ext_links_number":43,"body_int_links_number":84,"meta_nofollow":0,"meta_noarchive":0,"props_was_rendered":0,"src_redirect":"","download_time_msec":50,"download_ttfb_msec":48,"download_size":46146}

3. Robots.txt Check

Query:

Response:

4. Spam/Ban Check

Query:

Response:

5. Seen Status Check

ℹ️ Skipped - page is already crawled

📄

INDEXABLE

✅

CRAWLED

7 days ago

🤖

ROBOTS ALLOWED

Page Info Filters

Filter	Status	Condition	Details
HTTP status	PASS	`download_http_code = 200`	HTTP 200
Age cutoff	PASS	`download_stamp > now() - 6 MONTH`	0.3 months ago
History drop	PASS	`isNull(history_drop_reason)`	No drop reason
Spam/ban	PASS	`fh_dont_index != 1 AND ml_spam_score = 0`	ml_spam_score=0
Canonical	PASS	`meta_canonical IS NULL OR = '' OR = src_unparsed`	Not set

Page Details

Property

Value

URL

https://www.theatlantic.com/technology/archive/2025/09/youtube-ai-training-data-sets/684116/

Last Crawled

2026-04-15 13:13:45 (7 days ago)

First Indexed

2025-09-10 15:14:19 (7 months ago)

HTTP Status Code

200

Content

Meta Title

At Least 15 Million YouTube Videos Have Been Snatched by AI Companies - The Atlantic

Meta Description

At least 15 million videos have been snatched by tech companies.

Meta Canonical

null

Boilerpipe Text

Editor’s note: This analysis is part of The Atlantic ’s investigation into how YouTube videos are taken to train AI tools. You can use the search tool directly here , to see whether videos you’ve created or watched are included in the data sets. This work is part of AI Watchdog , The Atlantic ’s ongoing investigation into the generative-AI industry. W hen Jon Peters uploaded his first video to YouTube in 2010, he had no idea where it would lead. He was a professional woodworker running a small business who decided to film himself making a dining table with some old legs he had found in a barn. It turned out that people liked his candid style, and as he posted more videos, a fan base began to grow. “All of a sudden there’s people who appreciate the work I’m doing,” he told me. “The comments were a motivator.” Fifteen years later, his channel has more than 1 million subscribers. Sometimes he gets photos of people in their shops, following his guidance from a big TV on the wall—most of his viewers, Peters told me, are woodworkers looking to him for instruction. But Peters’s channel could soon be obsolete, along with millions of other videos created by people who share their expertise and advice on YouTube. Over the past few months, I’ve discovered more than 15.8 million videos from more than 2 million channels that tech companies have, without permission, downloaded to train AI products. Nearly 1 million of them, by my count, are how-to videos. You can find these videos in at least 13 different data sets distributed by AI developers at tech companies, universities, and research organizations, through websites such as Hugging Face, an online AI-development hub. In most cases the videos are anonymized, meaning that titles and creator names are not included. I was able to identify the videos by extracting unique identifiers from the data sets and looking them up on YouTube—similar to the process I followed when I revealed the contents of the Books3 , OpenSubtitles , and LibGen data sets. You can search the data sets using the tool below, typing in channel names like “MrBeast” or “James Charles,” for example. ( A note for users: Just because a video appears in these data sets does not mean it was used for training by AI companies, which could choose to omit certain videos when developing their products. ) To create AI products capable of generating video, developers need huge quantities of videos, and YouTube has become a common source. Although YouTube does offer paying subscribers the ability to download videos and watch them through the company’s app whenever they’d like, this is something different: Video files are being ripped from YouTube en masse and saved in files that are then fed to AI algorithms. This kind of downloading violates the platform’s terms of service , but many tools allow AI developers to download videos in this way. YouTube appears to have done little, if anything, to stop the mass downloading, and the company did not respond to my request for comment. Not all YouTube videos are copyrighted (and some are uploaded by people who don’t own the copyrights), but many are. Unauthorized copying or distribution of those videos is illegal, but whether AI training constitutes a form of copying or distribution is still a question being debated in many ongoing lawsuits. Tech companies have argued that training is a “fair use” of copyrighted work, and some judges have disagreed in their responses. How the courts ultimately apply the law to this novel technology could have massive consequences for creators’ motivations to post their work on YouTube and similar platforms—if tech companies are able to continue taking creators’ work to build AI products that compete with them, then creators may have little choice but to stop sharing. G enerative-AI tools are already producing videos that compete with human-made work on YouTube. AI-generated history videos with hundreds of thousands of views and many inaccuracies are drowning out fact-checked, expert-produced content. Popular music-remix videos are frequently created using this technology , and many of them perform better than human-made videos. The problem extends far beyond YouTube, however. Most modern chatbots are “multimodal,” meaning they can respond to a question by creating relevant media. Google’s Gemini chatbot, for instance, will produce short clips for paying users. Soon, you may be able to ask ChatGPT or another generative-AI tool about how to build a table from found legs and get a custom how-to video in response. Even if that response isn’t as good as any video Peters would make, it will be immediate, and it will be tailor-made to your specifications. The online-publishing business has already been decimated by text-generation tools ; video creators should expect similar challenges from generative-AI tools in the near future. Many major tech companies have used these data sets to train AI, according to research papers I’ve read and AI developers I’ve spoken with. The group includes Microsoft, Meta, Amazon, Nvidia, Runway, ByteDance, Snap, and Tencent. I reached out to each of these companies to ask about their use of these data sets. Only Meta, Amazon, and Nvidia responded. All three said they “respect” content creators and believe that their use of the work is legal under existing copyright law. Amazon also shared that, where video is concerned, it is currently focused on developing ways to generate “compelling, high-quality advertisements from simple prompts.” We can’t be certain whether all these companies will use the videos to create for-profit video-generating tools. Some of the work they’ve done may be simply experimental. But a few of these companies have an obvious interest in pursuing commercial products: Meta, for instance, is developing a suite of tools called Movie Gen that creates videos from text prompts, and Snap offers “AI Video Lenses” that allow users to augment their videos with generative AI. Videos such as the ones in these data sets are the raw material for products like these; much as ChatGPT couldn’t write like Shakespeare without first “reading” Shakespeare, a video generator couldn’t construct a fake newscast without “watching” tons of recorded broadcasts. In fact, a large number of the videos in these data sets are from news and educational channels, such as the BBC (which has at least 33,000 videos in the data sets, across its various brands) and TED (nearly 50,000). Hundreds of thousands of others—if not more—are from individual creators, such as Peters. AI companies are more interested in some videos than others. A spreadsheet leaked to 404 Media by a former employee at Runway, which builds AI video-generation tools, shows what the company valued about certain channels: “high camera movement,” “beautiful cinematic landscapes,” “high quality scenes from movies,” “super high quality sci-fi short films.” One channel was labeled “THE HOLY GRAIL OF CAR CINEMATICS SO FAR”; another was labeled “only 4 videos but they are really well done.” Developers seek out high-quality videos in a variety of ways. Curators of two of the data sets collected here—HowTo100M and HD-VILA-100M—prioritized videos with high view counts on YouTube, equating popularity with quality. The creators of another data set, HD-VG-130M, noted that “high view count does not guarantee video quality,” and used an AI model to select videos of high “aesthetic quality.” Data-set creators often try to avoid videos that contain overlaid text, such as subtitles and logos, so these identifying features don’t appear in videos generated by their model. So, some advice for YouTubers: Putting a watermark or logo on your videos, even a small one, makes them less desirable for training. To prepare the videos for training, developers split the footage into short clips, in many cases cutting wherever there is a scene or camera change. Each clip is then given an English-language description of the visual scene so the model can be trained to correlate words with moving images, and to generate videos from text prompts. AI developers have a few methods of writing these captions. One way is to pay workers to do it. Another is to use separate AI models to generate a description automatically. The latter is more common, because of its lower cost. A I video tools aren’t yet as mainstream as chatbots or image generators, but they are already in wide use. You may already have seen AI-manipulated video without realizing it. For example, TED has been using AI to dub speakers’ talks in different languages. This includes the video as well as the audio: Speakers’ mouths are lip-synched with the new words so it looks like they’re speaking Japanese, French, or Russian. Nishat Ruiter, TED’s general counsel, told me this is done with the speakers’ knowledge and consent. There are also consumer-facing products for tweaking videos with AI. If your face doesn’t look right, for example, you can try a face-enhancer such as Facetune , or ditch your mug entirely with a face-swapper such as Facewow . With Runway’s Aleph , you can change the colors of objects, or turn sunshine into a snowstorm. Then there are tools that generate new videos based on an image you provide. Google encourages Gemini users to animate their “favorite photos.” The result is a clip that extrapolates eight seconds of movement from an initial image, making a person dance, cook, or swing a golf club . These are often both amazing and creepy. “Talking head generation”—for employee-orientation videos , for example—is also advancing. Vidnoz AI promises to generate “Realistic AI Spokespersons of Any Style.” A company called Arcads will generate a complete advertisement, with actors and voiceover. ByteDance, the company that operates TikTok, offers a similar product called Symphony Creative Studio. Other applications of AI video generation include virtual try-on of clothes , generating custom video games , and animating cartoon characters and people . Some companies are both working with AI and simultaneously fighting to defend their content from being pilfered by AI companies. This reflects the Wild West mentality in AI right now—companies exploiting legal gray areas to see how they can profit. As I investigated these data sets, I learned about an incident involving TED—again, one of the most-pilfered organizations in the data sets captured here, and one that is attempting to employ AI to advance its own business. In June, the Cannes Lions international advertising festival gave one of its Grand Prix awards to an ad that included deepfaked footage from a TED talk by DeAndrea Salvador, currently a state senator in North Carolina. The ad agency, DM9, “used AI cloning to change her talk and repurposed it for a commercial ad campaign,” Ruiter told me on a video call recently. When the manipulation was discovered, the Cannes Lions festival withdrew the award . Last month, Salvador sued DM9 along with its clients—Whirlpool and Consul—for misappropriation of her likeness, among other things. DM9 apologized for the incident and cited “a series of failures in the production and sending” of the ad. A spokesperson from Whirlpool told me the company was unaware the senator’s remarks had been altered. Others in the film industry have filed lawsuits against AI companies for training with their content. In June, Disney and Universal sued Midjourney, the maker of an image-generating tool that can produce images containing recognizable characters (Warner Brothers joined the lawsuit last week). The lawsuit called Midjourney a “bottomless pit of plagiarism.” The following month, two adult-film companies sued Meta for downloading (and distributing through BitTorrent) more than 2,000 of their videos. Neither Midjourney nor Meta has responded to the allegations, and neither responded to my request for comment. One YouTuber filed their own lawsuit: In August of last year, David Millette sued Nvidia for unjust enrichment and unfair competition with regard to the training of its Cosmos AI , but the case was voluntarily dismissed months later. The Disney characters and the deepfaked Salvador ad are just two instances of how these tools can be damaging. The floodgates may soon be opening further. Thanks to the enormous amount of investment in the technology, generated videos are beginning to appear everywhere. One company, DeepBrain AI, pays “creators” to post AI-generated videos made with its tools on YouTube. It currently offers $500 for a video that gets 10,000 views, a relatively low threshold. Companies that run social-media platforms, such as Google and Meta, also pay users for content, through ad-revenue sharing, and many directly encourage the posting of AI-generated content. Not surprisingly, a coterie of gurus has arrived to teach the secrets of making money with AI-generated content. Google and Meta have also trained AI tools on large quantities of videos from their own platforms: Google has taken at least 70 million clips from YouTube, and Meta has taken more than 65 million clips from Instagram . If these companies succeed in flooding their platforms with synthetic videos, human creators could be left with the unenviable task of competing with machines that churn out endless content based on their original work. And social media will become even less social than it is. I asked Peters if he knew his videos had been taken from YouTube to train AI. He said he didn’t, but he wasn’t surprised. “I think everything’s gonna get stolen,” he told me. But he didn’t know what to do about it. “Do I quit, or do I just keep making videos and hope people want to connect with a person?”

Markdown

[Skip to content](https://www.theatlantic.com/technology/archive/2025/09/youtube-ai-training-data-sets/684116/#main-content) ## Site Navigation - [Popular](https://www.theatlantic.com/most-popular/)[Latest](https://www.theatlantic.com/latest/)[Newsletters](https://www.theatlantic.com/newsletters/) ## Sections - [Ideas](https://www.theatlantic.com/ideas/) - [Politics](https://www.theatlantic.com/politics/) - [Economy](https://www.theatlantic.com/economy/) - [Global](https://www.theatlantic.com/international/) - [National Security](https://www.theatlantic.com/national-security/) - [Washington Week](https://www.theatlantic.com/category/washington-week-atlantic/) - [Features](https://www.theatlantic.com/category/features/) - [Technology](https://www.theatlantic.com/technology/) - [AI Watchdog](https://www.theatlantic.com/category/ai-watchdog/) - [Science](https://www.theatlantic.com/science/) - [Planet](https://www.theatlantic.com/projects/planet/) - [Health](https://www.theatlantic.com/health/) - [Philosophy](https://www.theatlantic.com/category/philosophy/) - [Education](https://www.theatlantic.com/education/) - [Culture](https://www.theatlantic.com/culture/) - [Comedy](https://www.theatlantic.com/category/comedy/) - [Family](https://www.theatlantic.com/family/) - [Books](https://www.theatlantic.com/books/) - [Fiction](https://www.theatlantic.com/category/fiction/) - [Photography](https://www.theatlantic.com/photo/) - [Events](https://www.theatlantic.com/atlantic-across-america/) - [![](https://cdn.theatlantic.com/_next/static/images/nav-archive-promo-5541b02ae92f1a9276249e1c6c2534ee.png)Explore The Atlantic Archive](https://www.theatlantic.com/archive/) - [![games promo icon](https://cdn.theatlantic.com/media/games/games_promo_color.png)Play The Atlantic Games](https://www.theatlantic.com/games/) - [Listen to Podcasts and Articles](https://www.theatlantic.com/audio/) ## The Print Edition [![View the current print edition](https://www.theatlantic.com/magazine/images/current-issue.420.jpg)](https://www.theatlantic.com/magazine/) [Latest Issue](https://www.theatlantic.com/magazine/)[Past Issues](https://www.theatlantic.com/magazine/backissues/) *** [Give a Gift](https://accounts.theatlantic.com/products/gift) - [Popular](https://www.theatlantic.com/most-popular/) - [Latest](https://www.theatlantic.com/latest/) - [Newsletters](https://www.theatlantic.com/newsletters/) - [Sign In](https://accounts.theatlantic.com/login/) - [Subscribe](https://www.theatlantic.com/subscribe/navbar/) [Technology](https://www.theatlantic.com/technology/) # AI Is Coming for YouTube Creators At least 15 million videos have been snatched by tech companies. By [Alex Reisner](https://www.theatlantic.com/author/alex-reisner/) ![Animated illustration of data sets filled with binders labeled YouTube](https://cdn.theatlantic.com/thumbor/9h8vqTtTTTInm7xO_B0aNxJANEc=/0x0:2000x1125/960x540/media/img/mt/2025/09/ai_watchdog_youtube/original.gif) Illustration by Matteo Giuseppe Pani / The Atlantic September 10, 2025 Share Save ![Animated illustration of data sets filled with binders labeled YouTube](https://cdn.theatlantic.com/thumbor/3XF4RLuo7rs0K1ifxgZyLBufKP8=/438x0:1563x1125/80x80/media/img/mt/2025/09/ai_watchdog_youtube/original.gif) Listen − 1\.0x \+ Seek 0:0014:50 *Editor’s note: This analysis is part of* The Atlantic*’s investigation into how YouTube videos are taken to train AI tools. You can use the search tool directly [here](https://www.theatlantic.com/technology/archive/2025/09/search-youtube-videos-generative-ai/684158/), to see whether videos you’ve created or watched are included in the data sets. This work is part of [AI Watchdog](https://www.theatlantic.com/category/ai-watchdog/),* The Atlantic*’s ongoing investigation into the generative-AI industry.* *** When Jon Peters uploaded his first video to YouTube in 2010, he had no idea where it would lead. He was a professional woodworker running a small business who decided to film himself making a dining table with some old legs he had found in a barn. It turned out that people liked his candid style, and as he posted more videos, a fan base began to grow. “All of a sudden there’s people who appreciate the work I’m doing,” he told me. “The comments were a motivator.” Fifteen years later, his channel has more than 1 million subscribers. Sometimes he gets photos of people in their shops, following his guidance from a big TV on the wall—most of his viewers, Peters told me, are woodworkers looking to him for instruction. But [Peters’s channel](https://www.youtube.com/@JonPetersArtHome) could soon be obsolete, along with millions of other videos created by people who share their expertise and advice on YouTube. Over the past few months, I’ve discovered more than 15.8 million videos from more than 2 million channels that tech companies have, without permission, downloaded to train AI products. Nearly 1 million of them, by my count, are how-to videos. You can find these videos in at least 13 different data sets distributed by AI developers at tech companies, universities, and research organizations, through websites such as Hugging Face, an online AI-development hub. In most cases the videos are anonymized, meaning that titles and creator names are not included. I was able to identify the videos by extracting unique identifiers from the data sets and looking them up on YouTube—similar to the process I followed when I revealed the contents of the [Books3](https://www.theatlantic.com/technology/archive/2023/09/books3-database-generative-ai-training-copyright-infringement/675363/), [OpenSubtitles](https://www.theatlantic.com/technology/archive/2024/11/opensubtitles-ai-data-set/680650/), and [LibGen](https://www.theatlantic.com/technology/archive/2025/03/libgen-meta-openai/682093/) data sets. You can search the data sets using the tool below, typing in channel names like “MrBeast” or “James Charles,” for example. (*A note for users: Just because a video appears in these data sets does not mean it was used for training by AI companies, which could choose to omit certain videos when developing their products.*) To create AI products capable of generating video, developers need huge quantities of videos, and YouTube has become a common source. Although YouTube does offer paying subscribers the ability to download videos and watch them through the company’s app whenever they’d like, this is something different: Video files are being ripped from YouTube en masse and saved in files that are then fed to AI algorithms. This kind of downloading [violates the platform’s terms of service](https://www.bloomberg.com/news/articles/2024-04-04/youtube-says-openai-training-sora-with-its-videos-would-break-the-rules), but many tools allow AI developers to download videos in this way. YouTube appears to have done little, if anything, to stop the mass downloading, and the company did not respond to my request for comment. Not all YouTube videos are copyrighted (and some are uploaded by people who don’t own the copyrights), but many are. Unauthorized copying or distribution of those videos is illegal, but whether AI training constitutes a form of copying or distribution is still a question being debated in many ongoing lawsuits. Tech companies have argued that training is a “fair use” of copyrighted work, and some [judges have disagreed](https://www.theatlantic.com/technology/archive/2025/07/anthropic-meta-ai-rulings/683526/) in their responses. How the courts ultimately apply the law to this novel technology could have massive consequences for creators’ motivations to post their work on YouTube and similar platforms—if tech companies are able to continue taking creators’ work to build AI products that compete with them, then creators may have little choice but to stop sharing. Generative-AI tools are already producing videos that compete with human-made work on YouTube. AI-generated history videos with hundreds of thousands of views and many inaccuracies [are drowning out](https://www.404media.co/ai-generated-boring-history-videos-are-flooding-youtube-and-drowning-out-real-history/) fact-checked, expert-produced content. Popular music-remix videos are frequently created [using this technology](https://www.youtube.com/watch?v=eIahbtBz6Uo), and many of them perform better than human-made videos. The problem extends far beyond YouTube, however. Most modern chatbots are “multimodal,” meaning they can respond to a question by creating relevant media. Google’s Gemini chatbot, for instance, will produce short clips for paying users. Soon, you may be able to ask ChatGPT or another generative-AI tool about how to build a table from found legs and get a custom how-to video in response. Even if that response isn’t as good as any video Peters would make, it will be immediate, and it will be tailor-made to your specifications. The online-publishing business has already been [decimated by text-generation tools](https://www.theatlantic.com/technology/archive/2025/06/generative-ai-pirated-articles-books/683009/); video creators should expect similar challenges from generative-AI tools in the near future. Many major tech companies have used these data sets to train AI, according to research papers I’ve read and AI developers I’ve spoken with. The group includes Microsoft, Meta, Amazon, Nvidia, Runway, ByteDance, Snap, and Tencent. I reached out to each of these companies to ask about their use of these data sets. Only Meta, Amazon, and Nvidia responded. All three said they “respect” content creators and believe that their use of the work is legal under existing copyright law. Amazon also shared that, where video is concerned, it is currently focused on developing ways to generate “compelling, high-quality advertisements from simple prompts.” We can’t be certain whether all these companies will use the videos to create for-profit video-generating tools. Some of the work they’ve done may be simply experimental. But a few of these companies have an obvious interest in pursuing commercial products: Meta, for instance, is developing a suite of tools called [Movie Gen](https://ai.meta.com/research/movie-gen/) that creates videos from text prompts, and Snap offers [“AI Video Lenses”](http://theverge.com/news/628354/snap-snapchat-ai-video-lenses) that allow users to augment their videos with generative AI. Videos such as the ones in these data sets are the raw material for products like these; much as ChatGPT couldn’t write like Shakespeare without first “reading” Shakespeare, a video generator couldn’t construct a fake newscast without “watching” tons of recorded broadcasts. In fact, a large number of the videos in these data sets are from news and educational channels, such as the BBC (which has at least 33,000 videos in the data sets, across its various brands) and TED (nearly 50,000). Hundreds of thousands of others—if not more—are from individual creators, such as Peters. AI companies are more interested in some videos than others. A spreadsheet leaked [to *404 Media*](https://www.404media.co/runway-ai-image-generator-training-data-youtube/) by a former employee at Runway, which builds AI video-generation tools, shows what the company valued about certain channels: “high camera movement,” “beautiful cinematic landscapes,” “high quality scenes from movies,” “super high quality sci-fi short films.” One channel was labeled “THE HOLY GRAIL OF CAR CINEMATICS SO FAR”; another was labeled “only 4 videos but they are really well done.” Developers seek out high-quality videos in a variety of ways. Curators of two of the data sets collected here—HowTo100M and HD-VILA-100M—prioritized videos with high view counts on YouTube, equating popularity with quality. The creators of another data set, HD-VG-130M, [noted](https://arxiv.org/pdf/2305.10874) that “high view count does not guarantee video quality,” and used an AI model to select videos of high “aesthetic quality.” Data-set creators often try to avoid videos that contain overlaid text, such as subtitles and logos, so these identifying features don’t appear in videos generated by their model. So, some advice for YouTubers: Putting a watermark or logo on your videos, even a small one, makes them less desirable for training. To prepare the videos for training, developers split the footage into short clips, in many cases cutting wherever there is a scene or camera change. Each clip is then given an English-language description of the visual scene so the model can be trained to correlate words with moving images, and to generate videos from text prompts. AI developers have a few methods of writing these captions. One way is to pay workers to do it. Another is to use separate AI models to generate a description automatically. The latter is more common, because of its lower cost. AI video tools aren’t yet as mainstream as chatbots or image generators, but they are already in wide use. You may already have seen AI-manipulated video without realizing it. For example, TED has been using AI to dub speakers’ talks in different languages. This includes the video as well as the audio: Speakers’ mouths are [lip-synched](https://blog.ted.com/announcing-ai-adapted-multilingual-ted-talks/) with the new words so it looks like they’re speaking Japanese, French, or Russian. Nishat Ruiter, TED’s general counsel, told me this is done with the speakers’ knowledge and consent. There are also consumer-facing products for tweaking videos with AI. If your face doesn’t look right, for example, you can try a face-enhancer such as [Facetune](https://www.facetuneapp.com/create/video-face-editor), or ditch your mug entirely with a face-swapper such as [Facewow](https://facewow.ai/face-swap/video/). With Runway’s [Aleph](https://runwayml.com/research/introducing-runway-aleph), you can change the colors of objects, or turn sunshine into a snowstorm. Then there are tools that generate new videos based on an image you provide. Google [encourages Gemini users](https://blog.google/products/gemini/photo-to-video/) to animate their “favorite photos.” The result is a clip that extrapolates eight seconds of movement from an initial image, making a person dance, cook, or [swing a golf club](https://chromeunboxed.com/i-just-tried-geminis-new-photo-to-video-feature-and-im-blown-away/). These are often both amazing and creepy. “Talking head generation”—for [employee-orientation videos](https://www.youtube.com/watch?v=2nzdDQr_LqA), for example—is also advancing. [Vidnoz AI](https://www.vidnoz.com/) promises to generate “Realistic AI Spokespersons of Any Style.” A company called [Arcads](https://www.arcads.ai/) will generate a complete advertisement, with actors and voiceover. ByteDance, the company that operates TikTok, offers a similar product called Symphony Creative Studio. Other applications of AI video generation include [virtual try-on of clothes](https://github.com/showlab/Awesome-Video-Diffusion?tab=readme-ov-file#virtual-try-on), [generating custom video games](https://github.com/showlab/Awesome-Video-Diffusion?tab=readme-ov-file#game-generation), and animating [cartoon characters and people](https://shiyi-zh0408.github.io/projectpages/FlexiAct/). Some companies are both working with AI and simultaneously fighting to defend their content from being pilfered by AI companies. This reflects the Wild West mentality in AI right now—companies exploiting legal gray areas to see how they can profit. As I investigated these data sets, I learned about an incident involving TED—again, one of the most-pilfered organizations in the data sets captured here, and one that is attempting to employ AI to advance its own business. In June, the Cannes Lions international advertising festival gave one of its Grand Prix awards to an ad that included deepfaked footage from a TED talk by DeAndrea Salvador, currently a state senator in North Carolina. The ad agency, DM9, “used AI cloning to change her talk and repurposed it for a commercial ad campaign,” Ruiter told me on a video call recently. When the manipulation was discovered, the Cannes Lions festival [withdrew the award](https://www.canneslions.com/news/cannes-lions-statement-dm9-entries-into-cannes-lions-2025). Last month, Salvador [sued](https://www.courthousenews.com/wp-content/uploads/2025/08/deandrea-salvador-whirlpool-complaint.pdf) DM9 along with its clients—Whirlpool and Consul—for misappropriation of her likeness, among other things. DM9 apologized for the incident and [cited](https://www.linkedin.com/posts/dm9_nota-de-esclarecimento-na-semana-passada-activity-7343346894644875264-sHmV/?rcm=ACoAAABPu1wBY1EaJ5vQb_gdSm1BybbXG1_20hE) “a series of failures in the production and sending” of the ad. A spokesperson from Whirlpool told me the company was unaware the senator’s remarks had been altered. Others in the film industry have filed lawsuits against AI companies for training with their content. In June, Disney and Universal sued Midjourney, the maker of an image-generating tool that can produce images containing recognizable characters (Warner Brothers [joined](https://apnews.com/article/warner-bros-midjourney-ai-copyright-lawsuit-dc-studios-b87d80d7b4a4dfdcf0ee149d30830551) the lawsuit last week). The lawsuit called Midjourney a “bottomless pit of plagiarism.” The following month, two adult-film companies sued Meta for downloading (and distributing through BitTorrent) more than 2,000 of their videos. Neither Midjourney nor Meta has responded to the allegations, and neither responded to my request for comment. One YouTuber filed their own lawsuit: In August of last year, [David Millette sued Nvidia](https://www.courtlistener.com/docket/69045427/millette-v-nvidia-corporation/) for unjust enrichment and unfair competition with regard to the training of its [Cosmos AI](https://www.nvidia.com/en-us/ai/cosmos/), but the case was voluntarily dismissed months later. The Disney characters and the deepfaked Salvador ad are just two instances of how these tools can be damaging. The floodgates may soon be opening further. Thanks to the enormous amount of investment in the technology, generated videos are beginning to appear everywhere. One company, DeepBrain AI, [pays “creators”](https://www.aistudios.com/promotion/creator-join) to post AI-generated videos made with its tools on YouTube. It currently offers \$500 for a video that gets 10,000 views, a relatively low threshold. Companies that run social-media platforms, such as Google and Meta, also pay users for content, through ad-revenue sharing, and many directly [encourage](https://blog.youtube/news-and-events/new-shorts-creation-tools-2025/) the posting of AI-generated content. Not surprisingly, a coterie of [gurus](https://www.youtube.com/watch?v=TWpg1RmzAbc) has arrived to teach the secrets of making money with AI-generated content. [Google](https://arxiv.org/abs/2007.14937) and [Meta](https://arxiv.org/abs/1905.00561) have also trained AI tools on large quantities of videos from their own platforms: Google has taken [at least 70 million](https://arxiv.org/abs/2007.14937) clips from YouTube, and Meta has taken more than [65 million clips from Instagram](https://arxiv.org/abs/1905.00561). If these companies succeed in flooding their platforms with synthetic videos, human creators could be left with the unenviable task of competing with machines that churn out endless content based on their original work. And social media will become even less social than it is. I asked Peters if he knew his videos had been taken from YouTube to train AI. He said he didn’t, but he wasn’t surprised. “I think everything’s gonna get stolen,” he told me. But he didn’t know what to do about it. “Do I quit, or do I just keep making videos and hope people want to connect with a person?” ### About the Author [Alex Reisner](https://www.theatlantic.com/author/alex-reisner/) [Alex Reisner](https://www.theatlantic.com/author/alex-reisner/) is a staff writer at *The Atlantic.* Explore More Topics [YouTube](https://www.theatlantic.com/tag/product/youtube/) ## Popular Links - ### About - [Our History](https://www.theatlantic.com/history/) - [Careers](https://www.theatlantic.com/jobs/) - ### Contact - [Help Center](https://support.theatlantic.com/) - [Contact Us](https://www.theatlantic.com/contact/) - [Atlantic Brand Partners](https://atlanticbrandpartners.com/) - [Press](https://www.theatlantic.com/press-releases/) - [Reprints & Permissions](https://support.theatlantic.com/hc/en-us/articles/360011460753-Permissions-to-reprint-or-reproduce-content-from-The-Atlantic) - ### Podcasts - [Radio Atlantic](https://www.theatlantic.com/podcasts/radio-atlantic/) - [The David Frum Show](https://www.theatlantic.com/podcasts/the-david-frum-show/) - [Galaxy Brain](https://www.theatlantic.com/podcasts/galaxy-brain/) - [Autocracy in America](https://www.theatlantic.com/podcasts/autocracy-in-america/) - [How to Age Up](https://www.theatlantic.com/podcasts/how-to-build-a-happy-life/) - ### Subscription - [Purchase](https://www.theatlantic.com/subscribe/footer-cover/) - [Give a Gift](https://www.theatlantic.com/subscribe/footer-gift/) - [Manage Subscription](https://accounts.theatlantic.com/) - [Group Subscriptions](https://www.theatlantic.com/group-subscriptions/) - [Atlantic Editions](https://www.theatlantic.com/atlantic-editions/) - [Newsletters](https://www.theatlantic.com/newsletters/) - ### Follow ### About - [Our History](https://www.theatlantic.com/history/) - [Careers](https://www.theatlantic.com/jobs/) ### Contact - [Help Center](https://support.theatlantic.com/) - [Contact Us](https://www.theatlantic.com/contact/) - [Atlantic Brand Partners](https://atlanticbrandpartners.com/) - [Press](https://www.theatlantic.com/press-releases/) - [Reprints & Permissions](https://support.theatlantic.com/hc/en-us/articles/360011460753-Permissions-to-reprint-or-reproduce-content-from-The-Atlantic) ### Podcasts - [Radio Atlantic](https://www.theatlantic.com/podcasts/radio-atlantic/) - [The David Frum Show](https://www.theatlantic.com/podcasts/the-david-frum-show/) - [Galaxy Brain](https://www.theatlantic.com/podcasts/galaxy-brain/) - [Autocracy in America](https://www.theatlantic.com/podcasts/autocracy-in-america/) - [How to Age Up](https://www.theatlantic.com/podcasts/how-to-build-a-happy-life/) ### Subscription - [Purchase](https://www.theatlantic.com/subscribe/footer-cover/) - [Give a Gift](https://www.theatlantic.com/subscribe/footer-gift/) - [Manage Subscription](https://accounts.theatlantic.com/) - [Group Subscriptions](https://www.theatlantic.com/group-subscriptions/) - [Atlantic Editions](https://www.theatlantic.com/atlantic-editions/) - [Newsletters](https://www.theatlantic.com/newsletters/) ### Follow ## Site Information - [Privacy Policy](https://www.theatlantic.com/privacy-policy/) - [Your Privacy Choices](https://www.theatlantic.com/do-not-sell-my-personal-information/) - [Advertising Guidelines](https://www.theatlantic.com/advertising-guidelines/) - [Terms & Conditions](https://www.theatlantic.com/terms-and-conditions/) - [Terms of Sale](https://www.theatlantic.com/terms-of-sale/) - [Responsible Disclosure](https://www.theatlantic.com/responsible-disclosure-policy/) - [Site Map](https://www.theatlantic.com/site-map/) TheAtlantic.com © 2026 The Atlantic Monthly Group. All Rights Reserved. This site is protected by reCAPTCHA and the Google [Privacy Policy](https://policies.google.com/privacy) and [Terms of Service](https://policies.google.com/terms) apply

Readable Markdown

*Editor’s note: This analysis is part of* The Atlantic*’s investigation into how YouTube videos are taken to train AI tools. You can use the search tool directly [here](https://www.theatlantic.com/technology/archive/2025/09/search-youtube-videos-generative-ai/684158/), to see whether videos you’ve created or watched are included in the data sets. This work is part of [AI Watchdog](https://www.theatlantic.com/category/ai-watchdog/),* The Atlantic*’s ongoing investigation into the generative-AI industry.* *** When Jon Peters uploaded his first video to YouTube in 2010, he had no idea where it would lead. He was a professional woodworker running a small business who decided to film himself making a dining table with some old legs he had found in a barn. It turned out that people liked his candid style, and as he posted more videos, a fan base began to grow. “All of a sudden there’s people who appreciate the work I’m doing,” he told me. “The comments were a motivator.” Fifteen years later, his channel has more than 1 million subscribers. Sometimes he gets photos of people in their shops, following his guidance from a big TV on the wall—most of his viewers, Peters told me, are woodworkers looking to him for instruction. But [Peters’s channel](https://www.youtube.com/@JonPetersArtHome) could soon be obsolete, along with millions of other videos created by people who share their expertise and advice on YouTube. Over the past few months, I’ve discovered more than 15.8 million videos from more than 2 million channels that tech companies have, without permission, downloaded to train AI products. Nearly 1 million of them, by my count, are how-to videos. You can find these videos in at least 13 different data sets distributed by AI developers at tech companies, universities, and research organizations, through websites such as Hugging Face, an online AI-development hub. In most cases the videos are anonymized, meaning that titles and creator names are not included. I was able to identify the videos by extracting unique identifiers from the data sets and looking them up on YouTube—similar to the process I followed when I revealed the contents of the [Books3](https://www.theatlantic.com/technology/archive/2023/09/books3-database-generative-ai-training-copyright-infringement/675363/), [OpenSubtitles](https://www.theatlantic.com/technology/archive/2024/11/opensubtitles-ai-data-set/680650/), and [LibGen](https://www.theatlantic.com/technology/archive/2025/03/libgen-meta-openai/682093/) data sets. You can search the data sets using the tool below, typing in channel names like “MrBeast” or “James Charles,” for example. (*A note for users: Just because a video appears in these data sets does not mean it was used for training by AI companies, which could choose to omit certain videos when developing their products.*) To create AI products capable of generating video, developers need huge quantities of videos, and YouTube has become a common source. Although YouTube does offer paying subscribers the ability to download videos and watch them through the company’s app whenever they’d like, this is something different: Video files are being ripped from YouTube en masse and saved in files that are then fed to AI algorithms. This kind of downloading [violates the platform’s terms of service](https://www.bloomberg.com/news/articles/2024-04-04/youtube-says-openai-training-sora-with-its-videos-would-break-the-rules), but many tools allow AI developers to download videos in this way. YouTube appears to have done little, if anything, to stop the mass downloading, and the company did not respond to my request for comment. Not all YouTube videos are copyrighted (and some are uploaded by people who don’t own the copyrights), but many are. Unauthorized copying or distribution of those videos is illegal, but whether AI training constitutes a form of copying or distribution is still a question being debated in many ongoing lawsuits. Tech companies have argued that training is a “fair use” of copyrighted work, and some [judges have disagreed](https://www.theatlantic.com/technology/archive/2025/07/anthropic-meta-ai-rulings/683526/) in their responses. How the courts ultimately apply the law to this novel technology could have massive consequences for creators’ motivations to post their work on YouTube and similar platforms—if tech companies are able to continue taking creators’ work to build AI products that compete with them, then creators may have little choice but to stop sharing. Generative-AI tools are already producing videos that compete with human-made work on YouTube. AI-generated history videos with hundreds of thousands of views and many inaccuracies [are drowning out](https://www.404media.co/ai-generated-boring-history-videos-are-flooding-youtube-and-drowning-out-real-history/) fact-checked, expert-produced content. Popular music-remix videos are frequently created [using this technology](https://www.youtube.com/watch?v=eIahbtBz6Uo), and many of them perform better than human-made videos. The problem extends far beyond YouTube, however. Most modern chatbots are “multimodal,” meaning they can respond to a question by creating relevant media. Google’s Gemini chatbot, for instance, will produce short clips for paying users. Soon, you may be able to ask ChatGPT or another generative-AI tool about how to build a table from found legs and get a custom how-to video in response. Even if that response isn’t as good as any video Peters would make, it will be immediate, and it will be tailor-made to your specifications. The online-publishing business has already been [decimated by text-generation tools](https://www.theatlantic.com/technology/archive/2025/06/generative-ai-pirated-articles-books/683009/); video creators should expect similar challenges from generative-AI tools in the near future. Many major tech companies have used these data sets to train AI, according to research papers I’ve read and AI developers I’ve spoken with. The group includes Microsoft, Meta, Amazon, Nvidia, Runway, ByteDance, Snap, and Tencent. I reached out to each of these companies to ask about their use of these data sets. Only Meta, Amazon, and Nvidia responded. All three said they “respect” content creators and believe that their use of the work is legal under existing copyright law. Amazon also shared that, where video is concerned, it is currently focused on developing ways to generate “compelling, high-quality advertisements from simple prompts.” We can’t be certain whether all these companies will use the videos to create for-profit video-generating tools. Some of the work they’ve done may be simply experimental. But a few of these companies have an obvious interest in pursuing commercial products: Meta, for instance, is developing a suite of tools called [Movie Gen](https://ai.meta.com/research/movie-gen/) that creates videos from text prompts, and Snap offers [“AI Video Lenses”](http://theverge.com/news/628354/snap-snapchat-ai-video-lenses) that allow users to augment their videos with generative AI. Videos such as the ones in these data sets are the raw material for products like these; much as ChatGPT couldn’t write like Shakespeare without first “reading” Shakespeare, a video generator couldn’t construct a fake newscast without “watching” tons of recorded broadcasts. In fact, a large number of the videos in these data sets are from news and educational channels, such as the BBC (which has at least 33,000 videos in the data sets, across its various brands) and TED (nearly 50,000). Hundreds of thousands of others—if not more—are from individual creators, such as Peters. AI companies are more interested in some videos than others. A spreadsheet leaked [to *404 Media*](https://www.404media.co/runway-ai-image-generator-training-data-youtube/) by a former employee at Runway, which builds AI video-generation tools, shows what the company valued about certain channels: “high camera movement,” “beautiful cinematic landscapes,” “high quality scenes from movies,” “super high quality sci-fi short films.” One channel was labeled “THE HOLY GRAIL OF CAR CINEMATICS SO FAR”; another was labeled “only 4 videos but they are really well done.” Developers seek out high-quality videos in a variety of ways. Curators of two of the data sets collected here—HowTo100M and HD-VILA-100M—prioritized videos with high view counts on YouTube, equating popularity with quality. The creators of another data set, HD-VG-130M, [noted](https://arxiv.org/pdf/2305.10874) that “high view count does not guarantee video quality,” and used an AI model to select videos of high “aesthetic quality.” Data-set creators often try to avoid videos that contain overlaid text, such as subtitles and logos, so these identifying features don’t appear in videos generated by their model. So, some advice for YouTubers: Putting a watermark or logo on your videos, even a small one, makes them less desirable for training. To prepare the videos for training, developers split the footage into short clips, in many cases cutting wherever there is a scene or camera change. Each clip is then given an English-language description of the visual scene so the model can be trained to correlate words with moving images, and to generate videos from text prompts. AI developers have a few methods of writing these captions. One way is to pay workers to do it. Another is to use separate AI models to generate a description automatically. The latter is more common, because of its lower cost. AI video tools aren’t yet as mainstream as chatbots or image generators, but they are already in wide use. You may already have seen AI-manipulated video without realizing it. For example, TED has been using AI to dub speakers’ talks in different languages. This includes the video as well as the audio: Speakers’ mouths are [lip-synched](https://blog.ted.com/announcing-ai-adapted-multilingual-ted-talks/) with the new words so it looks like they’re speaking Japanese, French, or Russian. Nishat Ruiter, TED’s general counsel, told me this is done with the speakers’ knowledge and consent. There are also consumer-facing products for tweaking videos with AI. If your face doesn’t look right, for example, you can try a face-enhancer such as [Facetune](https://www.facetuneapp.com/create/video-face-editor), or ditch your mug entirely with a face-swapper such as [Facewow](https://facewow.ai/face-swap/video/). With Runway’s [Aleph](https://runwayml.com/research/introducing-runway-aleph), you can change the colors of objects, or turn sunshine into a snowstorm. Then there are tools that generate new videos based on an image you provide. Google [encourages Gemini users](https://blog.google/products/gemini/photo-to-video/) to animate their “favorite photos.” The result is a clip that extrapolates eight seconds of movement from an initial image, making a person dance, cook, or [swing a golf club](https://chromeunboxed.com/i-just-tried-geminis-new-photo-to-video-feature-and-im-blown-away/). These are often both amazing and creepy. “Talking head generation”—for [employee-orientation videos](https://www.youtube.com/watch?v=2nzdDQr_LqA), for example—is also advancing. [Vidnoz AI](https://www.vidnoz.com/) promises to generate “Realistic AI Spokespersons of Any Style.” A company called [Arcads](https://www.arcads.ai/) will generate a complete advertisement, with actors and voiceover. ByteDance, the company that operates TikTok, offers a similar product called Symphony Creative Studio. Other applications of AI video generation include [virtual try-on of clothes](https://github.com/showlab/Awesome-Video-Diffusion?tab=readme-ov-file#virtual-try-on), [generating custom video games](https://github.com/showlab/Awesome-Video-Diffusion?tab=readme-ov-file#game-generation), and animating [cartoon characters and people](https://shiyi-zh0408.github.io/projectpages/FlexiAct/). Some companies are both working with AI and simultaneously fighting to defend their content from being pilfered by AI companies. This reflects the Wild West mentality in AI right now—companies exploiting legal gray areas to see how they can profit. As I investigated these data sets, I learned about an incident involving TED—again, one of the most-pilfered organizations in the data sets captured here, and one that is attempting to employ AI to advance its own business. In June, the Cannes Lions international advertising festival gave one of its Grand Prix awards to an ad that included deepfaked footage from a TED talk by DeAndrea Salvador, currently a state senator in North Carolina. The ad agency, DM9, “used AI cloning to change her talk and repurposed it for a commercial ad campaign,” Ruiter told me on a video call recently. When the manipulation was discovered, the Cannes Lions festival [withdrew the award](https://www.canneslions.com/news/cannes-lions-statement-dm9-entries-into-cannes-lions-2025). Last month, Salvador [sued](https://www.courthousenews.com/wp-content/uploads/2025/08/deandrea-salvador-whirlpool-complaint.pdf) DM9 along with its clients—Whirlpool and Consul—for misappropriation of her likeness, among other things. DM9 apologized for the incident and [cited](https://www.linkedin.com/posts/dm9_nota-de-esclarecimento-na-semana-passada-activity-7343346894644875264-sHmV/?rcm=ACoAAABPu1wBY1EaJ5vQb_gdSm1BybbXG1_20hE) “a series of failures in the production and sending” of the ad. A spokesperson from Whirlpool told me the company was unaware the senator’s remarks had been altered. Others in the film industry have filed lawsuits against AI companies for training with their content. In June, Disney and Universal sued Midjourney, the maker of an image-generating tool that can produce images containing recognizable characters (Warner Brothers [joined](https://apnews.com/article/warner-bros-midjourney-ai-copyright-lawsuit-dc-studios-b87d80d7b4a4dfdcf0ee149d30830551) the lawsuit last week). The lawsuit called Midjourney a “bottomless pit of plagiarism.” The following month, two adult-film companies sued Meta for downloading (and distributing through BitTorrent) more than 2,000 of their videos. Neither Midjourney nor Meta has responded to the allegations, and neither responded to my request for comment. One YouTuber filed their own lawsuit: In August of last year, [David Millette sued Nvidia](https://www.courtlistener.com/docket/69045427/millette-v-nvidia-corporation/) for unjust enrichment and unfair competition with regard to the training of its [Cosmos AI](https://www.nvidia.com/en-us/ai/cosmos/), but the case was voluntarily dismissed months later. The Disney characters and the deepfaked Salvador ad are just two instances of how these tools can be damaging. The floodgates may soon be opening further. Thanks to the enormous amount of investment in the technology, generated videos are beginning to appear everywhere. One company, DeepBrain AI, [pays “creators”](https://www.aistudios.com/promotion/creator-join) to post AI-generated videos made with its tools on YouTube. It currently offers \$500 for a video that gets 10,000 views, a relatively low threshold. Companies that run social-media platforms, such as Google and Meta, also pay users for content, through ad-revenue sharing, and many directly [encourage](https://blog.youtube/news-and-events/new-shorts-creation-tools-2025/) the posting of AI-generated content. Not surprisingly, a coterie of [gurus](https://www.youtube.com/watch?v=TWpg1RmzAbc) has arrived to teach the secrets of making money with AI-generated content. [Google](https://arxiv.org/abs/2007.14937) and [Meta](https://arxiv.org/abs/1905.00561) have also trained AI tools on large quantities of videos from their own platforms: Google has taken [at least 70 million](https://arxiv.org/abs/2007.14937) clips from YouTube, and Meta has taken more than [65 million clips from Instagram](https://arxiv.org/abs/1905.00561). If these companies succeed in flooding their platforms with synthetic videos, human creators could be left with the unenviable task of competing with machines that churn out endless content based on their original work. And social media will become even less social than it is. I asked Peters if he knew his videos had been taken from YouTube to train AI. He said he didn’t, but he wasn’t surprised. “I think everything’s gonna get stolen,” he told me. But he didn’t know what to do about it. “Do I quit, or do I just keep making videos and hope people want to connect with a person?”

ML Classification

ML Categories

/Internet_and_Telecom		74.3%
/Internet_and_Telecom/Web_Services		65.6%
/Science		44.5%
/Science/Computer_Science		44.1%
/Science/Computer_Science/Machine_Learning_and_Artificial_Intelligence		44.0%
/Internet_and_Telecom/Web_Services/Search_Engine_Optimization_and_Marketing		27.6%
/Law_and_Government		23.8%
/Law_and_Government/Legal		23.1%
/News		12.8%
/News/Technology_News		12.5%
/Law_and_Government/Legal/Intellectual_Property		12.0%

Raw JSON

{
    "/Internet_and_Telecom": 743,
    "/Internet_and_Telecom/Web_Services": 656,
    "/Science": 445,
    "/Science/Computer_Science": 441,
    "/Science/Computer_Science/Machine_Learning_and_Artificial_Intelligence": 440,
    "/Internet_and_Telecom/Web_Services/Search_Engine_Optimization_and_Marketing": 276,
    "/Law_and_Government": 238,
    "/Law_and_Government/Legal": 231,
    "/News": 128,
    "/News/Technology_News": 125,
    "/Law_and_Government/Legal/Intellectual_Property": 120
}

ML Page Types

/Article		99.9%
/Article/News_Update		90.4%

Raw JSON

{
    "/Article": 999,
    "/Article/News_Update": 904
}

ML Intent Types

Informational

99.9%

Raw JSON

{
    "Informational": 999
}

Content Metadata

Language

Author

Alex Reisner

Publish Time

2025-09-10 14:59:00 (7 months ago)

Original Publish Time

2025-09-01 00:00:00 (7 months ago)

Republished

Word Count (Total)

2,535

Word Count (Content)

2,234

Links

External Links

Internal Links

Technical SEO

Meta Nofollow

Meta Noarchive

JS Rendered

Redirect Target

null

Performance

Download Time (ms)

TTFB (ms)

Download Size (bytes)

46,146

Shard

21 (laksa)

Root Hash

13119341252700813021

Unparsed URL

com,theatlantic!www,/technology/archive/2025/09/youtube-ai-training-data-sets/684116/ s443