Reddit datasets for NLP and machine learning
She budgeted two weeks to scrape and clean a training corpus. A colleague sent one link: four million Reddit posts, already paired with summaries, already cleaned. Two weeks became an afternoon.
Don't collect what you can download
If you are building an NLP or ML project and your first instinct is to scrape Reddit yourself, pause — there is a good chance the data you need already exists as a published, cleaned dataset, and using it will save you weeks. Reddit has been a workhorse corpus for language research for over a decade, and the community has packaged big slices of it for specific tasks: summarization, conversational modeling, sentiment, topic analysis. Collecting from scratch makes sense only when no existing set fits your exact need.
The short answer to "which dataset" is: Webis-TLDR-17 if you are doing summarization, ConvoKit if you need conversation structure, the Pushshift dumps on Academic Torrents if you need raw scale and full control, and the Hugging Face dataset hub for everything in between and the fastest start. The rest of this page is what each one is, what it is good for, and the licensing catch that matters more than people expect.
The datasets worth knowing
A lineage note: most "historical Reddit" datasets, including the Academic Torrents dumps and many Hugging Face sets, ultimately derive from the Pushshift collection. That is convenient — it is consistent, well-understood data — but it also means they share Pushshift's quirks and its cut-off behavior, and none of them include much of anything after the archive's end date.
Webis-TLDR-17: the summarization standard
If you are working on text summarization, Webis-TLDR-17 is the obvious starting point. It is built on a clever observation: Reddit users routinely write their own "TL;DR" one-line summary at the end of a long post. That gives you nearly four million naturally occurring document-summary pairs — human-written, at scale, for free — which is exactly the supervised signal abstractive summarization models need. It became a standard benchmark for that reason and is hosted on Hugging Face for one-line loading.
The caveats are the interesting part. TL;DRs are noisy supervision — people write them inconsistently, sometimes as jokes, sometimes summarizing only part of the post — so the pairs are weaker labels than a curated summarization set. And the content is Reddit, with all the informality, toxicity, and topical skew that implies. For research and pre-training it is excellent; for a production summarizer you will want to filter and clean it rather than train on it raw.
ConvoKit: when structure matters
Cornell's ConvoKit takes a different angle. Instead of flat text, it packages Reddit as conversations — structured objects of speakers, utterances, and reply relationships — which is what you want if your work is about discourse, conversational dynamics, disagreement, or anything where who-replied-to-whom is part of the signal. It includes a large corpus drawn from hundreds of thousands of subreddits and a smaller, more tractable set for quick experiments, with a consistent Python API for loading and analysis.
The reason to reach for ConvoKit over a raw dump is that it has already done the hard structural work of reconstructing conversation trees and attaching metadata, and it ships with tooling for common conversational-analysis tasks. If your model or study cares about the shape of the conversation and not just the text, starting from ConvoKit saves you the parsing and gives you a format other researchers recognize.
The Pushshift dumps: maximum control
When no packaged dataset fits — you need specific subreddits, a specific date range, specific fields, or simply everything — the Pushshift dumps on Academic Torrents are the raw material. They cover roughly 2005 through 2025 as per-subreddit, compressed NDJSON files, and with the open-source parsing scripts you can shape them into whatever your pipeline needs. This is the most flexible and the most laborious option: you own the cleaning, the filtering, the deduplication, and the storage.
Choose this route when control matters more than convenience — building a domain-specific corpus from particular communities, or a dataset whose exact composition you need to justify in a paper. For most other purposes, a packaged dataset gets you to results faster, which is why the dumps are the power-user option rather than the default. The download-a-subreddit guide covers the mechanics of working with these files.
Match the dataset to the task
A quick router from what you are building to where to start:
- Summarization → Webis-TLDR-17. Millions of post-and-TL;DR pairs are purpose-built for it.
- Conversational AI, discourse, disagreement → ConvoKit. You need the reply structure, and it is already reconstructed.
- Sentiment, classification, fine-tuning → search the Hugging Face hub first; there is often a task-specific Reddit set already labeled.
- A domain-specific corpus from particular subreddits → the Pushshift dumps on Academic Torrents, filtered to your communities.
- Quick experiments and prototyping → a small Hugging Face set or ConvoKit's small corpus; do not download hundreds of gigabytes to test an idea.
- Current data — this month's posts → none of these. Pre-built sets are frozen; for live data you need the API or an archive's recent slice.
The licensing catch nobody reads
This is the part that trips up commercial projects. A dataset being publicly downloadable does not mean you are free to do anything with it. Each set carries a licence and terms — some are research-only, some restrict commercial use, and underneath all of them sits Reddit's own position that its content is not free for commercial training without an agreement, which is exactly what its 2025 lawsuits are about. For academic research, the established Reddit datasets are broadly fine to use and cite. For training a commercial model, "I downloaded it from Hugging Face" is not a licence, and you need to check the dataset's terms and consider Reddit's.
There is also the human layer. These corpora contain real people's words, including content later deleted, and increasingly there is scrutiny — legal, ethical, and reputational — of models trained on social data without consent. None of this is a reason to avoid the datasets for legitimate research; it is a reason to read the licence, document your data provenance, and think before you build something commercial on top of them. The legality guide covers the broader picture, and none of this is legal advice.
Honest caveats
- Every pre-built set is frozen in time — they end at their collection date and tell you nothing about recent posts. For current data, use the API or an archive's recent slice.
- Reddit data is biased data — it skews toward certain demographics, languages, and topics, and carries the toxicity and misinformation of the open internet. Models trained on it inherit all of that.
- Deleted content is often included — Pushshift-lineage sets retain posts users removed, which raises consent and ethics questions for anything beyond aggregate research.
- Licences vary and matter — research-friendly is not the same as commercial-friendly; read each dataset's terms and weigh Reddit's own restrictions before commercial use.
- Cleaning is still on you — even packaged sets need filtering for your task; "pre-built" means collected and structured, not necessarily ready to train on raw.
If you wanted analysis, not a training set
Worth a gut-check on what you are actually after. If you are training or fine-tuning a model, the datasets above are exactly right and this is the page for you. But a fair number of people reach for "a Reddit dataset" when what they really want is an answer about a market or an audience — what people complain about, how sentiment splits, which problems recur — and a raw corpus is a heavy, indirect way to get that. rawneed is built for that second goal: ask a question in plain English and get a ranked, sourced report, with the data work handled. It is not a place to download a training set, and if that is your need these datasets serve you better. But if "get a Reddit dataset" was a means to an insight, this is the more direct route.
See how the analysis worksFrequently asked questions
What is the best Reddit dataset for NLP?
It depends on the task. Webis-TLDR-17 (about 3.8 million posts paired with author-written TL;DRs) is the standard for summarization. Cornell's ConvoKit is best when you need conversation structure for discourse or conversational modeling. The Pushshift dumps on Academic Torrents give you raw full history for custom corpora, and the Hugging Face hub has many task-specific labeled Reddit sets for sentiment and classification.
Where can I download Reddit datasets for machine learning?
The main sources are the Hugging Face dataset hub (the fastest start, with many task-specific Reddit sets and one-line loading), Cornell's ConvoKit for conversation-structured corpora, and Academic Torrents for the full historical Pushshift dumps. Webis-TLDR-17, the summarization standard, is on Hugging Face. Start with Hugging Face unless you specifically need raw scale or conversation structure.
Is the Webis-TLDR-17 dataset still available?
Yes, it is hosted on Hugging Face and loadable in one line. It contains roughly 3.8 million Reddit posts paired with their author-written TL;DR summaries, drawn from 2006 to 2016, and remains a standard benchmark for abstractive summarization. The supervision is noisy because TL;DRs are inconsistent, so filter and clean it for production work rather than training on it raw.
Can I use Reddit datasets to train a commercial model?
Not freely. Public availability is not a commercial licence — each dataset has its own terms, some research-only, and Reddit's own position is that its content is not free for commercial training without an agreement, which its 2025 lawsuits enforce. For academic research the established sets are broadly fine to use and cite. For a commercial model, check the dataset licence and Reddit's terms, and consider licensing access. This is not legal advice.
Do Reddit datasets include deleted comments?
Often yes. Datasets derived from the Pushshift collection — which includes the Academic Torrents dumps and many Hugging Face sets — frequently retain posts and comments that users later deleted or moderators removed, because they were captured at posting time. This is convenient for completeness but raises real consent and ethics questions, so handle deleted content as aggregate signal rather than republishing or attributing it to individuals.
How current are pre-built Reddit datasets?
They are not current — every packaged dataset is frozen at its collection date. The Pushshift-lineage sets generally end around 2025, and task-specific sets can be older. They are excellent for training and benchmarking but useless for questions about recent or live activity. For current data you need the official API or the recent slice of an archive, not a static dataset.
Keep reading
Write content about what your audience actually asks
Write about the questions your audience is actually asking.
Read →See what people really say about your competitors
Track how buyers really compare tools and why they switch.
Read →How to get Reddit data (the honest map)
He needed two years of posts from one subreddit by Friday. He tried Pushshift (dead), the API docs (a pricing table), and a Stack Overflow answer from 2019 (broken). The data exists — the map to it is just out of date everywhere he looked.
Read →Pushshift alternatives that actually work in 2026
Her dissertation pipeline ran on Pushshift for two years. One morning every call returned a 403. The data she needed still existed — it had just moved, quietly, to three different places nobody had told her about.
Read →How to download an entire subreddit
He wrote a clean script to pull every post in a subreddit, ran it, and got exactly 1,000 posts back. The subreddit had 80,000. The wall he hit is the single most important thing to understand before you start.
Read →Reddit API pricing, explained without the panic
The headlines said Reddit's API change cost one app developer $20 million a year. So when a solo dev needed 5,000 posts for a side project, she budgeted for the worst. Her actual bill came to exactly zero — she just had to know which tier she was in.
Read →Is scraping Reddit legal? An honest, non-lawyer answer
His lawyer's answer was the one founders hate: "it depends." But it depends on a small number of specific things — and once he understood which side of each line his project sat on, the grey area got a lot smaller.
Read →How to analyze Reddit data (without code)
Reading is not analyzing. A 1,400-comment thread you scroll for twenty minutes teaches you nothing you can write down. Here’s the repeatable, no-code method that does.
Read →