Get Reddit data

How to get Reddit data (the honest map)

He needed two years of one subreddit by Friday. Pushshift was dead, the API docs were a pricing table, and the top Stack Overflow answer was a broken 2019 snippet. The data exists. The map to it had just gone stale everywhere he looked.

The short answer, before the detail

There is no longer one obvious way to get Reddit data, which is exactly why this question got hard. For years the answer was "use Pushshift" — and Pushshift, for general researchers, is gone. So the real answer in 2026 is a decision, not a tool: pick the method that matches the shape of what you need. A few hundred recent threads for analysis is a different job than a decade of one subreddit for an NLP model, and they take different roads.

Here is the whole map in one breath. For a small, recent pull, a no-code exporter or Reddit's own public endpoints will do it in minutes. For programmatic, ongoing access, the official Data API is the durable route — free for personal and academic use, metered once you go commercial. For history older than a few months, you want an archive built on the old Pushshift dumps, with Arctic Shift the current go-to. For training data, several large Reddit corpora are already cleaned and published, so you may not need to collect anything at all. The rest of this page is which of those fits your job, what each one costs, and where the legal lines are.

The six roads, compared

MethodBest forCostSkill needed
No-code exporter (CSV)One thread or one sub, recent, to a spreadsheetFree tier then paid per exportNone — paste a URL
Public JSON endpointsLight, personal, recent pulls; quick scriptsFreeSome — a little code or curl
Official Data API (OAuth)Ongoing, programmatic, production accessFree personal/academic · ~$0.24 / 1k calls commercialDeveloper
Historical archive (Arctic Shift)Posts and comments older than a few monthsFreeLow to developer, depending on UI vs dumps
Bulk subreddit download (BDFR)An entire subreddit or user history to diskFreeComfortable with a terminal
Pre-built datasetNLP / ML training, no collection neededFreeData tooling to load it

Two non-obvious things this table encodes. First: "free" and "easy" rarely live in the same row — the free routes ask for a little technical comfort, and the paste-a-URL routes start charging once you go past a handful of exports. Second: freshness and history are a trade-off. The archives are deep but lag the present by weeks to months; the live methods are current but shallow on history. Few sources give you both at once.

Start with the question, not the tool

The most common way people waste a day here is reaching for a method before they have written down what they actually need. Four questions settle almost every case. How far back do you need to go — this week, or years? How much do you need — one thread, one subreddit, or the whole platform? How often — once, or on a schedule forever? And what happens to the data — a glance in a spreadsheet, or a column in a machine-learning pipeline?

Those four answers point at a row in the table above with almost no ambiguity. "A few recent threads, once, into a spreadsheet, to read" is a no-code export — you are done in ten minutes. "Two years of one subreddit, once, to analyze" is a historical archive. "Live mentions of my brand, every day, forever, into a dashboard" is the official API. The mistake is starting from "what tool do people use" instead of "what is the shape of my need," because the popular tool is often built for a different shape than yours.

Match the method to the job

The same four questions, turned into concrete recommendations:

  • Recent + small + read it once → a no-code CSV exporter. Paste the URL, download, open in your spreadsheet. No account, no code.
  • Recent + small + a quick script → Reddit's public JSON endpoints. Free, no key, fine for light personal use; just keep your request rate gentle.
  • Ongoing + programmatic + you write code → the official Data API with OAuth. Free for personal and academic use; metered once it is commercial.
  • Old + any size → a Pushshift-successor archive. Arctic Shift for a live API and web UI; the Academic Torrents dumps for bulk offline work.
  • A whole subreddit or user history → a bulk downloader like BDFR, which authenticates properly and writes everything to disk.
  • Training a model → start with a published Reddit dataset before you collect anything. The cleaning has already been done for you.

The fastest path for a one-off pull

  1. 1

    Copy the URL

    Grab the link to the specific thread or the subreddit page you care about. That is the only input most no-code exporters need.

  2. 2

    Paste it into an exporter

    A web tool like a comment exporter takes the URL and returns a downloadable file. The free tier covers a thread or two; bulk or repeated exports move you to a paid plan.

  3. 3

    Choose your columns

    Most exporters let you pick fields — author, score, timestamp, body, permalink. Keep the permalink: it is how you trace any row back to the live thread later.

  4. 4

    Download as CSV

    Open it in Sheets or Excel. For anything heavier — a whole sub, or a recurring pull — this is where you graduate to the API or an archive instead.

The official API, briefly

Reddit's Data API is the road Reddit wants you on, and for anything programmatic and ongoing it is the durable choice — the one least likely to break under you. The headline numbers: the free tier allows roughly 100 requests per minute per authenticated client, and is meant for personal and academic use. Commercial use moves you to paid access, widely reported at about $0.24 per 1,000 calls, with negotiated enterprise tiers above that. Authentication via OAuth is effectively required now; unauthenticated traffic is throttled hard.

Most people never touch the raw API directly. They use a wrapper — PRAW, the long-standing Python library, is the usual starting point, and bulk tools build on top of it. The full breakdown of tiers, what counts as a "call," how the free quota actually behaves, and when you cross into paid territory has its own guide, because the pricing is where most projects get a nasty surprise.

Historical data: the Pushshift problem

For most of Reddit's research history, the answer to "how do I get old data" was Pushshift — a third-party archive that mirrored nearly everything. In 2023 Reddit restricted Pushshift to verified moderators only, and for general researchers it effectively went dark. Hundreds of published studies had relied on it; overnight, the standard tool was gone. If a tutorial tells you to use Pushshift and it is more than a couple of years old, that is the tell that it is stale.

The gap got filled by successors built on the surviving Pushshift dumps. Arctic Shift is the current go-to: a free archive with a live API, a web search interface, and downloadable dumps. For bulk offline work, the same underlying data is redistributed as per-subreddit files on Academic Torrents, covering roughly 2005 through 2025. Which one you want depends on whether you need a few queries or a few hundred gigabytes — covered in full in the alternatives guide.

Is any of this legal?

The honest answer is "mostly, with real caveats, and it depends what you do next." Scraping genuinely public pages is unlikely to break the US Computer Fraud and Abuse Act — the hiQ v. LinkedIn line of cases came down roughly there. But that is not the whole story. Reddit's User Agreement separately prohibits scraping and commercial use of its content without an agreement, so the live risk is contract and terms-of-service, not hacking law — and Reddit is actively enforcing it, including a 2025 lawsuit against several data-scraping companies. The short version: collecting public Reddit data for personal research or analysis sits in a well-trodden, low-risk zone; reselling it, training a commercial model on it, or evading Reddit's technical controls at scale is where the exposure is real. Respect robots.txt, do not hammer the servers, do not republish people's content as your own, and when in doubt use the official API, which is the licensed path by design. The full legal picture has its own guide — and none of this is legal advice.

Datasets you don't have to build

If your goal is training or analysis rather than a specific live question, the fastest path is often to skip collection entirely. Several large Reddit corpora are already cleaned, structured, and published — the Webis-TLDR-17 summarization set of nearly four million posts, Cornell's ConvoKit conversation corpora, the historical Pushshift dumps on Academic Torrents, and a steady supply of Reddit datasets on Hugging Face. For a lot of NLP and ML work, one of these is a better starting point than anything you would assemble yourself, because the deduplication, formatting, and licensing headaches are already handled.

The catch is that pre-built datasets are frozen in time and shaped by someone else's collection choices. They are perfect for training and benchmarking, useless for "what is the sub saying this week." Knowing which published set fits which task — and what each one's licence actually permits — is its own guide below.

Honest caveats

  • Reddit is actively tightening data access, not loosening it — methods that work today can change, so prefer the official, authenticated routes for anything you need to keep running.
  • Rate limits are real on every free route — pull gently, add delays, and expect the occasional empty response under throttling rather than a clean error.
  • Archives lag reality — historical sources are weeks to months behind the live site, so they are wrong for anything time-sensitive.
  • Deleted and removed content is a minefield — old dumps may contain posts users later deleted; re-publishing those raises real ethical and, in some regions, legal questions.
  • Getting the data is the easy half — a million rows of raw Reddit JSON is not insight. The work is classifying, aggregating, and interpreting it, which is a different problem than acquisition.

When you want the answer, not the pipeline

Every method on this page hands you raw data and leaves the hard part — turning thousands of messy threads into a defensible answer — entirely to you. That gap is the reason rawneed exists. You give it a question in plain English; it gathers the relevant threads, classifies each one into structured fields (pain intensity, willingness to pay, sentiment, tools mentioned), and returns a ranked report with a link back to every source thread. No API keys to manage, no exporter quotas, no archive to download and parse. If your real goal is the insight rather than the data engineering, that is the shortcut — and if you genuinely need raw rows for your own pipeline, the methods above are exactly right and you should use them.

See how the analysis works

Frequently asked questions

What is the easiest way to get Reddit data?

For a one-off pull from a single thread or subreddit, a no-code CSV exporter is easiest — you paste the URL and download a file, no account or code required. The free tier covers a thread or two; bulk or repeated exports move you to a paid plan. For anything ongoing or large, the easiest durable route is the official Data API through a wrapper like PRAW.

Is Pushshift still available in 2026?

Not for general researchers. In 2023 Reddit restricted Pushshift to verified moderators only, and for everyone else it effectively went dark. The data lives on through successors built on the old Pushshift dumps — Arctic Shift offers a live API and web search interface, and the same underlying archive is redistributed in bulk on Academic Torrents.

How much does Reddit data cost?

It depends on the route. No-code exporters and the public endpoints are free for light use. The official Data API is free for personal and academic use up to its rate limits; commercial use is metered, widely reported at about $0.24 per 1,000 calls, with negotiated enterprise tiers above that. Historical archives and pre-built datasets are generally free to download.

Do I need to know how to code to get Reddit data?

No, for the simple cases. No-code exporters turn a URL into a CSV with no programming at all, and archive web interfaces like Arctic Shift let you search and download through a browser. You only need code once you want programmatic, ongoing, or large-scale access through the official API — and even then, libraries like PRAW do most of the heavy lifting.

Is it legal to collect Reddit data?

Collecting genuinely public Reddit data for personal research or analysis sits in a low-risk, well-trodden zone — scraping public pages is unlikely to violate US computer-access law. The real caveats are Reddit's own terms, which prohibit scraping and commercial use without an agreement, and which Reddit is actively enforcing. Reselling the data, training a commercial model on it, or evading technical controls at scale is where exposure becomes real. The official API is the licensed path. This is not legal advice.

What is the best source for historical Reddit data?

Arctic Shift is the current go-to for most people — a free archive with a live API, a web search UI, and downloadable dumps, built on the surviving Pushshift data. If you need bulk data offline, the same archive is published as per-subreddit files on Academic Torrents, covering roughly 2005 through 2025. Pushshift itself is no longer an option for general researchers.

Keep reading

Use case

Write content about what your audience actually asks

Write about the questions your audience is actually asking.

Read →
Use case

See what people really say about your competitors

Track how buyers really compare tools and why they switch.

Read →
Guide

Pushshift alternatives that actually work in 2026

Her dissertation pipeline ran on Pushshift for two years. One morning every call returned a 403. The data she needed still existed — it had just moved, quietly, to three different places nobody had told her about.

Read →
Guide

Reddit API pricing, explained without the panic

The headlines said Reddit's API change cost one app developer $20 million a year. So when a solo dev needed 5,000 posts for a side project, she budgeted for the worst. Her actual bill came to exactly zero — she just had to know which tier she was in.

Read →
Guide

How to export Reddit comments to CSV

She had the perfect thread — 600 comments arguing about exactly the feature her team was debating. She needed it as rows in a spreadsheet by the 2pm standup, not as an afternoon of copy-paste. There is a five-minute way and a five-hour way.

Read →
Guide

Is scraping Reddit legal? An honest, non-lawyer answer

His lawyer's answer was the one founders hate: "it depends." But it depends on a small number of specific things — and once he understood which side of each line his project sat on, the grey area got a lot smaller.

Read →
Guide

How to download an entire subreddit

He wrote a clean script to pull every post in a subreddit, ran it, and got exactly 1,000 posts back. The subreddit had 80,000. The wall he hit is the single most important thing to understand before you start.

Read →
Guide

Reddit datasets for NLP and machine learning

She budgeted two weeks to scrape and clean a training corpus. A colleague pointed at a Hugging Face link: four million Reddit posts, already paired with summaries, already cleaned. The two weeks became an afternoon.

Read →
Guide

How to analyze Reddit data (without code)

Reading is not analyzing. A 1,400-comment thread you scroll for twenty minutes teaches you nothing you can write down. Here’s the repeatable, no-code method that does.

Read →

Validate what people actually say, not what you wish they would.