Methodology

How the pipeline actually works

No black box. Here is exactly how a plain-English claim becomes a ranked, sourced report — the steps, the scoring, the cost, and how we test that it holds up.

From a claim to a report in six steps

  1. 1

    State the hypothesis

    You write your claim in plain English — e.g. “indie founders struggle to get their first 100 customers.”

  2. 2

    Clarify the scope

    A short set of AI clarifying questions sharpens who you mean and what counts, so the search does not drift.

  3. 3

    Suggest & tier subreddits

    Candidate subreddits are validated against r/<sub>/about.json — dead, private, or sub-500-member communities are dropped, and the rest are tiered Bullseye / Decent / Off-topic.

  4. 4

    Mine real search queries

    Queries are built from the actual title phrases people post in your chosen subs, with frequency counts, so you search the way your audience writes — not the way you guess they do.

  5. 5

    Set the pipeline knobs

    Choose how many threads to classify and how many run in parallel; a live cost estimate updates as you adjust, so there are no surprises.

  6. 6

    Review & launch

    Watch live logs, a running cost meter, and a per-thread status grid as the run executes. Stop and resume any time — runs are resumable.

What each thread is scored on

Every thread is classified into a fixed schema so the fields actually aggregate across hundreds of posts instead of becoming unique per-thread prose:

  • pain_signal — 0–100 intensity of the frustration in the thread
  • wtp_tier — willingness to pay, bucketed high / medium / low / none
  • tools_mentioned[] — the products and services named in the discussion
  • sentiment_toward_tools — positive / negative / mixed / neutral, per tool
  • primary_use_case — market research, lead gen, brand monitoring, content ideation, or other
  • relevance_score — 0–10 match between the thread and your claim
  • key_quotes, best_quote_from_OP, best_quote_from_top_reply — verbatim, with links back to source
  • summary — a one-line, plain-English gist of the thread

What a run costs and how long it takes

StageTimeCost
Fetch threads2–3 minFree — public JSON
Filter (rule-based prune)< 1 secFree
Classify with Gemini10–15 min · ~300 threads$0.13–0.40
Render the report< 1 secFree

Cost scales with comments per thread: ~$0.13 at five comments each, ~$0.30–0.40 at fifty to a hundred. Gemini 2.5 Flash is billed at $0.30 per million input tokens and $2.50 per million output, and each thread takes roughly 11 seconds to classify.

Built to survive Reddit’s API

Every thread comes from Reddit’s public JSON endpoints — no OAuth, no API key, no quota approval to wait on.

Runs are resumable and tolerate rate-limiting: a 429 or a dropped connection mid-run is caught and retried rather than losing progress, and a top-N cap lets you bound exactly how many threads get classified, which is the main cost lever.

That public-JSON foundation is also why the pipeline is not exposed to the API-pricing changes that have shut down other Reddit research tools.

How we test that it holds up

We do not just assert the pipeline works — we measure the parts that could quietly fail, with scripts anyone can re-run.

When we needed to know whether the wizard’s AI-suggested subreddits could be trusted, we probed 100 suggestions across ten domains: 90 were live, public communities and only one was a hallucinated name. When we needed to know whether exact-phrase Reddit search beats loose keyword matching, we ran a quoted-vs-tokenized A/B and found roughly three in four multi-word phrases return enough exact-match results to use directly.

Both tests live as reproducible scripts, and both carry caveats we keep in view — small samples, and a search ranker that shifts over time. We publish those caveats next to the numbers rather than burying them.

Maintained by · Founder, Reddit Research Pipeline.

Validate what people actually say, not what you wish they would.