How we tested
All five tools were tested between May 18 and June 5, 2026, on their current paid tiers (or the free tier where that is the headline product). Criteria are weighted toward citation grounding and synthesis quality, which decide whether the output is usable in real research, with cost and free-tier ceiling weighted heavily for student and independent-researcher use.
Retrieval Accuracy
We ran the same six structured research questions on each tool (three biomedical, three social-science) and counted how many of the top ten retrieved papers were relevant against a librarian-curated reference set, computing precision-at-10 per tool.
Citation Grounding
We pulled twenty AI-generated claims from each tool's output across the test queries and checked, sentence by sentence, whether the cited paper actually contained the claim, recording each as correct, paraphrased-but-defensible, misattributed, or fabricated.
Synthesis & Extraction Quality
Two reviewers independently scored each tool's structured output (evidence table, summary report, or Consensus Meter answer) against a human-written gold synthesis on five rubric items: decisions captured, sample sizes and methods extracted, contradictory findings flagged, hallucinations introduced, and length discipline. The two scores per task were averaged.
Source Coverage
We recorded each tool's underlying corpus (size and source: Semantic Scholar, OpenAlex, PubMed, the open web, or user-uploaded only), the file types it accepts, and whether retracted papers are automatically excluded.
Value at Paid Tier
We priced one user on each tool's standard paid plan (annual billing) against the free tier's real ceiling, the published cap on credits, reports, sources, or daily queries, and recorded what a heavy user actually has to pay to keep working without hitting a limit.
We ran every tool through the same battery of tasks, so the differences below come down to the products, not the briefs. The full per-criterion marks are above; the notes here cover where the ranking turned.
Why Elicit leads
Elicit wins on the dimension that decides this category for any reader doing real research work: defensibility.
As of May 2026, Elicit Systematic Review supports PRISMA 2020 guidelines and is reproducible, traceable, and auditable at every step, and an internal evaluation against 994 Cochrane reviews reported 95% search recall, 97% abstract screening, 99% full-text screening, and 96% extraction.
Those are the numbers a reviewer or supervisor will want to see; nothing else in our test cites figures that close to a methodological standard.
The corpus is the other reason it leads.
Elicit searches 138 million academic papers and 545,000 clinical trials through Semantic Scholar with semantic, not keyword, matching
, and
every summary, table cell, and report sentence links to the source paper with extracted supporting quotes, sharply reducing hallucination risk
. That sentence-level citation discipline is what separates a tool worth trusting from one worth proofreading.
The trade-offs are real but narrow.
The free Basic tier ships with 5,000 one-time credits rather than refreshing them monthly
, which makes the free plan more of a trial than a sustainable workflow, and
Elicit Pro at $49/month (or $499/year) is aimed at professional researchers and includes 12 reports per month, unlimited search across 138 million papers, and unlimited high-accuracy columns
, a real step up from Plus for anyone without institutional funding.
When Consensus is the right call
Consensus is the tool we recommend when the deliverable is an evidence-weighted answer to a focused question rather than a full literature review.
The architecture matters here: AI runs only after real papers are retrieved, so fabricated citations are eliminated structurally. There are no AI-invented papers to hallucinate.
The 2026 pricing ladder is Free / Pro $10 / Deep $45
, and
a 40% student discount with a verified academic email brings Premium annual to $5.39/month ($64.68/year)
, the most aggressive student price in our test.
The limits are honest.
Consensus is built for quick evidence-based answers rather than formal systematic reviews, where Elicit’s structured workflow and automated reporting excel
, and
the system does not automatically exclude retracted papers. If an article has been retracted but is still present on OpenAlex or Semantic Scholar, Consensus might still cite it, so Retraction Watch or the journal’s own website remains a necessary check.
The free pick: NotebookLM
NotebookLM is the answer for the reader who wants to do serious work on their own documents without paying anything.
It removes the hallucination risk that affects many general models by grounding every response in uploaded documents and refusing to invent unsupported claims.
The free plan gives 100 notebooks, 50 sources per notebook, 50 chat queries per day, 3 Audio Overviews per day, 3 Video Overviews per day, 10 reports per day, and 10 Deep Research sessions per month
, and
since January 2026 all plans, including free, have access to Gemini’s full 1-million-token context window. The difference between plans is daily usage caps and source limits, not context capacity.
The architectural limits are the reason it sits at three rather than higher.
NotebookLM caps each notebook at 50 sources on the free tier, with individual sources limited to roughly 500,000 words, sufficient for focused projects like a class assignment with ten papers, but a constraint for literature reviews with 100+ papers or multi-year thesis research.
Notebooks also can’t share context with each other; splitting sources across notebooks loses the cross-source querying that’s the whole point.
When Perplexity Deep Research wins
If the topic turns on what happened this week, Perplexity is the right tool.
Perplexity Deep Research is the fastest end-to-end research agent at 2 to 4 minutes per report, with transparent citations on every claim
, and
it launched on February 14, 2025 with PDF and Perplexity Pages export, free for all users at 5 queries per day (non-Pro) or 500 queries per day (Pro)
, though
Perplexity tightened Pro Deep Research allowances in early 2026 after previously offering 500 queries per day at launch, so the current cap is worth checking before relying on it.
The honest limit is that it isn’t an academic tool.
Perplexity draws from the open web, which includes low-quality sources alongside high-quality ones, less constrained than Elicit or Consensus, which means more versatility but less rigour, and it is not appropriate as a sole source for clinical decisions.
Deep Research cites 50 sources per report, but it doesn’t independently fact-check those sources. It relies entirely on their credibility, so if the top search results contain biased or outdated information, Deep Research propagates that bias into its output.
What didn’t make the cut
SciSpace is the one tool in our test that we mark Not Recommended at its current value. The headline price is fine.
At $12 per month, the Premium plan provides unlimited AI Copilot queries, unlimited paper summaries, advanced literature review tools, unlimited paraphrasing, and full access to over 40,000 journal-formatting templates
. But the value calculation breaks down inside the product.
SciSpace signs users up for pricey plans, but credits not used in a month vanish, with no roll-over, so users end up paying for capacity they never use
, and
even after upgrading, users report hitting random limits like character caps that stop the new features from working, with promised benefits failing to show up after payment.
Against NotebookLM (free) and Consensus Pro ($10/month), we can’t recommend it.