How we test, and what the mark means.
Every verdict comes out of the same process: a fixed battery of tasks for the category, graded against a written rubric, marked in stars, and re-run as the tools change.
We do not rate on impressions, and we do not run vendor demos. Each tool faces the same set of tasks — built to isolate one quality at a time — and the results are scored against a rubric we keep fixed across the category. We weigh what a tool gets wrong as heavily as what it gets right, and we publish the test plan on every ranking so a reader can see how the mark was earned.
A single number hides too much to stand alone, so every ranking lists its exact tests in a "How we tested" section, and every product reports how it rated on each criterion. The criteria below are the spine of that process; the specific tasks differ by category, because testing an image generator is not the same as testing a coding assistant.
What we evaluate
Each tool runs the same fixed battery of tasks for its category — the same prompts, briefs, or codebases — and two reviewers grade every result against a written rubric without seeing which tool produced it, so a brand name never moves the mark.
We repeat the hardest tasks across many runs and record how often a tool returns a correct, usable result without intervention. A tool that nails one demo but drifts on the tenth attempt is marked down for it.
We measure how precisely a tool can be directed and revised — placing or editing one element, constraining a format, or correcting a single region — counting the attempts it takes to reach an acceptable result.
We read the published terms and confirm them against the maker’s documentation: how the model was trained, what commercial rights and indemnification the paid plan grants, and how the tool handles unsafe or restricted requests.
We price a month of real, observed usage at the tier we tested, then divide by the number of results we judged usable — so a cheap tool that needs five retries is not allowed to look like a bargain.
Because these products change often, we re-run the battery on each meaningful update and date every verdict. A recommendation can be withdrawn when a rival ships or a tool regresses, and we say what changed.
How we rate
Each criterion is marked on a five-star scale, in half-star steps ( ), a filled star for a criterion a tool handles well, a half star for a partial result, and an empty star where it fails. The marks are weighted toward what matters most for the category and resolved into an overall star rating to the nearest half. Because every product is rated on every criterion, a reader can see exactly where one earned its standing and where it lost it.
We recommend products rated four stars and above. A product whose overall rating reaches four stars carries the solid Recommended stamp; anything below it is marked Not Recommended. The stamp is the verdict — the plain call on whether to use the tool — and it sits beside the star rating, not in place of it.
Nothing here is final. AI products ship meaningful changes often, so every verdict is dated and re-run on each major release. A recommendation can be withdrawn when a rival catches up or a tool regresses, and when that happens we update the ranking and state what changed.
Independence
We take no sponsorships and no payment for placement. A product cannot buy its way onto a ranking, buy a higher position, or buy a better mark. Our verdicts reflect our testing and nothing else.
Margaret Ashworth leads testing of image and video generators and the design tools built on top of them. She grades on prompt fidelity, artifact rates, licensing clarity, and the cost of an acceptable final frame.
Theodore Pruitt evaluates general assistants, reasoning models, and coding tools against fixed task batteries. He weighs accuracy and refusal behavior over benchmark scores, and reports what a tool gets wrong before what it gets right.
Constance Whitfield covers search, productivity suites, and knowledge and data tools. Her tests favor citations a reader can verify, exportable output, and pricing that holds up past the free tier.
Lionel Sackville designs the scoring rubric the Testing Desk uses and runs voice, audio, and cross-category tests. He is responsible for keeping the criteria comparable from one verdict to the next.