Lookback vs Maze: how to choose the right usability testing tool

Lookback vs Maze: how to choose the right usability testing tool

Lookback and Maze are not interchangeable tools for the same job. Choosing between them is a methodology decision before it is a software decision. Using the wrong one does not just slow you down, it actively misleads you.

5 min read

The most consequential choice in usability research is not which tool you open. It is which kind of evidence you decide to collect. That choice determines what questions you can honestly answer at the end of a study, and it has to be made before you recruit a single participant or build a single prototype flow.

Lookback and Maze are both usability research platforms. They are not alternatives to each other. They produce fundamentally different kinds of evidence, and treating them as interchangeable, choosing between them based on price, familiarity, or team preference, is the most reliable way to arrive at findings that feel solid but answer the wrong question.

The epistemological gap between moderated and unmoderated research

Moderated research is a conversation. The researcher is present, watching in real time, able to follow the hesitation before a tap, the half-formed sentence that trails off, the moment a user recovers from confusion and rationalises it as intentional. That recovery is often the most important data point in a session. A participant who backtracks, finds the right path, and then says “oh yeah, that makes sense” has just told you something a click map will never surface: they were lost, they found their way, and they left with a false sense of fluency.

Unmoderated research is measurement. The researcher is absent. Participants move through a prototype on their own, and the platform records what they did, where they clicked, how long they spent, whether they completed the task. The value of this format is not that it is cheaper or faster, though it often is both. The value is that it scales. You can run 200 sessions in the time it takes to moderate 10, and with enough sessions, the aggregate pattern becomes statistically meaningful in a way that qualitative observation alone cannot be.

The danger is not choosing one over the other. The danger is using Maze’s quantitative outputs to answer questions that only Lookback’s qualitative depth can address, or treating Lookback sessions as statistically representative when they are not. These are different epistemic commitments, and confusing them produces confident-sounding findings that are structurally wrong.

What Lookback is actually built for

Lookback is a moderated session platform built around the assumption that the researcher needs to be present and that presence needs to be manageable. Its remote observation tools allow teammates and stakeholders to watch live sessions through a separate observer link, which means that a product manager and a designer can be in the room without the participant knowing they are being watched by three people. That separation matters. Participant behaviour changes when they feel scrutinised.

The platform handles session recording, note-taking, and clip-tagging in ways that are designed to reduce the cognitive load on the moderator during the session itself. The researcher can focus on the conversation rather than the logistics.

Where Lookback’s remote setup creates data quality problems is in anything that depends on environmental context. When a participant is at home, on their own device, the researcher cannot see the surrounding conditions, the size of the screen, the quality of the connection, the physical setting. A participant completing a checkout flow on a 13-inch laptop in a quiet room is having a meaningfully different experience from one doing it on a phone with one hand while making coffee. Lookback does not eliminate this problem, and no remote moderated platform does. It is worth accounting for in your recruitment screener and session notes.

Lookback is the right tool when your research question requires knowing what it felt like to use something, when the emotional texture of an experience is the evidence, not just the outcome.

What Maze is actually built for

Maze integrates natively with Figma, which means a designer can push a prototype directly into a Maze study without rebuilding flows or managing export formats. That integration reduces the friction between design iteration and research validation, which is practically significant when a team is moving through multiple versions of a flow in a short sprint cycle.

The platform generates heatmaps that show where participants clicked relative to where you intended them to click, and funnel analytics that show where sessions dropped off across a task sequence. These outputs are genuinely useful when the research question is about success rates and drop-off points, not about why those things happened, but about whether and where they happened.

Where Maze produces misleading results is in studies where the prototype fidelity is too low for participants to orient themselves without a moderator’s context-setting, or where the task scenario requires interpretation that participants are not given. An unmoderated participant who abandons a flow may have misunderstood the task, encountered a Figma interaction that did not behave as expected, been interrupted, or genuinely failed to navigate the interface. Maze records all of these as the same event: an incomplete session. The aggregate drop-off metric looks clean. The underlying cause is invisible.

Maze is the right tool when your research question is about the frequency and location of success and failure across a population, when you need to know what happened at scale, not why it happened in depth.

The scenarios where each tool produces misleading results

Using Maze to evaluate a concept that is genuinely novel, where participants have no prior mental model to draw on, tends to produce artificially low success rates that reflect confusion about the concept itself, not problems with the interface. The fix is not to add more sessions. The fix is to run moderated sessions first, understand what participants bring to the concept, and then use Maze to test specific flow decisions once the conceptual frame is established.

Using Lookback to generate a success rate is a different kind of error. A researcher who moderates eight sessions, sees six participants complete a task successfully, and reports a 75% completion rate has done something that looks like measurement but is not. Eight sessions, even conducted well, do not constitute a sample from which a percentage can be extracted with any reliability. What those eight sessions can tell you, if the moderator was skilled and the analysis was rigorous, is a detailed account of what the experience was like for a specific set of people in specific circumstances. That is valuable. It is not the same as a statistic.

The 87% task success rate that Maze might generate tells you what happened. It does not tell you what it meant. A participant who completes a checkout in 34 seconds with no misclicks has produced data that looks like success. Whether they understood what they were buying, whether they felt confident in the return policy, whether they would return, none of that is in the funnel.

The context that neither tool captures

Maze tells you what users did in the prototype. Lookback tells you how they felt doing it. What neither captures is the category context that explains why they approached the task the way they did in the first place.

Participants arrive at a research session with accumulated experience, prior tools they have used, frustrations they carry, expectations formed by years of interacting with products in your category. A participant who hesitates before the pricing page may be doing so because of something your design did, or because of something a competitor’s design trained them to expect. A participant who skips the onboarding flow entirely may have been burned by onboarding flows before and made a decision before they ever saw yours.

This prior-experience layer is not something you can surface through task observation, moderated or unmoderated. It lives in the conversations people have when they are not in a research session: in forums, community threads, and review sites where users describe what they actually think about tools in your category, without being asked and without performing for a researcher. Mimir is built for this layer, continuously monitoring unprompted conversation across forums, review platforms, and communities, so the category context that shapes your participants’ behaviour is visible before they ever enter a session.

If your usability research is producing findings that feel complete but are not explaining behaviour in your category, Mimir monitors the unprompted conversation that neither Lookback nor Maze can reach. Start for free.

Stay in the know!

Subscribe for news updates.

Dovetail is the category leader because it is beautiful and flexible. It is also, for many teams, an insight graveyard with better design. The real question is not which tool stores research better. It is which one your team will still be using six months after implementation.