Posted on May 30, 2025

by Eric Holter

Why Classic UX Testing Fails Most Museum Sites

FROM: Strategy, Tools and Techniques

The Low‑Traffic Dilemma

Museum web teams often aspire to use standard UX A/B testing methods to improve their websites. However, mid-sized and smaller museums typically have relatively low web traffic to gather statistically solid results. Traditional approaches that work for high-traffic sites may not yield clear insights when your audience is modest.

Our next few articles will explore why conventional UX testing falls short in low-traffic scenarios and provide practical alternatives tailored for museums. Museum digital teams need realistic strategies to improve user experience despite limited visitor numbers.

We’ll cover how much traffic is needed for A/B tests, why common testing methods struggle on small sites, and what to do instead. We’ll discuss leveraging Google Analytics (GA), Tag Manager, heatmaps, recordings, and qualitative feedback as alternatives to standard A/B testing.

In this introductory article we’ll start by understanding the core challenge: the volume of traffic required for meaningful A/B testing.

Traffic Thresholds for Meaningful A/B Testing

A/B testing, showing two versions (A and B) to different users and measuring which performs better, relies on having a large enough sample size to detect differences. So how much traffic is “enough”? There’s no single magic number, but experts provide some rules of thumb:

Thousands of Visits and Conversions: A test generally becomes viable only if the page or feature gets on the order of a few thousand visits per month and on the order of ~100 conversions per variant over the test’s duration to have a chance at statistical confidence. One guide suggests roughly 1,000 visitors per week (or ~50 conversions per week) on the page in question as a baseline for A/B testing. If your museum’s page visits are far below that, A/B testing will be an uphill battle. Here’s a handy A/B calculator to help determine if your traffic is sufficient to test an interaction.
Example – Long Test Duration: Even with a decent traffic flow, the numbers add up quickly. For example, with about 1,000 daily visits to a page and a 2% conversion rate, an A/B test looking for a modest +10% improvement could require around 103,000 visitors (roughly 103 days) to reach a reliable 95% confidence result. That’s over three months for a single test! Smaller museums that only get a few thousand visitors per month would need an unreasonably long time (perhaps many months or more) to gather 100k visits.
Very Low Traffic – Below a Few Hundred a Month: If the traffic to a specific page or feature you want to test is extremely low – say only a few hundred visits per month, traditional A/B testing is not practical at all. In such cases, it’s often better to pause on A/B testing and focus on qualitative insights by collecting feedback through surveys and interviews.

Statistical testing needs a critical mass. Museums like the Met or MoMA might have the volume to meet these thresholds, but most other museums do not.

Let’s apply these thresholds to a couple of common museum transactions that are testable and measurable: buying tickets and making donations.

Transaction‑Level Volume: Why “100 k visits” Might Still Be Too Little

When we talk about sample‑size requirements, it’s easy to quote overall site traffic, but the real threshold is per interaction you want to measure:

As a rule of thumb, the sample-size you need is tied to the interaction you’re modifying, not to overall site traffic. If you’re running a micro-copy experiment, say, changing the wording on a “Buy Tickets” call-to-action, plan on at least 5,000 to 10,000 pageviews of that specific ticket-launch page and roughly 100 CTA clicks per variant before you can trust the result.

For a bigger change, such as collapsing a two-step donation form into a single screen, the bar rises: you’ll want around 20,000 completed visits to the donation form itself and about 200 finished donations per version to reach a meaningful level of confidence. The deeper the funnel and the more consequential the action, the more raw interactions you need before the data can speak reliably.

So while a museum’s homepage might get 100,000 visits per month, the ticketing launch page might only see 12,000 of those, and a mid‑form “Choose time slot” step only 4,000. That smaller denominator is what drives the real test length.

Quick Estimator

For every 1% absolute change you hope to detect, plan on at least 5,000 visitors at that step plus roughly 100 goal events per variant.

Why Standard UX Testing Methods Struggle with Low Traffic

Common UX testing techniques, especially quantitative methods like A/B or multivariate testing, assume you can capture enough user data to separate real trends from random noise. With low traffic, this assumption breaks down. Here are the key reasons these methods may not yield strong insights for a small museum website:

Inconclusive Results Due to Sample Size: A/B testing results are considered “reliable” typically at 95% statistical significance, meaning there’s only a 5% chance the observed difference is due to random chance. Reaching 95% confidence requires a substantial number of user interactions. With low traffic, you might run an experiment for weeks or months and still see the testing tool hover at 60% or 70% significance, essentially a coin flip. Below about 95%, the data are not deemed statistically reliable and drawing firm conclusions is risky. You could easily pick a “winner” variant that in reality isn’t better, the apparent difference was just noise. Too few users make an A/B test result murky.
Tiny Changes Are Undetectable: Small UX tweaks (like a minor copy change or button color change) usually produce only tiny differences in user behavior. For example, a 0.5% improvement in conversion rate. On a low-traffic site, such a minimal lift is well within the margin of error. You likely won’t be able to identify any change at all, because the effect is drowned out by normal variability. As one specialist notes, on a smaller site if you run a test with a very subtle change, it’s “very likely that you won’t be able to identify any change” in the metrics. Thus, common “micro-optimization” tests (like testing two shades of a CTA button) aren’t worth it for low-traffic museums, any impact would be so small that your analytics can’t reliably distinguish it from randomness.
More Variations = More Traffic Needed: It’s tempting to test many ideas at once (A/B/C/D tests with multiple variants), but low traffic sites can rarely support that. Every additional variant splits your audience further, requiring more total visitors to reach significance. For example, one calculation showed that with a given baseline conversion rate, an A/B test might need ~61,000 visitors, but an A/B/C test would require ~91,000, and an A/B/C/D around 122,000 to reach 95% confidence for a 10% improvement. Small museums simply don’t have that volume, so testing more than one change at a time is usually out of the question. This is why standard multivariate testing (which might work for Amazon or large museums) is impractical on a low-traffic site.
Long Timeframes Introduce Variables: If you try to run a test for an extended period (say 3-6 months to accumulate enough users), other factors can change during that time: seasonal traffic fluctuations, school holidays, external events, or content updates. These can cloud your results. For instance, you might think version B is winning after a few months, but perhaps a special exhibit or holiday in that period temporarily boosted traffic from a certain audience segment. With such a long test, it’s hard to isolate the cause of changes. Essentially, low traffic forces longer tests, and longer tests make it harder to control external influences, undermining the validity of the “clean” A/B comparison.
Surveys and Quantitative User Feedback Challenges: It’s not just A/B tests. Any method that relies on large sample surveys or polls can fall flat with a small audience. An on-site satisfaction survey, for example, might only get a dozen responses over several weeks on a low-traffic museum site. That’s often too few to confidently generalize (“10 people said search was hard to use” is a very limited insight if you have thousands of visitors monthly). Results can be skewed by just a couple of outspoken individuals. So, typical UX surveys or feedback forms may not reach a critical mass of responses for robust conclusions in this context. (They can still provide anecdotal clues, but not statistical certainty.)

In short, methods that depend on quantity struggle when you don’t have quantity. Low traffic means low data volume, which means high uncertainty for quantitative testing. This doesn’t mean you should give up on improving UX. Rather, it means shifting your approach.

Large vs. Mid-Size vs. Small Museums: Different Approaches

Not all museums are in the same boat when it comes to web traffic. It’s helpful to calibrate your UX optimization approach based on the scale of your online audience:

Large Museums (e.g. The Met, MoMA): Major institutions with widespread recognition tend to have high web traffic (hundreds of thousands or millions of visits per year). In these cases, standard A/B testing and UX research methods absolutely apply. A site like the Met’s might accumulate test data quickly enough to reach significance on important metrics. Large museums can use full-featured testing tools, run multiple experiments sequentially, and iterate in a data-driven way much like an e-commerce site would. For example, a big museum might A/B test the design of their ticket purchase page or the wording of membership sign-up calls to action, since they have sufficient volume of transactions to get results in a reasonable time. These organizations might also invest in extensive user research (surveys, usability labs, etc.) because they have the resources and audience size to support it. (One caveat: even with high overall traffic, certain niche pages or microsites of a large museum might still be low-traffic; so they may need to mix methods depending on the section of the site.)
Mid-Size Museums: A mid-sized museum (perhaps a city museum or university-affiliated museum) often falls into a middle ground. Their websites may have a steady stream of visitors, maybe tens of thousands of visits a month, but not the massive numbers of the top-tier institutions. For these museums, limited A/B testing is possible, but must be done selectively and patiently. You might be able to run an A/B test on your highest-traffic pages or during peak visitor seasons. For instance, if your exhibition calendar page or education programs page gets a lot of hits, you could test a new layout there, but you may need to run the test longer (several weeks or a couple of months) to accumulate enough data. It’s critical to focus on high-impact changes: don’t waste your limited testing “bandwidth” on trivial tweaks.

Mid-size museums should also supplement any A/B tests with qualitative methods (more on those later) to make up for the thinner quantitative data. In practice, a mid-size museum’s web team might do something like: use Google Analytics to spot a problem (say, many users abandon the online donation form), brainstorm a bold design improvement, and run an A/B test on that form knowing it might take 4-6 weeks to get an answer. At the same time, they could conduct a few user interviews or watch session recordings to gather insight while the test runs. Hybrid approaches are key. You have enough traffic to attempt some testing, but not enough to rely on it alone.

Small Museums: Smaller museums, local galleries, or historical sites often have very low web traffic (a few thousand visits per month or less). In this scenario, traditional A/B testing usually isn’t feasible at all, at least not if you expect statistically significant outcomes. It could take so long to get results that the exercise isn’t worthwhile. For small sites, the web team’s efforts are usually better spent on qualitative and observational UX research (like direct user feedback) and on implementing best practices rather than continuous testing. This doesn’t mean small museums should just redesign on a whim; rather, they should use data in non-traditional ways.

For example, instead of an A/B test to decide on a homepage change, a small museum might do 5 one-on-one usability tests with representative visitors and combine that with any available analytics to make an informed decision. If a small museum does attempt A/B tests, they might use “lightweight” tests where the bar for success is lower (e.g. looking for very large differences or using 80–90% confidence as acceptable evidence), essentially treating A/B results as directional hints. But more often, small museums will lean on the alternative validation methods discussed later rather than formal multivariate experiments.

It’s important to note that any museum, regardless of size, can practice data-informed UX improvement, even if that data isn’t coming from classic A/B tests. But everyone can do Conversion Rate Optimization (CRO) by following a rigorous process: use analytics to identify issues, develop hypotheses for improvement, implement changes carefully, and measure what you can. The difference is just in the tools and techniques: large museums can validate changes with large-scale experiments, while small museums might validate changes with a mix of expert review and qualitative feedback.

In our next article in this series on UX testing for museums, we guide you through a process for identifying exactly what features need testing, and how to define the parameters for measuring testing. Because you can’t test what you haven’t defined, and you can’t evaluate with our clear parameters.

Articles Related to Topic:

A CUBERIS FRAMEWORK DEMO

Why Classic UX Testing Fails Most Museum Sites

The Low‑Traffic Dilemma

Traffic Thresholds for Meaningful A/B Testing

Transaction‑Level Volume: Why “100 k visits” Might Still Be Too Little

Quick Estimator

Why Standard UX Testing Methods Struggle with Low Traffic

Large vs. Mid-Size vs. Small Museums: Different Approaches

Strategy, Tools and Techniques

Museum Website Accessibility and Legal Risk

Are Accessibility Plugins Enough for Museum Websites?

What WCAG Accessibility Levels Actually Mean for Museums

Meaningful Accessibility for Museum Websites

Testing During Museum Website Redesigns: Zero‑Traffic Prototypes & Post‑Launch Flexibility

From Numbers to Narratives: Qualitative Testing of Museum Websites with Heatmaps & Small-Sample Reviews

Strategy, Tools and Techniques

Museum Website Accessibility and Legal Risk

Are Accessibility Plugins Enough for Museum Websites?

What WCAG Accessibility Levels Actually Mean for Museums

Museum Design Workshop

Join the Low Code/No Code Revolution

Best Ticketing/Transaction Platforms for Museum Website Integration

Why Classic UX Testing Fails Most Museum Sites

The Low‑Traffic Dilemma

Traffic Thresholds for Meaningful A/B Testing

Transaction‑Level Volume: Why “100 k visits” Might Still Be Too Little

Quick Estimator

Why Standard UX Testing Methods Struggle with Low Traffic

Large vs. Mid-Size vs. Small Museums: Different Approaches

Strategy, Tools and Techniques

Strategy, Tools and Techniques

Transaction‑Level Volume: Why “100 k visits” Might Still Be Too Little