Landing page Testing strategy

landing page optimization the definitive guide to testing and tuning for conversions and wordpress landing page ab testing
DavyGodwin Profile Pic
DavyGodwin,United States,Professional
Published Date:03-08-2017
Your Website URL(Optional)
Comment
The Mechanics of Testing Landing page testing is one of the most pow- erful techniques available in your conversion improvement arsenal. In this part of the book we will discuss some of the common questions relating to testing, how to best prepare your con- IV tent for testing, and the actual testing methods. Chapter 10 Common Testing Questions Chapter 11 Preparing for Testing Chapter 12 Testing Methods■ COMMON TESTING QUESTIONS Common Testing Questions In the previous chapter you learned how to select elements for landing page tests. Our focus was on how to pick elements that would have the greatest impact on your landing page performance. We are sure that you can’t wait to get started. 281 But how do you interpret the results of your tests? To understand the power and limitation of testing methods, you first need to grasp the basics of the underlying math. 10 CHAPTER CONTENTS Lies, Damn Lies, and Statistics Crash Course in Probability and Statistics Have I Found Something Better? How Sure Do I Need to Be? How Much Better Is It? How Long Should My Test Run?Lies, Damn Lies, and Statistics There are three kinds of numbers: lies, damned lies, and statistics. —Mark Twain, quoting Benjamin Disraeli The statistics branch of mathematics has a poor reputation among the public. Much of modern science and economics is based on it in a fundamental way. So is public policy. Since public policy is a matter of priorities and heated debate about the allocation of government budgets, statistics has gotten pulled into the fray to support or undermine various political positions. Unscrupulous or ignorant people have corrupted it for their own purposes. While there is nothing wrong with statistics itself, there are many common mis- uses of it. In this section, we survey these misuses along with some implications for landing page tuning. 282 As additional background we recommend Rival Hypotheses: Alternative Interpretations of Data Based Conclusions (HarperCollins, 1979) by Schuyler W. Huck. The book describes one hund erd social sciences experiments along with the possible experimental problems that may cast doubt on, or completely in av lidate, the reported results. Throwing Away Part of the Data Statistical studies are based on a confidence level in the answer (commonly 95 percent). If you conduct a large number of experiments, even two identical effects can seem dif- ferent based simply on a statistical streak. For example, if you flipped a coin five times you might be surprised to see it come up heads every time and might even suspect that it could be loaded. However, this is exactly the result that you would expect based sim- ply on random chance about 3 percent of the time. So if this experiment was repeated one hundred times, a series of all-heads would be expected to come up about three times. Unscrupulous people might rerun the experiment many times and report a single all-heads result as proof that the coin was loaded. By discarding the remaining experi- ments that did not support their desired conclusion, they are misrepresenting the results. As you will see in Chapter 12, “Testing Methods,” there are sometimes valid (or at least practical) reasons to hold out some of the data that you collect during a landing page test. But do not cherry-pick and only look for data that supports your conclusions. Biased Samples Statistics assumes that a random selection of test subjects was drawn from the popula- tion in question. However, samples can be biased by oversampling or undersampling from certain groups. In extreme cases, no representatives are drawn from a particular subset of the population. C H A P T E R 10: COMMON TESTING QUESTIONS ■฀฀฀■ LIES, DAMN LIES, AND STATISTICS For example, online or call-in polls are often skewed by definition. They repre- sent self-selecting groups of people who are motivated enough to answer the polls. This usually implies that they have a strong opinion and want to express it. So these types of polls tend to produce more polarized results with a disproportionately large percentage of extreme views (at the expense of the more moderate outlook of the silent majority). Let’s take a look at some common types of sampling bias. Traic Filtering In landing page testing, you generally want to get as wide a range of traffic sources as possible. That way, they are more likely to be representative of your visitor population as a whole. However, you generally want traffic sources that are recurring, control- lable, and stable. If your traffic does not have these characteristics, it may be hard to tune. For this reason, you may want to remove unstable sources (such as some of your larger but highly variable affiliates or volatile social networking traffic) from your test- ing mix. You should also generally remove nonrecurring e-mail traffic. 283 Data Collection Method Let’s assume that you have picked appropriate filters for your traffic and are selecting the largest possible stable group among your population of visitors. If you have imple- mented your test properly, then each new visitor should be assigned at random to see one of the alternative versions of your landing page. However, even this sample may not be completely random because of technology considerations. For example, many testing tools (see the Appendix, “Landing Page Testing Tools,” for a partial list) require visitors to have JavaScript turned on in their web browser and to accept first-party cookies (small files left on the visitors’ hard disks by the website in question, which contain information about their visit and can be used to customize and personalize their experience upon return visits). If a visitor does not meet these technical criteria, they are not included in the test and are simply shown the original landing page. Based on current web usage statistics, these technical requirements disqualify fewer than 5 percent of Internet users. When your test is completed, you are forced to make some assumptions about how that 5 percent will react to your new page design. You assume that they will act like the other 95 percent that you are able to track. But this may not be true. Since they have JavaScript or first-party cookies turned off, they may represent a small, self-selecting group of people who are more cautious, techni- cally savvy, or concerned with privacy. Such people may indeed behave differently than the rest of the population. As a practical matter, this does not change your recommen- dations very much. Since the missing 5 percent represents such a small segment, even a significant difference in their behavior will be overwhelmed by the much larger conver- sion rate improvements that you usually uncover among the sampled visitors. However, it is important to be aware of such technical sampling issues.Sequential Testing Another type of sampling bias can be introduced by sequential testing. For example, you may test your original design for a month, and then replace it with another one during the following month. It is hard to reach any conclusions after this kind of experiment. Any number of external factors may have changed between the two testing periods. For example, there may have been a holiday with common family vacations, some major breaking news that affected your industry, or a major public relations announce- ment. The point is you are comparing apples to oranges. In landing page testing you should always try to collect data from your original version and your tested alternatives in parallel. This will allow you to control for (or at least detect and factor in) any changes in the external environment. Only use sequential testing as a last resort. Short Data Collection Even if you run your tests by splitting the available traffic and showing different ver- sions of your site design in parallel, you may still run into biased sampling issues related 284 to short data collection periods. Experiments involving very high data collection rates may be especially prone to this. For example, let’s assume that you are testing two alternative versions of your page and are measuring clickthroughs to a particular target page as your conversion action. Because of the high traffic to your landing page, you collect about 10,000 con- version actions in the first hour of your test. This data shows you that one of your ver- sions outperforms the other to a very high level of statistical confidence. Many people would conclude the test at this point and immediately install the best performer as the new landing page. But what if we were to tell you that the data was collected in the middle of the night? You might correctly conclude that people visiting your site during the day are a different population, or at least that they behave differently then. The same is true of weekday (accessing the Internet from work) versus weekend (accessing the Internet from home) traffic. Regardless of your data rate, you should collect data for at least a one- week period (or multiple whole-week increments if your data rate is low). This will allow you to get rid of the short-term biases discussed earlier. Of course, this does not address the question of longer-term seasonality (which will be covered in more detail in the “Not Accounting for Seasonality” section of Chapter 15, “Avoiding Real-World Pitfalls”). Overgeneralization Overgeneralization is the erroneous extension of your test conclusions to a setting where the original results no longer apply. For example, let’s say that you set up an experiment to count the ants in your kitchen and tracked it for a full week during a record cold spell in the wintertime. Your finding was that there were no ants in the kitchen at all during the study period. However, it would probably be incorrect to C H A P T E R 10: COMMON TESTING QUESTIONS ■฀฀฀■ LIES, DAMN LIES, AND STATISTICS assume that the same would hold true during a heat wave in the summer. Often the overgeneralization is not made by the original researcher, but rather by those who sub- sequently summarize or cite the results. A common overgeneralization in landing page testing is to assume that traf- fic sources that were not part of your original test will behave in the same way as the tested population. For example, if you see a particular effect with your PPC traffic, you should not assume that it will hold up when you expose the new landing page to your in-house e-mail list. Loaded Questions The answers that people give in surveys can be manipulated to skew the results in a certain direction. This is done by asking the question in a certain way, or preceding it with information that will support the desired answer. For example, imagine a survey that is polling about support for a salary raise for local firefighters. Depending on which side of the issue the pollster was on, you might 285 imagine two different questions: • Given the chronic neglect of city streets and the rising crime rate due to the understaffing of our police force, do you support a raise for our fire- fighters at this time? • After considering the extraordinary risks that firefighters face every day to protect your family and property, do you support a raise for our fire- fighters at this time? In normal surveying, loaded questions and the context for how the information is presented can be a problem. But in landing page testing, you stand this premise on its head. You want to create loaded landing page content. In fact, your whole goal is to see what your audience responds to best. A cynic might even say that landing page testing is the scientific and systematic discovery of the best audience manipulations available to you. False Causality Correlation does not imply causation. —Common scientific saying This saying does not use the word imply in its common sense (that is, to suggest). The scientific sense of imply (taken from formal logic) can be better translated as require. If reread this way it can be paraphrased as “because effects are related or occur together, one does not necessarily cause the other.” There may be a third previously unrecognized lurking variable (also called a confounding variable, or confounding factor) that causes the other two. For example, if we told you that the vast majority of car accidents occur within five miles of people’s residence, you might be tempted to start taking the bus instead of driving. But it would be wrong to conclude that accidents are caused by the proximity to your home. There is a third confounding variable that could explain both: People do the vast majority of their driving close to home, and accidents are directly related to the time spent driving. In landing page optimization, many people insist on extracting so-called “learn- ings” from their test results. Hindsight is used to rationalize why a particular landing page version had a higher conversion rate. For example, you may test two call-to-action buttons: orange and green. If the green one performs better, you may be tempted to conclude that your audience likes the color green more than orange. In fact, there may be another explanation: the contrast of the button color with the main color theme of the page. If your page was predominantly orange themed, the orange call-to-action button would seem muted and may get lost in a scene composed of similar colors. The green button may perform better not because of the actual color used but because the 286 contrasting color sticks out and seems more prominent. There may also be more subtle issues relating to other design changes that were also made at the same time. For example, the green button may have been a different size, or perhaps it used a different color for the call-to-action text. It may have been these look-and-feel factors rather than the button color that increased the propensity of people to act. Trying to rationalize results after the test is a dangerous activity because it may cause you to inappropriately fixate on elements of your design that had nothing to do with the performance improvement. You should try to restrain yourself from engaging in this kind of after-the-fact myth construction. Crash Course in Probability and Statistics Let’s go back to the roots of the statistics underlying landing page testing. Within the vast field of mathematics, we will guide you down to the specific subset that you will need to understand. Along the way, we will point out the specific relationship to land- ing page optimization. And since landing page testing is often a messy business, we will also flag where real-world considerations and issues deviate from the theoretical framework. This drill-down is a quick overview. You may need to do some additional background reading in the areas of probability and statistics. Probability Theory Probability theory is a branch of mathematics that deals with the description and anal- ysis of random events. The key building blocks of this framework are as follows: Random Variables A random variable is a quantity whose value is random or unpredictable, and to which you can assign a probability distribution function. The C H A P T E R 10: COMMON TESTING QUESTIONS ■฀฀฀■ CRASH COURSE IN PROBABILITY AND STATISTICS probability distribution function determines the set of possible values that can be assigned to the random variable, along with their likelihood. The total of all possible outcomes’ likelihood must by definition equal one (that is, one of the possible outcomes must happen, and its value will be assigned to the random variable). Let’s use a fair gaming die as an example. The top face of the die can take on one of six possible outcomes (1, 2, 3, 4, 5, 6). The probability distribution function is uniform (there is an equal one-in-six chance of any value between 1 and 6 coming up). When you sum all of the possible probabilities, they add up to exactly one. Stochastic Processes T here are two kinds of processes considered in probability theory: deterministic and stochastic. A deterministic process will go along a set path depending on its starting conditions. In other words, if you know where it starts, you can exactly compute where it will end up at some point in the future. A stochastic process (also called a random process) is more difficult to understand. You cannot tell exactly where it will end up, but you know (based on its probability distri- 287 bution function) that certain outcomes are more likely. In the simplest case, a stochas- tic process can be described as a sequence of samples from random variables. If these samples can be associated with particular points in time, it is a time series (a series of data points that were measured at successive times). In our die example, the stochastic process is the repeated roll of the die. Each roll will produce a random variable outcome (one of the six possible values), and successive rolls are independent of each other (what was rolled on the previous attempt has no influence on the likelihood of any particular number coming up on the next roll). Events An event in probability theory is a set of all possible outcomes to which a probability is assigned (also called the sample set). In the simplest case, the set of pos- sible outcomes is finite. Each of the basic possible outcomes is called an elementary event, but more complex events can be constructed by selecting larger groupings of elementary events (a proper subset of the sample space). In our die example, the elementary events are individual possible values of a die roll. But you can also construct other events and assign the proper probabilities to them (for example, an even roll of the die—with a probability of one-half—or a roll with a value greater than 4—with a probability of one-third). Probability Applied to Landing Page Testing So how does all of this apply to landing page optimization? The random variables are the visits to your site from the traffic sources that you have selected for the test. As we have already mentioned, the audience itself may be subject to sampling bias. The probability distribution function is pretty simple in most cases. You are counting whether or not the conversion happened as a result of the visit. You are assuming that there is some underlying and fixed probability of the conversion happening, and that the only other possible outcome is that the conversion does not hap- pen (that is, a visit is a Bernoulli random variable that can result in conversion, or not). As an example, let’s assume that the actual conversion rate for a landing page is 2 percent. So there is a small chance that the conversion will happen (2 percent), and a much larger chance that it will not (98 percent) for any particular visitor. As you can see, the sum of the two possible outcome probabilities, as required, exactly equals 1 (2% + 98% = 100%). The stochastic process is the flow of visitors from the traffic sources used for the test. Key assumptions about the process are that the behavior of the visitors does not change over time and that the population from which visitors are drawn remains the same. Unfortunately, both of these assumptions are routinely violated to a greater or lesser extent in the real world. The behavior of visitors changes due to seasonal factors, or with changing sophistication and knowledge levels about your products or industry. The population itself changes based on your current marketing mix. Most businesses 288 are constantly adjusting and tweaking their traffic sources (for example, by changing PPC bid prices and the resulting keyword mix that their audience arrives from). The result is that your time series, which is supposed to return a steady stream of yes or no answers (based on a fixed probability of a conversion), actually has a changing prob- ability of conversion. In mathematical terms, your time series is nonstationary and changes its behavior over time. The independence of the random variables in the stochastic process is also a critical theoretical requirement. However, the behavior on each visit is not necessar- ily independent. A person may come back to your landing page a number of times, and their current behavior would obviously be influenced by their previous visits. You might also have a bug or an overload condition where the actions of some users influence the actions that other users can take. For this reason it is best to use a fresh stream of visitors (with a minimal percentage of repeat visitors if possible) for your landing page test audience. Repeat visitors are by definition biased because they have voluntarily chosen to return to your site and are not seeing it for the first time at ran- dom. This is also a reason to avoid using landing page testing with an audience con- sisting of your in-house e-mail list. The people on the list are biased because they have self-selected to receive ongoing messages from you, and because they have already been exposed to previous communications. The event itself can also be more complicated than the simple did-the-visitor- convert determination. In an e-commerce catalog, it is important to know not only whether a sale happened, but also its value. If you were to tune only for a higher con- version rate, you could achieve that by pushing low-margin and low-cost products that people are more likely to buy. But this would not necessarily result in the highest profits. C H A P T E R 10: COMMON TESTING QUESTIONS ■฀฀฀■ CRASH COURSE IN PROBABILITY AND STATISTICS Some tests involve tuning for the highest possible revenue per visitor (or profit per visitor after considering the variable costs of the conversion action). For these kinds of situations, you need to consider real-valued random variables and their cumulative distribution functions. That discussion is more involved and is beyond the scope of this book. Law of Large Numbers The law of large numbers states that if a random variable with an underlying probabil- ity (p) is observed repeatedly during independent experiments, the ratio of the observed frequency of that event to the total number of experiments will converge to p. Let’s continue with our die rolling example. The law of large numbers guaran- tees that if you roll the die enough times, the percentage of sixes rolled will approach 1 exactly ⁄6 of the total number of rolls (that is, its expected percentage in the prob- ability distribution function). An intuitive way of understanding this is that over the long run, any streaks of rolling non-sixes will eventually be counteracted by streaks of 289 rolling extra sixes. The exciting thing about this law is that it ties something that you can observe (the actual conversion percentage in our test) to the unknown underlying actual con- version rate of your landing page. It guarantees the stable long-term results of the ran- dom visitor events. However, before you start celebrating, it is important to realize that this law is based on a very large number of samples and only guarantees that you will over the long term eventually come close to the actual conversion rate. In reality, your knowl- edge of the actual conversion rate will accumulate slowly. Moreover, the law of large numbers does not guarantee that you will converge to the correct answer with a small amount of data. In fact, it almost guarantees that over a short period of time, your estimate of conversion rate will be incorrect. Short-term streaks can and do cause conversion rates to significantly deviate from the true value. The best way to look at this situation is to keep in mind that collecting more data allows you to make increasingly more accurate estimates of the true underlying conver- sion rate. However, your estimate will always be subject to some error; moreover, you can know only approximate bounds on the size of this error. The Normal Distribution The Gaussian, or normal, distribution (also commonly called the bell curve because of its characteristic shape) occurs commonly in observations about science and nature. The exact shape and position of the bell curve is defined by two parameters: the posi- tion of its center point, and how wide it is. The bell curve can be tall and almost needle- like, or a wide low smudge (as shown in Figure 10.1).1 2 =0.2 �= 0,� 0.9 2 �= 0,� =1.0 2 �= 0,� =5.0 0.8 0.7 0.6 0.5 0.4 0.3 0.2 290 0.1 0 -5 -4 -3 -2 -1012345 Figure 10.1 Different bell curve shapes The shape and position of the curve are described by the following parameters. Mean (:) The mean is the sum of all of the random variables divided by the number of random variables observed. It also commonly called the “average” value. 2 Variance ( ) The variance shows how spread out or scattered the values are around the mean. If they are tightly clustered, then the variance is lower. If they are very spread out, then the variance is higher. Standard Deviation () A s tandard deviation is defined as the square root of the variance. It is often more useful than the variance itself since it is directly comparable to the underlying measurement. The unit normal distribution is a special case of the more general Gaussian dis- tribution. Basically, a particular Gaussian distribution can be standardized (by mov- ing its mean to zero and magnifying or shrinking it so that it has a standard deviation equal to 1). Normalizing a particular bell curve allows you to easily compare its prop- erties to those of other bell curves. The area contained under any normal distribution is always one by definition. The 68-95-99.7 rule (also called the empirical rule) tells you that for a normal distribution almost all values lie within three standard deviations of the mean (see Figure 10.2). C H A P T E R 10: COMMON TESTING QUESTIONS ■฀฀฀■ CRASH COURSE IN PROBABILITY AND STATISTICS 0.4 0.3 0.2 34.1% 34.1% 0.1 2.1% 2.1% 0.1% 0.1% 13.6% 13.6% -3 -2 -1 1 2 3 � ������ Figure 10.2 The normal distribution About 68 percent of the values are within one standard deviation of the mean (: ± ). About 95 percent of the values are within two standard deviations of the mean 291 (: ± 2). About 99.7 percent of the values lie within three standard deviations of the mean (: ± 3). For many scientific and engineering purposes, the 95 percent confidence limit is commonly used as the dividing line for making decisions. In other words, if your answer falls into the plus or minus two standard deviation band around a pre- dicted value, it is considered to be consistent with the prediction. The Central Limit Theorem The Central Limit Theorem tells you that regardless of the distribution that the origi- nal random variables were drawn from, if that distribution has a finite variance, their average will tend to conform to the normal distribution. This is the case across a wide range of real processes, including data collection in landing page testing. The Central Limit Theorem assures you that the conversion rate estimate that you observe for a particular landing page design will look like a normal distribution. This allows you to estimate the probable range of values for the actual underlying con- version rate. The more data you collect, the tighter your estimate will become. Statistical Methods One of the common questions answered by statistics is whether a relationship exists between some predictors (independent variables) and the response or resulting effects (the dependent variables). Often, your experiments can be arranged so that when you detect such a relationship, you can say that changes in the independent variables caused the changes in the dependent variables. There are two main types of statistical studies: Experimental Studies In experimental studies you first take measurements of the environment that you are studying. You then change the environment in a preplanned way and see if the changes have resulted in a different outcome than before.Landing page testing is a form of experimental study. The environment that you are changing is the design of your landing page. The outcome that you are measuring is typically the conversion rate. As we mentioned earlier, landing page testing and tuning is usually done in parallel, and not sequentially. This means that you should split your available traffic and randomly alternate the version of your landing page shown to each new visitor. A portion of your test traffic should always see the original version of the page. This approach will eliminate many of the problems with sequential testing. Observational Studies Observational studies, by contrast, do not involve any manip- ulation or changes to the environment in question. You simply gather the data and then analyze it for any interesting correlations between your independent and dependent variables. For example, you may be running PPC marketing programs on two different search engines. You collect data for a month on the total number of clicks from each cam- paign and the resulting number of conversions. You can then see if the conversion rate 292 between the two traffic sources is truly different or possibly due to chance. The basic steps of any scientific experiment are well known. We have summa- rized them next with notes on their applicability to landing page testing. Chapter 14, “Developing Your Action Plan,” covers these steps and all of the other required landing page testing activities in more detail. Plan the Research D etermine the landing page to tune, your traffic sources for the test, and the traffic levels available. Understand and try to correct for or eliminate any sampling biases among your population. Design the Experiment Create a written test plan that explicitly lays out the alterna- tive landing page elements that you intend to test (your independent variables). Define the performance measurement that you will be trying to improve (typically the conver- sion rate for a key process on the landing page). Collect the Data Y ou will need to collect the number of visits or impressions for your test pages as well as the number and value of any conversions. Summarize the Data Use descriptive statistics (see the next section) to summarize your findings. Hide unnecessary levels of detail. Draw Conclusions U se inferential statistics (see the next section) to see what infor- mation can be gleaned from your data sample about the underlying population of visi- tors on your landing page. Normally this would involve statistical tests to see if any of your alternative landing page designs are better than the original. Present the Results D ocument and present the results of your experiment. This can be a casual e-mail or a detailed formal report, depending on your circumstances and purpose. C H A P T E R 10: COMMON TESTING QUESTIONS ■฀฀฀■ HAVE I FOUND SOMETHING BETTER? Applied Statistics Statistical theory (also known as mathematical statistics) is based on probability the- ory and mathematical analysis, and is used to understand the theoretical basis of statis- tics. Applied statistics falls into two basic types: Descriptive Statistics Descriptive statistics is used to summarize or describe a collec- tion of data. This can be done numerically or graphically. Basic numerical descriptions include the mean, median, mode, variance, and standard deviation. Graphical summa- ries include various kinds of graphs and charts. Inferential Statistics Inferential statistics is used to reach conclusions that go beyond the specific data that you have collected. In effect, you are trying to infer the behavior of the larger process or population from which you drew your test sample. Examples of possible inferences include answers to yes-or-no questions (hypothesis testing), as well as other techniques like Analysis of Variance (ANOVA), regression analysis, and many other multivariate methods such as cluster analysis, multidimensional scaling, and fac- 293 tor analysis. Both types of applied statistics are commonly used in landing page testing and tuning. Unfortunately, descriptive statistics are often viewed as a substitute for the proper inferential tests and are used to make decisions. Remember, descriptive statis- tics only summarize or describe the data that you have observed. They do not tell you anything about the meaning or implications of your observations. Proper hypothesis testing must be done to see if differences in your data are likely to be due to random chance or are truly significant. Have I Found Something Better? Landing page optimization is based on statistics, and statistics is based in turn on probability theory. Probability theory is concerned with the study of random events, but a lot of people might object that the behavior of your landing page visitors is not “random.” Your visitors are not as simple as the roll of a die. They visit your landing page for a reason and act (or fail to act) based on their own internal motivations. So what does probability mean in this context? Let’s conduct a little thought experiment. Imagine that we are about to flip a fair coin. It has the potential to be in one of two states (heads or tails). What would you estimate the probability of it coming up heads to be? Fifty percent, right? So would we. Now imagine that we have flipped the coin and covered up the result after catch- ing it. The process of flipping is now complete, and the coin has taken on one particu- lar state. Now what would you estimate the probability of it coming up heads to be? Fifty percent again, right? We would agree because neither of us knows any more than before the coin was flipped. Now imagine if we peeked at the coin without letting you see it. What would you estimate the probability of it coming up heads to be? Still 50 percent, right? How about us? We would no longer agree with you. Having seen the outcome of the flip event we would declare that the probability of coming up heads is either zero or 100 percent (depending on what we have seen). How can two parties experience the same event and come to two different conclusions? Who is correct? The answer is—both of us. We are basing our answers on different available information. Not having seen the outcome of the flip, you must assume that the coin can still come up heads. In effect, for you the coin has not been flipped, but rather remains in a state of pre-flipped potential. We know more, so our answer is different. So probability can be viewed as simply taking the best guess given the available information. The more information you have, the more accurate your guess will become. Let’s look at this in the context of the simplest type of landing page optimiza- tion. Let’s assume that you have a constant flow of visitors to your landing page from 294 a steady and unchanging traffic source. You decide to test two versions of your page design, and split your traffic evenly and randomly between them. In statistical terminology, you have two stochastic processes (experiences with your landing pages), with their own random variables (visitors drawn from the same population) and their own measurable binary events (either visitors convert or they do not). The true probability of conversion for each page is not known, but must be between zero and one. This true probability of conversion is what you normally call the conversion rate and you assume that it is fixed. From the law of large numbers you know that as you sample a very large num- ber of visitors, the measured conversion rate will approach the true probability of con- version. From the Central Limit Theorem you also know that the chances of the actual value falling within three standard deviations of your observed mean are very high (99.7 percent) and that the width of the normal distribution will continue to narrow (depending only on the amount of data that you have collected). Basically, measured conversion rates will wander within ever narrower ranges as they get closer and closer to their true respective conversion rates. By seeing the amount of overlap between the two bell curves representing the normal distributions of the conversion rate, you can determine the likelihood of one version of the page being better than the other. One of the most common questions in inferential statistics is to see if two samples are really different or if they could have been drawn from the same underlying population as a result of random chance alone. You can compare the average perfor- mance between two groups by using a t-test computation. In landing page testing, this kind of analysis would allow you to compare the difference in conversion rate between two versions of your site design. Let’s suppose that your new version had a higher conversion rate than the original. The t-test would tell you if this difference was likely due to random chance or if the two were actually different. C H A P T E R 10: COMMON TESTING QUESTIONS ■฀฀฀■ HOW SURE DO I NEED TO BE? There is a whole family of related t-test formulas based on the circumstances. The appropriate one for head-to-head landing page optimization tests is the unpaired one-tailed equal-variance t-test. The test produces a single number as its output. The higher this number is, the higher the statistical certainty that the two outcomes being measured are truly different. It is very easy to compute and requires only basic spread- sheet formulas. How Sure Do I Need to Be? Online marketers often make the mistake of looking only at the descriptive statistics for their test and neglect to even do basic inferential statistics to see if their answers are due simply to random chance. They often do not have the knowledge or discipline to specify the desired confidence in their answer ahead of time, and to patiently collect enough data until that level of confidence is reached. There are three common issues associated with lack of statistical confidence. 295 Collecting Insufficient Data Early in an experiment when you have only collected a relatively small amount of data, the measured conversion rates may fluctuate wildly. If the first visitor for one of the page designs happens to convert, for instance, your measured conversion rate is 100 percent. It is tempting to draw conclusions during this early period, but doing so commonly leads to error. Just as you would not conclude a coin could never come up tails after seeing it come up heads just three times, you should not pick a page design before collecting enough data. What many people forget is that there can (and should) be short-term streaks that significantly skew the conversion rates in low data situations. Remember, the laws of probability only guarantee the accuracy and stability of results for very large sample sizes. For smaller sample sizes, a lot of fuzz and uncertainty remain. The way to deal with this is to decide on your desired confidence level ahead of time. How sure do you want to be in your answer—90, 95, 99 percent, even higher? Your level of certainty completely depends on your business goals and the consequences of being wrong. If a lot of money is involved, you should probably insist on higher confidence levels. Let’s consider the simplest example. You are trying to decide whether version A or B is best. You have split your traffic equally to test both options and have gotten 90 conversions on A and 100 conversions on B. Is B really better than A? Many people would answer yes, since 100 is obviously higher than 90. But the statistical reality is not so clear-cut. Confidence in your answer can be expressed by means of a Z-score, which is easy to calculate in cases like this. The Z-score tells you how many standard deviations away from the observed mean your data is. In other words, it is the same as the number of standard deviations in the test’s normal distribution. The Z-score therefore follows the 68-95-99.7 rule that we discussed earlier. Z=1 means that you are 67 percent sure of your answer, Z=2 means 95.28 percent sure, and Z=3 means 99.74 percent sure. Pick an appropriate confidence level, and then wait to collect enough data to reach i . t Let’s pick a 95 percent confidence level for our earlier example. This means that you want to be right 19 out of 20 times. So you will need to collect enough data to get a Z-score of 2 or more. The calculation of the Z-score depends on the standard deviation (). For con- version rates that are less than 30 percent, this formula is fairly accurate:  = "(Conversions) In our example for B, the standard deviation would be calculated as follows:  = "(100) = 10 296 So you are 67 percent sure (Z=1) that the real value of B is between 90 and 110 (100 plus or minus 10). In other words, there is a one out of three chance that A is actually bigger than the lower end of the estimated range, and you may just be seeing a lucky streak for B. Similarly, at your current data amounts, you are 95 percent sure (Z=2) that the real value of B is between 80 and 120 (100 plus or minus 20). So there is a good chance that the 90 conversions on A are actually better than the bottom end estimate of 80 for B. Confidence levels are often illustrated with a graph. The error bars on the quantity being measured represent the range of possible values (the confidence interval) that would include results within the selected confidence level. Figure 10.3 shows 95 percent confidence error bars (represented by the dashed lines) for our example. As you can see, the bottom of B’s error bars is higher than the top of A’s error bars. This implies that A might actu- ally be higher than B, despite B’s current streak of good luck in the current sample. 120 100 80 60 40 20 0 AB Figure 10.3 Confidence error bars (little data) C H A P T E R 10: COMMON TESTING QUESTIONS ■฀฀฀■ HOW SURE DO I NEED TO BE? If you wanted to be 95 percent sure that B is better than A, you would need to collect much more data. In our example, this level of confidence would be reached when A had 1,350 conversions and B had 1,500 conversions. Note that even though the ratio between A and B remains the same, the standard deviations have gotten much smaller, thus raising the Z-score. As you can see from Figure 10.4, the confidence error bars have now “uncrossed,” so you can be 95 percent confident that B actually is better than A. 1800 1600 1400 1200 1000 800 600 297 400 200 0 AB Figure 10.4 Confidence error bars (more data) All of this may seem a little intimidating at first, but the math for these calcula- tions can easily be programmed into a spreadsheet formula. After that, you just plug in the current test numbers and see if your desired confidence level has been reached yet. Believe us; this is preferable to making wrong decisions one-third of the time as you might have done in this section’s example. Confusing Significance with Importance In the preceding section we discussed how people often want to believe that large effects are statistically significant when they do not have enough data to support such a con- clusion. Because of their lack of statistical literacy, many people also make the converse mistake—they believe that just because they have found something statistically signifi- cant, it is also practically important. The word significant in statistical terms means only that you have high enough confidence in your answer. It does not mean that the effect found in your test is large or important. If you collect a large enough data sample, even tiny differences can be found to be statistically significant. Most people would probably not get excited if the difference between two landing page versions on which they collected test data for a long time turned out to be extremely small (yet significant to the required confidence level).Even if you reach a high level of statistical confidence, you may not have found an effect that is interesting in practical terms. Understanding the Results The null hypothesis in probability and statistics is the starting assumption that noth- ing other than random chance is operating to create the observed effect that you see in a particular set of data. Basically it assumes that the measured effects are the same across the independent conditions being tested. There are no differences or relation- ships between these independent variables and the dependent outcomes—equal until proven otherwise. The null hypothesis is rejected if your data set is unlikely to have been produced by chance. The significance of the results is described by the confidence level that was defined by the test (as described by the acceptable error “alpha-level”). For example, it is harder to reject the null hypothesis at 99 percent confidence (alpha 0.01) than at 95 percent confidence (alpha 0.05). 298 Even if the null hypothesis is rejected at a certain confidence level, no alternative hypothesis is proven thereby. The only conclusion you can draw is that some effect is going on. But you do not know its cause. If the experiment was designed properly, the only things that changed were the experimental conditions. So it is logical to attribute a causal effect to them. However, as we have already discussed, there are often subtle and gross sampling bias and test design errors in landing page optimization, and the documented effects can also be attributed to these. Under such conditions you can only strictly state that there is a high degree of correlation between the tested changes and the correspond- ing outcomes, but not true causality. Having said that, online marketing is an applied discipline that has to earn its keep, so don’t let such considerations dissuade you from using statistics to run your tests. We just feel obliged to point out the specific deviations from the pure underlying math. What if the null hypothesis is not rejected? This simply means that you did not find any statistically significant differences. That is not the same as stating that there was no difference. Remember, accepting the null hypothesis merely means that the observed differences might have been due simply to random chance, not that they must have been. Statistics cannot prove that there was no difference between two test condi- tions. The absence of evidence for a difference does not provide any evidence for the notion that no difference exists. How Much Better Is It? Internet marketing produces a detailed and quantifiable view of your online cam- paign activities. As we discussed earlier, most numbers produced fall under the general category of descriptive statistics. Descriptive statistics produces summaries C H A P T E R 10: COMMON TESTING QUESTIONS ■฀฀฀

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.