Difference between Round off and Truncation error

distinguish between round off errors and truncation errors, truncation error vs roundoff error,show round-off errors and truncation errors by example,what is round off and truncation error
GregDeamons Profile Pic
GregDeamons,New Zealand,Professional
Published Date:03-08-2017
Your Website URL(Optional)
Comment
cha01102_ch04_088-122.qxd 12/17/10 8:00 AM Page 88 4 Roundoff and Truncation Errors CHAPTER OBJECTIVES The primary objective of this chapter is to acquaint you with the major sources of errors involved in numerical methods. Specific objectives and topics covered are • Understanding the distinction between accuracy and precision. • Learning how to quantify error. • Learning how error estimates can be used to decide when to terminate an iterative calculation. • Understanding how roundoff errors occur because digital computers have a limited ability to represent numbers. • Understanding why floating-point numbers have limits on their range and precision. • Recognizing that truncation errors occur when exact mathematical formulations are represented by approximations. • Knowing how to use the Taylor series to estimate truncation errors. • Understanding how to write forward, backward, and centered finite-difference approximations of first and second derivatives. • Recognizing that efforts to minimize truncation errors can sometimes increase roundoff errors. YOU’VE GOT A PROBLEM n Chap. 1, you developed a numerical model for the velocity of a bungee jumper. To solve the problem with a computer, you had to approximate the derivative of velocity I with a finite difference. dv v v(t ) − v(t ) i+1 i ∼ = = dt t t − t i+1 i 88cha01102_ch04_088-122.qxd 12/17/10 8:00 AM Page 89 4.1 ERRORS 89 Thus, the resulting solution is not exact—that is, it has error. In addition, the computer you use to obtain the solution is also an imperfect tool. Be- cause it is a digital device, the computer is limited in its ability to represent the magnitudes and precision of numbers. Consequently, the machine itself yields results that contain error. So both your mathematical approximation and your digital computer cause your re- sulting model prediction to be uncertain. Your problem is: How do you deal with such un- certainty? In particular, is it possible to understand, quantify and control such errors in order to obtain acceptable results? This chapter introduces you to some approaches and concepts that engineers and scientists use to deal with this dilemma. 4.1 ERRORS Engineers and scientists constantly find themselves having to accomplish objectives based on uncertain information. Although perfection is a laudable goal, it is rarely if ever at- tained. For example, despite the fact that the model developed from Newton’s second law is an excellent approximation, it would never in practice exactly predict the jumper’s fall. A variety of factors such as winds and slight variations in air resistance would result in de- viations from the prediction. If these deviations are systematically high or low, then we might need to develop a new model. However, if they are randomly distributed and tightly grouped around the prediction, then the deviations might be considered negligible and the model deemed adequate. Numerical approximations also introduce similar discrepancies into the analysis. This chapter covers basic topics related to the identification, quantification, and mini- mization of these errors. General information concerned with the quantification of error is reviewed in this section. This is followed by Sections 4.2 and 4.3, dealing with the two major forms of numerical error: roundoff error (due to computer approximations) and trun- cation error (due to mathematical approximations). We also describe how strategies to re- duce truncation error sometimes increase roundoff. Finally, we briefly discuss errors not directly connected with the numerical methods themselves. These include blunders, model errors, and data uncertainty. 4.1.1 Accuracy and Precision The errors associated with both calculations and measurements can be characterized with regard to their accuracy and precision. Accuracy refers to how closely a computed or mea- sured value agrees with the true value. Precision refers to how closely individual computed or measured values agree with each other. These concepts can be illustrated graphically using an analogy from target practice. The bullet holes on each target in Fig. 4.1 can be thought of as the predictions of a numer- ical technique, whereas the bull’s-eye represents the truth. Inaccuracy (also called bias) is defined as systematic deviation from the truth. Thus, although the shots in Fig. 4.1c are more tightly grouped than in Fig. 4.1a, the two cases are equally biased because they are both centered on the upper left quadrant of the target. Imprecision (also called uncertainty), on the other hand, refers to the magnitude of the scatter. Therefore, although Fig. 4.1b and d are equally accurate (i.e., centered on the bull’s-eye), the latter is more precise because the shots are tightly grouped.cha01102_ch04_088-122.qxd 12/17/10 8:00 AM Page 90 90 ROUNDOFF AND TRUNCATION ERRORS Increasing accuracy (a) (b) (c) (d) FIGURE 4.1 An example from marksmanship illustrating the concepts of accuracy and precision: (a) inaccurate and imprecise, (b) accurate and imprecise, (c) inaccurate and precise, and (d) accurate and precise. Numerical methods should be sufficiently accurate or unbiased to meet the require- ments of a particular problem. They also should be precise enough for adequate design. In this book, we will use the collective term error to represent both the inaccuracy and imprecision of our predictions. 4.1.2 Error Definitions Numerical errors arise from the use of approximations to represent exact mathematical op- erations and quantities. For such errors, the relationship between the exact, or true, result and the approximation can be formulated as True value = approximation + error (4.1) By rearranging Eq. (4.1), we find that the numerical error is equal to the discrepancy between the truth and the approximation, as in E = true value − approximation (4.2) t where E is used to designate the exact value of the error. The subscript t is included to des- t ignate that this is the “true” error. This is in contrast to other cases, as described shortly, where an “approximate” estimate of the error must be employed. Note that the true error is commonly expressed as an absolute value and referred to as the absolute error. A shortcoming of this definition is that it takes no account of the order of magnitude of the value under examination. For example, an error of a centimeter is much more significant Increasing precisioncha01102_ch04_088-122.qxd 12/17/10 8:00 AM Page 91 4.1 ERRORS 91 if we are measuring a rivet than a bridge. One way to account for the magnitudes of the quantities being evaluated is to normalize the error to the true value, as in true value − approximation True fractional relative error = true value The relative error can also be multiplied by 100% to express it as true value − approximation ε = 100% (4.3) t true value where ε designates the true percent relative error. t For example, suppose that you have the task of measuring the lengths of a bridge and a rivet and come up with 9999 and 9 cm, respectively. If the true values are 10,000 and 10 cm, respectively, the error in both cases is 1 cm. However, their percent relative errors can be computed using Eq. (4.3) as 0.01% and 10%, respectively. Thus, although both mea- surements have an absolute error of 1 cm, the relative error for the rivet is much greater. We would probably conclude that we have done an adequate job of measuring the bridge, whereas our estimate for the rivet leaves something to be desired. Notice that for Eqs. (4.2) and (4.3), E and ε are subscripted with a t to signify that the error is based on the true value. For the example of the rivet and the bridge, we were pro- vided with this value. However, in actual situations such information is rarely available. For numerical methods, the true value will only be known when we deal with functions that can be solved analytically. Such will typically be the case when we investigate the theo- retical behavior of a particular technique for simple systems. However, in real-world ap- plications, we will obviously not know the true answer a priori. For these situations, an alternative is to normalize the error using the best available estimate of the true value—that is, to the approximation itself, as in approximate error ε = 100% (4.4) a approximation where the subscript a signifies that the error is normalized to an approximate value. Note also that for real-world applications, Eq. (4.2) cannot be used to calculate the error term in the numerator of Eq. (4.4). One of the challenges of numerical methods is to determine error estimates in the absence of knowledge regarding the true value. For example, certain numerical methods use iteration to compute answers. In such cases, a present approxima- tion is made on the basis of a previous approximation. This process is performed repeat- edly, or iteratively, to successively compute (hopefully) better and better approximations. For such cases, the error is often estimated as the difference between the previous and pre- sent approximations. Thus, percent relative error is determined according to present approximation − previous approximation ε = 100% (4.5) a present approximation This and other approaches for expressing errors is elaborated on in subsequent chapters. The signs of Eqs. (4.2) through (4.5) may be either positive or negative. If the approx- imation is greater than the true value (or the previous approximation is greater than the cur- rent approximation), the error is negative; if the approximation is less than the true value, the error is positive. Also, for Eqs. (4.3) to (4.5), the denominator may be less than zero,cha01102_ch04_088-122.qxd 12/17/10 8:00 AM Page 92 92 ROUNDOFF AND TRUNCATION ERRORS which can also lead to a negative error. Often, when performing computations, we may not be concerned with the sign of the error but are interested in whether the absolute value of the percent relative error is lower than a prespecified tolerance ε . Therefore, it is often useful s to employ the absolute value of Eq. (4.5). For such cases, the computation is repeated until ε ε (4.6) a s This relationship is referred to as a stopping criterion. If it is satisfied, our result is assumed to be within the prespecified acceptable level ε . Note that for the remainder of this text, we s almost always employ absolute values when using relative errors. It is also convenient to relate these errors to the number of significant figures in the ap- proximation. It can be shown (Scarborough, 1966) that if the following criterion is met, we can be assured that the result is correct to at least n significant figures. 2−n ε = (0.5 × 10 )% (4.7) s EXAMPLE 4.1 Error Estimates for Iterative Methods Problem Statement. In mathematics, functions can often be represented by infinite se- ries. For example, the exponential function can be computed using 2 3 n x x x x e = 1 + x + + +···+ (E4.1.1) 2 3 n Thus, as more terms are added in sequence, the approximation becomes a better and better x estimate of the true value of e . Equation (E4.1.1) is called a Maclaurin series expansion. x Starting with the simplest version, e = 1, add terms one at a time in order to estimate 0.5 e . After each new term is added, compute the true and approximate percent relative errors 0.5 with Eqs. (4.3) and (4.5), respectively. Note that the true value is e = 1.648721.... Add terms until the absolute value of the approximate error estimate ε falls below a prespeci- a fied error criterion ε conforming to three significant figures. s Solution. First, Eq. (4.7) can be employed to determine the error criterion that ensures a result that is correct to at least three significant figures: 2−3 ε = (0.5 × 10 )% = 0.05% s Thus, we will add terms to the series until ε falls below this level. a The first estimate is simply equal to Eq. (E4.1.1) with a single term. Thus, the first estimate is equal to 1. The second estimate is then generated by adding the second term as in x e = 1 + x or for x = 0.5 0.5 e = 1 + 0.5 = 1.5 This represents a true percent relative error of Eq. (4.3)     1.648721 − 1.5   ε = × 100% = 9.02% t   1.648721cha01102_ch04_088-122.qxd 12/17/10 8:00 AM Page 93 4.1 ERRORS 93 Equation (4.5) can be used to determine an approximate estimate of the error, as in     1.5 − 1   ε = × 100% = 33.3% a   1.5 Because ε is not less than the required value of ε , we would continue the computation by a s 2 adding another term, x /2, and repeating the error calculations. The process is continued until ε ε . The entire computation can be summarized as a s Terms Result ε , % ε , % t a 1 1 39.3 2 1.5 9.02 33.3 3 1.625 1.44 7.69 4 1.645833333 0.175 1.27 5 1.648437500 0.0172 0.158 6 1.648697917 0.00142 0.0158 Thus, after six terms are included, the approximate error falls below ε = 0.05%, and the s computation is terminated. However, notice that, rather than three significant figures, the result is accurate to five This is because, for this case, both Eqs. (4.5) and (4.7) are con- servative. That is, they ensure that the result is at least as good as they specify. Although, this is not always the case for Eq. (4.5), it is true most of the time. 4.1.3 Computer Algorithm for Iterative Calculations Many of the numerical methods described in the remainder of this text involve iterative calculations of the sort illustrated in Example 4.1. These all entail solving a mathematical problem by computing successive approximations to the solution starting from an initial guess. The computer implementation of such iterative solutions involves loops. As we saw in Sec. 3.3.2, these come in two basic flavors: count-controlled and decision loops. Most iter- ative solutions use decision loops. Thus, rather than employing a pre-specified number of iterations, the process typically is repeated until an approximate error estimate falls below a stopping criterion as in Example 4.1. To do this for the same problem as Example 4.1, the series expansion can be ex- pressed as n n  x x ∼ e = n i=0 An M-file to implement this formula is shown in Fig. 4.2. The function is passed the value to be evaluated (x) along with a stopping error criterion (es) and a maximum allowable number of iterations (maxit). If the user omits either of the latter two parameters, the func- tion assigns default values. cha01102_ch04_088-122.qxd 12/17/10 8:00 AM Page 94 94 ROUNDOFF AND TRUNCATION ERRORS function fx,ea,iter = IterMeth(x,es,maxit) % Maclaurin series of exponential function % fx,ea,iter = IterMeth(x,es,maxit) % input: % x = value at which series evaluated % es = stopping criterion (default = 0.0001) % maxit = maximum iterations (default = 50) % output: % fx = estimated value % ea = approximate relative error (%) % iter = number of iterations % defaults: if nargin2isempty(es),es=0.0001;end if nargin3isempty(maxit),maxit=50;end % initialization iter = 1; sol = 1; ea = 100; % iterative calculation while (1) solold = sol; sol = sol + x iter / factorial(iter); iter = iter + 1; if sol=0 ea=abs((sol - solold)/sol)100; end if ea=es iter=maxit,break,end end fx = sol; end FIGURE 4.2 An M-file to solve an iterative calculation. This example is set up to evaluate the Maclaurin series x expansion for e as described in Example 4.1. The function then initializes three variables: (a)iter, which keeps track of the number of iterations, (b) sol, which holds the current estimate of the solution, and (c) a variable, ea,which holds the approximate percent relative error. Note that ea is initially set to a value of 100 to ensure that the loop executes at least once. These initializations are followed by a decision loop that actually implements the iterative calculation. Prior to generating a new solution, the previous value, sol, is first as- signed to solold. Then a new value of sol is computed and the iteration counter is in- cremented. If the new value of sol is nonzero, the percent relative error, ea, is deter- mined. The stopping criteria are then tested. If both are false, the loop repeats. If either is true, the loop terminates and the final solution is sent back to the function call. When the M-file is implemented, it generates an estimate for the exponential function which is returned along with the approximate error and the number of iterations. For 1 example, e can be evaluated as format long approxval, ea, iter = IterMeth(1,1e-6,100)cha01102_ch04_088-122.qxd 12/17/10 8:00 AM Page 95 4.2 ROUNDOFF ERRORS 95 approxval = 2.718281826198493 ea = 9.216155641522974e-007 iter = 12 We can see that after 12 iterations, we obtain a result of 2.7182818 with an −7 approximate error estimate of = 9.2162 × 10 %. The result can be verified by using the built-in exp function to directly calculate the exact value and the true percent rel- ative error, trueval=exp(1) trueval = 2.718281828459046 et=abs((trueval- approxval)/trueval)100 et = 8.316108397236229e-008 As was the case with Example 4.1, we obtain the desirable outcome that the true error is less than the approximate error. 4.2 ROUNDOFF ERRORS Roundoff errors arise because digital computers cannot represent some quantities ex- actly. They are important to engineering and scientific problem solving because they can lead to erroneous results. In certain cases, they can actually lead to a calculation going unstable and yielding obviously erroneous results. Such calculations are said to be ill-conditioned. Worse still, they can lead to subtler discrepancies that are difficult to detect. There are two major facets of roundoff errors involved in numerical calculations: 1. Digital computers have magnitude and precision limits on their ability to represent numbers. 2. Certain numerical manipulations are highly sensitive to roundoff errors. This can re- sult from both mathematical considerations as well as from the way in which comput- ers perform arithmetic operations. 4.2.1 Computer Number Representation Numerical roundoff errors are directly related to the manner in which numbers are stored in a computer. The fundamental unit whereby information is represented is called a word. This is an entity that consists of a string of binary digits, or bits. Numbers are typically stored in one or more words. To understand how this is accomplished, we must first review some material related to number systems. A number system is merely a convention for representing quantities. Because we have 10 fingers and 10 toes, the number system that we are most familiar with is the decimal, orcha01102_ch04_088-122.qxd 12/17/10 8:00 AM Page 96 96 ROUNDOFF AND TRUNCATION ERRORS base-10, number system. A base is the number used as the reference for constructing the system. The base-10 system uses the 10 digits—0, 1, 2, 3, 4, 5, 6, 7, 8, and 9—to represent numbers. By themselves, these digits are satisfactory for counting from 0 to 9. For larger quantities, combinations of these basic digits are used, with the position or place value specifying the magnitude. The rightmost digit in a whole number represents a number from 0 to 9. The second digit from the right represents a multiple of 10. The third digit from the right represents a multiple of 100 and so on. For example, if we have the number 8642.9, then we have eight groups of 1000, six groups of 100, four groups of 10, two groups of 1, and nine groups of 0.1, or 3 2 1 0 −1 (8 × 10 ) + (6 × 10 ) + (4 × 10 ) + (2 × 10 ) + (9 × 10 ) = 8642.9 This type of representation is called positional notation. Now, because the decimal system is so familiar, it is not commonly realized that there are alternatives. For example, if human beings happened to have eight fingers and toes we would undoubtedly have developed an octal, or base-8, representation. In the same sense, our friend the computer is like a two-fingered animal who is limited to two states—either 0 or 1. This relates to the fact that the primary logic units of digital com- puters are on/off electronic components. Hence, numbers on the computer are repre- sented with a binary, or base-2, system. Just as with the decimal system, quantities can be represented using positional notation. For example, the binary number 101.1 is 2 1 0 −1 equivalent to (1 × 2 )+(0 × 2 ) + (1 × 2 ) + (1 × 2 ) = 4 + 0 + 1 + 0.5 = 5.5 in the decimal system. Integer Representation. Now that we have reviewed how base-10 numbers can be rep- resented in binary form, it is simple to conceive of how integers are represented on a com- puter. The most straightforward approach, called the signed magnitude method, employs the first bit of a word to indicate the sign, with a 0 for positive and a 1 for negative. The re- maining bits are used to store the number. For example, the integer value of 173 is repre- sented in binary as 10101101: 7 5 3 2 0 (10101101) = 2 + 2 + 2 + 2 + 2 = 128 + 32 + 8 + 4 + 1 = (173) 2 10 Therefore, the binary equivalent of −173 would be stored on a 16-bit computer, as depicted in Fig. 4.3. If such a scheme is employed, there clearly is a limited range of integers that can be rep- resented. Again assuming a 16-bit word size, if one bit is used for the sign, the 15 remaining bits can represent binary integers from 0 to 111111111111111. The upper limit can be con- 14 13 1 0 . . . verted to a decimal integer, as in (1 × 2 ) + (1 × 2 ) + + (1 × 2 ) + (1 × 2 ) = 32,767. FIGURE 4.3 The binary representation of the decimal integer –173 on a 16-bit computer using the signed magnitude method. 1000000010101101 Magnitude Signcha01102_ch04_088-122.qxd 12/17/10 8:00 AM Page 97 4.2 ROUNDOFF ERRORS 97 15 Note that this value can be simply evaluated as 2 − 1. Thus, a 16-bit computer word can store decimal integers ranging from −32,767 to 32,767. In addition, because zero is already defined as 0000000000000000, it is redundant to use the number 1000000000000000 to define a “minus zero.” Therefore, it is conven- tionally employed to represent an additional negative number: −32,768, and the range is n1 n1 from −32,768 to 32,767. For an n-bit word, the range would be from 2 to 2 − 1. Thus, 32-bit integers would range from −2,147,483,648 to +2,147,483,647. Note that, although it provides a nice way to illustrate our point, the signed magnitude method is not actually used to represent integers for conventional computers. A preferred approach called the 2s complement technique directly incorporates the sign into the number’s magnitude rather than providing a separate bit to represent plus or minus. Regardless, the range of numbers is still the same as for the signed magnitude method described above. The foregoing serves to illustrate how all digital computers are limited in their capability to represent integers. That is, numbers above or below the range cannot be represented. A more serious limitation is encountered in the storage and manipulation of fractional quanti- ties as described next. Floating-Point Representation. Fractional quantities are typically represented in com- puters using floating-point format. In this approach, which is very much like scientific notation, the number is expressed as e ±s × b where s = the significand (or mantissa), b = the base of the number system being used, and e = the exponent. Prior to being expressed in this form, the number is normalized by moving the decimal place over so that only one significant digit is to the left of the decimal point. This is done so computer memory is not wasted on storing useless nonsignificant zeros. For example, a value 0 like 0.005678 could be represented in a wasteful manner as 0.005678 × 10 . However, nor- −3 malization would yield 5.678 × 10 which eliminates the useless zeroes. Before describing the base-2 implementation used on computers, we will first ex- plore the fundamental implications of such floating-point representation. In particular, what are the ramifications of the fact that in order to be stored in the computer, both the mantissa and the exponent must be limited to a finite number of bits? As in the next example, a nice way to do this is within the context of our more familiar base-10 decimal world. EXAMPLE 4.2 Implications of Floating-Point Representation Problem Statement. Suppose that we had a hypothetical base-10 computer with a 5-digit word size. Assume that one digit is used for the sign, two for the exponent, and two for the mantissa. For simplicity, assume that one of the exponent digits is used for its sign, leaving a single digit for its magnitude. Solution. A general representation of the number following normalization would be s d 0 0 s d .d × 10 1 1 2cha01102_ch04_088-122.qxd 12/17/10 8:00 AM Page 98 98 ROUNDOFF AND TRUNCATION ERRORS where s and s = the signs, d = the magnitude of the exponent, and d and d = the mag- 0 1 0 1 2 nitude of the significand digits. Now, let’s play with this system. First, what is the largest possible positive quantity that can be represented? Clearly, it would correspond to both signs being positive and all magnitude digits set to the largest possible value in base-10, that is, 9: +9 Largest value=+9.9 × 10 So the largest possible number would be a little less than 10 billion. Although this might seem like a big number, it’s really not that big. For example, this computer would be inca- 23 pable of representing a commonly used constant like Avogadro’s number (6.022 × 10 ). In the same sense, the smallest possible positive number would be −9 Smallest value=+1.0 × 10 Again, although this value might seem pretty small, you could not use it to represent a −34 quantity like Planck’s constant (6.626 × 10 J·s). Similar negative values could also be developed. The resulting ranges are displayed in Fig. 4.4. Large positive and negative numbers that fall outside the range would cause an overflow error. In a similar sense, for very small quantities there is a “hole” at zero, and very small quantities would usually be converted to zero. Recognize that the exponent overwhelmingly determines these range limitations. For example, if we increase the mantissa by one digit, the maximum value increases slightly to 9 9.99 × 10 . In contrast, a one-digit increase in the exponent raises the maximum by 90 orders 99 of magnitude to 9.9 × 10 When it comes to precision, however, the situation is reversed. Whereas the significand plays a minor role in defining the range, it has a profound effect on specifying the precision. This is dramatically illustrated for this example where we have limited the significand to only 2 digits. As in Fig. 4.5, just as there is a “hole” at zero, there are also “holes” between values. −5 For example, a simple rational number with a finite number of digits like 2 = 0.03125 −2 would have to be stored as 3.1 × 10 or 0.031. Thus, a roundoff error is introduced. For this case, it represents a relative error of 0.03125 − 0.031 = 0.008 0.03125 FIGURE 4.4 The number line showing the possible ranges corresponding to the hypothetical base-10 floating-point scheme described in Example 4.2. Minimum Smallest Maximum 9 9 9 9 9.9  10 1.0  10 1.0  10 9.9  10 Overflow Underflow Overflow “Hole” at zerocha01102_ch04_088-122.qxd 12/17/10 8:00 AM Page 99 4.2 ROUNDOFF ERRORS 99 0.01 0.1 0.98 0.99 1 1.1 1.2 FIGURE 4.5 A small portion of the number line corresponding to the hypothetical base-10 floating-point scheme described in Example 4.2. The numbers indicate values that can be represented exactly. All other quantities falling in the “holes” between these values would exhibit some roundoff error. While we could store a number like 0.03125 exactly by expanding the digits of the significand, quantities with infinite digits must always be approximated. For example, a 0 commonly used constant such as π (= 3.14159…) would have to be represented as 3.1 × 10 or 3.1. For this case, the relative error is 3.14159 − 3.1 = 0.0132 3.14159 Although adding significand digits can improve the approximation, such quantities will always have some roundoff error when stored in a computer. Another more subtle effect of floating-point representation is illustrated by Fig. 4.5. Notice how the interval between numbers increases as we move between orders of mag- nitude. For numbers with an exponent of −1 (i.e., between 0.1 and 1), the spacing is 0.01. Once we cross over into the range from 1 to 10, the spacing increases to 0.1. This means that the roundoff error of a number will be proportional to its magnitude. In addition, it means that the relative error will have an upper bound. For this example, the maximum relative error would be 0.05. This value is called the machine epsilon (or machine precision). As illustrated in Example 4.2, the fact that both the exponent and significand are finite means that there are both range and precision limits on floating-point representation. Now, let us examine how floating-point quantities are actually represented in a real computer using base-2 or binary numbers. First, let’s look at normalization. Since binary numbers consist exclusively of 0s and 1s, a bonus occurs when they are normalized. That is, the bit to the left of the binary point will always be one This means that this leading bit does not have to be stored. Hence, nonzero binary floating-point numbers can be expressed as e ±(1 + f ) × 2 where f = the mantissa (i.e., the fractional part of the significand). For example, if we nor- −3 −3 malized the binary number 1101.1, the result would be 1.1011 × (2) or (1 + 0.1011) × 2 .cha01102_ch04_088-122.qxd 12/17/10 8:00 AM Page 100 100 ROUNDOFF AND TRUNCATION ERRORS Signed exponent Mantissa 11 bits 52 bits Sign (1 bit) FIGURE 4.6 The manner in which a floating-point number is stored in an 8-byte word in IEEE double- precision format. Thus, although the original number has five significant bits, we only have to store the four fractional bits: 0.1011. By default, MATLAB has adopted the IEEE double-precision format in which eight bytes (64 bits) are used to represent floating-point numbers. As in Fig. 4.6, one bit is re- served for the number’s sign. In a similar spirit to the way in which integers are stored, the exponent and its sign are stored in 11 bits. Finally, 52 bits are set aside for the mantissa. However, because of normalization, 53 bits can be stored. Now, just as in Example 4.2, this means that the numbers will have a limited range and precision. However, because the IEEE format uses many more bits, the resulting number system can be used for practical purposes. Range. In a fashion similar to the way in which integers are stored, the 11 bits used for the exponent translates into a range from −1022 to 1023. The largest positive number can be represented in binary as +1023 Largest value=+1.1111... 1111 × 2 where the 52 bits in the mantissa are all 1. Since the significand is approximately 2 (it is ac- −52 1024 308 tually 2 − 2 ), the largest value is therefore 2 = 1.7977 × 10 . In a similar fashion, the smallest positive number can be represented as −1022 Smallest value=+1.0000... 0000 × 2 –1022 –308 This value can be translated into a base-10 value of 2 = 2.2251 × 10 . Precision. The 52 bits used for the mantissa correspond to about 15 to 16 base-10 digits. Thus, π would be expressed as format long pi ans = 3.14159265358979 –52 –16 Note that the machine epsilon is 2 = 2.2204 × 10 .cha01102_ch04_088-122.qxd 12/17/10 8:00 AM Page 101 4.2 ROUNDOFF ERRORS 101 MATLAB has a number of built-in functions related to its internal number representa- tion. For example, the realmax function displays the largest positive real number: format long realmax ans = 1.797693134862316e+308 Numbers occurring in computations that exceed this value create an overflow. In MATLAB they are set to infinity, inf. The realmin function displays the smallest positive real number: realmin ans = 2.225073858507201e-308 Numbers that are smaller than this value create an underflow and, in MATLAB, are set to zero. Finally, the eps function displays the machine epsilon: eps ans = 2.220446049250313e-016 4.2.2 Arithmetic Manipulations of Computer Numbers Aside from the limitations of a computer’s number system, the actual arithmetic manipula- tions involving these numbers can also result in roundoff error. To understand how this occurs, let’s look at how the computer performs simple addition and subtraction. Because of their familiarity, normalized base-10 numbers will be employed to illus- trate the effect of roundoff errors on simple addition and subtraction. Other number bases would behave in a similar fashion. To simplify the discussion, we will employ a hypothet- ical decimal computer with a 4-digit mantissa and a 1-digit exponent. When two floating-point numbers are added, the numbers are first expressed so that they have the same exponents. For example, if we want to add 1.557 + 0.04341, the com- 1 1 puter would express the numbers as 0.1557 × 10 + 0.004341 × 10 . Then the mantissas 1 are added to give 0.160041 × 10 . Now, because this hypothetical computer only carries a 1 4-digit mantissa, the excess number of digits get chopped off and the result is 0.1600 × 10 . Notice how the last two digits of the second number (41) that were shifted to the right have essentially been lost from the computation. Subtraction is performed identically to addition except that the sign of the subtrahend is reversed. For example, suppose that we are subtracting 26.86 from 36.41. That is, 2 0.3641 × 10 2 −0.2686 × 10 2 0.0955 × 10 For this case the result must be normalized because the leading zero is unnecessary. So 1 we must shift the decimal one place to the right to give 0.9550 × 10 = 9.550. Notice thatcha01102_ch04_088-122.qxd 12/17/10 8:00 AM Page 102 102 ROUNDOFF AND TRUNCATION ERRORS the zero added to the end of the mantissa is not significant but is merely appended to fill the empty space created by the shift. Even more dramatic results would be obtained when the numbers are very close as in 3 0.7642 × 10 3 −0.7641 × 10 3 0.0001 × 10 0 which would be converted to 0.1000 × 10 = 0.1000. Thus, for this case, three nonsignif- icant zeros are appended. The subtracting of two nearly equal numbers is called subtractive cancellation. It is the classic example of how the manner in which computers handle mathematics can lead to numerical problems. Other calculations that can cause problems include: Large Computations. Certain methods require extremely large numbers of arithmetic manipulations to arrive at their final results. In addition, these computations are often inter- dependent. That is, the later calculations are dependent on the results of earlier ones. Con- sequently, even though an individual roundoff error could be small, the cumulative effect over the course of a large computation can be significant. A very simple case involves sum- ming a round base-10 number that is not round in base-2. Suppose that the following M-file is constructed: function sout = sumdemo() s = 0; for i = 1:10000 s = s + 0.0001; end sout = s; When this function is executed, the result is format long sumdemo ans = 0.99999999999991 The format long command lets us see the 15 significant-digit representation used by MATLAB. You would expect that sum would be equal to 1. However, although 0.0001 is a nice round number in base-10, it cannot be expressed exactly in base-2. Thus, the sum comes out to be slightly different than 1. We should note that MATLAB has fea- tures that are designed to minimize such errors. For example, suppose that you form a vector as in format long s = 0:0.0001:1; For this case, rather than being equal to 0.99999999999991, the last entry will be exactly one as verified by s(10001) ans = 1cha01102_ch04_088-122.qxd 12/17/10 8:00 AM Page 103 4.3 TRUNCATION ERRORS 103 Adding a Large and a Small Number. Suppose we add a small number, 0.0010, to a large number, 4000, using a hypothetical computer with the 4-digit mantissa and the 1-digit exponent. After modifying the smaller number so that its exponent matches the larger, 4 0.4000 × 10 4 0.0000001 × 10 4 0.4000001 × 10 4 which is chopped to 0.4000 × 10 . Thus, we might as well have not performed the addi- tion This type of error can occur in the computation of an infinite series. The initial terms in such series are often relatively large in comparison with the later terms. Thus, after a few terms have been added, we are in the situation of adding a small quantity to a large quan- tity. One way to mitigate this type of error is to sum the series in reverse order. In this way, each new term will be of comparable magnitude to the accumulated sum. Smearing. Smearing occurs whenever the individual terms in a summation are larger than the summation itself. One case where this occurs is in a series of mixed signs. Inner Products. As should be clear from the last sections, some infinite series are partic- ularly prone to roundoff error. Fortunately, the calculation of series is not one of the more common operations in numerical methods. A far more ubiquitous manipulation is the cal- culation of inner products as in n  x y = x y + x y +···+ x y i i 1 1 2 2 n n i=1 This operation is very common, particularly in the solution of simultaneous linear algebraic equations. Such summations are prone to roundofferror. Consequently, it is often desirable to compute such summations in double precision as is done automatically in MATLAB. 4.3 TRUNCATION ERRORS Truncation errors are those that result from using an approximation in place of an exact mathematical procedure. For example, in Chap. 1 we approximated the derivative of veloc- ity of a bungee jumper by a finite-difference equation of the form Eq. (1.11) dv v v(t ) − v(t ) i+1 i ∼ = = (4.8) dt t t − t i+1 i A truncation error was introduced into the numerical solution because the difference equa- tion only approximates the true value of the derivative (recall Fig. 1.3). To gain insight into the properties of such errors, we now turn to a mathematical formulation that is used widely in numerical methods to express functions in an approximate fashion—the Taylor series. 4.3.1 The Taylor Series Taylor’s theorem and its associated formula, the Taylor series, is of great value in the study of numerical methods. In essence, the Taylor theorem states that any smooth function can be approximated as a polynomial. The Taylor series then provides a means to express this idea mathematically in a form that can be used to generate practical results.True Second order First order cha01102_ch04_088-122.qxd 12/17/10 8:00 AM Page 104 104 ROUNDOFF AND TRUNCATION ERRORS f(x) f(x ) i Zero order f(x )  f(x ) i1 i 1.0 f(x )  f(x )  f(x )h i1 i i 0.5 f (x ) i 2 f(x )  f(x )  f(x )h  h i1 i i 2 f(x ) i1 0 x x  1 x  0 i1 i h FIGURE 4.7 4 3 2 The approximation of f (x) =−0.1x − 0.15x − 0.5x − 0.25x + 1.2 at x = 1 by zero-order, first-order, and second-order Taylor series expansions. A useful way to gain insight into the Taylor series is to build it term by term. A good problem context for this exercise is to predict a function value at one point in terms of the function value and its derivatives at another point. Suppose that you are blindfolded and taken to a location on the side of a hill facing downslope (Fig. 4.7). We’ll call your horizontal location x and your vertical distance with i respect to the base of the hill f (x ). You are given the task of predicting the height at a i position x , which is a distance h away from you. i+1 At first, you are placed on a platform that is completely horizontal so that you have no idea that the hill is sloping down away from you. At this point, what would be your best guess at the height at x ? If you think about it (remember you have no idea whatsoever i+1 what’s in front of you), the best guess would be the same height as where you’re standing now You could express this prediction mathematically as ∼ f (x ) f (x ) = (4.9) i+1 i This relationship, which is called the zero-order approximation, indicates that the value of f at the new point is the same as the value at the old point. This result makes intuitive sense because if x and x are close to each other, it is likely that the new value is probably sim- i i+1 ilar to the old value. Equation (4.9) provides a perfect estimate if the function being approximated is, in fact, a constant. For our problem, you would be right only if you happened to be standing on a perfectly flat plateau. However, if the function changes at all over the interval, addi- tional terms of the Taylor series are required to provide a better estimate. So now you are allowed to get off the platform and stand on the hill surface with one leg positioned in front of you and the other behind. You immediately sense that the frontcha01102_ch04_088-122.qxd 12/17/10 8:00 AM Page 105 4.3 TRUNCATION ERRORS 105 foot is lower than the back foot. In fact, you’re allowed to obtain a quantitative estimate of the slope by measuring the difference in elevation and dividing it by the distance between your feet. With this additional information, you’re clearly in a better position to predict the height at f (x ). In essence, you use the slope estimate to project a straight line out to i+1 x . You can express this prediction mathematically by i+1  ∼ f (x ) = f (x ) + f (x )h (4.10) i+1 i i This is called a first-order approximation because the additional first-order term consists of  a slope f (x ) multiplied by h, the distance between x and x . Thus, the expression is i i i+1 now in the form of a straight line that is capable of predicting an increase or decrease of the function between x and x . i i+1 Although Eq. (4.10) can predict a change, it is only exact for a straight-line, or linear, trend. To get a better prediction, we need to add more terms to our equation. So now you are allowed to stand on the hill surface and take two measurements. First, you measure the slope behind you by keeping one foot planted at x and moving the other one back a dis- i  tance x. Let’s call this slope f (x ). Then you measure the slope in front of you by keep- i b ing one foot planted at x and moving the other one forward x. Let’s call this slope i  f (x ). You immediately recognize that the slope behind is milder than the one in front. i f Clearly the drop in height is “accelerating” downward in front of you. Thus, the odds are that f (x ) is even lower than your previous linear prediction. i As you might expect, you’re now going to add a second-order term to your equation and make it into a parabola. The Taylor series provides the correct way to do this as in  f (x ) i  2 ∼ f (x ) = f (x ) + f (x )h + h (4.11) i+1 i i 2 To make use of this formula, you need an estimate of the second derivative. You can use the last two slopes you determined to estimate it as   f (x ) − f (x ) i i f b  ∼ f (x ) = (4.12) i+1 x Thus, the second derivative is merely a derivative of a derivative; in this case, the rate of change of the slope. Before proceeding, let’s look carefully at Eq. (4.11). Recognize that all the values subscripted i represent values that you have estimated. That is, they are numbers. Conse- quently, the only unknowns are the values at the prediction position x . Thus, it is a qua- i+1 dratic equation of the form 2 ∼ f (h) = a h + a h + a 2 1 0 Thus, we can see that the second-order Taylor series approximates the function with a second- order polynomial. Clearly, we could keep adding more derivatives to capture more of the function’s cur- vature. Thus, we arrive at the complete Taylor series expansion  (3) (n) f (x ) f (x ) f (x ) i i i  2 3 n f (x ) = f (x ) + f (x )h + h + h +···+ h + R (4.13) i+1 i i n 2 3 ncha01102_ch04_088-122.qxd 12/17/10 8:00 AM Page 106 106 ROUNDOFF AND TRUNCATION ERRORS Note that because Eq. (4.13) is an infinite series, an equal sign replaces the approximate sign that was used in Eqs. (4.9) through (4.11). A remainder term is also included to account for all terms from n + 1 to infinity: (n+1) f (ξ) n+1 R = h (4.14) n (n + 1) where the subscript n connotes that this is the remainder for the nth-order approximation and ξ is a value of x that lies somewhere between x and x . i i+1 We can now see why the Taylor theorem states that any smooth function can be ap- proximated as a polynomial and that the Taylor series provides a means to express this idea mathematically. In general, the nth-order Taylor series expansion will be exact for an nth-order poly- nomial. For other differentiable and continuous functions, such as exponentials and sinu- soids, a finite number of terms will not yield an exact estimate. Each additional term will contribute some improvement, however slight, to the approximation. This behavior will be demonstrated in Example 4.3. Only if an infinite number of terms are added will the series yield an exact result. Although the foregoing is true, the practical value of Taylor series expansions is that, in most cases, the inclusion of only a few terms will result in an approximation that is close enough to the true value for practical purposes. The assessment of how many terms are required to get “close enough” is based on the remainder term of the expansion (Eq. 4.14). This relationship has two major drawbacks. First, ξ is not known exactly but merely lies somewhere between x and x . Second, to evaluate Eq. (4.14), we need to determine the i i+1 (n + 1)th derivative of f (x). To do this, we need to know f (x). However, if we knew f (x), there would be no need to perform the Taylor series expansion in the present context Despite this dilemma, Eq. (4.14) is still useful for gaining insight into truncation errors. This is because we do have control over the term h in the equation. In other words, we can choose how far away from x we want to evaluate f (x), and we can control the num- ber of terms we include in the expansion. Consequently, Eq. (4.14) is often expressed as n+1 R = O(h ) n n+1 n+1 where the nomenclature O(h ) means that the truncation error is of the order of h . That is, the error is proportional to the step size h raised to the (n + 1)th power. Although this approximation implies nothing regarding the magnitude of the derivatives that multi- n+1 ply h , it is extremely useful in judging the comparative error of numerical methods based on Taylor series expansions. For example, if the error is O(h), halving the step size 2 will halve the error. On the other hand, if the error is O(h ), halving the step size will quar- ter the error. In general, we can usually assume that the truncation error is decreased by the addition of terms to the Taylor series. In many cases, if h is sufficiently small, the first- and other lower-order terms usually account for a disproportionately high percent of the error. Thus, only a few terms are required to obtain an adequate approximation. This property is illus- trated by the following example.cha01102_ch04_088-122.qxd 12/17/10 8:00 AM Page 107 4.3 TRUNCATION ERRORS 107 EXAMPLE 4.3 Approximation of a Function with a Taylor Series Expansion Problem Statement. Use Taylor series expansions with n = 0 to 6 to approximate f (x) = cos x at x = π/3 on the basis of the value of f (x) and its derivatives at i+1 x = π/4. Note that this means that h = π/3 − π/4 = π/12. i Solution. Our knowledge of the true function allows us to determine the correct value f (π/3) = 0.5. The zero-order approximation is Eq. (4.9)     π π ∼ f = cos = 0.707106781 3 4 which represents a percent relative error of     0.5 − 0.707106781   ε = 100% = 41.4% t   0.5  For the first-order approximation, we add the first derivative term where f (x)=−sin x:        π π π π ∼ f = cos − sin = 0.521986659 3 4 4 12 which has ε= 4.40%. For the second-order approximation, we add the second deriva- t  tive term where f (x)=− cos x:          2 π π π π cos(π/4) π ∼ f = cos − sin − = 0.497754491 3 4 4 12 2 12 with ε= 0.449%. Thus, the inclusion of additional terms results in an improved esti- t mate. The process can be continued and the results listed as in (n) Order nf (x) f (π/3) ε t 0 cos x 0.707106781 41.4 1 −sin x 0.521986659 4.40 2 −cos x 0.497754491 0.449 −2 3 sin x 0.499869147 2.62 × 10 −3 4 cos x 0.500007551 1.51 × 10 −5 5 −sin x 0.500000304 6.08 × 10 −6 6 −cos x 0.499999988 2.44 × 10 Notice that the derivatives never go to zero as would be the case for a polynomial. Therefore, each additional term results in some improvement in the estimate. However, also notice how most of the improvement comes with the initial terms. For this case, by the time we have added the third-order term, the error is reduced to 0.026%, which means that we have attained 99.974% of the true value. Consequently, although the addition of more terms will reduce the error further, the improvement becomes negligible.

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.