Lecture notes Linear regression analysis

simple linear regression analysis lecture notes, applied linear regression lecture notes and non linear regression lecture notes, linear regression interview questions
GregDeamons Profile Pic
GregDeamons,New Zealand,Professional
Published Date:03-08-2017
Your Website URL(Optional)
Comment
cha01102_ch14_321-360.qxd 12/17/10 8:13 AM Page 324 14 Linear Regression CHAPTER OBJECTIVES The primary objective of this chapter is to introduce you to how least-squares regression can be used to fit a straight line to measured data. Specific objectives and topics covered are • Familiarizing yourself with some basic descriptive statistics and the normal distribution. • Knowing how to compute the slope and intercept of a best-fit straight line with linear regression. • Knowing how to generate random numbers with MATLAB and how they can be employed for Monte Carlo simulations. • Knowing how to compute and understand the meaning of the coefficient of determination and the standard error of the estimate. • Understanding how to use transformations to linearize nonlinear equations so that they can be fit with linear regression. • Knowing how to implement linear regression with MATLAB. YOU’VE GOT A PROBLEM n Chap. 1, we noted that a free-falling object such as a bungee jumper is subject to the upward force of air resistance. As a first approximation, we assumed that this force was I proportional to the square of velocity as in 2 F = c v (14.1) U d 2 where F = the upward force of air resistance N = kg m/s , c = a drag coefficient U d (kg/m), and v = velocity m/s. Expressions such as Eq. (14.1) come from the field of fluid mechanics. Although such relationships derive in part from theory, experiments play a critical role in their formula- tion. One such experiment is depicted in Fig. 14.1. An individual is suspended in a wind 324cha01102_ch14_321-360.qxd 12/17/10 8:13 AM Page 325 LINEAR REGRESSION 325 FIGURE 14.1 Wind tunnel experiment to measure how the force of air resistance depends on velocity. 1600 1200 800 400 0 0 20406080 v, m/s FIGURE 14.2 Plot of force versus wind velocity for an object suspended in a wind tunnel. TABLE 14.1 Experimental data for force (N) and velocity (m/s) from a wind tunnel experiment. v, m/s 10 20 30 40 50 60 70 80 F, N 25 70 380 550 610 1220 830 1450 tunnel (any volunteers?) and the force measured for various levels of wind velocity. The result might be as listed in Table 14.1. The relationship can be visualized by plotting force versus velocity. As in Fig. 14.2, several features of the relationship bear mention. First, the points indicate that the force increases as velocity increases. Second, the points do not increase smoothly, but exhibit rather significant scatter, particularly at the higher velocities. Finally, although it may not be obvious, the relationship between force and velocity may not be linear. This conclusion becomes more apparent if we assume that force is zero for zero velocity. F, Ncha01102_ch14_321-360.qxd 12/17/10 8:13 AM Page 326 326 LINEAR REGRESSION In Chaps. 14 and 15, we will explore how to fit a “best” line or curve to such data. In so doing, we will illustrate how relationships like Eq. (14.1) arise from experimental data. 14.1 STATISTICS REVIEW Before describing least-squares regression, we will first review some basic concepts from the field of statistics. These include the mean, standard deviation, residual sum of the squares, and the normal distribution. In addition, we describe how simple descriptive statistics and distributions can be generated in MATLAB. If you are familiar with these subjects, feel free to skip the following pages and proceed directly to Section 14.2. If you are unfamiliar with these concepts or are in need of a review, the following material is designed as a brief introduction. 14.1.1 Descriptive Statistics Suppose that in the course of an engineering study, several measurements were made of a particular quantity. For example, Table 14.2 contains 24 readings of the coefficient of ther- mal expansion of a structural steel. Taken at face value, the data provide a limited amount of information—that is, that the values range from a minimum of 6.395 to a maximum of 6.775. Additional insight can be gained by summarizing the data in one or more well- chosen statistics that convey as much information as possible about specific characteristics of the data set. These descriptive statistics are most often selected to represent (1) the location of the center of the distribution of the data and (2) the degree of spread of the data set. Measure of Location. The most common measure of central tendency is the arithmetic mean. The arithmetic mean (y ¯) of a sample is defined as the sum of the individual data points (y ) divided by the number of points (n), or i  y i y ¯ = (14.2) n where the summation (and all the succeeding summations in this section) is from i = 1 through n. There are several alternatives to the arithmetic mean. The median is the midpoint of a group of data. It is calculated by first putting the data in ascending order. If the number of measurements is odd, the median is the middle value. If the number is even, it is the arith- metic mean of the two middle values. The median is sometimes called the 50th percentile. The mode is the value that occurs most frequently. The concept usually has direct util- ity only when dealing with discrete or coarsely rounded data. For continuous variables such as the data in Table 14.2, the concept is not very practical. For example, there are actually TABLE 14.2 Measurements of the coefficient of thermal expansion of structural steel. 6.495 6.595 6.615 6.635 6.485 6.555 6.665 6.505 6.435 6.625 6.715 6.655 6.755 6.625 6.715 6.575 6.655 6.605 6.565 6.515 6.555 6.395 6.775 6.685cha01102_ch14_321-360.qxd 12/17/10 8:13 AM Page 327 14.1 STATISTICS REVIEW 327 four modes for these data: 6.555, 6.625, 6.655, and 6.715, which all occur twice. If the num- bers had not been rounded to 3 decimal digits, it would be unlikely that any of the values would even have repeated twice. However, if continuous data are grouped into equispaced intervals, it can be an informative statistic. We will return to the mode when we describe his- tograms later in this section. Measures of Spread. The simplest measure of spread is the range, the difference be- tween the largest and the smallest value. Although it is certainly easy to determine, it is not considered a very reliable measure because it is highly sensitive to the sample size and is very sensitive to extreme values. The most common measure of spread for a sample is the standard deviation (s ) about y the mean:  S t s = (14.3) y n − 1 where S is the total sum of the squares of the residuals between the data points and the t mean, or  2 S = (y −¯ y) (14.4) t i Thus, if the individual measurements are spread out widely around the mean, S (and, con- t sequently, s ) will be large. If they are grouped tightly, the standard deviation will be small. y The spread can also be represented by the square of the standard deviation, which is called the variance:  2 (y −¯ y) i 2 s = (14.5) y n − 1 Note that the denominator in both Eqs. (14.3) and (14.5) is n − 1. The quantity n − 1 is referred to as the degrees of freedom. Hence S and s are said to be based on n − 1 de- t y grees of freedom. This nomenclature derives from the fact that the sum of the quantities upon which S is based (i.e., y ¯ − y , y ¯ − y ,..., y ¯ − y ) is zero. Consequently, if y ¯ is t 1 2 n known and n − 1 of the values are specified, the remaining value is fixed. Thus, only n − 1 of the values are said to be freely determined. Another justification for dividing by n − 1 is the fact that there is no such thing as the spread of a single data point. For the case where n = 1, Eqs. (14.3) and (14.5) yield a meaningless result of infinity. We should note that an alternative, more convenient formula is available to compute the variance:    2 2 y − y /n i 2 i s = (14.6) y n − 1 This version does not require precomputation of y ¯ and yields an identical result as Eq. (14.5). A final statistic that has utility in quantifying the spread of data is the coefficient of variation (c.v.). This statistic is the ratio of the standard deviation to the mean. As such, it provides a normalized measure of the spread. It is often multiplied by 100 so that it can be expressed in the form of a percent: s y c.v. = × 100% (14.7) y ¯cha01102_ch14_321-360.qxd 12/17/10 8:13 AM Page 328 328 LINEAR REGRESSION EXAMPLE 14.1 Simple Statistics of a Sample Problem Statement. Compute the mean, median, variance, standard deviation, and coeffi- cient of variation for the data in Table 14.2. Solution. The data can be assembled in tabular form and the necessary sums computed as in Table 14.3. The mean can be computed as Eq. (14.2), 158.4 y ¯ = = 6.6 24 Because there are an even number of values, the median is computed as the arithmetic mean of the middle two values: (6.605 + 6.615)/2 = 6.61. As in Table 14.3, the sum of the squares of the residuals is 0.217000, which can be used to compute the standard deviation Eq. (14.3):  0.217000 s = = 0.097133 y 24 − 1 TABLE 14.3 Data and summations for computing simple descriptive statistics for the coefficients of thermal expansion from Table 14.2. – 2 2 iy ( y − y) y i i i 1 6.395 0.04203 40.896 2 6.435 0.02723 41.409 3 6.485 0.01323 42.055 4 6.495 0.01103 42.185 5 6.505 0.00903 42.315 6 6.515 0.00723 42.445 7 6.555 0.00203 42.968 8 6.555 0.00203 42.968 9 6.565 0.00123 43.099 10 6.575 0.00063 43.231 11 6.595 0.00003 43.494 12 6.605 0.00002 43.626 13 6.615 0.00022 43.758 14 6.625 0.00062 43.891 15 6.625 0.00062 43.891 16 6.635 0.00122 44.023 17 6.655 0.00302 44.289 18 6.655 0.00302 44.289 19 6.665 0.00422 44.422 20 6.685 0.00722 44.689 21 6.715 0.01322 45.091 22 6.715 0.01322 45.091 23 6.755 0.02402 45.630 24 6.775 0.03062 45.901  158.400 0.21700 1045.657cha01102_ch14_321-360.qxd 12/17/10 8:13 AM Page 329 14.1 STATISTICS REVIEW 329 the variance Eq. (14.5): 2 2 s = (0.097133) = 0.009435 y and the coefficient of variation Eq. (14.7): 0.097133 c.v. = × 100% = 1.47% 6.6 The validity of Eq. (14.6) can also be verified by computing 2 1045.657 − (158.400) /24 2 s = = 0.009435 y 24 − 1 14.1.2 The Normal Distribution Another characteristic that bears on the present discussion is the data distribution—that is, the shape with which the data are spread around the mean. A histogram provides a simple visual representation of the distribution. A histogram is constructed by sorting the mea- surements into intervals, or bins. The units of measurement are plotted on the abscissa and the frequency of occurrence of each interval is plotted on the ordinate. As an example, a histogram can be created for the data from Table 14.2. The result (Fig. 14.3) suggests that most of the data are grouped close to the mean value of 6.6. Notice also, that now that we have grouped the data, we can see that the bin with the most values is from 6.6 to 6.64. Although we could say that the mode is the midpoint of this bin, 6.62, it is more common to report the most frequent range as the modal class interval. If we have a very large set of data, the histogram often can be approximated by a smooth curve. The symmetric, bell-shaped curve superimposed on Fig. 14.3 is one such characteristic shape—the normal distribution. Given enough additional measurements, the histogram for this particular case could eventually approach the normal distribution. FIGURE 14.3 A histogram used to depict the distribution of data. As the number of data points increases, the histogram often approaches the smooth, bell-shaped curve called the normal distribution. 5 4 3 2 1 0 6.4 6.6 6.8 Frequencycha01102_ch14_321-360.qxd 12/17/10 8:13 AM Page 330 330 LINEAR REGRESSION The concepts of the mean, standard deviation, residual sum of the squares, and nor- mal distribution all have great relevance to engineering and science. A very simple exam- ple is their use to quantify the confidence that can be ascribed to a particular measurement. If a quantity is normally distributed, the range defined by y ¯ − s to y ¯ + s will encompass y y approximately 68% of the total measurements. Similarly, the range defined by y ¯ − 2s to y y ¯ + 2s will encompass approximately 95%. y For example, for the data in Table 14.2, we calculated in Example 14.1 that y ¯ = 6.6 and s = 0.097133. Based on our analysis, we can tentatively make the statement that y approximately 95% of the readings should fall between 6.405734 and 6.794266. Because it is so far outside these bounds, if someone told us that they had measured a value of 7.35, we would suspect that the measurement might be erroneous. 14.1.3 Descriptive Statistics in MATLAB 1 Standard MATLAB has several functions to compute descriptive statistics. For example, the arithmetic mean is computed as mean(x). If x is a vector, the function returns the mean of the vector’s values. If it is a matrix, it returns a row vector containing the arithmetic mean of each column of x. The following is the result of using mean and the other statisti- cal functions to analyze a column vector s that holds the data from Table 14.2: format short g mean(s),median(s),mode(s) ans = 6.6 ans = 6.61 ans = 6.555 min(s),max(s) ans = 6.395 ans = 6.775 range=max(s)-min(s) range = 0.38 var(s),std(s) ans = 0.0094348 ans = 0.097133 These results are consistent with those obtained previously in Example 14.1. Note that although there are four values that occur twice, the mode function only returns the first of the values: 6.555. 1 MATLAB also offers a Statistics Toolbox that provides a wide range of common statistical tasks, from random number generation, to curve fitting, to design of experiments and statistical process control.cha01102_ch14_321-360.qxd 12/20/10 8:07 AM Page 331 14.2 RANDOM NUMBERS AND SIMULATION 331 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 6.35 6.4 6.45 6.5 6.55 6.6 6.65 6.7 6.75 6.8 6.85 FIGURE 14.4 Histogram generated with the MATLAB hist function. MATLAB can also be used to generate a histogram based on the hist function. The hist function has the syntax n, x = hist(y, x) where n = the number of elements in each bin, x = a vector specifying the midpoint of each bin, and y is the vector being analyzed. For the data from Table 14.2, the result is n,x =hist(s) n = 1 1 3 1 4 3 5 2 2 2 x = 6.414 6.452 6.49 6.528 6.566 6.604 6.642 6.68 6.718 6.756 The resulting histogram depicted in Fig. 14.4 is similar to the one we generated by hand in Fig. 14.3. Note that all the arguments and outputs with the exception of y are optional. For example, hist(y) without output arguments just produces a histogram bar plot with 10 bins determined automatically based on the range of values in y. 14.2 RANDOM NUMBERS AND SIMULATION In this section, we will describe two MATLAB functions that can be used to produce a sequence of random numbers. The first (rand) generates numbers that are uniformly distributed, and the second (randn) generates numbers that have a normal distribution.cha01102_ch14_321-360.qxd 12/17/10 8:13 AM Page 332 332 LINEAR REGRESSION 14.2.1 MATLAB Function: rand This function generates a sequence of numbers that are uniformly distributed between 0 and 1. A simple representation of its syntax is r = rand(m, n) where r = an m-by-n matrix of random numbers. The following formula can then be used to generate a uniform distribution on another interval: runiform = low + (up – low) rand(m, n) where low = the lower bound and up = the upper bound. EXAMPLE 14.2 Generating Uniform Random Values of Drag Problem Statement. If the initial velocity is zero, the downward velocity of the free-falling bungee jumper can be predicted with the following analytical solution (Eq. 1.9):    gm gc d v = tanh t c m d 2 Suppose that g = 9.81m/s , and m = 68.1 kg, but c is not known precisely. For example, d you might know that it varies uniformly between 0.225 and 0.275 (i.e., ±10% around a mean value of 0.25 kg/m). Use the rand function to generate 1000 random uniformly distributed values of c and then employ these values along with the analytical solution to d compute the resulting distribution of velocities at t = 4s. Solution. Before generating the random numbers, we can first compute the mean velocity:    9.81(68.1) 9.81(0.25) m v = tanh 4 = 33.1118 mean 0.25 68.1 s We can also generate the range:    9.81(68.1) 9.81(0.275) m v = tanh 4 = 32.6223 low 0.275 68.1 s    9.81(68.1) 9.81(0.225) m v = tanh 4 = 33.6198 6 high 0.225 68.1 s Thus, we can see that the velocity varies by 33.6198 − 32.6223 v = ×100% = 1.5063% 2(33.1118) The following script generates the random values for c , along with their mean, standard d deviation, percent variation, and a histogram: clc,format short g n=1000;t=4;m=68.1;g=9.81; cd=0.25;cdmin=cd-0.025,cdmax=cd+0.025 r=rand(n,1);cha01102_ch14_321-360.qxd 12/17/10 8:13 AM Page 333 14.2 RANDOM NUMBERS AND SIMULATION 333 cdrand=cdmin+(cdmax-cdmin)r; meancd=mean(cdrand),stdcd=std(cdrand) Deltacd=(max(cdrand)-min(cdrand))/meancd/2100. subplot(2,1,1) hist(cdrand),title('(a) Distribution of drag') xlabel('cd (kg/m)') The results are meancd = 0.25018 stdcd = 0.014528 Deltacd = 9.9762 These results, as well as the histogram (Fig. 14.5a) indicate that rand has yielded 1000 uniformly distributed values with the desired mean value and range. The values can then be employed along with the analytical solution to compute the resulting distribution of veloc- ities at t = 4 s. vrand=sqrt(gm./cdrand).tanh(sqrt(gcdrand/m)t); meanv=mean(vrand) Deltav=(max(vrand)-min(vrand))/meanv/2100. subplot(2,1,2) hist(vrand),title('(b) Distribution of velocity') xlabel('v (m/s)') FIGURE 14.5 Histograms of (a) uniformly distributed drag coefficients and (b) the resulting distribution of velocity. (a) Distribution of drag 150 100 50 0 0.22 0.23 0.24 0.25 0.26 0.27 0.28 cd (kg/m) (b) Distribution of velocity 150 100 50 0 32.4 32.6 32.8 33 33.2 33.4 33.6 33.8 v (m/s)cha01102_ch14_321-360.qxd 12/17/10 8:13 AM Page 334 334 LINEAR REGRESSION The results are meanv = 33.1151 Deltav = 1.5048 These results, as well as the histogram (Fig. 14.5b), closely conform to our hand cal- culations. The foregoing example is formally referred to as a Monte Carlo simulation. The term, which is a reference to Monaco’s Monte Carlo casino, was first used by physicists working on nuclear weapons projects in the 1940s. Although it yields intuitive results for this simple example, there are instances where such computer simulations yield surprising outcomes and provide insights that would otherwise be impossible to determine. The approach is fea- sible only because of the computer’s ability to implement tedious, repetitive computations in an efficient manner. 14.2.2 MATLAB Function: randn This function generates a sequence of numbers that are normally distributed with a mean of 0 and a standard deviation of 1. A simple representation of its syntax is r = randn(m, n) where r = an m-by-n matrix of random numbers. The following formula can then be used to generate a normal distribution with a different mean (mn) and standard deviation (s), rnormal = mn + s randn(m, n) EXAMPLE 14.3 Generating Normally-Distributed Random Values of Drag Problem Statement. Analyze the same case as in Example 14.2, but rather than employ- ing a uniform distribution, generate normally-distributed drag coefficients with a mean of 0.25 and a standard deviation of 0.01443. Solution. The following script generates the random values for c , along with their mean, d standard deviation, coefficient of variation (expressed as a %), and a histogram: clc,format short g n=1000;t=4;m=68.1;g=9.81; cd=0.25; stdev=0.01443; r=randn(n,1); cdrand=cd+stdevr; meancd=mean(cdrand),stdevcd=std(cdrand) cvcd=stdevcd/meancd100. subplot(2,1,1) hist(cdrand),title('(a) Distribution of drag') xlabel('cd (kg/m)') The results are meancd = 0.24988cha01102_ch14_321-360.qxd 12/17/10 8:13 AM Page 335 14.2 RANDOM NUMBERS AND SIMULATION 335 stdevcd = 0.014465 cvcd = 5.7887 These results, as well as the histogram (Fig. 14.6a) indicate that randn has yielded 1000 uniformly distributed values with the desired mean, standard deviation, and coefficient of variation. The values can then be employed along with the analytical solution to com- pute the resulting distribution of velocities at t  4 s. vrand=sqrt(gm./cdrand).tanh(sqrt(gcdrand/m)t); meanv=mean(vrand),stdevv=std(vrand) cvv=stdevv/meanv100. subplot(2,1,2) hist(vrand),title('(b) Distribution of velocity') xlabel('v (m/s)') The results are meanv = 33.117 stdevv = 0.28839 cvv = 0.8708 These results, as well as the histogram (Fig. 14.6b), indicate that the velocities are also nor- mally distributed with a mean that is close to the value that would be computed using the mean and the analytical solution. In addition, we compute the associated standard deviation which corresponds to a coefficient of variation of ±0.8708%. FIGURE 14.6 Histograms of (a) normally-distributed drag coefficients and (b) the resulting distribution of velocity. (a) Distribution of drag 300 200 100 0 0.18 0.2 0.22 0.24 0.26 0.28 0.3 cd (kg/m) (b) Distribution of velocity 300 200 100 0 32 32.5 33 33.5 34 34.5 v (m/s)cha01102_ch14_321-360.qxd 12/17/10 8:13 AM Page 336 336 LINEAR REGRESSION Although simple, the foregoing examples illustrate how random numbers can be easily gen- erated within MATLAB. We will explore additional applications in the end-of-chapter problems. 14.3 LINEAR LEAST-SQUARES REGRESSION Where substantial error is associated with data, the best curve-fitting strategy is to derive an approximating function that fits the shape or general trend of the data without neces- sarily matching the individual points. One approach to do this is to visually inspect the plotted data and then sketch a “best” line through the points. Although such “eyeball” approaches have commonsense appeal and are valid for “back-of-the-envelope” calcula- tions, they are deficient because they are arbitrary. That is, unless the points define a perfect straight line (in which case, interpolation would be appropriate), different analysts would draw different lines. To remove this subjectivity, some criterion must be devised to establish a basis for the fit. One way to do this is to derive a curve that minimizes the discrepancy between the data points and the curve. To do this, we must first quantify the discrepancy. The simplest exam- ple is fitting a straight line to a set of paired observations: (x , y ),(x , y ),...,(x , y ). 1 1 2 2 n n The mathematical expression for the straight line is y = a + a x + e (14.8) 0 1 where a and a are coefficients representing the intercept and the slope, respectively, and 0 1 e is the error, or residual, between the model and the observations, which can be repre- sented by rearranging Eq. (14.8) as e = y − a − a x (14.9) 0 1 Thus, the residual is the discrepancy between the true value of y and the approximate value, a + a x, predicted by the linear equation. 0 1 14.3.1 Criteria for a “Best” Fit One strategy for fitting a “best” line through the data would be to minimize the sum of the residual errors for all the available data, as in n n e = (y − a − a x ) (14.10) i i 0 1 i i=1 i=1 where n = total number of points. However, this is an inadequate criterion, as illustrated by Fig. 14.7a, which depicts the fit of a straight line to two points. Obviously, the best fit is the line connecting the points. However, any straight line passing through the midpoint of the connecting line (except a perfectly vertical line) results in a minimum value of Eq. (14.10) equal to zero because positive and negative errors cancel. One way to remove the effect of the signs might be to minimize the sum of the ab- solute values of the discrepancies, as in n n e = y − a − a x (14.11) i i 0 1 i i=1 i=1 Figure 14.7b demonstrates why this criterion is also inadequate. For the four points shown, any straight line falling within the dashed lines will minimize the sum of the absolute val- ues of the residuals. Thus, this criterion also does not yield a unique best fit.cha01102_ch14_321-360.qxd 12/17/10 8:13 AM Page 337 14.3 LINEAR LEAST-SQUARES REGRESSION 337 y Midpoint x (a) y x (b) y Outlier x (c) FIGURE 14.7 Examples of some criteria for “best fit” that are inadequate for regression: (a) minimizes the sum of the residuals, (b) minimizes the sum of the absolute values of the residuals, and (c) minimizes the maximum error of any individual point. A third strategy for fitting a best line is the minimax criterion. In this technique, the line is chosen that minimizes the maximum distance that an individual point falls from the line. As depicted in Fig. 14.7c, this strategy is ill-suited for regression because it gives undue influence to an outlier—that is, a single point with a large error. It should be noted that the minimax principle is sometimes well-suited for fitting a simple function to a compli- cated function (Carnahan, Luther, and Wilkes, 1969). A strategy that overcomes the shortcomings of the aforementioned approaches is to minimize the sum of the squares of the residuals: n n 2 2 S = e = (y − a − a x ) (14.12) r i 0 1 i i i=1 i=1cha01102_ch14_321-360.qxd 12/17/10 8:13 AM Page 338 338 LINEAR REGRESSION This criterion, which is called least squares, has a number of advantages, including that it yields a unique line for a given set of data. Before discussing these properties, we will pre- sent a technique for determining the values of a and a that minimize Eq. (14.12). 0 1 14.3.2 Least-Squares Fit of a Straight Line To determine values for a and a , Eq. (14.12) is differentiated with respect to each 0 1 unknown coefficient: ∂ S r =−2 (y − a − a x ) i 0 1 i ∂a 0 ∂ S r =−2 (y − a − a x )x i 0 1 i i ∂a 1 Note that we have simplified the summation symbols; unless otherwise indicated, all sum- mations are from i = 1 to n. Setting these derivatives equal to zero will result in a minimum S . If this is done, the equations can be expressed as r 0 = y − a − a x i 0 1 i 2 0 = x y − a x − a x i i 0 i 1 i  Now, realizing that a = na , we can express the equations as a set of two simultaneous 0 0 linear equations with two unknowns (a and a ): 0 1 na + x a = y (14.13) i 1 i 0 2 x a + x a = x y (14.14) i 0 1 i i i These are called the normal equations. They can be solved simultaneously for    n x y − x y i i i i a = (14.15) 1    2 2 n x − x i i This result can then be used in conjunction with Eq. (14.13) to solve for a =¯ y − a x¯ (14.16) 0 1 where y ¯ and x¯ are the means of y and x, respectively. EXAMPLE 14.4 Linear Regression Problem Statement. Fit a straight line to the values in Table 14.1. Solution. In this application, force is the dependent variable (y) and velocity is the independent variable (x). The data can be set up in tabular form and the necessary sums computed as in Table 14.4.cha01102_ch14_321-360.qxd 12/17/10 8:13 AM Page 339 14.3 LINEAR LEAST-SQUARES REGRESSION 339 TABLE 14.4 Data and summations needed to compute the best-fit line for the data from Table 14.1. 2 ix y x x y i i i i i 1 10 25 100 250 2 20 70 400 1,400 3 30 380 900 11,400 4 40 550 1,600 22,000 5 50 610 2,500 30,500 6 60 1,220 3,600 73,200 7 70 830 4,900 58,100 8 80 1,450 6,400 116,000  360 5,135 20,400 312,850 The means can be computed as 360 5,135 x¯ = = 45 y ¯ = = 641.875 8 8 The slope and the intercept can then be calculated with Eqs. (14.15) and (14.16) as 8(312,850) − 360(5,135) a = = 19.47024 1 2 8(20,400) − (360) a = 641.875 − 19.47024(45)=−234.2857 0 Using force and velocity in place of y and x, the least-squares fit is F =−234.2857 + 19.47024v The line, along with the data, is shown in Fig. 14.8. FIGURE 14.8 Least-squares fit of a straight line to the data from Table 14.1 1600 1200 800 400 0 20 40 60 80 v, m/s 400 F, Ncha01102_ch14_321-360.qxd 12/17/10 8:13 AM Page 340 340 LINEAR REGRESSION Notice that although the line fits the data well, the zero intercept means that the equa- tion predicts physically unrealistic negative forces at low velocities. In Section 14.4, we will show how transformations can be employed to derive an alternative best-fit line that is more physically realistic. 14.3.3 Quantification of Error of Linear Regression Any line other than the one computed in Example 14.4 results in a larger sum of the squares of the residuals. Thus, the line is unique and in terms of our chosen criterion is a “best” line through the points. A number of additional properties of this fit can be elucidated by examining more closely the way in which residuals were computed. Recall that the sum of the squares is defined as Eq. (14.12) n 2 S = (y − a − a x ) (14.17) r i 0 1 i i=1 Notice the similarity between this equation and Eq. (14.4) 2 S = (y −¯ y) (14.18) t i In Eq. (14.18), the square of the residual represented the square of the discrepancy between the data and a single estimate of the measure of central tendency—the mean. In Eq. (14.17), the square of the residual represents the square of the vertical distance between the data and another measure of central tendency—the straight line (Fig. 14.9). The analogy can be extended further for cases where (1) the spread of the points around the line is of similar magnitude along the entire range of the data and (2) the distri- bution of these points about the line is normal. It can be demonstrated that if these criteria FIGURE 14.9 The residual in linear regression represents the vertical distance between a data point and the straight line. y Measurement y i y  a  a x i 0 1 i a  a x 0 1 i x x i Regression linecha01102_ch14_321-360.qxd 12/17/10 8:13 AM Page 341 14.3 LINEAR LEAST-SQUARES REGRESSION 341 are met, least-squares regression will provide the best (i.e., the most likely) estimates of a 0 and a (Draper and Smith, 1981). This is called the maximum likelihood principle in statis- 1 tics. In addition, if these criteria are met, a “standard deviation” for the regression line can be determined as compare with Eq. (14.3)  S r s = (14.19) y/x n − 2 where s is called the standard error of the estimate. The subscript notation “y/x” desig- y/x nates that the error is for a predicted value of y corresponding to a particular value of x. Also, notice that we now divide by n − 2 because two data-derived estimates—a and a — 0 1 were used to compute S ; thus, we have lost two degrees of freedom. As with our discus- r sion of the standard deviation, another justification for dividing by n − 2 is that there is no such thing as the “spread of data” around a straight line connecting two points. Thus, for the case where n = 2, Eq. (14.19) yields a meaningless result of infinity. Just as was the case with the standard deviation, the standard error of the estimate quantifies the spread of the data. However, s quantifies the spread around the regression y/x line as shown in Fig. 14.10b in contrast to the standard deviation s that quantified the y spread around the mean (Fig. 14.10a). These concepts can be used to quantify the “goodness” of our fit. This is particularly useful for comparison of several regressions (Fig. 14.11). To do this, we return to the orig- inal data and determine the total sum of the squares around the mean for the dependent variable (in our case, y). As was the case for Eq. (14.18), this quantity is designated S . This t is the magnitude of the residual error associated with the dependent variable prior to regression. After performing the regression, we can compute S , the sum of the squares of r the residuals around the regression line with Eq. (14.17). This characterizes the residual FIGURE 14.10 Regression data showing (a) the spread of the data around the mean of the dependent variable and (b) the spread of the data around the best-fit line. The reduction in the spread in going from (a) to (b), as indicated by the bell-shaped curves at the right, represents the improvement due to linear regression. (a) (b)cha01102_ch14_321-360.qxd 12/17/10 8:13 AM Page 342 342 LINEAR REGRESSION y x (a) y x (b) FIGURE 14.11 Examples of linear regression with (a) small and (b) large residual errors. error that remains after the regression. It is, therefore, sometimes called the unexplained sum of the squares. The difference between the two quantities, S − S , quantifies the im- t r provement or error reduction due to describing the data in terms of a straight line rather than as an average value. Because the magnitude of this quantity is scale-dependent, the difference is normalized to S to yield t S − S t r 2 r = (14.20) S t 2 where r is called the coefficient of determination and r is the correlation coefficient √ 2 2 (= r ). For a perfect fit, S = 0 and r = 1, signifying that the line explains 100% of the r 2 variability of the data. For r = 0, S = S and the fit represents no improvement. An r t alternative formulation for r that is more convenient for computer implementation is     n (x y ) − x y i i i i r = (14.21)       2 2 2 2 n x − x n y − y i i i i EXAMPLE 14.5 Estimation of Errors for the Linear Least-Squares Fit Problem Statement. Compute the total standard deviation, the standard error of the esti- mate, and the correlation coefficient for the fit in Example 14.4. Solution. The data can be set up in tabular form and the necessary sums computed as in Table 14.5.cha01102_ch14_321-360.qxd 12/17/10 8:13 AM Page 343 14.3 LINEAR LEAST-SQUARES REGRESSION 343 TABLE 14.5 Data and summations needed to compute the goodness-of-fit statistics for the data from Table 14.1. 2 2 ix y a + a x ( y − y ¯) ( y − a − a x ) i i 0 1 i i i 0 1 i 110 25 −39.58 380,535 4,171 2 20 70 155.12 327,041 7,245 3 30 380 349.82 68,579 911 4 40 550 544.52 8,441 30 5 50 610 739.23 1,016 16,699 6 60 1,220 933.93 334,229 81,837 7 70 830 1,128.63 35,391 89,180 8 80 1,450 1,323.33 653,066 16,044  360 5,135 1,808,297 216,118 The standard deviation is Eq. (14.3)  1,808,297 s = = 508.26 y 8 − 1 and the standard error of the estimate is Eq. (14.19)  216,118 s = = 189.79 y/x 8 − 2 Thus, because s s , the linear regression model has merit. The extent of the improve- y/x y ment is quantified by Eq. (14.20) 1,808,297 − 216,118 2 r = = 0.8805 1,808,297 √ or r = 0.8805 = 0.9383. These results indicate that 88.05% of the original uncertainty has been explained by the linear model. Before proceeding, a word of caution is in order. Although the coefficient of determi- nation provides a handy measure of goodness-of-fit, you should be careful not to ascribe 2 more meaning to it than is warranted. Just because r is “close” to 1 does not mean that the 2 fit is necessarily “good.” For example, it is possible to obtain a relatively high value of r when the underlying relationship between y and x is not even linear. Draper and Smith (1981) provide guidance and additional material regarding assessment of results for linear regression. In addition, at the minimum, you should always inspect a plot of the data along with your regression curve. A nice example was developed by Anscombe (1973). As in Fig. 14.12, he came up with four data sets consisting of 11 data points each. Although their graphs are very different, all have the same best-fit equation, y = 3 + 0.5x, and the same coefficient of determination, 2 r = 0.67 This example dramatically illustrates why developing plots is so valuable.

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.