FORECASTING:CAUSAL FORECASTING WITH LINEAR REGRESSION

CAUSAL FORECASTING WITH LINEAR REGRESSION

In the preceding six sections, we have focused on time series forecasting methods, i.e., methods that forecast the next value in a time series based on its previous values. We now turn to another type of approach to forecasting.

Causal Forecasting

In some cases, the variable to be forecasted has a rather direct relationship with one or more other variables whose values will be known at the time of the forecast. If so, it would make sense to base the forecast on this relationship. This kind of approach is called causal forecasting.

Causal forecasting obtains a forecast of the quantity of interest (the dependent variable) by relating it directly to one or more other quantities (the independent variables) that drive the quantity of interest.

Table 27.2 shows some examples of the kinds of situations where causal forecasting sometimes is used. In each of the first three cases, the indicated dependent variable can be expected to go up or down rather directly with the independent variable(s) listed in the rightmost column. The last case also applies when some quantity of interest (e.g., sales of a product) tends to follow a steady trend upward (or downward) with the passage of time (the independent variable that drives the quantity of interest).

Linear Regression

We will focus on the type of causal forecasting where the mathematical relationship between the dependent variable and the independent variable(s) is assumed to be a linear one (plus some random fluctuations). The analysis in this case is referred to as linear regression.

To illustrate the linear regression approach, suppose that a publisher of textbooks is concerned about the initial press run for her books. She sells books both through book- stores and through mail orders. This latter method uses an extensive advertising campaign on line, as well as through publishing media and direct mail. The advertising campaign is conducted prior to the publication of the book. The sales manager has noted that there is a rather interesting linear relationship between the number of mail orders and the number sold through bookstores during the first year. He suggests that this relationship be exploited to determine the initial press run for subsequent books.

Thus, if the number of mail order sales for a book is denoted by X and the number of bookstore sales by Y, then the random variables X and Y exhibit a degree of association. However there is no functional relationship between these two random variables; i.e., given the number of mail order sales, one does not expect to determine exactly the number of bookstore sales. For any given number of mail order sales, there is a range of possible bookstore sales, and vice versa.

What, then, is meant by the statement, “The sales manager has noted that there is a rather interesting linear relationship between the number of mail orders and the number sold through bookstores during the first year”? Such a statement implies that the expected value of the number of bookstore sales is linear with respect to the number of mail order sales, i.e.,

E[Y½X = x] = A + Bx.

Thus, if the number of mail order sales is x for many different books, the average number of corresponding bookstore sales would tend to be approximately A + Bx. This relationship between X and Y is referred to as a degree of association model.

As already suggested in Table 27.2, other examples of this degree of association model can easily be found. A college admissions officer may be interested in the relationship be- tween a student’s performance on the college entrance examination and subsequent per- formance in college. An engineer may be interested in the relationship between tensile strength and hardness of a material. An economist may wish to predict a measure of in- flation as a function of the cost of living index, and so on.

The degree of association model is not the only model of interest. In some cases, there exists a functional relationship between two variables that may be linked linearly. In a fore- casting context, one of the two variables is time, while the other is the variable of interest. In Sec. 27.6, such an example was mentioned in the context of the generating process of the time series being represented by a linear trend superimposed with random fluctuations, i.e.,

Xt = A + Bt + et,

where A is a constant, B is the slope, and et is the random error, assumed to have expected value equal to zero and constant variance. (The symbol Xt can also be read as X given t or as X½t.) It follows that

E(Xt) = A + Bt.

Note that both the degree of association model and the exact functional relationship model lead to the same linear relationship, and their subsequent treatment is almost iden- tical. Hence, the publishing example will be explored further to illustrate how to treat both kinds of models, although the special structure of the model

E(Xt) = A + Bt,

with t taking on integer values starting with 1, leads to certain simplified expressions. In the standard notation of regression analysis, X represents the independent variable and Y represents the dependent variable of interest. Consequently, the notational expression for this special time series model now becomes

Yt = A + Bt + et.

Method of Least Squares

Suppose that bookstore sales and mail order sales are given for 15 books. These data appear in Table 27.3, and the resulting plot is given in Fig. 27.7.

It is evident that the points in Fig. 27.7 do not lie on a straight line. Hence, it is not clear where the line should be drawn to show the linear relationship. Suppose that an arbitrary line, given by the expression ~y = a + bx, is drawn through the data. A measure of how well this line fits the data can be obtained by computing the sum of squares of the vertical deviations of the actual points from the fitted line. Thus, let yi represent the book-store sales of the ith book and xi the corresponding mail order sales. Denote by ~y I the point on the fitted line corresponding to the mail order sales of xi. The proposed measure

and this is the line drawn in Fig. 27.7. Such a line is referred to as a regression line.

An Excel template called Linear Regression is available in your OR Courseware for calculating a regression line in this way. A procedure in the forecasting area of your IOR Tutorial also will perform this calculation for you, as well as enable you to graphically investigate the effect of making changes in the data.

This regression line is useful for forecasting purposes. For a given value of x, the corresponding value of y represents the forecast.

The decision maker may be interested in some measure of uncertainty that is associated with this forecast. This measure is easily obtained provided that certain assumptions can be made. Therefore, for the remainder of this section, it is assumed that

1. A random sample of n pairs (x1, Y1), (x2, Y2), . . . , (xn, Yn) is to be taken.

2. The Yi are normally distributed with mean A + Bxi and variance u- 2 (independent of i).

The assumption that Yi is normally distributed is not a critical assumption in deter- mining the uncertainty in the forecast, but the assumption of constant variance is crucial. Furthermore, an estimate of this variance is required.

An unbiased estimate of u- 2 is given by sy x, where

A very important reason for obtaining the linear relationship between two variables is to use the line for future decision making. From the regression line, it is possible to estimate E(Y½x) by a point estimate (the forecast) and a confidence interval estimate (a measure of forecast uncertainty).

For example, the publisher might want to use this approach to estimate the expected number of bookstore sales corresponding to mail order sales of, say, 1,400, by both a point estimate and a confidence interval estimate for forecasting purposes.

For any given value of x (denoted here by x+), the probability is 1 - a that the value of the future Y+ associated with x+ will fall in this interval.

Thus, in the publishing example, if x+ is 1,400, then the corresponding 95 percent prediction interval for the number of bookstore sales is given by 6,060 ± 315, which is naturally wider than the confidence interval for the expected number of bookstore sales, 6,060 ± 141.

This method of finding a prediction interval works fine if it is only being done once. However, it is not feasible to use the same data to find multiple prediction intervals with various values of x+ in this way and then specify a probability that all these predictions will be correct. For example, suppose that the publisher wants prediction intervals for several different books. For each individual book, she still is able to use these expressions to find the prediction interval and then make the prediction that the bookstore sales will be within this interval, where the probability is 1 - a that the prediction will be correct. However, what she cannot do is specify a probability that all these predictions will be correct. The reason is that these predictions are all based upon the same statistical data, so the predictions are not statistically independent. If the predictions were independent and if k future bookstore sales were being predicted, with each prediction being made with probability 1 - a, then the probability would be (1 - a)k that all k predictions of future bookstore sales will be correct. Unfortunately, the predictions are not independent, so the actual probability cannot be calculated, and (1 - a)k does not even provide a reasonable approximation.

This difficulty can be overcome by using simultaneous tolerance intervals. Using this technique, the publisher can take the mail order sales of any book, find an interval (based on the previously determined linear regression line) that will contain the actual bookstore sales with probability at least 1 - a, and repeat this for any number of books having the same or different mail order sales. Furthermore, the probability is P that all these predictions will be correct. An alternative interpretation is as follows. If every publisher followed this procedure, each using his or her own linear regression line, then 100P percent of the publishers (on average) would find that at least 100(1 - a) percent of their bookstore sales fell into the predicted intervals. The expression for the endpoints of each such tolerance interval is given by

Thus, the publisher can predict that the bookstore sales corresponding to known mail or- der sales will fall in these tolerance intervals. Such statements can be made for as many books as the publisher desires. Furthermore, the probability is P that at least 100(1 - a) percent of bookstore sales corresponding to mail order sales will fall in these intervals. If P is chosen as 0.90 and a = 0.05, the appropriate value of c** is 11.625. Hence, the number of bookstore sales corresponding to mail order sales of 1,400 books is predicted to fall in the interval 6,060 ± 759. If another book had mail order sales of 1,353, the bookstore sales are predicted to fall in the interval 5,258 ± 390, and so on. At least 95 percent of the bookstore sales will fall into their predicted intervals, and these statements are made with confidence 0.90.

To summarize, we now have described three measures of forecast uncertainty. The first (in the preceding subsection) is a confidence interval on the expected value of the random variable Y (for example, bookstore sales) given the observed value x of the independent variable X (for example, mail order sales). The second is a prediction interval on the ac- tual value that Y will take on, given x. The third is simultaneous tolerance intervals on a succession of actual values that Y will take on given a succession of observed values of X.

Search This Blog

Operation Research course