Time Series Forecasting in R

Time Series Forecasting

In this lab will go through time series data and how to build a machine learning model to forecast values for future periods.

Kindly go through the video lectures before proceeding with this lab.

We will use forecast and tseries package for this lab

library(forecast)
library(tseries)

To begin with we will use the dataset AirPassengers from base R.

It is the classic Box & Jenkins airline data. Monthly totals of international airline passengers, 1949 to 1960. The format of the data is monthly time series, in thousands.

Analysing the data

dim(AirPassengers)

## NULL

length(AirPassengers)

## [1] 144

start(AirPassengers)

## [1] 1949    1

end(AirPassengers)

## [1] 1960   12

class(AirPassengers)

## [1] "ts"

sum(is.na(AirPassengers))

## [1] 0

AirPassengers

##      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1949 112 118 132 129 121 135 148 148 136 119 104 118
## 1950 115 126 141 135 125 149 170 170 158 133 114 140
## 1951 145 150 178 163 172 178 199 199 184 162 146 166
## 1952 171 180 193 181 183 218 230 242 209 191 172 194
## 1953 196 196 236 235 229 243 264 272 237 211 180 201
## 1954 204 188 235 227 234 264 302 293 259 229 203 229
## 1955 242 233 267 269 270 315 364 347 312 274 237 278
## 1956 284 277 317 313 318 374 413 405 355 306 271 306
## 1957 315 301 356 348 355 422 465 467 404 347 305 336
## 1958 340 318 362 348 363 435 491 505 404 359 310 337
## 1959 360 342 406 396 420 472 548 559 463 407 362 405
## 1960 417 391 419 461 472 535 622 606 508 461 390 432

summary(AirPassengers)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   104.0   180.0   265.5   280.3   360.5   622.0

As we can see, the data is one-dimensional sequence of 144 data points with no missing values. Start time is Jan 1949 and end time is Dec 1960.

Minimum average no. of passengers is 104 (Nov 1949), and maximum 622 (July 1960). Mean is 280.3 and median is 265.5.

Time series data is usually one dimensional, but we also come across multivariate data in this domain. The data is already in time series format as its class is “ts”. If it were not a ts datatype, we could use the function, ts() to change the datatype to ts.

Let’s look at the cycle of the data (we are not referring to cyclical component here)

frequency(AirPassengers)

## [1] 12

cycle(AirPassengers)

##      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1949   1   2   3   4   5   6   7   8   9  10  11  12
## 1950   1   2   3   4   5   6   7   8   9  10  11  12
## 1951   1   2   3   4   5   6   7   8   9  10  11  12
## 1952   1   2   3   4   5   6   7   8   9  10  11  12
## 1953   1   2   3   4   5   6   7   8   9  10  11  12
## 1954   1   2   3   4   5   6   7   8   9  10  11  12
## 1955   1   2   3   4   5   6   7   8   9  10  11  12
## 1956   1   2   3   4   5   6   7   8   9  10  11  12
## 1957   1   2   3   4   5   6   7   8   9  10  11  12
## 1958   1   2   3   4   5   6   7   8   9  10  11  12
## 1959   1   2   3   4   5   6   7   8   9  10  11  12
## 1960   1   2   3   4   5   6   7   8   9  10  11  12

What we gather from the above table is that the data is monthly and cycle is yearly as each cycle has 12 months.

As we have see that a time series data may have four components:

Trend
Seasonal
Cyclical
Noise

We usually deal with trend, seasonal and noise components in building a model as the cyclical component is very rare.

Plotting the data

plot.ts(AirPassengers)
abline(lm(AirPassengers~time(AirPassengers)))

As we can see that the data has all the three components, trend (as it is going upward), seasonality (there are noticeable regular intra-year fluctuations), and also some noise.

However, we can see that the movement in the data is growing with time (getting wider). It appears that the variance is a function of time in this data. This could be indicative of multiplicative behavior of the data. In such cases we do the log transformation of the data to be able to use additive model building approach.

Let’s look at the boxplot of the data to see the month-on-month variation more clearly.

boxplot(AirPassengers~cycle(AirPassengers), xlab="Months", ylab="No. of Passengers per month")

From the boxplot above, we can see that during the months July-August, there is a sudden jump in the average number of monthly passengers. While the months Feb and Nov witness a very low average. We can get more details, when we decompose the data into its different components.

Let’s decompose the data into its components

dc = decompose(AirPassengers)

plot(dc)

As we can see from the plot, that we have a significant amount of all three components in the data. Here, random means noise.

We can also look at the values of these components and analyse their behaviour against their plots.

dc$trend

##           Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
## 1949       NA       NA       NA       NA       NA       NA 126.7917 127.2500
## 1950 131.2500 133.0833 134.9167 136.4167 137.4167 138.7500 140.9167 143.1667
## 1951 157.1250 159.5417 161.8333 164.1250 166.6667 169.0833 171.2500 173.5833
## 1952 183.1250 186.2083 189.0417 191.2917 193.5833 195.8333 198.0417 199.7500
## 1953 215.8333 218.5000 220.9167 222.9167 224.0833 224.7083 225.3333 225.3333
## 1954 228.0000 230.4583 232.2500 233.9167 235.6250 237.7500 240.5000 243.9583
## 1955 261.8333 266.6667 271.1250 275.2083 278.5000 281.9583 285.7500 289.3333
## 1956 309.9583 314.4167 318.6250 321.7500 324.5000 327.0833 329.5417 331.8333
## 1957 348.2500 353.0000 357.6250 361.3750 364.5000 367.1667 369.4583 371.2083
## 1958 375.2500 377.9167 379.5000 380.0000 380.7083 380.9583 381.8333 383.6667
## 1959 402.5417 407.1667 411.8750 416.3333 420.5000 425.5000 430.7083 435.1250
## 1960 456.3333 461.3750 465.2083 469.3333 472.7500 475.0417       NA       NA
##           Sep      Oct      Nov      Dec
## 1949 127.9583 128.5833 129.0000 129.7500
## 1950 145.7083 148.4167 151.5417 154.7083
## 1951 175.4583 176.8333 178.0417 180.1667
## 1952 202.2083 206.2500 210.4167 213.3750
## 1953 224.9583 224.5833 224.4583 225.5417
## 1954 247.1667 250.2500 253.5000 257.1250
## 1955 293.2500 297.1667 301.0000 305.4583
## 1956 334.4583 337.5417 340.5417 344.0833
## 1957 372.1667 372.4167 372.7500 373.6250
## 1958 386.5000 390.3333 394.7083 398.6250
## 1959 437.7083 440.9583 445.8333 450.6250
## 1960       NA       NA       NA       NA

plot(dc$trend)

As we can see that there is constant upward trend, hence there is year-on-year increase in the average number of monthly passengers.

As you can see from the trend plot and values, that during Jan-Jun 1949, and July-Dec 1960, there are no values (all NA’s). It shows that there was no observable trend during those periods - first and last six months of the entire data.

We also notice that May-Dec 1953, the graph is almost flat, that means there wasn’t much upward trend during that period and this component only accounted for an average of around 224 passengers per month during this period. Similar situations were observed during, Sep-Nov 1957 (avg. trend 372) and Apr-Jun 1958 (avg. trend 380).

This could suggest some significant events/incidents that might have affected the trend during these periods.

Let’s look at seasonal component:

dc$seasonal

##             Jan        Feb        Mar        Apr        May        Jun
## 1949 -24.748737 -36.188131  -2.241162  -8.036616  -4.506313  35.402778
## 1950 -24.748737 -36.188131  -2.241162  -8.036616  -4.506313  35.402778
## 1951 -24.748737 -36.188131  -2.241162  -8.036616  -4.506313  35.402778
## 1952 -24.748737 -36.188131  -2.241162  -8.036616  -4.506313  35.402778
## 1953 -24.748737 -36.188131  -2.241162  -8.036616  -4.506313  35.402778
## 1954 -24.748737 -36.188131  -2.241162  -8.036616  -4.506313  35.402778
## 1955 -24.748737 -36.188131  -2.241162  -8.036616  -4.506313  35.402778
## 1956 -24.748737 -36.188131  -2.241162  -8.036616  -4.506313  35.402778
## 1957 -24.748737 -36.188131  -2.241162  -8.036616  -4.506313  35.402778
## 1958 -24.748737 -36.188131  -2.241162  -8.036616  -4.506313  35.402778
## 1959 -24.748737 -36.188131  -2.241162  -8.036616  -4.506313  35.402778
## 1960 -24.748737 -36.188131  -2.241162  -8.036616  -4.506313  35.402778
##             Jul        Aug        Sep        Oct        Nov        Dec
## 1949  63.830808  62.823232  16.520202 -20.642677 -53.593434 -28.619949
## 1950  63.830808  62.823232  16.520202 -20.642677 -53.593434 -28.619949
## 1951  63.830808  62.823232  16.520202 -20.642677 -53.593434 -28.619949
## 1952  63.830808  62.823232  16.520202 -20.642677 -53.593434 -28.619949
## 1953  63.830808  62.823232  16.520202 -20.642677 -53.593434 -28.619949
## 1954  63.830808  62.823232  16.520202 -20.642677 -53.593434 -28.619949
## 1955  63.830808  62.823232  16.520202 -20.642677 -53.593434 -28.619949
## 1956  63.830808  62.823232  16.520202 -20.642677 -53.593434 -28.619949
## 1957  63.830808  62.823232  16.520202 -20.642677 -53.593434 -28.619949
## 1958  63.830808  62.823232  16.520202 -20.642677 -53.593434 -28.619949
## 1959  63.830808  62.823232  16.520202 -20.642677 -53.593434 -28.619949
## 1960  63.830808  62.823232  16.520202 -20.642677 -53.593434 -28.619949

plot(dc$seasonal)

As we can see from the seasonal plot and values, that there is a jump in the number of average passengers travelling during Jun-Sep every year, with maximum hikes seen in July and August of around 62-63, while in June it is around 35, and around 16 in September. These numbers represent seasonal component in the increase of passengers. But there is also noticeable decrease in the passengers with sudden dip during Jan-Feb and Oct-Dec (or Oct to Feb) every year, November being the lowest and Feb second lowest. Therefore, seasonality component can cause sudden rise or sudden fall in the general trend of the data movement.

Based on these findings, airline companies could come up with offers and also plan their inventories, no. of flights, etc. Even related businesses such as hotels, transportation services, restaurants, etc. can come up with their business plans to optimize their profits.

Let’s look at the noise component

dc$random

##              Jan         Feb         Mar         Apr         May         Jun
## 1949          NA          NA          NA          NA          NA          NA
## 1950   8.4987374  29.1047980   8.3244949   6.6199495  -7.9103535 -25.1527778
## 1951  12.6237374  26.6464646  18.4078283   6.9116162   9.8396465 -26.4861111
## 1952  12.6237374  29.9797980   6.1994949  -2.2550505  -6.0770202 -13.2361111
## 1953   4.9154040  13.6881313  17.3244949  20.1199495   9.4229798 -17.1111111
## 1954   0.7487374  -6.2702020   4.9911616   1.1199495   2.8813131  -9.1527778
## 1955   4.9154040   2.5214646  -1.8838384   1.8282828  -3.9936869  -2.3611111
## 1956  -1.2095960  -1.2285354   0.6161616  -0.7133838  -1.9936869  11.5138889
## 1957  -8.5012626 -15.8118687   0.6161616  -5.3383838  -4.9936869  19.4305556
## 1958 -10.5012626 -23.7285354 -15.2588384 -23.9633838 -13.2020202  18.6388889
## 1959 -17.7929293 -28.9785354  -3.6338384 -12.2967172   4.0063131  11.0972222
## 1960 -14.5845960 -34.1868687 -43.9671717  -0.2967172   3.7563131  24.5555556
##              Jul         Aug         Sep         Oct         Nov         Dec
## 1949 -42.6224747 -42.0732323  -8.4785354  11.0593434  28.5934343  16.8699495
## 1950 -34.7474747 -35.9898990  -4.2285354   5.2260101  16.0517677  13.9116162
## 1951 -36.0808081 -37.4065657  -7.9785354   5.8093434  21.5517677  14.4532828
## 1952 -31.8724747 -20.5732323  -9.7285354   5.3926768  15.1767677   9.2449495
## 1953 -25.1641414 -16.1565657  -4.4785354   7.0593434   9.1351010   4.0782828
## 1954  -2.3308081 -13.7815657  -4.6868687  -0.6073232   3.0934343   0.4949495
## 1955  14.4191919  -5.1565657   2.2297980  -2.5239899 -10.4065657   1.1616162
## 1956  19.6275253  10.3434343   4.0214646 -10.8989899 -15.9482323  -9.4633838
## 1957  31.7108586  32.9684343  15.3131313  -4.7739899 -14.1565657  -9.0050505
## 1958  45.3358586  58.5101010   0.9797980 -10.6906566 -31.1148990 -33.0050505
## 1959  53.4608586  61.0517677   8.7714646 -13.3156566 -30.2398990 -17.0050505
## 1960          NA          NA          NA          NA          NA          NA

plot(dc$random)

Trend shows long term upward or downward movement, while the random noise is short term shock in the data. The random noise cannot be predicted using long term data. Therefore, we look at the immediate values in the recent time stamps to predict next period.

Training the model

Let’s first build the model without any transformation in the data to see how our model performs.

Since our data has all three components, we will train the model using HoltWinters’s Exponential Smoothing Method

We can either use HoltWinters() (doesn’t have residual component) from base R or we can also use hw() (does have residual component) from the forecast package

ap.ts.model = HoltWinters(AirPassengers)
ap.ts.model

## Holt-Winters exponential smoothing with trend and additive seasonal component.
## 
## Call:
## HoltWinters(x = AirPassengers)
## 
## Smoothing parameters:
##  alpha: 0.2479595
##  beta : 0.03453373
##  gamma: 1
## 
## Coefficients:
##           [,1]
## a   477.827781
## b     3.127627
## s1  -27.457685
## s2  -54.692464
## s3  -20.174608
## s4   12.919120
## s5   18.873607
## s6   75.294426
## s7  152.888368
## s8  134.613464
## s9   33.778349
## s10 -18.379060
## s11 -87.772408
## s12 -45.827781

names(ap.ts.model)

## [1] "fitted"       "x"            "alpha"        "beta"         "gamma"       
## [6] "coefficients" "seasonal"     "SSE"          "call"

So, we have smoothing parameters, alpha, beta and gamma along with the coefficients for level, trend and seasonal components.

As expected, the seasonal coefficients s7 (for July) and s8 (for August) are large positive values, coefficient for Jun is also quite significant. These are months that see sudden rise in the average number of monthly passengers.

Coefficient s11 for Nov is a large negative value, again expected as during this month, the figures drop suddenly. s2, s12 for Feb and Dec are again large negative values. We did see the drop in numbers during these months.

Plot the model

Let’s plot the model:

plot(ap.ts.model)

The graph in black is the original data and that in red the forecast values by the model.

The model seems to fit the data well.

Forecast

Let’s make the forecast for the next two years. The frequency is 12 hence, two years would be 24

ap.for = forecast(ap.ts.model, h=24) #Function of package "forecast"
plot(ap.for)

From the plot, we can see that our model is able to forecast for the next two years using the same pattern in the original data.

Validating the model

Let’s validate the model

We will validate the model using the residuals of the model.

We will plot Autocorrelation Function (ACF) of the residual.

ACF tells us how a signal (or data value) is correlated to a time-shifted version of itself (or its lags). Since we are dealing with univariate time series data which is nothing but a single variable whose values change in time, in a way, it is both predictor and response variable. Therefore, we calculate autocorreltion instead of correlation.

Therefore, ACF provides a measure of similarities between the observed values of X at time t and at time t+T

Here, T = delay time or lag time.

Hence, autocorrelation of X(t) would be correlation of X(t) and X(t+T)

For our model to be valid, the residuals must be random and should have close to 0 autocorrelation value at all lags. It means that lags should not exceed the significance bounds.

One way to check that is to plot the ACF of residuals using acf() function in R

acf(ap.for$residuals, lag.max=20, na.action=na.pass)

The vertical lines show auotcorrelations at different lags. If these lags cross the dotted blue lines (representing significance bounds) that means there is an autocorrelation between those lags and the current series.

If there is just one lag crossing the confidence interval/significance bound, in residual ACF, we can accept the current model as we would expect one in 20 of the autocorrelations for the first twenty lags to exceed the 95% significance bounds by chance alone. 1 in 20 is 5%. So, in that case our p-value will not exceed 0.05.

But if more than one lag cross the significance bounds, that means there are many lags which have autocorrelation with the residuals of the model. We have to reject our model. In our case, we do see many lags crossing significance bound (dotted blue lines), hence our model is not able to explain all the variance in the data.

We can run a Ljung box test, which tells us whether or not there is autocorrelation in the data. If the p-value is less than 0.05, then there is autocorrelation.

Box.test(ap.for$residuals, lag=20, type="Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  ap.for$residuals
## X-squared = 109.13, df = 20, p-value = 2.831e-14

As we can see that the p-value is much smaller than 0.05, we can say that residuals are autocorrelated and they lack randomness. While for our model to be valid, residuals have to be random. So, again, we reject our model.

plot.ts(ap.for$residuals)

hist(ap.for$residuals)

As expected, there is a pattern in the residuals. We can see that the variance of the residuals is not constant, i.e. it is a function of time. However, looking at the histogram plot of residuals, we can say that there is an evidence of normal distribution of residuals.

Training model on transformed data

As discussed earlier, we need to transform our data to get rid of the patterns in random noise component of the data, as well as we need to get a constant variance in the data.

Let’s do log transformation of the data to see how it functions

ap.log = log(AirPassengers)
plot(ap.log)
abline(lm(ap.log~time(ap.log)))

As we can see that the variance in the data looks almost constant now.

Build the model on transformed data

Let’s decompose the log data

dc.log = decompose(ap.log)

plot(dc.log)

So, the transformed has all the components, trend, seasonality and noise.

Train the model

ap.log.model = HoltWinters(ap.log)
ap.log.model

## Holt-Winters exponential smoothing with trend and additive seasonal component.
## 
## Call:
## HoltWinters(x = ap.log)
## 
## Smoothing parameters:
##  alpha: 0.3266015
##  beta : 0.005744138
##  gamma: 0.8206654
## 
## Coefficients:
##             [,1]
## a    6.172308435
## b    0.008981893
## s1  -0.073201087
## s2  -0.140973564
## s3  -0.036703294
## s4   0.014522733
## s5   0.032554237
## s6   0.154873570
## s7   0.294317062
## s8   0.276063997
## s9   0.088237657
## s10 -0.032657089
## s11 -0.198012716
## s12 -0.102863837

Again, s7 and s8 have high positive coefficient values and s11 and s2 have high negative coefficient values, as expected.

Plotting the model

plot(ap.log.model, xlim=c(1949, 1962))

Model does fit well on the data.

Make forecast

Forecast values for next two years

ap.log.for = forecast(ap.log.model, h=24) #Function of package "forecast"
plot(ap.log.for, xlim=c(1949, 1963))

To get the values in original unit, we need to transform the forecast values taking exponential of log values

exp.for.fit = exp(ap.log.for$fitted) # Fitted values without log transformation
exp.for.x = exp(ap.log.for$x)        # Original values without log transformation
exp.for.mean = exp(ap.log.for$mean) #Forecast values of next two years without log transformation

ts.plot(exp.for.fit, exp.for.x, exp.for.mean, col=c("black", "red", "blue"), xlim=c(1949, 1963))

As we can see that model has performed well in the original form of the data (without log transformation).

Validating the model

ACF plot of residuals

acf(ap.log.for$residuals, lag.max=20, na.action=na.pass)

As we can see that we have only one lag crossing dotted blue lines (significance bounds). Hence, we can accept the model.

Let’s run the Ljung box test.

Box.test(ap.log.for$residuals, lag=20, type="Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  ap.log.for$residuals
## X-squared = 32.698, df = 20, p-value = 0.03641

The Box test does suggest some evidence of autocorrelation. But let’s plot the residuals to verify further.

plot(ap.log.for$residuals)

hist(ap.log.for$residuals)

The residual plot does look quite random and normally distributed. Therefore, we can say that model is almost valid. However, there is always an scope for improvement.

Box-Cox transformation

We will try one more transformation of the data using Box-Cox method.

Box-Cox transformation is used to deal with the non-constant variance.

BC transformation also helps in making data look approximately normally distributed.

This transformation is parameterized by lambda

Let’s try

hist(AirPassengers)

As we can see that the data is not normally distributed.

We will use a function boxcox() in R from MASS package

library(MASS)
bc = boxcox((AirPassengers)~time(AirPassengers))

(lambda <- bc$x[which.max(bc$y)])

## [1] 0.06060606

So, the lambda value is 0.06060606

Let’s do the transformation

For box-cox transformation, let’s say we have to tranform y using lambda, we use the following equation

For lambda != 0

y(lambda) = ((y^lambda)-1)/lambda

For lambda = 0

y(lambda) = log(y)

So, when we used log transformation earlier, we have in a way used box-cox transformation taking lambda to be 0.

ap.bc = (AirPassengers^lambda-1)/lambda # Our lambda > 0

hist(ap.bc)

As we can see that the transformed data does look close to normally distributed

plot(ap.bc)

The variance appears constant now

Train the model and forecast

ap.bc.model = HoltWinters(ap.bc)
ap.bc.model

## Holt-Winters exponential smoothing with trend and additive seasonal component.
## 
## Call:
## HoltWinters(x = ap.bc)
## 
## Smoothing parameters:
##  alpha: 0.321893
##  beta : 0.007546935
##  gamma: 0.8546614
## 
## Coefficients:
##            [,1]
## a    7.48336154
## b    0.01257751
## s1  -0.10405712
## s2  -0.20167684
## s3  -0.05556218
## s4   0.02350682
## s5   0.04793015
## s6   0.22364971
## s7   0.42840608
## s8   0.40005828
## s9   0.12644648
## s10 -0.04668913
## s11 -0.28514189
## s12 -0.14726881

Plot the model and forecast values

plot(ap.bc.model)

ap.bc.for = forecast(ap.bc.model, h=24)
plot(ap.bc.for)

Model does seem to perform well

Validating model

acf(ap.bc.for$residuals, lag.max=20, na.action=na.pass)

Box.test(ap.bc.for$residuals, lag=20, type="Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  ap.bc.for$residuals
## X-squared = 36.082, df = 20, p-value = 0.01504

plot.ts(ap.bc.for$residuals)

hist(ap.bc.for$residuals)

Even though, the box test shows evidence of autocorrelation in residuals, the plots do show that the model can be accepted as there is randomness in the residuals and their distribution does appear normal.

ets() in R

There is a function ets() in the package forecast. This function automatically selects the best type of model fit.

ap.ets =ets(AirPassengers, model = "ZZZ", lambda="auto") #Read about the function by typing ?ets in R
                                  # In model option 1st letter is for error, 2nd for trend and 3rd for seasonality

ap.ets

## ETS(A,A,A) 
## 
## Call:
##  ets(y = AirPassengers, model = "ZZZ", lambda = "auto") 
## 
##   Box-Cox transformation: lambda= -0.2947 
## 
##   Smoothing parameters:
##     alpha = 0.4025 
##     beta  = 1e-04 
##     gamma = 1e-04 
## 
##   Initial states:
##     l = 2.5695 
##     b = 0.0018 
##     s = -0.0215 -0.0437 -0.0143 0.0132 0.04 0.0407
##            0.0218 -0.0018 -0.0011 0.0049 -0.0207 -0.0175
## 
##   sigma:  0.0078
## 
##       AIC      AICc       BIC 
## -664.3179 -659.4607 -613.8310

names(ap.ets)

##  [1] "loglik"     "aic"        "bic"        "aicc"       "mse"       
##  [6] "amse"       "fit"        "residuals"  "fitted"     "states"    
## [11] "par"        "m"          "method"     "series"     "components"
## [16] "call"       "initstate"  "sigma2"     "x"          "lambda"

ets() has trained a model that has parameters for multiplicative level (noise), additive damped trend, and multiplicative season. Function uses letters such as N = None, A = Additive, M = Multiplicative, Z = automatically selected. Since we supplied “zzz” to the model argument of the function, it selected the best combination for the model.

Our model has used damping parameter phi as well. The trend has been damped.

Here sigma (standard deviation of residuals) is the optimization criterion. We can choose different option as well.

plot(ap.ets)

The above plot shows different components of the fitted model.

Forecast

ap.est.for = forecast(ap.ets,h=24)

plot(ap.est.for)

Validating the model

acf(residuals(ap.est.for), lag.max=20)

Box.test(residuals(ap.est.for), lag=20, type="Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  residuals(ap.est.for)
## X-squared = 42.693, df = 20, p-value = 0.002244

hist(as.vector(residuals(ap.ets)))

plot.ts(residuals(ap.ets))

ARIMA

ARIMA is another method of building a time series model.

AR stands for Auto-regression, I stands for Integrated and MA stands for Moving Average.

AR cab be understood as the data is regressing on itself and accounts for long term effect which last longer

It is more like a linear regression and can be written as follows

x(t) = theta1X(t-1) + theta2X(t-2)+…+ error(t)

Here, we want to calculate X in time t based on X in the time lags of t-1, t-2, t-3,… and some error in time t.

In our example of passengers, the average number of passengers every month keeps growing and continues month-on-month and year-on-year. It is a long term effect and we are looking at annual cycle with monthly frequencies.

MA is related to the noise in the data caused by some short term shocks or sudden change in the data and the effect vanishes quickly

X(t) = theta*random_noise(t-1) + error(t)

We are calculating X in time t based on random noise in the time lags of t-1 and some error in time t

In our example, let’s say that there is a sudden change in weather and the next few days the flights have been cancelled. This would cause sudden drop in the average number of passengers. But within few days, things will come back to normalcy. So, the effect of weather change on the average number of passengers was short term and vanished quickly.

However, we must understand that, we have to look at the autocorrelation graph of the data to see how many lag functions (t-1, t-2, etc.) are autocorrelated with the X in t. Based on this autocorrelation, we decide how many lag functions will be used for AR or MA.

However, to build an ARIMA model, our data must be stationary, that is, its variance and mean should not be a function of time.

Detrending: If there is a trend in the data, mean cannot be constant over time and hence, we must remove the trend from the data. It is called detrending.

Differencing: One of the methods to remove non-stationarity related to trend in the data is by taking the difference of the terms X(t)-X(t-1) and use it to build the model. This differencing is known as integrated I in ARIMA.

Log Transformation: If the variance of the data is not constant, i.e., if it is dependent on time, we transform the data by taking log of it.

Testing for stationarity in the data

We can look at the ACF chart of the data and find out if the data is stationary. If the ACF decays very slowly that means the data is not stationary.

acf(AirPassengers)

As expected, ACF plot doesn’t decay quickly and hence, the data is not stationary.

There is a test called Augmented Dickey-Fuller Test to see whether the data is stationary. But before we proceed with the test, we need to log-transoform the data for getting constant variance over time and detrend the log transformed data to remove the trend and get constant mean. We will use adf.test() function in R.

adf.test(diff(log(AirPassengers)), alternative="stationary", k=0)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff(log(AirPassengers))
## Dickey-Fuller = -9.6003, Lag order = 0, p-value = 0.01
## alternative hypothesis: stationary

As we can see that the p-value is less than 0.05, we can easily reject null hypothesis and accept the alternative hypothesis. Thus, the transformed data is stationary.

Degrees of ARIMA

ARIMA has few parameters such as p,d,q

p is the degree of AR

d is the degree of I

and q is the degree of MA

We already know d is 1 for differencing

In MA, the shock vanishes quickly and hence, it is easy to find out the cut-off for lag function. But for AR, correlation goes down gradually and it’s not that easy to find the cutoff for lag function. Hence, we look at Partial Autocorrelation Function (PACF). When we calculate the partial correlation of all the lags, we can get the degree of AR series.

ACF: A bar chart of the coefficients of correlation between time series and its lags
PCAF: A bar chart of the partial correlation coefficients between time series and its lags

Since, we have already decided to train the model on difference of log transformed data, we will plot the ACF and PACF to find out the degrees of AR and MA.

Acf(diff(log(AirPassengers))) #Using the function of forecast package

Acf(diff(log(AirPassengers)), type = "partial")

Looking at the PACF graph above we may consider to keep p as zero or 1, if we are using one differencing. Looking at ACF graph, we can get an idea that may be lag 1 or 2.

For seasonality, since it is 12 months cycle, if we look at pacf graph, lag cuts off at 12, so the degree p for seasonality could be 0 and looking at acf the seasonality q could be again 1. d for seasonality should be 1 again as we are using data with one differencing.

Let’s train the model using arima() function in R.

ap.arima <- arima(log(AirPassengers), c(0, 1, 1),seasonal = list(order = c(0, 1, 1), period = 12))
ap.arima

## 
## Call:
## arima(x = log(AirPassengers), order = c(0, 1, 1), seasonal = list(order = c(0, 
##     1, 1), period = 12))
## 
## Coefficients:
##           ma1     sma1
##       -0.4018  -0.5569
## s.e.   0.0896   0.0731
## 
## sigma^2 estimated as 0.001348:  log likelihood = 244.7,  aic = -483.4

Forecast

Let’s forecast for next two years.

pred <- predict(ap.arima, n.ahead = 2*12)

Let’s plot the model

ts.plot(AirPassengers, exp(pred$pred), log = "y", lty = c(1,3))

Validating the model

acf(ap.arima$residuals, lag.max=20)

Box.test(ap.arima$residuals, lag=20, type="Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  ap.arima$residuals
## X-squared = 17.688, df = 20, p-value = 0.6079

qqnorm(as.vector(ap.arima$residuals))
qqline(as.vector(ap.arima$residuals))

hist(as.vector(ap.arima$residuals))

plot.ts(ap.arima$residuals)

As all our residual tests show that our model is valid.

ACF plot shows no lag out of 20 lags crossing significance bounds.

Box test show significant p-value (>>0.05), which implies that the residuals are random and no autocorrelation exists among lags.

Histogram and normal qq plots show normal distribution of residuals, which is what is expected from residuals for model with good fit.

The residuals plot vs time shows random without any pattern.

Hence, we conclude that our model is valid.

auto.arima () in R

There is a function, auto.arima() in R that automatically searches for the best ARIMA model and returns the same.

Let’s build the model using this function and see if the models match.

ap.arima #Model that we trained

## 
## Call:
## arima(x = log(AirPassengers), order = c(0, 1, 1), seasonal = list(order = c(0, 
##     1, 1), period = 12))
## 
## Coefficients:
##           ma1     sma1
##       -0.4018  -0.5569
## s.e.   0.0896   0.0731
## 
## sigma^2 estimated as 0.001348:  log likelihood = 244.7,  aic = -483.4

auto.arima(log(AirPassengers))

## Series: log(AirPassengers) 
## ARIMA(0,1,1)(0,1,1)[12] 
## 
## Coefficients:
##           ma1     sma1
##       -0.4018  -0.5569
## s.e.   0.0896   0.0731
## 
## sigma^2 = 0.001371:  log likelihood = 244.7
## AIC=-483.4   AICc=-483.21   BIC=-474.77

As we can see that even auto.arima() has selected the same model as we did. You can use this function to get some idea about how to move ahead with your model building. You can only take the help of this function after doing proper analysis of data and having built and tried different models.

Therefore, we have gone through various methods to trains time series model here. A lot of analysts prefer ARIMA over Holt Winters, but there also who like to use the latter option more often. The idea is to use train a model that gives higher accuracy and low error.

Time series in itself a very vast field and we have just tried to cover basics in this lab. For deeper knowledge, you experiment with more techniques and tools with different datasets.

Exercise: Use some other time series dataset from R or any other source and anlayse the data and train the model using all methods given in this lab.

Business and AI

Search This Blog