class: center, middle, inverse, title-slide .title[ # CSSS/POLS 512 ] .subtitle[ ## Lab 2: Time Series Diagnostics — ACF, PACF, and the Box-Jenkins Method ] .author[ ### Ramses Llobet ] .date[ ### Spring 2026 ] --- # Today's Plan .pull-left[ **Part 1: Building Up a Time Series** - What components can a time series have? - Building the composite equation step by step - Simulations for each component **Part 2: ACF/PACF & "Guess the Process"** - ACF/PACF as diagnostic tools - The identification table - Interactive: identify 9 unknown series ] .pull-right[ **Part 3: Stationarity Tests & Residual Diagnostics** - ADF, KPSS: what they test and when to use them - Ljung-Box and Jarque-Bera - Estimation and model comparison **Part 4: Practice** - Diagnose 6 mystery series using Box-Jenkins - ~20 minutes hands-on ] --- class: inverse, center, middle # Part 1: Building Up a Time Series --- # The Big Picture Most time series can be decomposed into a combination of recognizable components: `$$y_t = \underbrace{\beta_0}_\text{level} + \underbrace{\beta_1 t}_\text{trend} + \underbrace{S_t}_\text{seasonal} + \underbrace{\phi_1 y_{t-1}}_\text{AR(1)} + \underbrace{\theta_1 \varepsilon_{t-1}}_\text{MA(1)} + \underbrace{\varepsilon_t}_\text{white noise}$$` -- Not every series has all of these. The goal of **Box-Jenkins diagnostics** is to determine which components are present and how strong they are. -- The diagnostic order matters: **first** address trend and seasonality, **then** identify AR and MA. Let's build this equation **one component at a time**. --- # Step 1: Level + White Noise `$$y_t = \beta_0 + \varepsilon_t, \quad \varepsilon_t \sim N(0, \sigma^2)$$` The simplest possible time series: a constant mean plus random noise. No memory, no patterns. This is our **null model** — what "no structure" looks like. -- <img src="Lab2_slides_files/figure-html/build-step1-1.svg" width="864" style="display: block; margin: auto;" /> .footnote[*Code: `Lab2.Rmd` → Section 1.1 (ts objects) and Section 1.4 (white noise, ACF/PACF, Ljung-Box)*] --- # Step 2: Add a Trend `$$y_t = \beta_0 + \color{firebrick}{\beta_1 t} + \varepsilon_t$$` Now the mean drifts over time. Each period, the expected value increases by `\(\beta_1\)`. The series fluctuates around a **line**, not a constant. -- <img src="Lab2_slides_files/figure-html/build-step2-1.svg" width="864" style="display: block; margin: auto;" /> .footnote[*Code: `Lab2.Rmd` → Section 2.1 (deterministic trends, detrending with `lm()`)*] --- # Step 3: Add Seasonality `$$y_t = \beta_0 + \beta_1 t + \color{firebrick}{S_t} + \varepsilon_t$$` A repeating pattern at a fixed period (e.g., 12 months). The series oscillates with a predictable rhythm layered on top of trend and noise. -- <img src="Lab2_slides_files/figure-html/build-step3-1.svg" width="864" style="display: block; margin: auto;" /> -- Trend and seasonality are the **first things to address** — they dominate the ACF and mask the AR/MA structure underneath. .footnote[*Code: `Lab2.Rmd` → Section 1.3 (STL preview) and Section 2.3 (deseasonalization: STL vs. `lm()` + seasonal means)*] --- # Step 4: Add Autoregressive Dependence (AR) `$$y_t = \color{firebrick}{\phi_1 y_{t-1}} + \varepsilon_t, \quad |\phi_1| < 1$$` Now each observation depends on the **previous value**. The series develops **persistence** — smooth, wandering patterns. High values tend to be followed by high values. -- <img src="Lab2_slides_files/figure-html/build-step4-1.svg" width="864" style="display: block; margin: auto;" /> -- Compare to white noise: the AR(1) series is **smoother** — it "remembers" where it has been. .footnote[*Code: `Lab2.Rmd` → Section 1.5 (AR processes, varying φ, AR(2))*] --- # Step 5: Moving Average Shocks (MA) `$$y_t = \varepsilon_t + \color{firebrick}{\theta_1 \varepsilon_{t-1}}$$` Now the current value depends on **past shocks**, not past values. The effect of any shock lasts exactly one period, then disappears — **finite memory**. -- <img src="Lab2_slides_files/figure-html/build-step5-1.svg" width="864" style="display: block; margin: auto;" /> -- Compare to AR(1): the MA(1) series is **rougher** — less persistence, quicker return to the mean. .footnote[*Code: `Lab2.Rmd` → Section 1.6 (MA processes) and Section 1.7 (ARMA)*] --- # The Full Composite Putting it all together — and in **diagnostic order**: `$$y_t = \underbrace{\beta_0}_\text{level} + \underbrace{\beta_1 t}_\text{trend} + \underbrace{S_t}_\text{seasonal} + \underbrace{\phi_1 y_{t-1}}_\text{AR} + \underbrace{\theta_1 \varepsilon_{t-1}}_\text{MA} + \underbrace{\varepsilon_t}_\text{noise}$$` -- The **Box-Jenkins method**: address each layer in order, then identify what remains. | Step | What you do | What you remove | |:-----|:------------|:----------------| | 1. **Visual inspection** | Plot the series | — | | 2. **Detrend** | Regress on `\(t\)`, keep residuals | Trend | | 3. **Deseasonalize** | Subtract seasonal means or STL | Seasonality | | 4. **Identify AR/MA** | Examine ACF/PACF of the cleaned series | — | | 5. **Estimate & diagnose** | Fit model, check residuals for white noise | AR/MA structure | -- If residuals still show structure → **revise and repeat**. --- class: inverse, center, middle # Part 2: ACF/PACF & "Guess the Process" --- # ACF and PACF: The Diagnostic Tools **ACF** — correlation between `\(y_t\)` and `\(y_{t-k}\)` (includes indirect paths through intermediate lags). **PACF** — correlation between `\(y_t\)` and `\(y_{t-k}\)` **after removing** the linear effect of lags `\(1, \dots, k-1\)`. -- ### The Identification Table | Process | ACF | PACF | |:--------|:----|:-----| | **AR(p)** | Tails off (decays) | **Cuts off** after lag `\(p\)` | | **MA(q)** | **Cuts off** after lag `\(q\)` | Tails off (decays) | | **ARMA(p,q)** | Tails off | Tails off | -- "Cuts off" = drops to zero (within confidence bands) after a specific lag. "Tails off" = decays gradually (exponential, sinusoidal, or both). The blue dashed **confidence bands** are `\(\pm 1.96/\sqrt{n}\)` — they get narrower with more data. A single spike barely crossing the band is likely noise (5% will cross by chance). **Important:** This table only works on **stationary** data — detrend and deseasonalize first! --- # How Each Process Looks in ACF/PACF <img src="Lab2_slides_files/figure-html/acf-pacf-grid-1.svg" width="1008" style="display: block; margin: auto;" /> .footnote[*Code: `Lab2.Rmd` → Section 1.8 (identification table)*] --- # The Ljung-Box Test: "Are My Residuals Clean?" The **Ljung-Box test** is the formal answer to: *"Is there any autocorrelation left?"* - `\(H_0\)`: No autocorrelation (white noise) - `\(H_1\)`: Serial correlation exists -- ### Why this matters for regression .pull-left[ **Ljung-Box fails to reject** `\((p > 0.05)\)`: Residuals are white noise → your OLS standard errors, `\(t\)`-tests, and confidence intervals are **valid** → regular regression is fine. ] .pull-right[ **Ljung-Box rejects** `\((p \leq 0.05)\)`: Autocorrelation in residuals → your standard errors are **too small** → you're over-rejecting `\(H_0\)` → you need to model the dynamics (ARMA) or use robust SEs (Newey-West). ] -- **Bottom line:** This test is the bridge between time series diagnostics and the regression you already know. If Ljung-Box passes, you're done. If it fails, you need the tools from this lab. .footnote[`Box.test(residuals(fit), lag = 10, type = "Ljung-Box")`] --- # Now Let's Practice — Guess the Process! For each series I will show you: 1. First, the **time series plot** — look at it and form a hypothesis 2. Then, the **ACF and PACF** — match to the identification table -- Series A–D are **clean** (stationary, no trend or seasonality). Series E–G have **complications** (trend, seasonality, unit root). The .Rmd covers additional processes (negative AR, AR(2)) that we skip here for time — work through them in the lab document. --- # Guess the Process: Series A <img src="Lab2_slides_files/figure-html/series-a-ts-1.svg" width="864" style="display: block; margin: auto;" /> --- # Series A: ACF and PACF <img src="Lab2_slides_files/figure-html/series-a-acf-1.svg" width="864" style="display: block; margin: auto;" /> -- **Answer: White Noise.** No significant spikes in either ACF or PACF — no autocorrelation structure. --- # Guess the Process: Series B <img src="Lab2_slides_files/figure-html/series-b-ts-1.svg" width="864" style="display: block; margin: auto;" /> --- # Series B: ACF and PACF <img src="Lab2_slides_files/figure-html/series-b-acf-1.svg" width="864" style="display: block; margin: auto;" /> -- **Answer: AR(1), `\(\phi = 0.85\)`.** ACF decays slowly; PACF has a single significant spike at lag 1. --- # Guess the Process: Series C <img src="Lab2_slides_files/figure-html/series-c-ts-1.svg" width="864" style="display: block; margin: auto;" /> --- # Series C: ACF and PACF <img src="Lab2_slides_files/figure-html/series-c-acf-1.svg" width="864" style="display: block; margin: auto;" /> -- **Answer: MA(1), `\(\theta = 0.9\)`.** ACF cuts off sharply after lag 1; PACF tails off. The **mirror image** of the AR(1) pattern. --- # Guess the Process: Series D <img src="Lab2_slides_files/figure-html/series-d-ts-1.svg" width="864" style="display: block; margin: auto;" /> --- # Series D: ACF and PACF <img src="Lab2_slides_files/figure-html/series-d-acf-1.svg" width="864" style="display: block; margin: auto;" /> -- **Answer: ARMA(1,1), `\(\phi = 0.7, \theta = 0.4\)`.** Both ACF and PACF tail off — neither cuts off cleanly. When this happens → compare candidate models with AIC. --- class: inverse, center, middle # Now the Trickier Ones: Complications --- # Series E: What's going on here? <img src="Lab2_slides_files/figure-html/series-e-ts-1.svg" width="864" style="display: block; margin: auto;" /> --- # Series E: Reveal <img src="Lab2_slides_files/figure-html/series-e-acf-1.svg" width="864" style="display: block; margin: auto;" /> -- **AR(1) with `\(\phi = 0.6\)` + deterministic trend.** The ACF decays very slowly — this is the trend's signature, not genuine persistence. **Fix: detrend first** (regress on time), then re-examine ACF/PACF. --- # Series F: What's going on here? <img src="Lab2_slides_files/figure-html/series-f-ts-1.svg" width="864" style="display: block; margin: auto;" /> --- # Series F: Reveal <img src="Lab2_slides_files/figure-html/series-f-acf-1.svg" width="864" style="display: block; margin: auto;" /> -- **AR(1) + seasonal component (period = 12).** ACF shows peaks at lags 12, 24, 36. PACF: spike at lag 1 (AR) and lag 12 (seasonal). **Fix: remove seasonality first** (STL or seasonal means), then diagnose the remainder. --- # Series G: What's going on here? <img src="Lab2_slides_files/figure-html/series-g-ts-1.svg" width="864" style="display: block; margin: auto;" /> --- # Series G: Reveal <img src="Lab2_slides_files/figure-html/series-g-acf-1.svg" width="864" style="display: block; margin: auto;" /> -- **Random walk** ( `\(y_t = y_{t-1} + \varepsilon_t\)` ). ACF barely decays — stays near 1.0 across all lags. This is **non-stationary**. **Fix: first-difference** ( `\(\Delta y_t\)` ) to recover white noise. --- # Summary: Complications and Their Signatures | Complication | What you see | ACF signature | What to do | |:-------------|:-------------|:--------------|:-----------| | **Deterministic trend** | Upward/downward drift | Very slow decay | Detrend with `lm()` | | **Seasonality** | Repeating periodic pattern | Spikes at period multiples | STL or seasonal means | | **Unit root** | Wandering, no fixed mean | Barely decays from 1.0 | Difference ( `\(\Delta y\)` ) | -- **Key lesson:** Handle trends, seasonality, and non-stationarity *before* reading the identification table. The ACF/PACF rules only work on **stationary** data. .footnote[*Code: `Lab2.Rmd` → Section 2.1 (trends), Section 2.2 (unit roots, ADF/KPSS), Section 2.3 (seasonality comparison)*] --- class: inverse, center, middle # Part 3: Stationarity Tests & Residual Diagnostics --- # Stationarity Tests: ADF and KPSS Two tests with **opposite null hypotheses** — use both together: -- .pull-left[ ### ADF (Augmented Dickey-Fuller) - `\(H_0\)`: **Unit root** (non-stationary) - `\(H_1\)`: Stationary - Small `\(p\)` → reject → **evidence for stationarity** - `tseries::adf.test(x)` ] .pull-right[ ### KPSS - `\(H_0\)`: **Stationary** - `\(H_1\)`: Unit root - Small `\(p\)` → reject → **evidence for non-stationarity** - `tseries::kpss.test(x)` ] -- ### The Decision Table | ADF result | KPSS result | Interpretation | |:-----------|:------------|:---------------| | Reject | Fail to reject | Both agree: **stationary** | | Fail to reject | Reject | Both agree: **non-stationary** → difference | | Fail to reject | Fail to reject | Ambiguous — need more data or alternative tests | | Reject | Reject | Contradictory — possible structural break | .footnote[*Code: `Lab2.Rmd` → Section 2.2 (stationarity tests, differencing, diagnostic toolkit table)*] --- # Why Use Both Tests? Hypothesis tests can only provide evidence **against** the null, not **for** it. -- - If ADF **fails to reject** the unit root null, that could mean: - (a) There really is a unit root, **or** - (b) ADF simply lacks **power** (common when `\(\phi \approx 0.95\)` or small `\(n\)`) -- - KPSS flips the null → if KPSS **also fails to reject** (stationarity null), we have **positive evidence for stationarity**, not just absence of evidence. -- - The **Phillips-Perron test** (`PP.test()`) is an alternative to ADF with the same null. It handles serial correlation differently (nonparametric correction). Useful when ADF and KPSS disagree. --- # Residual Diagnostics: The Finish Line After fitting an ARMA model, check whether residuals `\(\hat{e}_t\)` look like **white noise**: -- .pull-left[ ### Ljung-Box test - `\(H_0\)`: No autocorrelation (white noise) - `\(H_1\)`: Serial correlation exists - **Pass:** `\(p > 0.05\)` → residuals are clean - `Box.test(resid, type = "Ljung-Box")` ### Jarque-Bera test - `\(H_0\)`: Residuals are normally distributed - `\(H_1\)`: Non-normal (excess skew/kurtosis) - `tseries::jarque.bera.test(resid)` ] .pull-right[ ### All-in-one: `checkresiduals()` ```r fit <- Arima(y, order = c(1, 0, 1)) checkresiduals(fit) ``` Produces: 1. Residual time plot 2. Residual ACF 3. Histogram 4. Ljung-Box `\(p\)`-value If residuals **fail** → revise model → re-estimate → check again. ] .footnote[*Code: `Lab2.Rmd` → Section 3.1 (estimation), Section 3.2 (residual checks, Jarque-Bera), Section 3.3 (AIC comparison)*] --- # Diagnostic Toolkit Summary | Test | R function | `\(H_0\)` | Use when... | |:-----|:-----------|:-------|:------------| | **Ljung-Box** | `Box.test(x, type="Ljung-Box")` | White noise | Testing for remaining autocorrelation | | **Jarque-Bera** | `jarque.bera.test(x)` | Normal distribution | Checking residual normality | | **ADF** | `adf.test(x)` | Unit root | Checking if differencing is needed | | **KPSS** | `kpss.test(x)` | Stationary | Complementing ADF | | **Phillips-Perron** | `PP.test(x)` | Unit root | Alternative to ADF | -- **A note on `auto.arima()`:** Useful as a sanity check, but not a substitute for understanding the Box-Jenkins procedure. It can miss the correct specification or propose complex models that are hard to interpret. Always verify with `checkresiduals()`. --- class: inverse, center, middle # Part 4: Practice — Diagnose Unknown Series --- # Practice Instructions Open **`Lab2.Rmd`**, Part 4. You will find **6 mystery time series**. For each one: 1. **Plot** the series — look for trends, seasonality, stationarity 2. **Test stationarity** — run ADF and KPSS 3. **Examine** the ACF and PACF (after detrending/deseasonalizing if needed) 4. **Identify** a candidate model using the identification table 5. **Estimate** the model with `Arima()` 6. **Diagnose** the residuals — are they white noise? -- .pull-left[ | ACF | PACF | → Model | |:----|:-----|:--------| | Tails off | Cuts off at `\(p\)` | AR$(p)$ | | Cuts off at `\(q\)` | Tails off | MA$(q)$ | | Tails off | Tails off | ARMA$(p,q)$ — compare with AIC | ] .pull-right[ **Tips:** - Series B has `frequency = 12` - If ACF decays very slowly → check for trend or unit root *first* - If both ACF/PACF tail off → try ARMA(1,1) as a starting point - When in doubt, let `auto.arima()` give you a second opinion ] You have **~20 minutes**. We will debrief together. --- # Wrap-Up **Today we covered:** - The composite equation: level, trend, seasonality, AR, MA, noise — in diagnostic order - ACF and PACF as fingerprints for AR, MA, and ARMA processes - Common complications: trends, seasonality, unit roots — handle these first - Stationarity tests: ADF + KPSS (opposite nulls, use together) - Residual diagnostics: Ljung-Box + Jarque-Bera -- **Coming up next:** - Estimation and interpretation of ARMA models with covariates - Model selection: AIC, cross-validation - Forecasting with ARIMA -- **Self-study pointers:** - Lab2.Rmd Appendices A and B (Box-Jenkins flowchart + mathematical details) - Shumway & Stoffer, Ch. 3 (ARIMA models) - Practice: simulate your own ARMA processes, try to identify them --- class: inverse, center, middle # Questions?