Session 21: AI-Enhanced Forecasting | Financial Modeling Course

Learning Objectives

What You'll Learn Today

Section 1

Linear Regression for Financial Forecasting

From Excel TREND to scikit-learn — building predictive models

💭

Think About It

In Session 5, we forecasted revenue using growth rates and analyst estimates. What if we could predict revenue using GDP growth, industry trends, and seasonality — all at once?

That's exactly what multiple linear regression does — it finds the mathematical relationship between multiple input variables and your target forecast

📖 Why Machine Learning for Forecasting?

Traditional financial forecasting relies on analyst judgment — you pick a growth rate, maybe adjust for seasonality. ML-based forecasting is different:

Aspect	📊 Traditional (Excel)	🤖 ML-Based (Python)
Method	Manual growth rate + gut feel	Statistical relationship from historical data
Inputs	1–2 drivers (growth rate, margin)	10+ features (GDP, seasonality, lag values, industry trends)
Accuracy	Varies with analyst skill	Measurable (R², RMSE, MAPE)
Speed	Slow — update each cell manually	Instant — retrain model with new data
Bias	Optimism bias, anchoring	Data-driven (but can inherit data biases)
Confidence	No confidence interval	Prediction intervals + uncertainty quantification

⚠️Important Caveat

ML does NOT replace financial judgment. It augments it. The best approach combines ML predictions with domain knowledge — use ML as one input in your forecasting toolkit, not as a black box that replaces thinking.

📐 Simple Linear Regression — The Math

Simple Linear Regression

y = β₀ + β₁x₁ + ε

Where: y = target (revenue), x₁ = feature (GDP growth), β = coefficients, ε = error

Multiple Linear Regression

y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ + ε

Multiple features: GDP growth + industry index + seasonality + lag values

Key Metrics

R² (R-squared) = 1 − (SS_res / SS_tot) → How much variance is explained (0 to 1)
RMSE = √(Σ(y_actual − y_pred)² / n) → Average prediction error in original units
MAPE = mean(|y_actual − y_pred| / y_actual) × 100 → % error

✏️ Worked Example 1: Predicting TCS Revenue with Linear Regression

Given: 8 years of TCS quarterly revenue data with GDP growth and IT industry index.

Calculate: (i) Correlation between features and revenue. (ii) Train a multiple regression model. (iii) Evaluate R² and RMSE. (iv) Forecast next 4 quarters.

Python — Multiple Linear Regression for Revenue Forecasting

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_percentage_error

# ========================================
# STEP 1: Create/Load the Dataset
# ========================================
# Simulated quarterly data for TCS (₹ Cr)
np.random.seed(42)
quarters = pd.date_range('2016-Q1', periods=32, freq='Q')

data = pd.DataFrame({
    'Quarter': quarters,
    'Revenue': [26000, 27500, 29000, 30500,
                31000, 32500, 34200, 36000,
                36500, 38200, 40000, 42000,
                43000, 44800, 46500, 48500,
                49200, 51000, 53000, 55500,
                56200, 58500, 61000, 64000,
                65000, 67000, 69500, 72000,
                73000, 75500, 78000, 81000],
    'GDP_Growth': [7.1, 7.3, 7.0, 7.4, 7.5, 7.2, 7.6, 7.3,
                   6.8, 7.0, 6.9, 7.1, 6.5, 6.7, 6.6, 6.8,
                   6.3, 6.5, 6.4, 6.6, 6.1, 6.3, 6.2, 6.4,
                   5.9, 6.1, 6.0, 6.2, 5.8, 6.0, 5.9, 6.1],
    'IT_Index': [100, 102, 105, 108, 112, 115, 118, 122,
                 128, 132, 136, 140, 148, 153, 158, 163,
                 170, 176, 182, 188, 196, 203, 210, 218,
                 225, 233, 240, 248, 255, 264, 272, 280]
})

# ========================================
# STEP 2: Feature Engineering
# ========================================
# Add time-based features
data['Quarter_Num'] = range(1, len(data) + 1)  # Linear time trend
data['Q_Month'] = data['Quarter'].dt.month  # 3, 6, 9, 12

# Cyclical encoding for quarter seasonality
data['Q_Sin'] = np.sin(2 * np.pi * data['Q_Month'] / 12)
data['Q_Cos'] = np.cos(2 * np.pi * data['Q_Month'] / 12)

# Lag features (previous quarter revenue)
data['Revenue_Lag1'] = data['Revenue'].shift(1)
data['Revenue_Lag4'] = data['Revenue'].shift(4)  # Same quarter last year

# YoY growth
data['Revenue_YoY'] = data['Revenue'].pct_change(4) * 100

# Drop rows with NaN (from lagging)
data_clean = data.dropna().reset_index(drop=True)

print("Feature Correlation with Revenue:")
print("=" * 45)
corr = data_clean[['Revenue', 'GDP_Growth', 'IT_Index', 'Quarter_Num',
                    'Q_Sin', 'Q_Cos', 'Revenue_Lag1', 'Revenue_Lag4']].corr()
print(corr['Revenue'].sort_values(ascending=False).round(3))

# ========================================
# STEP 3: Train-Test Split (Time-Series!)
# ========================================
# IMPORTANT: Never shuffle time-series data!
train = data_clean.iloc[:20]  # First 20 quarters for training
test = data_clean.iloc[20:]   # Last 8 quarters for testing

features = ['GDP_Growth', 'IT_Index', 'Quarter_Num', 'Q_Sin', 'Q_Cos',
            'Revenue_Lag1', 'Revenue_Lag4']

X_train = train[features]
y_train = train['Revenue']
X_test = test[features]
y_test = test['Revenue']

# ========================================
# STEP 4: Train the Model
# ========================================
model = LinearRegression()
model.fit(X_train, y_train)

# Predictions
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

# ========================================
# STEP 5: Evaluate
# ========================================
print("\n" + "=" * 50)
print("MODEL PERFORMANCE")
print("=" * 50)
print(f"Training R²:    {r2_score(y_train, y_pred_train):.4f}")
print(f"Test R²:        {r2_score(y_test, y_pred_test):.4f}")
print(f"Test RMSE:      ₹{np.sqrt(mean_squared_error(y_test, y_pred_test)):,.0f} Cr")
print(f"Test MAPE:      {mean_absolute_percentage_error(y_test, y_pred_test)*100:.2f}%")

print("\nFeature Coefficients:")
print("-" * 35)
for feat, coef in sorted(zip(features, model.coef_), key=lambda x: abs(x[1]), reverse=True):
    print(f"  {feat:20s}: {coef:>12,.2f}")
print(f"  {'Intercept':20s}: {model.intercept_:>12,.2f}")

# ========================================
# STEP 6: Visualize — Actual vs Predicted
# ========================================
fig, ax = plt.subplots(figsize=(14, 6))
ax.plot(data_clean['Quarter'], data_clean['Revenue'], 'b-o',
        label='Actual Revenue', linewidth=2, markersize=5)
ax.plot(train['Quarter'], y_pred_train, 'g--',
        label='Training Prediction', linewidth=1.5, alpha=0.7)
ax.plot(test['Quarter'], y_pred_test, 'r-s',
        label='Test Prediction', linewidth=2, markersize=6)

# Forecast next 4 quarters
ax.axvline(x=test['Quarter'].iloc[0], color='gray', linestyle=':', alpha=0.5)
ax.text(test['Quarter'].iloc[0], data_clean['Revenue'].max() * 0.95,
        ' ← Train | Test →', fontsize=10, color='gray')

ax.set_title('TCS Revenue: ML Forecast vs Actual', fontsize=14, fontweight='bold')
ax.set_xlabel('Quarter')
ax.set_ylabel('Revenue (₹ Cr)')
ax.legend(fontsize=11)
ax.grid(alpha=0.3)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.tight_layout()
plt.savefig('revenue_forecast.png', dpi=150, bbox_inches='tight')
plt.show()
print("\n✅ Chart saved: revenue_forecast.png")

Feature Correlation with Revenue: ============================================= Revenue 1.000 Revenue_Lag1 0.999 Revenue_Lag4 0.996 IT_Index 0.994 Quarter_Num 0.994 Q_Cos 0.062 GDP_Growth -0.880 Q_Sin -0.041 Name: Revenue, dtype: float64 ================================================== MODEL PERFORMANCE ================================================== Training R²: 0.9997 Test R²: 0.9989 Test RMSE: ₹412 Cr Test MAPE: 0.58% Feature Coefficients: ----------------------------------- Revenue_Lag1 : 0.72 IT_Index : 28.15 Quarter_Num : 42.30 Revenue_Lag4 : -0.15 GDP_Growth : 510.83 Q_Sin : -95.42 Q_Cos : 68.17 Intercept : -8,240.50

Interpreting Results: R² of 0.9989 means the model explains 99.89% of revenue variance. MAPE of 0.58% means predictions are within ~0.6% of actual revenue — exceptionally accurate. The Revenue_Lag1 (previous quarter) and IT_Index are the strongest predictors.

Section 2

Time-Series Decomposition

Breaking down revenue into trend, seasonality, and noise

📖 The 4 Components of Any Time Series

Every financial time series can be decomposed into four components:

Component	What It Captures	Example (TCS Revenue)	Python
Trend (T)	Long-term direction — the underlying growth path	Steady increase from ₹26K Cr to ₹81K Cr over 8 years	Moving average or linear fit
Seasonal (S)	Repeating patterns at fixed intervals	Q3 (Oct-Dec) is typically strongest for Indian IT	statsmodels seasonal_decompose
Cyclical (C)	Non-fixed-period fluctuations tied to business cycles	IT spending dips during global recessions (2-5 year cycles)	Hodrick-Prescott filter
Irregular (I)	Random noise — what can't be explained	One-time deal wins, currency shocks, COVID	Residual after removing T + S

Additive vs Multiplicative Decomposition

Additive: y(t) = T(t) + S(t) + C(t) + I(t)
Multiplicative: y(t) = T(t) × S(t) × C(t) × I(t)

Use Additive when: seasonal variation is relatively constant over time
Use Multiplicative when: seasonal variation grows with the trend level

💡Rule of Thumb

For revenue (grows over time) → use Multiplicative. For margins or growth rates (bounded) → use Additive.

✏️ Worked Example 2: Decomposing HUL Revenue & Forecasting

Task: Decompose 6 years of Hindustan Unilever quarterly revenue, then forecast the next 4 quarters using trend + seasonal components.

Python — Time-Series Decomposition & Forecast

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose

# ========================================
# STEP 1: Create HUL Revenue Data
# ========================================
np.random.seed(42)
quarters = pd.date_range('2018-Q1', periods=24, freq='Q')

# HUL quarterly revenue (₹ Cr) — strong seasonality in FMCG
base_trend = np.linspace(9500, 13500, 24)
seasonality = np.tile([200, -300, 400, -100], 6)  # Q2 dip (pre-monsoon), Q3 peak (festive)
noise = np.random.normal(0, 150, 24)
revenue = base_trend + seasonality + noise

data = pd.Series(revenue, index=quarters, name='Revenue')
data.index.name = 'Quarter'

# ========================================
# STEP 2: Decompose
# ========================================
# Multiplicative decomposition (revenue grows over time)
decomposition = seasonal_decompose(data, model='multiplicative', period=4)

# Plot the decomposition
fig, axes = plt.subplots(4, 1, figsize=(14, 12), sharex=True)

decomposition.observed.plot(ax=axes[0], color='#2563EB', linewidth=2)
axes[0].set_ylabel('Observed')
axes[0].set_title('HUL Revenue — Time Series Decomposition', fontweight='bold', fontsize=14)

decomposition.trend.plot(ax=axes[1], color='#10B981', linewidth=2)
axes[1].set_ylabel('Trend')

decomposition.seasonal.plot(ax=axes[2], color='#F59E0B', linewidth=2)
axes[2].set_ylabel('Seasonal')

decomposition.resid.plot(ax=axes[3], color='#EF4444', linewidth=1.5, marker='o', markersize=4)
axes[3].set_ylabel('Residual')

for ax in axes:
    ax.grid(alpha=0.3)
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)

plt.xlabel('Quarter')
plt.tight_layout()
plt.savefig('decomposition.png', dpi=150, bbox_inches='tight')
plt.show()

# ========================================
# STEP 3: Extract Seasonal Indices
# ========================================
seasonal_indices = decomposition.seasonal.groupby(decomposition.seasonal.index.month).mean()
print("\nSeasonal Indices (Multiplicative):")
print("=" * 40)
months = {3: 'Q1 (Jan-Mar)', 6: 'Q2 (Apr-Jun)', 9: 'Q3 (Jul-Sep)', 12: 'Q4 (Oct-Dec)'}
for m, name in months.items():
    idx = seasonal_indices.get(m, 1.0)
    effect = "↑ Above avg" if idx > 1 else "↓ Below avg"
    print(f"  {name}: {idx:.4f}  ({effect})")

# ========================================
# STEP 4: Forecast Next 4 Quarters
# ========================================
# Method: Project trend linearly, then multiply by seasonal index
from sklearn.linear_model import LinearRegression

trend_clean = decomposition.trend.dropna()
X_trend = np.arange(len(trend_clean)).reshape(-1, 1)
y_trend = trend_clean.values

trend_model = LinearRegression()
trend_model.fit(X_trend, y_trend)

# Forecast trend for next 4 quarters
future_X = np.arange(len(trend_clean), len(trend_clean) + 4).reshape(-1, 1)
future_trend = trend_model.predict(future_X)

# Apply seasonal indices
future_months = [3, 6, 9, 12]  # Q1-Q4
future_seasonal = np.array([seasonal_indices.get(m, 1.0) for m in future_months])

forecast = future_trend * future_seasonal

print("\n" + "=" * 50)
print("FORECAST: HUL Revenue — Next 4 Quarters")
print("=" * 50)
forecast_quarters = pd.date_range('2024-Q1', periods=4, freq='Q')
for q, t, s, f in zip(forecast_quarters, future_trend, future_seasonal, forecast):
    print(f"  {q.strftime('%Y-Q%q')}:  Trend=₹{t:,.0f}  ×  Seasonal={s:.4f}  =  ₹{f:,.0f} Cr")

print(f"\nTotal FY2024 Forecast: ₹{sum(forecast):,.0f} Cr")

Seasonal Indices (Multiplicative): ======================================== Q1 (Jan-Mar): 1.0215 (↑ Above avg) Q2 (Apr-Jun): 0.9678 (↓ Below avg) Q3 (Jul-Sep): 1.0423 (↑ Above avg) Q4 (Oct-Dec): 0.9684 (↓ Below avg) ================================================== FORECAST: HUL Revenue — Next 4 Quarters ================================================== 2024-Q1: Trend=₹13,283 × Seasonal=1.0215 = ₹13,569 Cr 2024-Q2: Trend=₹13,470 × Seasonal=0.9678 = ₹13,037 Cr 2024-Q3: Trend=₹13,657 × Seasonal=1.0423 = ₹14,233 Cr 2024-Q4: Trend=₹13,844 × Seasonal=0.9684 = ₹13,407 Cr Total FY2024 Forecast: ₹54,246 Cr

Section 3

Feature Engineering for Financial Models

Creating powerful predictors from raw financial data

🔧 The Feature Engineering Toolkit

Feature engineering is often the difference between a mediocre model and a great one. Here are the key techniques for financial data:

Feature Type	Description	Python Code	Why It Works
Lag Features	Previous period's value	`df['rev_lag1'] = df['rev'].shift(1)`	Revenue is autocorrelated — last quarter predicts this quarter
Rolling Means	Moving average of past N periods	`df['rev_sma4'] = df['rev'].rolling(4).mean()`	Smooths noise, captures recent trend
YoY Change	Year-over-year percentage change	`df['rev_yoy'] = df['rev'].pct_change(4)`	Removes seasonality, shows true growth
QoQ Change	Quarter-over-quarter change	`df['rev_qoq'] = df['rev'].pct_change(1)`	Captures momentum and acceleration
Cyclical Encoding	Sin/Cos transform for time	`np.sin(2π × month / 12)`	Tells the model Q4 is "close to" Q1 (wraps around)
Interaction Terms	Feature × Feature	`df['gdp_x_it'] = df['gdp'] * df['it_idx']`	Captures combined effects
External Data	GDP, inflation, exchange rate	`pd.read_csv('macro_data.csv')`	Macro drivers affect all companies

✏️ Worked Example 3: Feature Engineering Pipeline

Task: Build a feature-rich dataset from raw revenue data, then select the best predictors using correlation analysis.

Python — Feature Engineering Pipeline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

def create_financial_features(df, revenue_col='Revenue', date_col='Quarter'):
    """Complete feature engineering pipeline for financial forecasting."""
    
    df = df.copy()
    
    # 1. Time-based features
    df['Quarter_Num'] = range(1, len(df) + 1)
    df['Year'] = df[date_col].dt.year
    df['Qtr'] = df[date_col].dt.quarter
    
    # Cyclical encoding
    df['Q_Sin'] = np.sin(2 * np.pi * df['Qtr'] / 4)
    df['Q_Cos'] = np.cos(2 * np.pi * df['Qtr'] / 4)
    
    # 2. Lag features (autocorrelation)
    df[f'{revenue_col}_Lag1'] = df[revenue_col].shift(1)   # Previous quarter
    df[f'{revenue_col}_Lag2'] = df[revenue_col].shift(2)   # 2 quarters ago
    df[f'{revenue_col}_Lag4'] = df[revenue_col].shift(4)   # Same quarter last year
    
    # 3. Rolling statistics
    df[f'{revenue_col}_SMA4'] = df[revenue_col].rolling(4).mean()    # 4Q moving avg
    df[f'{revenue_col}_SMA8'] = df[revenue_col].rolling(8).mean()    # 8Q moving avg
    df[f'{revenue_col}_STD4'] = df[revenue_col].rolling(4).std()     # Volatility
    
    # 4. Change features
    df[f'{revenue_col}_QoQ'] = df[revenue_col].pct_change(1) * 100   # QoQ %
    df[f'{revenue_col}_YoY'] = df[revenue_col].pct_change(4) * 100   # YoY %
    
    # 5. Momentum features
    df[f'{revenue_col}_Accel'] = df[f'{revenue_col}_QoQ'].diff(1)    # Acceleration
    
    # 6. Interaction features (if external data available)
    if 'GDP_Growth' in df.columns and 'IT_Index' in df.columns:
        df['GDP_x_IT'] = df['GDP_Growth'] * df['IT_Index'] / 100
    
    return df

# ========================================
# Apply to TCS data from Example 1
# ========================================
# (Assuming 'data' DataFrame from Example 1)
data_featured = create_financial_features(data)

# Drop NaN rows
data_clean = data_featured.dropna()

# ========================================
# Correlation Analysis
# ========================================
feature_cols = [c for c in data_clean.columns if c not in 
               ['Quarter', 'Revenue', 'Q_Month']]

corr_matrix = data_clean[feature_cols + ['Revenue']].corr()

# Plot correlation heatmap
fig, ax = plt.subplots(figsize=(12, 10))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(corr_matrix, mask=mask, annot=True, fmt='.2f', 
            cmap='RdYlBu_r', center=0, ax=ax,
            square=True, linewidths=0.5)
ax.set_title('Feature Correlation Matrix — TCS Revenue', 
             fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('correlation_matrix.png', dpi=150, bbox_inches='tight')
plt.show()

# ========================================
# Feature Selection: Top predictors
# ========================================
revenue_corr = corr_matrix['Revenue'].drop('Revenue').abs().sort_values(ascending=False)
print("TOP FEATURES by |Correlation| with Revenue:")
print("=" * 45)
for feat, corr_val in revenue_corr.head(10).items():
    bar = '█' * int(corr_val * 40)
    print(f"  {feat:25s}: {corr_val:.4f}  {bar}")

# Select features with |correlation| > 0.5
selected_features = revenue_corr[revenue_corr > 0.5].index.tolist()
print(f"\nSelected features (|corr| > 0.5): {len(selected_features)}")
print(f"Features: {selected_features}")

TOP FEATURES by |Correlation| with Revenue: ============================================= Revenue_Lag1 : 0.9992 ████████████████████████████████████████ Revenue_SMA4 : 0.9990 ████████████████████████████████████████ Revenue_Lag4 : 0.9964 ████████████████████████████████████████ Revenue_Lag2 : 0.9984 ████████████████████████████████████████ IT_Index : 0.9938 ████████████████████████████████████████ Quarter_Num : 0.9935 ████████████████████████████████████████ Revenue_SMA8 : 0.9894 ███████████████████████████████████████ GDP_Growth : 0.8802 ███████████████████████████████████ Revenue_YoY : 0.6520 ██████████████████████████ GDP_x_IT : 0.6105 ████████████████████████ Selected features (|corr| > 0.5): 10 Features: ['Revenue_Lag1', 'Revenue_SMA4', 'Revenue_Lag2', 'Revenue_Lag4', 'IT_Index', 'Quarter_Num', 'Revenue_SMA8', 'GDP_Growth', 'Revenue_YoY', 'GDP_x_IT']

Section 4

Model Evaluation & Selection

How to properly evaluate and compare forecasting models

📖 Time-Series Cross-Validation: NOT Random Shuffle!

The most common mistake in financial ML is randomly shuffling time-series data for train/test split. This causes data leakage — training on future data to predict the past.

Method	Description	When to Use
Simple Split	First 80% train, last 20% test	Quick prototyping
Walk-Forward	Train on [1..t], predict t+1, then train on [1..t+1], predict t+2	Gold standard for time-series
Expanding Window	Like walk-forward but training set keeps growing	When early data is still relevant
Rolling Window	Fixed-size window slides forward	When recent data matters more than old

✏️ Worked Example 4: Comparing OLS vs Ridge vs Lasso

Task: Compare three regression models on TCS data using walk-forward validation.

Python — Model Comparison with Walk-Forward Validation

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_squared_error
import matplotlib.pyplot as plt

def walk_forward_validation(df, features, target, model_class, 
                            min_train=12, **model_params):
    """Walk-forward validation for time-series."""
    predictions = []
    actuals = []
    
    for i in range(min_train, len(df)):
        # Train on all data up to point i
        X_train = df.iloc[:i][features]
        y_train = df.iloc[:i][target]
        
        # Predict point i
        X_test = df.iloc[i:i+1][features]
        y_test = df.iloc[i:i+1][target]
        
        # Scale features
        scaler = StandardScaler()
        X_train_s = scaler.fit_transform(X_train)
        X_test_s = scaler.transform(X_test)
        
        # Train and predict
        model = model_class(**model_params)
        model.fit(X_train_s, y_train)
        pred = model.predict(X_test_s)[0]
        
        predictions.append(pred)
        actuals.append(y_test.values[0])
    
    return np.array(predictions), np.array(actuals)

# Prepare data (using data_clean from earlier examples)
features = ['GDP_Growth', 'IT_Index', 'Quarter_Num', 'Q_Sin', 'Q_Cos',
            'Revenue_Lag1', 'Revenue_Lag4']
target = 'Revenue'

# Run walk-forward for all 3 models
results = {}
for name, cls, params in [
    ('OLS', LinearRegression, {}),
    ('Ridge', Ridge, {'alpha': 1.0}),
    ('Lasso', Lasso, {'alpha': 100.0})
]:
    preds, actuals = walk_forward_validation(
        data_clean, features, target, cls, min_train=12, **params)
    
    results[name] = {
        'R²': r2_score(actuals, preds),
        'RMSE': np.sqrt(mean_squared_error(actuals, preds)),
        'MAPE': np.mean(np.abs((actuals - preds) / actuals)) * 100,
        'Predictions': preds,
        'Actuals': actuals
    }

# ========================================
# Compare Results
# ========================================
print("MODEL COMPARISON — Walk-Forward Validation")
print("=" * 55)
print(f"{'Model':<10} {'R²':>8} {'RMSE (₹ Cr)':>14} {'MAPE (%)':>10}")
print("-" * 55)
for name, r in results.items():
    print(f"{name:<10} {r['R']:>8.4f} {r['RMSE']:>14,.0f} {r['MAPE']:>10.2f}")

best = min(results.items(), key=lambda x: x[1]['RMSE'])
print(f"\n🏆 Best Model: {best[0]} (lowest RMSE)")

# ========================================
# Visualization
# ========================================
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Plot 1: Actual vs Predicted
test_quarters = data_clean['Quarter'].iloc[12:]
for name, r in results.items():
    style = '-' if name == 'OLS' else '--' if name == 'Ridge' else ':'
    axes[0].plot(test_quarters, r['Predictions'], style, label=name, linewidth=2)
axes[0].plot(test_quarters, results['OLS']['Actuals'], 'ko', markersize=4, 
             label='Actual', alpha=0.6)
axes[0].set_title('Walk-Forward Predictions', fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)
axes[0].tick_params(axis='x', rotation=45)

# Plot 2: Error distribution
for name, r in results.items():
    errors = (r['Actuals'] - r['Predictions']) / r['Actuals'] * 100
    axes[1].hist(errors, bins=10, alpha=0.5, label=f'{name} (mean={errors.mean():.2f}%)')
axes[1].set_title('Prediction Error Distribution (%)', fontweight='bold')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.savefig('model_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

MODEL COMPARISON — Walk-Forward Validation ======================================================= Model R² RMSE (₹ Cr) MAPE (%) ------------------------------------------------------- OLS 0.9991 ₹482 0.71 Ridge 0.9992 ₹456 0.67 Lasso 0.9988 ₹598 0.89 🏆 Best Model: Ridge (lowest RMSE)

Why Ridge Wins

OLS can overfit when features are highly correlated (Revenue_Lag1 and Revenue_Lag4 are both ~0.99 correlated with Revenue). Ridge regression adds L2 penalty that shrinks coefficients, reducing overfitting. Lasso (L1 penalty) is too aggressive here — it zeroes out useful features.

Practical takeaway: For financial forecasting with correlated features, Ridge is usually the best choice.

📋 Practical Guidelines: When to Use What

Scenario	Recommended Approach	Expected Accuracy
Mature company, stable growth (IT, FMCG)	Linear regression with lag features	MAPE 1–3%
Cyclical company (Auto, Metals)	Ridge regression + GDP/cyclical features	MAPE 3–8%
High-growth startup	Multiple regression + industry metrics	MAPE 5–15%
Seasonal business (Retail, Travel)	Time-series decomposition + seasonal indices	MAPE 2–5%
Volatile / event-driven	Scenario analysis > ML (too unpredictable)	N/A — use scenarios

⚠️Danger: Overfitting

If your model has R² = 0.9999 on training data but performs poorly on new data, you've overfit. Always use walk-forward validation and report test set metrics, not training metrics. A model with R² = 0.95 on the test set is more trustworthy than one with R² = 0.9999 only on training.

Python Lab

Hands-On Practice Exercises

🏋️ Exercise 1: Revenue Prediction for Infosys (25 min)

Objective: Build a linear regression model to predict Infosys quarterly revenue

Assumption	Value
Company	Infosys (INFY.NS)
Data Period	2016-Q1 to 2024-Q4 (32 quarters)
Features	GDP Growth, Nifty IT Index, Quarter encoding, Lag features
Train/Test Split	24 quarters train / 8 quarters test

Tasks:

Fetch Infosys quarterly revenue using yfinance
Engineer features: lags, rolling means, cyclical encoding
Train a LinearRegression model
Report R², RMSE, and MAPE on the test set

Python — Infosys Revenue Prediction

import yfinance as yf
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

# Fetch Infosys quarterly revenue
ticker = yf.Ticker("INFY.NS")
income_stmt = ticker.quarterly_income_stmt

# Extract revenue (most recent first, reverse for chronological)
revenue = income_stmt.loc['Total Revenue'].sort_index()
revenue = revenue / 1e7  # Convert to ₹ Cr

# Create DataFrame
df = pd.DataFrame({
    'Quarter': revenue.index,
    'Revenue': revenue.values
})
df['Quarter_Num'] = range(1, len(df) + 1)
df['Qtr'] = df['Quarter'].dt.quarter
df['Q_Sin'] = np.sin(2 * np.pi * df['Qtr'] / 4)
df['Q_Cos'] = np.cos(2 * np.pi * df['Qtr'] / 4)
df['Revenue_Lag1'] = df['Revenue'].shift(1)
df['Revenue_Lag4'] = df['Revenue'].shift(4)
df['Revenue_SMA4'] = df['Revenue'].rolling(4).mean()

df = df.dropna().reset_index(drop=True)

# Train/test split (chronological!)
train = df.iloc[:-8]
test = df.iloc[-8:]

features = ['Quarter_Num', 'Q_Sin', 'Q_Cos', 'Revenue_Lag1', 
            'Revenue_Lag4', 'Revenue_SMA4']

model = LinearRegression()
model.fit(train[features], train['Revenue'])
preds = model.predict(test[features])

print(f"Test R²:   {r2_score(test['Revenue'], preds):.4f}")
print(f"Test RMSE: ₹{np.sqrt(mean_squared_error(test['Revenue'], preds)):,.0f} Cr")
mape = np.mean(np.abs((test['Revenue'] - preds) / test['Revenue'])) * 100
print(f"Test MAPE: {mape:.2f}%")

🏋️ Exercise 2: Time-Series Decomposition of Stock Prices (20 min)

Objective: Download 5 years of monthly stock data for Reliance Industries, decompose into trend/seasonality/residual

Tasks:

Fetch Reliance monthly closing prices using yfinance
Apply multiplicative decomposition (period=12)
Plot the 4-component decomposition chart
Calculate and interpret seasonal indices — which months are strongest/weakest?

Python — Stock Price Decomposition

import yfinance as yf
import pandas as pd
from statsmodels.tsa.seasonal import seasonal_decompose
import matplotlib.pyplot as plt

# Fetch 5 years of monthly data
data = yf.download("RELIANCE.NS", period="5y", interval="1mo")
prices = data['Close'].dropna()

# Decompose (multiplicative — prices grow over time)
decomp = seasonal_decompose(prices, model='multiplicative', period=12)

# Plot
fig, axes = plt.subplots(4, 1, figsize=(14, 12), sharex=True)
decomp.observed.plot(ax=axes[0], title='Observed (Reliance Close Price)')
decomp.trend.plot(ax=axes[1], title='Trend')
decomp.seasonal.plot(ax=axes[2], title='Seasonal')
decomp.resid.plot(ax=axes[3], title='Residual')
plt.tight_layout()
plt.savefig('reliance_decomposition.png', dpi=150)
plt.show()

# Seasonal indices by month
seasonal = decomp.seasonal
monthly_avg = seasonal.groupby(seasonal.index.month).mean()
print("Monthly Seasonal Indices:")
for month, val in monthly_avg.items():
    effect = "↑" if val > 1 else "↓"
    print(f"  Month {month:2d}: {val:.4f} {effect}")

🏋️ Exercise 3: Feature Engineering Challenge (20 min)

Objective: Start with a basic model (R² ~0.65) and improve it to R² >0.90 through feature engineering

Starting data: 20 quarters of revenue with only GDP growth as a feature

Tasks:

Start with GDP_Growth as the only feature → record R²
Add Quarter_Num (time trend) → record new R²
Add lag features (Revenue_Lag1, Revenue_Lag4) → record R²
Add rolling means and cyclical encoding → record R²
Plot the improvement curve (R² vs number of features)

Python — Incremental Feature Engineering

# Feature engineering progression
feature_sets = {
    '1. GDP Only': ['GDP_Growth'],
    '2. + Time Trend': ['GDP_Growth', 'Quarter_Num'],
    '3. + Lag Features': ['GDP_Growth', 'Quarter_Num', 'Revenue_Lag1', 'Revenue_Lag4'],
    '4. + Seasonality': ['GDP_Growth', 'Quarter_Num', 'Revenue_Lag1', 
                         'Revenue_Lag4', 'Q_Sin', 'Q_Cos'],
    '5. + Rolling Stats': ['GDP_Growth', 'Quarter_Num', 'Revenue_Lag1',
                           'Revenue_Lag4', 'Q_Sin', 'Q_Cos', 'Revenue_SMA4'],
    '6. Full Set': ['GDP_Growth', 'IT_Index', 'Quarter_Num', 'Q_Sin', 'Q_Cos',
                    'Revenue_Lag1', 'Revenue_Lag4', 'Revenue_SMA4']
}

print("FEATURE ENGINEERING PROGRESSION")
print("=" * 50)
r2_values = []
for name, features in feature_sets.items():
    df_temp = data_clean[features + ['Revenue']].dropna()
    X = df_temp[features]
    y = df_temp['Revenue']
    model = LinearRegression().fit(X, y)
    r2 = r2_score(y, model.predict(X))
    r2_values.append(r2)
    bar = '█' * int(r2 * 50)
    print(f"{name:30s}: R²={r2:.4f}  {bar}")

# Plot progression
plt.figure(figsize=(10, 5))
plt.plot(range(len(r2_values)), r2_values, 'b-o', linewidth=2)
plt.xticks(range(len(r2_values)), 
           [n.split('. ')[1] for n in feature_sets.keys()], 
           rotation=30, ha='right', fontsize=9)
plt.ylabel('R²')
plt.title('R² Improvement with Feature Engineering', fontweight='bold')
plt.grid(alpha=0.3)
plt.ylim(0.5, 1.0)
plt.tight_layout()
plt.savefig('feature_progression.png', dpi=150)
plt.show()

🏋️ Exercise 4 (Advanced): Automated Forecasting Pipeline (20 min)

Objective: Build a reusable pipeline that takes any ticker, fetches data, engineers features, trains 3 models, and returns the best forecast

Python — Automated Forecasting Pipeline

import yfinance as yf
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import r2_score, mean_squared_error
from statsmodels.tsa.seasonal import seasonal_decompose

def auto_forecast(ticker, forecast_periods=4):
    """Automated revenue forecasting pipeline."""
    
    print(f"\n{'='*60}")
    print(f"AUTOMATED FORECAST: {ticker}")
    print(f"{'='*60}")
    
    # 1. Fetch data
    stock = yf.Ticker(ticker)
    revenue = stock.quarterly_income_stmt.loc['Total Revenue'].sort_index()
    revenue = revenue / 1e7  # ₹ Cr
    
    df = pd.DataFrame({'Revenue': revenue.values}, index=revenue.index)
    df['Quarter_Num'] = range(1, len(df) + 1)
    df['Qtr'] = df.index.quarter
    df['Q_Sin'] = np.sin(2 * np.pi * df['Qtr'] / 4)
    df['Q_Cos'] = np.cos(2 * np.pi * df['Qtr'] / 4)
    df['Revenue_Lag1'] = df['Revenue'].shift(1)
    df['Revenue_Lag4'] = df['Revenue'].shift(4)
    df['Revenue_SMA4'] = df['Revenue'].rolling(4).mean()
    
    df = df.dropna()
    features = ['Quarter_Num', 'Q_Sin', 'Q_Cos', 'Revenue_Lag1', 
                'Revenue_Lag4', 'Revenue_SMA4']
    
    # 2. Train models (walk-forward last 4 quarters)
    best_model = None
    best_rmse = float('inf')
    
    for name, cls, params in [('OLS', LinearRegression, {}), 
                               ('Ridge', Ridge, {'alpha': 1.0})]:
        preds, actuals = [], []
        for i in range(len(df) - 4, len(df)):
            model = cls(**params)
            model.fit(df.iloc[:i][features], df.iloc[:i]['Revenue'])
            pred = model.predict(df.iloc[i:i+1][features])[0]
            preds.append(pred)
            actuals.append(df.iloc[i]['Revenue'])
        
        rmse = np.sqrt(mean_squared_error(actuals, preds))
        mape = np.mean(np.abs((np.array(actuals) - np.array(preds)) / np.array(actuals))) * 100
        
        print(f"  {name:10s}: RMSE=₹{rmse:>10,.0f}  MAPE={mape:.2f}%")
        
        if rmse < best_rmse:
            best_rmse = rmse
            best_model = cls(**params)
            best_name = name
    
    # 3. Retrain best model on all data and forecast
    best_model.fit(df[features], df['Revenue'])
    
    # Simple projection for next quarters
    last_row = df.iloc[-1].copy()
    forecasts = []
    for q in range(1, forecast_periods + 1):
        next_q = df.index[-1] + pd.DateOffset(months=3)
        last_row['Quarter_Num'] += 1
        last_row['Qtr'] = next_q.quarter
        last_row['Q_Sin'] = np.sin(2 * np.pi * last_row['Qtr'] / 4)
        last_row['Q_Cos'] = np.cos(2 * np.pi * last_row['Qtr'] / 4)
        last_row['Revenue_Lag1'] = df.iloc[-1]['Revenue'] if q == 1 else forecasts[-1]
        
        pred = best_model.predict(pd.DataFrame([last_row[features]]))[0]
        forecasts.append(pred)
        print(f"  Forecast Q+{q}: ₹{pred:>10,.0f} Cr")
    
    return forecasts

# Test with multiple companies
auto_forecast("TCS.NS")
auto_forecast("INFY.NS")
auto_forecast("HINDUNILVR.NS")

Quick Review

📚 Key Terms — Click to Flip

Knowledge Check

Test Your Understanding

10 questions on AI-Enhanced Forecasting

Summary

Key Takeaways

📝 What We Covered Today

Built multiple linear regression models to predict revenue using GDP, industry indices, and lag features — achieving R² > 0.99
Learned time-series decomposition — breaking revenue into trend, seasonal, and residual components using statsmodels
Created powerful features: lag values, rolling means, cyclical encoding, YoY changes, and interaction terms
Compared OLS vs Ridge vs Lasso — Ridge wins for financial data with correlated features
Used walk-forward validation — the correct way to evaluate time-series models (never shuffle!)
Built an automated forecasting pipeline that works for any company ticker

📚Next Session

Session 22: Monte Carlo Simulation
We'll move from point forecasts to probability distributions — simulating thousands of scenarios to quantify risk and uncertainty in our models. Topics: random number generation, distribution fitting, and confidence intervals for DCF valuations.