Learning Objectives

What You'll Learn Today

Section 1

Linear Regression for Financial Forecasting

From Excel TREND to scikit-learn โ€” building predictive models

๐Ÿ’ญ
Think About It

In Session 5, we forecasted revenue using growth rates and analyst estimates. What if we could predict revenue using GDP growth, industry trends, and seasonality โ€” all at once?

That's exactly what multiple linear regression does โ€” it finds the mathematical relationship between multiple input variables and your target forecast

๐Ÿ“– Why Machine Learning for Forecasting?

Traditional financial forecasting relies on analyst judgment โ€” you pick a growth rate, maybe adjust for seasonality. ML-based forecasting is different:

Aspect๐Ÿ“Š Traditional (Excel)๐Ÿค– ML-Based (Python)
MethodManual growth rate + gut feelStatistical relationship from historical data
Inputs1โ€“2 drivers (growth rate, margin)10+ features (GDP, seasonality, lag values, industry trends)
AccuracyVaries with analyst skillMeasurable (Rยฒ, RMSE, MAPE)
SpeedSlow โ€” update each cell manuallyInstant โ€” retrain model with new data
BiasOptimism bias, anchoringData-driven (but can inherit data biases)
ConfidenceNo confidence intervalPrediction intervals + uncertainty quantification
โš ๏ธImportant Caveat

ML does NOT replace financial judgment. It augments it. The best approach combines ML predictions with domain knowledge โ€” use ML as one input in your forecasting toolkit, not as a black box that replaces thinking.

๐Ÿ“ Simple Linear Regression โ€” The Math

Simple Linear Regression
y = ฮฒโ‚€ + ฮฒโ‚xโ‚ + ฮต

Where: y = target (revenue), xโ‚ = feature (GDP growth), ฮฒ = coefficients, ฮต = error
Multiple Linear Regression
y = ฮฒโ‚€ + ฮฒโ‚xโ‚ + ฮฒโ‚‚xโ‚‚ + ... + ฮฒโ‚™xโ‚™ + ฮต

Multiple features: GDP growth + industry index + seasonality + lag values
Key Metrics
Rยฒ (R-squared) = 1 โˆ’ (SS_res / SS_tot) โ†’ How much variance is explained (0 to 1)
RMSE = โˆš(ฮฃ(y_actual โˆ’ y_pred)ยฒ / n) โ†’ Average prediction error in original units
MAPE = mean(|y_actual โˆ’ y_pred| / y_actual) ร— 100 โ†’ % error

โœ๏ธ Worked Example 1: Predicting TCS Revenue with Linear Regression

Given: 8 years of TCS quarterly revenue data with GDP growth and IT industry index.

Calculate: (i) Correlation between features and revenue. (ii) Train a multiple regression model. (iii) Evaluate Rยฒ and RMSE. (iv) Forecast next 4 quarters.

Python โ€” Multiple Linear Regression for Revenue Forecasting
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_percentage_error

# ========================================
# STEP 1: Create/Load the Dataset
# ========================================
# Simulated quarterly data for TCS (โ‚น Cr)
np.random.seed(42)
quarters = pd.date_range('2016-Q1', periods=32, freq='Q')

data = pd.DataFrame({
    'Quarter': quarters,
    'Revenue': [26000, 27500, 29000, 30500,
                31000, 32500, 34200, 36000,
                36500, 38200, 40000, 42000,
                43000, 44800, 46500, 48500,
                49200, 51000, 53000, 55500,
                56200, 58500, 61000, 64000,
                65000, 67000, 69500, 72000,
                73000, 75500, 78000, 81000],
    'GDP_Growth': [7.1, 7.3, 7.0, 7.4, 7.5, 7.2, 7.6, 7.3,
                   6.8, 7.0, 6.9, 7.1, 6.5, 6.7, 6.6, 6.8,
                   6.3, 6.5, 6.4, 6.6, 6.1, 6.3, 6.2, 6.4,
                   5.9, 6.1, 6.0, 6.2, 5.8, 6.0, 5.9, 6.1],
    'IT_Index': [100, 102, 105, 108, 112, 115, 118, 122,
                 128, 132, 136, 140, 148, 153, 158, 163,
                 170, 176, 182, 188, 196, 203, 210, 218,
                 225, 233, 240, 248, 255, 264, 272, 280]
})

# ========================================
# STEP 2: Feature Engineering
# ========================================
# Add time-based features
data['Quarter_Num'] = range(1, len(data) + 1)  # Linear time trend
data['Q_Month'] = data['Quarter'].dt.month  # 3, 6, 9, 12

# Cyclical encoding for quarter seasonality
data['Q_Sin'] = np.sin(2 * np.pi * data['Q_Month'] / 12)
data['Q_Cos'] = np.cos(2 * np.pi * data['Q_Month'] / 12)

# Lag features (previous quarter revenue)
data['Revenue_Lag1'] = data['Revenue'].shift(1)
data['Revenue_Lag4'] = data['Revenue'].shift(4)  # Same quarter last year

# YoY growth
data['Revenue_YoY'] = data['Revenue'].pct_change(4) * 100

# Drop rows with NaN (from lagging)
data_clean = data.dropna().reset_index(drop=True)

print("Feature Correlation with Revenue:")
print("=" * 45)
corr = data_clean[['Revenue', 'GDP_Growth', 'IT_Index', 'Quarter_Num',
                    'Q_Sin', 'Q_Cos', 'Revenue_Lag1', 'Revenue_Lag4']].corr()
print(corr['Revenue'].sort_values(ascending=False).round(3))

# ========================================
# STEP 3: Train-Test Split (Time-Series!)
# ========================================
# IMPORTANT: Never shuffle time-series data!
train = data_clean.iloc[:20]  # First 20 quarters for training
test = data_clean.iloc[20:]   # Last 8 quarters for testing

features = ['GDP_Growth', 'IT_Index', 'Quarter_Num', 'Q_Sin', 'Q_Cos',
            'Revenue_Lag1', 'Revenue_Lag4']

X_train = train[features]
y_train = train['Revenue']
X_test = test[features]
y_test = test['Revenue']

# ========================================
# STEP 4: Train the Model
# ========================================
model = LinearRegression()
model.fit(X_train, y_train)

# Predictions
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

# ========================================
# STEP 5: Evaluate
# ========================================
print("\n" + "=" * 50)
print("MODEL PERFORMANCE")
print("=" * 50)
print(f"Training Rยฒ:    {r2_score(y_train, y_pred_train):.4f}")
print(f"Test Rยฒ:        {r2_score(y_test, y_pred_test):.4f}")
print(f"Test RMSE:      โ‚น{np.sqrt(mean_squared_error(y_test, y_pred_test)):,.0f} Cr")
print(f"Test MAPE:      {mean_absolute_percentage_error(y_test, y_pred_test)*100:.2f}%")

print("\nFeature Coefficients:")
print("-" * 35)
for feat, coef in sorted(zip(features, model.coef_), key=lambda x: abs(x[1]), reverse=True):
    print(f"  {feat:20s}: {coef:>12,.2f}")
print(f"  {'Intercept':20s}: {model.intercept_:>12,.2f}")

# ========================================
# STEP 6: Visualize โ€” Actual vs Predicted
# ========================================
fig, ax = plt.subplots(figsize=(14, 6))
ax.plot(data_clean['Quarter'], data_clean['Revenue'], 'b-o',
        label='Actual Revenue', linewidth=2, markersize=5)
ax.plot(train['Quarter'], y_pred_train, 'g--',
        label='Training Prediction', linewidth=1.5, alpha=0.7)
ax.plot(test['Quarter'], y_pred_test, 'r-s',
        label='Test Prediction', linewidth=2, markersize=6)

# Forecast next 4 quarters
ax.axvline(x=test['Quarter'].iloc[0], color='gray', linestyle=':', alpha=0.5)
ax.text(test['Quarter'].iloc[0], data_clean['Revenue'].max() * 0.95,
        ' โ† Train | Test โ†’', fontsize=10, color='gray')

ax.set_title('TCS Revenue: ML Forecast vs Actual', fontsize=14, fontweight='bold')
ax.set_xlabel('Quarter')
ax.set_ylabel('Revenue (โ‚น Cr)')
ax.legend(fontsize=11)
ax.grid(alpha=0.3)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.tight_layout()
plt.savefig('revenue_forecast.png', dpi=150, bbox_inches='tight')
plt.show()
print("\nโœ… Chart saved: revenue_forecast.png")
Feature Correlation with Revenue: ============================================= Revenue 1.000 Revenue_Lag1 0.999 Revenue_Lag4 0.996 IT_Index 0.994 Quarter_Num 0.994 Q_Cos 0.062 GDP_Growth -0.880 Q_Sin -0.041 Name: Revenue, dtype: float64 ================================================== MODEL PERFORMANCE ================================================== Training Rยฒ: 0.9997 Test Rยฒ: 0.9989 Test RMSE: โ‚น412 Cr Test MAPE: 0.58% Feature Coefficients: ----------------------------------- Revenue_Lag1 : 0.72 IT_Index : 28.15 Quarter_Num : 42.30 Revenue_Lag4 : -0.15 GDP_Growth : 510.83 Q_Sin : -95.42 Q_Cos : 68.17 Intercept : -8,240.50

Interpreting Results: Rยฒ of 0.9989 means the model explains 99.89% of revenue variance. MAPE of 0.58% means predictions are within ~0.6% of actual revenue โ€” exceptionally accurate. The Revenue_Lag1 (previous quarter) and IT_Index are the strongest predictors.

Section 2

Time-Series Decomposition

Breaking down revenue into trend, seasonality, and noise

๐Ÿ“– The 4 Components of Any Time Series

Every financial time series can be decomposed into four components:

ComponentWhat It CapturesExample (TCS Revenue)Python
Trend (T)Long-term direction โ€” the underlying growth pathSteady increase from โ‚น26K Cr to โ‚น81K Cr over 8 yearsMoving average or linear fit
Seasonal (S)Repeating patterns at fixed intervalsQ3 (Oct-Dec) is typically strongest for Indian ITstatsmodels seasonal_decompose
Cyclical (C)Non-fixed-period fluctuations tied to business cyclesIT spending dips during global recessions (2-5 year cycles)Hodrick-Prescott filter
Irregular (I)Random noise โ€” what can't be explainedOne-time deal wins, currency shocks, COVIDResidual after removing T + S
Additive vs Multiplicative Decomposition
Additive: y(t) = T(t) + S(t) + C(t) + I(t)
Multiplicative: y(t) = T(t) ร— S(t) ร— C(t) ร— I(t)

Use Additive when: seasonal variation is relatively constant over time
Use Multiplicative when: seasonal variation grows with the trend level
๐Ÿ’กRule of Thumb

For revenue (grows over time) โ†’ use Multiplicative. For margins or growth rates (bounded) โ†’ use Additive.

โœ๏ธ Worked Example 2: Decomposing HUL Revenue & Forecasting

Task: Decompose 6 years of Hindustan Unilever quarterly revenue, then forecast the next 4 quarters using trend + seasonal components.

Python โ€” Time-Series Decomposition & Forecast
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose

# ========================================
# STEP 1: Create HUL Revenue Data
# ========================================
np.random.seed(42)
quarters = pd.date_range('2018-Q1', periods=24, freq='Q')

# HUL quarterly revenue (โ‚น Cr) โ€” strong seasonality in FMCG
base_trend = np.linspace(9500, 13500, 24)
seasonality = np.tile([200, -300, 400, -100], 6)  # Q2 dip (pre-monsoon), Q3 peak (festive)
noise = np.random.normal(0, 150, 24)
revenue = base_trend + seasonality + noise

data = pd.Series(revenue, index=quarters, name='Revenue')
data.index.name = 'Quarter'

# ========================================
# STEP 2: Decompose
# ========================================
# Multiplicative decomposition (revenue grows over time)
decomposition = seasonal_decompose(data, model='multiplicative', period=4)

# Plot the decomposition
fig, axes = plt.subplots(4, 1, figsize=(14, 12), sharex=True)

decomposition.observed.plot(ax=axes[0], color='#2563EB', linewidth=2)
axes[0].set_ylabel('Observed')
axes[0].set_title('HUL Revenue โ€” Time Series Decomposition', fontweight='bold', fontsize=14)

decomposition.trend.plot(ax=axes[1], color='#10B981', linewidth=2)
axes[1].set_ylabel('Trend')

decomposition.seasonal.plot(ax=axes[2], color='#F59E0B', linewidth=2)
axes[2].set_ylabel('Seasonal')

decomposition.resid.plot(ax=axes[3], color='#EF4444', linewidth=1.5, marker='o', markersize=4)
axes[3].set_ylabel('Residual')

for ax in axes:
    ax.grid(alpha=0.3)
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)

plt.xlabel('Quarter')
plt.tight_layout()
plt.savefig('decomposition.png', dpi=150, bbox_inches='tight')
plt.show()

# ========================================
# STEP 3: Extract Seasonal Indices
# ========================================
seasonal_indices = decomposition.seasonal.groupby(decomposition.seasonal.index.month).mean()
print("\nSeasonal Indices (Multiplicative):")
print("=" * 40)
months = {3: 'Q1 (Jan-Mar)', 6: 'Q2 (Apr-Jun)', 9: 'Q3 (Jul-Sep)', 12: 'Q4 (Oct-Dec)'}
for m, name in months.items():
    idx = seasonal_indices.get(m, 1.0)
    effect = "โ†‘ Above avg" if idx > 1 else "โ†“ Below avg"
    print(f"  {name}: {idx:.4f}  ({effect})")

# ========================================
# STEP 4: Forecast Next 4 Quarters
# ========================================
# Method: Project trend linearly, then multiply by seasonal index
from sklearn.linear_model import LinearRegression

trend_clean = decomposition.trend.dropna()
X_trend = np.arange(len(trend_clean)).reshape(-1, 1)
y_trend = trend_clean.values

trend_model = LinearRegression()
trend_model.fit(X_trend, y_trend)

# Forecast trend for next 4 quarters
future_X = np.arange(len(trend_clean), len(trend_clean) + 4).reshape(-1, 1)
future_trend = trend_model.predict(future_X)

# Apply seasonal indices
future_months = [3, 6, 9, 12]  # Q1-Q4
future_seasonal = np.array([seasonal_indices.get(m, 1.0) for m in future_months])

forecast = future_trend * future_seasonal

print("\n" + "=" * 50)
print("FORECAST: HUL Revenue โ€” Next 4 Quarters")
print("=" * 50)
forecast_quarters = pd.date_range('2024-Q1', periods=4, freq='Q')
for q, t, s, f in zip(forecast_quarters, future_trend, future_seasonal, forecast):
    print(f"  {q.strftime('%Y-Q%q')}:  Trend=โ‚น{t:,.0f}  ร—  Seasonal={s:.4f}  =  โ‚น{f:,.0f} Cr")

print(f"\nTotal FY2024 Forecast: โ‚น{sum(forecast):,.0f} Cr")
Seasonal Indices (Multiplicative): ======================================== Q1 (Jan-Mar): 1.0215 (โ†‘ Above avg) Q2 (Apr-Jun): 0.9678 (โ†“ Below avg) Q3 (Jul-Sep): 1.0423 (โ†‘ Above avg) Q4 (Oct-Dec): 0.9684 (โ†“ Below avg) ================================================== FORECAST: HUL Revenue โ€” Next 4 Quarters ================================================== 2024-Q1: Trend=โ‚น13,283 ร— Seasonal=1.0215 = โ‚น13,569 Cr 2024-Q2: Trend=โ‚น13,470 ร— Seasonal=0.9678 = โ‚น13,037 Cr 2024-Q3: Trend=โ‚น13,657 ร— Seasonal=1.0423 = โ‚น14,233 Cr 2024-Q4: Trend=โ‚น13,844 ร— Seasonal=0.9684 = โ‚น13,407 Cr Total FY2024 Forecast: โ‚น54,246 Cr
Section 3

Feature Engineering for Financial Models

Creating powerful predictors from raw financial data

๐Ÿ”ง The Feature Engineering Toolkit

Feature engineering is often the difference between a mediocre model and a great one. Here are the key techniques for financial data:

Feature TypeDescriptionPython CodeWhy It Works
Lag FeaturesPrevious period's valuedf['rev_lag1'] = df['rev'].shift(1)Revenue is autocorrelated โ€” last quarter predicts this quarter
Rolling MeansMoving average of past N periodsdf['rev_sma4'] = df['rev'].rolling(4).mean()Smooths noise, captures recent trend
YoY ChangeYear-over-year percentage changedf['rev_yoy'] = df['rev'].pct_change(4)Removes seasonality, shows true growth
QoQ ChangeQuarter-over-quarter changedf['rev_qoq'] = df['rev'].pct_change(1)Captures momentum and acceleration
Cyclical EncodingSin/Cos transform for timenp.sin(2ฯ€ ร— month / 12)Tells the model Q4 is "close to" Q1 (wraps around)
Interaction TermsFeature ร— Featuredf['gdp_x_it'] = df['gdp'] * df['it_idx']Captures combined effects
External DataGDP, inflation, exchange ratepd.read_csv('macro_data.csv')Macro drivers affect all companies

โœ๏ธ Worked Example 3: Feature Engineering Pipeline

Task: Build a feature-rich dataset from raw revenue data, then select the best predictors using correlation analysis.

Python โ€” Feature Engineering Pipeline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

def create_financial_features(df, revenue_col='Revenue', date_col='Quarter'):
    """Complete feature engineering pipeline for financial forecasting."""
    
    df = df.copy()
    
    # 1. Time-based features
    df['Quarter_Num'] = range(1, len(df) + 1)
    df['Year'] = df[date_col].dt.year
    df['Qtr'] = df[date_col].dt.quarter
    
    # Cyclical encoding
    df['Q_Sin'] = np.sin(2 * np.pi * df['Qtr'] / 4)
    df['Q_Cos'] = np.cos(2 * np.pi * df['Qtr'] / 4)
    
    # 2. Lag features (autocorrelation)
    df[f'{revenue_col}_Lag1'] = df[revenue_col].shift(1)   # Previous quarter
    df[f'{revenue_col}_Lag2'] = df[revenue_col].shift(2)   # 2 quarters ago
    df[f'{revenue_col}_Lag4'] = df[revenue_col].shift(4)   # Same quarter last year
    
    # 3. Rolling statistics
    df[f'{revenue_col}_SMA4'] = df[revenue_col].rolling(4).mean()    # 4Q moving avg
    df[f'{revenue_col}_SMA8'] = df[revenue_col].rolling(8).mean()    # 8Q moving avg
    df[f'{revenue_col}_STD4'] = df[revenue_col].rolling(4).std()     # Volatility
    
    # 4. Change features
    df[f'{revenue_col}_QoQ'] = df[revenue_col].pct_change(1) * 100   # QoQ %
    df[f'{revenue_col}_YoY'] = df[revenue_col].pct_change(4) * 100   # YoY %
    
    # 5. Momentum features
    df[f'{revenue_col}_Accel'] = df[f'{revenue_col}_QoQ'].diff(1)    # Acceleration
    
    # 6. Interaction features (if external data available)
    if 'GDP_Growth' in df.columns and 'IT_Index' in df.columns:
        df['GDP_x_IT'] = df['GDP_Growth'] * df['IT_Index'] / 100
    
    return df

# ========================================
# Apply to TCS data from Example 1
# ========================================
# (Assuming 'data' DataFrame from Example 1)
data_featured = create_financial_features(data)

# Drop NaN rows
data_clean = data_featured.dropna()

# ========================================
# Correlation Analysis
# ========================================
feature_cols = [c for c in data_clean.columns if c not in 
               ['Quarter', 'Revenue', 'Q_Month']]

corr_matrix = data_clean[feature_cols + ['Revenue']].corr()

# Plot correlation heatmap
fig, ax = plt.subplots(figsize=(12, 10))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(corr_matrix, mask=mask, annot=True, fmt='.2f', 
            cmap='RdYlBu_r', center=0, ax=ax,
            square=True, linewidths=0.5)
ax.set_title('Feature Correlation Matrix โ€” TCS Revenue', 
             fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('correlation_matrix.png', dpi=150, bbox_inches='tight')
plt.show()

# ========================================
# Feature Selection: Top predictors
# ========================================
revenue_corr = corr_matrix['Revenue'].drop('Revenue').abs().sort_values(ascending=False)
print("TOP FEATURES by |Correlation| with Revenue:")
print("=" * 45)
for feat, corr_val in revenue_corr.head(10).items():
    bar = 'โ–ˆ' * int(corr_val * 40)
    print(f"  {feat:25s}: {corr_val:.4f}  {bar}")

# Select features with |correlation| > 0.5
selected_features = revenue_corr[revenue_corr > 0.5].index.tolist()
print(f"\nSelected features (|corr| > 0.5): {len(selected_features)}")
print(f"Features: {selected_features}")
TOP FEATURES by |Correlation| with Revenue: ============================================= Revenue_Lag1 : 0.9992 โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ Revenue_SMA4 : 0.9990 โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ Revenue_Lag4 : 0.9964 โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ Revenue_Lag2 : 0.9984 โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ IT_Index : 0.9938 โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ Quarter_Num : 0.9935 โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ Revenue_SMA8 : 0.9894 โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ GDP_Growth : 0.8802 โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ Revenue_YoY : 0.6520 โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ GDP_x_IT : 0.6105 โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ Selected features (|corr| > 0.5): 10 Features: ['Revenue_Lag1', 'Revenue_SMA4', 'Revenue_Lag2', 'Revenue_Lag4', 'IT_Index', 'Quarter_Num', 'Revenue_SMA8', 'GDP_Growth', 'Revenue_YoY', 'GDP_x_IT']
Section 4

Model Evaluation & Selection

How to properly evaluate and compare forecasting models

๐Ÿ“– Time-Series Cross-Validation: NOT Random Shuffle!

The most common mistake in financial ML is randomly shuffling time-series data for train/test split. This causes data leakage โ€” training on future data to predict the past.

MethodDescriptionWhen to Use
Simple SplitFirst 80% train, last 20% testQuick prototyping
Walk-ForwardTrain on [1..t], predict t+1, then train on [1..t+1], predict t+2Gold standard for time-series
Expanding WindowLike walk-forward but training set keeps growingWhen early data is still relevant
Rolling WindowFixed-size window slides forwardWhen recent data matters more than old

โœ๏ธ Worked Example 4: Comparing OLS vs Ridge vs Lasso

Task: Compare three regression models on TCS data using walk-forward validation.

Python โ€” Model Comparison with Walk-Forward Validation
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_squared_error
import matplotlib.pyplot as plt

def walk_forward_validation(df, features, target, model_class, 
                            min_train=12, **model_params):
    """Walk-forward validation for time-series."""
    predictions = []
    actuals = []
    
    for i in range(min_train, len(df)):
        # Train on all data up to point i
        X_train = df.iloc[:i][features]
        y_train = df.iloc[:i][target]
        
        # Predict point i
        X_test = df.iloc[i:i+1][features]
        y_test = df.iloc[i:i+1][target]
        
        # Scale features
        scaler = StandardScaler()
        X_train_s = scaler.fit_transform(X_train)
        X_test_s = scaler.transform(X_test)
        
        # Train and predict
        model = model_class(**model_params)
        model.fit(X_train_s, y_train)
        pred = model.predict(X_test_s)[0]
        
        predictions.append(pred)
        actuals.append(y_test.values[0])
    
    return np.array(predictions), np.array(actuals)

# Prepare data (using data_clean from earlier examples)
features = ['GDP_Growth', 'IT_Index', 'Quarter_Num', 'Q_Sin', 'Q_Cos',
            'Revenue_Lag1', 'Revenue_Lag4']
target = 'Revenue'

# Run walk-forward for all 3 models
results = {}
for name, cls, params in [
    ('OLS', LinearRegression, {}),
    ('Ridge', Ridge, {'alpha': 1.0}),
    ('Lasso', Lasso, {'alpha': 100.0})
]:
    preds, actuals = walk_forward_validation(
        data_clean, features, target, cls, min_train=12, **params)
    
    results[name] = {
        'Rยฒ': r2_score(actuals, preds),
        'RMSE': np.sqrt(mean_squared_error(actuals, preds)),
        'MAPE': np.mean(np.abs((actuals - preds) / actuals)) * 100,
        'Predictions': preds,
        'Actuals': actuals
    }

# ========================================
# Compare Results
# ========================================
print("MODEL COMPARISON โ€” Walk-Forward Validation")
print("=" * 55)
print(f"{'Model':<10} {'Rยฒ':>8} {'RMSE (โ‚น Cr)':>14} {'MAPE (%)':>10}")
print("-" * 55)
for name, r in results.items():
    print(f"{name:<10} {r['R']:>8.4f} {r['RMSE']:>14,.0f} {r['MAPE']:>10.2f}")

best = min(results.items(), key=lambda x: x[1]['RMSE'])
print(f"\n๐Ÿ† Best Model: {best[0]} (lowest RMSE)")

# ========================================
# Visualization
# ========================================
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Plot 1: Actual vs Predicted
test_quarters = data_clean['Quarter'].iloc[12:]
for name, r in results.items():
    style = '-' if name == 'OLS' else '--' if name == 'Ridge' else ':'
    axes[0].plot(test_quarters, r['Predictions'], style, label=name, linewidth=2)
axes[0].plot(test_quarters, results['OLS']['Actuals'], 'ko', markersize=4, 
             label='Actual', alpha=0.6)
axes[0].set_title('Walk-Forward Predictions', fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)
axes[0].tick_params(axis='x', rotation=45)

# Plot 2: Error distribution
for name, r in results.items():
    errors = (r['Actuals'] - r['Predictions']) / r['Actuals'] * 100
    axes[1].hist(errors, bins=10, alpha=0.5, label=f'{name} (mean={errors.mean():.2f}%)')
axes[1].set_title('Prediction Error Distribution (%)', fontweight='bold')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.savefig('model_comparison.png', dpi=150, bbox_inches='tight')
plt.show()
MODEL COMPARISON โ€” Walk-Forward Validation ======================================================= Model Rยฒ RMSE (โ‚น Cr) MAPE (%) ------------------------------------------------------- OLS 0.9991 โ‚น482 0.71 Ridge 0.9992 โ‚น456 0.67 Lasso 0.9988 โ‚น598 0.89 ๐Ÿ† Best Model: Ridge (lowest RMSE)
Why Ridge Wins

OLS can overfit when features are highly correlated (Revenue_Lag1 and Revenue_Lag4 are both ~0.99 correlated with Revenue). Ridge regression adds L2 penalty that shrinks coefficients, reducing overfitting. Lasso (L1 penalty) is too aggressive here โ€” it zeroes out useful features.

Practical takeaway: For financial forecasting with correlated features, Ridge is usually the best choice.

๐Ÿ“‹ Practical Guidelines: When to Use What

ScenarioRecommended ApproachExpected Accuracy
Mature company, stable growth (IT, FMCG)Linear regression with lag featuresMAPE 1โ€“3%
Cyclical company (Auto, Metals)Ridge regression + GDP/cyclical featuresMAPE 3โ€“8%
High-growth startupMultiple regression + industry metricsMAPE 5โ€“15%
Seasonal business (Retail, Travel)Time-series decomposition + seasonal indicesMAPE 2โ€“5%
Volatile / event-drivenScenario analysis > ML (too unpredictable)N/A โ€” use scenarios
โš ๏ธDanger: Overfitting

If your model has Rยฒ = 0.9999 on training data but performs poorly on new data, you've overfit. Always use walk-forward validation and report test set metrics, not training metrics. A model with Rยฒ = 0.95 on the test set is more trustworthy than one with Rยฒ = 0.9999 only on training.

Python Lab

Hands-On Practice Exercises

๐Ÿ‹๏ธ Exercise 1: Revenue Prediction for Infosys (25 min)

Objective: Build a linear regression model to predict Infosys quarterly revenue

AssumptionValue
CompanyInfosys (INFY.NS)
Data Period2016-Q1 to 2024-Q4 (32 quarters)
FeaturesGDP Growth, Nifty IT Index, Quarter encoding, Lag features
Train/Test Split24 quarters train / 8 quarters test

Tasks:

  1. Fetch Infosys quarterly revenue using yfinance
  2. Engineer features: lags, rolling means, cyclical encoding
  3. Train a LinearRegression model
  4. Report Rยฒ, RMSE, and MAPE on the test set
Python โ€” Infosys Revenue Prediction
import yfinance as yf
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

# Fetch Infosys quarterly revenue
ticker = yf.Ticker("INFY.NS")
income_stmt = ticker.quarterly_income_stmt

# Extract revenue (most recent first, reverse for chronological)
revenue = income_stmt.loc['Total Revenue'].sort_index()
revenue = revenue / 1e7  # Convert to โ‚น Cr

# Create DataFrame
df = pd.DataFrame({
    'Quarter': revenue.index,
    'Revenue': revenue.values
})
df['Quarter_Num'] = range(1, len(df) + 1)
df['Qtr'] = df['Quarter'].dt.quarter
df['Q_Sin'] = np.sin(2 * np.pi * df['Qtr'] / 4)
df['Q_Cos'] = np.cos(2 * np.pi * df['Qtr'] / 4)
df['Revenue_Lag1'] = df['Revenue'].shift(1)
df['Revenue_Lag4'] = df['Revenue'].shift(4)
df['Revenue_SMA4'] = df['Revenue'].rolling(4).mean()

df = df.dropna().reset_index(drop=True)

# Train/test split (chronological!)
train = df.iloc[:-8]
test = df.iloc[-8:]

features = ['Quarter_Num', 'Q_Sin', 'Q_Cos', 'Revenue_Lag1', 
            'Revenue_Lag4', 'Revenue_SMA4']

model = LinearRegression()
model.fit(train[features], train['Revenue'])
preds = model.predict(test[features])

print(f"Test Rยฒ:   {r2_score(test['Revenue'], preds):.4f}")
print(f"Test RMSE: โ‚น{np.sqrt(mean_squared_error(test['Revenue'], preds)):,.0f} Cr")
mape = np.mean(np.abs((test['Revenue'] - preds) / test['Revenue'])) * 100
print(f"Test MAPE: {mape:.2f}%")

๐Ÿ‹๏ธ Exercise 2: Time-Series Decomposition of Stock Prices (20 min)

Objective: Download 5 years of monthly stock data for Reliance Industries, decompose into trend/seasonality/residual

Tasks:

  1. Fetch Reliance monthly closing prices using yfinance
  2. Apply multiplicative decomposition (period=12)
  3. Plot the 4-component decomposition chart
  4. Calculate and interpret seasonal indices โ€” which months are strongest/weakest?
Python โ€” Stock Price Decomposition
import yfinance as yf
import pandas as pd
from statsmodels.tsa.seasonal import seasonal_decompose
import matplotlib.pyplot as plt

# Fetch 5 years of monthly data
data = yf.download("RELIANCE.NS", period="5y", interval="1mo")
prices = data['Close'].dropna()

# Decompose (multiplicative โ€” prices grow over time)
decomp = seasonal_decompose(prices, model='multiplicative', period=12)

# Plot
fig, axes = plt.subplots(4, 1, figsize=(14, 12), sharex=True)
decomp.observed.plot(ax=axes[0], title='Observed (Reliance Close Price)')
decomp.trend.plot(ax=axes[1], title='Trend')
decomp.seasonal.plot(ax=axes[2], title='Seasonal')
decomp.resid.plot(ax=axes[3], title='Residual')
plt.tight_layout()
plt.savefig('reliance_decomposition.png', dpi=150)
plt.show()

# Seasonal indices by month
seasonal = decomp.seasonal
monthly_avg = seasonal.groupby(seasonal.index.month).mean()
print("Monthly Seasonal Indices:")
for month, val in monthly_avg.items():
    effect = "โ†‘" if val > 1 else "โ†“"
    print(f"  Month {month:2d}: {val:.4f} {effect}")

๐Ÿ‹๏ธ Exercise 3: Feature Engineering Challenge (20 min)

Objective: Start with a basic model (Rยฒ ~0.65) and improve it to Rยฒ >0.90 through feature engineering

Starting data: 20 quarters of revenue with only GDP growth as a feature

Tasks:

  1. Start with GDP_Growth as the only feature โ†’ record Rยฒ
  2. Add Quarter_Num (time trend) โ†’ record new Rยฒ
  3. Add lag features (Revenue_Lag1, Revenue_Lag4) โ†’ record Rยฒ
  4. Add rolling means and cyclical encoding โ†’ record Rยฒ
  5. Plot the improvement curve (Rยฒ vs number of features)
Python โ€” Incremental Feature Engineering
# Feature engineering progression
feature_sets = {
    '1. GDP Only': ['GDP_Growth'],
    '2. + Time Trend': ['GDP_Growth', 'Quarter_Num'],
    '3. + Lag Features': ['GDP_Growth', 'Quarter_Num', 'Revenue_Lag1', 'Revenue_Lag4'],
    '4. + Seasonality': ['GDP_Growth', 'Quarter_Num', 'Revenue_Lag1', 
                         'Revenue_Lag4', 'Q_Sin', 'Q_Cos'],
    '5. + Rolling Stats': ['GDP_Growth', 'Quarter_Num', 'Revenue_Lag1',
                           'Revenue_Lag4', 'Q_Sin', 'Q_Cos', 'Revenue_SMA4'],
    '6. Full Set': ['GDP_Growth', 'IT_Index', 'Quarter_Num', 'Q_Sin', 'Q_Cos',
                    'Revenue_Lag1', 'Revenue_Lag4', 'Revenue_SMA4']
}

print("FEATURE ENGINEERING PROGRESSION")
print("=" * 50)
r2_values = []
for name, features in feature_sets.items():
    df_temp = data_clean[features + ['Revenue']].dropna()
    X = df_temp[features]
    y = df_temp['Revenue']
    model = LinearRegression().fit(X, y)
    r2 = r2_score(y, model.predict(X))
    r2_values.append(r2)
    bar = 'โ–ˆ' * int(r2 * 50)
    print(f"{name:30s}: Rยฒ={r2:.4f}  {bar}")

# Plot progression
plt.figure(figsize=(10, 5))
plt.plot(range(len(r2_values)), r2_values, 'b-o', linewidth=2)
plt.xticks(range(len(r2_values)), 
           [n.split('. ')[1] for n in feature_sets.keys()], 
           rotation=30, ha='right', fontsize=9)
plt.ylabel('Rยฒ')
plt.title('Rยฒ Improvement with Feature Engineering', fontweight='bold')
plt.grid(alpha=0.3)
plt.ylim(0.5, 1.0)
plt.tight_layout()
plt.savefig('feature_progression.png', dpi=150)
plt.show()

๐Ÿ‹๏ธ Exercise 4 (Advanced): Automated Forecasting Pipeline (20 min)

Objective: Build a reusable pipeline that takes any ticker, fetches data, engineers features, trains 3 models, and returns the best forecast

Python โ€” Automated Forecasting Pipeline
import yfinance as yf
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import r2_score, mean_squared_error
from statsmodels.tsa.seasonal import seasonal_decompose

def auto_forecast(ticker, forecast_periods=4):
    """Automated revenue forecasting pipeline."""
    
    print(f"\n{'='*60}")
    print(f"AUTOMATED FORECAST: {ticker}")
    print(f"{'='*60}")
    
    # 1. Fetch data
    stock = yf.Ticker(ticker)
    revenue = stock.quarterly_income_stmt.loc['Total Revenue'].sort_index()
    revenue = revenue / 1e7  # โ‚น Cr
    
    df = pd.DataFrame({'Revenue': revenue.values}, index=revenue.index)
    df['Quarter_Num'] = range(1, len(df) + 1)
    df['Qtr'] = df.index.quarter
    df['Q_Sin'] = np.sin(2 * np.pi * df['Qtr'] / 4)
    df['Q_Cos'] = np.cos(2 * np.pi * df['Qtr'] / 4)
    df['Revenue_Lag1'] = df['Revenue'].shift(1)
    df['Revenue_Lag4'] = df['Revenue'].shift(4)
    df['Revenue_SMA4'] = df['Revenue'].rolling(4).mean()
    
    df = df.dropna()
    features = ['Quarter_Num', 'Q_Sin', 'Q_Cos', 'Revenue_Lag1', 
                'Revenue_Lag4', 'Revenue_SMA4']
    
    # 2. Train models (walk-forward last 4 quarters)
    best_model = None
    best_rmse = float('inf')
    
    for name, cls, params in [('OLS', LinearRegression, {}), 
                               ('Ridge', Ridge, {'alpha': 1.0})]:
        preds, actuals = [], []
        for i in range(len(df) - 4, len(df)):
            model = cls(**params)
            model.fit(df.iloc[:i][features], df.iloc[:i]['Revenue'])
            pred = model.predict(df.iloc[i:i+1][features])[0]
            preds.append(pred)
            actuals.append(df.iloc[i]['Revenue'])
        
        rmse = np.sqrt(mean_squared_error(actuals, preds))
        mape = np.mean(np.abs((np.array(actuals) - np.array(preds)) / np.array(actuals))) * 100
        
        print(f"  {name:10s}: RMSE=โ‚น{rmse:>10,.0f}  MAPE={mape:.2f}%")
        
        if rmse < best_rmse:
            best_rmse = rmse
            best_model = cls(**params)
            best_name = name
    
    # 3. Retrain best model on all data and forecast
    best_model.fit(df[features], df['Revenue'])
    
    # Simple projection for next quarters
    last_row = df.iloc[-1].copy()
    forecasts = []
    for q in range(1, forecast_periods + 1):
        next_q = df.index[-1] + pd.DateOffset(months=3)
        last_row['Quarter_Num'] += 1
        last_row['Qtr'] = next_q.quarter
        last_row['Q_Sin'] = np.sin(2 * np.pi * last_row['Qtr'] / 4)
        last_row['Q_Cos'] = np.cos(2 * np.pi * last_row['Qtr'] / 4)
        last_row['Revenue_Lag1'] = df.iloc[-1]['Revenue'] if q == 1 else forecasts[-1]
        
        pred = best_model.predict(pd.DataFrame([last_row[features]]))[0]
        forecasts.append(pred)
        print(f"  Forecast Q+{q}: โ‚น{pred:>10,.0f} Cr")
    
    return forecasts

# Test with multiple companies
auto_forecast("TCS.NS")
auto_forecast("INFY.NS")
auto_forecast("HINDUNILVR.NS")
Quick Review

๐Ÿ“š Key Terms โ€” Click to Flip

Knowledge Check

Test Your Understanding

10 questions on AI-Enhanced Forecasting

Summary

Key Takeaways

๐Ÿ“ What We Covered Today

  • Built multiple linear regression models to predict revenue using GDP, industry indices, and lag features โ€” achieving Rยฒ > 0.99
  • Learned time-series decomposition โ€” breaking revenue into trend, seasonal, and residual components using statsmodels
  • Created powerful features: lag values, rolling means, cyclical encoding, YoY changes, and interaction terms
  • Compared OLS vs Ridge vs Lasso โ€” Ridge wins for financial data with correlated features
  • Used walk-forward validation โ€” the correct way to evaluate time-series models (never shuffle!)
  • Built an automated forecasting pipeline that works for any company ticker
๐Ÿ“šNext Session

Session 22: Monte Carlo Simulation
We'll move from point forecasts to probability distributions โ€” simulating thousands of scenarios to quantify risk and uncertainty in our models. Topics: random number generation, distribution fitting, and confidence intervals for DCF valuations.