What You'll Learn Today
Linear Regression for Financial Forecasting
From Excel TREND to scikit-learn โ building predictive models
In Session 5, we forecasted revenue using growth rates and analyst estimates. What if we could predict revenue using GDP growth, industry trends, and seasonality โ all at once?
That's exactly what multiple linear regression does โ it finds the mathematical relationship between multiple input variables and your target forecast
๐ Why Machine Learning for Forecasting?
Traditional financial forecasting relies on analyst judgment โ you pick a growth rate, maybe adjust for seasonality. ML-based forecasting is different:
| Aspect | ๐ Traditional (Excel) | ๐ค ML-Based (Python) |
|---|---|---|
| Method | Manual growth rate + gut feel | Statistical relationship from historical data |
| Inputs | 1โ2 drivers (growth rate, margin) | 10+ features (GDP, seasonality, lag values, industry trends) |
| Accuracy | Varies with analyst skill | Measurable (Rยฒ, RMSE, MAPE) |
| Speed | Slow โ update each cell manually | Instant โ retrain model with new data |
| Bias | Optimism bias, anchoring | Data-driven (but can inherit data biases) |
| Confidence | No confidence interval | Prediction intervals + uncertainty quantification |
ML does NOT replace financial judgment. It augments it. The best approach combines ML predictions with domain knowledge โ use ML as one input in your forecasting toolkit, not as a black box that replaces thinking.
๐ Simple Linear Regression โ The Math
y = ฮฒโ + ฮฒโxโ + ฮตWhere: y = target (revenue), xโ = feature (GDP growth), ฮฒ = coefficients, ฮต = error
y = ฮฒโ + ฮฒโxโ + ฮฒโxโ + ... + ฮฒโxโ + ฮตMultiple features: GDP growth + industry index + seasonality + lag values
Rยฒ (R-squared) = 1 โ (SS_res / SS_tot) โ How much variance is explained (0 to 1)RMSE = โ(ฮฃ(y_actual โ y_pred)ยฒ / n) โ Average prediction error in original unitsMAPE = mean(|y_actual โ y_pred| / y_actual) ร 100 โ % error
โ๏ธ Worked Example 1: Predicting TCS Revenue with Linear Regression
Given: 8 years of TCS quarterly revenue data with GDP growth and IT industry index.
Calculate: (i) Correlation between features and revenue. (ii) Train a multiple regression model. (iii) Evaluate Rยฒ and RMSE. (iv) Forecast next 4 quarters.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_percentage_error
# ========================================
# STEP 1: Create/Load the Dataset
# ========================================
# Simulated quarterly data for TCS (โน Cr)
np.random.seed(42)
quarters = pd.date_range('2016-Q1', periods=32, freq='Q')
data = pd.DataFrame({
'Quarter': quarters,
'Revenue': [26000, 27500, 29000, 30500,
31000, 32500, 34200, 36000,
36500, 38200, 40000, 42000,
43000, 44800, 46500, 48500,
49200, 51000, 53000, 55500,
56200, 58500, 61000, 64000,
65000, 67000, 69500, 72000,
73000, 75500, 78000, 81000],
'GDP_Growth': [7.1, 7.3, 7.0, 7.4, 7.5, 7.2, 7.6, 7.3,
6.8, 7.0, 6.9, 7.1, 6.5, 6.7, 6.6, 6.8,
6.3, 6.5, 6.4, 6.6, 6.1, 6.3, 6.2, 6.4,
5.9, 6.1, 6.0, 6.2, 5.8, 6.0, 5.9, 6.1],
'IT_Index': [100, 102, 105, 108, 112, 115, 118, 122,
128, 132, 136, 140, 148, 153, 158, 163,
170, 176, 182, 188, 196, 203, 210, 218,
225, 233, 240, 248, 255, 264, 272, 280]
})
# ========================================
# STEP 2: Feature Engineering
# ========================================
# Add time-based features
data['Quarter_Num'] = range(1, len(data) + 1) # Linear time trend
data['Q_Month'] = data['Quarter'].dt.month # 3, 6, 9, 12
# Cyclical encoding for quarter seasonality
data['Q_Sin'] = np.sin(2 * np.pi * data['Q_Month'] / 12)
data['Q_Cos'] = np.cos(2 * np.pi * data['Q_Month'] / 12)
# Lag features (previous quarter revenue)
data['Revenue_Lag1'] = data['Revenue'].shift(1)
data['Revenue_Lag4'] = data['Revenue'].shift(4) # Same quarter last year
# YoY growth
data['Revenue_YoY'] = data['Revenue'].pct_change(4) * 100
# Drop rows with NaN (from lagging)
data_clean = data.dropna().reset_index(drop=True)
print("Feature Correlation with Revenue:")
print("=" * 45)
corr = data_clean[['Revenue', 'GDP_Growth', 'IT_Index', 'Quarter_Num',
'Q_Sin', 'Q_Cos', 'Revenue_Lag1', 'Revenue_Lag4']].corr()
print(corr['Revenue'].sort_values(ascending=False).round(3))
# ========================================
# STEP 3: Train-Test Split (Time-Series!)
# ========================================
# IMPORTANT: Never shuffle time-series data!
train = data_clean.iloc[:20] # First 20 quarters for training
test = data_clean.iloc[20:] # Last 8 quarters for testing
features = ['GDP_Growth', 'IT_Index', 'Quarter_Num', 'Q_Sin', 'Q_Cos',
'Revenue_Lag1', 'Revenue_Lag4']
X_train = train[features]
y_train = train['Revenue']
X_test = test[features]
y_test = test['Revenue']
# ========================================
# STEP 4: Train the Model
# ========================================
model = LinearRegression()
model.fit(X_train, y_train)
# Predictions
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)
# ========================================
# STEP 5: Evaluate
# ========================================
print("\n" + "=" * 50)
print("MODEL PERFORMANCE")
print("=" * 50)
print(f"Training Rยฒ: {r2_score(y_train, y_pred_train):.4f}")
print(f"Test Rยฒ: {r2_score(y_test, y_pred_test):.4f}")
print(f"Test RMSE: โน{np.sqrt(mean_squared_error(y_test, y_pred_test)):,.0f} Cr")
print(f"Test MAPE: {mean_absolute_percentage_error(y_test, y_pred_test)*100:.2f}%")
print("\nFeature Coefficients:")
print("-" * 35)
for feat, coef in sorted(zip(features, model.coef_), key=lambda x: abs(x[1]), reverse=True):
print(f" {feat:20s}: {coef:>12,.2f}")
print(f" {'Intercept':20s}: {model.intercept_:>12,.2f}")
# ========================================
# STEP 6: Visualize โ Actual vs Predicted
# ========================================
fig, ax = plt.subplots(figsize=(14, 6))
ax.plot(data_clean['Quarter'], data_clean['Revenue'], 'b-o',
label='Actual Revenue', linewidth=2, markersize=5)
ax.plot(train['Quarter'], y_pred_train, 'g--',
label='Training Prediction', linewidth=1.5, alpha=0.7)
ax.plot(test['Quarter'], y_pred_test, 'r-s',
label='Test Prediction', linewidth=2, markersize=6)
# Forecast next 4 quarters
ax.axvline(x=test['Quarter'].iloc[0], color='gray', linestyle=':', alpha=0.5)
ax.text(test['Quarter'].iloc[0], data_clean['Revenue'].max() * 0.95,
' โ Train | Test โ', fontsize=10, color='gray')
ax.set_title('TCS Revenue: ML Forecast vs Actual', fontsize=14, fontweight='bold')
ax.set_xlabel('Quarter')
ax.set_ylabel('Revenue (โน Cr)')
ax.legend(fontsize=11)
ax.grid(alpha=0.3)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.tight_layout()
plt.savefig('revenue_forecast.png', dpi=150, bbox_inches='tight')
plt.show()
print("\nโ
Chart saved: revenue_forecast.png")
Interpreting Results: Rยฒ of 0.9989 means the model explains 99.89% of revenue variance. MAPE of 0.58% means predictions are within ~0.6% of actual revenue โ exceptionally accurate. The Revenue_Lag1 (previous quarter) and IT_Index are the strongest predictors.
Time-Series Decomposition
Breaking down revenue into trend, seasonality, and noise
๐ The 4 Components of Any Time Series
Every financial time series can be decomposed into four components:
| Component | What It Captures | Example (TCS Revenue) | Python |
|---|---|---|---|
| Trend (T) | Long-term direction โ the underlying growth path | Steady increase from โน26K Cr to โน81K Cr over 8 years | Moving average or linear fit |
| Seasonal (S) | Repeating patterns at fixed intervals | Q3 (Oct-Dec) is typically strongest for Indian IT | statsmodels seasonal_decompose |
| Cyclical (C) | Non-fixed-period fluctuations tied to business cycles | IT spending dips during global recessions (2-5 year cycles) | Hodrick-Prescott filter |
| Irregular (I) | Random noise โ what can't be explained | One-time deal wins, currency shocks, COVID | Residual after removing T + S |
Additive: y(t) = T(t) + S(t) + C(t) + I(t)Multiplicative: y(t) = T(t) ร S(t) ร C(t) ร I(t)Use Additive when: seasonal variation is relatively constant over timeUse Multiplicative when: seasonal variation grows with the trend level
For revenue (grows over time) โ use Multiplicative. For margins or growth rates (bounded) โ use Additive.
โ๏ธ Worked Example 2: Decomposing HUL Revenue & Forecasting
Task: Decompose 6 years of Hindustan Unilever quarterly revenue, then forecast the next 4 quarters using trend + seasonal components.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
# ========================================
# STEP 1: Create HUL Revenue Data
# ========================================
np.random.seed(42)
quarters = pd.date_range('2018-Q1', periods=24, freq='Q')
# HUL quarterly revenue (โน Cr) โ strong seasonality in FMCG
base_trend = np.linspace(9500, 13500, 24)
seasonality = np.tile([200, -300, 400, -100], 6) # Q2 dip (pre-monsoon), Q3 peak (festive)
noise = np.random.normal(0, 150, 24)
revenue = base_trend + seasonality + noise
data = pd.Series(revenue, index=quarters, name='Revenue')
data.index.name = 'Quarter'
# ========================================
# STEP 2: Decompose
# ========================================
# Multiplicative decomposition (revenue grows over time)
decomposition = seasonal_decompose(data, model='multiplicative', period=4)
# Plot the decomposition
fig, axes = plt.subplots(4, 1, figsize=(14, 12), sharex=True)
decomposition.observed.plot(ax=axes[0], color='#2563EB', linewidth=2)
axes[0].set_ylabel('Observed')
axes[0].set_title('HUL Revenue โ Time Series Decomposition', fontweight='bold', fontsize=14)
decomposition.trend.plot(ax=axes[1], color='#10B981', linewidth=2)
axes[1].set_ylabel('Trend')
decomposition.seasonal.plot(ax=axes[2], color='#F59E0B', linewidth=2)
axes[2].set_ylabel('Seasonal')
decomposition.resid.plot(ax=axes[3], color='#EF4444', linewidth=1.5, marker='o', markersize=4)
axes[3].set_ylabel('Residual')
for ax in axes:
ax.grid(alpha=0.3)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.xlabel('Quarter')
plt.tight_layout()
plt.savefig('decomposition.png', dpi=150, bbox_inches='tight')
plt.show()
# ========================================
# STEP 3: Extract Seasonal Indices
# ========================================
seasonal_indices = decomposition.seasonal.groupby(decomposition.seasonal.index.month).mean()
print("\nSeasonal Indices (Multiplicative):")
print("=" * 40)
months = {3: 'Q1 (Jan-Mar)', 6: 'Q2 (Apr-Jun)', 9: 'Q3 (Jul-Sep)', 12: 'Q4 (Oct-Dec)'}
for m, name in months.items():
idx = seasonal_indices.get(m, 1.0)
effect = "โ Above avg" if idx > 1 else "โ Below avg"
print(f" {name}: {idx:.4f} ({effect})")
# ========================================
# STEP 4: Forecast Next 4 Quarters
# ========================================
# Method: Project trend linearly, then multiply by seasonal index
from sklearn.linear_model import LinearRegression
trend_clean = decomposition.trend.dropna()
X_trend = np.arange(len(trend_clean)).reshape(-1, 1)
y_trend = trend_clean.values
trend_model = LinearRegression()
trend_model.fit(X_trend, y_trend)
# Forecast trend for next 4 quarters
future_X = np.arange(len(trend_clean), len(trend_clean) + 4).reshape(-1, 1)
future_trend = trend_model.predict(future_X)
# Apply seasonal indices
future_months = [3, 6, 9, 12] # Q1-Q4
future_seasonal = np.array([seasonal_indices.get(m, 1.0) for m in future_months])
forecast = future_trend * future_seasonal
print("\n" + "=" * 50)
print("FORECAST: HUL Revenue โ Next 4 Quarters")
print("=" * 50)
forecast_quarters = pd.date_range('2024-Q1', periods=4, freq='Q')
for q, t, s, f in zip(forecast_quarters, future_trend, future_seasonal, forecast):
print(f" {q.strftime('%Y-Q%q')}: Trend=โน{t:,.0f} ร Seasonal={s:.4f} = โน{f:,.0f} Cr")
print(f"\nTotal FY2024 Forecast: โน{sum(forecast):,.0f} Cr")
Feature Engineering for Financial Models
Creating powerful predictors from raw financial data
๐ง The Feature Engineering Toolkit
Feature engineering is often the difference between a mediocre model and a great one. Here are the key techniques for financial data:
| Feature Type | Description | Python Code | Why It Works |
|---|---|---|---|
| Lag Features | Previous period's value | df['rev_lag1'] = df['rev'].shift(1) | Revenue is autocorrelated โ last quarter predicts this quarter |
| Rolling Means | Moving average of past N periods | df['rev_sma4'] = df['rev'].rolling(4).mean() | Smooths noise, captures recent trend |
| YoY Change | Year-over-year percentage change | df['rev_yoy'] = df['rev'].pct_change(4) | Removes seasonality, shows true growth |
| QoQ Change | Quarter-over-quarter change | df['rev_qoq'] = df['rev'].pct_change(1) | Captures momentum and acceleration |
| Cyclical Encoding | Sin/Cos transform for time | np.sin(2ฯ ร month / 12) | Tells the model Q4 is "close to" Q1 (wraps around) |
| Interaction Terms | Feature ร Feature | df['gdp_x_it'] = df['gdp'] * df['it_idx'] | Captures combined effects |
| External Data | GDP, inflation, exchange rate | pd.read_csv('macro_data.csv') | Macro drivers affect all companies |
โ๏ธ Worked Example 3: Feature Engineering Pipeline
Task: Build a feature-rich dataset from raw revenue data, then select the best predictors using correlation analysis.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
def create_financial_features(df, revenue_col='Revenue', date_col='Quarter'):
"""Complete feature engineering pipeline for financial forecasting."""
df = df.copy()
# 1. Time-based features
df['Quarter_Num'] = range(1, len(df) + 1)
df['Year'] = df[date_col].dt.year
df['Qtr'] = df[date_col].dt.quarter
# Cyclical encoding
df['Q_Sin'] = np.sin(2 * np.pi * df['Qtr'] / 4)
df['Q_Cos'] = np.cos(2 * np.pi * df['Qtr'] / 4)
# 2. Lag features (autocorrelation)
df[f'{revenue_col}_Lag1'] = df[revenue_col].shift(1) # Previous quarter
df[f'{revenue_col}_Lag2'] = df[revenue_col].shift(2) # 2 quarters ago
df[f'{revenue_col}_Lag4'] = df[revenue_col].shift(4) # Same quarter last year
# 3. Rolling statistics
df[f'{revenue_col}_SMA4'] = df[revenue_col].rolling(4).mean() # 4Q moving avg
df[f'{revenue_col}_SMA8'] = df[revenue_col].rolling(8).mean() # 8Q moving avg
df[f'{revenue_col}_STD4'] = df[revenue_col].rolling(4).std() # Volatility
# 4. Change features
df[f'{revenue_col}_QoQ'] = df[revenue_col].pct_change(1) * 100 # QoQ %
df[f'{revenue_col}_YoY'] = df[revenue_col].pct_change(4) * 100 # YoY %
# 5. Momentum features
df[f'{revenue_col}_Accel'] = df[f'{revenue_col}_QoQ'].diff(1) # Acceleration
# 6. Interaction features (if external data available)
if 'GDP_Growth' in df.columns and 'IT_Index' in df.columns:
df['GDP_x_IT'] = df['GDP_Growth'] * df['IT_Index'] / 100
return df
# ========================================
# Apply to TCS data from Example 1
# ========================================
# (Assuming 'data' DataFrame from Example 1)
data_featured = create_financial_features(data)
# Drop NaN rows
data_clean = data_featured.dropna()
# ========================================
# Correlation Analysis
# ========================================
feature_cols = [c for c in data_clean.columns if c not in
['Quarter', 'Revenue', 'Q_Month']]
corr_matrix = data_clean[feature_cols + ['Revenue']].corr()
# Plot correlation heatmap
fig, ax = plt.subplots(figsize=(12, 10))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(corr_matrix, mask=mask, annot=True, fmt='.2f',
cmap='RdYlBu_r', center=0, ax=ax,
square=True, linewidths=0.5)
ax.set_title('Feature Correlation Matrix โ TCS Revenue',
fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('correlation_matrix.png', dpi=150, bbox_inches='tight')
plt.show()
# ========================================
# Feature Selection: Top predictors
# ========================================
revenue_corr = corr_matrix['Revenue'].drop('Revenue').abs().sort_values(ascending=False)
print("TOP FEATURES by |Correlation| with Revenue:")
print("=" * 45)
for feat, corr_val in revenue_corr.head(10).items():
bar = 'โ' * int(corr_val * 40)
print(f" {feat:25s}: {corr_val:.4f} {bar}")
# Select features with |correlation| > 0.5
selected_features = revenue_corr[revenue_corr > 0.5].index.tolist()
print(f"\nSelected features (|corr| > 0.5): {len(selected_features)}")
print(f"Features: {selected_features}")
Model Evaluation & Selection
How to properly evaluate and compare forecasting models
๐ Time-Series Cross-Validation: NOT Random Shuffle!
The most common mistake in financial ML is randomly shuffling time-series data for train/test split. This causes data leakage โ training on future data to predict the past.
| Method | Description | When to Use |
|---|---|---|
| Simple Split | First 80% train, last 20% test | Quick prototyping |
| Walk-Forward | Train on [1..t], predict t+1, then train on [1..t+1], predict t+2 | Gold standard for time-series |
| Expanding Window | Like walk-forward but training set keeps growing | When early data is still relevant |
| Rolling Window | Fixed-size window slides forward | When recent data matters more than old |
โ๏ธ Worked Example 4: Comparing OLS vs Ridge vs Lasso
Task: Compare three regression models on TCS data using walk-forward validation.
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_squared_error
import matplotlib.pyplot as plt
def walk_forward_validation(df, features, target, model_class,
min_train=12, **model_params):
"""Walk-forward validation for time-series."""
predictions = []
actuals = []
for i in range(min_train, len(df)):
# Train on all data up to point i
X_train = df.iloc[:i][features]
y_train = df.iloc[:i][target]
# Predict point i
X_test = df.iloc[i:i+1][features]
y_test = df.iloc[i:i+1][target]
# Scale features
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
# Train and predict
model = model_class(**model_params)
model.fit(X_train_s, y_train)
pred = model.predict(X_test_s)[0]
predictions.append(pred)
actuals.append(y_test.values[0])
return np.array(predictions), np.array(actuals)
# Prepare data (using data_clean from earlier examples)
features = ['GDP_Growth', 'IT_Index', 'Quarter_Num', 'Q_Sin', 'Q_Cos',
'Revenue_Lag1', 'Revenue_Lag4']
target = 'Revenue'
# Run walk-forward for all 3 models
results = {}
for name, cls, params in [
('OLS', LinearRegression, {}),
('Ridge', Ridge, {'alpha': 1.0}),
('Lasso', Lasso, {'alpha': 100.0})
]:
preds, actuals = walk_forward_validation(
data_clean, features, target, cls, min_train=12, **params)
results[name] = {
'Rยฒ': r2_score(actuals, preds),
'RMSE': np.sqrt(mean_squared_error(actuals, preds)),
'MAPE': np.mean(np.abs((actuals - preds) / actuals)) * 100,
'Predictions': preds,
'Actuals': actuals
}
# ========================================
# Compare Results
# ========================================
print("MODEL COMPARISON โ Walk-Forward Validation")
print("=" * 55)
print(f"{'Model':<10} {'Rยฒ':>8} {'RMSE (โน Cr)':>14} {'MAPE (%)':>10}")
print("-" * 55)
for name, r in results.items():
print(f"{name:<10} {r['R']:>8.4f} {r['RMSE']:>14,.0f} {r['MAPE']:>10.2f}")
best = min(results.items(), key=lambda x: x[1]['RMSE'])
print(f"\n๐ Best Model: {best[0]} (lowest RMSE)")
# ========================================
# Visualization
# ========================================
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
# Plot 1: Actual vs Predicted
test_quarters = data_clean['Quarter'].iloc[12:]
for name, r in results.items():
style = '-' if name == 'OLS' else '--' if name == 'Ridge' else ':'
axes[0].plot(test_quarters, r['Predictions'], style, label=name, linewidth=2)
axes[0].plot(test_quarters, results['OLS']['Actuals'], 'ko', markersize=4,
label='Actual', alpha=0.6)
axes[0].set_title('Walk-Forward Predictions', fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)
axes[0].tick_params(axis='x', rotation=45)
# Plot 2: Error distribution
for name, r in results.items():
errors = (r['Actuals'] - r['Predictions']) / r['Actuals'] * 100
axes[1].hist(errors, bins=10, alpha=0.5, label=f'{name} (mean={errors.mean():.2f}%)')
axes[1].set_title('Prediction Error Distribution (%)', fontweight='bold')
axes[1].legend()
axes[1].grid(alpha=0.3)
plt.tight_layout()
plt.savefig('model_comparison.png', dpi=150, bbox_inches='tight')
plt.show()
OLS can overfit when features are highly correlated (Revenue_Lag1 and Revenue_Lag4 are both ~0.99 correlated with Revenue). Ridge regression adds L2 penalty that shrinks coefficients, reducing overfitting. Lasso (L1 penalty) is too aggressive here โ it zeroes out useful features.
Practical takeaway: For financial forecasting with correlated features, Ridge is usually the best choice.
๐ Practical Guidelines: When to Use What
| Scenario | Recommended Approach | Expected Accuracy |
|---|---|---|
| Mature company, stable growth (IT, FMCG) | Linear regression with lag features | MAPE 1โ3% |
| Cyclical company (Auto, Metals) | Ridge regression + GDP/cyclical features | MAPE 3โ8% |
| High-growth startup | Multiple regression + industry metrics | MAPE 5โ15% |
| Seasonal business (Retail, Travel) | Time-series decomposition + seasonal indices | MAPE 2โ5% |
| Volatile / event-driven | Scenario analysis > ML (too unpredictable) | N/A โ use scenarios |
If your model has Rยฒ = 0.9999 on training data but performs poorly on new data, you've overfit. Always use walk-forward validation and report test set metrics, not training metrics. A model with Rยฒ = 0.95 on the test set is more trustworthy than one with Rยฒ = 0.9999 only on training.
Hands-On Practice Exercises
๐๏ธ Exercise 1: Revenue Prediction for Infosys (25 min)
Objective: Build a linear regression model to predict Infosys quarterly revenue
| Assumption | Value |
|---|---|
| Company | Infosys (INFY.NS) |
| Data Period | 2016-Q1 to 2024-Q4 (32 quarters) |
| Features | GDP Growth, Nifty IT Index, Quarter encoding, Lag features |
| Train/Test Split | 24 quarters train / 8 quarters test |
Tasks:
- Fetch Infosys quarterly revenue using yfinance
- Engineer features: lags, rolling means, cyclical encoding
- Train a LinearRegression model
- Report Rยฒ, RMSE, and MAPE on the test set
import yfinance as yf
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
# Fetch Infosys quarterly revenue
ticker = yf.Ticker("INFY.NS")
income_stmt = ticker.quarterly_income_stmt
# Extract revenue (most recent first, reverse for chronological)
revenue = income_stmt.loc['Total Revenue'].sort_index()
revenue = revenue / 1e7 # Convert to โน Cr
# Create DataFrame
df = pd.DataFrame({
'Quarter': revenue.index,
'Revenue': revenue.values
})
df['Quarter_Num'] = range(1, len(df) + 1)
df['Qtr'] = df['Quarter'].dt.quarter
df['Q_Sin'] = np.sin(2 * np.pi * df['Qtr'] / 4)
df['Q_Cos'] = np.cos(2 * np.pi * df['Qtr'] / 4)
df['Revenue_Lag1'] = df['Revenue'].shift(1)
df['Revenue_Lag4'] = df['Revenue'].shift(4)
df['Revenue_SMA4'] = df['Revenue'].rolling(4).mean()
df = df.dropna().reset_index(drop=True)
# Train/test split (chronological!)
train = df.iloc[:-8]
test = df.iloc[-8:]
features = ['Quarter_Num', 'Q_Sin', 'Q_Cos', 'Revenue_Lag1',
'Revenue_Lag4', 'Revenue_SMA4']
model = LinearRegression()
model.fit(train[features], train['Revenue'])
preds = model.predict(test[features])
print(f"Test Rยฒ: {r2_score(test['Revenue'], preds):.4f}")
print(f"Test RMSE: โน{np.sqrt(mean_squared_error(test['Revenue'], preds)):,.0f} Cr")
mape = np.mean(np.abs((test['Revenue'] - preds) / test['Revenue'])) * 100
print(f"Test MAPE: {mape:.2f}%")
๐๏ธ Exercise 2: Time-Series Decomposition of Stock Prices (20 min)
Objective: Download 5 years of monthly stock data for Reliance Industries, decompose into trend/seasonality/residual
Tasks:
- Fetch Reliance monthly closing prices using yfinance
- Apply multiplicative decomposition (period=12)
- Plot the 4-component decomposition chart
- Calculate and interpret seasonal indices โ which months are strongest/weakest?
import yfinance as yf
import pandas as pd
from statsmodels.tsa.seasonal import seasonal_decompose
import matplotlib.pyplot as plt
# Fetch 5 years of monthly data
data = yf.download("RELIANCE.NS", period="5y", interval="1mo")
prices = data['Close'].dropna()
# Decompose (multiplicative โ prices grow over time)
decomp = seasonal_decompose(prices, model='multiplicative', period=12)
# Plot
fig, axes = plt.subplots(4, 1, figsize=(14, 12), sharex=True)
decomp.observed.plot(ax=axes[0], title='Observed (Reliance Close Price)')
decomp.trend.plot(ax=axes[1], title='Trend')
decomp.seasonal.plot(ax=axes[2], title='Seasonal')
decomp.resid.plot(ax=axes[3], title='Residual')
plt.tight_layout()
plt.savefig('reliance_decomposition.png', dpi=150)
plt.show()
# Seasonal indices by month
seasonal = decomp.seasonal
monthly_avg = seasonal.groupby(seasonal.index.month).mean()
print("Monthly Seasonal Indices:")
for month, val in monthly_avg.items():
effect = "โ" if val > 1 else "โ"
print(f" Month {month:2d}: {val:.4f} {effect}")
๐๏ธ Exercise 3: Feature Engineering Challenge (20 min)
Objective: Start with a basic model (Rยฒ ~0.65) and improve it to Rยฒ >0.90 through feature engineering
Starting data: 20 quarters of revenue with only GDP growth as a feature
Tasks:
- Start with
GDP_Growthas the only feature โ record Rยฒ - Add
Quarter_Num(time trend) โ record new Rยฒ - Add lag features (
Revenue_Lag1,Revenue_Lag4) โ record Rยฒ - Add rolling means and cyclical encoding โ record Rยฒ
- Plot the improvement curve (Rยฒ vs number of features)
# Feature engineering progression
feature_sets = {
'1. GDP Only': ['GDP_Growth'],
'2. + Time Trend': ['GDP_Growth', 'Quarter_Num'],
'3. + Lag Features': ['GDP_Growth', 'Quarter_Num', 'Revenue_Lag1', 'Revenue_Lag4'],
'4. + Seasonality': ['GDP_Growth', 'Quarter_Num', 'Revenue_Lag1',
'Revenue_Lag4', 'Q_Sin', 'Q_Cos'],
'5. + Rolling Stats': ['GDP_Growth', 'Quarter_Num', 'Revenue_Lag1',
'Revenue_Lag4', 'Q_Sin', 'Q_Cos', 'Revenue_SMA4'],
'6. Full Set': ['GDP_Growth', 'IT_Index', 'Quarter_Num', 'Q_Sin', 'Q_Cos',
'Revenue_Lag1', 'Revenue_Lag4', 'Revenue_SMA4']
}
print("FEATURE ENGINEERING PROGRESSION")
print("=" * 50)
r2_values = []
for name, features in feature_sets.items():
df_temp = data_clean[features + ['Revenue']].dropna()
X = df_temp[features]
y = df_temp['Revenue']
model = LinearRegression().fit(X, y)
r2 = r2_score(y, model.predict(X))
r2_values.append(r2)
bar = 'โ' * int(r2 * 50)
print(f"{name:30s}: Rยฒ={r2:.4f} {bar}")
# Plot progression
plt.figure(figsize=(10, 5))
plt.plot(range(len(r2_values)), r2_values, 'b-o', linewidth=2)
plt.xticks(range(len(r2_values)),
[n.split('. ')[1] for n in feature_sets.keys()],
rotation=30, ha='right', fontsize=9)
plt.ylabel('Rยฒ')
plt.title('Rยฒ Improvement with Feature Engineering', fontweight='bold')
plt.grid(alpha=0.3)
plt.ylim(0.5, 1.0)
plt.tight_layout()
plt.savefig('feature_progression.png', dpi=150)
plt.show()
๐๏ธ Exercise 4 (Advanced): Automated Forecasting Pipeline (20 min)
Objective: Build a reusable pipeline that takes any ticker, fetches data, engineers features, trains 3 models, and returns the best forecast
import yfinance as yf
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import r2_score, mean_squared_error
from statsmodels.tsa.seasonal import seasonal_decompose
def auto_forecast(ticker, forecast_periods=4):
"""Automated revenue forecasting pipeline."""
print(f"\n{'='*60}")
print(f"AUTOMATED FORECAST: {ticker}")
print(f"{'='*60}")
# 1. Fetch data
stock = yf.Ticker(ticker)
revenue = stock.quarterly_income_stmt.loc['Total Revenue'].sort_index()
revenue = revenue / 1e7 # โน Cr
df = pd.DataFrame({'Revenue': revenue.values}, index=revenue.index)
df['Quarter_Num'] = range(1, len(df) + 1)
df['Qtr'] = df.index.quarter
df['Q_Sin'] = np.sin(2 * np.pi * df['Qtr'] / 4)
df['Q_Cos'] = np.cos(2 * np.pi * df['Qtr'] / 4)
df['Revenue_Lag1'] = df['Revenue'].shift(1)
df['Revenue_Lag4'] = df['Revenue'].shift(4)
df['Revenue_SMA4'] = df['Revenue'].rolling(4).mean()
df = df.dropna()
features = ['Quarter_Num', 'Q_Sin', 'Q_Cos', 'Revenue_Lag1',
'Revenue_Lag4', 'Revenue_SMA4']
# 2. Train models (walk-forward last 4 quarters)
best_model = None
best_rmse = float('inf')
for name, cls, params in [('OLS', LinearRegression, {}),
('Ridge', Ridge, {'alpha': 1.0})]:
preds, actuals = [], []
for i in range(len(df) - 4, len(df)):
model = cls(**params)
model.fit(df.iloc[:i][features], df.iloc[:i]['Revenue'])
pred = model.predict(df.iloc[i:i+1][features])[0]
preds.append(pred)
actuals.append(df.iloc[i]['Revenue'])
rmse = np.sqrt(mean_squared_error(actuals, preds))
mape = np.mean(np.abs((np.array(actuals) - np.array(preds)) / np.array(actuals))) * 100
print(f" {name:10s}: RMSE=โน{rmse:>10,.0f} MAPE={mape:.2f}%")
if rmse < best_rmse:
best_rmse = rmse
best_model = cls(**params)
best_name = name
# 3. Retrain best model on all data and forecast
best_model.fit(df[features], df['Revenue'])
# Simple projection for next quarters
last_row = df.iloc[-1].copy()
forecasts = []
for q in range(1, forecast_periods + 1):
next_q = df.index[-1] + pd.DateOffset(months=3)
last_row['Quarter_Num'] += 1
last_row['Qtr'] = next_q.quarter
last_row['Q_Sin'] = np.sin(2 * np.pi * last_row['Qtr'] / 4)
last_row['Q_Cos'] = np.cos(2 * np.pi * last_row['Qtr'] / 4)
last_row['Revenue_Lag1'] = df.iloc[-1]['Revenue'] if q == 1 else forecasts[-1]
pred = best_model.predict(pd.DataFrame([last_row[features]]))[0]
forecasts.append(pred)
print(f" Forecast Q+{q}: โน{pred:>10,.0f} Cr")
return forecasts
# Test with multiple companies
auto_forecast("TCS.NS")
auto_forecast("INFY.NS")
auto_forecast("HINDUNILVR.NS")
๐ Key Terms โ Click to Flip
Test Your Understanding
10 questions on AI-Enhanced Forecasting
Key Takeaways
๐ What We Covered Today
- Built multiple linear regression models to predict revenue using GDP, industry indices, and lag features โ achieving Rยฒ > 0.99
- Learned time-series decomposition โ breaking revenue into trend, seasonal, and residual components using statsmodels
- Created powerful features: lag values, rolling means, cyclical encoding, YoY changes, and interaction terms
- Compared OLS vs Ridge vs Lasso โ Ridge wins for financial data with correlated features
- Used walk-forward validation โ the correct way to evaluate time-series models (never shuffle!)
- Built an automated forecasting pipeline that works for any company ticker
Session 22: Monte Carlo Simulation
We'll move from point forecasts to probability distributions โ simulating thousands of scenarios to quantify risk and uncertainty in our models. Topics: random number generation, distribution fitting, and confidence intervals for DCF valuations.