- Published on
Intermediate Machine Learning: Advanced Techniques and Production-Ready Models
Table of Contents
- Introduction
- Advanced Feature Engineering
- Ensemble Methods
- Handling Imbalanced Data
- Hyperparameter Optimization
- Building ML Pipelines
- Model Validation and Cross-Validation
- Model Interpretation and Explainability
- Model Deployment Basics
- Real-World Project: Complete End-to-End Example
- Best Practices and Common Pitfalls
- Continue Your Learning Journey
- Related Topics
- Further Reading
Introduction
You've mastered the basics of machine learning—you can load data, train models, and make predictions. But real-world ML is messier and more complex. Production models require sophisticated feature engineering, careful optimization, robust validation strategies, and deployment pipelines.
This intermediate guide bridges the gap between tutorial code and production-ready machine learning systems. You'll learn techniques that data scientists and ML engineers use daily to build models that actually work in the real world.
Prerequisites: This guide assumes you're comfortable with:
- Basic ML concepts (supervised/unsupervised learning)
- Python and common libraries (pandas, numpy, scikit-learn)
- Training and evaluating simple models
If you need a refresher, start with our Beginner's Guide to AI and Machine Learning.
Advanced Feature Engineering
Features are the most important factor in model performance. Great features with a simple model often outperform poor features with a complex model.
Handling Missing Data
Missing data is inevitable in real-world datasets. Here are strategies beyond simple deletion:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
# Load dataset with missing values
df = pd.read_csv('data.csv')
# Analyze missing data patterns
print(df.isnull().sum())
print(f"\nPercentage missing:\n{df.isnull().mean() * 100}")
# Strategy 1: Mean/Median/Mode Imputation
imputer_mean = SimpleImputer(strategy='mean')
df['age_imputed'] = imputer_mean.fit_transform(df[['age']])
# Strategy 2: Forward/Backward Fill (time series)
df['temperature_ffill'] = df['temperature'].fillna(method='ffill')
# Strategy 3: KNN Imputation (uses similar rows)
imputer_knn = KNNImputer(n_neighbors=5)
df_knn_imputed = pd.DataFrame(
imputer_knn.fit_transform(df.select_dtypes(include=[np.number])),
columns=df.select_dtypes(include=[np.number]).columns
)
# Strategy 4: Iterative Imputation (MICE algorithm)
imputer_iterative = IterativeImputer(random_state=42)
df_iter_imputed = pd.DataFrame(
imputer_iterative.fit_transform(df.select_dtypes(include=[np.number])),
columns=df.select_dtypes(include=[np.number]).columns
)
# Strategy 5: Add missing indicator feature
df['age_is_missing'] = df['age'].isnull().astype(int)
Best Practices:
- Analyze why data is missing (MCAR, MAR, or MNAR)
- Create "missing" indicator features when missingness is informative
- Document imputation strategy for reproducibility
- Compare model performance with different strategies
Encoding Categorical Variables
Beyond simple one-hot encoding:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder
from category_encoders import TargetEncoder, BinaryEncoder, HashingEncoder
import category_encoders as ce
# Sample dataset
df = pd.DataFrame({
'city': ['NYC', 'LA', 'Chicago', 'NYC', 'Houston', 'LA'] * 100,
'education': ['BS', 'MS', 'PhD', 'HS', 'BS', 'MS'] * 100,
'salary': [75000, 95000, 120000, 45000, 80000, 100000] * 100
})
# 1. Label Encoding (ordinal relationships)
ordinal_encoder = OrdinalEncoder(
categories=[['HS', 'BS', 'MS', 'PhD']]
)
df['education_ordinal'] = ordinal_encoder.fit_transform(df[['education']])
# 2. One-Hot Encoding (no ordinal relationship)
df_onehot = pd.get_dummies(df, columns=['city'], prefix='city')
# 3. Target Encoding (using target variable)
target_encoder = TargetEncoder(cols=['city'])
df['city_target_encoded'] = target_encoder.fit_transform(
df['city'], df['salary']
)
# 4. Binary Encoding (for high cardinality)
binary_encoder = BinaryEncoder(cols=['city'])
df_binary = binary_encoder.fit_transform(df)
# 5. Hashing Encoding (for very high cardinality)
hash_encoder = HashingEncoder(cols=['city'], n_components=8)
df_hashed = hash_encoder.fit_transform(df)
# 6. Frequency Encoding
city_freq = df['city'].value_counts(normalize=True)
df['city_frequency'] = df['city'].map(city_freq)
When to Use Each:
- Ordinal: Natural ordering (small, medium, large)
- One-Hot: Few categories (less than 10), no order
- Target: High cardinality, supervised learning
- Binary: High cardinality (100-1000 categories)
- Hashing: Very high cardinality (greater than 1000 categories)
- Frequency: When frequency correlates with target
Feature Scaling and Transformation
from sklearn.preprocessing import (
StandardScaler, MinMaxScaler, RobustScaler,
PowerTransformer, QuantileTransformer
)
import scipy.stats as stats
# 1. StandardScaler (mean=0, std=1)
scaler_standard = StandardScaler()
df['age_standard'] = scaler_standard.fit_transform(df[['age']])
# 2. MinMaxScaler (range 0-1)
scaler_minmax = MinMaxScaler()
df['salary_minmax'] = scaler_minmax.fit_transform(df[['salary']])
# 3. RobustScaler (robust to outliers)
scaler_robust = RobustScaler()
df['income_robust'] = scaler_robust.fit_transform(df[['income']])
# 4. Log Transformation (for right-skewed data)
df['price_log'] = np.log1p(df['price']) # log1p = log(1+x)
# 5. Box-Cox Transformation (makes data more normal)
power_transformer = PowerTransformer(method='box-cox')
df['sales_boxcox'] = power_transformer.fit_transform(df[['sales']] + 1)
# 6. Quantile Transformation (to uniform/normal distribution)
quantile_transformer = QuantileTransformer(output_distribution='normal')
df['feature_quantile'] = quantile_transformer.fit_transform(df[['feature']])
# Detect skewness
skewness = df['salary'].skew()
print(f"Skewness: {skewness}")
if abs(skewness) > 0.5:
print("Consider transformation")
Creating Derived Features
# 1. Polynomial Features
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
df_poly = pd.DataFrame(
poly.fit_transform(df[['age', 'income']]),
columns=poly.get_feature_names_out(['age', 'income'])
)
# 2. Interaction Features
df['age_income_interaction'] = df['age'] * df['income']
df['is_senior_high_earner'] = ((df['age'] > 65) & (df['income'] > 100000)).astype(int)
# 3. Binning/Discretization
df['age_group'] = pd.cut(
df['age'],
bins=[0, 18, 35, 50, 65, 100],
labels=['child', 'young_adult', 'middle_aged', 'senior', 'elderly']
)
# 4. Date/Time Features
df['date'] = pd.to_datetime(df['date'])
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['quarter'] = df['date'].dt.quarter
df['days_since_start'] = (df['date'] - df['date'].min()).dt.days
# 5. Aggregate Features
customer_stats = df.groupby('customer_id').agg({
'purchase_amount': ['mean', 'sum', 'std', 'count'],
'days_since_purchase': 'min'
}).reset_index()
customer_stats.columns = ['_'.join(col).strip() for col in customer_stats.columns.values]
# 6. Text Features
df['text_length'] = df['review'].str.len()
df['word_count'] = df['review'].str.split().str.len()
df['capital_ratio'] = df['review'].apply(lambda x: sum(1 for c in x if c.isupper()) / len(x))
Feature Selection
Reducing features improves model performance and interpretability:
from sklearn.feature_selection import (
SelectKBest, f_classif, mutual_info_classif,
RFE, SelectFromModel, VarianceThreshold
)
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
# Remove low variance features
variance_selector = VarianceThreshold(threshold=0.1)
X_high_variance = variance_selector.fit_transform(X)
# 1. Univariate Feature Selection
selector = SelectKBest(score_func=f_classif, k=10)
X_selected = selector.fit_transform(X, y)
selected_features = X.columns[selector.get_support()].tolist()
# 2. Mutual Information
mi_selector = SelectKBest(score_func=mutual_info_classif, k=10)
mi_selector.fit(X, y)
mi_scores = pd.DataFrame({
'feature': X.columns,
'score': mi_selector.scores_
}).sort_values('score', ascending=False)
# 3. Recursive Feature Elimination
model = RandomForestClassifier(n_estimators=100, random_state=42)
rfe = RFE(estimator=model, n_features_to_select=10, step=1)
rfe.fit(X, y)
rfe_features = X.columns[rfe.support_].tolist()
# 4. L1-based Feature Selection
lasso_selector = SelectFromModel(
LogisticRegression(penalty='l1', solver='liblinear'),
threshold='median'
)
lasso_selector.fit(X, y)
selected_features_lasso = X.columns[lasso_selector.get_support()].tolist()
# 5. Feature Importance from Tree Models
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X, y)
feature_importance = pd.DataFrame({
'feature': X.columns,
'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)
# Plot top 20 features
plt.figure(figsize=(10, 8))
feature_importance.head(20).plot(x='feature', y='importance', kind='barh')
plt.title('Top 20 Most Important Features')
plt.xlabel('Importance')
plt.tight_layout()
plt.show()
Ensemble Methods
Combining multiple models often outperforms single models.
Bagging (Bootstrap Aggregating)
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
# Basic Bagging
bagging = BaggingClassifier(
estimator=DecisionTreeClassifier(),
n_estimators=100,
max_samples=0.8,
max_features=0.8,
bootstrap=True,
random_state=42
)
bagging.fit(X_train, y_train)
# Random Forest (advanced bagging for trees)
rf = RandomForestClassifier(
n_estimators=200,
max_depth=10,
min_samples_split=5,
min_samples_leaf=2,
max_features='sqrt',
bootstrap=True,
n_jobs=-1,
random_state=42
)
rf.fit(X_train, y_train)
Boosting
from sklearn.ensemble import (
AdaBoostClassifier, GradientBoostingClassifier
)
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
# AdaBoost
ada = AdaBoostClassifier(
estimator=DecisionTreeClassifier(max_depth=1),
n_estimators=100,
learning_rate=1.0,
random_state=42
)
ada.fit(X_train, y_train)
# Gradient Boosting
gb = GradientBoostingClassifier(
n_estimators=100,
learning_rate=0.1,
max_depth=3,
subsample=0.8,
random_state=42
)
gb.fit(X_train, y_train)
# XGBoost (usually best performance)
xgb = XGBClassifier(
n_estimators=200,
learning_rate=0.05,
max_depth=6,
subsample=0.8,
colsample_bytree=0.8,
gamma=0,
reg_alpha=0,
reg_lambda=1,
random_state=42,
n_jobs=-1
)
xgb.fit(X_train, y_train)
# LightGBM (fast and efficient)
lgbm = LGBMClassifier(
n_estimators=200,
learning_rate=0.05,
num_leaves=31,
max_depth=-1,
subsample=0.8,
colsample_bytree=0.8,
random_state=42,
n_jobs=-1
)
lgbm.fit(X_train, y_train)
# CatBoost (handles categorical features automatically)
catboost = CatBoostClassifier(
iterations=200,
learning_rate=0.05,
depth=6,
l2_leaf_reg=3,
random_state=42,
verbose=False
)
catboost.fit(X_train, y_train, cat_features=['category_col1', 'category_col2'])
Stacking
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
# Base models
estimators = [
('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
('xgb', XGBClassifier(n_estimators=100, random_state=42)),
('svc', SVC(probability=True, random_state=42))
]
# Meta-model
stacking = StackingClassifier(
estimators=estimators,
final_estimator=LogisticRegression(),
cv=5
)
stacking.fit(X_train, y_train)
# Predict
y_pred = stacking.predict(X_test)
print(f"Stacking Accuracy: {accuracy_score(y_test, y_pred):.4f}")
Voting
from sklearn.ensemble import VotingClassifier
# Hard voting (majority vote)
voting_hard = VotingClassifier(
estimators=estimators,
voting='hard'
)
voting_hard.fit(X_train, y_train)
# Soft voting (weighted probabilities)
voting_soft = VotingClassifier(
estimators=estimators,
voting='soft',
weights=[2, 3, 1] # Give more weight to XGBoost
)
voting_soft.fit(X_train, y_train)
Handling Imbalanced Data
Real-world datasets are often imbalanced (e.g., fraud detection, disease diagnosis).
Evaluation Metrics for Imbalanced Data
from sklearn.metrics import (
classification_report, confusion_matrix,
precision_recall_curve, roc_auc_score,
average_precision_score, f1_score
)
import matplotlib.pyplot as plt
# Don't use accuracy alone!
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]
# Better metrics
print(f"ROC-AUC Score: {roc_auc_score(y_test, y_pred_proba):.4f}")
print(f"Average Precision: {average_precision_score(y_test, y_pred_proba):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.4f}")
print(f"\n{classification_report(y_test, y_pred)}")
# Precision-Recall Curve
precision, recall, thresholds = precision_recall_curve(y_test, y_pred_proba)
plt.figure(figsize=(10, 6))
plt.plot(recall, precision, marker='.')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.grid(True)
plt.show()
Resampling Techniques
from imblearn.over_sampling import SMOTE, ADASYN, RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler, TomekLinks
from imblearn.combine import SMOTETomek, SMOTEENN
from collections import Counter
print(f"Original class distribution: {Counter(y_train)}")
# 1. Random Oversampling
ros = RandomOverSampler(random_state=42)
X_ros, y_ros = ros.fit_resample(X_train, y_train)
print(f"After oversampling: {Counter(y_ros)}")
# 2. SMOTE (Synthetic Minority Over-sampling)
smote = SMOTE(random_state=42, k_neighbors=5)
X_smote, y_smote = smote.fit_resample(X_train, y_train)
print(f"After SMOTE: {Counter(y_smote)}")
# 3. ADASYN (Adaptive Synthetic Sampling)
adasyn = ADASYN(random_state=42)
X_adasyn, y_adasyn = adasyn.fit_resample(X_train, y_train)
# 4. Random Undersampling
rus = RandomUnderSampler(random_state=42)
X_rus, y_rus = rus.fit_resample(X_train, y_train)
print(f"After undersampling: {Counter(y_rus)}")
# 5. Combination: SMOTE + Tomek Links
smote_tomek = SMOTETomek(random_state=42)
X_combined, y_combined = smote_tomek.fit_resample(X_train, y_train)
print(f"After SMOTE+Tomek: {Counter(y_combined)}")
Class Weights
from sklearn.utils.class_weight import compute_class_weight
# Compute class weights
class_weights = compute_class_weight(
class_weight='balanced',
classes=np.unique(y_train),
y=y_train
)
class_weight_dict = dict(enumerate(class_weights))
# Use in models
rf_weighted = RandomForestClassifier(
n_estimators=100,
class_weight=class_weight_dict,
random_state=42
)
rf_weighted.fit(X_train, y_train)
# For XGBoost
scale_pos_weight = len(y_train[y_train==0]) / len(y_train[y_train==1])
xgb_weighted = XGBClassifier(
scale_pos_weight=scale_pos_weight,
random_state=42
)
xgb_weighted.fit(X_train, y_train)
Hyperparameter Optimization
Finding the best hyperparameters dramatically improves performance.
Grid Search
from sklearn.model_selection import GridSearchCV
# Define parameter grid
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [5, 10, 15, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'max_features': ['sqrt', 'log2', None]
}
# Grid search with cross-validation
grid_search = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=5,
scoring='roc_auc',
n_jobs=-1,
verbose=2
)
grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")
# Use best model
best_model = grid_search.best_estimator_
Random Search
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
# Define parameter distributions
param_dist = {
'n_estimators': randint(100, 500),
'max_depth': randint(5, 20),
'min_samples_split': randint(2, 20),
'min_samples_leaf': randint(1, 10),
'max_features': ['sqrt', 'log2', None],
'learning_rate': uniform(0.01, 0.3)
}
# Random search (faster than grid search)
random_search = RandomizedSearchCV(
XGBClassifier(random_state=42),
param_distributions=param_dist,
n_iter=100, # Number of random combinations to try
cv=5,
scoring='roc_auc',
n_jobs=-1,
random_state=42,
verbose=2
)
random_search.fit(X_train, y_train)
print(f"Best parameters: {random_search.best_params_}")
print(f"Best CV score: {random_search.best_score_:.4f}")
Bayesian Optimization
from skopt import BayesSearchCV
from skopt.space import Real, Integer
# Define search space
search_spaces = {
'n_estimators': Integer(100, 500),
'max_depth': Integer(3, 15),
'learning_rate': Real(0.01, 0.3, prior='log-uniform'),
'subsample': Real(0.5, 1.0),
'colsample_bytree': Real(0.5, 1.0),
'gamma': Real(0, 5),
'reg_alpha': Real(0, 10),
'reg_lambda': Real(0, 10)
}
# Bayesian optimization
bayes_search = BayesSearchCV(
XGBClassifier(random_state=42),
search_spaces,
n_iter=50,
cv=5,
scoring='roc_auc',
n_jobs=-1,
random_state=42,
verbose=2
)
bayes_search.fit(X_train, y_train)
print(f"Best parameters: {bayes_search.best_params_}")
print(f"Best CV score: {bayes_search.best_score_:.4f}")
Building ML Pipelines
Pipelines ensure reproducible, maintainable code.
Basic Pipeline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
# Define transformers for different column types
numeric_features = ['age', 'income', 'credit_score']
categorical_features = ['city', 'education', 'employment_type']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# Combine transformers
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
# Create full pipeline
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
# Train pipeline
pipeline.fit(X_train, y_train)
# Predict
y_pred = pipeline.predict(X_test)
# Save pipeline
import joblib
joblib.dump(pipeline, 'model_pipeline.pkl')
# Load and use
loaded_pipeline = joblib.load('model_pipeline.pkl')
predictions = loaded_pipeline.predict(new_data)
Custom Transformers
from sklearn.base import BaseEstimator, TransformerMixin
class OutlierRemover(BaseEstimator, TransformerMixin):
def __init__(self, factor=1.5):
self.factor = factor
def fit(self, X, y=None):
# Calculate IQR for each feature
self.Q1 = np.percentile(X, 25, axis=0)
self.Q3 = np.percentile(X, 75, axis=0)
self.IQR = self.Q3 - self.Q1
return self
def transform(self, X):
# Remove outliers
lower_bound = self.Q1 - (self.factor * self.IQR)
upper_bound = self.Q3 + (self.factor * self.IQR)
X_transformed = X.copy()
for i in range(X.shape[1]):
X_transformed[:, i] = np.clip(X_transformed[:, i],
lower_bound[i],
upper_bound[i])
return X_transformed
class FeatureCreator(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
X_transformed = X.copy()
# Create new features
X_transformed['age_income_ratio'] = X_transformed['age'] / (X_transformed['income'] + 1)
X_transformed['is_high_earner'] = (X_transformed['income'] > 100000).astype(int)
return X_transformed
# Use in pipeline
pipeline_custom = Pipeline([
('outlier_remover', OutlierRemover(factor=1.5)),
('feature_creator', FeatureCreator()),
('scaler', StandardScaler()),
('model', RandomForestClassifier())
])
pipeline_custom.fit(X_train, y_train)
Model Validation and Cross-Validation
Advanced Cross-Validation Strategies
from sklearn.model_selection import (
KFold, StratifiedKFold, GroupKFold,
TimeSeriesSplit, cross_validate
)
# 1. K-Fold Cross-Validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kfold, scoring='accuracy')
print(f"KFold CV scores: {scores}")
print(f"Mean: {scores.mean():.4f} (+/- {scores.std() * 2:.4f})")
# 2. Stratified K-Fold (preserves class distribution)
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=stratified_kfold, scoring='roc_auc')
# 3. Group K-Fold (for grouped data, e.g., multiple samples per patient)
groups = df['patient_id'] # Ensure same patient isn't in train and test
group_kfold = GroupKFold(n_splits=5)
scores = cross_val_score(model, X, y, groups=groups, cv=group_kfold)
# 4. Time Series Split (for temporal data)
tscv = TimeSeriesSplit(n_splits=5)
for train_index, test_index in tscv.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# Train and evaluate
# 5. Multiple Metrics Cross-Validation
scoring = {
'accuracy': 'accuracy',
'precision': 'precision',
'recall': 'recall',
'f1': 'f1',
'roc_auc': 'roc_auc'
}
cv_results = cross_validate(
model, X, y,
cv=stratified_kfold,
scoring=scoring,
return_train_score=True
)
for metric in scoring:
print(f"{metric}: {cv_results[f'test_{metric}'].mean():.4f} "
f"(+/- {cv_results[f'test_{metric}'].std() * 2:.4f})")
Model Interpretation and Explainability
Understanding why a model makes predictions is crucial for trust and debugging.
Feature Importance
import shap
import lime
import lime.lime_tabular
# 1. Tree-based Feature Importance
feature_importance = pd.DataFrame({
'feature': X_train.columns,
'importance': model.feature_importances_
}).sort_values('importance', ascending=False)
plt.figure(figsize=(10, 8))
plt.barh(feature_importance['feature'][:20], feature_importance['importance'][:20])
plt.xlabel('Importance')
plt.title('Top 20 Feature Importances')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
# 2. Permutation Importance (model-agnostic)
from sklearn.inspection import permutation_importance
perm_importance = permutation_importance(
model, X_test, y_test,
n_repeats=10,
random_state=42,
n_jobs=-1
)
perm_importance_df = pd.DataFrame({
'feature': X_train.columns,
'importance': perm_importance.importances_mean
}).sort_values('importance', ascending=False)
# 3. SHAP Values
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
# Summary plot
shap.summary_plot(shap_values, X_test, plot_type="bar")
# Individual prediction explanation
shap.force_plot(explainer.expected_value, shap_values[0], X_test.iloc[0])
# 4. LIME (Local Interpretable Model-agnostic Explanations)
explainer_lime = lime.lime_tabular.LimeTabularExplainer(
X_train.values,
feature_names=X_train.columns,
class_names=['Class 0', 'Class 1'],
mode='classification'
)
# Explain a single prediction
exp = explainer_lime.explain_instance(
X_test.iloc[0].values,
model.predict_proba,
num_features=10
)
exp.show_in_notebook()
Partial Dependence Plots
from sklearn.inspection import PartialDependenceDisplay
# Show how predictions change with feature values
features_to_plot = [0, 1, (0, 1)] # Individual and interaction
PartialDependenceDisplay.from_estimator(
model, X_train, features_to_plot,
feature_names=X_train.columns
)
plt.tight_layout()
plt.show()
Model Deployment Basics
Saving and Loading Models
import joblib
import pickle
# Save with joblib (recommended for scikit-learn)
joblib.dump(model, 'model.pkl', compress=3)
# Load model
loaded_model = joblib.load('model.pkl')
# Save with pickle
with open('model.pkl', 'wb') as f:
pickle.dump(model, f)
# Load with pickle
with open('model.pkl', 'rb') as f:
loaded_model = pickle.load(f)
# Save entire pipeline
joblib.dump(pipeline, 'pipeline.pkl')
Creating a Simple Prediction API
from flask import Flask, request, jsonify
import pandas as pd
app = Flask(__name__)
# Load model at startup
model = joblib.load('model_pipeline.pkl')
@app.route('/predict', methods=['POST'])
def predict():
# Get JSON data
data = request.get_json()
# Convert to DataFrame
df = pd.DataFrame([data])
# Make prediction
prediction = model.predict(df)
probability = model.predict_proba(df)
# Return result
return jsonify({
'prediction': int(prediction[0]),
'probability': float(probability[0][1])
})
@app.route('/health', methods=['GET'])
def health():
return jsonify({'status': 'healthy'})
if __name__ == '__main__':
app.run(debug=False, host='0.0.0.0', port=5000)
Model Monitoring
import numpy as np
from scipy import stats
class ModelMonitor:
def __init__(self, reference_data):
self.reference_mean = reference_data.mean()
self.reference_std = reference_data.std()
def detect_data_drift(self, new_data, threshold=0.05):
"""Detect if new data distribution differs significantly"""
drift_detected = {}
for col in new_data.columns:
# Kolmogorov-Smirnov test
statistic, p_value = stats.ks_2samp(
self.reference_data[col],
new_data[col]
)
drift_detected[col] = p_value < threshold
return drift_detected
def check_prediction_distribution(self, predictions):
"""Monitor if prediction distribution changes"""
mean_pred = predictions.mean()
std_pred = predictions.std()
# Check if distribution is significantly different
z_score = (mean_pred - self.reference_mean) / (self.reference_std / np.sqrt(len(predictions)))
return abs(z_score) > 3 # 3-sigma rule
# Usage
monitor = ModelMonitor(X_train)
drift = monitor.detect_data_drift(X_new_batch)
if any(drift.values()):
print("Data drift detected! Consider retraining model.")
Real-World Project: Complete End-to-End Example
Let's put it all together with a credit card fraud detection project:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_validate
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, roc_auc_score
from imblearn.over_sampling import SMOTE
from sklearn.pipeline import Pipeline
import joblib
# 1. Load Data
df = pd.read_csv('creditcard.csv')
# 2. Exploratory Data Analysis
print(df.info())
print(df.describe())
print(f"Fraud percentage: {df['Class'].mean() * 100:.2f}%")
# 3. Feature Engineering
df['Amount_log'] = np.log1p(df['Amount'])
df['Time_hour'] = (df['Time'] / 3600) % 24
# 4. Prepare Features
X = df.drop(['Class', 'Amount', 'Time'], axis=1)
y = df['Class']
# 5. Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# 6. Handle Imbalanced Data
smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)
# 7. Create Pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', XGBClassifier(
n_estimators=200,
max_depth=6,
learning_rate=0.05,
scale_pos_weight=len(y_train[y_train==0]) / len(y_train[y_train==1]),
random_state=42
))
])
# 8. Cross-Validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_results = cross_validate(
pipeline, X_train_balanced, y_train_balanced,
cv=cv,
scoring=['roc_auc', 'precision', 'recall', 'f1'],
return_train_score=True
)
print("Cross-Validation Results:")
for metric in ['roc_auc', 'precision', 'recall', 'f1']:
print(f"{metric}: {cv_results[f'test_{metric}'].mean():.4f} "
f"(+/- {cv_results[f'test_{metric}'].std() * 2:.4f})")
# 9. Train Final Model
pipeline.fit(X_train_balanced, y_train_balanced)
# 10. Evaluate on Test Set
y_pred = pipeline.predict(X_test)
y_pred_proba = pipeline.predict_proba(X_test)[:, 1]
print("\nTest Set Performance:")
print(classification_report(y_test, y_pred))
print(f"ROC-AUC: {roc_auc_score(y_test, y_pred_proba):.4f}")
# 11. Save Model
joblib.dump(pipeline, 'fraud_detection_model.pkl')
# 12. Create Monitoring Dashboard
def create_monitoring_report(y_true, y_pred, y_pred_proba):
report = {
'total_predictions': len(y_pred),
'fraud_detected': sum(y_pred),
'fraud_rate': sum(y_pred) / len(y_pred),
'roc_auc': roc_auc_score(y_true, y_pred_proba),
'precision': precision_score(y_true, y_pred),
'recall': recall_score(y_true, y_pred)
}
return report
monitoring_report = create_monitoring_report(y_test, y_pred, y_pred_proba)
print(f"\nMonitoring Report: {monitoring_report}")
Best Practices and Common Pitfalls
Do's
- Always split data before any preprocessing
- Use cross-validation for reliable performance estimates
- Monitor for data drift in production
- Document your feature engineering process
- Version your models and data
- Test your pipeline end-to-end before deployment
- Keep train/test distributions similar
Don'ts
- Don't use test data for any training decisions
- Don't ignore class imbalance
- Don't rely solely on accuracy for imbalanced datasets
- Don't forget to scale features when necessary
- Don't skip exploratory data analysis
- Don't deploy without monitoring
- Don't trust a single validation metric
Continue Your Learning Journey
You've now mastered intermediate machine learning techniques! You can:
- Engineer powerful features
- Handle imbalanced datasets
- Build ensemble models
- Optimize hyperparameters
- Create production pipelines
- Deploy and monitor models
Want to go deeper? Check out our Advanced Machine Learning Guide where you'll learn:
- Deep learning and neural networks
- Transformers and attention mechanisms
- Natural Language Processing
- Computer Vision
- Advanced MLOps practices
- Custom architectures and research
Need a refresher on basics? Review our Beginner's Guide to AI and Machine Learning.
Related Topics
- Airflow - Orchestrate ML pipelines in production
- Data Processing Pipeline Patterns - Build robust data pipelines
- Apache Spark - Process large-scale ML datasets
- Databricks - Unified analytics for ML
Further Reading
Related Articles
AI and Machine Learning for Beginners: Your Complete Getting Started Guide
A comprehensive beginner-friendly guide to understanding AI and Machine Learning concepts. Learn the fundamentals, set up your first ML environment, and build your first machine learning model from scratch with Python and scikit-learn.
Advanced Machine Learning: Deep Learning, NLP, Computer Vision, and MLOps
Master advanced ML topics including deep learning architectures, transformers, natural language processing, computer vision, transfer learning, and production MLOps. Build state-of-the-art models and deploy them at scale.
A List of Python of Natural Language Processing (NLP) libraries
Explore Python's top NLP libraries like NLTK, spaCy, Gensim, TextBlob, and Transformers, each specializing in tasks like tokenization, topic modeling, sentiment analysis, and state-of-the-art language processing.