Published on

PyTorch for Python

What is PyTorch?

  • Open-Source Machine Learning Framework: PyTorch is a popular open-source library primarily used for deep learning applications but also offers versatility in general machine learning areas.
  • Based on Torch: It's a Python adaptation of the Lua-based Torch scientific computing framework.
  • Core Applications:
    • Computer Vision (image/video processing)
    • Natural Language Processing (text analysis, translation)
    • Reinforcement Learning
    • Building and training deep neural networks

Key Features

  • Tensor Computations with GPU Acceleration: PyTorch heavily utilizes tensors (multidimensional arrays) for efficient numerical computations and can leverage the power of GPUs for faster processing.
  • Dynamic Computational Graphs: In contrast to frameworks like TensorFlow (prior to v2), PyTorch builds computational graphs on the fly. This enhances flexibility for debugging and experimentation during model development.
  • Pythonic and User-Friendly: PyTorch seamlessly integrates with the Python ecosystem and feels natural to Python developers, making it easy to learn and adopt.
  • Extensive Community and Ecosystem: PyTorch boasts a large, active community and a wide array of pre-trained models, tutorials, and tools.
  • Researcher-Friendly: Its dynamic nature and focus on flexibility make it highly favored in research environments where rapid prototyping and experimentation are crucial.
  • Production-Ready: While popular in research, PyTorch is equally capable for production-level deployments in applications like self-driving cars and natural language processing systems.
  • Strong Competition to TensorFlow: It's one of the primary competitors to Google's TensorFlow framework, with both offering distinct advantages.

Getting Started with PyTorch

  1. Installation: pip install torch torchvision (Often you'll install torchvision for computer vision tools)
  2. Basics: Learn about tensors, neural network modules, automatic differentiation, and model training loops.
  3. Resources:

Installation Guide

PyTorch offers flexible installation options based on your hardware and requirements:

CPU-Only Installation

For CPU-only environments (testing, development without GPU):

pip install torch torchvision torchaudio

GPU Installation with CUDA

For NVIDIA GPUs, install CUDA-enabled PyTorch. Check your CUDA version first:

# CUDA 11.8
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# CUDA 12.1
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

AMD GPU with ROCm

For AMD GPUs using ROCm:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.6

Conda Installation

Using conda for environment management:

# CPU version
conda install pytorch torchvision torchaudio cpuonly -c pytorch

# CUDA version
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

Verify Installation

import torch

# Check PyTorch version
print(f"PyTorch version: {torch.__version__}")

# Check CUDA availability
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
print(f"Number of GPUs: {torch.cuda.device_count()}")

Core Concepts Deep Dive

Tensors

Tensors are the fundamental building blocks in PyTorch - multidimensional arrays similar to NumPy arrays but with GPU acceleration capabilities.

import torch

# Creating tensors
tensor_1d = torch.tensor([1, 2, 3, 4])
tensor_2d = torch.tensor([[1, 2], [3, 4]])
tensor_zeros = torch.zeros(3, 3)
tensor_ones = torch.ones(2, 4)
tensor_random = torch.randn(3, 3)  # Normal distribution

# Tensor from NumPy
import numpy as np
numpy_array = np.array([1, 2, 3])
tensor_from_numpy = torch.from_numpy(numpy_array)

# Tensor properties
print(f"Shape: {tensor_2d.shape}")
print(f"Data type: {tensor_2d.dtype}")
print(f"Device: {tensor_2d.device}")

# Tensor operations
a = torch.tensor([1.0, 2.0, 3.0])
b = torch.tensor([4.0, 5.0, 6.0])

# Element-wise operations
addition = a + b
multiplication = a * b
dot_product = torch.dot(a, b)

# Matrix operations
matrix_a = torch.randn(3, 4)
matrix_b = torch.randn(4, 5)
matrix_mult = torch.matmul(matrix_a, matrix_b)

Automatic Differentiation (Autograd)

PyTorch's autograd system automatically computes gradients for backpropagation:

# Enable gradient tracking
x = torch.tensor([2.0, 3.0], requires_grad=True)
y = torch.tensor([6.0, 4.0], requires_grad=True)

# Define computation
z = x * y
loss = z.sum()

# Compute gradients automatically
loss.backward()

# Access gradients
print(f"Gradient of x: {x.grad}")  # dL/dx
print(f"Gradient of y: {y.grad}")  # dL/dy

# Gradient accumulation
x.grad.zero_()  # Clear gradients before next iteration

Computational Graphs

PyTorch builds dynamic computational graphs that record operations:

import torch

x = torch.tensor(3.0, requires_grad=True)
y = torch.tensor(4.0, requires_grad=True)

# Operations create computational graph
z = x ** 2 + y ** 3
z.backward()

print(f"dz/dx = {x.grad}")  # 2*x = 6.0
print(f"dz/dy = {y.grad}")  # 3*y^2 = 48.0

Building Neural Networks

PyTorch provides nn.Module as the base class for all neural networks:

Basic Neural Network Architecture

import torch
import torch.nn as nn
import torch.nn.functional as F

class SimpleNet(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(SimpleNet, self).__init__()
        # Define layers
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, hidden_size)
        self.fc3 = nn.Linear(hidden_size, output_size)
        self.dropout = nn.Dropout(0.2)

    def forward(self, x):
        # Define forward pass
        x = F.relu(self.fc1(x))
        x = self.dropout(x)
        x = F.relu(self.fc2(x))
        x = self.dropout(x)
        x = self.fc3(x)
        return x

# Instantiate model
model = SimpleNet(input_size=784, hidden_size=128, output_size=10)
print(model)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total_params:,}")

Convolutional Neural Network (CNN)

class CNN(nn.Module):
    def __init__(self, num_classes=10):
        super(CNN, self).__init__()
        # Convolutional layers
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.pool = nn.MaxPool2d(2, 2)

        # Fully connected layers
        self.fc1 = nn.Linear(64 * 7 * 7, 128)
        self.fc2 = nn.Linear(128, num_classes)

    def forward(self, x):
        # Conv block 1
        x = self.pool(F.relu(self.conv1(x)))
        # Conv block 2
        x = self.pool(F.relu(self.conv2(x)))
        # Flatten
        x = x.view(-1, 64 * 7 * 7)
        # Fully connected
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

cnn_model = CNN(num_classes=10)

Common Activation Functions

# ReLU (most common)
relu = nn.ReLU()

# LeakyReLU
leaky_relu = nn.LeakyReLU(negative_slope=0.01)

# Sigmoid
sigmoid = nn.Sigmoid()

# Tanh
tanh = nn.Tanh()

# Softmax (for classification output)
softmax = nn.Softmax(dim=1)

# GELU (used in transformers)
gelu = nn.GELU()

Training Workflow

Complete training pipeline with data loading, loss computation, and optimization:

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# 1. Prepare data
X_train = torch.randn(1000, 20)  # 1000 samples, 20 features
y_train = torch.randint(0, 2, (1000,))  # Binary classification

train_dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

# 2. Initialize model
model = SimpleNet(input_size=20, hidden_size=64, output_size=2)

# 3. Define loss function
criterion = nn.CrossEntropyLoss()

# 4. Define optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001)

# 5. Training loop
num_epochs = 10
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

for epoch in range(num_epochs):
    model.train()  # Set to training mode
    running_loss = 0.0
    correct = 0
    total = 0

    for batch_idx, (data, target) in enumerate(train_loader):
        # Move to device
        data, target = data.to(device), target.to(device)

        # Zero gradients
        optimizer.zero_grad()

        # Forward pass
        outputs = model(data)
        loss = criterion(outputs, target)

        # Backward pass
        loss.backward()

        # Update weights
        optimizer.step()

        # Statistics
        running_loss += loss.item()
        _, predicted = outputs.max(1)
        total += target.size(0)
        correct += predicted.eq(target).sum().item()

    # Epoch statistics
    epoch_loss = running_loss / len(train_loader)
    accuracy = 100. * correct / total
    print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {epoch_loss:.4f}, Accuracy: {accuracy:.2f}%')

# 6. Evaluation mode
model.eval()
with torch.no_grad():
    # Evaluation code here
    pass

Learning Rate Scheduling

# Learning rate scheduler
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.1)

# In training loop
for epoch in range(num_epochs):
    # ... training code ...
    scheduler.step()  # Update learning rate

PyTorch vs TensorFlow

FeaturePyTorchTensorFlow
Computational GraphDynamic (define-by-run)Static (TF 1.x), Dynamic (TF 2.x with Eager)
DebuggingEasier with Python debuggerMore complex in TF 1.x, improved in 2.x
Production DeploymentTorchServe, ONNXTensorFlow Serving, TFLite, TF.js
Learning CurveMore Pythonic, easier for beginnersSteeper initially, improved with TF 2.x
CommunityStrong in research, academiaStrong in industry, production
Mobile DeploymentPyTorch MobileTensorFlow Lite (more mature)
VisualizationTensorBoard (via integration)TensorBoard (native)
API DesignMore flexible, explicit controlMore high-level options (Keras)
PerformanceExcellent for research workflowsOptimized for production scale
EcosystemGrowing rapidlyMore mature, extensive

When to Use PyTorch

  • Research and experimentation
  • Rapid prototyping
  • When you need dynamic computational graphs
  • Academic projects and papers
  • Computer vision and NLP research

When to Use TensorFlow

  • Production deployment at scale
  • Mobile and edge device deployment
  • When you need TensorFlow Extended (TFX) for MLOps
  • JavaScript/web deployment (TF.js)
  • Established production pipelines

GPU Acceleration

Moving Tensors to GPU

# Check GPU availability
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Move tensors to GPU
tensor_cpu = torch.randn(1000, 1000)
tensor_gpu = tensor_cpu.to(device)

# Alternative methods
tensor_gpu = tensor_cpu.cuda()  # Explicit CUDA
tensor_gpu = torch.randn(1000, 1000, device='cuda')  # Create directly on GPU

# Move back to CPU
tensor_cpu = tensor_gpu.cpu()

# Check tensor device
print(f"Tensor is on: {tensor_gpu.device}")

GPU Memory Management

# Clear GPU cache
torch.cuda.empty_cache()

# Get memory usage
print(f"Allocated: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
print(f"Reserved: {torch.cuda.memory_reserved() / 1024**3:.2f} GB")

# Set specific GPU
torch.cuda.set_device(0)  # Use GPU 0

# Context manager for specific GPU
with torch.cuda.device(1):
    # Operations on GPU 1
    pass

Multi-GPU Training

# DataParallel (simple but older method)
if torch.cuda.device_count() > 1:
    print(f"Using {torch.cuda.device_count()} GPUs")
    model = nn.DataParallel(model)

model.to(device)

# DistributedDataParallel (recommended for multi-GPU)
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

# Initialize process group
dist.init_process_group(backend='nccl')

# Wrap model
model = DDP(model, device_ids=[local_rank])

Common Use Cases

Computer Vision - Image Classification

import torchvision
import torchvision.transforms as transforms

# Data preprocessing
transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                       std=[0.229, 0.224, 0.225])
])

# Load pretrained model
model = torchvision.models.resnet50(pretrained=True)

# Modify for custom number of classes
num_classes = 10
model.fc = nn.Linear(model.fc.in_features, num_classes)

# Fine-tuning: freeze early layers
for param in model.parameters():
    param.requires_grad = False

# Unfreeze final layer
for param in model.fc.parameters():
    param.requires_grad = True

Natural Language Processing - Text Classification

class TextClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_classes):
        super(TextClassifier, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, 128, batch_first=True, bidirectional=True)
        self.fc = nn.Linear(256, num_classes)
        self.dropout = nn.Dropout(0.3)

    def forward(self, x):
        # x shape: (batch_size, sequence_length)
        embedded = self.embedding(x)
        lstm_out, (hidden, cell) = self.lstm(embedded)
        # Use final hidden state
        hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1)
        output = self.dropout(hidden)
        output = self.fc(output)
        return output

# Initialize model
vocab_size = 10000
model = TextClassifier(vocab_size=vocab_size, embed_dim=100, num_classes=5)

Reinforcement Learning - Simple Agent

class DQN(nn.Module):
    def __init__(self, state_size, action_size):
        super(DQN, self).__init__()
        self.fc1 = nn.Linear(state_size, 128)
        self.fc2 = nn.Linear(128, 128)
        self.fc3 = nn.Linear(128, action_size)

    def forward(self, state):
        x = F.relu(self.fc1(state))
        x = F.relu(self.fc2(x))
        q_values = self.fc3(x)
        return q_values

# Agent training step
state = torch.FloatTensor(current_state)
q_values = model(state)
action = q_values.argmax().item()

Best Practices

Model Saving and Loading

# Save entire model
torch.save(model, 'model_complete.pth')

# Load entire model
model = torch.load('model_complete.pth')

# Save model state dict (recommended)
torch.save(model.state_dict(), 'model_weights.pth')

# Load model state dict
model = SimpleNet(input_size=20, hidden_size=64, output_size=2)
model.load_state_dict(torch.load('model_weights.pth'))

# Save training checkpoint
checkpoint = {
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'loss': loss,
}
torch.save(checkpoint, 'checkpoint.pth')

# Load checkpoint
checkpoint = torch.load('checkpoint.pth')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']
loss = checkpoint['loss']

Debugging Techniques

# Check for NaN values
assert not torch.isnan(loss).any(), "Loss contains NaN"

# Gradient clipping (prevent exploding gradients)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

# Register hooks for debugging
def print_grad(grad):
    print(f"Gradient: {grad}")

x = torch.tensor([1.0], requires_grad=True)
x.register_hook(print_grad)

# Anomaly detection
torch.autograd.set_detect_anomaly(True)

# Check model architecture
from torchsummary import summary
summary(model, input_size=(1, 28, 28))

Profiling Performance

import torch.profiler as profiler

with profiler.profile(
    activities=[profiler.ProfilerActivity.CPU, profiler.ProfilerActivity.CUDA],
    record_shapes=True,
    profile_memory=True,
    with_stack=True
) as prof:
    # Your training code here
    model(input_tensor)

# Print results
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

# Export for Chrome trace viewer
prof.export_chrome_trace("trace.json")

PyTorch Ecosystem

TorchVision - Computer Vision

import torchvision

# Pretrained models
resnet = torchvision.models.resnet50(pretrained=True)
vgg = torchvision.models.vgg16(pretrained=True)
efficientnet = torchvision.models.efficientnet_b0(pretrained=True)

# Datasets
train_dataset = torchvision.datasets.CIFAR10(
    root='./data',
    train=True,
    download=True,
    transform=transform
)

# Transforms
transforms = torchvision.transforms.Compose([
    torchvision.transforms.RandomHorizontalFlip(),
    torchvision.transforms.RandomRotation(10),
    torchvision.transforms.ToTensor(),
])

TorchText - Natural Language Processing

# Note: torchtext has undergone significant API changes
from torchtext.vocab import build_vocab_from_iterator
from torchtext.data.utils import get_tokenizer

tokenizer = get_tokenizer('basic_english')

# Build vocabulary
def yield_tokens(data_iter):
    for text in data_iter:
        yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(texts), specials=["<unk>", "<pad>"])
vocab.set_default_index(vocab["<unk>"])

TorchAudio - Audio Processing

import torchaudio

# Load audio
waveform, sample_rate = torchaudio.load("audio.wav")

# Transforms
spectrogram = torchaudio.transforms.Spectrogram()
mel_spectrogram = torchaudio.transforms.MelSpectrogram(sample_rate=16000)

# Apply transform
spec = mel_spectrogram(waveform)

PyTorch Lightning - Simplified Training

import pytorch_lightning as pl

class LitModel(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.model = SimpleNet(20, 64, 2)

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self.model(x)
        loss = F.cross_entropy(y_hat, y)
        self.log('train_loss', loss)
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.001)

# Train with Lightning
trainer = pl.Trainer(max_epochs=10, accelerator='gpu')
trainer.fit(model, train_loader)

Performance Optimization

JIT Compilation with TorchScript

# Trace-based scripting
model = SimpleNet(20, 64, 2)
model.eval()

example_input = torch.randn(1, 20)
traced_model = torch.jit.trace(model, example_input)

# Save traced model
traced_model.save("model_traced.pt")

# Script-based compilation (supports control flow)
scripted_model = torch.jit.script(model)
scripted_model.save("model_scripted.pt")

# Load and use
loaded_model = torch.jit.load("model_traced.pt")
output = loaded_model(example_input)

Mixed Precision Training

from torch.cuda.amp import autocast, GradScaler

# Initialize gradient scaler
scaler = GradScaler()

for epoch in range(num_epochs):
    for data, target in train_loader:
        optimizer.zero_grad()

        # Automatic mixed precision
        with autocast():
            output = model(data)
            loss = criterion(output, target)

        # Scaled backward pass
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

DataLoader Optimization

# Optimize data loading
train_loader = DataLoader(
    dataset,
    batch_size=64,
    shuffle=True,
    num_workers=4,          # Parallel data loading
    pin_memory=True,        # Faster GPU transfer
    persistent_workers=True, # Keep workers alive
    prefetch_factor=2       # Prefetch batches
)

# Custom collate function for variable-length sequences
def collate_fn(batch):
    sequences, labels = zip(*batch)
    # Pad sequences
    padded_sequences = torch.nn.utils.rnn.pad_sequence(sequences, batch_first=True)
    return padded_sequences, torch.tensor(labels)

loader = DataLoader(dataset, batch_size=32, collate_fn=collate_fn)

Model Optimization Techniques

# Gradient accumulation for larger effective batch size
accumulation_steps = 4

for i, (data, target) in enumerate(train_loader):
    output = model(data)
    loss = criterion(output, target)
    loss = loss / accumulation_steps
    loss.backward()

    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

# Gradient checkpointing for memory efficiency
from torch.utils.checkpoint import checkpoint

class CheckpointedModel(nn.Module):
    def forward(self, x):
        # Trade compute for memory
        x = checkpoint(self.layer1, x)
        x = checkpoint(self.layer2, x)
        return x

Troubleshooting

Common Errors and Solutions

CUDA Out of Memory

# Solution 1: Reduce batch size
batch_size = 16  # Instead of 64

# Solution 2: Gradient accumulation
# (shown in optimization section)

# Solution 3: Clear cache
torch.cuda.empty_cache()

# Solution 4: Use gradient checkpointing
# (shown in optimization section)

RuntimeError: Expected all tensors on same device

# Problem: Tensors on different devices
# Solution: Ensure all tensors are on same device
model = model.to(device)
data = data.to(device)
target = target.to(device)

Gradient becomes NaN

# Solution 1: Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

# Solution 2: Lower learning rate
optimizer = optim.Adam(model.parameters(), lr=0.0001)

# Solution 3: Check for invalid operations
assert not torch.isnan(loss).any()

DataLoader Worker Crashes

# Solution: Reduce num_workers or set to 0
train_loader = DataLoader(dataset, batch_size=32, num_workers=0)

# On Windows, use proper main guard
if __name__ == '__main__':
    # DataLoader code here
    pass

Model Not Learning

# Check 1: Verify gradients are flowing
for name, param in model.named_parameters():
    if param.grad is not None:
        print(f"{name}: {param.grad.abs().mean()}")

# Check 2: Ensure model is in training mode
model.train()

# Check 3: Verify learning rate
print(f"Learning rate: {optimizer.param_groups[0]['lr']}")

# Check 4: Check data and labels
print(f"Data range: {data.min()} to {data.max()}")
print(f"Unique labels: {target.unique()}")

Slow Training Speed

# Profiling to find bottlenecks
import time

start = time.time()
for i, (data, target) in enumerate(train_loader):
    if i == 0:
        print(f"First batch loading time: {time.time() - start:.2f}s")

    # Training code
    pass

# Solutions:
# - Increase num_workers in DataLoader
# - Use pin_memory=True
# - Move data preprocessing to GPU if possible
# - Use mixed precision training

Related Articles

Python Cheat Sheet

Awesome Python frameworks. A curated list of awesome Python frameworks, libraries, software and resources.