1. Deep Neural Networks (DNNs)
- Structure of DNN
- Input Layer:
- The first layer in the network receives raw input data (e.g., images, text, numerical values).
- Each neuron in the input layer corresponds to a feature of the input data.
- Hidden Layers:
- Intermediate layers that process input data through weighted connections and activation functions.
- Multiple hidden layers allow DNNs to learn complex patterns and hierarchical representations.
- Features:
- Each layer extracts progressively higher-level features from the input.
- Non-linear activation functions (e.g., ReLU, Sigmoid, Tanh) introduce non-linearity to the model.
- Output Layer:
- Produces the final predictions based on the learned features.
- The number of neurons in the output layer corresponds to the task:
- Single neuron for binary classification.
- Multiple neurons for multi-class classification or regression.
- Challenges in Training Deep Networks
- Vanishing Gradients:
- Gradients become very small as they are backpropagated through many layers, leading to slow or stalled weight updates.
- Common in networks with Sigmoid or Tanh activation functions.
- Exploding Gradients:
- Gradients grow excessively large during backpropagation, causing instability and divergence.
- Often occurs in RNNs or when weights are poorly initialized.
- Solutions:
- Advanced architectures like LSTMs for sequential data.
- Gradient clipping to prevent gradients from exceeding a threshold.
- Batch normalization to stabilize training dynamics.
2. Training Techniques
- Regularization Methods
- Techniques to prevent overfitting and improve generalization:
- Weight Decay (L2 Regularization):
- Penalizes large weights by adding a term proportional to their squared magnitude to the loss function.
- Dropout:
- Randomly drops neurons during training to prevent reliance on specific features.
- Effective in reducing overfitting in large networks.
- Batch Normalization:
- Normalizes layer inputs to a standard distribution, stabilizing gradients and speeding up training.
- Acts as a form of regularization by introducing noise during training.
- Early Stopping
- Monitors the model’s performance on a validation set during training.
- Stops training when validation performance ceases to improve, preventing overfitting.
- Benefits:
- Reduces computation time.
- Ensures the model is not overly tailored to the training data.
- Optimization Algorithms
- Methods for updating weights to minimize the loss function:
- Stochastic Gradient Descent (SGD):
- Updates weights using the gradient of the loss function for a single training example.
- Benefits: Simple and efficient for large datasets.
- Challenges: Noisy updates can lead to slow convergence.
- Adam (Adaptive Moment Estimation):
- Combines momentum and adaptive learning rates for faster convergence.
- Adjusts the learning rate for each parameter based on past gradients.
- RMSProp:
- Maintains a moving average of squared gradients to normalize updates.
- Well-suited for non-stationary objectives like RNNs.
- Mini-Batch Gradient Descent
- Divides the training data into small batches, performing weight updates for each batch.
- Advantages:
- Balances efficiency (compared to batch gradient descent) and stability (compared to SGD).
- Introduces regularization through mini-batch sampling, reducing overfitting.
- Widely used in modern deep learning frameworks due to its scalability and robustness.