【Deep Learning OCR Series·7】CTC Loss Function and Training Techniques
📅
Post time: 2025-08-19
👁️
Reading:2048
⏱️
Approx. 21 minutes (4005 words)
📁
Category: Advanced Guides
The principle, implementation and training techniques of CTC loss function, and the core technology to solve the sequence alignment problem. Dive into forward-backward algorithms, decoding strategies, and optimization methods.
## Introduction
Connectionist Temporal Classification (CTC) is an important breakthrough in deep learning sequence modeling, especially in the field of OCR. CTC solves the fundamental problem of mismatch between the length of the input sequence and the output sequence, enabling end-to-end sequence learning. This article will delve into the mathematical principles, algorithm implementation, and training optimization techniques of CTC.
## CTC Basic Concepts
### Sequence alignment issues
In OCR tasks, we face the following challenges:
**Length mismatch**: The length of the input image feature sequence is different from the output text sequence length. For example, a word containing 3 characters may correspond to a feature sequence of 100 time steps.
**Uncertain Position**: The exact position of each character in the image is unknown. Traditional methods require precise character segmentation, which is difficult in practical applications.
**Difficulty in Character Segmentation**: Continuously written text, handwritten text, or artistic fonts struggle to accurately split into individual characters.
### CTC's Solution
CTC solves sequence alignment problems in the following innovative ways:
Introducing Blank Markers: Use special blank markers to handle alignment. Blank tags do not correspond to any output characters and are used to separate duplicate characters from fill sequences.
Path Probability: Calculates the probability of all possible alignment paths. Each path represents a possible character-to-time step correspondence.
**Dynamic Planning**: Efficiently calculate path probabilities using forward-backward algorithms, avoiding enumerating all possible paths.
## CTC Mathematical Principles
### Basic Definitions
Given the input sequence X = (x₁, x₂, ..., xt) and the target sequence Y = (y₁, y₂, ..., yu), where T ≥ U.
Tag set: L = {1, 2, ..., K}, containing K character categories.
**Extended Tag Collection**: L_ext = L ∪ {blank}, containing blank tags.
**Alignment path**: A sequence of length T π = (π₁, π₂, ..., πt), where πt ∈ L_ext.
### Mapping of paths to tags
CTC defines a mapping function B that converts the alignment path into an output label sequence:
1. Remove all blank markers
2. Merge consecutive duplicate characters
**Mapping example**:
- π = (a, a, blank, b, blank, b, b) → B(π) = (a, b, b)
- π = (blank, c, c, a, blank, t) → B(π) = (c, a, t)
### CTC loss function
The CTC loss function is defined as the negative logarithm of the sum of all path probabilities mapped to the target sequence Y:
L_CTC = -log P(Y| X) = -log Σ_{π∈B⁻¹(Y)} P(π| X)
where B⁻¹(Y) is the set of all paths mapped to Y.
Path Probability: Assuming that the predictions of each time step are independent, the path probability is:
P(π| X) = ∏ₜ yₜ^{πₜ}
where yt^{πt} is the probability of the time step t predicting the label πt.
## Forward-Backward Algorithm
### Forward Algorithm
The forward algorithm calculates the path probability from the start of the sequence to the current position.
**Extended Label Sequence**: To facilitate calculation, expand the target sequence Y to Y_ext, inserting blank tags before and after each character.
**Initialization**:
- α₁(1) = y₁^{blank} (first position is blank)
- α₁(2) = y₁^{y₁} (the first position is the first character)
- α₁(s) = 0 for other locations
**Recursive Formula**:
For t > 1 and position s:
- If Y_ext[s] is blank or the same as the previous character:
α_t(s) = (α_{t-1}(s) + α_{t-1}(s-1)) × y_t^{Y_ext[s]}
- Otherwise:
α_t(s) = (α_{t-1}(s) + α_{t-1}(s-1) + α_{t-1}(s-2)) × y_t^{Y_ext[s]}
### Backward Algorithm
The backward algorithm calculates the path probability from the current position to the end of the sequence.
**Initialization**:
- β_T(| Y_ext|) = 1
- β_T(| Y_ext|-1) = 1 (if the last tag is not blank)
- β_T(s) = 0 for other locations
**Recursive Formula**:
For t < T and position s:
- If Y_ext [s+1] is blank or the same as the current character:
β_t(s) = (β_{t+1}(s) + β_{t+1}(s+1)) × y_{t+1}^{Y_ext[s+1]}
- Otherwise:
β_t(s) = (β_{t+1}(s) + β_{t+1}(s+1) + β_{t+1}(s+2)) × y_{t+1}^{Y_ext[s+1]}
### Gradient Calculation
Total Probability:P (Y| X) = α_T(| Y_ext|) + α_T(| Y_ext|-1)
**Gradient of Label Probability**:
∂(-ln P(Y| X))/∂y_k^t = -1/P(Y| X) × Σ_{s:Y_ext[s]=k} (α_t(s) × β_t(s))/y_k^t
## CTC decoding strategy
### Greedy decoding
Greedy decodes the label with the highest probability at each time step:
π_t = argmax_k y_t^k
Then apply B mapping to get the final sequence.
**Pros**: Easy calculations and fast speed
**Disadvantages**: The global optimal solution may not be obtained
### Bundle search decoding
Beam search maintains multiple candidate paths, expanding the most promising paths at each time step.
**Algorithm Steps**:
1. Initialize: The candidate collection contains empty paths
2. For each time step:
- Extend all candidate paths
- Keep the K-path with the highest probability
3. Return the complete path with the highest probability
**Parameter Tuning**:
- Beam Width K: Balances computational complexity with decoding quality
- Length Penalty: Avoid favoring short sequences
### Prefix bundle search
Prefix bundle search considers the prefix probability of a path to avoid double-counting paths with the same prefix.
**Core idea**: Merge paths with the same prefix, and only keep the most probable extension method.
## Training Techniques and Optimization
### Data preprocessing
**Sequence Length Processing**:
- Dynamic batching: Grouping sequences of similar length
- Fill Strategy: Fill short sequences with special markers
- Truncation Strategy: Reasonably truncate excessively long sequences
**Label Preprocessing**:
- Character Set Standardization: Uniform character encoding and capitalization
- Special character handling: Handles punctuation marks and spaces
- Vocabulary Building: Build a complete glossary of characters
### Training Strategy
**Course Learning**:
Start training with simple samples and gradually increase the difficulty:
- Short to long sequences
- Clear image to blurry image
- Regular fonts to handwritten fonts
**Data Enhancement**:
- Geometry transformations: rotate, scale, cut
- Noise addition: Gaussian noise, salt and pepper noise
- Lighting changes: brightness, contrast adjustments
**Regularization Techniques**:
- Dropout: Prevent overfitting
- Weight degradation: L2 regularization
- Label Smoothing: Reduces overconfidence
### Hyperparameter tuning
**Learning Rate Scheduling**:
- Warm-up strategy:The first few epochs use a small learning rate
- Cosine annealing: The learning rate decays according to the cosine function
- Adaptive Tuning: Adjusts based on validation set performance
**Batch Size Selection**:
- Memory Limitations: Consider GPU memory capacity
- Gradient Stability: Provides a more stable gradient for larger batches
- Convergence Speed: Balance training speed and stability
## Practical Application Considerations
### Computational Optimization
**Memory Optimization**:
- Gradient checkpoints: Reduces the memory footprint of forward propagation
- Mixed-precision training: Reduce memory requirements with FP16
- Dynamic graph optimization: Optimizes memory allocation for calculated graphs
**Speed Optimization**:
- Parallel Computing: Utilizes GPU parallel processing capabilities
- Algorithm Optimization: Implemented using efficient forward-to-backward algorithms
- Batch Optimization: Set batch sizes appropriately
### Numerical stability
**Probability Calculation**:
- Log-space calculation: Avoid value overflow caused by probability multiplication
- Numeric clipping: Limits the range of probability values
- Normalization Techniques: Ensure the validity of probability distributions
**Gradient Stability**:
- Gradient Cropping: Prevents gradient explosions
- Weight Initialization: Use a suitable initialization strategy
- Batch normalization: stabilizes the training process
## Performance Evaluation
### Evaluate metrics
**Character-Level Accuracy**:
Accuracy_char = Number of characters correctly recognized / Total number of characters
**Serial Level Accuracy**:
Accuracy_seq = Number of Exactly Correct Sequences / Total Number of Sequences
**Editing Distance**:
Measures the difference between the predicted sequence and the real sequence, including the minimum number of insertion, deletion, and replacement operations.
### Error Analysis
**Common Error Types**:
- Character Confusion: Misidentification of similar characters
- Duplicate errors: CTCs tend to produce duplicate characters
- Length error: Inaccurate sequence length predictions
**Improvement Strategies**:
- Difficult sample mining: Focus on training samples with high error rates
- Post-processing optimization: Corrects errors using language models
- Integrated Approach: Combining predictions from multiple models
## Summary
The CTC loss function provides a powerful tool for sequence modeling, especially when dealing with alignment problems. By introducing blank labeling and dynamic programming algorithms, CTC realizes end-to-end sequence learning and avoids complex preprocessing steps.
**Key Takeaways**:
- CTC solves the problem of mismatched input and output sequence lengths
- Forward-backward algorithms provide efficient probability calculations
- A suitable decoding strategy is crucial for the final performance
- Training techniques and optimization strategies significantly impact model performance
**Application Suggestions**:
- Choose the appropriate decoding strategy for the specific task
- Emphasis on data preprocessing and enhancement techniques
- Focus on numerical stability and computational efficiency
- Post-processing optimization based on domain knowledge
The successful application of CTC has laid an important foundation for the development of deep learning in the field of sequence modeling, and also provided key support for the progress of OCR technology.
Tags:
CTC loss function
Join the timing classification
Sequence alignment
Forward-backward algorithm
Dynamic planning
OCR training
Sequence modeling