A from-scratch implementation of the Transformer architecture as described in "Attention Is All You Need" (Vaswani et al., 2017). This project implements the encoder-decoder architecture with multi-head attention, positional encoding, and all core components from the original paper.
This repository contains a complete implementation of the Transformer model, including:
- Encoder-Decoder Architecture: Full transformer with stacked encoder and decoder layers
- Multi-Head Attention: Self-attention and cross-attention mechanisms
- Positional Encoding: Sinusoidal positional encodings
- Layer Normalization: Pre-norm architecture with residual connections
- Feed-Forward Networks: Position-wise feed-forward neural networks
.
├── custom_transformers.py # Main Transformer model implementation
├── transformer_layers.py # Encoder and Decoder layer implementations
├── transformer_sublayers.py # Core sublayer components (attention, FFN, etc.)
├── transformer_utils.py # Utility functions (positional encoding, attention, etc.)
├── notebook/ # Jupyter notebooks for training and evaluation
│ ├── train_with_custom_BPE.ipynb
│ ├── train_with_prebuilt_tokenizers.ipynb
│ ├── test_sublayers.ipynb
│ └── testing_evals.ipynb
├── data/ # Training datasets
│ ├── bible/ # Bible translation dataset (en-zh)
│ ├── php_docs/ # PHP documentation dataset (en-zh)
│ └── wmt/ # WMT translation datasets
└── requirements.txt # Python dependencies
- Pure PyTorch Implementation: No reliance on high-level transformer libraries
- Modular Design: Clean separation of concerns with sublayers, layers, and full model
- Training Notebooks: Ready-to-use notebooks for training with different tokenization strategies
- Evaluation Metrics: Comprehensive evaluation with BLEU and ROUGE scores
- Multiple Datasets: Support for various machine translation datasets
- Custom BPE Tokenizer: Implementation of Byte-Pair Encoding from scratch
- Prebuilt Tokenizers: Support for Hugging Face tokenizers (BERT, Marian)
See SETUP.md for detailed installation instructions.
- Clone the repository and navigate to the project directory:
git clone <repository-url>
cd original_transformers_for_fun- Create and activate a virtual environment (recommended):
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Download SpaCy language models:
python -m spacy download en_core_web_sm- Start Jupyter Notebook:
jupyter notebook-
Open one of the training notebooks:
notebook/train_with_custom_BPE.ipynb- Training with custom Byte-Pair Encodingnotebook/train_with_prebuilt_tokenizers.ipynb- Training with prebuilt tokenizers (recommended for beginners)
-
Run the notebook cells to start training your transformer model.
-
Evaluate your trained model:
notebook/testing_evals.ipynb- Evaluate model performance with BLEU and ROUGE metrics
Note: Make sure to run the first cell in each notebook that adds the parent directory to the Python path.
The implementation follows the original Transformer architecture:
-
Encoder: Stack of N identical layers, each with:
- Multi-head self-attention
- Position-wise feed-forward network
- Residual connections and layer normalization
-
Decoder: Stack of N identical layers, each with:
- Masked multi-head self-attention
- Multi-head cross-attention (encoder-decoder attention)
- Position-wise feed-forward network
- Residual connections and layer normalization
from custom_transformers import Transformer
# Initialize model
model = Transformer(
source_vocab_size=10000,
target_vocab_size=10000,
embedding_dim=512,
num_of_heads=8,
dropout_prob=0.1,
n=6, # number of layers
global_max_seq_len=512,
src_padding_idx=0,
tgt_padding_idx=0
)
# Forward pass
output = model(
source_input=src_tokens,
target_input=tgt_tokens,
src_max_len=src_length,
tgt_max_len=tgt_length,
encoder_mask=enc_mask,
decoder_mask=dec_mask,
encoder_decoder_mask=enc_dec_mask
)The project includes several machine translation datasets:
- Bible Dataset: English-Chinese translation pairs
- PHP Documentation: Technical documentation translations
- WMT: Various language pairs from WMT datasets
Pre-trained model checkpoints are stored in:
notebook/model_weights_with_custom_BPE/- Models trained with custom BPE tokenizernotebook/model_weights_with_prebuilt_tokenizer/- Models trained with Hugging Face tokenizers
Note: These directories are excluded from git (see .gitignore) due to file size.
This is an educational project. Please refer to individual dataset licenses in the data/ directory.
- Vaswani, A., et al. (2017). "Attention Is All You Need." Advances in Neural Information Processing Systems.
custom_transformers.py: Main Transformer class with Encoder, Decoder, and full modeltransformer_layers.py: EncoderLayer and DecoderLayer implementationstransformer_sublayers.py: Core components (MultiHeadAttention, FeedForward, LayerNorm)transformer_utils.py: Helper functions for positional encoding, masking, and attention
The project includes several notebooks for testing and evaluation:
notebook/test_sublayers.ipynb- Unit tests for individual componentsnotebook/testing_evals.ipynb- Comprehensive evaluation with BLEU and ROUGE metrics
The testing_evals.ipynb notebook provides comprehensive evaluation of trained models:
-
BLEU Scores: Computes BLEU-1, BLEU-2, BLEU-3, and BLEU-4 scores
- Measures n-gram precision between predicted and reference translations
- Handles both English (word-level) and Chinese (character-level) tokenization
-
ROUGE Scores: Computes ROUGE-1, ROUGE-2, and ROUGE-L scores
- Measures recall-oriented n-gram overlap
- Includes precision, recall, and F-measure for each metric
- Uses character-level tokenization for Chinese text
To run evaluation:
- Open
notebook/testing_evals.ipynb - Configure the model checkpoint path and tokenizer type
- Run all cells to evaluate on the test set
- View average scores and example translations
Key dependencies include:
- PyTorch: Deep learning framework
- Transformers: Hugging Face library for tokenizers
- SpaCy: Natural language processing
- Jieba: Chinese text segmentation
- NumPy/Pandas: Data manipulation
- NLTK: Natural language toolkit (for BLEU scores)
- rouge-score: ROUGE metric computation
See requirements.txt for the complete list.
The evaluation notebook (testing_evals.ipynb) provides detailed metrics for model performance:
-
BLEU Scores: Standard metric for machine translation evaluation
- BLEU-4 > 0.3: Good translation quality
- BLEU-4 > 0.5: Very good translation quality
- BLEU-4 > 0.7: Excellent translation quality
-
ROUGE Scores: Recall-oriented evaluation metrics
- Useful for evaluating fluency and coverage
- Character-level tokenization for Chinese text
The notebook automatically handles:
- Space removal for Chinese text (Chinese doesn't use spaces)
- Character-level vs word-level tokenization
- Error handling and progress tracking
This implementation is for educational purposes and follows the original paper's architecture. For production use, consider optimizations and improvements beyond the original design.
This is a learning project. Feel free to fork and experiment with different architectures, optimizations, or training strategies.