Original Transformers for Fun

A from-scratch implementation of the Transformer architecture as described in "Attention Is All You Need" (Vaswani et al., 2017). This project implements the encoder-decoder architecture with multi-head attention, positional encoding, and all core components from the original paper.

Overview

This repository contains a complete implementation of the Transformer model, including:

Encoder-Decoder Architecture: Full transformer with stacked encoder and decoder layers
Multi-Head Attention: Self-attention and cross-attention mechanisms
Positional Encoding: Sinusoidal positional encodings
Layer Normalization: Pre-norm architecture with residual connections
Feed-Forward Networks: Position-wise feed-forward neural networks

Project Structure

.
├── custom_transformers.py      # Main Transformer model implementation
├── transformer_layers.py        # Encoder and Decoder layer implementations
├── transformer_sublayers.py     # Core sublayer components (attention, FFN, etc.)
├── transformer_utils.py         # Utility functions (positional encoding, attention, etc.)
├── notebook/                    # Jupyter notebooks for training and evaluation
│   ├── train_with_custom_BPE.ipynb
│   ├── train_with_prebuilt_tokenizers.ipynb
│   ├── test_sublayers.ipynb
│   └── testing_evals.ipynb
├── data/                        # Training datasets
│   ├── bible/                   # Bible translation dataset (en-zh)
│   ├── php_docs/                # PHP documentation dataset (en-zh)
│   └── wmt/                     # WMT translation datasets
└── requirements.txt             # Python dependencies

Features

Pure PyTorch Implementation: No reliance on high-level transformer libraries
Modular Design: Clean separation of concerns with sublayers, layers, and full model
Training Notebooks: Ready-to-use notebooks for training with different tokenization strategies
Evaluation Metrics: Comprehensive evaluation with BLEU and ROUGE scores
Multiple Datasets: Support for various machine translation datasets
Custom BPE Tokenizer: Implementation of Byte-Pair Encoding from scratch
Prebuilt Tokenizers: Support for Hugging Face tokenizers (BERT, Marian)

Installation

See SETUP.md for detailed installation instructions.

Quick Setup

Clone the repository and navigate to the project directory:

git clone <repository-url>
cd original_transformers_for_fun

Create and activate a virtual environment (recommended):

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Download SpaCy language models:

python -m spacy download en_core_web_sm

Quick Start

Start Jupyter Notebook:

jupyter notebook

Open one of the training notebooks:
- notebook/train_with_custom_BPE.ipynb - Training with custom Byte-Pair Encoding
- notebook/train_with_prebuilt_tokenizers.ipynb - Training with prebuilt tokenizers (recommended for beginners)
Run the notebook cells to start training your transformer model.
Evaluate your trained model:
- notebook/testing_evals.ipynb - Evaluate model performance with BLEU and ROUGE metrics

Note: Make sure to run the first cell in each notebook that adds the parent directory to the Python path.

Model Architecture

The implementation follows the original Transformer architecture:

Encoder: Stack of N identical layers, each with:
- Multi-head self-attention
- Position-wise feed-forward network
- Residual connections and layer normalization
Decoder: Stack of N identical layers, each with:
- Masked multi-head self-attention
- Multi-head cross-attention (encoder-decoder attention)
- Position-wise feed-forward network
- Residual connections and layer normalization

Usage Example

from custom_transformers import Transformer

# Initialize model
model = Transformer(
    source_vocab_size=10000,
    target_vocab_size=10000,
    embedding_dim=512,
    num_of_heads=8,
    dropout_prob=0.1,
    n=6,  # number of layers
    global_max_seq_len=512,
    src_padding_idx=0,
    tgt_padding_idx=0
)

# Forward pass
output = model(
    source_input=src_tokens,
    target_input=tgt_tokens,
    src_max_len=src_length,
    tgt_max_len=tgt_length,
    encoder_mask=enc_mask,
    decoder_mask=dec_mask,
    encoder_decoder_mask=enc_dec_mask
)

Datasets

The project includes several machine translation datasets:

Bible Dataset: English-Chinese translation pairs
PHP Documentation: Technical documentation translations
WMT: Various language pairs from WMT datasets

Model Checkpoints

Pre-trained model checkpoints are stored in:

notebook/model_weights_with_custom_BPE/ - Models trained with custom BPE tokenizer
notebook/model_weights_with_prebuilt_tokenizer/ - Models trained with Hugging Face tokenizers

Note: These directories are excluded from git (see .gitignore) due to file size.

License

This is an educational project. Please refer to individual dataset licenses in the data/ directory.

References

Vaswani, A., et al. (2017). "Attention Is All You Need." Advances in Neural Information Processing Systems.

Development

Project Structure Details

custom_transformers.py: Main Transformer class with Encoder, Decoder, and full model
transformer_layers.py: EncoderLayer and DecoderLayer implementations
transformer_sublayers.py: Core components (MultiHeadAttention, FeedForward, LayerNorm)
transformer_utils.py: Helper functions for positional encoding, masking, and attention

Testing and Evaluation

The project includes several notebooks for testing and evaluation:

notebook/test_sublayers.ipynb - Unit tests for individual components
notebook/testing_evals.ipynb - Comprehensive evaluation with BLEU and ROUGE metrics

Evaluation Metrics

The testing_evals.ipynb notebook provides comprehensive evaluation of trained models:

BLEU Scores: Computes BLEU-1, BLEU-2, BLEU-3, and BLEU-4 scores
- Measures n-gram precision between predicted and reference translations
- Handles both English (word-level) and Chinese (character-level) tokenization
ROUGE Scores: Computes ROUGE-1, ROUGE-2, and ROUGE-L scores
- Measures recall-oriented n-gram overlap
- Includes precision, recall, and F-measure for each metric
- Uses character-level tokenization for Chinese text

To run evaluation:

Open notebook/testing_evals.ipynb
Configure the model checkpoint path and tokenizer type
Run all cells to evaluate on the test set
View average scores and example translations

Dependencies

Key dependencies include:

PyTorch: Deep learning framework
Transformers: Hugging Face library for tokenizers
SpaCy: Natural language processing
Jieba: Chinese text segmentation
NumPy/Pandas: Data manipulation
NLTK: Natural language toolkit (for BLEU scores)
rouge-score: ROUGE metric computation

See requirements.txt for the complete list.

Evaluation Results

The evaluation notebook (testing_evals.ipynb) provides detailed metrics for model performance:

BLEU Scores: Standard metric for machine translation evaluation
- BLEU-4 > 0.3: Good translation quality
- BLEU-4 > 0.5: Very good translation quality
- BLEU-4 > 0.7: Excellent translation quality
ROUGE Scores: Recall-oriented evaluation metrics
- Useful for evaluating fluency and coverage
- Character-level tokenization for Chinese text

The notebook automatically handles:

Space removal for Chinese text (Chinese doesn't use spaces)
Character-level vs word-level tokenization
Error handling and progress tracking

Notes

This implementation is for educational purposes and follows the original paper's architecture. For production use, consider optimizations and improvements beyond the original design.

Contributing

This is a learning project. Feel free to fork and experiment with different architectures, optimizations, or training strategies.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Original Transformers for Fun

Overview

Project Structure

Features

Installation

Quick Setup

Quick Start

Model Architecture

Usage Example

Datasets

Model Checkpoints

License

References

Development

Project Structure Details

Testing and Evaluation

Evaluation Metrics

Dependencies

Evaluation Results

Notes

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
notebook		notebook
.gitignore		.gitignore
README.md		README.md
SETUP.md		SETUP.md
custom_transformers.py		custom_transformers.py
requirements.txt		requirements.txt
transformer_layers.py		transformer_layers.py
transformer_sublayers.py		transformer_sublayers.py
transformer_utils.py		transformer_utils.py

Folders and files

Latest commit

History

Repository files navigation

Original Transformers for Fun

Overview

Project Structure

Features

Installation

Quick Setup

Quick Start

Model Architecture

Usage Example

Datasets

Model Checkpoints

License

References

Development

Project Structure Details

Testing and Evaluation

Evaluation Metrics

Dependencies

Evaluation Results

Notes

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages