diff --git a/source/_static/tools/poet/prompt-12.png b/source/_static/tools/poet/prompt-12.png index 38c3560d..4e3c46fe 100644 Binary files a/source/_static/tools/poet/prompt-12.png and b/source/_static/tools/poet/prompt-12.png differ diff --git a/source/_static/tools/poet/prompt-13.png b/source/_static/tools/poet/prompt-13.png index f4db2e90..dae4d0f1 100644 Binary files a/source/_static/tools/poet/prompt-13.png and b/source/_static/tools/poet/prompt-13.png differ diff --git a/source/_static/tools/poet/prompt-4.png b/source/_static/tools/poet/prompt-4.png index a9f35dae..e63de809 100644 Binary files a/source/_static/tools/poet/prompt-4.png and b/source/_static/tools/poet/prompt-4.png differ diff --git a/source/_static/tools/poet/prompt-contenxt-use-existing-1.png b/source/_static/tools/poet/prompt-contenxt-use-existing-1.png new file mode 100644 index 00000000..26141898 Binary files /dev/null and b/source/_static/tools/poet/prompt-contenxt-use-existing-1.png differ diff --git a/source/_static/tools/poet/prompt-contenxt-use-existing-2.png b/source/_static/tools/poet/prompt-contenxt-use-existing-2.png new file mode 100644 index 00000000..8d919393 Binary files /dev/null and b/source/_static/tools/poet/prompt-contenxt-use-existing-2.png differ diff --git a/source/_static/tools/poet/prompt-context-custom-1.png b/source/_static/tools/poet/prompt-context-custom-1.png index 43dd6efe..9bcee131 100644 Binary files a/source/_static/tools/poet/prompt-context-custom-1.png and b/source/_static/tools/poet/prompt-context-custom-1.png differ diff --git a/source/_static/tools/poet/prompt-context-msa.png b/source/_static/tools/poet/prompt-context-msa.png index 9f636b71..742b905a 100644 Binary files a/source/_static/tools/poet/prompt-context-msa.png and b/source/_static/tools/poet/prompt-context-msa.png differ diff --git a/source/_static/tools/poet/query-1.png b/source/_static/tools/poet/query-1.png index 785b3231..a2636666 100644 Binary files a/source/_static/tools/poet/query-1.png and b/source/_static/tools/poet/query-1.png differ diff --git a/source/_static/tools/poet/query-2.png b/source/_static/tools/poet/query-2.png index 04fd8bd6..88c7bbfc 100644 Binary files a/source/_static/tools/poet/query-2.png and b/source/_static/tools/poet/query-2.png differ diff --git a/source/_static/tools/poet/refolding settings.png b/source/_static/tools/poet/refolding settings.png new file mode 100644 index 00000000..9e1ed507 Binary files /dev/null and b/source/_static/tools/poet/refolding settings.png differ diff --git a/source/_static/tools/poet/sampling-parameters.png b/source/_static/tools/poet/sampling-parameters.png index 762f9d9f..546d0858 100644 Binary files a/source/_static/tools/poet/sampling-parameters.png and b/source/_static/tools/poet/sampling-parameters.png differ diff --git a/source/web-app/poet/generate-sequences.rst b/source/web-app/poet/generate-sequences.rst index e9088fe4..08fe4448 100644 --- a/source/web-app/poet/generate-sequences.rst +++ b/source/web-app/poet/generate-sequences.rst @@ -11,6 +11,9 @@ What You Need Before Starting This tool requires a multiple sequence alignment (MSA), from which it builds a prompt. You can choose an existing prompt, upload your own MSA or have the OpenProtein.AI model generate one for you. If you aren't already familiar with prompts, we recommend learning more about OpenProtein.AI's `prompts and prompt sampling methods <./prompts.rst>`_ before diving in. +Sampling parameters +^^^^^^^^^^^^^^^^^^^^^ + You also need to know about sampling parameters, which are settings that regulate randomness. These include temperature, top-p, and top-k. - *Top-p* (also known as nucleus sampling) limits sampling to amino acids with sum likelihoods which do not exceed the specified value. As a result, the list of possible amino acids is dynamically selected based on the sum of likelihood scores achieving the top-p value. For example, setting a top-p of 0.8 limits sampling to amino acids summing to an 80% or greater probability. Other amino acids are ignored. @@ -22,10 +25,31 @@ You also need to know about sampling parameters, which are settings that regulat A note on the *Random seed* setting: this determines the state of the random number generator for random sampling. If it is set to a specific number, the algorithm will sample the same set of sequences each time. We recommend not defining this seed unless you are reproducing a job. +Sequence generators +^^^^^^^^^^^^^^^^^^^^ +Sequence generators create amino‑acid sequences under different types of conditioning. The platform provides two models: + +- *PoET-2* should be used when you want to explore the natural sequence landscape, generate diverse variants without requiring a structure, or control sampling behavior using temperature, top‑p, and top‑k. PoET is ideal for most sequence design and library generation. + +- *ProteinMPNN* generates structure‑conditioned sequences that are compatible with a given 3D backbone. Use ProteinMPNN when you already have a structure (experimental, predicted, or generated) and want to optimize stability, redesign interfaces, or produce multiple sequences for the same fold. ProteinMPNN is ideal for structure‑first sequence design. + + +Structure generators +^^^^^^^^^^^^^^^^^^^^ +Structure generators create new 3D protein backbones or binder–target complexes. These define the shapes that sequences will later be designed for. + +- *RFdiffusion* generates new protein backbones under geometric or functional constraints. Use RFdiffusion when designing new folds, symmetric assemblies, enzyme scaffolds, or any backbone that will later be sequence‑designed with ProteinMPNN. + +- *BoltzGen* generates binder structures conditioned on a target surface, with optional sequence co‑design. Use BoltzGen when designing binders to a target protein and exploring diverse binding modes. BoltzGen's primary output is a binder backbone and binder–target complex. + + + Generating Sequences --------------------- -Navigate to the tool by opening the **PoET** dropdown menu, then selecting **Generate Sequences.** You can choose the model used to run the job. We recommend using PoET-2 for most use cases. +Navigate to the tool by opening the **PoET** dropdown menu, then selecting **Generate Sequences.** You can choose the model used to be the sequence generator and structure generator. For sequence generator, we offer PoET-2, PoET-1 and ProteinMPNN as options depending on your specific task. We recommend using PoET-2 for most use cases.For structure generator, we offer RFdiffusion and boltsgen as options. + + Step 1: Prompt Query ^^^^^^^^^^^^^^^^^^^^^ @@ -42,11 +66,36 @@ Refer to `Creating a Context <./prompts.rst#creating-a-context>`_ to learn about Step 3: Sampling Settings ^^^^^^^^^^^^^^^^^^^^^^^^^^ -Set your parameters to control sampling behavior. In particular, **temperature**, **top-p**, and **top-k** provide the ability to focus sampling around highly likely sequences. We recommend that you use either top-p or top-k on a given job, not both. You can choose the default structure prediction model to generate the sequence structures after the job completes. +Set your parameters to control sampling behavior. In particular, **temperature**, **top-p**, and **top-k** provide the ability to focus sampling around highly likely sequences. We recommend that you use either top-p or top-k on a given job, not both. .. image:: /_static/tools/poet/sampling-parameters.png :alt: Sampling Parameters +Step 4: Refolding settings +^^^^^^^^^^^^^^^^^^^^^^^^^^ +Refolding predicts the 3D structure of generated protein sequences to verify that they fold into the intended structure. After sequence generation, each sequence is passed through a structure prediction model, and the predicted structure is compared to the expected backbone. This step helps identify sequences that are structurally consistent with the design objective. + +Refolding settings allow you to: + +- Select a structure prediction model +- Define template usage + +Each generated sequence is folded using the selected prediction model. The predicted structure can optionally use a template to guide the folding process.You may select from the following structure prediction models: +- Boltz-2 +- Protenix +- ESMfold +- Minifold + +Templates help constrain structure prediction and improve structural agreement when a reference structure is available. + +You must select one of the following template options. +- Use generated structure +- Use Query structure +- No template + +.. image:: /_static/tools/poet/refolding-settings.png + :alt: Refolding settings + You're ready to generate custom sequences! Click **Run.** The job may take a few minutes depending on how busy the service is, how long your sequences are, and how many sequences you want to score. A 400 (Bad request) error code may be due to the following: diff --git a/source/web-app/poet/prompts.rst b/source/web-app/poet/prompts.rst index 44a6f999..ea0e3932 100644 --- a/source/web-app/poet/prompts.rst +++ b/source/web-app/poet/prompts.rst @@ -57,11 +57,13 @@ The sequence editor provides buttons at the top for efficient query editing: - **Unmask Sequence** — restore all positions to match the reference sequence - **Mask Selected Residues** — mask only the highlighted positions - **Mask Unselected Residues** — mask all positions except the highlighted ones +- **Delete Undefined Structures** - delete positions with undefined structures after uploading a structure file Additional keyboard shortcuts include: - Copy and paste sequences (Ctrl + C / V) - Replace highlighted positions with a character (e.g., highlight positions 1–50 and press `X` to mask that region) +- To add a chain where sequence and structures are unknown, you can indicate with the formula `/`, `x`, `*`, followed by the number of residues in the chain. For example, to add a chain of 80 residues, you can input `/x*80` as a shortcut. These tools allow precise control over the query, enabling you to define exactly which residues or structural positions should guide PoET-2's generation. @@ -77,12 +79,14 @@ You can create a prompt context in three ways: 1. Use Existing Prompt ~~~~~~~~~~~~~~~~~~~~~~~ -If you've previously uploaded prompts, you can reuse them. In the **Prompt Type** dropdown, +If you've previously uploaded prompts, you can reuse them. In the **Choose from project**, select an existing prompt. The sequences from that prompt will automatically load. -.. image:: /_static/tools/poet/prompt-contenxt-use-existing.png +.. image:: /_static/tools/poet/prompt-contenxt-use-existing-1.png :alt: Use existing prompt +.. image:: /_static/tools/poet/prompt-contenxt-use-existing-2.png + :alt: Use existing prompt 2. Create Custom Context ~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -91,9 +95,10 @@ To create a custom prompt context, in the **Prompt Type** dropdown, select **Cre 1. **Upload files**: Click **Choose Files** to select files for your context. We support .fa, .fasta for FASTA files, and .pdb, .cif for structure files. 2. **Manually enter sequences**: Paste sequences in CSV or FASTA format, then click **Upload**. If you use CSV content, please note the following requirements: - - It must not include a header row. - - It can contain a maximum of 2 columns. - - If there are 2 columns, the first one must be the sequence names. + + - It must not include a header row. + - It can contain a maximum of 2 columns. + - If there are 2 columns, the first one must be the sequence names. .. image:: /_static/tools/poet/prompt-context-custom-1.png :alt: Create custom context @@ -109,161 +114,48 @@ If a structure file contains multiple chains, you can select which chain to use 3. Build From MSA ~~~~~~~~~~~~~~~~~~ -There are serveral options to create a context from an MSA: - -1. **Use Existing MSA**: Select an existing MSA from the current project. -2. **Upload MSA**: Upload an MSA file directly. -3. **Run Homology Search Using a Seed Sequence**: Enter a single seed sequence, and PoET will generate an MSA by searching for homologs. Note: If multiple sequences are entered, only the first one will be used. - -.. image:: /_static/tools/poet/prompt-context-msa.png - :alt: Manage prompts - -You can further customize your analysis by: - -- **Number of prompts to ensemble**: Choose 1 to sample a single prompt, or 2-15 to increase diversity. We recommend 3-5 prompts for most use cases. -- **Prompt Sampling Method**: Start with the default settings and fine-tune them based on your results. - - -Uploading and Saving a Sequence-Only Prompt --------------------------------------------- - -Without a Project -~~~~~~~~~~~~~~~~~~ -On the **Projects** page, select a PoET tool from the navigation bar. Under **Prompt Definition**, click **Select a file** and choose a `.fasta` or `.csv` file. Ensure **Prompt** is selected before uploading. - -.. image:: /_static/tools/poet/prompt-1.png - :alt: Uploading prompt without a project - -Within a Project -~~~~~~~~~~~~~~~~~ -Prompts can be uploaded via: - -- **Project Page:** Click **Upload**, select **Prompt**, and upload your `.fasta` or `.csv` file. - -.. image:: /_static/tools/poet/prompt-2.png - :alt: Uploading prompt from project - -- **Left Sidebar:** Click the **Upload** button under the **Prompt** section and select your file. - -.. image:: /_static/tools/poet/prompt-3.png - :alt: Uploading prompt within a project - -.. image:: /_static/tools/poet/prompt-4.png - :alt: Uploaded prompt preview - -- **From a MSA:** On an existing MSA page, click **Create Prompt**. - -.. image:: /_static/tools/poet/prompt-5.png - :alt: Create prompt from MSA page - - -What is a Multiple Sequence Alignment (MSA)? ---------------------------------------------- - -Multiple sequence alignment (MSA) is a technique for biological sequence analysis. It consists of a sequence alignment of three or more biological sequences that usually have an evolutionary relationship. - -Why is MSA Useful? -~~~~~~~~~~~~~~~~~~~ +A multiple sequence alignment (MSA) places three or more related protein sequences in register so that equivalent residues are compared in the same positions. Protein MSAs often include gap characters (`-`) to preserve this relationship across insertions and deletions. -The resulting MSA can be used to infer sequence homology and conduct phylogenetic analysis to assess the sequences’ shared evolutionary origins. Biologically sound and accurate alignments show homology and relationships, allowing for new member identification and the comparison of similar sequences. Accuracy is vital because subsequent analyses depend on the MSA results. +Use this option when you already have an MSA, want to reuse an MSA from the project, or want OpenProtein to build an MSA from a seed sequence. PoET samples sequences from the MSA according to the prompt sampling parameters and uses those sampled sequences as the prompt context. -When building a prompt from an MSA, include sequences you want to optimize. The model learns the patterns of the proteins and predicts sequences that best fit that list. Since the model views proteins in their entirety, you cannot optimize for a specific property or activity. +When uploading a precomputed MSA, make sure the sequences are aligned and use gap tokens where needed. FASTA files should use `.fa` or `.fasta`; CSV files should not include a header row. +Create the Prompt Context +^^^^^^^^^^^^^^^^^^^^^^^^^ -Creating a Prompt Using a MSA -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +There are several ways to create a context from an MSA: -Without a Project -^^^^^^^^^^^^^^^^^^^ -Navigate to any PoET tool under **Prompt Definition**. You can either input the MSA directly or upload an existing `.fa`, `.fasta`, or `.csv` file. - -.. image:: /_static/tools/poet/prompt-6.png - :alt: Uploading MSA without a project - -Within a Project -^^^^^^^^^^^^^^^^^^ -MSAs can be uploaded via: - -- **Project Page:** Click **Upload**, select **MSA**, and input or upload a `.fa`, `.fasta`, or `.csv` file. - -.. image:: /_static/tools/poet/prompt-7.png - :alt: Uploading MSA on project page - -.. image:: /_static/tools/poet/prompt-8.png - :alt: Uploading MSA popup on project page - -- **Left Sidebar**: Click the **Upload** button under the **MSA** section and input or upload a file. - -.. image:: /_static/tools/poet/prompt-9.png - :alt: sidebar MSA upload button - -.. image:: /_static/tools/poet/prompt-10.png - :alt: Uploading MSA popup within a project - - -What is a Seed Sequence? ---------------------------- -A seed sequence is a single protein sequence provided by the user to initiate a homology search. PoET does a homology search using public databases like uniprot to build an MSA from the seed sequence. PoET then creates a prompt by randomly selecting sequences from the MSA. - - -Creating a Prompt via Homology Search based on a Seed Sequence -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Without a Project -^^^^^^^^^^^^^^^^^^ - -Navigate to any PoET tool under **Prompt Definition**, input a seed sequence, and select **Single Sequence**. +1. **Use Existing MSA**: Select an existing MSA from the current project. +2. **Upload MSA**: Upload an aligned MSA file directly. +3. **Run Homology Search Using a Seed Sequence**: Enter a single protein sequence to start a homology search. Choose a representative sequence for the protein family or design target, because OpenProtein searches for homologs from this seed, builds an MSA from the results, and samples that MSA to create the prompt context. -.. image:: /_static/tools/poet/prompt-11.png - :alt: entering seed sequence without a project +.. image:: /_static/tools/poet/prompt-context-msa.png + :alt: Manage prompts +.. _prompt-sampling-definitions: -Within a Project -^^^^^^^^^^^^^^^^^^ +Prompt Sampling Parameters +^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Homology search from a seed sequence can be initiated via: +The selection of prompt sequences from the MSA is controlled by several prompt sampling parameters. -- **Project Page**: Click **Upload**, select **MSA**, input a single sequence, and click **Search for homologs to build MSA**. +The **sampling method** field defines the strategy used for selecting prompt sequences from the homologs found by homology search, or from the provided MSA. We recommend using the default **Neighbors** method. The other options are **Top** and **Random**. -.. image:: /_static/tools/poet/prompt-7.png - :alt: Uploading MSA on project page - -.. image:: /_static/tools/poet/prompt-12.png - :alt: Uploading MSA on project page - - -- **Left Sidebar**: Click the **Upload** button under the **MSA** section, input a sequence, and click **Search for homologs to build MSA**. +The **homology level** field allows you to generate more or less diverse prompt sequences. If your protein comes from a highly diverse family, or if you want to explore a large and diverse set of variants, use a lower homology level. If you need more focused generation, use a higher homology level and set a minimum similarity threshold to focus the prompt on the local sequence landscape around your seed. -.. image:: /_static/tools/poet/prompt-9.png - :alt: sidebar MSA upload button +The default **maximum similarity** and **minimum similarity** parameters work well across a wide range of protein families. Tune these parameters when you want to adjust the diversity of sequences modeled by PoET. -.. image:: /_static/tools/poet/prompt-13.png - :alt: single seq popup sidebar +- **Number of prompts to ensemble**: Choose 1 to sample a single prompt, or 2-15 to increase diversity. We recommend 3-5 prompts for most use cases. +- **Sampling method**: Defines the sampling strategy used for selecting prompt sequences from the homologs found by homology search, or from the provided MSA. The following strategies are available: -Prompt Sampling Parameters -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + - **Top**: Select sequences based on the order in which they occur in the MSA. + - **Random**: Select sequences randomly without replacement in the MSA. + - **Neighbors**: Sample more diverse, less redundant sequences from the MSA by sampling each sequence with weight inversely proportional to its number of homologs in the MSA. -- **Sampling method**: defines the sampling strategy used for selecting prompt sequences from the homologs found by homology search, or from the provided MSA. The following strategies are available: - - **Top**: Select sequences based on the order in which they occur in the MSA - - **Random**: Select sequences randomly without replacement in the MSA - - **Neighbors**: Sample more diverse, less redundant sequences from the MSA by sampling each sequence with weight inversely proportional to its number of homologs in the MSA. - **Homology level**: This parameter controls the identity level at which two sequences are considered “neighbors” - that is, redundant - in the MSA. This is equivalent to the homology level used to calculate the number of effective sequences in protein families. - **Random seed**: The seed for the random number generator used to sample from the MSA. Using the same seed with the same MSA and sampling parameters will guarantee that the same results are generated each time. Different seeds will produce different prompt samples. - **Maximum similarity to seed sequence**: The maximum similarity to the seed sequence allowed when selecting sequences for the prompt. No sequence with identity greater than this to the seed will be included. - **Minimum similarity to seed sequence**: The minimum similarity to the seed sequence allowed when selecting sequences for the prompt. No sequence with identity less than this to the seed will be included. This is useful for creating prompts that are highly focused on the local sequence space around the seed. - **Maximum number of sequences**: The number of sequences sampled from the MSA to form the prompt. The same sequence will not be sampled from the MSA more than once, so the number of sequences in the prompt will never be greater than the number of sequences in the MSA. - **Maximum total number of residues**: The maximum total number of residues in all sequences sampled from the MSA to form the prompt. For example, if this is set to 1000, sequences will be sampled from the MSA up to a maximum cumulative length of 1000 residues. - -Prompt Sampling Explained -~~~~~~~~~~~~~~~~~~~~~~~~~~ - -The selection of prompt sequences from the MSA is controlled by several prompt sampling parameters. - -The **sampling method** field defines the sampling strategy used for selecting prompt sequences from the homologs found by homology search, or from the provided MSA. We recommend using the default **Neighbors** method. The other options are **Top** and **Random**. - -The **homology level** field allows you to generate more or less diverse prompt sequences. -- If your protein comes from a highly diverse family or you wish to explore a large and diverse set of variants, adjusting the homology level to be lower will select higher diversity prompt sequences and generate higher diversity sequence distributions. -- If you need more focused generation, use a higher homology level and set a minimum similarity threshold to ensure the prompt focuses on the local sequence landscape around your seed. - -The default **maximum** and **minimum similarity parameters** are set to values which perform well across a wide range of protein families. These can be tuned to adjust the diversity of sequences that will be modeled by PoET.