How I Redesigned a Thermostable Enzyme with ProteinMPNN, Validated with AlphaFold2

How I Redesigned a Thermostable Enzyme with ProteinMPNN, Validated with AlphaFold2

calendar_today agoschedule3 min read
— Originally published at dev.to

E155 and E215 at 6.03 Å - exactly the nucleophile/acid-base separation
expected for a GH5 retaining endoglucanase.

This step matters. If I had accepted the literature numbering without checking, I would have fixed the wrong residues and the constrained run would be biologically
meaningless.


Step 2: B-factor profile

Before running ProteinMPNN, I computed per-residue B-factors to identify
flexibility hotspots - regions that might benefit most from redesign:

High-flexibility regions (B > 30 Ų):

  • Residues 246–250: B-factor up to 96.85 Ų - extremely mobile surface loop
  • N-terminus (residues 1–3)
  • Loop 169–173

These are the regions where ProteinMPNN has the most freedom and where
thermostabilising mutations would be most impactful in a follow-up study.


Step 3: ProteinMPNN - unconstrained run

First I ran ProteinMPNN without any constraints to understand what sequences it naturally prefers for this backbone:

python protein_mpnn_run.py \
    --pdb_path cel5a_clean.pdb \
    --out_folder mpnn_output/temp_0.1 \
    --num_seq_per_target 100 \
    --sampling_temp 0.1 \
    --seed 42

Five temperatures (0.1, 0.2, 0.3, 0.5, 0.8), 100 sequences each = 500 total.

The result was surprising: at T=0.1, ProteinMPNN placed Threonine at E155
(93/100 sequences)
and Alanine at E215 (97/100 sequences). Both catalytic glutamates were replaced with non-catalytic residues.

This is not a bug - it's the correct behaviour. ProteinMPNN optimises for backbone fit, not biological function. Threonine and alanine may pack better against the local structure than glutamate, but they cannot perform the retaining mechanism. This finding motivates the constrained run.


Step 4: ProteinMPNN - catalytic-constrained run

Fix E155 and E215, let everything else vary:

fixed_positions = {
    "cel5a_clean": {
        "A": [155, 215]
    }
}
with open("fixed_positions.jsonl", "w") as f:
    f.write(json.dumps(fixed_positions) + "\n")
python protein_mpnn_run.py \
    --pdb_path cel5a_clean.pdb \
    --fixed_positions_jsonl fixed_positions.jsonl \
    --num_seq_per_target 100 \
    --sampling_temp 0.1

Results:

Temperature Mean score Mean recovery E155 preserved E215 preserved
0.1 0.763 52.6% 100% 100%
0.2 0.788 52.1% 100% 100%
0.3 0.830 51.5% 100% 100%
0.5 0.983 49.1% 100% 100%
0.8 1.369 43.1% 100% 100%

100% catalytic preservation at all temperatures. And the score distributions are nearly identical to the unconstrained run - fixing 2 out of 605 residues costs essentially nothing in terms of backbone fit.


Step 5: AlphaFold2 validation

Top 20 constrained designs (lowest MPNN score, T=0.1) were folded using
ColabFold on a free T4 GPU:

colabfold_batch \
  top20_constrained_candidates.fasta \
  af2_structures/designs \
  --model-type alphafold2_ptm \
  --num-recycle 3 \
  --msa-mode single_sequence

Also folded the wildtype under identical conditions as a baseline.

Result: 20/20 designs beat wildtype pLDDT.

Design MPNN score AF2 pLDDT delta pLDDT vs WT
CEL5A_FIXED_013 0.7532 46.04 +12.71
CEL5A_FIXED_012 0.7513 45.69 +12.37
CEL5A_FIXED_014 0.7535 45.61 +12.29
CEL5A_FIXED_010 0.7508 44.48 +11.16
CEL5A_FIXED_004 0.7471 43.72 +10.40

The per-residue pLDDT profiles show the biggest improvements in the catalytic domain region (residues 150–250) - exactly where the constrained redesign introduced the most changes around the fixed glutamates.


A note on absolute pLDDT values

The pLDDT values (33–46) look low compared to typical small protein benchmarks. This is expected for a 605 aa two-domain protein in single-sequence mode - AlphaFold2 relies heavily on co-evolutionary information from MSAs to correctly position domains. Single-sequence mode lacks this signal.

The meaningful comparison is relative pLDDT under identical conditions, not absolute values. Every design predicts better than wildtype under the same constraints.


What I'd do next

  1. MSA-mode validation for top 5 designs - proper pLDDT with full
    co-evolutionary information
  2. Rosetta delta-delta-G scoring - filter by predicted thermostability change
  3. Molecular dynamics - simulate the top 3 designs at 55°C and 70°C to
    assess thermal stability of the catalytic triad geometry
  4. Experimental validation - express in E. coli, measure Tm by DSF,
    compare CMC activity at elevated temperatures

Key lessons

1. Verify catalytic residues experimentally, not from literature.
PDB numbering often differs from publication numbering. The 6 Å distance
criterion is more reliable than assuming the literature values transfer directly.

2. Run unconstrained first.
The unconstrained run revealed that ProteinMPNN actively avoids glutamate at both catalytic positions. Without this finding, the constrained run would lack motivation and the project would have less scientific narrative.

3. Fixing 2/605 residues is essentially free.
Score distributions between constrained and unconstrained runs are almost
identical. You can enforce catalytic function without sacrificing sequence
diversity.

4. Low absolute pLDDT is not failure.
Single-sequence mode for large multi-domain proteins always yields low absolute pLDDT. Design relative comparisons, and always fold the wildtype under the same conditions as your baseline.


GitHub: https://github.com/Farhan89082/proteinmpnn-cel5a

If you have questions about any stage - especially the catalytic residue
identification or the constrained run setup - drop them in the comments.

🔥 Join developers growing publicly
Share your knowledge, build in public, and grow your developer presence with a global community.

More Posts

How I Built a React Portfolio in 7 Days That Landed ₹1.2L in Freelance Work

Dharanidharan - Feb 9

I Wrote a Script to Fix Audible's Unreadable PDF Filenames

snapsynapseverified - Apr 20

Dashboard Operasional Armada Rental Mobil dengan Python + FastAPI

Masbadar - Mar 12

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

Karol Modelskiverified - Mar 19

How I built a computational AMP screening pipeline: from 24,000 sequences to 47 drug candidates

FarhansBioAI - Jul 4
chevron_left
5Posts
0Comments
Bioinformatician working on single-cell RNA-seq analysis of human disease. Python + Scanpy. Interested in neurodegeneration and cancer immunology.

Related Jobs

View all jobs →

Commenters (This Week)

2 comments
1 comment
1 comment

Contribute meaningful comments to climb the leaderboard and earn badges!