How I Redesigned a Thermostable Enzyme with ProteinMPNN, Validated with AlphaFold2

Question

How I Redesigned a Thermostable Enzyme with ProteinMPNN, Validated with AlphaFold2

calendar_todayJul 4 • schedule3 min read

— Originally published at dev.to

E155 and E215 at 6.03 Å - exactly the nucleophile/acid-base separation
expected for a GH5 retaining endoglucanase.

This step matters. If I had accepted the literature numbering without checking, I would have fixed the wrong residues and the constrained run would be biologically
meaningless.

Step 2: B-factor profile

Before running ProteinMPNN, I computed per-residue B-factors to identify
flexibility hotspots - regions that might benefit most from redesign:

High-flexibility regions (B > 30 Å²):

Residues 246–250: B-factor up to 96.85 Å² - extremely mobile surface loop
N-terminus (residues 1–3)
Loop 169–173

These are the regions where ProteinMPNN has the most freedom and where
thermostabilising mutations would be most impactful in a follow-up study.

Step 3: ProteinMPNN - unconstrained run

First I ran ProteinMPNN without any constraints to understand what sequences it naturally prefers for this backbone:

python protein_mpnn_run.py \
    --pdb_path cel5a_clean.pdb \
    --out_folder mpnn_output/temp_0.1 \
    --num_seq_per_target 100 \
    --sampling_temp 0.1 \
    --seed 42

Five temperatures (0.1, 0.2, 0.3, 0.5, 0.8), 100 sequences each = 500 total.

The result was surprising: at T=0.1, ProteinMPNN placed Threonine at E155
(93/100 sequences) and Alanine at E215 (97/100 sequences). Both catalytic glutamates were replaced with non-catalytic residues.

This is not a bug - it's the correct behaviour. ProteinMPNN optimises for backbone fit, not biological function. Threonine and alanine may pack better against the local structure than glutamate, but they cannot perform the retaining mechanism. This finding motivates the constrained run.

Step 4: ProteinMPNN - catalytic-constrained run

Fix E155 and E215, let everything else vary:

fixed_positions = {
    "cel5a_clean": {
        "A": [155, 215]
    }
}
with open("fixed_positions.jsonl", "w") as f:
    f.write(json.dumps(fixed_positions) + "\n")

python protein_mpnn_run.py \
    --pdb_path cel5a_clean.pdb \
    --fixed_positions_jsonl fixed_positions.jsonl \
    --num_seq_per_target 100 \
    --sampling_temp 0.1

Results:

Temperature	Mean score	Mean recovery	E155 preserved	E215 preserved
0.1	0.763	52.6%	100%	100%
0.2	0.788	52.1%	100%	100%
0.3	0.830	51.5%	100%	100%
0.5	0.983	49.1%	100%	100%
0.8	1.369	43.1%	100%	100%

100% catalytic preservation at all temperatures. And the score distributions are nearly identical to the unconstrained run - fixing 2 out of 605 residues costs essentially nothing in terms of backbone fit.

Step 5: AlphaFold2 validation

Top 20 constrained designs (lowest MPNN score, T=0.1) were folded using
ColabFold on a free T4 GPU:

colabfold_batch \
  top20_constrained_candidates.fasta \
  af2_structures/designs \
  --model-type alphafold2_ptm \
  --num-recycle 3 \
  --msa-mode single_sequence

Also folded the wildtype under identical conditions as a baseline.

Result: 20/20 designs beat wildtype pLDDT.

Design	MPNN score	AF2 pLDDT	delta pLDDT vs WT
CEL5A_FIXED_013	0.7532	46.04	+12.71
CEL5A_FIXED_012	0.7513	45.69	+12.37
CEL5A_FIXED_014	0.7535	45.61	+12.29
CEL5A_FIXED_010	0.7508	44.48	+11.16
CEL5A_FIXED_004	0.7471	43.72	+10.40

The per-residue pLDDT profiles show the biggest improvements in the catalytic domain region (residues 150–250) - exactly where the constrained redesign introduced the most changes around the fixed glutamates.

A note on absolute pLDDT values

The pLDDT values (33–46) look low compared to typical small protein benchmarks. This is expected for a 605 aa two-domain protein in single-sequence mode - AlphaFold2 relies heavily on co-evolutionary information from MSAs to correctly position domains. Single-sequence mode lacks this signal.

The meaningful comparison is relative pLDDT under identical conditions, not absolute values. Every design predicts better than wildtype under the same constraints.

What I'd do next

MSA-mode validation for top 5 designs - proper pLDDT with full
co-evolutionary information
Rosetta delta-delta-G scoring - filter by predicted thermostability change
Molecular dynamics - simulate the top 3 designs at 55°C and 70°C to
assess thermal stability of the catalytic triad geometry
Experimental validation - express in E. coli, measure Tm by DSF,
compare CMC activity at elevated temperatures

Key lessons

1. Verify catalytic residues experimentally, not from literature.
PDB numbering often differs from publication numbering. The 6 Å distance
criterion is more reliable than assuming the literature values transfer directly.

2. Run unconstrained first.
The unconstrained run revealed that ProteinMPNN actively avoids glutamate at both catalytic positions. Without this finding, the constrained run would lack motivation and the project would have less scientific narrative.

3. Fixing 2/605 residues is essentially free.
Score distributions between constrained and unconstrained runs are almost
identical. You can enforce catalytic function without sacrificing sequence
diversity.

4. Low absolute pLDDT is not failure.
Single-sequence mode for large multi-domain proteins always yields low absolute pLDDT. Design relative comparisons, and always fold the wildtype under the same conditions as your baseline.

GitHub: https://github.com/Farhan89082/proteinmpnn-cel5a

If you have questions about any stage - especially the catalytic residue
identification or the constrained run setup - drop them in the comments.

🔥 Join developers growing publicly

Share your knowledge, build in public, and grow your developer presence with a global community.

Join CoderLegion

chevron_left

Commenters (This Week)

Contribute meaningful comments to climb the leaderboard and earn badges!

	How I Built a React Portfolio in 7 Days That Landed ₹1.2L in Freelance Work Dharanidharan - Feb 9
	I Wrote a Script to Fix Audible's Unreadable PDF Filenames snapsynapseverified - Apr 20
	Dashboard Operasional Armada Rental Mobil dengan Python + FastAPI Masbadar - Mar 12
	I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt Karol Modelskiverified - Mar 19
	How I built a computational AMP screening pipeline: from 24,000 sequences to 47 drug candidates FarhansBioAI - Jul 4

How I Redesigned a Thermostable Enzyme with ProteinMPNN, Validated with AlphaFold2

Step 2: B-factor profile

Step 3: ProteinMPNN - unconstrained run

Step 4: ProteinMPNN - catalytic-constrained run

Step 5: AlphaFold2 validation

A note on absolute pLDDT values

What I'd do next

Key lessons

0 Comments

Please log in to comment on this post.

More Posts

How I Built a React Portfolio in 7 Days That Landed ₹1.2L in Freelance Work

I Wrote a Script to Fix Audible's Unreadable PDF Filenames

Dashboard Operasional Armada Rental Mobil dengan Python + FastAPI

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

How I built a computational AMP screening pipeline: from 24,000 sequences to 47 drug candidates

More From FarhansBioAI

How I built a computational AMP screening pipeline: from 24,000 sequences to 47 drug candidates

How I Used Python to Analyse 40,000 Human Gut Cells and Uncover What Makes Crohn's Disease Different

T Cells, Tumour Macrophages, and Why Lung Cancer Evades Your Immune System

Related Jobs

Commenters (This Week)

Welcome to Coder Legion

Connect with 4,753 amazing developers

Don't have an account? Sign up

OR

How I Redesigned a Thermostable Enzyme with ProteinMPNN, Validated with AlphaFold2

Step 2: B-factor profile

Step 3: ProteinMPNN - unconstrained run

Step 4: ProteinMPNN - catalytic-constrained run

Step 5: AlphaFold2 validation

A note on absolute pLDDT values

What I'd do next

Key lessons

0 Comments

Please log in to comment on this post.

More Posts

More From FarhansBioAI

Related Jobs

Commenters (This Week)