Hardening Two Multi Tenant SaaS APIs:

Question

Hardening Two Multi Tenant SaaS APIs:

Joshua R. Gutierrez

calendar_todayJun 5 • schedule9 min read

— Originally published at dev.to

Hardening Two Multi Tenant SaaS APIs

What We Found, What We Fixed, and What Changed

Security hardening is not glamorous work.

Most of it is careful reading, careful verification, uncomfortable edge cases, and refusing to trust assumptions that have quietly become part of the system.

Recently, I completed a remediation pass across two multi tenant SaaS products: Site2CRM and Made4Founders. Both products had grown into real platforms with authenticated dashboards, public webhooks, billing flows, CRM integrations, OAuth style connections, file uploads, background jobs, and customer owned data.

That kind of product has a wide attack surface.

The work started with security reports and ended with two hardened branches, 24 total commits, more than 140 new regression tests, database migrations, centralized security utilities, audit logging, startup gates, and CI checks that now block the same classes of mistakes from coming back.

This article is a breakdown of what we found, how we fixed it, and the engineering lessons that came out of the process.

The Goal Was Not Just to Close Findings

A security report usually arrives as a list of issues.

That can make the work feel transactional.

Fix this endpoint.

Add this check.

Reject this payload.

Patch this route.

That is necessary, but it is not enough.

The real question is:

What class of mistake allowed this bug to exist?

That question changed the remediation strategy.

For every finding, the goal became clear:

Fix the specific issue.
Add a regression test for that finding.
Centralize the security pattern where possible.
Add a guard so the same mistake is harder to reintroduce.
Document any deployment or data migration steps clearly.

That approach turned the work from a cleanup pass into a hardening pass.

Finding 1: Tenant Data Must Always Be Scoped by Tenant

In a multi tenant application, the most important security rule is simple:

A customer should only be able to access data that belongs to their organization.

That rule sounds obvious, but enforcing it consistently is where systems get tested.

A risky query often looks harmless:

lead = db.query(Lead).filter(Lead.id == lead_id).first()

The problem is that lead_id alone is not a tenant boundary.

In a multi tenant system, the query needs to prove both identity and ownership:

lead = (
    db.query(Lead)
    .filter(
        Lead.id == lead_id,
        Lead.organization_id == current_user.organization_id,
    )
    .first()
)

The same principle applies to updates, deletes, dashboard feeds, background jobs, calendar items, configuration records, and integration data.

In one case, a calendar feed was using a join path that could leak data across tenants. In another, a vault configuration model had rows that were not safely attached to an organization. Both issues were fixed by making organization ownership explicit, adding migrations where needed, and creating regression coverage around the exact failure modes.

The broader lesson was this:

Tenant isolation should not depend on developer memory.

It should be a pattern the codebase makes easy and the test suite actively defends.

Finding 2: Organization Admins Are Not Platform Admins

One of the clearest authorization lessons came from a global resource protected by a tenant level role check.

Most SaaS products have organization roles like:

OWNER
ADMIN
USER

Those roles are useful inside a customer account. An organization owner should be able to manage users, settings, forms, integrations, and billing details for that organization.

But tenant authority is not platform authority.

A customer OWNER should not be able to manage global platform resources, issue marketplace codes, access internal tools, or make changes that affect other organizations.

The risky pattern was conceptually simple:

if current_user.role not in ["OWNER", "ADMIN"]:
    raise HTTPException(status_code=403)

That check answers the wrong question.

It asks:

Is this user powerful inside their own organization?

For global platform actions, the application needs to ask a different question:

Is this user actually trusted to operate the platform itself?

The fix was to introduce a real platform staff boundary and move global operations behind that boundary:

require_platform_staff(current_user)

This was not just a one line fix. Tests were added to prove that tenant owners and tenant admins could not access platform scoped functionality. A static guard was also added so future global table routes cannot quietly be protected only by tenant roles.

The lesson:

Never reuse customer roles for platform administration.

They represent different trust models.

Finding 3: Public Webhooks Are Public, Not Trusted

Public webhook endpoints need to be reachable by external providers.

That does not mean they should trust incoming requests.

Several webhook surfaces needed stronger sender verification. The risk was not that the endpoints existed. The risk was that state changing payloads could be processed without proving they came from the provider they claimed to represent.

The corrected model is straightforward:

Read the raw request body.
Verify the provider signature.
Reject missing or invalid signatures.
Parse the payload only after verification.
Apply idempotency where relevant.
Mutate state.
Return success.

This mattered for provider events such as inbound messaging events, email bounce and complaint notifications, and billing lifecycle events.

Billing webhooks received extra attention because they can change plan state. A forged billing event can potentially activate, cancel, downgrade, or otherwise alter customer access.

The fix was to make verification mandatory. If the provider webhook secret or webhook ID is missing in production, the app should fail closed. If the signature is invalid, the route should reject the request instead of logging the error and continuing.

The lesson:

A webhook endpoint can be public without being trusted.

Reachability and trust are separate things.

Finding 4: External Install Identifiers Are Not Authentication

One integration trusted a client supplied install identifier as if it were a credential.

That is dangerous.

Install IDs, instance IDs, account IDs, and external resource IDs are identifiers. They are not proof that the caller owns the installation.

The hardened approach was to require a signed provider token, verify it server side, and only then extract the canonical installation identity from the verified payload.

The corrected flow became clear:

Receive signed provider instance token.
Verify the token using the provider app secret.
Extract the canonical instance ID from the verified payload.
Resolve the install record from that verified identity.
Refuse cross organization rebinding unless ownership is proven.

This fixed both the authentication issue and the related install rebinding risk.

The lesson:

Do not authenticate integrations with bare IDs.

Use provider signed assertions, server side verification, and explicit ownership rules.

Finding 5: User Controlled URLs Need SSRF Protection

Several features across the two products involved server side requests to external URLs.

That pattern is common:

Outbound webhooks
RSS imports
Social integrations
Callback URLs
Remote media fetches
Health checks

The security risk is SSRF, or Server Side Request Forgery.

SSRF occurs when a user can influence a URL that the server requests. Without guardrails, an attacker may be able to point the server at internal infrastructure.

Examples include localhost, private network addresses, cloud metadata services, link local addresses, and internal admin panels.

The fix was to centralize outbound fetching through a safe fetch utility.

That utility blocks dangerous destinations, applies timeouts, restricts redirects, limits response size, and prevents internal response bodies from being reflected back to users.

The important part was centralization.

Instead of asking every route to remember SSRF rules, risky outbound requests now go through one safer path.

The lesson:

Any feature that lets a user provide a URL should be treated as a network boundary.

Finding 6: Cryptography Should Not Have Production Fallbacks

Development conveniences can become production vulnerabilities if they are allowed to survive past local use.

The hardening pass removed weak secret fallbacks and hardcoded encryption key fallbacks. Token encryption was versioned, and a re encryption path was added so legacy stored values could be migrated safely.

The improved model included several important rules:

No hardcoded production fallback keys.
An explicit application encryption key requirement.
A versioned encrypted token format.
A migration script for legacy tokens.
A production flag to reject legacy plaintext tokens after migration.

The lesson:

Cryptographic failure should be loud.

If the key is missing in production, the app should not quietly invent one.

Finding 7: File Uploads Should Stay Boring

File uploads are easy to underestimate.

For brand logos and media uploads, the safest approach was to restrict formats and improve storage behavior.

The hardening changes included rejecting risky file types where they were unnecessary, using high entropy media names, enforcing body size limits, handling chunked uploads safely, and avoiding active content in logo uploads.

One practical example was SVG.

SVG can be useful, but it can also contain active content. If served from the wrong origin or with weak headers, it can become a stored XSS risk.

XSS means Cross Site Scripting. It occurs when attacker controlled content executes JavaScript in a trusted browser context.

For a logo upload feature, SVG was not worth the additional risk.

The lesson:

If a product only needs images, do not accept formats that behave like documents or code.

Finding 8: Security Relevant Configuration Should Fail Closed

Some risks were not about code paths. They were about missing configuration.

Examples included missing webhook secrets, missing encryption keys, missing CAPTCHA secrets, missing Redis for rate limiting, weak application secret keys, and insecure production cookie settings.

The fix was to add production startup gates.

In development, flexible configuration is helpful.

In production, missing security configuration should stop the app from starting.

A startup gate turns a hidden runtime weakness into an obvious deployment failure.

The lesson:

Failing to boot is better than booting insecurely.

Finding 9: Rate Limiting Should Not Silently Degrade

Rate limiting is often treated as a nice to have, but for authentication and abuse prevention it is a security control.

If Redis is unavailable and the system silently falls back to per process memory, limits become weaker under multiple workers.

For example, a limit of 10 attempts may effectively become 40 attempts across four workers.

The production behavior was hardened so that security sensitive rate limiting depends on a real shared backend.

The lesson:

A degraded security control should be visible, not silent.

Finding 10: Regression Tests Are Part of the Fix

Every meaningful finding received a named regression test.

That mattered.

A test named after a security issue tells future maintainers why the behavior exists. It also prevents a fix from being accidentally removed during a refactor.

Examples of the test coverage included:

Tenant data cannot be accessed across organizations.
Platform scoped routes require platform staff.
Unsigned webhooks are rejected.
Invalid webhook signatures are rejected.
Unsafe webhook URLs are rejected.
SVG logo uploads are rejected.
Weak production secrets fail startup.
Legacy plaintext tokens can be rejected after migration.

The lesson:

If a security issue was important enough to fix, it is important enough to test.

Finding 11: Static Guards Catch What Tests Miss

Tests are excellent for specific behavior.

Static lint is better for broad architectural patterns.

Both repositories now have security lint guards for risky patterns:

Tenant owned queries without organization scope.
Raw outbound requests using user controlled URLs.
State changing public routes without auth or signature classification.
Global platform routes protected only by tenant roles.

The linter supports a baseline, which is important for mature codebases.

A baseline allows known accepted cases to remain documented while CI fails only on new violations. That keeps the guard practical instead of noisy.

The preferred workflow is simple:

Fix the violation.
Add a clearly marked exception only when intentional.
Update the baseline only when the exception is reviewed and accepted.

The lesson:

The best time to add a guard is right after fixing the bug class.

That is when the pattern is fresh, the risk is understood, and the team knows what should never happen again.

Finding 12: Some Fixes Require Operational Discipline

Not every security fix ends with a commit.

Some work has to happen during deployment.

Examples include coordinating breaking integration changes, setting required production secrets, running database migrations, inspecting quarantined records, re encrypting stored tokens, enabling flags that reject legacy formats, scrubbing sensitive files from Git history, and running full test suites before release.

These steps were intentionally left as human controlled actions because they affect production data, integration behavior, and deployment timing.

That is part of responsible hardening.

The code can be ready before production is ready.

The lesson:

A remediation branch is not deployable until the operational checklist is complete.

What Changed by the End

Across both products, the hardening pass produced:

24 total commits.
More than 140 new regression tests.
Database migrations.
Centralized webhook verification.
Centralized SSRF safe fetch behavior.
Fail closed production startup gates.
Platform staff authorization.
Audit logging for sensitive actions.
Versioned token encryption.
Tenant isolation checks.
Security lint guards enforced in CI.
Main branches left untouched for review.

More importantly, the products now have better security shape.

The fixes were not just scattered patches. They became reusable boundaries.

Tenant data access now has stronger conventions.

Platform operations now have a separate trust model.

Webhook verification now follows a consistent pattern.

Outbound URL fetching now has a safer path.

Production misconfiguration now fails early.

Dangerous patterns now have CI visibility.

Final Takeaway

The most valuable security work is not only fixing what was found.

It is asking what the finding reveals about the system.

A good remediation process should answer several questions:

What failed?
Where else could it fail?
What is the correct shared pattern?
How do we test the fix?
How do we prevent the class from returning?
What must be true before deployment?

That mindset turns a security report into a stronger codebase.

Patching closes issues.

Hardening changes the system.

🔥 Join developers growing publicly

Share your knowledge, build in public, and grow your developer presence with a global community.

Join CoderLegion

chevron_left

Commenters (This Week)

Contribute meaningful comments to climb the leaderboard and earn badges!

	Dashboard Operasional Armada Rental Mobil dengan Python + FastAPI Masbadar - Mar 12
	Comparison: Universal Import vs. Plaid/Yodlee Pocket Portfolio - Mar 12
	Why We Bet on CSV over APIs Pocket Portfolio - Feb 17
	MCP Is the USB-C of AI. So Why Are You Plugging Everything In? Ken W. Algerverified - Jun 10
	I Wrote a Script to Fix Audible's Unreadable PDF Filenames snapsynapseverified - Apr 20

Hardening Two Multi Tenant SaaS APIs:

Hardening Two Multi Tenant SaaS APIs

What We Found, What We Fixed, and What Changed

The Goal Was Not Just to Close Findings

Finding 1: Tenant Data Must Always Be Scoped by Tenant

Finding 2: Organization Admins Are Not Platform Admins

Finding 3: Public Webhooks Are Public, Not Trusted

Finding 4: External Install Identifiers Are Not Authentication

Finding 5: User Controlled URLs Need SSRF Protection

Finding 6: Cryptography Should Not Have Production Fallbacks

Finding 7: File Uploads Should Stay Boring

Finding 8: Security Relevant Configuration Should Fail Closed

Finding 9: Rate Limiting Should Not Silently Degrade

Finding 10: Regression Tests Are Part of the Fix

Finding 11: Static Guards Catch What Tests Miss

Finding 12: Some Fixes Require Operational Discipline

What Changed by the End

Final Takeaway

0 Comments

Please log in to comment on this post.

More Posts

Dashboard Operasional Armada Rental Mobil dengan Python + FastAPI

Comparison: Universal Import vs. Plaid/Yodlee

Why We Bet on CSV over APIs

MCP Is the USB-C of AI. So Why Are You Plugging Everything In?

I Wrote a Script to Fix Audible's Unreadable PDF Filenames

More From Joshua R. Gutierrez

Tools Are Easy. Outcomes Are Hard. How I Build an SEO Stack That Actually Moves Numbers

DeepAudit AI - Zapier Integration

Site2CRM - WordPress Plugin

Related Jobs

Commenters (This Week)

Welcome to Coder Legion

Connect with 4,756 amazing developers

Don't have an account? Sign up

OR

Hardening Two Multi Tenant SaaS APIs:

Hardening Two Multi Tenant SaaS APIs

What We Found, What We Fixed, and What Changed

The Goal Was Not Just to Close Findings

Finding 1: Tenant Data Must Always Be Scoped by Tenant

Finding 2: Organization Admins Are Not Platform Admins

Finding 3: Public Webhooks Are Public, Not Trusted

Finding 4: External Install Identifiers Are Not Authentication

Finding 5: User Controlled URLs Need SSRF Protection

Finding 6: Cryptography Should Not Have Production Fallbacks

Finding 7: File Uploads Should Stay Boring

Finding 8: Security Relevant Configuration Should Fail Closed

Finding 9: Rate Limiting Should Not Silently Degrade

Finding 10: Regression Tests Are Part of the Fix

Finding 11: Static Guards Catch What Tests Miss

Finding 12: Some Fixes Require Operational Discipline

What Changed by the End

Final Takeaway

0 Comments

Please log in to comment on this post.

More Posts

More From Joshua R. Gutierrez

Related Jobs

Commenters (This Week)