Hardening Two Multi Tenant SaaS APIs
What We Found, What We Fixed, and What Changed
Security hardening is not glamorous work.
Most of it is careful reading, careful verification, uncomfortable edge cases, and refusing to trust assumptions that have quietly become part of the system.
Recently, I completed a remediation pass across two multi tenant SaaS products: Site2CRM and Made4Founders. Both products had grown into real platforms with authenticated dashboards, public webhooks, billing flows, CRM integrations, OAuth style connections, file uploads, background jobs, and customer owned data.
That kind of product has a wide attack surface.
The work started with security reports and ended with two hardened branches, 24 total commits, more than 140 new regression tests, database migrations, centralized security utilities, audit logging, startup gates, and CI checks that now block the same classes of mistakes from coming back.
This article is a breakdown of what we found, how we fixed it, and the engineering lessons that came out of the process.
The Goal Was Not Just to Close Findings
A security report usually arrives as a list of issues.
That can make the work feel transactional.
Fix this endpoint.
Add this check.
Reject this payload.
Patch this route.
That is necessary, but it is not enough.
The real question is:
What class of mistake allowed this bug to exist?
That question changed the remediation strategy.
For every finding, the goal became clear:
- Fix the specific issue.
- Add a regression test for that finding.
- Centralize the security pattern where possible.
- Add a guard so the same mistake is harder to reintroduce.
- Document any deployment or data migration steps clearly.
That approach turned the work from a cleanup pass into a hardening pass.
Finding 1: Tenant Data Must Always Be Scoped by Tenant
In a multi tenant application, the most important security rule is simple:
A customer should only be able to access data that belongs to their organization.
That rule sounds obvious, but enforcing it consistently is where systems get tested.
A risky query often looks harmless:
lead = db.query(Lead).filter(Lead.id == lead_id).first()
The problem is that lead_id alone is not a tenant boundary.
In a multi tenant system, the query needs to prove both identity and ownership:
lead = (
db.query(Lead)
.filter(
Lead.id == lead_id,
Lead.organization_id == current_user.organization_id,
)
.first()
)
The same principle applies to updates, deletes, dashboard feeds, background jobs, calendar items, configuration records, and integration data.
In one case, a calendar feed was using a join path that could leak data across tenants. In another, a vault configuration model had rows that were not safely attached to an organization. Both issues were fixed by making organization ownership explicit, adding migrations where needed, and creating regression coverage around the exact failure modes.
The broader lesson was this:
Tenant isolation should not depend on developer memory.
It should be a pattern the codebase makes easy and the test suite actively defends.
One of the clearest authorization lessons came from a global resource protected by a tenant level role check.
Most SaaS products have organization roles like:
OWNER
ADMIN
USER
Those roles are useful inside a customer account. An organization owner should be able to manage users, settings, forms, integrations, and billing details for that organization.
But tenant authority is not platform authority.
A customer OWNER should not be able to manage global platform resources, issue marketplace codes, access internal tools, or make changes that affect other organizations.
The risky pattern was conceptually simple:
if current_user.role not in ["OWNER", "ADMIN"]:
raise HTTPException(status_code=403)
That check answers the wrong question.
It asks:
Is this user powerful inside their own organization?
For global platform actions, the application needs to ask a different question:
Is this user actually trusted to operate the platform itself?
The fix was to introduce a real platform staff boundary and move global operations behind that boundary:
require_platform_staff(current_user)
This was not just a one line fix. Tests were added to prove that tenant owners and tenant admins could not access platform scoped functionality. A static guard was also added so future global table routes cannot quietly be protected only by tenant roles.
The lesson:
Never reuse customer roles for platform administration.
They represent different trust models.
Finding 3: Public Webhooks Are Public, Not Trusted
Public webhook endpoints need to be reachable by external providers.
That does not mean they should trust incoming requests.
Several webhook surfaces needed stronger sender verification. The risk was not that the endpoints existed. The risk was that state changing payloads could be processed without proving they came from the provider they claimed to represent.
The corrected model is straightforward:
- Read the raw request body.
- Verify the provider signature.
- Reject missing or invalid signatures.
- Parse the payload only after verification.
- Apply idempotency where relevant.
- Mutate state.
- Return success.
This mattered for provider events such as inbound messaging events, email bounce and complaint notifications, and billing lifecycle events.
Billing webhooks received extra attention because they can change plan state. A forged billing event can potentially activate, cancel, downgrade, or otherwise alter customer access.
The fix was to make verification mandatory. If the provider webhook secret or webhook ID is missing in production, the app should fail closed. If the signature is invalid, the route should reject the request instead of logging the error and continuing.
The lesson:
A webhook endpoint can be public without being trusted.
Reachability and trust are separate things.
Finding 4: External Install Identifiers Are Not Authentication
One integration trusted a client supplied install identifier as if it were a credential.
That is dangerous.
Install IDs, instance IDs, account IDs, and external resource IDs are identifiers. They are not proof that the caller owns the installation.
The hardened approach was to require a signed provider token, verify it server side, and only then extract the canonical installation identity from the verified payload.
The corrected flow became clear:
- Receive signed provider instance token.
- Verify the token using the provider app secret.
- Extract the canonical instance ID from the verified payload.
- Resolve the install record from that verified identity.
- Refuse cross organization rebinding unless ownership is proven.
This fixed both the authentication issue and the related install rebinding risk.
The lesson:
Do not authenticate integrations with bare IDs.
Use provider signed assertions, server side verification, and explicit ownership rules.
Finding 5: User Controlled URLs Need SSRF Protection
Several features across the two products involved server side requests to external URLs.
That pattern is common:
Outbound webhooks
RSS imports
Social integrations
Callback URLs
Remote media fetches
Health checks
The security risk is SSRF, or Server Side Request Forgery.
SSRF occurs when a user can influence a URL that the server requests. Without guardrails, an attacker may be able to point the server at internal infrastructure.
Examples include localhost, private network addresses, cloud metadata services, link local addresses, and internal admin panels.
The fix was to centralize outbound fetching through a safe fetch utility.
That utility blocks dangerous destinations, applies timeouts, restricts redirects, limits response size, and prevents internal response bodies from being reflected back to users.
The important part was centralization.
Instead of asking every route to remember SSRF rules, risky outbound requests now go through one safer path.
The lesson:
Any feature that lets a user provide a URL should be treated as a network boundary.
Finding 6: Cryptography Should Not Have Production Fallbacks
Development conveniences can become production vulnerabilities if they are allowed to survive past local use.
The hardening pass removed weak secret fallbacks and hardcoded encryption key fallbacks. Token encryption was versioned, and a re encryption path was added so legacy stored values could be migrated safely.
The improved model included several important rules:
- No hardcoded production fallback keys.
- An explicit application encryption key requirement.
- A versioned encrypted token format.
- A migration script for legacy tokens.
- A production flag to reject legacy plaintext tokens after migration.
The lesson:
Cryptographic failure should be loud.
If the key is missing in production, the app should not quietly invent one.
Finding 7: File Uploads Should Stay Boring
File uploads are easy to underestimate.
For brand logos and media uploads, the safest approach was to restrict formats and improve storage behavior.
The hardening changes included rejecting risky file types where they were unnecessary, using high entropy media names, enforcing body size limits, handling chunked uploads safely, and avoiding active content in logo uploads.
One practical example was SVG.
SVG can be useful, but it can also contain active content. If served from the wrong origin or with weak headers, it can become a stored XSS risk.
XSS means Cross Site Scripting. It occurs when attacker controlled content executes JavaScript in a trusted browser context.
For a logo upload feature, SVG was not worth the additional risk.
The lesson:
If a product only needs images, do not accept formats that behave like documents or code.
Finding 8: Security Relevant Configuration Should Fail Closed
Some risks were not about code paths. They were about missing configuration.
Examples included missing webhook secrets, missing encryption keys, missing CAPTCHA secrets, missing Redis for rate limiting, weak application secret keys, and insecure production cookie settings.
The fix was to add production startup gates.
In development, flexible configuration is helpful.
In production, missing security configuration should stop the app from starting.
A startup gate turns a hidden runtime weakness into an obvious deployment failure.
The lesson:
Failing to boot is better than booting insecurely.
Finding 9: Rate Limiting Should Not Silently Degrade
Rate limiting is often treated as a nice to have, but for authentication and abuse prevention it is a security control.
If Redis is unavailable and the system silently falls back to per process memory, limits become weaker under multiple workers.
For example, a limit of 10 attempts may effectively become 40 attempts across four workers.
The production behavior was hardened so that security sensitive rate limiting depends on a real shared backend.
The lesson:
A degraded security control should be visible, not silent.
Finding 10: Regression Tests Are Part of the Fix
Every meaningful finding received a named regression test.
That mattered.
A test named after a security issue tells future maintainers why the behavior exists. It also prevents a fix from being accidentally removed during a refactor.
Examples of the test coverage included:
- Tenant data cannot be accessed across organizations.
- Platform scoped routes require platform staff.
- Unsigned webhooks are rejected.
- Invalid webhook signatures are rejected.
- Unsafe webhook URLs are rejected.
- SVG logo uploads are rejected.
- Weak production secrets fail startup.
- Legacy plaintext tokens can be rejected after migration.
The lesson:
If a security issue was important enough to fix, it is important enough to test.
Finding 11: Static Guards Catch What Tests Miss
Tests are excellent for specific behavior.
Static lint is better for broad architectural patterns.
Both repositories now have security lint guards for risky patterns:
- Tenant owned queries without organization scope.
- Raw outbound requests using user controlled URLs.
- State changing public routes without auth or signature classification.
- Global platform routes protected only by tenant roles.
The linter supports a baseline, which is important for mature codebases.
A baseline allows known accepted cases to remain documented while CI fails only on new violations. That keeps the guard practical instead of noisy.
The preferred workflow is simple:
- Fix the violation.
- Add a clearly marked exception only when intentional.
- Update the baseline only when the exception is reviewed and accepted.
The lesson:
The best time to add a guard is right after fixing the bug class.
That is when the pattern is fresh, the risk is understood, and the team knows what should never happen again.
Finding 12: Some Fixes Require Operational Discipline
Not every security fix ends with a commit.
Some work has to happen during deployment.
Examples include coordinating breaking integration changes, setting required production secrets, running database migrations, inspecting quarantined records, re encrypting stored tokens, enabling flags that reject legacy formats, scrubbing sensitive files from Git history, and running full test suites before release.
These steps were intentionally left as human controlled actions because they affect production data, integration behavior, and deployment timing.
That is part of responsible hardening.
The code can be ready before production is ready.
The lesson:
A remediation branch is not deployable until the operational checklist is complete.
What Changed by the End
Across both products, the hardening pass produced:
- 24 total commits.
- More than 140 new regression tests.
- Database migrations.
- Centralized webhook verification.
- Centralized SSRF safe fetch behavior.
- Fail closed production startup gates.
- Platform staff authorization.
- Audit logging for sensitive actions.
- Versioned token encryption.
- Tenant isolation checks.
- Security lint guards enforced in CI.
- Main branches left untouched for review.
More importantly, the products now have better security shape.
The fixes were not just scattered patches. They became reusable boundaries.
Tenant data access now has stronger conventions.
Platform operations now have a separate trust model.
Webhook verification now follows a consistent pattern.
Outbound URL fetching now has a safer path.
Production misconfiguration now fails early.
Dangerous patterns now have CI visibility.
Final Takeaway
The most valuable security work is not only fixing what was found.
It is asking what the finding reveals about the system.
A good remediation process should answer several questions:
- What failed?
- Where else could it fail?
- What is the correct shared pattern?
- How do we test the fix?
- How do we prevent the class from returning?
- What must be true before deployment?
That mindset turns a security report into a stronger codebase.
Patching closes issues.
Hardening changes the system.