We Read a French Health-Tech Giant's Open-Source AI Pipeline Next to Its Paper. And...

We Read a French Health-Tech Giant's Open-Source AI Pipeline Next to Its Paper. And...

Leader 3 16 66
calendar_today agoschedule15 min read
— Originally published at flamehaven.space

When a Health-AI Release Becomes a Governance Surface

2

Biotech and health-AI didn't slow down in 2026.

By June, the Flamehaven team's tracking of disclosed biotechnology equity rounds showed 113 venture and strategic financings. Roughly $17.0 billion deployed, year to date.[1] Therapeutics platforms still take most of the deal count. But AI-native infrastructure (model platforms, data pipelines, diagnostic engines) is eating a bigger slice of every quarter than the one before it.

Narrow that down to France and the picture sharpens.

AI made up 62% of the €8.2 billion raised by French tech ventures in 2025. The country now counts more than 750 AI-focused companies.[2] Healthcare is one of the best-funded verticals inside that boom. Paris Region healthcare startups alone pulled in more than €3 billion over the past three years, the largest healthcare VC total of any region in the EU.[3]

None of that money showed up by accident. France 2030 put €54 billion behind it on purpose. A public Health Data Hub, built on real hospital EHR data. Compute poured into the Jean Zay supercomputer cluster. All of it built so French health-AI companies could train and publish without leaning on American infrastructure.[4]

3

Doctolib runs the appointment-booking and teleconsultation platform that roughly 80 million patients across France, Germany, Italy, and the Netherlands use to see a doctor. €348 million in annual revenue. 2,900 employees.[5] This is the kind of company whose AI releases get read by hospital procurement officers before anyone in machine learning even opens the GitHub link.

On June 20, 2026, Doctolib's research lab published exactly that kind of release. A paper.[7] Two open models on HuggingFace. A public GitHub repository.[6] All dropped the same day: the paper and models under open licenses, and the code as a public repository describing a pipeline for building French medical language models from web data, paired with a public dataset called FineMed.

The Flamehaven team audits AI repositories in bio and health-adjacent domains for a living. Mostly that means reading a README against a test suite, writing down where they disagree, and moving to the next one. This time the team did not move on. One sentence in the paper kept refusing to match what the repository actually does. Pulling on that thread led through the code, into the company's own commercial position, and out the other side into three explanations that remain plausible from the public evidence, though not equally supported.

Here's what the team found, in the order they found it.


What's Actually Sitting in the Repository

5

Before suspecting anything, you have to know what you're looking at. So start with the boring pass first.

The repository is genuinely well-organized. The README lays out three clean stages, acquisition, classification, and synthesis, each backed by a matching .slurm job-runner script. Pydantic schemas enforce structure on every LLM output. The tokenizer chain runs end to end: SentencePiece training, vocabulary trimming, fertility checks that actually pass. Across thirty-one Python files, zero AST parse errors. Zero bare except: blocks quietly swallowing exceptions. This is not amateur code, and it would be dishonest to pretend otherwise.

But sitting underneath that competence are a few things that look like exactly what they are, the kind of thing every engineer ships at 2 a.m. and means to fix later:

  • No LICENSE file in the code repository, even though both released models carry Apache-2.0 on HuggingFace.
  • A real reproducibility bug. The parameter p2_short_floor defaults to 128 through the main pipeline entrypoint. Call the underlying worker function directly, which is exactly what happens when a researcher debugs one file in isolation, and it silently defaults to 256 instead. No error. No warning. Just a different training corpus depending on which door you walked through, 234 lines apart from its own contradiction. That matters beyond aesthetics: if the corpus produced by one path differs from the corpus produced by the other, and the paper does not specify which invocation path was used, then the benchmark results in Tables 5 and 6 cannot be independently verified by a third party working from the public repository alone.
  • An undocumented dependency wall. Try to actually run this and convert_to_mds.py --help dies on ModuleNotFoundError: No module named 'snappy', a C-extension dependency three layers deep that the README never mentions. llm_generate.py --help fails the same way on vllm, because the synthesis pipeline that built the training corpus has no CPU fallback and needs a GPU the docs do not flag as mandatory.
  • Multiple hardcoded absolute file paths, including active ones, scattered across six files and pointing straight at Doctolib's internal Jean Zay supercomputer cluster. Not commented out. Live.

The team ran the dependency commands directly on a real Python environment. Both failed on first invocation — snappy missing before mosaicml-streaming could initialize, vllm missing immediately on import. Neither dependency appears as a prerequisite in the README. That is not a gap caught in review; it is a documented failure from execution.

None of that, on its own, should raise anyone's pulse. Research code ships unpolished constantly. The Flamehaven team has audited far messier repositories sitting under far smaller claims than this one.

What actually stopped the team wasn't any single item on that list. It was what happened once they stopped reading the code on its own terms and started holding it up against the one sentence the paper uses to describe what this thing is supposed to be.


The Sentence, and What's Sitting Underneath It

6

The paper's contributions list states it as a formal, numbered claim. The phrasing is specific enough to check:

"We release FineMed... together with FineMed-rephrased, and a reproducible curation pipeline with multi-axis annotators."

The repository's own README lands on almost the identical phrase, independently, in its opening paragraph, written for a completely different audience:

"This repository holds the reproducible data-curation pipeline..."

Two documents. Two different readers in mind, one for peer reviewers who will never open a terminal, one for GitHub visitors who will. Both land, unprompted, on the same word. That convergence is what made it worth checking, line by line, whether the code underneath actually earns that word or just borrows it.

Here's what sits directly under the functions the paper names as completed work:

!8.png

What the Paper Calls It Where It Lives in Code What's Actually There
§5.1, the "coarse pre-screen" stage llm_rewrite.py, line 840 # tmp pre-filter, above the active filtering call
§3.3, "style dimensions" llm_rewrite.py, line 629 # tmp: v3 (default), on the live default branch
§3.1, pipeline orchestration llm_generate.py, line 13 # tmp control, above a module-level env-var read

Three places. Three markers, each one reading, in plain English: not finished yet. And all three sit precisely on the parts of the pipeline that decide whether someone outside Doctolib could rebuild this from raw web data to a trained model. Not on the model architecture. Not on the entity extraction logic. Specifically on the parts that determine whether the process is repeatable by anyone who isn't already inside the building.

There was a fourth candidate the team almost included. The reason it didn't make the cut matters more than the marker itself.

A ## tmp comment in split_document.py looked, at first glance, like it tagged an active filter from the paper's §4.1. So the team didn't take it at face value. They opened the file and checked what the comment was actually sitting on top of:

  • The comment itself: ## tmp: keep only med01 ∧ edu4, which reads exactly like an active filter tag.
  • The lines directly underneath it: a dataset.filter(...) call.
  • The status of that call: every line of it commented out. Not running. Not reachable. Dead.

So the marker was not tagging live code. It was a historical note sitting on top of a block someone had already disabled, the kind of comment that survives a refactor by accident, not a flag on something the pipeline still does. Getting that distinction wrong would have meant reporting four signals instead of three. The difference between those two numbers is the entire reason this kind of checking exists in the first place.

Three real markers stayed on the table. Which raised an obvious next question: was this confined to one section that happened to catch the eye, or did it run through the whole paper?


Checking All Twelve Claims, Not Just the One That Looked Bad

So the team went back and mapped every formal contribution the paper makes against the code that's supposed to implement it. All twelve. Each one scored as fully matching, partially matching, or simply impossible to check from outside the company.

Eight came back fully verified:

  • The two-stage annotation architecture
  • The medical-term density thresholds
  • The educational quality thresholds
  • The 8-class medical entity extractor, where active code matches the paper's entity table exactly
  • The multi-dimensional rewriting styles
  • The tokenizer vocabulary alignment
  • The medical-content gating logic
  • The diverse genre/audience sampling

Zero came back contradicted. Nothing the team found anywhere in this codebase disproves anything the paper claims about what the final model does, or how it performs. That needs to be said plainly, because it's easy for it to get lost in everything that follows: this is not a story about a model that doesn't do what it says.

7

Four claims didn't clear the same bar:

  • "Reproducible curation pipeline" - Partial. The CLI exists. The training script does not appear in the main repository checkout. Three # tmp markers sit on active functions. Multiple hardcoded paths make outside execution materially harder.
  • PII replaced with fictional values - Partial. Not because the team found a contradiction. The paper's own Ethics section already says, in its own words, that no post-hoc audit of instruction compliance exists.
  • Coarse pre-screen filter - Partial, for the same # tmp reason as above.
  • 3-phase, 240-billion-token training run - Unverifiable. There is no training execution script anywhere in the main repository checkout. Not broken. Not incomplete. Absent. Just fragments of sbatch job-submission syntax that name the phases and launch nothing.

Look at where those four land against where the eight verified ones land, because the split isn't random. Every verified claim describes what the system does once it's built. Every non-verified claim describes whether anyone outside Doctolib can rebuild it. Twelve coin flips don't sort themselves into two clean piles by accident.

One thing has to sit next to that finding before it goes any further. The paper's own README states plainly that the actual pretraining code lives in a separate ModernBERT/ Git submodule, credited openly to an outside open-source project, and that submodule simply was not initialized in the checkout the team audited. It is entirely possible the training script being called "absent" here is one git clone --recurse-submodules away from existing. The team flagged this as the single highest-priority item to resolve, and it is surfaced here rather than quietly smoothed over because that is the whole point of doing this work in public.

What that open question doesn't do is erase the other three. Those sit on different files, different functions, and don't depend on a submodule that was never cloned.

One more thing belongs here before moving on. The model cards on HuggingFace do contain explicit intended-use boundaries: both DoctoBERT-fr and DoctoModernBERT-fr are framed as encoders for downstream NLP tasks, not medical devices, and both cards state that their outputs must not drive clinical decisions.[9] That matters. The weaker surface in this release is not the model card. It is the code repository as a reproducibility artifact.


The Company Sitting Underneath the Code

13
By this point, the audit stops being only a code question.

Doctolib's core business has a ceiling built into it. It's a per-professional SaaS subscription in a market with a finite number of doctors and rising local competition. The company's own numbers point at exactly where it's looking next: roughly €115 million in R&D spend in 2024, an AI consultation assistant already logging more than 2 million consultations, according to its own CFO.[5] DoctoBERT isn't a side project floating apart from that strategy. The paper says, in its own words, that the model was evaluated on "a proprietary clinical NER task from a real-world production setting." This pipeline is already touching a live internal clinical workflow. It's not sitting safely inside an academic benchmark where the stakes are lower.

Which makes the licensing asymmetry worth a second look:

  • Model weights (DoctoBERT-fr, DoctoModernBERT-fr): Apache-2.0 on HuggingFace. Open, unambiguous, free to build on commercially.
  • The code that produced those weights: no license file at all.

Under the Berne Convention, which governs copyright across France and the EU, no license does not mean unlicensed in any casual sense. It means default copyright, all rights reserved, by operation of law. The weights are free. The pipeline that built them legally is not.

Separately: in November 2025, France's competition authority fined Doctolib €4,665,000 for abusing its dominant position in online medical appointment booking. That says nothing about DoctoBERT directly, but it does confirm that Doctolib is already operating under active regulatory scrutiny as a dominant-market actor — a fact worth holding when reading the governance gaps in the code.

A missing LICENSE file is genuinely one of the most common oversights in software. Most teams that skip it aren't doing it on purpose, and one missing file isn't a smoking gun on its own.

It doesn't sit on its own, though. Line it up against everything else:

  • Three live # tmp markers on exactly the functions that determine reproducibility
  • Multiple hardcoded paths into a supercomputer cluster only Doctolib has keys to
  • An undocumented dependency chain that breaks on a missing C-extension nobody warned about
  • A training script that, in the public checkout, simply isn't there

Every one of those facts points in the same direction, toward outside reproduction being harder than the paper's own headline sentence claims it is. Not one points the other way. Taken together, they are difficult to dismiss as random noise.

There's a regulatory layer that keeps this from being purely a software-hygiene complaint. Under the EU AI Act as currently written, DoctoBERT likely clears the bar for non-applicability at this stage. The scientific-research exemption and the pre-market exemption both plausibly cover a release shaped like this one.[8] Nothing in the AI Act's high-risk categories cleanly captures a general medical encoder being fine-tuned downstream for NER. That's a defensible legal position today, and nothing here disputes it.

But "not legally required to be reproducible" and "actually reproducible" are different claims. The second one starts mattering the moment a hospital procurement team asks for an auditable pipeline as a condition of doing business. So does an insurer's compliance officer. So does a government certification reviewer, exactly the customers this pivot is aimed at. The repository on GitHub right now cannot hand them one.

Some of these barriers are ordinary research-code realities. GPU-bound inference, HPC paths, and version-sensitive dependencies are standard features of large-scale ML pipelines, and they are not, by themselves, enough to defeat a reproducibility claim.

The stronger finding is narrower than that. Even setting infrastructure entirely aside, the repository contains code-level divergences, the p2_short_floor dual default being the clearest, where the same inputs can yield different outputs depending on which entry path is used.

That gap has nothing to do with whether you have access to Jean Zay. It exists inside the code itself.


So Which of the Three Is It

14

The Flamehaven team sat with the whole picture before writing any of this down. The three markers. The license asymmetry. The dependency wall. The missing training script. A company visibly pivoting toward the exact customers who will ask the hardest questions about all of it. Three explanations remain plausible from the public evidence, but they are not equally supported. Nobody on the team is going to pretend the distinction can be proved cleanly from public commits alone, because it cannot.

1. The release-debt explanation: this is simply an unfinished release.

Research teams push code to GitHub the same week as a paper deadline constantly, without anyone assigned to harden it for a stranger trying to rebuild it from nothing. The # tmp markers, the hardcoded cluster paths, and the missing license are exactly what shows up when an internal pipeline ships under deadline pressure and nobody circles back. Nothing here required a strategic decision. It only required a launch date.

The disclosure-boundary explanation: the release was scoped deliberately. Open enough to earn the credibility of having published a method. Not open enough for anyone outside Doctolib's compute environment to fully rebuild what the paper describes. The weights carry the license that buys goodwill. The code that would let someone replicate the full process carries ambiguity instead. No deception required — just a decision, somewhere, about where the boundary of openness should sit.

2. The serious explanation: this was designed to look open while functioning closed.

The team checked this one hardest, specifically because it is the easiest to under-check out of discomfort. They did not find support for it. The prompts are in the repository in full, including the exact rewriting instructions handed to the LLM. The filtering logic is there. The job-submission scripts are there. If the goal had been to mislead outsiders into believing they had something they do not, publishing a hollow shell would have been the easier move by a wide margin. That is not what is here. What is here is real, working logic, with one specific step left undone.

3. The team can rule out that third explanation with reasonable confidence.

The public evidence does not cleanly separate the first from the second. Both produce exactly the artifact sitting in this repository today, and the difference between release-hardening debt and an intentional disclosure boundary lives inside a decision at Doctolib that no public commit history will ever show.


The Boundary That Remains

What the team can say without claiming motive is this: the gap between the paper's headline sentence and what the public can currently do with this repository is real. It is measurable. And it lands, with observable precision, exactly on the claims that determine whether anyone outside Doctolib can verify this work independently.

You now have the same evidence the Flamehaven team did. Where you land on those two explanations is worth sitting with. So is the question of whether you'd land somewhere different if it were your hospital's compliance officer doing the asking, instead of an outside audit team with no stake in the outcome either way.


This audit is based on static code analysis, AST parsing, and partial runtime verification of a public GitHub repository at a fixed commit, cross-referenced against the published arXiv paper and project README. No proprietary Doctolib systems, internal infrastructure, or non-public data were accessed. Findings reflect code-observable and text-observable evidence, not a determination of intent. This is not a legal opinion.


For readers who want to inspect the evidence trail in more detail, Flamehaven has also published two supporting artifacts.

The full technical diagnostic report provides the deeper codebase review, including methodology notes, claim traceability, runtime-verified dependency findings, repository hygiene findings, and the corrected scope limitations behind this newsletter:

https://flamehaven.space/writing/doctobert-codebase-diagnostic-report/

00


The verification ledger records the STEM-BIO-AI / governance-side assessment and preserves the audit entry as a structured public evidence record:

https://flamehaven01.github.io/Flamehaven-Verification-Ledger/#doctobert

11

This newsletter is the narrative summary. The report and ledger are provided for readers who want to see how the review was conducted, what was checked, and where the remaining uncertainty still sits.


References

[1]. Biotech VC Funding Tracker 2026 - BioBucks. Disclosed biotechnology equity rounds, year to date as of June 2026.

[2]. The French AI Boom You Can't Afford to Ignore in 2026 - Atera, February 2026. French tech and AI venture funding figures for 2025. (Secondary source — primary statistical sourcing pending verification.)

[3]. Health & Healthtech in the Paris Region - Choose Paris Region. Three-year healthcare VC totals for the Paris Region.

[4]. France 2030: Digital Health Acceleration Strategy - OECD.AI. France 2030 budget allocation, Health Data Hub, and Jean Zay supercomputer investment.

[5]. French healthtech unicorn Doctolib hits €348m ARR, eyes profitability - Sifted, 2024. Doctolib ARR, operating loss, subscription revenue model, and growth figures.

[6]. doctolib-lab/doctobert - GitHub. Repository audited at the commit referenced in this piece.

[7]. Where Does the Signal Live? A Web Data Recipe for Medical Encoder Pretraining - arXiv:2606.22079, June 20, 2026.

[8]. Regulation (EU) 2024/1689 (EU AI Act), Official Journal of the European Union, OJ L 2024/1689, 12 July 2024, CELEX 32024R1689 — Articles 2(6), 2(8), 2(12), 6(2), Annex III, Article 51(2).

[9]. doctolib-lab/doctobert-fr-base and doctolib-lab/doctomodernbert-fr-base - HuggingFace Hub model cards. Intended-use boundaries, license (Apache-2.0), and non-clinical scope disclaimers.

🔥 Join developers growing publicly
Share your knowledge, build in public, and grow your developer presence with a global community.

More Posts

The Sovereign Vault — A Comprehensive Guide to Protocol-Driven AI

Ken W. Algerverified - Jun 4

Local-First: The Browser as the Vault

Pocket Portfolio - Apr 20

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

Karol Modelskiverified - Mar 19

Your Tech Stack Isn’t Your Ceiling. Your Story Is

Karol Modelskiverified - Apr 9

Beyond the 98.6°F Myth: Defining Personal Baselines in Health Management

Huifer - Feb 2
chevron_left
3.7k Points85 Badges
South Koreaflamehaven.space
53Posts
29Comments
26Connections
Founder designing Sovereign AGI & Scientific AI systems — governance, reasoning models, medical/phys... Show more

Related Jobs

View all jobs →

Commenters (This Week)

1 comment
1 comment
1 comment

Contribute meaningful comments to climb the leaderboard and earn badges!