A couple of weeks ago, a founder building a clinical documentation tool asked me a version of the question I get more than any other in healthcare AI: “Can we use GPT-4o, or does HIPAA mean we have to build everything ourselves?”
The question makes sense. HIPAA’s reputation precedes it. Most founders walk into healthcare AI conversations assuming any cloud model is automatically off-limits, that local-only is the only safe architecture, and that any use of patient data in an LLM call is a violation waiting to happen.
None of that is exactly right.
You can use GPT-4o, Claude, Gemini, and most major cloud models with patient data, provided you structure the architecture correctly. The architecture is what determines compliance, not the model. But “structure it correctly” is doing a lot of work in that sentence. This post breaks down what correct actually looks like across four patterns we’ve used in production healthcare builds.
What HIPAA Actually Requires From Your AI Stack
HIPAA covers Protected Health Information (PHI): 18 specific identifiers including name, date of birth, address, medical record numbers, and diagnoses tied to an individual. The full list is in the HHS Privacy Rule guidance.
When you’re building AI on top of PHI, three requirements directly shape your architecture.
First, Business Associate Agreements. If you’re sending PHI to a third-party vendor (like an LLM provider’s API), that vendor must sign a BAA with you. A BAA is a contract where the vendor acknowledges it’s handling PHI and commits to specific security and privacy obligations. No BAA, no PHI in their API calls.
Second, audit trails. Every access to PHI that goes through your system, including AI calls, needs to be logged: who accessed what, when, and why. This applies to LLM calls as much as it applies to database reads. The logging requirement often catches teams off-guard: it’s not enough to log the application layer if your LLM calls bypass your audit middleware.
Third, the minimum necessary standard. You shouldn’t send more PHI to an LLM than the specific task requires. If you’re asking AI to summarize a medication list, you shouldn’t be sending the full patient record including social security number and insurance policy details. This is less a technical constraint and more a documentation obligation: you need to be able to explain, in writing, why each piece of PHI you process is necessary for the task.
That’s the compliance skeleton. Everything else in healthcare AI architecture is a variation on how you satisfy these three requirements under different constraints.
Which LLM Providers Have Signed BAAs (And What That Gets You)
The short list: OpenAI (Enterprise tier), Anthropic (Enterprise), Google (healthcare-configured Vertex AI), AWS (Bedrock under their BAA-covered services), and Azure (OpenAI Service under Azure’s HIPAA BAA).
All of them will sign BAAs. None of them will sign BAAs on standard developer plans.
This changes your cost floor in ways that catch most teams off-guard. OpenAI Enterprise doesn’t have public pricing, but it’s meaningfully higher than the standard API pay-as-you-go rates. Vertex AI healthcare configurations have their own pricing structure. The developer-tier versions of these APIs (the ones most prototypes are built on) are not HIPAA-covered by default. You can check OpenAI’s enterprise privacy practices for what their BAA actually covers.
A BAA from a vendor also doesn’t mean you can throw PHI at the API without other controls in place. The BAA covers the vendor’s obligations, not yours. You still need the audit logging, the minimum necessary principle, access controls on your side, and a documented risk analysis. A signed BAA is a necessary condition. It’s not a sufficient one.
Where founders get into trouble: they sign the BAA with their LLM provider, send PHI to the API, and consider the compliance box checked. Then they get a breach notification letter six months later because a developer’s API key was leaked in a public commit, exposing the PHI that went through those calls. The BAA didn’t protect them from a key management failure on their own side.
We’ve seen this pattern enough times that we now build API key rotation and secret scanning into the project kickoff checklist for any healthcare AI build we take on.
The Four Patterns We Use in Production
Most healthcare AI architectures we’ve seen and built land in one of four patterns. Here’s how each works, where each is the right call, and where each breaks.
Pattern 1: De-identify First, Then Query
The most common pattern we implement. Before any PHI touches an LLM API, a de-identification step removes or replaces the 18 HIPAA identifiers. Names become placeholders like “Patient_01”, dates get shifted or generalized, specific addresses become regional references. The de-identified data goes to the LLM. The LLM’s response comes back, gets re-identified where needed via a mapping table kept in your own infrastructure, and reaches the end user.
This works for document summarization, clinical note drafting, medication extraction, and most analysis tasks where the AI doesn’t need the patient’s specific name to produce a useful output. The de-identification step runs in your own infrastructure, so the only data leaving your network to the LLM API is de-identified. HHS has published guidance on acceptable de-identification methods if you need to know exactly what the Safe Harbor standard requires.
The limitation: some use cases genuinely need identifiable PHI to reach the model. A personalized patient communication tool can’t replace the patient’s name with “Patient_01”; that defeats the purpose. A tool that cross-references specific insurance records needs real identifiers. For those cases, you need one of the next three patterns.
Pattern 2: Local Model for PHI, Cloud Model for Everything Else
A hybrid approach. PHI-heavy tasks run against a locally-hosted model (Llama 3.3 70B or a smaller fine-tuned variant for lower compute), which never sends data outside your infrastructure. Tasks that don’t involve PHI, or that operate on already-de-identified data, hit the cloud API.
The practical split we use: clinical note processing and patient record summarization go to the local model. General analysis, formatting, template generation, and non-PHI reasoning go to the cloud model. The cloud model is faster and cheaper per call; the local model is slower, more expensive to host, and fully under your control.
Running a 70B model at production throughput is not trivial. Expect dedicated GPU instances (A100s or A10Gs) for anything beyond light usage. Infrastructure cost alone, ignoring software, runs $8,000 to $15,000 per month for a healthcare SaaS handling a few thousand daily patient interactions. That number needs to be in your financial model before you commit to this pattern.
Pattern 3: Full On-Premises Deployment
No data leaves your network. Everything runs on infrastructure you operate: the LLM, the vector database, the embedding models, the observability stack. You get nothing from the cloud LLM providers.
This is the architecture for hospital systems whose security teams have ruled out any patient data reaching third-party APIs under any circumstances. It’s also the architecture for government health programs and regulated insurance contexts where data sovereignty is a hard requirement.
The cost is significantly higher: dedicated infrastructure, an internal MLOps team to manage model updates and hardware health, and the ongoing work of staying current with model improvements without vendor-managed updates. One hospital system we spoke to during scoping was running $40,000 per month in infrastructure for an on-prem LLM deployment before any application costs.
This pattern makes sense when the alternative (failing to win the enterprise healthcare contract or losing an incumbent vendor relationship) is more expensive than the infrastructure cost.
Pattern 4: SaaS with Vendor-Managed Compliance
Some LLM-powered healthcare SaaS tools (Epic’s AI modules, certain Microsoft Copilot healthcare configurations, Azure Health Data Services) come with their own BAAs and managed compliance posture. If the vendor’s tool fits your use case and you’re willing to operate within their constraints, this is the fastest path to compliant deployment.
The downside: you’re building on someone else’s AI layer, which means less customization, less control over model updates, and a dependency on their pricing and roadmap. It’s the right trade for teams that want to move fast on a defined, contained use case without building the compliance infrastructure themselves.
Matching the Pattern to Your Situation
Three questions narrow the choice quickly.
Does your use case require identifiable PHI to reach the model? If not, Pattern 1 (de-identify first) is almost always the right call. It keeps infrastructure costs low, uses the best available cloud models, and satisfies BAA requirements without a full local hosting setup. This covers most clinical documentation, report summarization, and extraction use cases. In our experience, it handles roughly 80% of the healthcare AI features founders actually want to build.
What’s your institutional risk tolerance? A direct-to-consumer telehealth app has different constraints than a hospital system deploying AI to read radiology reports. The higher the institutional complexity and procurement scrutiny, the more likely you’ll need Pattern 2 or 3. Patterns 2 and 3 are harder to build, but they shorten the security review cycle with large hospital IT departments.
What’s your infrastructure budget? If you’re pre-Series A, Pattern 1 or Pattern 4 (if a suitable SaaS vendor exists for your specific use case) keeps your burn rate manageable. Pattern 3 is appropriate when you’ve got enterprise contracts that justify the monthly infrastructure spend.
The mistake I see most often: teams default to full on-prem (Pattern 3) because it feels safest, without modeling what Pattern 1 actually costs and what the realistic compliance risk is. De-identify-first handles the majority of healthcare AI use cases at a fraction of the cost. The cautious choice isn’t always the right engineering choice.
One practical note on RAG pipelines in healthcare contexts: if you’re building retrieval-augmented systems on top of clinical documents (which most healthcare AI builds end up needing), the RAG architecture decisions matter as much as the PHI handling layer. We’ve seen teams get the HIPAA architecture right and then build a retrieval system that surfaces PHI in unexpected ways.
What HIPAA Compliance Doesn’t Cover (And What Still Goes Wrong)
HIPAA compliance is a legal framework, not a technical security specification. Checking the compliance boxes doesn’t mean your system is secure.
A few failure modes we’ve seen in healthcare AI builds that were technically HIPAA-compliant when reviewed:
Prompt injection through patient-controlled inputs. A patient entering their own “medical history” into a form can inject instructions that manipulate your LLM’s behavior. The PHI handling architecture was correct; the input sanitization wasn’t. Compliant doesn’t mean injection-safe.
Model outputs that allow re-identification. De-identified inputs sometimes produce outputs that allow re-identification when combined with other data. This is an emerging area. De-identification isn’t foolproof at scale, and the regulatory treatment of inference-based re-identification is still developing. It’s a real technical risk that compliance reviews don’t catch.
Audit log gaps on failure paths. Logging that covers the happy path often misses retries, async calls, and error branches. We’ve found logging gaps in production systems that had passed compliance reviews, because the review checked that logging existed, not that it was comprehensive across all execution paths. Good LLM observability tooling catches these gaps before a compliance auditor does.
Third-party integrations without BAAs. You sign a BAA with your LLM provider. You send de-identified data correctly. Then a developer adds an error tracking library that sends exception traces (which sometimes include PHI fragments) to a third-party service. That service has no BAA. The main path was compliant; the debugging path wasn’t.
HIPAA compliance means you’ve checked the required boxes. Security means the boxes you checked match the actual threat model. In healthcare AI, you need both. The second is harder to audit than the first.
FAQ
Do I need a BAA with my LLM provider to use AI with patient data?
Yes, if PHI reaches the LLM API. The main providers (OpenAI Enterprise, Anthropic Enterprise, Google Vertex AI healthcare configurations, AWS Bedrock, and Azure OpenAI) will all sign BAAs, but not on standard developer or pay-as-you-go plans. If you’re using Pattern 1 (de-identify first), PHI never reaches the LLM API, which changes the calculus. You still need to verify that your de-identification process meets HIPAA’s Safe Harbor or Expert Determination standard, and that you’re not inadvertently sending PHI through other channels like error logs.
What’s the difference between HIPAA Safe Harbor and Expert Determination for de-identification?
HIPAA defines two methods for de-identifying PHI. Safe Harbor removes all 18 specified identifiers from the data. Expert Determination uses statistical methods to demonstrate that the risk of re-identification is very small, allowing more flexible removal patterns, but requiring a qualified statistician to certify the method. For most healthcare AI builds, Safe Harbor is simpler to implement and easier to audit. Expert Determination makes sense when Safe Harbor would strip too much clinically relevant context for the AI task to produce useful output.
How much does a HIPAA-compliant LLM architecture actually cost?
It depends on the pattern. De-identify-first (Pattern 1) with an enterprise cloud LLM adds roughly $1,000 to $3,000 per month in API costs at modest scale (under 10,000 patient interactions per day), plus a few hundred dollars per month in compute for the de-identification pipeline itself. Hybrid local-plus-cloud (Pattern 2) adds $8,000 to $15,000 per month in GPU infrastructure. Full on-prem (Pattern 3) starts at $20,000 per month in infrastructure. Pattern 4 (SaaS with BAA) has vendor-specific pricing, typically per-seat or per-use-case subscription pricing that varies significantly by vendor.
Can I use GPT-4o or Claude with actual patient records?
Yes, with a signed BAA and the right architecture. OpenAI offers BAAs through OpenAI Enterprise. Anthropic offers BAAs for enterprise deployments. Both models handle healthcare AI tasks well: note summarization, medication extraction, clinical decision support drafts, and diagnostic reasoning support. The model choice matters less than the architecture around it. Don’t spend six weeks benchmarking GPT-4o vs. Claude on healthcare tasks before you’ve solved the PHI handling architecture. The architecture is the harder, more consequential problem.
How long does it take to ship a HIPAA-compliant healthcare AI product?
For a typical clinical documentation AI or patient communication tool using Pattern 1 (de-identify-first), we’ve shipped production-ready systems in 6 to 10 weeks. That timeline covers: vendor BAA negotiation (1 to 2 weeks if your legal team moves fast), de-identification pipeline build and validation (2 to 3 weeks depending on document types and PHI density), LLM integration and clinical workflow integration (3 to 4 weeks), and security review (1 week if run in parallel). The wildcard is always institutional approval on the client side: a hospital system’s IT security review can add 4 to 12 weeks regardless of how ready your technical system is. Build that timeline separately from the engineering timeline and don’t let the two block each other.
If you’re building AI for a healthcare product and trying to figure out which architecture fits your use case, PHI volume, and regulatory context, we can usually give you a clear answer in 30 minutes. Book a call and we’ll tell you which pattern makes sense for your situation.