flowchart TD
A([Data Request\nSubmitted]) --> B{Classification\nTier?}
B -->|Public or Internal| C[Approved Enterprise\nTool — no BAA required]
B -->|Regulated PHI| D[Honest Broker Review\nRe-id risk assessment]
B -->|Restricted| E[Governance Committee\nApproval Required]
D --> F{Expert Determination\nPasses?}
F -->|No| G[Redesign or Decline]
F -->|Yes| H[BAA + ZDR Verified]
E --> H
H --> I{Processing\nLocation?}
I -->|Enterprise cloud| J[Proceed with\naudit logging]
I -->|External API| K[Zero-Retention\nclause confirmed]
K --> J
J --> L([Output + Audit Log\nRetained Institutionally])
17 Data Access and Governance
The most common bottleneck in AMC AI deployment is not model selection, compute availability, or vendor relationships. It is data governance. Not in the abstract sense — every institution has HIPAA policies — but in the specific sense that most AMC data governance frameworks were designed for uses that look nothing like LLM training or inference. The HIPAA Safe Harbor de-identification standard was written to support data sharing for research and public health uses. It was not designed to account for a model that can infer a patient’s identity from the statistical patterns in clinical notes after every explicitly identifying field has been removed. Understanding why existing frameworks are insufficient, and what a governance structure adequate to the LLM era requires, is the starting point for sound data strategy.
17.1 The AMC Data Mosaic
An academic medical center holds data that falls under at least four distinct regulatory regimes, and AI pipelines frequently cross the lines between them without anyone noticing. Clinical data generated by patient care is governed primarily by HIPAA. Student and trainee evaluations are governed by FERPA, which prohibits disclosure of educational records without consent. Research data involving human subjects is governed by the Common Rule (45 CFR 46) and, for federally funded genomic data, by additional NIH consent and data sharing requirements. Administrative and business data is governed by institutional policy and non-disclosure obligations.
The governance failure happens when a researcher asks an LLM to analyze de-identified clinical notes alongside trainee evaluation records and administrative email threads, passes all of it into a proprietary API with a standard enterprise agreement, and calls the result compliant because no single data type was explicitly prohibited. The pipeline crosses four regulatory regimes, creates potential for cross-dataset re-identification, and sends everything to an external service with whatever data retention policies the standard enterprise agreement allows. This is not a hypothetical; it is a pattern that occurs regularly at institutions without explicit AI data governance policies.
17.2 Data Classification for AI
A data classification framework designed for the LLM era needs at least four tiers that account not just for the sensitivity of the data in isolation but for its sensitivity in combination with the capabilities of the model processing it.
Public data can be used freely, including with public AI services: institutional press releases, public-facing policy documents, de-identified aggregate statistics.
Internal data includes non-PHI business data — operational metrics, general financial information, non-patient-facing communications. This data should be processed only through services with a signed data processing agreement, but does not require a HIPAA BAA. Many institutions use consumer-grade AI tools for internal data tasks without adequate contractual protections; this is a gap the classification framework should close.
Regulated data is the primary risk tier: PHI under HIPAA, student records under FERPA, and individually identifiable research data under the Common Rule. This data requires a BAA or equivalent privacy agreement, zero-data-retention provisions in vendor contracts, and where possible, local processing rather than transmission to external services.
Restricted data is the highest tier: genomic sequences, psychiatric treatment records, HIV status, substance use treatment records, and data subject to specific research consent restrictions. NIH policy has begun restricting the export of AI models trained on certain genomic datasets, recognizing that the model weights themselves can encode identifying information. Restricted data should not leave the institutional network boundary without specific governance approval.
17.3 The Limits of De-identification
The HIPAA Privacy Rule defines two de-identification methods. Safe Harbor removes 18 specific categories of direct identifiers. Expert Determination uses a statistical assessment to verify that the residual re-identification risk is very small (U.S. Department of Health and Human Services, Office for Civil Rights 2012). In 2012, when these standards were codified, the primary risk cases were database linkage attacks using public records. The threat model has changed materially since then.
Gymrek and colleagues demonstrated that genomic data de-identified under Safe Harbor could be re-identified using public genealogical databases and surname inference — using techniques available to any moderately sophisticated analyst (Gymrek et al. 2013). The threat model for clinical notes has evolved similarly. Models trained on large corpora of clinical text can memorize rare clinical strings — unusual diagnoses, distinctive procedure sequences, specific medication combinations — and reproduce them under adversarial prompting. A note that contains no direct identifiers may still contain a clinical signature unique enough to identify a specific patient to someone with access to supplementary information.
The governance implication is direct: Safe Harbor is not an adequate privacy guarantee for LLM training datasets. Expert Determination — a statistical verification that the specific data in the specific modeling context poses negligible re-identification risk — should be the default for AI training on clinical data. For inference, the appropriate safeguard is a BAA with zero-data-retention provisions that prevent the model provider from retaining prompt content after the session concludes.
17.4 The AI-Ready Honest Broker
Most AMCs have an existing honest broker function: a person or team that processes data requests for research use, applies de-identification, and manages sharing agreements. This function was designed for structured data exports. It was not designed for the AI-era pattern: unstructured text, imaging data, multi-modal inputs, and requests for ongoing access to live data streams rather than static exports.
The AI-ready data broker function needs to add several capabilities. It needs the ability to assess re-identification risk for unstructured text, not just structured fields. It needs to evaluate vendor BAAs specifically for AI provisions — standard BAAs written before 2022 frequently do not address model training, prompt retention, or output logging. It needs to maintain an approved tool registry mapping permitted data types to approved services, so that individual researchers do not make case-by-case risk assessments without institutional guidance.
17.5 FHIR and OMOP as AI Substrate
The interoperability standards that AMCs have adopted for data exchange — HL7 FHIR and the OMOP Common Data Model — are not just data formats. They are the substrate on which clinical AI is increasingly built and validated.
FHIR R5 provides standardized resource definitions for clinical data that allow AI models to query, retrieve, and write structured information across different EHR implementations. The SMART on FHIR authorization framework enables granular, patient-specific access scopes that are essential for the least-privilege agentic system design described in Chapter 11. For AMCs procuring AI tools that need EHR interaction, FHIR compatibility is increasingly the minimum requirement — both for richer data access and for the access logging that governance requires.
The OMOP CDM is the dominant standard for large-scale observational research, and it has become the basis for multi-institutional AI validation (Singhal et al. 2023). A model validated on OMOP-formatted data from one institution can be tested against OMOP-formatted data from another without custom integration work. For AMCs participating in research consortia — NIH N3C, PCORnet, NIH Bridge2AI — OMOP compatibility is typically required.
17.6 Vendor Contracts: The Non-Negotiables
The BAA for any AI vendor processing PHI must address provisions that standard HIPAA BAAs frequently omit. AMC legal teams should require:
No-training clause: The vendor may not use AMC data — prompts, completions, user interactions — to train or fine-tune models. This is standard in enterprise tiers from major providers but absent in default tiers; it must be explicitly verified.
Zero-data-retention for prompts: Prompt content containing PHI must not be retained after the session concludes. This means no logging for safety monitoring involving human review, and no storage in training pipelines.
Output ownership: Clinical notes and analyses generated from institutional data are institutional property. Vendor agreements should not claim ownership of AI-generated outputs based on institutional inputs.
Algorithmic change notification: The vendor must notify the institution before making significant model changes that could affect clinical outputs. An unannounced model update in a clinical workflow integration is a safety event.
| Contract Provision | Risk Addressed | Negotiability |
|---|---|---|
| No-training on customer data | PHI memorization in model weights | Mandatory |
| Zero-data-retention for prompts | PHI leakage through session logging | Mandatory |
| Output ownership by institution | IP rights to AI-generated clinical content | Mandatory |
| Right to audit | Verify data handling compliance | Required for high-risk use |
| Algorithmic change notification | Unexpected behavior change in clinical AI | Required for clinical integration |
| US-only data residency | Jurisdictional compliance | Required for restricted data |
17.7 TEFCA and the Nationwide Exchange Layer
The Trusted Exchange Framework and Common Agreement — TEFCA — established a network of Qualified Health Information Networks (QHINs) for nationwide clinical data exchange when it went live in 2023 (Office of the National Coordinator for Health Information Technology 2023). Epic’s Nexus Health Network, Health Gorilla, Oracle Health, and a handful of other organizations have received QHIN designation, creating for the first time a governed national pipe through which clinical data can flow across institutional boundaries.
The implications for AI data access are significant, and more complicated than they appear. TEFCA defines a set of permitted Exchange Purposes — Treatment, Payment, Health Care Operations, Public Health, Individual Access, and a handful of others — that determine under what conditions data can be queried across the network. Research is notably absent from the permitted purposes for nationwide exchange in the current framework. An AMC that wants to use TEFCA-accessed data to train or validate an AI model is operating in legally ambiguous territory unless it can characterize the use as Health Care Operations, a framing that has a defined statutory meaning and does not extend to all AI development activities.
The practical implication is that TEFCA is most immediately useful for clinical AI applications that operate at the point of care — retrieving a patient’s medication history from an external health system to inform a clinical decision, for example — and least immediately useful for the large-scale data aggregation that model training requires. AMCs participating in AI research consortia will need to route their data sharing through existing research frameworks — IRB oversight, data use agreements, OMOP standardization — rather than through TEFCA’s exchange infrastructure, at least until research is added as a permitted exchange purpose.
The QHIN connection that TEFCA requires does, however, create an infrastructure opportunity for AI governance that did not exist before: every query that passes through a QHIN- connected endpoint is logged. For the first time, the institution has a national-scale audit trail for the external data it accesses. That audit trail is an asset for the AI governance program, providing evidence of the provenance of external data inputs to clinical AI systems in a way that was previously unavailable.
17.8 The NIH Data Management and Sharing Policy Tension
NIH’s Data Management and Sharing Policy, which took effect in January 2023, requires that all research conducted with NIH funding produce a data management plan and share the resulting scientific data to the extent permitted by law and subject to privacy and ethical constraints (National Institutes of Health 2020). The policy was written to address the reproducibility crisis in biomedical science — if every investigator shared their data, more findings could be independently verified. In the AI context, it creates a specific governance problem: what counts as “scientific data” when the research involves training or fine-tuning a language model?
NIH clarified in a 2025 notice that AI models trained on controlled-access genomic data are Data Derivatives subject to the same access restrictions as the underlying data (National Institutes of Health 2025). The model weights — the billions of numerical parameters that encode what the model learned — may contain information about the training data that must not be shared openly. This is the “model weights as PHI” problem: a model trained on genomic data from a controlled-access cohort may memorize rare clinical signatures in ways that allow adversarial reconstruction of individual patient data, and releasing that model is equivalent to releasing a derivative of the controlled-access dataset.
The resulting tension is not abstract. An investigator conducting NIH-funded research on a clinical AI model is simultaneously obligated to share scientific data (by the DMS Policy) and obligated not to share data that could enable re-identification (by HIPAA, IRB consent terms, and the 2025 genomic AI notice). The resolution requires distinguishing among what must be shared, what may be shared, and what must not be shared. Model documentation — model cards, performance reports, training data descriptions — can satisfy much of the transparency obligation without releasing the weights themselves. Code and analysis scripts can be shared. The trained model weights require a case-by-case governance determination that should involve the institution’s research compliance office, not the individual investigator.
17.9 Synthetic Data as a Governance Instrument
One response to the tension between data access and privacy protection is to replace real clinical data with synthetic data — artificial patient records that preserve the statistical structure of the original data without containing information about any real individual. The technology for generating high-quality synthetic clinical data has matured substantially since 2020. Variational autoencoders, generative adversarial networks, and diffusion models have all been applied to EHR data generation, with recent methods producing synthetic records that are nearly indistinguishable from real records on most clinical research tasks.
The governance value of synthetic data is that it allows the institution to provide data access — for model development, testing, algorithm validation, and education — at a risk level substantially lower than real PHI. A developer testing a new clinical NLP pipeline does not need real patient data to verify that the pipeline runs correctly; synthetic data with the same structural properties serves that purpose without the privacy exposure. A student learning to write SQL queries against clinical data does not need to practice on actual patients.
But synthetic data has real limitations that institutional governance needs to acknowledge. First, synthetic data inherits the biases of the data it was generated from. A synthetic dataset generated from an EHR that underrepresents certain demographic groups will underrepresent those groups in the synthetic version. Bias auditing of synthetic data requires the same demographic stratification that bias auditing of real data requires. Second, for purposes that require population-level validity — epidemiological analysis, model external validation, clinical trial simulations — the fidelity of synthetic data is an empirical question that requires evaluation, not an assumption. The institution should not claim that a model trained on synthetic data performs equivalently to a model trained on real data without evidence.
Third, the privacy protection that synthetic data provides is not absolute. Membership inference attacks — attempts to determine whether a specific individual’s data was in the training set — can be applied to synthetic data generators as well as to trained models. The privacy guarantee of synthetic data depends on the generation method and the characteristics of the source data. For high-dimensional or rare-condition data, where individuals may be uniquely identifiable by their clinical signature, synthetic generation requires formal privacy guarantees — differential privacy, for example — rather than relying on the assumption that no record in the synthetic set corresponds to any real person.
17.10 Federated Learning and the Governance of Distributed Data
Federated learning offers a different architecture for the same problem: rather than sharing data with a central model, participating institutions keep their data local and share only model updates — the numerical gradients that represent what the model learned from a local training round. A central aggregator combines the updates from multiple institutions into an improved global model, which is then distributed back to the nodes for the next training round. The data never leaves the hospital’s firewall.
Production federated learning infrastructure exists in healthcare. The NIH Bridge2AI program is developing standardized data and AI-ready infrastructure across four large data generation projects, with federated coordination as a design principle (National Institutes of Health 2023). The Medical Imaging and Data Resource Center (MIDRC) has used federated learning to validate imaging AI models across multiple institutions without centralizing imaging data (MIDRC Consortium 2024). NVIDIA FLARE is the dominant production framework for healthcare federated learning, with documented deployments at major academic medical centers.
Governance of federated learning participation is meaningfully different from governance of centralized data sharing, but it is not governance-free. Participating in a federated training consortium requires a data use agreement with the consortium that specifies what the model is being trained to do, who controls the aggregated model, and how the model can be used after training. An AMC that contributes gradients to a federated training round for a commercial AI product is contributing to the development of that product — a contribution that has intellectual property implications and that may or may not be reflected in the benefit the institution receives from the trained model.
The most important governance question for federated learning is model poisoning risk: a malicious participant can contribute corrupted gradients that degrade the global model or embed adversarial behaviors. For clinical AI models where the output influences patient care, this is not a theoretical concern. Federated learning consortia need technical safeguards — gradient inspection, anomaly detection, trusted node certification — and governance frameworks that define what audit mechanisms participants can rely on to verify the integrity of their contribution and the safety of the models they receive.
17.11 Where to Start
17.11.1 Starter Project 1: AI Data Governance Policy and BAA Audit
What it is: A review of all current vendor agreements for AI tools handling PHI, assessing whether each BAA includes the provisions in Table 17.1, combined with a draft institutional AI data governance policy operationalizing the classification framework and honest broker requirements above.
Why now: Many AMC AI deployments occurred under standard enterprise agreements not specifically negotiated for AI use. A BAA audit surfaces contractual gaps before an adverse event. A data governance policy closes the structural gap that allowed non-compliant tools to be deployed.
How to execute: Legal and compliance review each active vendor agreement. Gaps are prioritized by data sensitivity and tool risk tier. Renegotiation is requested for high-priority gaps; tools with unresolvable gaps are candidates for replacement. The governance policy is drafted by legal and informatics jointly, reviewed by the AI Steering Committee, and published with a defined annual review cycle.
Buy vs. build: Legal and governance work. Commercial data governance platforms can maintain the inventory, but the policy analysis requires legal judgment.
17.11.2 Starter Project 2: Institutional AI Data Enclave
What it is: A technically isolated compute environment in which regulated AMC data can be used for AI development and experimentation without transmitting PHI to external services.
Why now: Demand for AI development access to clinical data is increasing, and case-by-case IRB and data governance review does not scale. A standing enclave with pre- approved data protocols, audit logging, and network isolation provides a faster path for approved researchers while preventing accidental data egress.
How to execute: Build on existing research computing infrastructure with added network controls and an approved tool list. Pre-approve OMOP-formatted clinical datasets at appropriate de-identification levels. Establish a lightweight application process for enclave access that replaces per-project data governance review for compliant use cases.
Buy vs. build: Build infrastructure on HIPAA-eligible cloud or on-premises HPC. The institutional work is in governance design, access controls, and audit processes layered on top of the compute environment.