Why OpenAI and Anthropic Are Building Dedicated Health Applications

Also, how are AI healthcare tools evaluated, why Grok is getting bipartisan criticism, and the latest policy roundup.

Jan 21, 2026

Happy Wednesday, and welcome to the latest edition of the Trustible AI Newsletter! We’ve got a lot of exciting news to share in the next few weeks, but for now, be sure to download our latest whitepaper on AI monitoring! Here’s our team’s latest insights:

Why OpenAI and Anthropic Built Dedicated Health Applications
How to Evaluate Healthcare AI
Grok’s Generation of Non-consensual Intimate Imagery
Trustible’s Top AI Policy Stories

1. Why OpenAI and Anthropic Built Dedicated Health Applications

(Source: Google Gemini)

Within a few days of each other, both OpenAI and Anthropic announced dedicated ‘health’ versions of their AI platforms, ChatGPT Health, and Claude for Healthcare. According to their own data, over 230 million health related requests are made per week on ChatGPT, and it’s one of the top use cases. The dedicated health applications will have specific connectors for healthcare related databases, integrations with fitness and wellness platforms, and health related questions will now be ‘routed’ to the dedicated healthcare application instead of being answered.

There are several likely motivations behind this split: opening new revenue streams, competing against health-specific wrapper companies, and a desire for more data. But we’ll focus on regulatory and compliance motivations. Under many AI related laws such as the EU AI Act, an AI service providing medical advice would be categorized as ‘high risk’, triggering a number of heavy compliance obligations. Many existing privacy, liability, and security laws surrounding health data also apply. By carving out health related uses into a dedicated app, both companies can build dedicated guardrails, infrastructure, and processes around the sensitive application. This allows their non-health applications to still innovate quickly without getting bogged down by compliance. And by putting significant effort into routing users to a ‘safer’ application for health related queries, these companies will be able to claim that the non-health version of their applications are not ‘intended’ for this high risk domain and are not marketed as such. This is actually not the first carve-out, as both companies already support a similarly dedicated platform for the US public sector. Much like with healthcare, these platforms have dedicated infrastructure, customized security and privacy controls, and a different set of guardrails.

Key Takeaway: Regulation has always shaped product architecture, and we should expect big AI companies to create dedicated platforms for each high-risk domain these laws identify.

2. Tech Explainer: How to Evaluate Healthcare AI

Given the recently announced healthcare applications from OpenAI and Anthropic, alongside Utah’s new pilot program, it’s worth digging into how these systems were evaluated. Obviously since there are bespoke capabilities, guardrails, and integrations in these applications, they require customized testing, and there is already a growing ecosystem of healthcare related benchmarks and evaluations, albeit with many limitations. Here’s a quick analysis from the information released so far:

ChatGPT Health

OpenAI evaluated using HealthBench, a benchmark of realistic healthcare conversations with physician-created rubrics. Key strengths of his benchmark are:

Multi-turn conversations (superior to single-turn evals as quality often degrades over time)
Multi-faceted rubrics evaluating medical accuracy, communication quality, and jargon avoidance, scored via LLM-as-a-judge
Development by 262 physicians from 60 countries, improving cross-cultural validity

However, the benchmark doesn’t fully account for varied input formats from healthcare record providers (these integrations are central to the platform). Performance for “ChatGPT Health” was not published; but gpt-5-thinking scored 67.2%. It is unclear how this score translates to real world outcomes.

Claude for Healthcare

Anthropic’s announcement covered two functionalities: supporting healthcare professionals with prior authorizations and care coordination, and helping individuals summarize medical history and prepare for appointments. They reported Claude-4.5-Opus performance on MedAgentBench, which assesses agent capabilities in medical records contexts with pre-defined tools. While co-developed by physicians, it’s a proxy metric, and Claude’s actual tools differ from the sandbox environment. No evaluations addressed the personal healthcare scenario.

Utah Auto-Refill Program

Utah’s automated refill pilot uses Doctronic’s algorithm, which showed 99.2% agreement with physician decisions in testing. Unlike the broader AI systems, this narrowly scoped use case allows direct performance measurement on the actual task. However, testing used urgent care cases with physician-entered data, which may differ from patient chatbot inputs in production.

Our Take: Benchmarks are a useful evaluation tool, but they imperfectly simulate real-world conditions because they imperfectly capture aspects of real-world conditions like external data formats and real agentic tools. In addition, for the open-ended consumer products, it is not clear how these scores will translate to improved clinical outcomes. While the current systems have been released with a number of safeguards, better reporting standards will be necessary as AI Healthcare tools become more commonplace.

3. AI Incident Spotlight: Grok’s Generation of Non-Consensual Intimate Imagery (Incident 1329)

What Happened: In late December 2025, users discovered that xAI’s Grok would readily “undress” women in photos, manipulating existing images to create sexualized deepfakes without consent. The flood of content included images of celebrities, private individuals, and minors. Viral prompts ranged from “put her in a bikini” to far worse. Despite reports, X was slow to respond. Even one of Musk’s ex-partners struggled to get deepfakes of herself removed. After international backlash, X announced partial restrictions in mid-January, but the standalone Grok Imagine app continues generating explicit imagery.

Why it Matters: This isn’t a fringe product. Days after Secretary Hegseth announced that Grok would be integrated into Pentagon systems, including classified networks, regulators in at least a dozen countries launched investigations or outright bans. The same model generating what California’s Attorney General called an “avalanche” of illegal content is being deployed to 3 million DoD personnel.

The political dimension matters too. Even policymakers who oppose AI regulation have consistently carved out exceptions for child safety. This is one area with genuine bipartisan consensus, and incidents involving minors accelerate legislative action. By pushing boundaries on content moderation, xAI may be generating exactly the public backlash that fuels demand for stricter AI regulation across the board. Every headline about AI-generated sexual material erodes trust in AI broadly, not just in Grok.

How to Mitigate: Treat content moderation capabilities as a procurement criterion. Before deploying any image generation system, request documentation on what categories are blocked and how. For organizations considering Grok or X API integration, this incident warrants a serious risk assessment, particularly for customer-facing applications where generated content could create legal exposure.

4. AI Policy Roundup

Next Steps on AI Moratorium EO. The Department of Justice issued a memo establishing a task force to challenge state AI laws, as directed under the Trump AI Moratorium EO. The EO also calls for federal legislative proposals to regulate AI, but the director of the Office of Science and Technology Policy offered few details on the Administration’s plans at a recent congressional hearing.

Our Take: The EO spurred controversy even before it was signed because of the power it attempts to assert over states to regulate AI. It has not appeared to blunt momentum at the state level to pass AI laws, as several state legislatures have introduced bills in 2026.

ChatGPT’s Confidentiality Quagmire. Sam Altman recently asserted that OpenAI does not have an obligation to keep sensitive information confidential when people use ChatGPT as a therapist. Altman acknowledged that privacy concerns with AI may hinder adoption.

Our Take: Model providers are further blurring the lines between their products and privacy obligations. Health privacy laws like HIPAA and HITECH do not explicitly cover products like ChatGPT, but as the models expand into offering health services (e.g., digital therapy) that may change.

Congress Targets Deepfake Porn. Congress is considering bipartisan legislation that would allow victims to sue over nonconsensual sexual images. The DEFIANCE Act passed the Senate unanimously and will head to the House.

Our Take: Congress is responding to concerns over Grok’s capabilities to produce fake, sexually explicit deepfakes. This is one of the rare times that federal lawmakers agree on creating a private right of action, which allows individuals to sue. The law does not ban these images, though is thought of as a complement to the TAKE IT DOWN Act.

In case you missed it, here are a few additional AI policy developments making the rounds:

Africa. Nigeria is working towards passing a comprehensive AI law, making it among one of the first countries in Africa to enact such a law. The law is primarily focused on safeguards for high-risk systems, and would allow regulators to demand information from providers for non-compliance. The law is expected to be enacted in March 2026.
Asia. The Taiwanese legislature passed an AI basic law towards the end of 2025 and that law came into effect on January 14, 2026. The law outlines a series of principles for AI development and deployment, though has no specific enforcement mechanism.
Europe. Regulators in the EU and UK are considering consequences for AI tools that can create sexually explicit images. The concerns come in the wake of the controversy over Grok’s ability to produce sexualized images. EU lawmakers are considering banning the technology altogether and the UK government is threatening to revoke xAI’s ability to self-regulate.

—

As always, we welcome your feedback on content! Have suggestions? Drop us a line at newsletter@trustible.ai.

AI Responsibly,

- Trustible Team

Trustible Newsletter

Discussion about this post

Ready for more?