Skip to content

Instantly share code, notes, and snippets.

@chunhualiao
Created April 6, 2025 15:56
Show Gist options
  • Save chunhualiao/19c701d514bc43265f3f5bcdbfb3c655 to your computer and use it in GitHub Desktop.
Save chunhualiao/19c701d514bc43265f3f5bcdbfb3c655 to your computer and use it in GitHub Desktop.

Thanks! I’ll look into the best publicly accessible AI models (including both free and paid/public APIs) that currently perform best in medical diagnostics and question answering, especially for text and multi-modal inputs like images or charts.

I’ll compare their accuracy, strengths, limitations, access models, and pricing where applicable. I’ll get back to you shortly with a breakdown of which models lead in each area.

State-of-the-Art AI Models for Medical Diagnostics and Q&A

Overview

Advances in large language models (LLMs) have led to AI systems that can answer medical questions and even assist in diagnostic reasoning. Recent models achieve exam-level performance on medical licensing questions, a dramatic leap from just a few years ago (Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments | Scientific Reports). Both general-purpose models (like OpenAI’s GPT series) and domain-tuned models (from Google, startups, and open-source communities) are pushing accuracy higher while striving to remain safe and reliable for clinical use. Many top performers now approach physician-level expertise on benchmark questions ( Assessing GPT-4’s Performance in Delivering Medical Advice: Comparative Analysis With Human Experts - PMC ), though real-world deployment still requires caution. Crucially, newer systems are also exploring multi-modal inputs – analyzing not just text but images (e.g. X-rays, pathology slides) or other data – to support richer diagnostic use cases. The following sections breakdown the leading publicly accessible models, comparing their capabilities, performance, and accessibility (free vs. paid), with a focus on diagnostic and general medical question-answering (as opposed to research literature tasks).

(Fig. 1: Med-PaLM 2 performance on MultiMedQA. | Nature Medicine) Figure 1: Rapid improvement in AI medical Q&A performance. Plot (a) shows the accuracy of various models on the USMLE-style MedQA benchmark over time, rising from ~30–50% in early models to 86.5% with Google’s Med-PaLM 2 (blue dot), and further to 90–91% with GPT-4 and Google’s latest Med-Gemini in 2024 (gray dots) (Fig. 1: Med-PaLM 2 performance on MultiMedQA. | Nature Medicine) (Advancing medical AI with Med-Gemini). This surpasses the ~60% accuracy of GPT-3.5 (ChatGPT) (Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments | Scientific Reports) and earlier biomedical models. Panel (b) (orange vs. blue bars) indicates Med-PaLM 2’s answers were often rated higher quality than physician-written answers on multiple axes by expert raters (Fig. 1: Med-PaLM 2 performance on MultiMedQA. | Nature Medicine) (reflecting factors like consensus, comprehension, recall, reasoning), while maintaining low likelihood of harmful or biased content (Toward expert-level medical question answering with large language models | Nature Medicine). These trends highlight how quickly medical AI models are approaching expert-level performance.

General-Purpose LLMs in Medicine

OpenAI GPT-4OpenAI’s GPT-4 is a large general model that has demonstrated outstanding medical knowledge and reasoning. Though not specifically trained on medical data, GPT-4 achieved an ~86% score on USMLE exam-style questions – placing in the 90th+ percentile of test-takers (Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments | Scientific Reports). This is a huge jump over its predecessor GPT-3.5 (which scored ~60% (Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments | Scientific Reports), barely passing). In head-to-head evaluations, GPT-4’s answers have often matched or exceeded physicians’ answers in accuracy and thoroughness. For example, a study of 251 consumer health questions in cardiology found GPT-4’s advice was rated as accurate as expert physicians, even preferred slightly in blinded comparisons ( Assessing GPT-4’s Performance in Delivering Medical Advice: Comparative Analysis With Human Experts - PMC ). GPT-4 can also draft detailed clinical reports and reasoning. That said, it isn’t flawless – e.g. it failed to find the correct diagnosis in ~20% of complex cases in one trial (Evaluating GPT-4 as an academic support tool for clinicians), and it may occasionally produce plausible but incorrect statements (“hallucinations”). OpenAI has integrated some guardrails and it generally refuses overtly harmful instructions, but medical users must still verify its outputs.

Multi-modal abilities: Notably, GPT-4 introduced a vision-enabled version (sometimes called GPT-4V) that can accept images (e.g. an X-ray or ECG chart) along with text. This opens the door for multi-modal diagnostics – however, current performance is mixed. One study found text-only GPT-4 (given a radiology case description) correctly diagnosed 43% of musculoskeletal radiology cases, similar to a radiology resident’s 41%– but the same model given the actual images (GPT-4V) got only 8% correct ( ChatGPT’s diagnostic performance based on textual vs. visual information compared to radiologists’ diagnostic performance in musculoskeletal radiology - PMC ). In other words, GPT-4’s general vision skills are not yet reliable for specialized medical imagery, often misinterpreting details ( ChatGPT’s diagnostic performance based on textual vs. visual information compared to radiologists’ diagnostic performance in musculoskeletal radiology - PMC ). Users report that GPT-4V can describe straightforward images (e.g. an EKG strip or a rash photo) in general terms, but it may miss subtle clinical findings (Evaluating ChatGPT-4's Diagnostic Accuracy: Impact of Visual Data ...) (Exploring the clinical applicability of GPT-4 with vision - ScienceDirect). Therefore, while GPT-4 with vision is accessible and promising for multi-modal input, it should be used with extreme caution for image-based diagnosis at present.

Other General Models: Anthropic’s Claude 2 and Google’s Bard (PaLM 2) are two other general AI chat models that are publicly accessible (Claude via API or interfaces like Poe, and Bard as a free Google service). These have conversational ability but relatively lower medical exam performance. Studies have found Bard scoring well below passing on medical questions (e.g. ~16% on a med entrance test vs. GPT-4’s 43%) (JMIR Medical Education - Evaluating Large Language Models for the National Premedical Exam in India: Comparative Analysis of GPT-3.5, GPT-4, and Bard). Claude 2’s medical accuracy is not as extensively published, but in general GPT-4 remains the strongest general model for medical Q&A. Still, such models can be useful for non-critical medical queries or as adjuncts. They also typically cost less: Bard is free, and Claude’s API pricing is comparable to or lower than OpenAI’s. However, for high-stakes diagnostic reasoning, GPT-4 is currently the go-to among general models, given its demonstrated exam-level knowledge.

Specialized Medical LLMs

To further improve accuracy and reliability, several organizations have created medical-domain LLMs by fine-tuning on medical knowledge and clinical texts. These models are explicitly geared towards diagnostics and medical Q&A.

Open-Source and Other Notable Models

Beyond the above, the open-source community has produced medical LLMs by fine-tuning open models (like Meta’s LLaMA or smaller GPT-style models) on medical text. Examples include BioGPT (Microsoft), PubMedGPT, MedAlpaca, PMC-LLaMA, and others. These typically lag behind the big proprietary models in raw accuracy, but are improving steadily (Toward expert-level medical question answering with large language models | Nature Medicine). For instance, early biomedical transformers achieved ~45–50% on the MedQA exam, and newer open models (~7B parameters) can reach somewhere in the 60–70% range. The HuggingFace medical leaderboard noted some 7B open models (e.g. Stanford’s Alpaca Medicine, UPenn’s Med 🦙 (MedLLaMA)) show competitive performance on certain tasks despite their smaller siz (The Open Medical-LLM Leaderboard: Benchmarking Large Language Models in Healthcare)】. Their appeal lies in being free and customizable – organizations can fine-tune them on private data, and deployment can be local (addressing data privacy concerns). However, no open model yet matches the top-tier closed models on general medical expertise. Using them in practice would require significant validation and likely task-specific tuning. There are also specialist AI solutions (not LLM-based) for specific diagnostics – for example, image-based models like CheXNet for chest X-rays or dermatology lesion classifiers – but those are beyond the scope of “general question-answering” capabilities. It’s worth noting that researchers are working on combining such specialist models with LLMs (for example, an LLM that calls a radiology CNN for an image, then explains the result). Some open multi-modal projects are exploring this, but they remain experimental.

In summary, open-source medical models are improving and can be useful for non-critical or research applications, especially if one cannot use paid APIs. Yet for highest accuracy and broad medical knowledge, the proprietary models (GPT-4, Med-PaLM, etc.) still have an edge in 2025.

Performance, Accuracy & Safety Comparison

The table below compares key attributes of several leading models relevant to medical Q&A and diagnostics:

Model Modalities Performance (Medical QA Benchmarks) Notable Capabilities Access / Pricing
OpenAI GPT-4 Text; Images (limited) ~86% accuracy on USMLE-style question ([Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments Scientific Reports](https://www.nature.com/articles/s41598-023-43436-9#:~:text=ChatGPT%20was%20shown%20to%20have,or%20distinguishing%20between%20medical%20knowledge))】 (near top-tier);
Comparable to physicians on many answer ([
        Assessing GPT-4’s Performance in Delivering Medical Advice: Comparative Analysis With Human Experts - PMC
    ](https://pmc.ncbi.nlm.nih.gov/articles/PMC11250047/#:~:text=Results))】.<br>Struggles with direct image diagnosis (8% on some image tests ([
        ChatGPT’s diagnostic performance based on textual vs. visual information compared to radiologists’ diagnostic performance in musculoskeletal radiology - PMC
    ](https://pmc.ncbi.nlm.nih.gov/articles/PMC11632015/#:~:text=GPT,0.001%2C%20respectively))】. | Excellent general reasoning and vast knowledge; passes medical exams; can analyze case text thoroughly. Limited vision capability for now. | API access (≈$0.03–0.06 per 1K tokens ([Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments | Scientific Reports](https://www.nature.com/articles/s41598-023-43436-9#:~:text=In%20this%20study%2C%20we%20assessed,generate%20text%20with%20even%20greater))】; ChatGPT Plus $20/mo for GUI. |

| Google Med-PaLM 2 | Text | 86.5% on USMLE (MedQA (Fig. 1: Med-PaLM 2 performance on MultiMedQA. | Nature Medicine)】; outperforms previous domain models. In physician ratings, answers often preferred over human doctors (Fig. 1: Med-PaLM 2 performance on MultiMedQA. | Nature Medicine)】. | Medical domain-tuned LLM by Google; high accuracy and aligned to clinical norms (low harmful content (Toward expert-level medical question answering with large language models | Nature Medicine)】. Excels at complex medical Q&A. | Not publicly available (pilot use in select hospitals (Google's medical AI chatbot is already being tested in hospitals)】; no pricing (research model). | | Google Med-Gemini | Text, Images, Data | 91.1% on USMLE (SOTA (Advancing medical AI with Med-Gemini)】 – currently highest on MedQA. Also sets state-of-art on image-based medical questions, greatly beating GPT-4 (Advancing medical AI with Med-Gemini)】. | Next-gen multimodal medical AI: can interpret 2D/3D scans, clinical images, even genomic (Advancing medical AI with Med-Gemini) (Advancing medical AI with Med-Gemini)】. Integrates web search; strong long-form reasoning. | Not publicly released (Google research prototype); likely to appear in future Google Health offerings. | | Jivi MedX | Text | ~91.7 (out of 100) average across multiple medical exam benchmark (AI Startup Jivi’s LLM Beats OpenAI’s GPT-4 & Google’s Med-PaLM 2 in Answering Medical Questions  - insideAI News)】 (ranked #1 on one leaderboard in 2024). | Startup-built medical LLM trained on huge proprietary medical dataset. Aims for accurate and affordable diagnostic advic (AI Startup Jivi’s LLM Beats OpenAI’s GPT-4 & Google’s Med-PaLM 2 in Answering Medical Questions  - insideAI News)】. | Expected via API/product (planned 2024 launch (AI Startup Jivi’s LLM Beats OpenAI’s GPT-4 & Google’s Med-PaLM 2 in Answering Medical Questions  - insideAI News)】; pricing TBD (likely subscription or license). | | John Snow Labs (Healthcare LLM) | Text | ~87.3 on composite medical QA benchmark (John Snow Labs Achieves New State-of-the-Art Medical LLM Accuracy Benchmarks Outperforming GPT-4, Med-PaLM2, and Hundreds of Others - John Snow Labs)】 (above GPT-4’s recorded score); also offers smaller models (7B, 3B) with best-in-class size-to-accuracy rati (John Snow Labs Achieves New State-of-the-Art Medical LLM Accuracy Benchmarks Outperforming GPT-4, Med-PaLM2, and Hundreds of Others - John Snow Labs)】. | Enterprise-grade medical LLM with emphasis on clinical guidelines and factual accuracy. Can be deployed privately; supports custom tuning. | Commercial software (contact for pricing; typically enterprise license). Provides on-prem or cloud API via John Snow Labs platform. | | Open-Source Medical LLMs (e.g. Med-Alpaca, OpenBioLM) | Text (some experimental vision) | Variable; generally lower. E.g. older BioGPT ~50% on USMLE; newer 7B models ~60–70% on some benchmark (The Open Medical-LLM Leaderboard: Benchmarking Large Language Models in Healthcare)】. Improving with each iteration. | Community-driven models fine-tuned on medical text. Fully transparent and customizable. Require careful validation; may lack rigorous safety tuning. | Free to use (open-source); cost is in computing resources. No usage fees, but no formal support or guarantees. |

Table Notes: “Modalities” indicates whether the model handles only text input or can also analyze images and other data. Performance is summarized from benchmark studies – note these focus on multiple-choice Q&A accuracy and may not capture real clinical scenario performance fully. Each model’s capabilities and alignment measures (for safety) also differ: e.g., Med-PaLM 2 underwent physician review for harm reductio (Toward expert-level medical question answering with large language models | Nature Medicine)】, while open models may not have such guardrails by default. Access/Pricing distinguishes free vs. paid: GPT-4, Claude, etc. require payment or subscriptions, whereas Bard and some open models are free. Enterprise solutions like John Snow Labs come with support (and costs) but offer integration into clinical workflows.

Safety and Reliability in Practice

While the accuracy numbers are impressive, real-world reliability is an equally important metric. Medical AI must be used carefully because errors can have serious consequences. Key considerations include:

  • Hallucinations and Errors: Even top models sometimes produce incorrect statements with high confidence. For example, multimodal LLMs might hallucinate seeing abnormalities on a perfectly normal imag ( Assessing the performance of zero-shot visual question answering in multimodal large language models for 12-lead ECG image interpretation - PMC )】. Pure text models can occasionally misinterpret a case or recall outdated guidelines. A systematic review found that across many studies, GPT-4 was the most evaluated model and generally showed high accuracy, but no model is error-fre (A systematic review and meta-analysis of diagnostic performance ...)】. Thus, any AI-generated diagnosis or advice should be verified by human clinicians. These AIs are best used as assistive tools to double-check reasoning or provide a second opinion, rather than autonomous decision-makers.

  • Safety Tuning: The specialized models tend to have more rigorous safety checks. Google’s Med-PaLM project, for instance, evaluated answers on potential harm, inappropriate content, and bias – Med-PaLM 2 had 90.6% of its answers rated as low risk of harm by clinicians, an improvement over the prior mode (Toward expert-level medical question answering with large language models | Nature Medicine)】. OpenAI similarly improved GPT-4’s alignment compared to earlier GPT-3.5, and it is less likely to give unethical or dangerous advice (it usually refuses requests that would violate medical ethics, like giving dosing for illegal drugs, etc.). Nonetheless, guardrails can be outsmarted or misused. The user should always apply medical judgment before acting on AI output. If possible, use these models in a setting that allows sourcing (e.g., retrieving cited medical literature or guidelines to back up answers). Some implementations combine LLMs with information retrieval from medical databases to increase factual accurac (Toward expert-level medical question answering with large language models | Nature Medicine)】.

  • Regulatory and Ethical Use: As of 2025, no AI model is FDA-approved as an autonomous diagnostic device in general medicine. However, they can be used under clinician supervision. Privacy is a concern too – using a cloud AI on real patient data may violate HIPAA or other regulations if the service isn’t compliant. That’s why solutions like John Snow Labs (on-premise deployment) or open-source local models might be chosen by hospitals for data-sensitive tasks. Always ensure if you use a model via API that patient data is de-identified or you have a Business Associate Agreement in place. Safety also means knowing the model’s limits: for instance, an LLM might provide a decent differential diagnosis for common symptoms, but it lacks real clinical exam capability and may not know subtle patient-specific factors.

In real-world pilots, doctors have found these AIs helpful for generating suggestions and educational explanations. A study in an emergency department showed GPT-4 could suggest the correct diagnosis more often than junior physicians for triage case (ChatGPT With GPT-4 Outperforms Emergency Department ...)】, and another report found GPT-4 improved doctors’ management plans in complex case (GPT-4 gives physicians an edge in complex case management)】. But importantly, the final decisions were made by the human clinicians, who used the AI as a reference. The bottom line: the best models can augment medical decision-making and knowledge retrieval, but they are not infallible. Proper oversight, verification, and complementary use of traditional diagnostic methods remain essential.

Recommendations

For those looking to leverage AI for medical diagnostics or Q&A today, here are some recommendations:

  • For best performance: OpenAI’s GPT-4 is currently the most accessible top-tier model for medical reasoning. It offers near-expert accuracy on general medical question (Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments | Scientific Reports)】 and is available via API and ChatGPT with relatively easy setup. If you have a budget for API calls or a ChatGPT Plus subscription, GPT-4 is a strong choice to get state-of-the-art general medical AI assistance. Just remember to use it judiciously and double-check critical outputs. Google’s Med-PaLM 2 is similarly strong, but since it’s not openly available, it’s not an option for most users yet (keep an eye on Google’s health AI offerings in case that changes).

  • Multi-modal needs: If you require image analysis (like interpreting scans or photos), be cautious. GPT-4 with vision can handle simple tasks but is not reliable enough for serious image diagnostic ( ChatGPT’s diagnostic performance based on textual vs. visual information compared to radiologists’ diagnostic performance in musculoskeletal radiology - PMC )】. In radiology or pathology, specialized AI (FDA-approved algorithms for detection) or a human specialist is still the gold standard. That said, research models like Med-Gemini show what’s coming – soon we may have APIs that can take an entire patient record (text, labs, images) and provide comprehensive answers. For now, one practical approach is a hybrid: use separate tools for images (e.g. a radiology AI that outputs findings) and feed those findings into an LLM for explanation or Q&A. No single public model yet does it all.

  • Free vs Paid: If budget is a concern, open-source models could be explored. For example, a fine-tuned LLaMA model (like “Med-Alpaca”) can be run locally for zero API cost. Just be aware the accuracy will be significantly lower – they might answer straightforward questions well but falter on complex cases. Free general models like Bard are improving but, as noted, still underperform on medical specific (JMIR Medical Education - Evaluating Large Language Models for the National Premedical Exam in India: Comparative Analysis of GPT-3.5, GPT-4, and Bard)】. Using the free ChatGPT (GPT-3.5) is an option for basic queries, but it may miss nuances and even make mistakes a medical student wouldn’ (Comparing ChatGPT and GPT-4 performance in USMLE soft skill ...)】. In our assessment, the reliability gap between GPT-3.5 and GPT-4 is large in medicine, so investing in GPT-4 or a similar quality model is worthwhile if you need serious diagnostic support.

  • Enterprise solutions: If you represent a healthcare institution, consider engaging with groups like John Snow Labs or emerging vendors like Jivi. These companies offer more bespoke solutions – e.g., models that can be deployed within your data firewall, or that come pre-loaded with clinical guidelines and formulary knowledge. They may also offer fine-tuning on your own data (like past cases or EHR notes) which can boost relevance. The trade-off is cost and the need to vet the vendor’s claims. Always request validation results specific to your use-case (for example, how does the model perform on your specialty’s questions, not just generic benchmarks).

In conclusion, 2024–2025 has been a breakthrough period for medical AI models. General LLMs like GPT-4 have essentially become “consultants” that can answer a wide range of medical questions with a high level of competenc ( Assessing GPT-4’s Performance in Delivering Medical Advice: Comparative Analysis With Human Experts - PMC )】, while new specialized models are pushing accuracy even further and incorporating multimodal dat (Advancing medical AI with Med-Gemini)】. For now, GPT-4 remains the workhorse solution for many, given its combination of performance and availability. But on the horizon, systems like Google’s Med-Gemini and others promise even more holistic diagnostic AI. Regardless of choice, the user should remain at the center: use these tools to inform and enhance your own clinical reasoning, but do not rely on them blindly. With proper use, AI models can improve efficiency (e.g. quick answers to clinical questions, drafting documentation) and broaden access to medical knowledge – ultimately supporting better patient care when integrated responsibly.

Sources: The information above is drawn from recent studies and benchmarks of medical AI models, including Nature Medicine and Scientific Reports articles evaluating Med-PaLM and GPT- (Fig. 1: Med-PaLM 2 performance on MultiMedQA. | Nature Medicine) (Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments | Scientific Reports)】, conference publications on multimodal medical A (Advancing medical AI with Med-Gemini)】, and industry reports of new models like Jivi Med (AI Startup Jivi’s LLM Beats OpenAI’s GPT-4 & Google’s Med-PaLM 2 in Answering Medical Questions  - insideAI News)】 and John Snow Labs’ suit (John Snow Labs Achieves New State-of-the-Art Medical LLM Accuracy Benchmarks Outperforming GPT-4, Med-PaLM2, and Hundreds of Others - John Snow Labs)】. Performance metrics (e.g. USMLE scores) and safety findings (like risk-of-harm assessments) are cited inline. All these sources collectively indicate which models are currently top-of-class and how they differ in capability and accessibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment