Triaging Patients With Artificial Intelligence for Respiratory Symptoms in Primary Care to Improve Patient Outcomes: A Retrospective Diagnostic Accuracy Study

Abstract

PURPOSE Respiratory symptoms are the most common presenting complaint in primary care. Often these symptoms are self resolving, but they can indicate a severe illness. With increasing physician workload and health care costs, triaging patients before in-person consultations would be helpful, possibly offering low-risk patients other means of communication. The objective of this study was to train a machine learning model to triage patients with respiratory symptoms before visiting a primary care clinic and examine patient outcomes in the context of the triage.

METHODS We trained a machine learning model, using clinical features only available before a medical visit. Clinical text notes were extracted from 1,500 records for patients that received 1 of 7 International Classification of Diseases 10th Revision codes (J00, J10, JII, J15, J20, J44, J45). All primary care clinics in the Reykjavík area of Iceland were included. The model scored patients in 2 extrinsic data sets and divided them into 10 risk groups (higher values having greater risk). We analyzed selected outcomes in each group.

RESULTS Risk groups 1 through 5 consisted of younger patients with lower C-reactive protein values, re-evaluation rates in primary and emergency care, antibiotic prescription rates, chest x-ray (CXR) referrals, and CXRs with signs of pneumonia, compared with groups 6 through 10. Groups 1 through 5 had no CXRs with signs of pneumonia or diagnosis of pneumonia by a physician.

CONCLUSIONS The model triaged patients in line with expected outcomes. The model can reduce the number of CXR referrals by eliminating them in risk groups 1 through 5, thus decreasing clinically insignificant incidentaloma findings without input from clinicians.

Key words:

INTRODUCTION

Health care costs have steadily increased in recent decades.¹ General practitioners face a greater number of patients,^2,3 with more comorbidities⁴ and demands,⁵ and diagnostic test orders have increased substantially.⁶ Around 20% of patient visits to general practitioners stem from self-resolving symptoms,⁷ and up to 72% of patient visits are due to acute respiratory symptoms.⁸ Overuse and misuse of diagnostic tests is a well-known problem in primary care^9,10 that increases incidental findings.^11,12 The same applies to antibiotic prescribing,¹³ especially for respiratory tract infections,¹⁴ leading to increased bacterial resistance.¹⁵ The causes for clinical resource misapplication are multifactorial, but patient demands, human biases, and time pressure play substantial roles.^16,17

Machine learning models (MLMs) are thought to be similar or superior to physicians in multiple clinical tasks.^18-27 Patient triage using MLMs is reportedly comparable to triage by physicians.^28,29 Research in tertiary settings showed MLMs to be superior to physicians at estimating patient risk when ordering diagnostic tests.³⁰ Clinical guidelines and scoring systems can standardize diagnosis and treatment and improve the quality of care while reducing costs,^31-34 but remain underused.^35-37 Guideline applicability, useability, and time scarcity are cited as reasons why.^38,39

Structured triage with standardized questionnaires is likely safer than unstructured triage.⁴⁰ Assistance from a clinical decision support system increases triage quality.⁴¹ By design, MLMs use standardized inputs, making them a good fit for integration into a clinical decision support system, and such systems have been shown to reduce health care costs by 14%.⁴² Triaging patients at the time of appointment scheduling is even more important since the COVID-19 pandemic. Methods to identify patients well suited to virtual consultations are needed as they now make up 13% to 17% of consultations across all specialties.⁴³

Clinical text notes (CTNs) are a written record of a physician’s interpretation of the patient’s symptoms and signs, reasons for clinical decisions made during the consultation, and actions taken (eg, imaging referrals, prescriptions written). The objective of this study was to train a patient triage MLM on symptoms and signs (clinical features) of patients with respiratory symptoms, using only features the patient could be asked about in order to mimic previsit triage. We extracted the clinical features from CTNs.

This MLM, which we refer to as a respiratory symptom triage model (RSTM), divides patients into 10 risk groups (with increasing risk from groups 1 to 10) based on a score. We validated the RSTM by examining patient outcomes, stratified by risk group, on intrinsic data, and in 2 separate extrinsic (unseen) data sets. Evaluating of MLM performance in a medical context is complex, and knowing which benchmarks to use is often unclear. Many reports benchmark MLMs against physicians’ diagnoses which are affected by human biases and errors.⁴⁴ Benchmarking the RSTM against multiple patient outcomes likely serves as a better performance metric, and, to our knowledge, no reports have examined MLM triage performance in this way.

METHODS

In this retrospective diagnostic accuracy study, we obtained 44,007 medical records of 23,819 patients from a medical database common to all primary clinics in the Reykjavík area of Iceland. Each record contained a CTN with diagnostic referrals and results, diagnoses, and prescriptions written.

The selection criteria were patients over the age of 18 years who were diagnosed by a physician from January 1, 2016 through December 31, 2018 with 1 of 7 International Classification of Diseases 10th Revision (ICD-10) codes: J00 (common cold), J10 and J11 (influenza), J15 (bacterial pneumonia), J20 (acute bronchitis), J44 (chronic obstructive pulmonary disease [COPD]), and J45 (asthma), including subgroups. We removed CTNs containing less than 250 characters, resulting in 17,177 CTNs included in this study.

In our previous work, we trained a deep neural network to extract clinical features from CTNs,⁴⁵ which we call the clinical feature extraction model. We randomly selected 7,000 CTNs as input to the clinical feature extraction model and discarded CTNs with less than 8 clinical features, increasing the odds of having enough clinical features in each for the RSTM. The clinical feature extraction model also extracted presenting complaints which we used to limit inclusion to only patients presenting with acute or subacute respiratory symptoms. The complete list of presenting complaints is in Supplemental Table 1. We removed 95 CTNs from follow-up consultations and 223 CTNs with multiple topics to include only CTNs in which patients presented with a new respiratory complaint as a single complaint. Thus, for patients diagnosed with COPD and asthma, only cases of exacerbation were included.

Applying these filters reduced the set of 7,000 CTNs to 2,942. Of those, 2,000 CTNs were randomly selected and manually annotated by a single physician. As annotating CTNs is costly, the final number of 2,000 CTNs was limited by funding. We split the resulting data set randomly into training (75%, 1,500 CTNs) and test (25%, 500 CTNs) sets. We also annotated an additional 664 CTNs with influenza ICD-10 codes (J10, J11) as a second test set to examine the generalizability of the RSTM further because there were no influenza patients in the training data set.

Subsequently, we trained the RSTM on features that patients can be asked about and measure themselves, imitating a setting where triage takes place before a medical consultation. We chose the input features that a web-based triage system could obtain directly from a patient without other human assistance to ensure the model fits into a clinical workflow. The training objective was to predict the likelihood of a patient having a lower respiratory tract infection. We considered all diagnoses except J00 (common cold) to be a lower respiratory tract infection.

The RSTM had a single output: a score between 0 and 1, where patient scores approaching 1 have an increased probability of a lower respiratory tract infection diagnosis. We performed 25 repeats of a 4-fold stratified nested cross validation for hyperparameter search and intrinsic validation. We then trained the RSTM on the training data set with optimized hyperparameters before splitting patients in the test sets into 10 risk groups based on the score they received. The risk score interval for each group was 0.1, and we refer to groups 1 through 5 as the low-risk groups and 6 through 10 as the high-risk groups.

Annotation

The annotation method was inspired by researchers who applied similar annotations on medical text,¹⁹ which assigned binary and numerical values to clinical features, representing the presence or absence of signs and symptoms in the CTNs, as they were described in text. They constituted the patient’s health state as described by a physician during the medical consultation when the CTN was written. A detailed description of the annotation process is in the Supplemental Appendix. We gave missing binary features the value of 0. Missing value features were replaced by randomly sampling the normal distribution for that feature to reduce the odds of the model simply learning where features are missing, which would be more likely for a patient with less severe disease.

Model Architecture, Hyperparameter Optimization, and Training

The classifier we used was a type of logistic regression with Least Absolute Shrinkage and Selection Operator penalty. We used Shapley Additive Explanation⁴⁶ values to extract the 50 most impactful clinical features to use as input features into the RSTM to reduce the risk of a spurious correlation between the input and output data. A full list of the model clinical features can be found in Supplemental Table 2. We performed 25 repeats of a 4-fold stratified nested cross validation with grid search on the training set, to optimize the hyperparameters of the RSTM. Only class weight and the penalty (C parameter) were optimized, resulting in use of a balanced parameter for class weight and a C value of 0.1 during training. Then we trained the RSTM on the training data set before running inference on the patients in the test sets.

Outcomes and Statistical Analysis

For each risk group, we examined the following outcomes: mean C-reactive protein (CRP) value, ICD-10 code distribution, the proportion of patients re-evaluated in primary care and emergency departments within 7 days, the proportion of patients referred for a CXR, CXRs with signs of pneumonia and incidentalomas, and proportion of patients receiving antibiotic prescriptions. C-reactive protein values were only available if the physician deemed it necessary and were extracted from the CTN since rapid-CRP test results are saved only in the CTN not in a structural database in Iceland. Referrals for CXRs and results are linked to a CTN and the textual answer from the radiologist was manually annotated in the same manner as the CTNs. Except for incidentalomas and ICD-10 codes, a positive or a higher outcome value indicated more severe disease for a given patient. Notes about consolidations, infiltrations, and pneumonia-like signs in the CXR’s text description were considered positive signs of pneumonia. All data sources were from the patients’ electronic health records. The 95% CIs were calculated by sorting the values for each outcome and calculating the 2.5% and 97.5% percentiles. We used a 2-sided Fisher’s exact test to calculate P values for binary variables and a 2-sided Mann-Whitney U test for continuous variables. We considered P <.05 to be significant. We implemented data analysis in Python (version 3.8) and trained and validated the RSTM with the scikit-learn library (0.22.1).⁴⁷

RESULTS

A total of 2,000 CTNs from 1,915 patients were included in the final data set. There were 26,971 annotations, for an average of 13.5 annotations per CTN. The flowchart of CTN selection of the first test set is shown in Figure 1. In the second test set, 664 CTNs from 652 patients were included. Table 1 shows the demographics for each data set, ICD-10 code, and mean outcome distribution. The baseline outcome rates are similar to those reported by others.^48-51 Patients with pneumonia on CXRs received antibiotic prescriptions in 46% of cases. All incidentalomas were of nodule subtype and none had clinical significance. Table 2 compares the outcome rates in the low-risk and high-risk groups in the test sets with calculated P values. There was a significant difference between the groups in CRP values and antibiotic treatment in test set 1 and only in antibiotic treatment in test set 2. No evaluations in the emergency department resulted in a pneumonia diagnosis, and 83% received the same ICD-10 code as they received in the initial consultation. No primary care re-evaluations resulted in a pneumonia diagnosis, and 80% received the same ICD-10 code they initially received.

Figure 1.

Study flowchart for clinical text note selection.

CFEM = clinical feature extraction model; CTN = clinical text note; PC = primary care.

Table 1.

Demographics, ICD-10 Code Distributions, and Outcomes in the Data Sets

Table 2.

Comparison of Outcome Rates in the Test Sets Between Low-Risk and High-Risk Groups

Outcome distributions stratified by risk group are shown in Figure 2 (training set), Figure 3 (set 1), and Figure 4 (set 2). The low-risk groups in the training set (Figure 2) contain no positive CXRs, 52% of the incidentalomas, and 9% of CXR referrals. In the first test set, the low-risk groups included one-third of the patients who were younger, and had lower CRP values, antibiotic prescription rates, re-evaluation rates, no positive CXRs, and 19% of CXR referrals. In the second test set, 45% of patients and 35% of CXR referrals were in low-risk groups, that had no CXRs with signs of pneumonia and the single incidentaloma found. The outcome trends in Figures 2, 3, and 4 show rising outcome rates with higher outcome groups for all outcomes, except for re-evaluation in primary care and CRP values in the second test set.

Figure 2.

The outcome distribution in the cross-validated data set.

CXR = chest x-ray; CRP = C-reactive protein; ICD-10 = International Classification of Diseases, 10th Revision; J00 = common cold; J15 =bacterial pneumonia; J20 = acute bronchitis; J44 = chronic obstructive pulmonary disease; J45 = asthma.

Notes: (A-E) bars represent 95% CIs. (F) shaded area represents 95% CIs.

Figure 3.

The outcome distribution in test set 1.

CXR = chest x-ray; CRP = C-reactive protein; dL= deciliter; ED = emergency department; ICD-10 = International Classification of Diseases, 10th revision; J00 = common cold; J15 = bacterial pneumonia; J20 = acute bronchitis; J44 = chronic obstructive pulmonary disease; J45 = asthma; mg = milligram; PC = primary care.

Notes: (B) bars represent 95% CIs. (F) shaded area represents 95% CIs.

Figure 4.

The outcome distribution in test set 2.

CXR = chest x-ray; CRP = C-reactive protein; dL= deciliter; ICD-10 = International Classification of Diseases, 10th revision; mg = milligram; PC = primary care.

Notes: (B) bars represent 95% CIs. (F) shaded area represents 95% CIs. ICD-10 code distribution in test set 2 was not examined for these symptomatically similar patients.

DISCUSSION

In this large retrospective study, we show, for the first time, the results of patient triage by MLMs in primary care, using only data available before a medical consultation, in the context of patient outcomes. The RSTM performs the triage such that patients in high-risk groups have more severe outcomes than those in lower-risk groups. Importantly, no patient in the lowest 5 risk groups had a CXR with signs of pneumonia or a pneumonia ICD-10 code. Despite patients in test set 2 coming from a different population than patients in the training data set, the triage shows an outcome distribution pattern similar to that of test set 1, further validating that the RSTM triages pre-consultation patients similarly. The nested cross validation shows an underlying signal across the whole data set, allowing the RSTM to triage the patients aligned with expected outcomes, regardless of how the data set is split and ordered. The outcome distribution is similar in all data sets, indicating a general good model fit to the data. Interestingly, the RSTM is ignorant of ICD-10 code subtypes but scores J15 (bacterial pneumonia) patients at an increasing rate in groups 4 through 10, while J00 (common cold) and J20 (acute bronchitis) decrease proportionally. J44 (COPD) was only found in groups 2 though 8, indicating that the model considers patients with pneumonia (J15) and COPD (J44) most likely for worse outcomes, matching reality.

Findings Compared With Other Studies

We were unable to find similar studies, but multiple studies have attempted to derive a diagnostic rule for pneumonia from the signs and symptoms of patients. All but 1 include features in their rules which make them incomparable to the RSTM. When we compare the scores of the RSTM and the diagnostic rule from the 1 comparable study,⁵² we see a linear correlation (Supplemental Figure 1). Those authors concluded that using the diagnostic rule in clinical settings would substantially reduce antibiotic use and CXR imaging,⁵² which coincides with our findings. We also compared the score of the RSTM to the Anthonisen score,⁵³ which recommends antibiotic treatment for exacerbation of COPD if 2 of 3 cardinal symptoms are positive (increased sputum expectoration, increased dyspnea, purulent sputum production). Their score coincides well with the risk prediction of the RSTM (Supplemental Figure 2) and recommends that COPD patients in the low-risk groups should not be treated with antibiotics.

Clinical Implications

If the RSTM shows similar performance in clinical settings, it could be implemented as a web-based tool, potentially triaging patients online before they make an appointment. The triage could potentially identify patients with low risk of lower respiratory tract infection, that could be attended to without the need for face-to-face consultations. The RSTM could eliminate CXR referrals for patients in groups where the probability of them being positive is low or nonexistent, which would remove up to one-third of CXRs and possibly one-half of the incidentalomas without missing a positive CXR. Despite all patients in the low-risk groups receiving diagnoses where the benefit of antibiotics is debatable, antiobiotics were substantially prescribed. Reducing antibiotic prescriptions in the low-risk groups would increase prescription quality. The RSTM score needs no input from clinicians and can be ready when a patient enters the examination room, resulting in an easy-to-use, unambiguous, applicable score with a meaningful effect. Thus, the RTSM can possibly reduce costs for patients, the health care system, and society.

Strengths and Limitations

The strengths of this study include a large data set of patients with 2 distinct test sets. Using multiple patient outcomes, stratified by risk groups, gives more insight into the performance and safety of the triage instead of using only physicians’ diagnoses as benchmarks. The study is subject to limitations and biases of a retrospective methodology, and the findings must be validated prospectively. The CTNs are a written record of the physician’s interpretation of patients’ symptoms and signs and contain human errors and biases, making the RSTM erroneous and biased. Removing CTNs with less than 8 clinical features creates selection bias, likely toward patients with more severe symptoms. Direct data collection from patients would provide more quality training data. There is availability bias in the CRP values and CXR outcomes. Performing annotation with multiple physicians would likely result in more quality annotations.

Received for publication September 9, 2022.
Revision received December 9, 2022.
Accepted for publication January 12, 2023.

link

Triaging Patients With Artificial Intelligence for Respiratory Symptoms in Primary Care to Improve Patient Outcomes: A Retrospective Diagnostic Accuracy Study

Abstract

INTRODUCTION