For the Project 2 subject, I collaborated with four other students to produce a model to determine the triage category for people presenting to an Emergency Department. My main task in the project was to try and clean the free text fields. It is said that 80% of data science is cleaning the data. In this project, I would put that percentage as higher.
We were provided with a dataset of 556,652 Emergency Medical Records from four NSW hospital Emergency Departments over a 3-year period (2017 – 2019) that contained 77 variables consisting of demographical data (age, gender), presenting information, and vital statistics (heart, blood pressure, GCS, etc). Two of the fields for presenting information were free text.
Free text notes made by the Triage Nurse do not conform to proper sentence structure. Free text fields increase the likelihood of misspellings and typing errors; the hectic workflow in emergency departments further increases error vulnerability. In previous studies, text fields have proven difficult in cleaning and aggregating, and have been excluded (Zhang et al., 2017). As the text provided significant predictive power, for this study, it was not disregarded but cleaned and transformed into word vectors using a Term Frequency – Inverse Document Frequency (TF-IDF) vectorizer.
Triage Presenting Information contained 555,654 observations up to 57 words long per patient, with a total of 10,544,474 words, 114,542 of them unique. 109,774 unique words occurred less than 100 times and 65,790 words occurred only once suggesting a high number of misspellings and typing errors. Similar results were seen in the Triage Presenting Additional Information text, the top 20 of which are shown in Figure 1.
The 50 most common words in Triage Presenting Information are shown in Figure 2.
In order to reduce the number of misspellings and typing errors, unique words were converted to lowercase and put through a spellchecker package. Any words not spelled correctly were written out to a spreadsheet to be manually checked. This was a very time-consuming job.
As the words were now out of context we could only accept corrections that were obvious. For longer words that had the error towards the end, the spellchecker was good at returning the correct spelling. For example, with “hypertention” the second ‘t’ should be ‘s’. For some drug names, shorter words, and words with the error towards the beginning, the spellchecker returned words that were less likely to be correct. For example, “goord” could be the condition “GORD” or it could be “good”. And “gunting” could be grunting, hunting, or Huntington's or even something else. The file of corrections was then loaded into an SQL table.
To clean the text data, abbreviations were expanded (e.g. ‘c/o’ to ‘complains of’, and ‘-ve’ to ‘negative’), and slashes and hyphens were replaced by spaces to prevent words concatenating together when punctuation was removed. Any extra white space was then removed. To perform corrections the text fields were split into individual words and were matched against the corrections.
We had planned to spend 3 of the 6-week project on data cleaning. This turned out to be inadequate and we had to make a decision to stop at the end of week 4 and work with what we had. In Figure 3 we can see the impact of cleaning the text fields. With Triage Presenting Information, shown in dark blue, there was a 5.87% decrease in the total number of words and a 22.43% decrease in the number of unique words. With Triage Presenting Additional Information the percentage decreases were 0.67% and 16.95% respectively. Even with this drop in unique words, it was still a lot of words to feed into the modelling.
As seen in Figure 4, the word misspelled the highest number of ways across both free text fields is ‘diarrhoea’ (249). ‘vomiting’ (7989) has the highest impact of misspellings.
Using an ensemble method we were able to fairly accurately predict categories 1, 2, and 5. A lot more work is required to match human-level results.
A big limitation of the project was the timeframe. We only had 6 weeks. More time and resources would have allowed several more passes through the text data to further identify and match low-frequency words against a more robust dictionary and would likely improve the model results. Extracting meaning on word frequency (e.g. vomiting x8), and numbers referred to in the text (e.g 3/7 for 3 days) also warrant further investigation and feature engineering.
References
Zhang, X., Kim, J., Patzer, R.E., Pitts, S., Patzer, A., Schrager, J.D. (2017). Prediction of emergency department hospital admission based on natural language processing and neural networks. Methods of Information in Medicine 56(5) doi: 10.3414/ME17-01-0024