Why health is the ‘worst case’ scenario for text mining

16 minute read

Teaching a machine to read a medical record is a lot harder than many folks thought. Will we ever get there?

Think about how many keystrokes you make every day in your practice. Now, multiply that number by 744,437 – the number of registered health practitioners in Australia.

That’ll give you a rough idea of just how much free text is being generated by the health sector each day.

This free text is a goldmine glistening with potentially useful data.

Let’s take something most doctors would really like to know about their patients: what medications they are currently taking. The patient sitting in front of you can’t remember and they’ve forgotten to bring their medicine kit. How nice would it be if a computer could trawl through all the written medical records that exist about the patient and compile a list of likely recent medications?

This kind of intelligent system would never be out of work. For example, it could find all the patients overdue for a pap smear or a vaccination. Or it could match patients with rare kinds of cancers to clinical trials, compile data on adverse events or create registries. The list of potential uses is endless.

But there’s a snag, and it’s a major one. The language used in electronic medical records is deeply confusing to a computer.

Here’s a typical example of what a specialist might key into their computer at the end of a consult:

“Pt 18yof, independent. R) sarp pain++ @rest 0000hrs. AV called. x15 paracetamol? x2 oxycodone x5 valium and ?? number of lasix. pt @ TNH ED 1/7 ago with same present. Hx ideation. 3/12 56 called for assault. Anxious+++ CATT tream involved.”

What does that mean? I have no idea. A computer would have trouble figuring it out too.

A lot of this notation has been made up by the doctor, so it’s not like the computer can cross-reference it with a standardised list of medical abbreviations. Text mining is tricky at the best of times, but healthcare is one of the hardest areas to work in. 

“I call it worst case,” says Professor Wray Buntine, a machine learning expert at Monash University in Melbourne, who also kindly supplied the specialist consult example above.

“Just about anything else is better, because [in medicine] there are huge amounts of jargon and abbreviations. There are hundreds of thousands of different, very specialised drug names, medical disease names. Several of them have multiple acronyms and common-language versions.

“And also, there’s a lot of ambiguity. Some jargon has different meanings in different places. In the emergency ward it means one thing, in another ward it might mean something else.

“A lot of medical text isn’t grammatical, it’s in bullet points. You just have a stream of consciousness almost.”

It gets worse. Doctors don’t always have the best spelling and their fingers slip just like the rest of us, stamping typos into the medical record. The word diarrhoea might actually appear as diorhea, dihorrea, diaherea, dihorrea, dierrhea or diorrea.

And there are as many typo combinations as there are colours in a paint shop catalogue. Atorvastatin, for example, could mutate into aotrvastatin, atorvastatin, atorvastatin, atrovastatin, taorvastaint, atovasattin, and on it goes.

Typos can easily lead to unintended meanings that mess with computers. Here’s an example helpfully provided by US informatician Associate Professor David Hanauer:

When a doctor types in “9 yr old fe,sle, strong family hx of food allergies”, a bot reading that line might assume the patient had a condition called SLE or systemic lupus erythematosus. But if you look at an English language QWERTY keyboard, it’s easy to see the mistake; the ‘,’ button is right next to the ‘m’ button and the ‘s’ is right next to the ‘a’. So, the record should read “9 yr old female”. A simple transposition created a medical condition!

Sometimes it isn’t even a typo that baffles the computer though. Does ‘CA’ mean calcium, cancer or California? The answer is: it depends on the context.

You’d think that numbers would be fairly straightforward, but they aren’t. Ten could be written as ‘10’ or ‘ten’ or ‘X’ (Roman numerals). And ‘10 years old’ could be written as ‘ten yo’, ‘10 years’, ‘10 yrs’, or ‘tenyo’, or numerous other combinations. Even worse is ‘4’ which can also be ‘IV’, the commonly used abbreviation for ‘intravenous’.

Doctors don’t speak the same lingo either. In a single day at a US hospital, doctors described ‘fever’ in 278 different ways and ‘ear pain’ in 123 different ways in 213 patients, according to an analysis of digital medical records.

Medical language has a lot of redundancy built into it too. Cancer could appear as carcinoma, ca, tumour or neoplasm in a medical record, while ‘white blood cell count’ could read: white count, leukocyte count or WBC.

And the English language is filled with homonyms – words that are spelled the same but have two completely different meanings. A human could easily distinguish between “feeling blue” to “turning blue”, but a computer might not discern that there’s quite a big medical difference between the two phrases.


But this still isn’t even scratching the surface of the complexity, nuance and uncertainty that permeate medical records.

For example, a doctor might write in a diagnosis followed by ‘???’. To a human, three question marks is an alarming amount of uncertainty. But how does a computer translate that feeling of doubt in the language of 0s and 1s?

Not everything a doctor writes on a patient’s chart is gospel, either. Incorrect diagnoses spread like viruses through medical records, getting replicated every time a new record is created. While a specialist would carefully weigh the evidence behind each diagnosis, computers might be completely thrown off course by these hidden landmines.

We could just force doctors to write in a more standardised, structured format. That would make the computer’s life a lot easier.

But structured data, such as box ticking, standardised forms or use of ICD-11 codes, actually contains substantially less information than free text. You lose a lot of data about the series of events and the medical reasoning that brought the doctor to a particular diagnosis or treatment plan.   

It’s easier for AI to read standardised records but these kinds of records don’t capture the subtleties, uncertainties and broader context that help make for good healthcare decisions. Not to mention that entering structured data would be clunky and annoying and slow down workflow. 

In a pivotal paper on the topic, US biomedical informaticians concluded there was a direct conflict between the desire for structured data and the need for flexibility in medical records.

The “narrative expressivity” of free text should not be sacrificed for the sake of structure and standardisation, they argued. Instead, EMRs should have a hybrid model where both types of data could be recorded.


All these peculiarities of medicine make it incredibly hard for “natural language processing” systems to parse the text, says Professor Buntine.

So what is “parsing?” It is basically how computers read human language. The computer scans free text and classifies each word it encounters as a verb, noun, subject, object, drug name, disease name, and so on.

Natural language processing is a branch of computer science that tries to extract meaning from free text using coding, artificial intelligence and linguistics.

There are some relatively simple forms of such processing, such as when your smartphone translates your finger painting into text, or when Google annoyingly finishes your sentences.

Then there are slightly more clever systems, such as search engines that retrieve information not just containing the key word you typed in, but also information containing synonyms or related concepts.

And then there’s the holy grail: an intelligent system that can crunch thousands of pages of free text data in order to help solve a complex medical problem, such as recommending the best treatment for a cancer patient. This is the kind of system that the group behind IBM Watson Oncology and other research teams are hoping to build.

While grand technology projects in health are garnering a lot of interest, there’s a growing realisation that deciphering free text in the health sector is much more difficult than, say, teaching Watson to play Jeopardy.

In fact, the failure of IBM Watson to live up to the hype has been well documented. A STAT investigation in 2017 found Watson for Oncology was floundering. The system didn’t tell doctors anything they didn’t already know. The supercomputer was still in its “toddler stage”, one cancer specialist said.

“Teaching a machine to read a record is a lot harder than anyone thought,” one ex-project leader admitted. 

For all the above reasons, natural language processing “hasn’t really hit the mainstream for a lot of tasks for medical research”, says David Hanauer, an associate professor pediatrics at the University of Michigan who has specialised in health informatics.

There is some really low hanging fruit, however. Over the past 15 years, Professor Hanauer has been building a basic search engine for doctor’s notes called EMERSE.

It’s a bit like doing a Google search. The system doesn’t do any complicated thinking, it just retrieves information and lets the doctor, or the researcher, figure out the rest. 

It’s a free, open source system that is currently being used at the University of Michigan, the University of North Carolina and the University of Cincinnati, and there are a few other centres working on getting it installed.

“Amazingly, most medical record systems don’t really have a good way to help the clinicians find information,” says Professor Hanauer.

What EMERSE does is allow the investigator to enter many different clinical terms, such as diagnoses, symptoms and medications. Then it searches the medical record not just for those specific words, but for a range of synonyms, abbreviations, acronyms and shorthand descriptors.

EMERSE can help researchers find a cohort of patients across an entire system, or it can highlight certain terms within notes to speed up chart reviews.

“It will show them where all those things are mentioned in the notes so that they can drill down very quickly and find it and then do the data abstraction that they’re trying to do,” says Professor Hanauer.

EMERSE could also work as a memory aid for doctors and patients during a consultation.

Let’s say a patient comes in with a headache and tells their doctor that a medication they took five years ago worked really well but they can’t remember the name of it.

“So, with a tool like this, you can basically just type in the word ‘headache’,” says Professor Hanauer. “And then within a few seconds… ‘Oh, I see where that that term was mentioned… from that time a few years ago’.”

Text mining is tricky at the best of times, but healthcare is one of the hardest areas (particularly when you get spelling mistakes, as shown in these images!)


You’d expect there to be some collaboration between researchers working on natural language processing in health because of the advantages of pulling data. (AI systems get smarter when they are exposed to new and different clinical data sets.) But actually, the opposite is happening. Instead of collaborative projects, we see lots of independent research groups building their own systems from scratch.

Why? Because health data has a lot of privacy restrictions. While normal human language can be sourced through Wikipedia, the language of doctors is only found inside clinical notes, and research teams need special permission to work with these sensitive records.

It might seem like a duplication of effort for so many research groups to be trying to solve the same problems independently. But medical records look and sound very different depending on which group of doctors made them, so custom built procession might actually prove superior to joint projects. 

Dr Rachel Knevel, a senior scientist at Leiden University Medical Centre in the Netherlands, is working on one such project. She’s interested in delving into the question of what triggers rheumatoid arthritis using cluster analysis.

To do this, it’s a lot faster to develop a program that can quickly scan tens of thousands of patient charts, rather than try to read each chart individually, she says. “I think it’s a way to make science more efficient,” she says.

Dr Knevel’s team has been developing a system that can identify clinical records that mention drugs for rheumatoid arthritis. The major problem is that the record is full of typos and spelling mistakes, she says.

Methotrexate is often written into the clinical record using the abbreviation ‘MXT’. But sometimes there’s a typo, and the word becomes ‘MTX’. Sometimes it’s ‘TXM’ or ‘NTX’.

The more letters that switch place or are replaced by another letter, the less likely it is that the doctor intended to write ‘MXT’.

Identifying the likely typos for methotrexate is essentially a maths problem, and the Leiden team solved it using something called the “Damerau Levenshtein distance”. This is an algorithm that measures the distance between words in terms of the number of character operations (remove, add, move or replace) required to transform one word into another.

They trained their algorithm on around 6,000 EMR entries and validated it using around 3,000 EMR entries. The algorithm demonstrated a high accuracy of detecting EMRs in which rheumatoid arthritis drugs (and typo versions) were prescribed.

Another example of an independent NLP project is happening in Geneva, Switzerland.

Patrick Ruch from the Swiss Institute of Bioinformatics’s text mining group, is leading several natural language processing projects to support personalised medicine in oncology.

Dr Ruch says his team has designed an algorithm that distils the “large universe of papers” into a list ranked in order of importance, with the most robust research on common mutations at the top and the more personalised, but less robust, studies on very specific variants second.

It’s a similar system to commercial products such as IBM Watson, says Dr Ruch. The major difference is that the Swiss system has a closed loop with the oncologists, so if a third-line treatment was recommended by the computer system, the research team knows within a few weeks whether this was helpful or not in real life.   

“I cannot claim that we have saved someone,” says Dr Ruch. “It is too early.”


CSIRO has a Data61 team scattered across Sydney and Brisbane who work on NLP problems in healthcare. They’ve got three main projects under way, according to senior research scientist Dr Sarvnaz Karimi:

Firstly, in a project that started around five years ago, CSIRO’s Data61 team used data from askapatient.com to identify what drug adverse effects were being experienced in the community.

Anyone can contribute to the US website askapatient.com. It publishes reviews of medications by patients, including lots of rich data about potential side effects.

The Data61 system identified whether people were reporting their own symptoms or a friend’s, and whether there was language of negation. (i.e. ‘I didn’t get a headache’) 

The system then translated the informal language used by patients into standardised medical language, so that statistics could be created from the free text data.

In a second project, the Data61 team also created a system that could pick up the early rumblings of thunderstorm asthma over Twitter.

By reading tweets relating to asthma and breathing problems, their algorithms would have detected the 2016 thunderstorm asthma crisis in Melbourne five hours before the first news report, according to a study published in Epidemiology this year.

The third project deals with a big problem in medical research – the difficulty of matching patients to particular clinical trials.

Clinical trials often don’t have enough participants because it is difficult for patients to figure out which trials they are eligible for. To enrol in a clinical trial for cancer, for example, a patient might need to have a particular type of genetic mutation and no comorbidities.

There are databases, such as clinicaltrials.gov in the US, but these aren’t easily searchable. Patients and doctors still have to read through each trial and figure out if they fit the criteria.

To solve this problem, the Data61 team is working on a program where a patient can type in their specific characteristics (such as their age and gene mutation) and the algorithm brings up all the trials that contain free text descriptions that match.

The reason why natural language processing is needed is because there are many different ways to describe gene names and age brackets, so a simple word search is insufficient.


Professor Jon Patrick left the University of Sydney around seven years ago to start two natural language processing companies in healthcare: HLA Global and iCIMS.

Professor Patrick is working closely with the Sydney Adventist Hospital to create cancer patient records for multidisciplinary care team meetings.

The system draws in reports from surgeons, chemotherapists, radiotherapist, and, as an “extra twist”, the hospital has asked the company to design a program that pulls data from pathology reports into a structured summary, says Professor Patrick.

“We’ve got lung and breast going at the moment, and we’ve just started to work on urology and gynaecology,” he says.

But Professor Patrick’s major client is from the US: the California Cancer Registry. The registry mines medical records to generate data on cancer trends. To speed up this work, they’ve hired Professor Patrick’s company to create systems that can help sift through the thousands of clinical reports.   

There are several layers of difficulty involved in teaching a computer to accurately read these reports, says Professor Patrick.

If you take a pathology report, for example, that classically has a macro description, micro description, final diagnosis, description of specimen and clinical history, but it can also have supplementary records with biomarkers and genomic tests, he says.

“So, that diversity just makes it richer and trickier to get what you want,” says Professor Patrick.

“And if you think about pathology, again, the core information about the diagnosis should be in the final diagnosis section. ‘Should’ be. But all too often it’s not.”

Even when a natural language processing system appears to be working fine, it can get thrown every time a new source of data is added – such as when a new pathology provider that structures their reports slightly differently, he says.

The company often does program revisions on a weekly basis to keep on top of these changes.

Sometimes the system can identify a single doctor whose eccentric sentence structure or word choice is continually causing the computer headaches.

“We certainly see reports that cause trouble where we can instantly identify who the author is,” says Professor Patrick.

It seems like some electronic medical records are about as indecipherable to a computer as a doctor’s handwritten note is to the average human patient. 

End of content

No more pages to load

Log In Register ×