A²I : NLP in Healthcare

  • August 02, 2018

Medical field has gained attention in recent times, in the field of Natural Language Processing (NLP) and Data analytics due to the amount of data that it generates rapidly. The amount of data that the health care systems systems produce is directly proportional to the number of lethargic humans who consider their health as another cup of tea in their life. This depicts that unless we humans do not take care of our body, at least to an extent of love we show to our iphones, the data that gets produced every other second will not have an end. Exponential growth of health care data has induced researchers and industrialists to extract essential information from it and replace humans with machines.


Before getting into the actual topic it is essential to know what is meant by structure of text data. Structure of a datum is nothing but, the way in which the data is perceived and manipulated by an individual. If the data is easily understandable and some meaningful insights could be derived from it (plotting a graph in case of numerical values, or tabulation of texts or even other possibilities), then it comes under structured set. The latter category is understandable as any data that is not organized with repetition or clustered outlook, which also requires man power in deriving meaningful entities from it, clearly comes under the unstructured datum. With this idea, moving further would help us understand the key problems the medical industry suffers.


Handwritten data

Until few decades ago, the data produced in this domain was all in a paper-pen format. To be more precise, hospitals managed patients data in few papers and files. These kind of data storage as hard copies always had its difficulty in managing and effectively using it. The negativity included the amount of space it consumed, though recorded in template papers, the handwriting and grammar usage that differs from every other (Doctor/Clinician/Physician), the difficulty in maintaining the documents and its concerned test reports for each individual (Patient) and also the harm in usage of papers for every other notification. Here the notification includes the data of a patient from the history of his/her medical ailment, the medical prescriptions, lab tests, observations, next visit updates, change of medication if any and many more. Though there are quiet a few people in-charge in maintaining these patient records, we humans always miss to be perfect at all times. The patients themselves have their drawbacks in managing their own files and medical records. From Dr. Robert E. Hoyt’s authoritative textbook, HEALTH INFORMATICS: A PRACTICAL GUIDE (the 6th edition), it was found that information that gets missed during every visit of a patient to the hospital includes lab results(45%), dictations(39%), radiology results(28%), history and physical exams(27%) and pathology results(15%). This shows that nothing was systematic though there were manual intervention in all these areas. Template methodology of entering data (hand-written) is no way equal to a structured document, as these documents could not be stored as digital texts in any means. Later came the method of medical dictation that produce dictated documents.


Taking a closer look at the dictated documents, would help us understand the difficulty in maintaining and further extracting necessary information of a patients health record. The procedure that is basically involved in forming a database of a patient is as follows. Initially the Doctor listens to all the difficulties a patient has been undergoing. He/She in return would then clarify the doubts and understand the patient’s problem. Once the Doctor is done with this, as a next step the Doctor records all the necessary information he gathered from the patient’s meet in any cassette or mobile phone recorder. This is then sent to the transcriptionists who is in-charge to listen to these recording and convert the speech to valuable information as text. These transcriptionists are specially appointed for the purpose of understanding the speech and then filling it duly in the prescribed area as digital texts. Once this is done the Doctors re-check these documents and then electronically sign it to acknowledge it as a legally approved copy. The amount of responsibility a Doctor holds in approving these dictated documents is the same amount of responsibility the transcriptionists hold as well. As in if any wrong medication of symptoms are entered in specific categories, it would result in a total blunder and unsafe way of maintenance. It was for this purpose, transcriptionist editors were appointed who would then cross evaluate the actual speech with the entered text before sending it to the Doctor for approval. The entry of a patient info simply undergoes three different stages of validation before its final approval. In such a scenario, manual power and intelligence is required.


Electronic Health Record

After the existence of Electronic Health Record (EHR), the data that was previously in hand written format got digitized to some extent. Here the data that was previously recorded using some other means could be directly recorded and processed using EHR. An automatic speech-to-text approach would help in automating the process, which would eventually fill in the details in the specific spaces in the provided template. Though this process needed manual intervention it has replaced the job of transcriptionists to a greater extent. Documents produced using EHRs includes records of patient’s family history, reason for initial complaint, diagnosis and treatment, prescription medication, lab tests and results, record of visits, administrative and billing data, patient demographics, progress notes, vital signs, medical histories, immunization dates, allergies, radiology images and etc. Almost all the details of a patient will be readily available for clinicians and physicians at any point of time. The medical coding methodology which was the only technique that was used earlier to fetch patient data (partially) has now been extended to extraction of information from EHR data. If the EHR system gets a cloud connectivity, it would also benefit others Doctors in that particular circle to proceed further with any medical process if necessary. Patient portal could also be included into it, as a step of including new patient details themselves rather than the old receptionist entry. These EHRs have become beneficial in organizing the patient details, reducing the redundancy and effort, in serving as a medium to share multimedia information (radiology images, scan reports, x-rays and etc.) and its proven ability to extract information from the contained datums.


Though digitization has started evolving in health care, the quantity of unstructured data always kept growing. Peter J. Embi, MD, MS President and CEO at Regenstrief Institute said that “Eighty percent of clinical data is locked away in unstructured physician notes that can’t be read by an EHR and so can’t be accessed by advanced decision support and quality improvement applications.” From the current state of Text Analytics in Healthcare it was found that Health systems simply aren’t making use of Text analytics and only less than 5% do as of July 2016. This shows the demand of research and development required in the field of Medicine.


Hand written documents—these were once the only means of storing the datum of a patient. Though the growth of these hand written documents have optimally reduced, people still rely on it and its growth is still on process. Unstructured data does not only include hand written type-free texts like noting down the patient’s details in a paper, prescribing medicines, discharge summary and many more but also includes the emails and attachments, typed transcriptions, audio messages, radiology and pathology outputs and etc. To be more specific, all documents that require human touch to read, capture and interpret comes under unstructured category.


Even though we assume that part of the medical data has been digitized and automatized using EHRs and that there is no need for human intervention when dealt with this, the end system is always re-checked and corrected using a human with the respective domain knowledge. Though it has moved from manual speech-to-text to automatic speech-to-text based method, the error reduction percent has only considerable improvement. This indeed requires more human support, as the speech recorded by clinician/physician (differ in pronunciation) may not be clear and precise for a system to extract information from it. This simply shows us that the system is not fully automatized and still requires precision in saving the data for future use.


Adding intelligence to Healthcare—NLP

  • Existing development

It was then when Natural Language Processing took its lead to automatize the process. With the help of NLP in Health care, the end system that required a human assistance to re-check the documents got replaced by NLP based systems. There are hospitals around the world which have already started using NLP in their daily basis. Scott Evans -Director of Medical Informatics at Intermountain Healthcare in Salt Lake City, Utah, United States told that Intermountain Healthcare has already started using NLP. He told that NLP based systems helped them in spotting out patients with specific disease even though they have come for other treatment. He told that they were able to identify people with heart failure, from 25 different types of free-text documents that were stored in EHR. Rasu Shrestha, chief innovation officer at University of Pittsburgh Medical Center, Pennsylvania, United States said that they use NLP in Healthcare for clinical decision support to manage risk around chronic diseases more intelligently. They are using an NLP based system called MedCPU, that has been added on top of the EHR to add intelligence to the physician’s work flow. He made it a point that NLP has the ability to mine information from the data that we assume to be less informative.


So far medical domain has been using NLP, that worked on rule-based methodology. Rule-based is nothing but set of hand coded rules to extract valuable information from the medical data. With respect to the knowledge resources that needed to be extracted in each set of documents, certain rules were framed that were convenient for the structure of each document. In simple, document specific rules were used for knowledge extraction. It was a little later when algorithm driven models were used and reduced the work load of manually encoding the datum. Recently researchers have moved into applying Deep Learning to the Health care data. Though NLP based solutions show promising results, the growth rate is really slow. Data and its structure adds complexity to the process which eventually slows down the growth rate. This probably would be reason why there is demand for this area of research and more students, scholars, professors, industrialists and researchers have started focusing on coming up with an ideal solution for mining information from these medical data.


This blog just gives a simple introduction to the complexity of structure of health care data and what provokes the need of NLP in it. It has also shown certain real time use cases of NLP in hospitals and also describes the technology and research behind the same.


Working demo of a Natural Language Understander that has been fed in with medical data — MedNLU is made available in Arnekt website. This is one of the application using digitized medical data, that helps in finding out the test, treatment and problem from a large pieces of medical text. More on NLP in health care and the logic behind MedNLU will be seen in my next blog (NLP tools in Health care).