Medical Natural Language Processing
In this video, VG Vinod Vydiswaran, Associate Professor of Learning Health Sciences and Associate Professor of Information, discusses the building blocks for natural language processing and information extraction for health data.
“Hand drawn flat design npl illustration” by Freepik is available via Freepik.
Excerpt From
Transcript
Hello everyone. Welcome to Module 2 of this MOOC. In this module, we are going to go from just looking at formatted text to start building a pipeline for natural language processing and information extraction for health data. In this video, we're going to talk about what are the building blocks of creating such a pipeline through this discussion about medical natural language processing. The goal of medical natural language processing, or MNLP for short, is to convert text data to structured knowledge. You have information such as the website, WebMD, that is giving you this information about Erbitux helping treat lung cancer. From this article, the structured knowledge is really the fact that Erbitux is a treatment that helps treat the condition lung cancer. Advanced lung condition is a modified version of this condition. But it has increased in severity. This kind of information, this text data, can come from anywhere. It could be websites, it could be tweets, such as this tweet from CDC, which is Centers for Disease Control in the US, where this tweet is talking about sneezing on someone spreading Ebola. Then you have other information that might occur from tweets that combines information that is there in the text content and other multimedia sources that now say, for example, dizziness is a symptom of CO poisoning. Only the image is actually telling us more details about CO poisoning being carbon monoxide poisoning. There is a slight distinction between what is there in text and what is there in the picture. In this MOOC and in this module, we are going to focus only on text data to understand how we can convert this text data into knowledge, into something that can be then processed in downstream tasks. Typical natural language processing tasks include identifying medical concepts. This is the problem of named entity recognition that we are focusing on in this entire MOOC. An example would be to identify the concept "chest pain" or the concept of antihistamine. Then we may have to extend beyond just identifying concepts to extracting assertions and attributes and values corresponding to certain concepts. This task in general is called information extraction. An example of that would be if you have the concept of blood pressure and then you have possible values like high blood pressure or low blood pressure or normal. Similarly, we might have tumor that might be absent or benign or malignant or in-situ and so on. These are the specific attributes for a key-value pair. Another variation of this is to identify medical relationships between concepts. This in general is called a relation identification. For example, you are finding out this relationship between Prozac as a treatment to clinical depression. This is a triplet. There is a treatment relation between two entities, Prozac and clinical depression. Similarly, you can have something like cataract is a condition and decreased vision is an outcome of cataract, and so you have this relation causes. Cataract causes decreased vision. These are typically the tasks of identifying key elements. Either they are concepts or attributes or relationships that are expressed in text. But before we do any of this, we need to look at more fundamental NLP tasks. Really, when you look at texts, texts do not come in concepts and relationships. They come in words or sentences. We need to first identify those concepts. The building blocks of any NLP system, in general, and information extraction system is identifying these building blocks more clearly. For example, defining what a word is, counting words, frequencies, and so on, finding sentence boundaries where one sentence ends and another starts, finding sections where a paragraph or a particular section within a clinical node ends and then another stats are basically looking at the way the text is mentioned. Then we are looking into grammar. Part of speech tagging. For example, what are the nouns? What are the verbs? What are modifiers to the verb? For example, adverbs. What are modifiers to nouns? That is adjectives, and so on. Then to understand what is happening in a sentence, to parse the sentence to derive meaning out of it. Then further advanced NLP concepts of semantic roles. For example, who is doing the action, on what object is the action done, and so on. Finally, we also have identifying entities in a sentence. This is the task of named entity recognition. Or then to find out which pronoun in this particular sentence or a paragraph refers to which entity. For example, if there is a sentence where we say Rebecca Moody, visited the clinic yesterday, she was suffering from so and so you have the word she that then has to refer to the patient in this case. But if the first sentence had both the patient and a doctor, and then the second sentence had she was suffering from, then you know that the pronoun then has to refer to the patient, the first subject on the previous sentence. This is called pronoun resolution. There are many such tasks, so you can have other co-reference resolution tasks and so on. In this video and the next, we are going to talk in more detail about these individual steps that are needed to first identify the building blocks of language. Let's look at what are the steps in linguistic analysis. First, you were to split a document into paragraphs, split the paragraph into sentences. In general, this task is called sentence splitting. Second, we take the sentences and when you split them into words or tokens, this is done by a process called tokenization. Third, we need to find the root form of the word. This is called lemmatization, find the lemma, which is the root form of the word. Then we could identify what part of speech these words have at a particular sentence, so that's called a part of speech tagging and then find the grammatical structure of the sentence, this is called parsing. Let's start with sentence splitting. We might think that identifying sentences or where one sentence ends and another one begins is pretty straightforward. When you read a paragraph of text, you know really where to stop where a sentence ends and then when it starts. We know that a full stop or a question mark or an exclamation marks are typically where sentences end. But it's not that straightforward. Sentence splitting is an important yet non-trivial first step in NLP. How will you do it? We know that full stops are a valid way of ending a sentence, but are all full stops indicative of end of a sentence? Let's think about it. Second, we have question marks or exclamation marks. All of these are good ways of ending sentences and identifying them would be a good hint to know that a sentence ends. But let's look at the full stop example, in more detail. You do want full stops to end sentences unless you have a single alphabet before a full stop, so what does that mean? That's a typically used for initials. You have single alphabets before full stops, so when you have acronyms, so for example, WHO, the World Health Organization, you have W.H.O. These are acronyms. You might have numbers before and after full stop. Digits, which would then indicate currency, so 2.85 would be some measure of currency. Then we have whether it is an honorific or a title. Mrs, Miss, and Mr and doctor and professor and so on. Whether it's a known abbreviation or not, we gave an example of WHO. We can have a country names like USA or UK that have something similar. But in health, we have another area where you use these full stop this dot extensively. That is in a coding in ICD-9 and ICD-10 codes, which are diagnostic codes, billing codes that are used extensively. Then in biomedical literature in general, you also use this full stop to identify chemical names and genes abbreviations and so on. You'll notice that even a simple task of splitting a sentence on full stop is not necessarily just every full stop, every dot is an end of a sentence. You need to look at different options in which that is used, that does not indicate end of a sentence and then we may want to continue. How do you do this? This is typically done in a rule-based system where you look at full stop and then looking at the context around it, decide whether that is a valid end of a sentence or not. So that would be the first step in looking at NLP. In the next video, we're going to look at the next steps of identifying tokens or word boundaries and then finding meanings of those words.