Using MetaMap with Python to access the UMLS MetaThesaurus - A Quick Start Guide

The National Library of Medicine (NLM) created a tremendous resource called the Unified Medical Language Systems (UMLS). The UMLS is a suite of data and software resources that offers several tools for processing, representing, and analyzing biomedical text data. One exciting use case that we’ll explore here is using UMLS to identify signs and symptoms that are documented in the text of clinical encounter notes.

MetaMap is a software tool that identifies concepts in the text of clinical notes that are found in UMLS ontologies. MetaMap is written in Java and there is a convenient wrapper in Python: PyMetaMap written by Anthony Rios. There are other Python wrappers around MetaMap but PyMetaMap was the most straightforward to get running on my machine so that’s what we’ll use here.

Step 1: Installation

Make sure you have have MetaMap installed. Instructions are here: https://metamap.nlm.nih.gov/Installation.shtml

I used the default settings for the installation and it has worked out fine.

NB To use most UMLS resources you will have to register for an account and sign a data use agreement. This is free!

Step 2: Get some text data

Because working with clinical notes is often challening due to issues of de-identification, here are a few sample text notes we’ll use here. These are completely made up and not based on any real people.

note1 = "Mrs. Jones came in today complaining of a lot of chest pain. She denies shortness of breath and fevers."

note2 = "Mr. Smith has been having pain in his knee for many weeks. It hurts when he walks. There is no swelling, erythema, or micromotion tenderness."

note3 = "Sandy Lemon has been having headaches for two months that are associated with blurry vision and numbness in her right arm."

# Put them all together in a list
note_list = [note1, note2, note3]

Step 3: Install PyMetaMap

The last time I tried this, the version available via pip was old. So installed the latest version (0.2) directly from the GitHub site here: https://github.com/AnthonyMRios/pymetamap

I accomplished this using:

pip install git+https://github.com/AnthonyMRios/pymetamap.git

Now load the necessary modules and set up the relevant servers in the background:

# Load MetaMap
from pymetamap import MetaMap

# Import os to make system calls
import os

# For pausing
from time import sleep

# Setup UMLS Server
metamap_base_dir = '/gwshare/umls_2021/metamap/public_mm/'
metamap_bin_dir = 'bin/metamap20'
metamap_pos_server_dir = 'bin/skrmedpostctl'
metamap_wsd_server_dir = 'bin/wsdserverctl'

# Start servers
os.system(metamap_base_dir + metamap_pos_server_dir + ' start') # Part of speech tagger
os.system(metamap_base_dir + metamap_wsd_server_dir + ' start') # Word sense disambiguation 

# Sleep a bit to give time for these servers to start up
sleep(60)

Set up a MetaMap object to do the work. If you are running into memory issues with this approach, see some tips here: https://lhncbc.nlm.nih.gov/ii/tools/MetaMap/Docs/OutOfMemory.pdf

metam = MetaMap.get_instance(metamap_base_dir + metamap_bin_dir)

Now let’s process some text.

cons, errs = metam.extract_concepts(note_list,
                                word_sense_disambiguation = True,
                                restrict_to_sts = ['sosy'], # signs and symptoms
                                composite_phrase = 1, # for memory issues
                                prune = 30)
                                
# Look at the output:
print(cons)

## [ConceptMMI(index='USER', mm='MMI', score='5.18', preferred_name='Dyspnea', cui='C0013404', semtypes='[sosy]', trigger='["SHORTNESS OF BREATH"-tx-1-"shortness of breath"-noun-1]', location='TX', pos_info='73/19', tree_codes=''), ConceptMMI(index='USER', mm='MMI', score='5.18', preferred_name='Fever', cui='C0015967', semtypes='[sosy]', trigger='["Fevers"-tx-1-"fevers"-noun-1]', location='TX', pos_info='97/6', tree_codes=''), ConceptMMI(index='USER', mm='MMI', score='3.45', preferred_name='Mastodynia', cui='C0024902', semtypes='[sosy]', trigger='["of breast pain"-tx-1-"of chest pain"-noun-0]', location='TX', pos_info='38/2,50/10', tree_codes=''), ConceptMMI(index='USER', mm='MMI', score='3.71', preferred_name='Knee pain', cui='C0231749', semtypes='[sosy]', trigger='["Pain in knee"-tx-1-"pain in knee"-noun-0]', location='TX', pos_info='27/7,39/4', tree_codes=''), ConceptMMI(index='USER', mm='MMI', score='3.68', preferred_name='Sore to touch', cui='C0234233', semtypes='[sosy]', trigger='["TENDERNESS (NOS)"-tx-1-"tenderness"-noun-1]', location='TX', pos_info='131/10', tree_codes=''), ConceptMMI(index='USER', mm='MMI', score='3.57', preferred_name='Headache', cui='C0018681', semtypes='[sosy]', trigger='["Headaches"-tx-1-"headaches"-noun-0]', location='TX', pos_info='29/9', tree_codes=''), ConceptMMI(index='USER', mm='MMI', score='3.56', preferred_name='Numbness', cui='C0028643', semtypes='[sosy]', trigger='["NUMBNESS"-tx-1-"numbness"-noun-0]', location='TX', pos_info='97/8', tree_codes='')]

And try to put it into a more usable format:

import pandas as pd

def get_keys_from_mm(concept, klist):
    conc_dict = concept._asdict()
    conc_list = [conc_dict.get(kk) for kk in klist]
    return(tuple(conc_list))
        
        
keys_of_interest = ['preferred_name', 'cui', 'semtypes', 'pos_info']
cols = [get_keys_from_mm(cc, keys_of_interest) for cc in cons]
results_df = pd.DataFrame(cols, columns = keys_of_interest)

# See results
print(results_df)

## preferred_name       cui semtypes    pos_info
## 0        Dyspnea  C0013404   [sosy]       73/19
## 1          Fever  C0015967   [sosy]        97/6
## 2     Mastodynia  C0024902   [sosy]  38/2,50/10
## 3      Knee pain  C0231749   [sosy]   27/7,39/4
## 4  Sore to touch  C0234233   [sosy]      131/10
## 5       Headache  C0018681   [sosy]        29/9
## 6       Numbness  C0028643   [sosy]        97/8

And there you found the signs and symptoms with their associated plain language name, cui, and position. For processing many notes in parallel, I’ve found it works to load the UMLS servers once but create a new MetaMap object inside each process.

Good luck, and happy processing!

Assistant Professor of Medicine and Informatics

I am a pulmonary and critical care physician at the University of Pennsylvania Perelman School of Medicine, and core faculty member at the Palliative and Advanced Illness Research (PAIR) Center. My work seeks to translate innovations in artificial intelligence methods into bedside clinical decision support systems that alleviate uncertainty in critical clinical decisions. My research interests include clinical informatics, artificial intelligence, natural language processing, machine learning, population health, and pragmatic trials.