CEDARS is provided as-is with no guarantee whatsoever and users agree to be held responsible for compliance with their local government/institutional regulations. All CEDARS installations should be reviewed with institutional information security authorities.
CEDARS was tested on a desktop PC. R 3.5.0 or above and all dependency packages need to be installed:
RStudio is required to use the app locally and to publish it to RStudio Connect. A MongoDB installation is required to hold all the project data, preferably on a dedicated server.
Lastly, the Unified Medical Language System (UMLS) MRCONSO.RRF file is required for searches using Concept Unique Identifiers (CUI's). The UMLS is a rich compendium of biomedical lexicons. It is maintained by the National Institutes of Health (NIH) and requires establishing an account in order to access the associated files. Those files are not included with the CEDARS R package, but CEDARS is designed to use them natively so individual users can easily include them in their annotation pipeline.
The CEDARS app runs from within a Shiny instance. It is possible to use either RStudio Connect or alternatively a dedicated server running Shiny Server. The former is easy to use from RStudio desktop but requires an existing RStudio Connect installation within your organization, while the latter is typically more costly and labor intensive.
Users connect to the Shiny app by accessing a web URL provided by RStudio Connect or a web server on a dedicated Shiny Server installation. CEDARS performs the operations to pull data from the database, process it and present it to the end users. Data entered by users is processed by CEDARS and saved to the database. RStudio Connect and Shiny Server Pro allow for the automatic generation of multiple processes ("workers") when multiple end users access the app simultaneously, however this is rarely needed with CEDARS since in most implementations only a few abstractors (i.e. <5) will have access to the interface. In most cases, if pre-search is used (see below), CEDARS will run fairly quickly with 2 simultaneous users in a single-threaded setup.
CEDARS can handle authentication, but ideally this will be done by Active Directory through RStudio Connect. This approach ensures optimal integration with an organization's IT architecture.
CEDARS R Package Installation
From RStudio, and with the devtools package installed:
This will install CEDARS on a desktop or server. In the case of RStudio Connect implementations, CEDARS will be automatically installed when uploading the app through RStudio.
Extracting clinical event dates with CEDARS is a simple, sequential process:
Project Execution Overview
The CEDARS package includes its data entry interface in the form of a Shiny app. However, additional information from the administrator is required for the app instance to connect with MongoDB, including:
database user ID and password
host server and port (default is 27017)
whether or not Active Directory will be used for user authentication
destination path to save the app, mapped from the R working directory
The function save_credentials() must be called to generate the app and associated Rdata file:
Both the app.R and db_credentials.Rdata files must be uploaded to the Shiny instance.
RStudio Connect App Upload
This option assumes you have an account with your institution's RStudio Connect service. From within RStudio, simply navigate to the folder where the app was saved and click on the app.R file. Click the "Publish to Server" icon, making sure both necessary files are included and hit "Publish".
RStudio Server App Upload
A discussion of the installation process and use of Shiny Server is beyond the scope of this manual; pertinent information can be found on the maker's website. The CEDARS app.R and db_credentials.Rdata files should be uploaded to the desired app directory.
Creating the database and populating it with data pertaining to the project occurs as follows:
Preparing the Database
Each data collection task on a given cohort of patients is a distinct CEDARS "project" with its own MongoDB database including all collections needed to operate. Different projects cannot share the same database or collections. This encapsulation allows for reliable backup and deletion of project data upon completion, also avoiding data corruption due to cross-talk between different annotation tasks. Initialization is the process by which necessary collections are generated and populated with project-specific data.
The function create_project() generates a database which will hold all collections pertaining to the project. If the CEDARS project administrator has database creation privileges, a new MongoDB instance will be created and collections generated automatically. If database creation privileges have not been granted, it is possible to have the MongoDB administrator create the blank database. Once this is done, create_project() can be used to generate the collections:
Function upload_notes() is used to transfer the raw clinical corpus to the CEDARS database. This would typically consist of a collection of clinical notes or radiology reports formatted as a dataframe with the follwing fields:
"patient_id" Patient-specific unique identifier, typically a medical record number
"text_id" Unique identifier for the text segment
"text" Text segment, can be a whole note or a section, sub-section etc.
"text_date" Date of the clinical encounter or radiology test
"doc_id" Unique identifier or the document
"text_sequence" Optional, if a document contains more than one text segment (each with a distinct text_id), this field indicates the order of the segments/sections
"text_tag_1" Optional metadata, for example patient's name, medical professional name, note section name, etc.
"text_tag_2" Optional metadata
"text_tag_3" Optional metadata
"text_tag_4" Optional metadata
"text_tag_5" Optional metadata
"text_tag_6" Optional metadata
"text_tag_7" Optional metadata
"text_tag_8" Optional metadata
"text_tag_9" Optional metadata
"text_tag_10" Optional metadata
In a typical use case, there would be a large number of patients/notes located on a separate server, so a custom batch function to download notes and transfer to CEDARS one patient at a time would have to be devised, e.g.:
recnum <- patient_list$recnum[i]
download_notes <- sqlQuery(origin_db_rodbc, paste(c("SELECT * FROM CORE_DB WHERE RECNUM = \'", recnum), collapse=""))
# convert field names and ensure their format is compliant with CEDARS
print(paste("notes uploaded for patient #", i, sep=""))
}else print(paste("no notes for patient #", i, sep=""))
Once all notes or reports have been uploaded, unique values for metadata tags can be obtained with the download_filtered_tags() function. This can be useful to select specific documents types to include/exclude in the query. By default, tags featured in the uploaded corpus will we matched with the stored CEDARS query but this can be overridden:
CEDARS uses the UDPipe natural language processing (NLP) pipeline for paragraph/sentence boundary detection, tokenization, lemmatization, part-of-speech tagging and dependency parsing. The function automatic_NLP_processor() will assess the project for missing annotations and process documents as needed:
uri_fun <- mongo_uri_standard
# Only UDPipe is supported for now
# Use your favorite model, can be standard issue or custom fitted
Sometimes a cohort of patients will already have been assessed with other methods and CEDARS is used as a redundant method to pick up any previously missed events. In this use case, a list of known clinical events with their dates will exist. This information can be loaded on CEDARS as a "starting point", so as to avoid re-discovering already documented events. The function upload_events() can be used to insert these data:
# event_dates is dataframe with patient unique ID's and event dates
The CEDARS search query incorporates the following wildcards:
"?": for one character, for example "r?d" would match "red" or "rod" but not "reed"
"*": for zero to any number of characters, for example "r*" would match "red", "rod", "reed", "rd", etc.
CEDARS also applies the following Boolean operators:
"AND": both conditions present "OR": either present present "!": negation, for example "!red" would only match sentences without the word "red"
Lastly, the "(" and ")" operators can be used to further develop logic within a query.
Once NLP annotations are complete and the search query has been defined, if an end user logs into CEDARS the system will automatically start parsing the annotations for matches. However, depending on the frequency (density) of hits, it can take several seconds to find a sentence to present the end user with. This can impact user experience, especially in multi-user settings. Because of this aspect, ideally records should be pre-searched:
Detailed information is provided, including which sentences were reviewed.
CEDARS is by definition semi-automated, and depending on the specific use case and search query some events might be missed. This problem should be quantified by means of a systematic, old-fashion review of randomly selected patients. Typically, at least 200 patients would be selected and their corpora reviewed manually for events. Alternatively, a different method (e.g. billing codes) could be used. This audit dataset should be overlapped with the CEDARS event table to estimate sensitivity of the search query in the cohort at large. If this parameter falls below the previously established minimum acceptable value, the search query scope should be broadened, followed by a database reset, uploading of previously identified events and a new human annotation pass, followed by a repeat audit.
Once all events have been tallied and the audit results are satisfactory, if desired the CEDARS project database can be deleted from the MongoDB database. This is an irreversible operation: