Approximately 21M RTF documents (transcripts and letters) accessible online, from a mixture of hospitals, clinics, specialties, and private practices. The data is exclusively from the US and some overseas US territories / holdings.
Create a data product by region, ambulatory vs. acute care, specialty, gender, and age.
The documents primarily contain semi-structured transcriptions of doctor audio recordings. They also have a mix of metadata. The presence and location of metadata within the documents is determined by the customer.
Metadata usually occurs in a page header (or footer), or simply at the start of the document.
Labels often identify the metadata field (e.g., NAME, DATE OF BIRTH).
Sections and subsections are inserted as needed to further structure the narrative (e.g., VITAL SIGNS, HISTORY, etc.).
The documents often contain lists to enumerate similar information (e.g., medications, problems, diagnoses).
A template system is used for all customers, in order to minimize document structural differences.
For all these files, we also have the full document metadata in the DB (e.g. provider, various dates, practice data, work types, etc.).
Metadata can be exported from the DB along with the documents themselves.