norstella-logo-white

How unstructured EMR data helps pharma find patients

By Yash Rathi, Senior Director of Data Intelligence | MMIT

Ilan Behm, Vice President of Real-World Data Engagement | Norstella

As therapies have become more complex, pharma companies are now challenged to achieve precision targeting within a much tighter timeframe. While claims data is readily available, one of its key limitations is the lack of timeliness.

Many manufacturers now rely on specialized lab data—from imaging results to genetic testing and genomics—to identify eligible patients and their providers. As lab data is often the key driver in diagnostic decisions, this is an excellent source for commercial targeting initiatives. But what about understanding the intent behind the testing?

Unlike lab data, EMR data provides insights into physician sentiment as well as the patient’s care journey. In fact, unstructured EMR data contains the richest recorded details about a patient’s care, from biomarker levels and tumor specifics to the reasoning behind a treatment plan.

To learn more about this underappreciated data source, I spoke to Ilan Behm, vice president of Real-World Data Engagement at Norstella.

Q: What exactly is unstructured EMR data? 

A: When a patient has an office visit, their doctor uses drop-down menus to add specific values to their chart, things like vital signs and diagnostic codes. We call that structured EMR data. The unstructured data is basically all the rest of the information that contextualizes the visit: why the patient has come in, what was done during past appointments, and what the plan is for the future.

Unstructured data is captured when a doctor records and transcribes their clinical notes, or when they write free text directly into the patient’s chart. This type of data is prone to typos and redundancies, and the wording varies quite a bit from physician to physician. However, these fields are often the only place you can find the richest information about a patient, like their biomarker levels or tumor size.  

Q: How can that rich unstructured data be made searchable and usable? 

A: Depending on the case, we might first deploy large language models and natural language processing to search these clinical notes for specific keywords of interest. Standardized data science techniques help us confirm that these keywords mean what we think they mean, and that they’re not leading us to a false positive. For example, there’s a huge difference between a past diagnosis of breast cancer and a family history of breast cancer.

After extracting this information on a note and patient level, we gather data on when the keyword was used and in what context: is there a date? Which encounter was this note from? At which health system or office did this visit occur? This allows us to relate the note to other EMR data points, like which doctors were involved in that visit, what medications were prescribed during the encounter, etc. 

Q: Why is unstructured EMR data particularly useful within oncology? 

A: If we look at only structured data from the EMR, we can see what kind of tumor a patient has. We may even be able to see if the cancer has metastasized, and if the patient has a secondary cancer. But most of the targeted therapies available today have to be deployed at specific stages of the tumor. All of the pivotal information manufacturers need—details of the patient’s cancer staging, disease progression, and tumor biology—is recorded in the clinical notes. That’s what makes unstructured EMR data so essential.

On top of that, most of these targeted oncology therapies are also associated with specific biomarkers. If a therapy is only applicable to a tiny subset of the patient population, let’s say 10% of the stage 3 colorectal cancer patients who’ve tested positive for a certain biomarker, then the race is on to find those patients. Time is of the essence in cases like these: life science companies need to identify those eligible patients and their treating physicians as quickly as possible, because lives depend upon it. 

Q: It seems like timeliness would be a big driver within rare disease as well. Can you speak to that space?  

A: Yes, absolutely. Rare diseases are hard to diagnose, and we’re talking about very small patient populations. There may only be one or two treatments available, so it’s all the more important to get to those patients and their physicians as quickly as possible. Their providers may not even know what this condition is, nor how to treat it. Pharma companies must be fast and strategic abound finding and educating treating physicians in time.

You know, more and more rare diseases are being identified now, but the diseases themselves aren’t new—they were just previously unnamed, which naturally means they didn’t have an ICD-10 code. To find an undiagnosed patient population in rare disease, you might need to search unstructured EMR data for patients who experienced a handful of different symptoms, which occurred in a certain pattern within a specific timeframe, in such a way that suggests they could have this newly identified condition.

Of course, you need longitudinal data to do that, to see years into a patient’s past history. In order to see the totality of events that patients are experiencing, you really need all of these real-world datasets in tandem—both structured and unstructured EMR data, open and closed claims data, lab tests and results. Integrating all of it together is really the only way to see that total patient journey. 

Q: So how can this unstructured EMR data be bridged to other datasets? 

A: Typically, if a manufacturer already has claims data from Vendor A, and they want to use a supplemental dataset, they’ll pay Vendor B for a data pull or a subscription to acquire lab data, or EMR data. But then, they would also have to pay a third-party company to bridge those two datasets and link their respective patient IDs. Not all of the patients would match, so they’d lose some of the patient files in the process. And they’d also have to do expert determination, which might require a fourth vendor to ensure patients remain unidentifiable after the data is linked.

Basically, there’s always a loss of data fidelity associated with this process. It’s also expensive and time-consuming. If a manufacturer wanted to use this data on a weekly basis, they might choose to pay just once for data tokenization and harmonization, but they would still have to keep bridging files every week, running quality control, and so on and so forth. It’s not a very sustainable process.

That’s why NorstellaLinQ is a real game-changer, because we’ve already integrated our real-world datasets. We use the same Norstella patient ID across our data, whether that data originated from open claims, closed claims, lab tests, vital tables, wherever. It allows us to tie the information contained in unstructured clinical notes to the rest of the patient’s journey, so we can see the full picture of how their care and disease has progressed over time.

We eliminate the time and expense cost of harmonization, as well as the loss of data fidelity. Our clients don’t have to wait; they can just start running analytics right away. 

Q: Tell me more about the data science techniques used to validate this data. 

A: When working with this data, you don’t just use one AI, machine learning, or large language model to find your results. You’re looking for consensus between multiple models, which helps to ensure that the end results are an accurate representation. For example, let me explain how AI and data science can be used in a predictive manner, to extrapolate our initial findings to a broader population.

In type I diabetes, one of the key measures our clients look for is unstructured EMR data is islet, or anti-, autoantibody testing, a blood test which can be used to diagnose type I diabetes—or to determine if the patient has type II diabetes. The lab data reveals that the autoantibody testing is occurring, but we don’t know the sentiment behind why the test was ordered until we look in the clinical notes in the EMR.

By studying all the characteristics of the patient, physician, and the sequencing of ordered tests, we can confirm which patterns indicate that a physician is ordering this test to confirm type I diabetes. We can then apply that pattern, via a machine learning model, to the broader patient population, for patients where we don’t have access to their physicians’ unstructured clinical notes.

In this way, we can predict with confidence that a particular subset of early autoantibody testing was performed because the physicians suspected type I diabetes. By knowing a physician’s intention, we can help life science companies understand who the experts are in a given field, and which HCPs and HCOs might benefit from additional education and awareness from field teams. 

Learn more about how our unstructured EMR data can help your team find eligible patients and their prescribers.

yash-rathi-headshot

Yash Rathi

Senior Director of Data Intelligence | MMIT

ilan-behm-headshot

Ilan Behm

Vice President of Real-World Data Engagement | Norstella

Work with us

Join our mission

We’re looking for agile, growth-oriented team players who are passionate about client success and helping clients bring life-saving therapies to market quicker—and help patients in need.

Work with us

Get in Touch

Let's connect

Have questions about Norstella or its brands? Or do you want to know more about how to solve your challenges at each stage of the drug life cycle?

We want to hear from you