How many times have you wished data were better in Healthcare industry? We run into data issues in EMRs, Data Warehouses, Analytics and especially in Population Health Platforms. We don’t need to even talk about how important feature engineering is when it comes to applications leveraging BIG Data, Machine Learning or Artificial Intelligence. Several Population Health companies failed because they didn’t tackle data issues correctly or underestimated the importance of data!
In this article, I am going to briefly talk about a clever technique you could employ in tackling data issues for your use cases. I call this Data Hurdle Strategy. And, ‘for your use cases’ is the key. You have to start with your use case, identify all the data you need – not more not less – and data formats you need the data in. For all these data points, you need to think about external (external to your platform) and internal data journey (after the data have reached your platform), identify all the data hurdles, their hurdle scale and impact and finally the mitigation strategies.
Before the data reached your platform:
Identity all the data sources – external or intra-company upstream applications, data source fidelity or trust level e.g. fidelity of the data coming from CMS would be super high, number of hops data took before it reached you i.e. how many EMRs, Interface Engines, HIEs, data platforms data moved through before reaching your application or platform, frequency at which you need the data for, how many sources you are receiving the same data points, variation or consistency of the same data points across different sources, data availability percentage by source at basic unit level e.g. patient or provider, and identifiers of you basic units from all your data sources. We are living in a state where every company has its own platforms and so data could’ve been tampered anywhere.
After the data reached your platform:
Identify all the cleansing steps you have in your applications, mappings and normalization the data went through, the filling strategies you used when data were not available, the new data points you derived in your platform, the dependencies of derived data, the trail of your data in your analytics measures, facts, features, and dimensions etc.
Identify the Data Hurdles, their Scale and Impact:
After you identified all the information discussed above, it’s not difficult to identify all the data hurdles you may have, their scale and impact they would have on your use cases. Some of the examples of data hurdles are delayed data, unavailable data (data were never captured), unstructured data i.e. data captured but not in a structured format, structured but un-coded data, data captured but never sent to us, data sent to us but ignored, data available in different code sets etc. One of the examples of scale of the data hurdles is inconsistency of data across different sources. If different sources send given data in different formats, you could have a 3x or 4x or 10x scale issue and the issue needs to be addressed with more urgency if the data were very important. Different types of data points can have different impact on your application. For example, a wrong patient identifier sent to your platform may mean that the entire data are unusable and so impact is the highest (this is a real-world example). You must fix these types of data hurdles first.
Data Hurdle Mitigation strategy:
Then, come up with your mitigation strategy for each of the Hurdle type. Sometimes you may need to have multiple mitigation strategies. For example, if data are not captured in the source EMR, the corrective action or mitigation strategy is to ask upstream EMR technologists / data managers to create a form or change EMR workflow to capture the type of the data you need. If data are received in a different format, mitigation strategy might be to do standard lookups and transform the data in the format you need. Data provenance is a big issue in Healthcare industry and one of the mitigation strategies could be not using certain data coming from unreliable and low-fidelity platforms. Depending on the existence of different types of the data hurdles you are dealing with, you may have to use different Data Validation strategies too. For example if a data point has more than five hops i.e. data were too liquid, you may have to validate the accuracy of the data at every hop i.e. data might have been tampered with at any data lake.
A few final thoughts:
Use 80-20 rule i.e. don’t waste 80% of your time on 20% of the data. Sometimes, it’s okay to do a few things manually. As one of my Cambridge professors said, never trust the data. You can trust, but verify. Inferences from existing data will most likely be wrong the moment you bring more data. Also, never make assumptions without confirming with upstream entities. Finally, bring data experts who understand technology, know business fundamentals and have industry know-how.
I helped multiple clients in using Data Hurdle Strategy in completing successful Population Health Management platform implementations. Message me and I can help you guys succeed in your Data and Pop Health journey as well. We help Health Systems, Health Plans and Health IT companies and so we have seen it all!