Intelligent Data Profiling for Healthcare Data Lakes Using AI-Enhanced Analytics
Main Article Content
Abstract
Improved data quality, governance, and interoperability are critical to fulfilling the potential of data lakes for healthcare analytics. Well-crafted profiling of diverse source data is a prerequisite step for quality data lakes. Nevertheless, existing literature provides no clear guidance on operationalizing data profiling for heterogeneous healthcare data lakes, in part because key aspects of the profiling process remain underexplored. A set of necessary and sufficient data-governance requirements guides consequent profiling of structure, schema, lineage, quality, and anomalies in both clinical and nonclinical databases of an operational healthcare data lake. Profiling results reveal crucial evidence for intelligent data-governance decision-making and are formally disseminated within a metadata catalog.
The growing role of artificial intelligence (AI)—and especially machine learning (ML)—in the analytic processes that utilize data lakes has given rise to the notion of AI-Enhanced Analytics, which extends standard data-analysis processes with profiled data-knowledge aspects in order to improve discovery, quality, applicability, and generalization of the results. Profiling plays a key role in traditional data warehouses and their ETL processes, yet guidance on the profiling of complex heterogeneous data lakes remains limited—especially with respect to operational aspects. AI-Enhanced Analytics capabilities have now been employed to fulfill these operational requirements, enabling improved data-governance decision-making. Profiling aspects specifically fulfillment of data-governance requirements for data lakes in healthcare analytics.
Article Details

This work is licensed under a Creative Commons Attribution 4.0 International License.