Predictive ETL Failure Detection in Healthcare Data Pipelines Using Anomaly Detection Algorithms
Main Article Content
Abstract
Healthcare data pipelines' failure detection remains a largely unsolved problem, despite the criticality of ETL (extract, transform, load) processes for timely healthcare insights and the availability of ETL logs across many pipelines to support analytics. This study proposes an architecture to detect ETL failures up to 1 hour before they occur by training anomaly detection models with log and ETL status data over preceding periods. Datasets of five months of ETL logs from a healthcare data warehouse with associated ground truth labels are used to train, validate, and benchmark four anomaly-detection approaches—Isolation Forest, One-Class SVM, LSTM-VAE, and ARIMA—for predictive failure detection. Two months of additional data serve as testing for the models trained on three months of data and tested on one month, along with a four-day test set for models trained on one month and tested on three days.
Results demonstrate the potential of the suggested architecture for predictive ETL failure detection in healthcare data pipelines. Factors contributing to ETL failures are identified in the feature engineering stage, enabling an understanding of the predictive power of various features as well as dataset partitioning for better model training. Ultimately, these findings contribute to future closed-loop control of ETL tasks through automatic recovery and alerting on upcoming failures, with potential application to ETL systems outside the healthcare domain.
Article Details

This work is licensed under a Creative Commons Attribution 4.0 International License.