End-to-End Data Pipeline
By:
Dingkun Yang
Posted:
Project Link: Click this link
Description
This repository contains the code for an end-to-end Arpache Airflow data pipeline with Docker container which extracts data from both csv and Postgres database into S3 buckets (using minio); then processes the data using Apache Spark and loads the data into bucktets. Then utilizes python scripts to analyze and build User Behaviour Metric and stores it in DuckDB (as a warehouse). Finally the data visualization is done using Quarto and Plotly.
Architecture
The architecture of the data pipeline is as follows:
Airflow
is used to orchestrate the data pipeline, DAGs.Postgres
is used to store Airflow's metadata and the data to be processed.DuckDB
is acted as a data warehouse to store the processed data.Quarto with Plotly
are used to convert code inmarkdown
format to html files that can be embedded in the app or servered as is.Apache Spark
is used to process the data and run a classification algorithm.minio
: To provide an S3 compatible open source storage system.
Screenshots
Airflow Running DAG
Airflow running with Docker
Docker
Interactive Dashboard with Quatro and Plotly
You can view the dashboard html rendered file here