My first practice ETL project

Hello everyone this is the first time I use Medium to write a blog to save my first ETL project

I’m working in the HR area and do not have any programming fundamental but interested in data so I would like to try to practice my skill after complete the Road to Data Engineer course from Data — Data Science ชิลชิล :)

My first project comes from #CloudGuruChallenge — Event-Driven Python on AWS to do an automated ETL processing pipeline for COVID-19 data in USA using Python and Cloud Services with some tasks that I can do with my skills. ( Thank you Admin Perth for recommendation to me)

Before start, I will tell you the tools I used in this project

  1. Google Collab for python code
  2. Google Cloud Platform (Storage, Composer (Apache Airflow), Bigquery)
    You can earn 300$ for free via this link
  3. Power Bi Desktop

Here is the data pipeline concept

Credit :

So let’s start the first step!

  1. In Google Collab, download this CSV file from Github. (This is a daily dump of US COVID-19 data from a repository maintained by the New York Times. Every day, the file updates with an additional row of data.)
import pandas as pdurl = ""data_ny = pd.read_csv(url)
Data from New York Times

2. From the exercise they want to show recovered cases as well as confirmed cases and deaths but The NYT data does not track recoveries, so I will need to pull US recovery data from this Johns Hopkins dataset and merge it into my data.

Remark : the case and death counts of the Johns Hopkins dataset is not aligned with the NYT data. So I will use the NYT data as the main data and only copy the recovery data from Johns Hopkins.)

url_jh = ""data_jh = pd.read_csv(url_jh)#select only recovered and usa from JH
data_jh_usa = data_jh[data_jh['Country/Region'] == 'US']
#then drop other column
data_jh_usa_recovered = data_jh_usa.drop(['Confirmed', 'Deaths', 'Province/State', 'Country/Region'], 1)
Data from John Hopskin that show the float which I will solve this later

3. It’s time to merge data!! but before to do that I have to make sure by convert on the 2 date column into date-time then use left join to merge 2 data set

data_ny['date'] = pd.to_datetime(data_ny['date'])
data_jh_usa_recovered['Date'] = pd.to_datetime(data_jh_usa_recovered['Date'])
data_ny_jh = data_ny.merge(data_jh_usa_recovered, how="left", left_on="date", right_on="Date")
Merged data

4. Do more clean data by drop NA, drop Date column and change float into integer then export to python file (.py)

import numpy as npdata_ny_jh.dropna(subset = ["Recovered"], inplace=True)
data_ny_jh.drop (['Date'], 1, inplace=True)
complete transform data

5. To bring our cleaned data to automate pipeline. Go to Google Cloud Platform and create google cloud composer environment to use Apache Airflow then install packages we use in our data (each step will take the time about 15–20 minutes)

install packages we use

6. Open active cloud shell, upload python file and use the editor to set up my code to run in airflow (check my code from Github at the bottom)

set up the code to run in airflow

7. At composer cluster go to dag folder from environment configuration by clicking the link at DAGs folder then use command line to copy our pipeline into dags folder

gsutil cp __file name__ gs://__bucket name__/dags

8. Under DAGs folder we will go to Airflow Web UI to see our automated data pipeline that can run from getting data and transformation until sending to Big Query

9. Finally, I make the data visualization by importing data from Google Bigquery

Check my code from Github :