Data Engineering — Building an ETL Pipeline for Canadian ETFs

Web scraping → Transformation & Data Validation → Loading into Snowflake → EDA in Snowflake (part 1)

Sarthak Girdhar
8 min readMar 10, 2024
Photo by Tyler Prahm on Unsplash

I am back yet again with another Data Engineering project. This time, I am building an ETL pipeline for Canadian ETFs. The data is being obtained/extracted by scraping two websites. In the first project, we extracted data from the NYC Open Data API.

Introduction

We will build a Canadian ETF Screener in this project. This article is divided into two parts. You can read the part 2 here.

In the first part, we will:-

  1. Write code in Python to extract/scrape data from Wikipedia and Investing.com.
  2. Transform the extracted data.
  3. Perform data validation.

In the second part, we will:-

4. Load the data into Snowflake.

5. Join the two datasets to create One Big Table (OBT).

6. Do Exploratory Data Analysis (EDA), whilst writing some Intermediate-level SQL queries.

Scraping Wikipedia

We start with importing the necessary libraries. The data will be extracted from this Wikipedia page. You will also notice that we use the services of Beautiful soup library to accomplish our mission!

# import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd

# wikipedia url to scrape data from
wiki_url = "https://en.wikipedia.org/wiki/List_of_Canadian_exchange-traded_funds"

wiki_response = requests.get(wiki_url)

# parse data from the html into a beautifulsoup object
soup = BeautifulSoup(wiki_response.text, 'html.parser')
wiki_etf_table = soup.find('table',{'class':"wikitable"})

# convert wikipedia table into a pandas dataframe
wiki_etf_df = pd.read_html(str(wiki_etf_table))

# convert list to dataframe
wiki_etf_df = pd.DataFrame(wiki_etf_df[0])

We have successfully extracted the Wikipedia table. Let’s take a look at the contents of the wiki_etf_df dataframe.

wiki_etf_df first 5 rows

Not too bad! Let’s transform this dataframe now so that it is analysis-ready.

Transforming wiki_etf_df

We can start with dropping the last 4 columns. They are not adding much value to the dataset.

# drop redundant columns from the dataframe
wiki_etf_df.drop(wiki_etf_df.loc[:, 'Inverse':'Active Managed'].columns, inplace=True, axis=1)

Next, we can also drop the first 5 characters in the Symbol column.

# remove first 5 characters/extract string from 6th character onwards
wiki_etf_df['Symbol'] = wiki_etf_df['Symbol'].str[5:]

MER stands for Management Expense Ratio. The MER is the combined costs of managing a fund, including operating expenses and taxes. What this means is if you invested $1000 in an ETF with an MER of 0.28%, the fund will deduct $2.80 as expenses.

Now, we need to think about how our data will be used by the downstream users. For example, the end user would like to know which ETFs have low MER because they want to invest in them. If we keep the MER column as {0.28%, 0.82%, 0.13%,…}, it won’t cast into FLOAT and thereby, we won’t be able to use it for mathematical operations. The other option is to change the data type to VARIANT; we will be able to use it in mathematical calculations, however, I think FLOAT is just cleaner. Therefore, we will remove the % sign from the numbers in the MER column.

# remove last character
wiki_etf_df['MER'] = wiki_etf_df['MER'].str[:-1]

# rename column
wiki_etf_df.rename({'MER': 'MER (%)'}, axis=1, inplace=True)
2 wiki_etf_df [‘MER (%)’]

Once again, there are two more problematic entries → and 0.03% (after mgmt fee rebate. We will replace these entries with the appropriate values. This step would normally be tackled in the Data Validation phase, however, we are handling it here for now.

wiki_etf_df['MER (%)'].replace(
to_replace=['-'],
value='0.00',
inplace=True
)

wiki_etf_df['MER (%)'].replace(
to_replace=['0.03% (after mgmt fee rebate'],
value='0.03',
inplace=True
)

It’s done! Before we move onto checking for data quality, let’s take another look at our dataframe.

3 wiki_etf_df post transformation

Data Validation for wiki_etf_df

Let’s start with checking if there are any NULL values in the wiki_etf_df dataframe.

4 wiki_etf_df check for NULL values

It appears only Issuer and Total Assets (MM) have null values. Let’s explore them one by one.

Let’s start with the Issuer column.

We notice that this is the Horizons Active US Floating Rate Bond, so, the Issuer is Horizons. We also notice that there is no information available for Total Assets (MM) and MER (%). Perhaps, the fund is no longer active.

Anyway, we won’t be deleting the row. It will probably be automatically taken care of in the join later.

Let’s impute the value Horizons in the Issuer.

wiki_etf_df.loc[wiki_etf_df.index[121], 'Issuer'] = "Horizons"
6 wiki_etf_df.loc[121,:]

Now, let’s take a look at the Total Assets (MM) column.

7 wiki_etf_df[‘Total Assets (MM)’]

Yes, it appears that there are some ETFs by Hamilton which probably are no longer active. As this dataset has been scraped from Wikipedia, we can expect that not all the information would be up-to-date.

We’ll still replace the NaN values with 0, instead of deleting them.

wiki_etf_df['Total Assets (MM)'] = wiki_etf_df['Total Assets (MM)'].fillna(0)

So far, so good! Let’s run some final checks, and move onto the next task.

8 wiki_etf_df final checks

I think we’re good. Let’s move onto scraping Investing.com.

Scraping ca.Investing.com

We will use this webpage/table to extract our data.

When we attempt to scrape the Investing.com table (just like the Wikipedia table), we will get error code 403. The output 403 means that the server understands the request, however, it chose to deny it. In other words, it is forbidden.

As ca.investing.com is protected by Cloudflare, we won’t be able to scrape it in a straightforward manner. Now, there are ways via Selenium, etc. however, that’s beyond the scope of this project.

Therefore, we will just scrape the website’s cache. It won’t give us the most current price of the ETFs, but that’s okay…

header = {
'authority': 'www.google.com',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'accept-language': 'en-US,en;q=0.9',
'cache-control': 'max-age=0',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36',
# Add more headers as needed
}

# url to scrape data from
investing_url = "https://webcache.googleusercontent.com/search?q=cache:https://ca.investing.com/etfs/canada-etfs"

investing_response = requests.get(investing_url, headers=header)

# parse data from the html into a beautifulsoup object
soup = BeautifulSoup(investing_response.text, 'html.parser')
investing_etf_table = soup.find('table',{'id':"etfs"})

investing_etf_df = pd.read_html(str(investing_etf_table))

# convert list to dataframe
investing_etf_df = pd.DataFrame(investing_etf_df[0])

We have successfully extracted the table. Let’s take a look at the contents of the investing_etf_df dataframe.

9 investing_etf_df first 5 rows

Time to transform this dataframe.

Kindly note that all the code for this project is also available on my GitHub.

Transforming investing_etf_df

The first (Canadian flag) and the eighth column (clock sign) come out as NaN. We can certainly delete those two columns.

The other columns that can also be deleted are Name (since, it’s already present in the wiki_etf_df) and Time. The Time column gives us a date/time for the Last column (the last known price). Since, we only have one price in our dataset, keeping the date doesn’t serve any useful purpose.

# delete redundant columns
investing_etf_df.drop(investing_etf_df.columns[[0, 1, 6, 7]], axis=1, inplace=True)

We’ll also rename our column Last to Price since, it is more descriptive.

# rename column Last -> Price
investing_etf_df.rename({'Last': 'Price'}, axis=1, inplace=True)

Finally, we’ll also remove the % sign from the values in the Chg. % column.

# remove last character of the string
investing_etf_df['Chg. %'] = investing_etf_df['Chg. %'].str[:-1]

For the Vol. column, we’ll notice that there are few values ending in M and the rest ending in K. We’ll convert all these values into one FLOAT type.

investing_etf_df['Vol.'] = investing_etf_df['Vol.'].replace({'K': '*1e3', 'M': '*1e6'}, regex=True).map(pd.eval).astype(float)

All the transformations are completed. Before we move onto data validation, let’s take a quick look at our dataframe investing_etf_df.

10 investing_etf_df post transformations

Data Validation for investing_etf_df

Let’s start with checking if there are any NULL values in the investing_etf_df dataframe.

11 investing_etf_df check for NULL values

Upon looking at the contents of all the columns in the investing_etf_df dataframe, you’ll find that there are no data quality issues present.

# check for any inconsistencies/errors in all the columns of the dataframe

investing_etf_df['Symbol'].unique()

investing_etf_df['Price'].unique()

investing_etf_df['Chg. %'].unique()

investing_etf_df['Vol.'].unique()

Kindly note that ca.Investing.com was scraped on 1 March 2024 for this project. If you are reading this anytime after that, there’s a pretty decent chance that there are few data quality concerns in the dataset that now need to be addressed.

Conclusion

This brings us to the end of this article. In this article, we scraped data from two different websites and stored it as pandas dataframes. We then, transformed those dataframes to suit our analytical needs, whilst also performing some data validation.

This dataset doesn’t suffer much from poor data quality. If you wish to see a list of more comprehensive data quality checks, take a look at my earlier project.

In the next part, we will load both of these dataframes into Snowflake as two separate tables, join them together, and then perform some basic Exploratory Data Analysis (EDA).

References

--

--