From email to data warehouse – Google Edition

Use the Google technology stack to automatically load csv files received over email into BigQuery.

In the data world, sending data over email is something that has been around for a very long time and it’s not going away anytime soon.

In many cases data received on email is very important. It can be used to power some reporting or analysis to yield some key insight. So it makes sense to make the data more durable and useable than sitting in your inbox by moving it to your data warehouse in BigQuery.

This blog post describes a process that will take a csv attached to an email (sent to a gmail address) and load it into a BigQuery table, via Google Sheets as the intermediate layer.

Tools / Platforms used:

  • Gmail
  • Google Sheets
  • BigQuery
  • Google Apps script

Pre-requisites:

  • The email with csv attached data arrives at a predictable time
  • There is a single csv file attached to the email
  • There is a consistent pattern to identify the specific email with the csv attachment (eg a specific sender , subject title combination)

Guide

This solution makes use of Google Apps Script.

STEP 1: Create a Google Spreadsheet that the data from the csv will be loaded into.

STEP 2: In this spreadsheet open the script editor and paste in the following code blocks. Modifications will be needed to be able to apply specifically to your use case.

  1. Change 1: Line 5 || Change search criteria to retrieve the email with csv attached
  2. Change 2: Line 23 || enter the id of the Google Sheet that the csv data will be loaded into
  3. Change 3: Line 24 || change “autoupload” to the tab name in which the csv data will be loaded
  4. Change 4: Line 34 || change “G1” to the cell id of where the last updated time will be saved

function importCSVFromGmail() {
  // Data in attached CSV file is saved to the Google sheet: https://docs.google.com/spreadsheets/d/{SPREADSHEET_ID}


  var threads = GmailApp.search("from:reports@example.com");

  // get latest message in thread
  var message = threads[0].getMessages()[0];
  var sentDate = message.getDate()

  // Loop through all message attachments
  var i;
  for (i = 0; i < message.getAttachments().length; i++) {
    var attachment = message.getAttachments()[i];

    // Is the attachment a CSV file
    // There should only be 1 csv attached
    // Behaviour of script may be unexpected if multiple csv's are attached.
    if (attachment.getContentType() === "text/csv") {

      // Select Google sheet and tab named "autoupload"
      var spreadsheet = SpreadsheetApp.openById("{SPREADSHEET_ID}");
      var sheet = spreadsheet.getSheetByName("autoupload").activate();

      var csvData = Utilities.parseCsv(attachment.getDataAsString(), ",");

      // Remember to clear the content of the sheet before importing new data
      sheet.clearContents();
      sheet.getRange(1, 1, csvData.length, csvData[0].length).setValues(csvData);

      console.log("Data saved to spreadsheet from report sent on " + sentDate)


      // Enter value of last run date time in spreadsheet in cell G1
      // Change this line to save the last run date time in a different cell
      // or comment out this entire section

      my_range = sheet.getRange("G1");
      let currentDate = new Date();
      let cDay = currentDate.getDate();
      my_range.setValue("Data last updated at: " + currentDate);

    }
  }
}

STEP 3: Schedule the Google apps script code to run at a time when the csv attached email is expected to be in your inbox.

For example, if the email is recieved at around 5:30am each day, go ahead and schedule the task to run at between 6am and 7am.

Select “Triggers” from the left side bar

Click the “Add trigger” button in the bottom right corner.

In the modal that pops up set the code to run at a time the csv attached email is expected to be in the gmail inbox. The example below shows a schedule to run the code everyday between 6am and 7am.

When scheduling this for the first time, authorisation to use gmail will need to be granted.

Step 4: Run the google apps script code to load data into the spreadsheet by clicking the run button

Step 5: with the data now in the Google Sheet from the csv file, set up BigQuery to be able to query this google sheet by following this guide.

Step 6: Create an SQL query, that will query the Google Sheet (where the data from emailed csv now sits) and place the data in another table

INSERT `destination_table_to_insert_data_into` (column_1, column_2, column3) 
SELECT
    column_1,
    column_2,
    column_3,
FROM 
  `bigquery_table_that_reads_from_google_sheet` 
WHERE 
    --some conditions to only pull the data you want. eg not blank rows
    column_1 IS NOT null

Step 7: Set up the created query as a scheduled query in BigQuery by following this guide.

Since the csv data will be loaded into the google sheet at some point between 6am and 7am each morning (according to the schedule set up for Google Apps script) it makes sense to schedule the SQL query to run after 7am each morning.

Conclusion

And there you have it. An automated process that will load a csv file received on email (gmail) and load the corresponding data in the csv into a table in BigQuery (via a route through Google Sheets)

Send a Comment

Your email address will not be published.