QBoard » Big Data » Big Data on Cloud » Loading data to Big Query from Local Database

Loading data to Big Query from Local Database

  • I am wanting to start a data warehouse in Google Big Query but I'm not sure how to actually schedule jobs to get the data into the cloud.

    To give some background. I have a MySQL database hosted on-prem which I currently take a demp of each night as a backup. My idea is that I can send this dump to the Google Cloud and have it import the data into Big Query. I have thought that I could send the dump and probably use a cloud scheduler function to then run something that opens the dump and does this but I'm unsure how these services all fit together.

    I'm a bit of a newby with the Google Cloud so if there is a better way to achieve this then I'm happy to change my plan of action.

    Thanks in advance.

      October 2, 2021 2:30 PM IST
    0
  • As the new EXTERNAL_QUERY has been launched and you can query from BigQuery a Cloud SQL instance, your best shot right now is:

    1. Setup replica from your current instance to a Cloud SQL instance, follow this guide.
    2. Understand how Cloud SQL federated queries let's you query from BigQuery Cloud SQL instances.

    You get this way a live access to your relational database as:

    Example query that you run on BigQuery:

    SELECT * EXTERNAL_QUERY(
    'connection_id',
    '''SELECT * FROM mysqltable AS c ORDER BY c.customer_id'');

     

    You can even join Bigquery table with SQL table:

    Example:

    SELECT c.customer_id, c.name, SUM(t.amount) AS total_revenue,
    rq.first_order_date
    FROM customers AS c
    INNER JOIN transaction_fact AS t ON c.customer_id = t.customer_id
    LEFT OUTER JOIN EXTERNAL_QUERY(
      'connection_id',
      '''SELECT customer_id, MIN(order_date) AS first_order_date
      FROM orders
      GROUP BY customer_id''') AS rq ON rq.customer_id = c.customer_id
    GROUP BY c.customer_id, c.name, rq.first_order_date;
      October 4, 2021 1:51 PM IST
    0
  • In order to achieve this you will need to create a Cloud Storage bucket running gsutil mb gs://BUCKET_NAME.

    After creating the bucket you need to create a cloud function triggered by the bucket using the finalize option.

    You can follow this sample function

    'use strict';
    
    const Storage = require('@google-cloud/storage');
    const BigQuery = require('@google-cloud/bigquery');
    
    // Instantiates a client
    const storage = Storage();
    const bigquery = new BigQuery();
    
    /**
     * Creates a BigQuery load job to load a file from Cloud Storage and write the data into BigQuery.
     *
     * @param {object} data The event payload.
     * @param {object} context The event metadata.
     */
    exports.loadFile = (data, context) => {
        const datasetId = 'Your_Dataset_name';
        const tableId = 'Your_Table_ID';
    
        const jobMetadata = {
            skipLeadingRows: 1,
            writeDisposition: 'WRITE_APPEND'
        };
    
        // Loads data from a Google Cloud Storage file into the table
        bigquery
            .dataset(datasetId)
            .table(tableId)
            .load(storage.bucket(data.bucket).file(data.name), jobMetadata)
            .catch(err => {
                console.error('ERROR:', err);
            });
    
        console.log(`Loading from gs://${data.bucket}/${data.name} into ${datasetId}.${tableId}`);
    };​


    Then create your BigQuery dataset using your desired schema

    And now you can upload your csv file into your bucket and you will see the uploaded data in your bigquery.

      October 5, 2021 1:05 PM IST
    0
  • Required permissions
    When you load data into BigQuery, you need permissions to run a load job and permissions that let you load data into new or existing BigQuery tables and partitions. If you are loading data from Cloud Storage, you also need permissions to access to the bucket that contains your data.

    BigQuery permissions
    At a minimum, the following permissions are required to load data into BigQuery. These permissions are required if you are loading data into a new table or partition, or if you are appending or overwriting a table or partition.

    bigquery.tables.create
    bigquery.tables.updateData
    bigquery.jobs.create
    The following predefined IAM roles include both bigquery.tables.create and bigquery.tables.updateData permissions:

    bigquery.dataEditor
    bigquery.dataOwner
    bigquery.admin
    The following predefined IAM roles include bigquery.jobs.create permissions:

    bigquery.user
    bigquery.jobUser
    bigquery.admin
    In addition, if a user has bigquery.datasets.create permissions, when that user creates a dataset, they are granted bigquery.dataOwner access to it. bigquery.dataOwner access lets the user create and update tables in the dataset by using a load job.

    For more information on IAM roles and permissions in BigQuery, see Access control.

    Cloud Storage permissions
    To load data from a Cloud Storage bucket, you must be granted storage.objects.get permissions. If you are using a URI wildcard, you must also have storage.objects.list permissions.

    The predefined IAM role storage.objectViewer can be granted to provide both storage.objects.get and storage.objects.list permissions.
      October 7, 2021 12:52 PM IST
    0