QBoard » Big Data » Big Data on Cloud » ETL approaches to bulk load data in Cloud SQL

ETL approaches to bulk load data in Cloud SQL

  • I need to ETL data into my Cloud SQL instance. This data comes from API calls. Currently, I'm running a custom Java ETL code in Kubernetes with Cronjobs that makes request to collect this data and load it on Cloud SQL. The problem comes with managing the ETL code and monitoring the ETL jobs. The current solution may not scale well when more ETL processes are incorporated. In this context, I need to use an ETL tool.

    My Cloud SQL instance contains two types of tables: common transactional tables and tables that contains data that comes from the API. The second type is mostly read-only in a "operational database perspective" and a huge part of the tables are bulk updated every hour (in batch) to discard the old data and refresh the values.

    Considering this context, I noticed that Cloud Dataflow is the ETL tool provided by GCP. However, it seems that this tool is more suitable for big data applications that needs to do complex transformations and ingest data in multiple formats. Also, in Dataflow, the data is parallel processed and worker nodes are escalated as needed. Since Dataflow is a distributed system, maybe the ETL process would have an overhead when allocating resources to do a simple bulk load. In addition to that, I noticed that Dataflow doesn't have a particular sink for Cloud SQL. This probably means that Dataflow isn't the correct tool for simple bulk load operations in a Cloud SQL database.

    In my current needs, I only need to do simple transformations and bulk load the data. However, in the future, we might want to handle other sources of data (pngs, json, csv files) and sinks (Cloud Storage and maybe BigQuery). Also, in the future, we might want to ingest streaming data and store it on Cloud SQL. In this sense, the underlying Apache Beam model is really interesting, since it offers an unified model for batch and streaming.

    Giving all this context, I can see two approaches:

    1) Use an ETL tool like Talend in the Cloud to help monitoring ETL jobs and maintenance.

    2) Use Cloud Dataflow, since we may need streaming capabilities and integration with all kinds of sources and sinks.

    The problem with the first approach is that I may end up using Cloud Dataflow anyway when future requeriments arrives and that would be bad for my project in terms of infrastructure costs, since I would be paying for two tools.

    The problem with the second approach is that Dataflow doesn't seem to be suitable for simply bulk loading operations in a Cloud SQL Database.

    Is there something I am getting wrong here? Can someone enlighten me?

      October 14, 2021 1:21 PM IST
    0
  • A typical bulk load Flow performs the following operations:

    1. It extracts data from the source.
    2. It then creates CSVJSON, XMLAvro, or Parquet files in the staging area, which can be one of the following: local file system, remote FTP or SFTP server, Azure Blob, Google Cloud Storage, or Amazon S3. The actual location of the staging area depends on the specific implementation of the bulk load command for the destination database. 
    3. If needed, it compresses files using the gzip algorithm.
    4. It checks to see if the destination table exists, and if it does not, creates the table using metadata from the source.
    5. It executes the user-defined SQL statements for the bulk load.
    6. It then optionally executes user-defined or automatically generated MERGE statements.
    7. In the end, it cleans up the remaining files in the staging, if needed.
    This post was edited by Viaan Prakash at November 2, 2021 2:33 PM IST
      November 2, 2021 2:33 PM IST
    0
  • Bulk loading of data refers to the process by which huge volumes of data can be loaded into a database fast. Bulk loading allows data to be imported and exported from one destination to another, much faster than with usual data loading methods. Bulk loading on cloud platforms typically involves parallelization and the use of multiple nodes to speed up the process. Read about BryteFlow for bulk loading

    Bulk loading loads data in big chunks

    Bulk loading is a method to load data into a database in big chunks. Imagine loading details of purchase transactions in a certain period to your database. With usual methods you may enter one order at a time, but bulk loading will take files with similar information and load hundreds of thousands of records in a very short time. If you are taking data from one DBMS to another, you cannot expect to take across the data as it is to the new DB. You would need to dump the info to a file format that can be read and recognized by the new DB and then replicate the data to the new DB.

    Bulk loading -why you need it and the difference from usual data loading

    Loading data on indexed tables is usually optimized for inserting rows one at a time. When you are loading a large amount of data at one go, inserting rows one at a time will be time-consuming and inefficient. In fact, it will degrade the performance of the entire database. This calls for the need of bulk loading of data. Bulk loading or bulk insert of data relies on other more efficient processes of data loading.

    Why is bulk loading faster than routine data loading?

    Bulk loading operations are usually not logged and transactional integrity may not be ideal. Bulk loading operations often bypass triggers and integrity checks like constraints. Bypassing these time-consuming processes speeds up data loading performance to a great degree when you have large amounts of data to transfer.

      November 13, 2021 2:15 PM IST
    0
  • You can use Cloud Dataflow just for loading operations. Here is a tutorial on how to perform ETL operations with Dataflow. It uses BigQuery but you can adapt it to connect to your Cloud SQL or other JDBC sources.

    More examples can be found on the official Google Cloud Platform github page for Dataflow analysis of user generated content.

    You can also have a look at this GCP ETL architecture example that automates the tasks of extracting data from operational databases.

    For simpler ETL operations, Dataprep is an easy tool to use and provides flow scheduling as well.

      October 21, 2021 2:21 PM IST
    0