I'm relatively new to GCP and just starting to setup/evaluate my organizations architecture on GCP.
Scenario:Data will flow into a pub/sub topic (high frequency, low amount of... moreI'm relatively new to GCP and just starting to setup/evaluate my organizations architecture on GCP.
Scenario:Data will flow into a pub/sub topic (high frequency, low amount of data). The goal is to move that data into Big Table. From my understanding you can do that either with a having a cloud function triggering on the topic or with Dataflow.
Now I have previous experience with cloud functions which I am satisfied with, so that would be my pick.
I fail to see the benefit of choosing one over the other. So my question is when to choose what of these products?
Thanks less
I need to ETL data into my Cloud SQL instance. This data comes from API calls. Currently, I'm running a custom Java ETL code in Kubernetes with Cronjobs that makes request to... moreI need to ETL data into my Cloud SQL instance. This data comes from API calls. Currently, I'm running a custom Java ETL code in Kubernetes with Cronjobs that makes request to collect this data and load it on Cloud SQL. The problem comes with managing the ETL code and monitoring the ETL jobs. The current solution may not scale well when more ETL processes are incorporated. In this context, I need to use an ETL tool.
My Cloud SQL instance contains two types of tables: common transactional tables and tables that contains data that comes from the API. The second type is mostly read-only in a "operational database perspective" and a huge part of the tables are bulk updated every hour (in batch) to discard the old data and refresh the values.
Considering this context, I noticed that Cloud Dataflow is the ETL tool provided by GCP. However, it seems that this tool is more suitable for big data applications that needs to do complex transformations and ingest data in multiple formats. Also, in Dataflow, the... less
Nathan Marz in his book "Big Data" describes how to maintain files of data in HDFS and how to optimize files' sizes to be as near native HDFS block size as possible using... moreNathan Marz in his book "Big Data" describes how to maintain files of data in HDFS and how to optimize files' sizes to be as near native HDFS block size as possible using his Pail library running on top of Map Reduce.
Is it possible to achieve the same result in Google Cloud Storage?
Can I use Google Cloud Dataflow instead of MapReduce for this purpose?
Task: We have to setup a periodic sync of records from Spanner to Big Query. Our Spanner database has a relational table hierarchy.
Option Considered I was thinking of using... moreTask: We have to setup a periodic sync of records from Spanner to Big Query. Our Spanner database has a relational table hierarchy.
Option Considered I was thinking of using Dataflow templates to setup this data pipeline.
Option1: Setup a job with Dataflow template 'Cloud Spanner to Cloud Storage Text' and then another with Dataflow template 'Cloud Storage Text to BigQuery'. Con: The first template works only on a single table and we have many tables to export.
Option2: Use 'Cloud Spanner to Cloud Storage Avro' template which exports the entire database. Con: I only need to export selected tables within a database and I don't see a template to import Avro into Big Query.
Questions: Please suggest what is the best option for setting up this pipeline less