QBoard » Big Data » Big Data on Cloud » Benefits with Dataflow over cloud functions when moving data?

Benefits with Dataflow over cloud functions when moving data?

  • I'm relatively new to GCP and just starting to setup/evaluate my organizations architecture on GCP.

    Scenario:
    Data will flow into a pub/sub topic (high frequency, low amount of data). The goal is to move that data into Big Table. From my understanding you can do that either with a having a cloud function triggering on the topic or with Dataflow.

    Now I have previous experience with cloud functions which I am satisfied with, so that would be my pick.

    I fail to see the benefit of choosing one over the other. So my question is when to choose what of these products?

    Thanks

      November 30, 2021 12:40 PM IST
    0
  • Both solutions could work. Dataflow will scale better if your pub/sub traffic grows to large amounts of data, but Cloud Functions should work fine for low amounts of data; I would look at this page (especially the rate-limit section) to ensure that you fit within Cloud Functions: https://cloud.google.com/functions/quotas

    Another thing to consider is that Dataflow can guarantee exactly-once processing of your data, so that no duplicates end up in BigTable. Cloud Functions will not do this for you out of the box. If you go with a functions approach, then you will want to make sure that the Pub/Sub message consistently determines which BigTable cell is written to; that way, if the function gets retried several times the same data will simply overwrite the same BigTable cell.

      December 7, 2021 12:24 PM IST
    0
  • Your needs sound relatively straightforward and Dataflow may be overkill for what you're trying to do. If Cloud functions do what you need they maybe stick with that. Often I find that simplicity is key when it comes to maintainability.

    However when you need to perform transformations like merging these events by user before storing them in BigTable, that's where Dataflow really shines:

    https://beam.apache.org/documentation/programming-guide/#groupbykey

      December 10, 2021 11:10 AM IST
    0
  • My primary use of Cloud Functions with Dataflow pipelines it that I am using Cloud Functions to start Dataflow job through API request. Dataflow is exported as a template to Cloud Storage.

    Cloud Functions is can be triggered by an event in Cloud Storage (new file is added to the bucket) and it then starts Dataflow pipeline usually taking as input parameter uploaded file.

    I don’t use Cloud Functions directly within Dataflow pipelines. in case there would be such need, a better approach, in my opinion, would be to publish messages to PubSub in Dataflow pipeline and then configure Cloud Function which would be triggered by that PubSub topic.

      December 16, 2021 12:35 PM IST
    0