QBoard » Big Data » Big Data - Data Storage : Hive, HBase, MongoDB, Teradata.. » Flattening event data in big query with one query

Flattening event data in big query with one query

  • We have over 100m rows in big query of analytics data. Each record is an event attached to an id.
    A simplification:
    ID EventId Timestamp
    Is it possible to flatten this to one table holding rows like:
    ID timestamp-period event1 event2 event3 event4
    Where the event columns hold the counts of the number of events for that id in that time period?
    So far, i've managed to do it on small data sets with 2 queries. One to create rows that hold counts for an individual event id and another to flatten these in to one row after. The reason I haven't yet been able to do this accross the whole data set is that bigquery runs out of resources - not entirely sure why.
    These two queries look something like this:
    SELECT VideoId, date_1, IF(EventId = 1, INTEGER(count), 0) AS user_play, IF(EventId = 2, INTEGER(count), 0) AS auto_play, IF(EventId = 3, INTEGER(count), 0) AS pause, IF(EventId = 4, INTEGER(count), 0) AS replay, IF(EventId = 5, INTEGER(count), 0) AS stop, IF(EventId = 6, INTEGER(count), 0) AS seek, IF(EventId = 7, INTEGER(count), 0) AS resume, IF(EventId = 11, INTEGER(count), 0) AS progress_25, IF(EventId = 12, INTEGER(count), 0) AS progress_50, IF(EventId = 13, INTEGER(count), 0) AS progress_75, IF(EventId = 14, INTEGER(count), 0) AS progress_90, IF(EventId = 15, INTEGER(count), 0) AS data_loaded, IF(EventId = 16, INTEGER(count), 0) AS playback_complete, IF(EventId = 30, INTEGER(count), 0) AS object_click, IF(EventId = 31, INTEGER(count), 0) AS object_rollover, IF(EventId = 32, INTEGER(count), 0) AS object_clickthrough, IF(EventId = 33, INTEGER(count), 0) AS object_shown, IF(EventId = 34, INTEGER(count), 0) AS object_close, IF(EventId = 40, INTEGER(count), 0) AS logo_clickthrough, IF(EventId = 41, INTEGER(count), 0) AS endframe_clickthrough, IF(EventId = 42, INTEGER(count), 0) AS startframe_clickthrough, IF(EventId = 61, INTEGER(count), 0) AS share_facebook, IF(EventId = 62, INTEGER(count), 0) AS share_twitter, IF(EventId = 63, INTEGER(count), 0) AS open_social_panel, IF(EventId = 70, INTEGER(count), 0) AS embed_code_requested, IF(EventId = 80, INTEGER(count), 0) AS player_impression, IF(EventId = 81, INTEGER(count), 0) AS player_loaded, IF(EventId = 90, INTEGER(count), 0) AS html5_impression, IF(EventId = 91, INTEGER(count), 0) AS html5_load, IF(EventId = 95, INTEGER(count), 0) AS fallback_impression, IF(EventId = 96, INTEGER(count), 0) AS fallback_load, IF(EventId = 152, INTEGER(count), 0) AS object_impression, IF(EventId = 200, INTEGER(count), 0) AS ping, IF(EventId = 250, INTEGER(count), 0) AS facebook_clickthrough, IF(EventId = 251, INTEGER(count), 0) AS twitter_clickthrough, IF(EventId = 252, INTEGER(count), 0) AS other_clickthrough, IF(EventId = 253, INTEGER(count), 0) AS qr_clickthrough, IF(EventId = 254, INTEGER(count), 0) AS banner_clickthrough, IF(EventId = 280, INTEGER(count), 0) AS banner_impression, IF(EventId = 281, INTEGER(count), 0) AS banner_loaded, IF(EventId = 282, INTEGER(count), 0) AS banner_data_loaded, IF(EventId = 284, INTEGER(count), 0) AS banner_forward, IF(EventId = 285, INTEGER(count), 0) AS banner_back, IF(EventId = 300, INTEGER(count), 0) AS mobile_preview_loaded, IF(EventId = 301, INTEGER(count), 0) AS mobile_preview_clickthrough, IF(EventId = 302, INTEGER(count), 0) AS mobile_preview_clickthrough_back, IF(EventId = 310, INTEGER(count), 0) AS product_search_click, IF(EventId = 311, INTEGER(count), 0) AS promo_code_click, IF(EventId = 320, INTEGER(count), 0) AS player_share_facebook, IF(EventId = 321, INTEGER(count), 0) AS player_share_twitter, IF(EventId = 322, INTEGER(count), 0) AS player_share_googleplus, IF(EventId = 323, INTEGER(count), 0) AS player_share_email, IF(EventId = 324, INTEGER(count), 0) AS player_share_embed, IF(EventId = 401, INTEGER(count), 0) AS youtube_error_2, IF(EventId = 402, INTEGER(count), 0) AS youtube_error_100, IF(EventId = 403, INTEGER(count), 0) AS youtube_error_101, FROM ( SELECT VideoId, EventId, count(*) as count, Date(timestamp) as date_1 FROM [data.data_1] GROUP EACH BY VideoId, EventId, date_1 ) ORDER BY data_loaded DESC;
    Then just a group by on id and timestamp creates the full aggregated table.
    Am I doing this the right way, and do I just need to do it on a small partition of the dataset or is there a better way to aggregate like this that will use bigquery in a more efficient way?
    Thanks in advance, Mat
      October 23, 2021 4:30 PM IST
    0
  • My guess is that you're running out of resources because of the ORDER BY at the end. Everything else should be able to be done in parallel. Also note that if you remove the order by, you will be able to use the 'allow large results' flag and write out a large table of the results (if the results are > 128MB).

     
      November 26, 2021 12:23 PM IST
    0
  • You may try using a pivot query here. For example, if you wanted the difference in minutes for each person/campaign from the time of receiving an email to opening that email, you could try this:

    SELECT
        campaign_id,
        person_id,
        TIMESTAMP_DIFF(
            MAX(CASE WHEN event_type = 'Opened Email' THEN timestamp END),
            MAX(CASE WHEN event_type = 'Received Email' THEN timestamp END),
            MINUTE) AS diff_in_minutes
    FROM yourTable
    GROUP BY
        campaign_id,
        person_id;

     

    Note: This answer was given to the original question, which was later changed substantially.

     
      January 8, 2022 2:46 PM IST
    0
  • In order to query multiple REPEATED Records as we intended to do originally, we’ll need to make use of the FLATTEN function. This acts similarly to Entity SQL’s FLATTEN function by purposefully flattening the specified field into the rest of the dataset.

    For example, if we want to perform our original query to return all the data from our persons table, we’ll need to FLATTEN one of the REPEATED records:

    SELECT
      *
    FROM
      FLATTEN([primary.persons], children)
    

    Here we’re FLATTENING the children REPEATED Record into the rest of the table, so our results are duplicated as often as necessary to accomodate for every repetition of nested fields (children and citiesLives):

    Row kind    fullName    age gender  phoneNumber_areaCode    phoneNumber_number  children_name   children_gender children_age    citiesLived_place   citiesLived_yearsLived
    1   person  John Doe    22  Male    206 1234567 Jane    Female  6   Seattle 1995
    2   person  John Doe    22  Male    206 1234567 Jane    Female  6   Stockholm   2005
    3   person  John Doe    22  Male    206 1234567 John    Male    15  Seattle 1995
    4   person  John Doe    22  Male    206 1234567 John    Male    15  Stockholm   2005
    5   person  Mike Jones  35  Male    622 1567845 Earl    Male    10  Los Angeles 1989
    6   person  Mike Jones  35  Male    622 1567845 Earl    Male    10  Los Angeles 1993
    7   person  Mike Jones  35  Male    622 1567845 Earl    Male    10  Los Angeles 1998
    8   person  Mike Jones  35  Male    622 1567845 Earl    Male    10  Los Angeles 2002
    9   person  Mike Jones  35  Male    622 1567845 Earl    Male    10  Washington DC   1990
    10  person  Mike Jones  35  Male    622 1567845 Earl    Male    10  Washington DC   1993
    ...
      December 22, 2021 1:23 PM IST
    0