QBoard » Big Data » Big Data - Spark » Lambda Architecture with Apache Spark

Lambda Architecture with Apache Spark

  • I'm trying to implement a Lambda Architecture using the following tools: Apache Kafka to receive all the datapoints, Spark for batch processing (Big Data), Spark Streaming for real time (Fast Data) and Cassandra to store the results.

    Also, all the datapoints I receive are related to a user session, and therefore, for the batch processing I'm only interested to process the datapoints once the session finishes. So, since I'm using Kafka, the only way to solve this (assuming that all the datapoints are stored in the same topic) is for the batch to fetch all the messages in the topic, and then ignore those that correspond to sessions that have not yet finished.

    So, what I'd like to ask is:

    • Is this a good approach to implement the Lambda Architecture? Or should use Haddop and Storm instead? (I can't find information about people using Kafka and Apache Spark for batch processing, Map Reduce)
    • Is there a better approach to solve the user sessions problem?

    Thanks.

      December 10, 2021 11:23 AM IST
    0
  • This is a good approach. Using Spark for both the speed and batch layers lets you write the logic once and use it in both contexts.

    Concerning your session issue, since you're doing that in batch mode, why not just ingest the data from Kafka into HDFS or Cassandra and then write queries for full sessions there? You could use Spark Streaming's "direct connection" to Kafka to do this.

      December 28, 2021 12:21 PM IST
    0
  • The Lambda architecture just explained is the basis for the setup of our demo ETL system. The goal of this system is to test and try out tools of the Cloudera CDH platform and its interfaces in order to realize a minimal example of a Lambda ETL pipeline. The focus was on Apache Spark, a framework for cluster computing.

    Because of the vast size of the Hadoop ecosystem, with its myriad components, we have focused our implementation on a few common tools: namely Hive, Spark and Kafka. The pre-system is an SAP Bank Analyzer 9 on a HANA database.

    https://www.ifb-group.com/blog/wp-content/uploads/image-5-300x163.png 300w, https://www.ifb-group.com/blog/wp-content/uploads/image-5-768x416.png 768w" alt="" width="969" height="525">
    Figure 3: Data flows in the implemented Lambda architecture

    A business use case for discounting cash flows was implemented. These cash flows are stored in HANA DB as part of financial transactions in the SAP Bank Analyzer. From there they are transferred to the Hadoop system with the help of Spark, where they are discounted. This requires current market data, which is not manually entered and updated here, but is obtained from the Internet via a public API. For so-called “Open Data”, the REST interface of the European Central Bank to its Statistical Data Warehouse is a good choice, from which a variety of market data and rates can be obtained. In the example discussed here, EURIBOR money market rates and EUROYIELD capital market rates were used for the present value calculation of the cash flow. The discounted cash flow is then stored in a hive table in HDFS. However, only the delta of the records is written, i.e. a record is only stored if it does not yet exist in the table or if it has been updated. Otherwise the result is filtered out and not persisted in Hive.

    The technical implementation is done with two different Java programs and self-written Spark-Java libraries. Although Spark is written in Scala as our main API, we used Java. Our experiences with customers have shown that they prefer to use Java because of the better availability of developers and the wider distribution.

    In the following, a more detailed look at the individual components of the ETL pipeline will be taken. The program (1) for loading the market data receives JSON files from the ECB Statistical Data Warehouse via a REST call. These files are then parsed to extract and re-bundle the relevant data. Using the Kafka-Java-API a Kafka-Producer is implemented, which writes the data formatted as JSON-string into a Kafka-Topic (2). Since only the latest version of the market data is needed, such a topic is an easy-to-use key-value store.

    Of course, this step can also be done directly in Spark and you can also skip the caching of the data in Kafka Topics. However, the focus was to test as many interfaces as possible with a simple use case. In addition, the traceability of older calculations is ensured in this way.

    The main program for loading cash flows (3) was developed using the Spark-Java-API. Two versions of the program were created for this purpose, one for stream processing and a second for batch processing. Thanks to the possibility to use Spark-Streaming for batch processing via the trigger setting “One-Time-Micro-Batch”, the implementation and maintenance effort is limited. Most of the code can be used for both cases. The processing mode is simply selected as needed via a configuration file. Such a single processing brings all known advantages of the Spark streaming library, such as the automatic recovery of the query in case of an unintentional system shutdown or crash of created checkpoints. In addition, however, all advantages of batch processing are retained, such as the reduction of costs through targeted cluster startup and shutdown.

      January 6, 2022 1:06 PM IST
    0
  • In today’s time in which standstill is already considered a step backwards, it is especially important in the economy to react faster to trends and to draw the right conclusions from them. For this reason, decision-making processes are not only based on data from classic databases, which transmit their data to the subsequent systems once a day or better: overnight, but also on data from various sources such as social media, log files, images, sensor data, etc. This means that not only the heterogeneity of the data has increased, but also the speed of turnover and thus the speed at which it is necessary to react. If, for example, an ATM has a defect or runs out of cash, this should result in timely action to keep customer satisfaction high. Modern IT architectures must take these changed circumstances into account.

    In the context of big data scenarios, Lambda architecture is a frequently used form of architecture in IT system landscapes when it comes to reconciling the requirements of two different user groups. On the one hand, there are users who have always had to process and evaluate data of high quality. These are usually enriched with additional, calculated key figures. The “classic” users need the data for specific key dates in departments such as reporting, accounting, risk or controlling. On the other hand, there are users with a short-term need for information who have to react quickly to events. This can be the defective ATM for the maintenance technician, but also the next boycott call for a certain company in the social media for a stock trader.

    https://www.ifb-group.com/blog/wp-content/uploads/image-1-300x105.png 300w" alt="" width="922" height="323">
    Figure 1: Lambda architecture overview

    How can you unite these different users on one platform?

    The Lambda architecture achieves this by using two different layers. One is a Batch Layer, which ensures consistency. This is achieved by the important principle of immutability of the data. This means that the source data is never changed, only copies are created and saved. The unchangeable source information is used as a basis for the calculation of further KPIs, which are added to the source data as new copies. This ensures a clean separation of source data and derived, calculated data. This layer is especially important for the classical users.

    The Speed Layer is especially important for the real time analysis of data. It should close the comparatively large time window until data from the Batch Layer is available. Only the latest data from the previous systems are processed and, if necessary, additional KPIs are calculated on the fly and made available in real-time. The calculated KPIs are usually only a subset of the calculation of the batch layer. As soon as the calculation in the Batch Layer is completed at a later point in time, the missing KPIs are added to the Serving Layer. The goal of the Speed Layer is to provide a preliminary image in real time at the expense of completeness and accuracy.

    In the Serving Layer, both user groups can create their reports from one or both layers according to their requirements.

    Modern Big Data architectures of the financial industry or Industry 4.0 often work with a Lambda architecture. Here, streaming sources (sensor data, Internet of Things or change data captures from databases) are tapped and evaluated for the Speed Layer. In most cases, the Speed Layer is equipped with machine learning methods to evaluate data more effectively and automatically. The Batch Layer is fed from relevant pre-systems and ERP systems. Due to the heterogeneity of the data formats, a data lake is often used for storage. The data of both layers are available for different applications like machine learning or reporting.

      December 11, 2021 3:15 PM IST
    0