QBoard » Big Data » Big Data - Spark » Why Spark SQL considers the support of indexes unimportant?

Why Spark SQL considers the support of indexes unimportant?

  • Quoting the Spark DataFrames, Datasets and SQL manual:

    A handful of Hive optimizations are not yet included in Spark. Some of these (such as indexes) are less important due to Spark SQL’s in-memory computational model. Others are slotted for future releases of Spark SQL.

    Being new to Spark, I'm a bit baffled by this for two reasons:

    1. Spark SQL is designed to process Big Data, and at least in my use case the data size far exceeds the size of available memory. Assuming this is not uncommon, what is meant by "Spark SQL’s in-memory computational model"? Is Spark SQL recommended only for cases where the data fits in memory?

    2. Even assuming the data fits in memory, a full scan over a very large dataset can take a long time. I read this argument against indexing in in-memory database, but I was not convinced. The example there discusses a scan of a 10,000,000 records table, but that's not really big data. Scanning a table with billions of records can cause simple queries of the "SELECT x WHERE y=z" type take forever instead of returning immediately.

    I understand that Indexes have disadvantages like slower INSERT/UPDATE, space requirements, etc. But in my use case, I first process and load a large batch of data into Spark SQL, and then explore this data as a whole, without further modifications. Spark SQL is useful for the initial distributed processing and loading of the data, but the lack of indexing makes interactive exploration slower and more cumbersome than I expected it to be.

    I'm wondering then why the Spark SQL team considers indexes unimportant to a degree that it's off their road map. Is there a different usage pattern that can provide the benefits of indexing without resorting to implementing something equivalent independently?

      September 7, 2021 1:52 PM IST
    0
  • Indexing input data

    • The fundamental reason why indexing over external data sources is not in the Spark scope is that Spark is not a data management system but a batch data processing engine. Since it doesn't own the data it is using it cannot reliably monitor changes and as a consequence cannot maintain indices.
    • If data source supports indexing it can be indirectly utilized by Spark through mechanisms like predicate pushdown.

    Indexing Distributed Data Structures:

    • standard indexing techniques require persistent and well defined data distribution but data in Spark is typically ephemeral and its exact distribution is nondeterministic.
    • high level data layout achieved by proper partitioning combined with columnar storage and compression can provide very efficient distributed access without an overhead of creating, storing and maintaining indices.This is a common pattern used by different in-memory columnar systems.

    That being said some forms of indexed structures do exist in Spark ecosystem. Most notably Databricks provides Data Skipping Index on its platform.

    Other projects, like Succinct (mostly inactive today) take different approach and use advanced compression techniques with with random access support.

    Of course this raises a question - if you require an efficient random access why not use a system which is design as a database from the beginning. There many choices out there, including at least a few maintained by the Apache Foundation. At the same time Spark as a project evolves, and the quote you used might not fully reflect future Spark directions.

      October 16, 2021 1:22 PM IST
    0
  • The reason why using UDFs require Cartesian product is quite simple. Since you pass an arbitrary function with possibly infinite domain and non-deterministic behavior the only way to determine its value is to pass arguments and evaluate. It means you simply have to check all possible pairs.

    Simple equality from the other hand has a predictable behavior. If you use t1.foo=t2.bar condition you can simply shuffle t1 and t2 rows by foo and bar respectively to get expected result.

    And just to be precise in the relational algebra outer join is actually expressed using natural join. Anything beyond that is simply an optimization.

    This post was edited by Maryam Bains at November 17, 2021 12:54 PM IST
      November 17, 2021 12:53 PM IST
    0
  • Spark SQL is mostly about distributed in-memory computations on massive scale.
     
    Spark SQL provides Spark programmers with the benefits of relational processing (e.g., declarative queries and optimized storage), and allows SQL users to call complex analytics libraries in Spark (e.g., machine learning).
     
    Now talking about index:
    Indexing is done in order to allow you to move your disk head to the precise location on disk you want and just read that record.
     
    Indexing input data
    • The reason why indexing over external data sources is not supported by Spark is that Spark is not a data management system but a batch data processing engine. Since the data used by spark comes from different sources and is not owned by spark, it cannot reliably monitor changes. And as a consequence it cannot maintain indices.

    • Now, if the data source that provides data to the Spark supports indexing, it will be indirectly utilized by Spark through mechanisms like predicate pushdown.

    Indexing Distributed Data Structures:
    • standard indexing techniques require persistent and well defined data distribution but the exact distribution of the data present in Spark is nondeterministic and also this data is ephemeral.

    • high-level data layout achieved by proper partitioning plus attached columnar storage and compression provides an efficient distributed access without an overhead of creating, storing and maintaining indices. This common pattern is used by different in-memory columnar systems.

    That being said some forms of indexed structures do exist in the Spark ecosystem.
    Other projects, like Succinct (mostly inactive today) runs on a different approach and use advanced compression techniques with random access support.
     
    And of course this raises a question. If there comes a case where we require an efficient random access why not use a system which is designed as a database from the beginning. There are many choices out there, including at least a few maintained by the Apache Foundation. At the same time Spark is evolving as a project. The future of Spark might hold anything.
      September 15, 2021 11:22 PM IST
    0