QBoard » Big Data » Big Data on Cloud » Hadoop on cloud latency impact

Hadoop on cloud latency impact

  • I'm a big data architect with no skills with the cloud.

    I have always worked with Hadoop on Premise, I know that servers locality is a very serious concern as it may apply higher latency.

    Today with Hadoop integration on the cloud I'm wondering :

    1. If cloud providers ( AWS, AZURE ) have the possibility to offer hosts of the same cluster on the same locality to reduce the latency?
    2. How do we manage the latency to transfer huge data from local machines to the cloud?
      August 5, 2021 4:01 PM IST
    0
    • Distributed computing is one component adding to cloud latency’s complexity. With enterprise data centers a thing of the past, the nature of applications has completely changed from being contained within a local infrastructure to being distributed all over the world. The proliferation of Big Data applications using tools such as R and Hadoop are incentivizing distributed computing even more. The problem lies in the fact that these applications, which are deployed all over the world, have varying degrees of latency with each of their Internet connections. Furthermore, latencies are entirely dependent on Internet traffic, which waxes and wanes to compete for the same bandwidth and infrastructure.
    • Virtualization adds another layer of complexity to latency in the cloud. Gone are the days of rack-mounted servers; enterprises are building virtualized environments for consolidation and cost efficiency, and today’s data centers are now a complex web of hypervisors running dozens of virtual machines. Unfortunately, virtualized network infrastructure can introduce its own series of packet delays before the data even leaves the rack itself.
    • Another complexity layer lies in the lack of measurement tools for modern applications. While ping and traceroute can be used to test Internet connection, modern applications don’t have anything to do with ICMP, the protocol behind these tracing devices. Instead, modern applications and networks use other protocols such as HTTP and FTP and therefore need to try to measure their performances accordingly.
    • Prioritizing traffic and Quality of Service (QoS) delve deeper into cloud latency’s complexity. Pre-cloud, Service Level Agreements (SLAs) and QoS were created to prioritize traffic and to make sure that latency-sensitive applications would have the network resources to run properly. However, cloud and virtualized services make this a dated process, given that we need to now differentiate between certain items like an outage in a server, a network card, a piece of the storage infrastructure, or a security exploit. Different cloud applications have different tolerances for network latency, depending on their level of criticality; while an application controlling back-office reporting may tolerate lower uptime, not all corporate processes can allow for downtime without having a significant impact on the business. This makes it increasingly important for SLAs to prioritize particular applications based on performance and availability.
      October 16, 2021 12:46 PM IST
    0
  • I think that this should be on ServerFault, not StackOverflow. That said, I can still try to help!

    • These cloud providers have ways to choose which regions your systems are hosted within. They are not "on-premises" since they are in a remote data center, but if you are running them in the same region then the latency between them will be surprisingly faster than you may expect. These companies work especially hard (at least, with AWS) to make it so that even when data isn't in the same region - it is still extremely fast to send and receive messages between even different countries while staying within their network. Many people will create VPNs inside of AWS for the sole purpose of being able to use their networks because they are extremely and most surprisingly low-latency.

    • Generally, you don't have to worry about latency outside of your software's processing latency. That's one of the benefits of using a cloud provider.

      August 6, 2021 12:55 PM IST
    0
  • Hadoop is increasingly being adopted as the go-to platform for large-scale data analytics. However, it is still not necessarily clear that Hadoop is always the optimal choice for traditional data warehousing for reporting and analysis, especially in its “out of the box” configuration. That is because Hadoop itself is not a database, even though there are some data organization methods that are adapted to and firmly ingrained within the distributed architecture.

    The first is the distributed file organization itself – the Hadoop Distributed File System, or HDFS. And while the data organization provided by HDFS is intended to provide linear scalability for capturing large data volumes, aspects of HDF will impact the performance of reporting and analytical applications.

    Hadoop is deployed across a collection of computing and data nodes and correspondingly, your data files are likewise distributed across the different data nodes. But because one of the foundational aspects of Hadoop is fault-tolerance, there is an expectation of potential component failure, and to mitigate this risk, not only does HDFS distribute your file, it replicates the different chunks across different nodes so that if one node fails, the data is still accessible at another one. In fact, by default the data is stored redundantly three times.

    Redundancy leads to two performance impacts. The first is from a sizing perspective – your storage requirement will be three times the size of your data set! The second involves the time for data loading; because the data is replicated, the data has to be written to disk three times, increasing the time it takes to store the file. The impacts of data redundancy are, for the most part, static, and can be addressed by changing the default.

    Hadoop also provides data organization for reporting. Hive is Hadoop’s standard for providing an SQL interface for running queries. However, it is important to recognize that despite the promise of high-performance execution, data distribution has a more insidious performance impact for reporting because of data access latency.

    Data access latency (sometime just referred to as “latency”) is the time it takes for data to be accessed from its storage location and brought to the computation location. For most simple queries (that is, filtering data sets using conditional SQL SELECT queries), data access latency is not an issue. Because each computation node is looking at the chunks of the data set with which the data distribution is aligned, the latency is limited to the time it takes for the records to be streamed from disk.

    The problem is the more complex queries (e.g., JOINs), where potentially all the records in one table are compared with all the records in another table. Consequently, each record in one of the tables has to be accessed by all of the computing nodes. If the distribution of the data chunks is not aligned with the computing nodes, data must be sent over the interconnection network from the data node to all of the computing nodes. Between the time it takes to access the data, package it for network transmission, and push it through the network, the result is a dramatically increased latency, which is only made worse due to the burst-y style of the communications that will tax the network bandwidth.

    In other words, other than embarrassingly parallel queries, SQL-style reporting may not perform as what might be naively expected from a high-performance analytics platform. In my next post, we will consider some ways to optimize the queries so that they are less impacted by the data organization.

      August 24, 2021 1:50 PM IST
    0
  • Despite latency’s complexity, it provides a great opportunity for innovative cloud-based solutions, such as the Radar benchmarking tool from Cedexis, which provides insight into what goes on across various IaaS providers, as well as tools like Gomez, which are helpful in comparing providers. While tools are of course helpful in providing insight on trends, the overarching solution to measuring and mitigating cloud latency is providing more consistent network connections.

    The best available option is to connect to a public cloud platform. Amazon’s Direct Connect is the best that we’ve seen in providing more predictable metrics for bandwidth and latency. [Disclosure: Interxion recently announced a direct connect to the AWS platform in each of its data centers.] Another viable option is Windows Azure – both products are particularly useful for companies looking to build hybrid solutions, as they allow some data to be stored on premise while other solution components can be migrated to the cloud.

    Finally, by colocating within a third-party data center, companies can rest assured that their cloud applications are equipped to handle all of latency’s challenges and reap extra benefits in terms of monitoring, troubleshooting, support, and cost. Colocation facilities that offer specific Cloud Hubs can provide excellent connectivity and cross-connections with cloud providers, exchanges and carriers that improve performance and reduce latency to end users. Furthermore, colocation data centers ensure that companies not only have the best coverage for their business, but also a premium network at their fingertips.

    In this connected, always-on world, users increasingly demand immediate results for optimal website and application performance. For businesses looking to boost ROI and maintain customer satisfaction, every millisecond counts. While several dimensions and complicating factors of latency can introduce a number of disturbances for users and providers of cloud services, having dedicated network connections can help avoid these pitfalls and achieve optimal cloud performance.

      October 8, 2021 1:09 PM IST
    0