QBoard » Statistical modeling » Stats - Conceptual » General architecture for a long-running data-processing system in Java?

General architecture for a long-running data-processing system in Java?

  • I've been asked to port a legacy data processing application over to Java.

    The current version of the system is composed of a nubmer of (badly written) Excel sheets. The sheets implement a big loop: A number of data-sources are polled. These source are a mixture of CSV and XML-based web-servics.

    The process is conceptually simple:

    It's stateless, that means the calculations which run are purely dependant on the inputs. The results from the calculations are published (currently by writing a number of CSV files in some standard locations on the network).

    Having published the results the polling cycle begins again.

    The process will not need an admin GUI, however it would be neat if I could implemnt some kind of web-based control panel. It would be nothing pretty and purely for internal use. The control panel would do little more than dispay stats about the source feeds and possibly force refresh the input feeds in the event of a problem. This component is purely optional in the first delivery round.

    A critical feature of this system will be fault-tolerance. Some of the input feeds are notoriously buggy. I'd like my system to be able to recover in the event that some of the inputs are broken. In this case it would not be possible to update the output - I'd like it to keep polling until the system is resolved, possibly generating some XMPP messages to indicate the status of the system. Overall the system should work without intervention for long periods of time.

    Users currently have a custom-client which polls the CSV files which (hopefully) will not need to be re-written. If I can do this job properly then they will not notice that the engine that runs this system has been re-implemented.

    I'm not a java devloper (I mainly do Python), but JVM is the requirement in this case. The manager has given me generous time to learn.

    What I want to know is how to begin architecting this kind of project. I'd like to make use of frameworks & good patterns possible. Are there any big building-blocks that might help me get a good quality system running faster?

    UPDATE0: Nobody mentioned Spring yet - Does this framework have a role to play in this kind of application?

      August 18, 2021 2:01 PM IST
    0
  • You can use lots of big complex frameworks to "help" you do this. Learning these can be CV++.

    In your case I would suggest you try making the system as simple as possible. It will perform better and be easier to maintain (its also more likely to work)

    So I would take each of the requirements and ask yourself; How simple can I make this? This is not about being lazy (you have to think harder) but good practice IMHO.

      August 19, 2021 1:59 PM IST
    0
  • Have a look at Pentaho ETL tool or Talend OpenStudio.
    This tools provide access to files, databases and so on. You can write your own plugin or adapter if you need it. Talend creates Java code which you can compile and run.
      August 26, 2021 5:37 PM IST
    0
  • There is a tool in Java ecosystem, which solves all (almost) integration problems.

    It is called Apache Camel (http://camel.apache.org/). It relies on a concept of Consumers and Producers and Enterprise Integration Patterns in between. It provides fault-tolerance and concurrent processing configuration capabilities. There is a support for periodical polling. It has components for XML, CSV and XMPP. It is easy to define time-triggered background jobs and integrate with any messaging system you like for job queuing.

    If you would be writing such system from scratch it would takes weeks and weeks and still you would probably miss some of the error conditions.

      September 20, 2021 1:18 PM IST
    0
  • Have a look at Pentaho ETL tool or Talend OpenStudio.
    This tools provide access to files, databases and so on. You can write your own plugin or adapter if you need it. Talend creates Java code which you can compile and run.
      October 28, 2021 6:15 PM IST
    0