QBoard » Artificial Intelligence & ML » AI and ML - R » Are there any good resources/best-practices to "industrialize" code in R for a data science project?

Are there any good resources/best-practices to "industrialize" code in R for a data science project?

  • I need to "industrialize" an R code for a data science project, because the project will be rerun several times in the future with fresh data. The new code should be really easy to follow even for people who have not worked on the project before and they should be able to redo the whole workflow quite quickly. Therefore I am looking for tips, suggestions, resources and best-practices on how to achieve this objective.

    Thank you for your help in advance!

      January 8, 2022 3:19 PM IST
    0
  • One can get lost in the multiple files in the project's folder, so it should be structured properly: link

    Naming conventions that I use: firstsecond.

    Set up the random seed, so the outputs should be reproducible. Documentation is important: you can use the Roxygen skeleton in rstudio (default ctrl+alt+shift+r).

    I usually separate the code into smaller, logically cohesive scripts, and use a main.R script, that uses the others.

    If you use a special set of libraries, you can consider using packrat. Once you set it up, you can manage the installed project-specific libraries.

      January 10, 2022 12:27 PM IST
    0
  • You can make an R package out of your project, because it has everything you need for a standalone project that you want to share with others :

    1. Easy to share, download and install
    2. R has a very efficient documentation system for your functions and objects when you work within R Studio. Combined with roxygen2, it enables you to document precisely every function, and makes the code clearer since you can avoid commenting with inline comments (but please do so anyway if needed)
    3. You can specify quite easily which dependancies your package will need, so that every one knows what to install for your project to work. You can also use packrat if you want to mimic python's virtualenv
    4. R also provide a long format documentation system, which are called vignettes and are similar to a printed notebook : you can display code, text, code results, etc. This is were you will write guidelines and methods on how to use the functions, provide detailed instructions for a certain method, etc. Once the package is installed they are automatically included and available for all users.
    5. The only downside is the following : since R is a functional programming language, a package consists of mainly functions, and some other relevant objects (data, for instance), but not really scripts.

    More details about the last point if your project consists in a script that calls a set of functions to do something, it cannot directly appear within the package. Two options here : a) you make a dispatcher function that runs a set of functions to do the job, so that users just have to call one function to run the whole method (not really good for maintenance) ; b) you make the whole script appear in a vignette (see above). With this method, people just have to write a single R file (which can be copy-pasted from the vignette), which may look like this :

    library(mydatascienceproject)
    library(...)
    ...
    dothis()
    dothat()
    finishwork()

    That enables you to execute the whole work from a terminal or a distant machine with Rscript, with the following (using argparse to add arguments)

    Rscript myautomatedtask.R --arg1 anargument --arg2 anotherargument
    

    And finally if you write a bash file calling Rscript, you can automate everything !

    Feel free to read Hadley Wickham's book about R packages, it is super clear, full of best practices and of great help in writing your packages.

     

      January 14, 2022 2:10 PM IST
    0