QBoard » Artificial Intelligence & ML » AI and ML - R » Language detection in R with the textcat package : how to restrict to a few languages?

Language detection in R with the textcat package : how to restrict to a few languages?

  • I need to detect the language of many short texts, using R. I am using the textcat package, which find which among many (say 30) European languages is the one of each text. However, I know my texts are either French or English (or, more generally, a small subset of the langages handled by textcat).
    How could add this knowledge when calling textcat functions ?
    Thanks,

      August 5, 2020 11:55 AM IST
    0
  • This might work. Presumably you wish to restrict the language choices to English or French to reduce the misclassification rate. Without example text for which the desired result is known I cannot test the approach below. However, it does seem to restrict the language choices to English and French.
    my.profiles <- TC_byte_profiles[names(TC_byte_profiles) %in% c("english", "french")]
    my.profiles
    
    my.text <- c("This is an English sentence.",
                 "Das ist ein deutscher Satz.",
                "Il s'agit d'une phrase française.",
                "Esta es una frase en espa~nol.")
    
    textcat(my.text, p = my.profiles)
    
    # [1] "english" "english" "french"  "french"​
      August 5, 2020 11:59 AM IST
    0
  • The textcat package does this. It can detect 74 'languages' (more properly, language/encoding combinations), more with other extensions. Details and examples are in this freely available article:

    Hornik, K., Mair, P., Rauch, J., Geiger, W., Buchta, C., & Feinerer, I. The textcat Package for n-Gram Based Text Categorization in R. Journal of Statistical Software, 52, 1-17.

    here's one of their examples:

    library("textcat")
    textcat(c(
      "This is an English sentence.",
      "Das ist ein deutscher Satz.",
      "Esta es una frase en espa~nol."))
    [1] "english" "german" "spanish" 
      September 15, 2020 3:40 PM IST
    0
  • Try http://cran.r-project.org/web/packages/cldr/ which brings Google Chrome's language detection to R.

    #install from archive
    url <- "http://cran.us.r-project.org/src/contrib/Archive/cldr/cldr_1.1.0.tar.gz"
    pkgFile<-"cldr_1.1.0.tar.gz"
    download.file(url = url, destfile = pkgFile)
    install.packages(pkgs=pkgFile, type="source", repos=NULL)
    unlink(pkgFile)
    # or devtools::install_version("cldr",version="1.1.0")
    
    #usage
    library(cldr)
    demo(cldr)
      September 15, 2020 3:41 PM IST
    0
  • There is also a pretty well working R package called "franc". Though, it is slower than the others, I had a better experience with it than with cld2 and especially cld3.
      September 15, 2020 3:43 PM IST
    0