Language detection in R with the textcat package : how to restrict to a few languages?

QBoard » Artificial Intelligence & ML » AI and ML - R » Language detection in R with the textcat package : how to restrict to a few languages?

User Dashboard

Language detection in R with the textcat package : how to restrict to a few languages?

Back To Topics

Tags : nlp R

Viaan Prakash

461

I need to detect the language of many short texts, using R. I am using the textcat package, which find which among many (say 30) European languages is the one of each text. However, I know my texts are either French or English (or, more generally, a small subset of the langages handled by textcat).
How could add this knowledge when calling textcat functions ?
Thanks,

August 5, 2020 11:55 AM IST

0
Samar Patil

346 3
This might work. Presumably you wish to restrict the language choices to English or French to reduce the misclassification rate. Without example text for which the desired result is known I cannot test the approach below. However, it does seem to restrict the language choices to English and French.
```
my.profiles <- TC_byte_profiles[names(TC_byte_profiles) %in% c("english", "french")]
my.profiles

my.text <- c("This is an English sentence.",
             "Das ist ein deutscher Satz.",
            "Il s'agit d'une phrase française.",
            "Esta es una frase en espa~nol.")

textcat(my.text, p = my.profiles)

# [1] "english" "english" "french"  "french"
```
August 5, 2020 11:59 AM IST

0
Shivakumar Kota

102 9
The textcat package does this. It can detect 74 'languages' (more properly, language/encoding combinations), more with other extensions. Details and examples are in this freely available article:

Hornik, K., Mair, P., Rauch, J., Geiger, W., Buchta, C., & Feinerer, I. The textcat Package for n-Gram Based Text Categorization in R. Journal of Statistical Software, 52, 1-17.

here's one of their examples:
```
library("textcat")
textcat(c(
  "This is an English sentence.",
  "Das ist ein deutscher Satz.",
  "Esta es una frase en espa~nol."))
[1] "english" "german" "spanish" 
```
September 15, 2020 3:40 PM IST

0

Pranav B

106 5

Try http://cran.r-project.org/web/packages/cldr/ which brings Google Chrome's language detection to R.

#install from archive
url <- "http://cran.us.r-project.org/src/contrib/Archive/cldr/cldr_1.1.0.tar.gz"
pkgFile<-"cldr_1.1.0.tar.gz"
download.file(url = url, destfile = pkgFile)
install.packages(pkgs=pkgFile, type="source", repos=NULL)
unlink(pkgFile)
# or devtools::install_version("cldr",version="1.1.0")

#usage
library(cldr)
demo(cldr)

September 15, 2020 3:41 PM IST

Nitara Bobal

53

There is also a pretty well working R package called "franc". Though, it is slower than the others, I had a better experience with it than with cld2 and especially cld3.

September 15, 2020 3:43 PM IST

0

Member Sign In

Member Sign In

Create Account

Language detection in R with the textcat package : how to restrict to a few languages?

Connect With Us