Dev Agrawal's other Models Reports

Major Concepts

 

Sign-Up/Login to access Several ML Models and also Deploy & Monetize your own ML solutions for free

Language Detection

Models Status

Model Overview

Introduction:


Prediction of the natural language of a text can be an important step in a Natural Language Processing (NLP) use case. For use cases like translation or sentiment analysis, it is better to know the language of the text used in it. For example, if you go to google translate, translation of the text is followed by detecting the language. 




There are various different approaches to language identification. We’re using various machine learning algorithms, choosing the best one and character n-grams as features. In the end, we show that an accuracy of over 95% can be achieved with this approach.


DataSet:


Dataset link: https://www.kaggle.com/zarajamshaid/language-identification-datasst


WiLI-2018, the Wikipedia language identification benchmark dataset, contains 235000 paragraphs of 235 languages. Each language in this dataset contains 1000 rows/paragraphs.


After data selection and preprocessing I used the 22 selective languages from the original dataset Which Includes the following languages:



This csv file contains 22 labels and 1000 rows for each label total of 22000 rows and 2 columns.


Model used:


For feature extraction used Bag of Words. Bag of Words (BOW) is used to extract features from text documents that are used for training machine learning algorithms by creating a vocabulary of all the unique words occurring in all the documents in the training set. The bag of words model is when we use all the words of any article/paragraph/text to get a feature vector. The Count vectorizer is used for the N-gram approach which tells how many words are taken together as a single entity in the training set for classification. Training is done on the following machine learning algorithms:



  • Random Forest Classifier:

  • Logistic regression: 


Training is evaluated on multiple models from Uni-gram to 10-gram word/char models, and fitted on the following machine learning algorithms.


Results:


The accuracy score on the test dataset for all the models created are as follows:



After analysis, the final model used is a uni-gram model for the final predictions on the random text entered by users. The accuracy achieved by this model is 95%.


 NOTE: Greater the length of input text better the accuracy.
 


0 comments