QBoard » Artificial Intelligence & ML » AI and ML - Conceptual » How can we use unsupervised learning techniques on a data-set, and then label the clusters?

How can we use unsupervised learning techniques on a data-set, and then label the clusters?

  • First up, this is most certainly homework (so no full code samples please). That said...

    I need to test an unsupervised algorithm next to a supervised algorithm, using the Neural Network toolbox in Matlab. The data set is the UCI Artificial Characters Database. The problem is, I've had a good tutorial on supervised algorithms, and been left to sink on unsupervised.

    So I know how to create a self organising map using selforgmap, and then I train it using train(net, trainingSet). I don't understand what to do next. I know that it's clustered the data I gave it into (hopefully) 10 clusters (one for each letter).

    Two questions then:

    • How can I then label the clusters (given that I have a comparison pattern)?
      • Am I trying to turn this into a supervised learning problem when I do this?
    • How can I create a confusion matrix on (another) testing set to compare to the supervised algorithm?

    I think I'm missing something conceptual or jargon-based here - all my searches come up with supervised learning techniques. A point in the right direction would be much appreciated. My existing code is below:

    P = load('-ascii', 'pattern');
    T = load('-ascii', 'target');
    
    % data needs to be translated
    P = P';
    T = T';
    
    T = T(find(sum(T')), :);
    
    mynet = selforgmap([10 10]);
    mynet.trainparam.epochs = 5000;
    mynet = train(mynet, P);
    
    
    P = load('-ascii', 'testpattern');
    T = load('-ascii', 'testtarget');
    
    P = P';
    T = T';
    T = T(find(sum(T')), :);
    
    Y = sim(mynet,P);
    Z = compet(Y);
    
    % this gives me a confusion matrix for supervised techniques:
    C = T*Z'
    
     
      September 23, 2021 3:16 PM IST
    0
  • Since you ask this very basic question, it looks like it's worth specifying what Machine Learning itself is.
    Machine Learning is a class of algorithms which is data-driven, i.e. unlike "normal" algorithms it is the data that "tells" what the "good answer" is. Example: a hypothetical non-machine learning algorithm for face detection in images would try to define what a face is (round skin-like-colored disk, with dark area where you expect the eyes etc). A machine learning algorithm would not have such coded definition, but would "learn-by-examples": you'll show several images of faces and not-faces and a good algorithm will eventually learn and be able to predict whether or not an unseen image is a face.
    This particular example of face detection is supervised, which means that your examples must be labeled, or explicitly say which ones are faces and which ones aren't.
    In an unsupervised algorithm your examples are not labeled, i.e. you don't say anything. Of course, in such a case the algorithm itself cannot "invent" what a face is, but it can try to cluster the data into different groups, e.g. it can distinguish that faces are very different from landscapes, which are very different from horses.
    Since another answer mentions it (though, in an incorrect way): there are "intermediate" forms of supervision, i.e. semi-supervised and active learning. Technically, these are supervised methods in which there is some "smart" way to avoid a large number of labeled examples. In active learning, the algorithm itself decides which thing you should label (e.g. it can be pretty sure about a landscape and a horse, but it might ask you to confirm if a gorilla is indeed the picture of a face). In semi-supervised learning, there are two different algorithms which start with the labeled examples, and then "tell" each other the way they think about some large number of unlabeled data. From this "discussion" they learn.
      September 24, 2021 12:22 PM IST
    0
  • Could this video be of any help? It doesn't answer your question but it shows that human interaction may be required to even select number of clusters. Automatically labeling clusters is even harder.

    If you think about it there's no guarantee that clustering will be done based on the depicted number. Network might group digits based on width of the line or on the smoothing of the font, etc.

      September 27, 2021 2:02 PM IST
    0
  • Unsupervised models are used when the outcome (or class label) of each sample is not available in your data. If you want to use your method to perform a classification task, you should have those labels in order to assess how good the method is. If this is the case, i.e class labels are available, I recomment you to test and compare your method with other well-known supervised machine learning models.
      October 5, 2021 1:20 PM IST
    0
  • You can use your clustering method on data with labels removed and then check its efficiency by counting how many samples labeled with a similar class went to the same clusters. The trick here is that you cannot use precision, recall etc. metrics that you usually use to check the efficiency of classification. The most common metrics for clustering evaluation are Rand Jaccard, B-cubed. Here in paragraph "5.3 Evaluating clusters" I suggest to use F-measure:

    You can see in the formula how different it is from the F-measure used for analysis of classification. It is important that you will not be able to compare efficiency of your clustering method to classification ones.


    But if by classification you don't mean a Machine Learning method, but just that you want to use your clusters as a basis for terminological research - for example, that these clusters relate to some expert-defined classes of disorders, then you need to compare your own clustering method to other clustering (sic!) methods, not ML classification ones.
      October 6, 2021 3:13 PM IST
    0