How can we use unsupervised learning techniques on a data-set, and then label the clusters?

126

First up, this is most certainly homework (so no full code samples please). That said...

I need to test an unsupervised algorithm next to a supervised algorithm, using the Neural Network toolbox in Matlab. The data set is the UCI Artificial Characters Database. The problem is, I've had a good tutorial on supervised algorithms, and been left to sink on unsupervised.

So I know how to create a self organising map using selforgmap, and then I train it using train(net, trainingSet). I don't understand what to do next. I know that it's clustered the data I gave it into (hopefully) 10 clusters (one for each letter).

Two questions then:

How can I then label the clusters (given that I have a comparison pattern)?
- Am I trying to turn this into a supervised learning problem when I do this?
How can I create a confusion matrix on (another) testing set to compare to the supervised algorithm?

I think I'm missing something conceptual or jargon-based here - all my searches come up with supervised learning techniques. A point in the right direction would be much appreciated. My existing code is below:

P = load('-ascii', 'pattern');
T = load('-ascii', 'target');

% data needs to be translated
P = P';
T = T';

T = T(find(sum(T')), :);

mynet = selforgmap([10 10]);
mynet.trainparam.epochs = 5000;
mynet = train(mynet, P);


P = load('-ascii', 'testpattern');
T = load('-ascii', 'testtarget');

P = P';
T = T';
T = T(find(sum(T')), :);

Y = sim(mynet,P);
Z = compet(Y);

% this gives me a confusion matrix for supervised techniques:
C = T*Z'

September 23, 2021 3:16 PM IST

0

Viaan Prakash

461

Could this video be of any help? It doesn't answer your question but it shows that human interaction may be required to even select number of clusters. Automatically labeling clusters is even harder.

If you think about it there's no guarantee that clustering will be done based on the depicted number. Network might group digits based on width of the line or on the smoothing of the font, etc.

September 27, 2021 2:02 PM IST

0

Maryam Bains

317

You can use your clustering method on data with labels removed and then check its efficiency by counting how many samples labeled with a similar class went to the same clusters. The trick here is that you cannot use precision, recall etc. metrics that you usually use to check the efficiency of classification. The most common metrics for clustering evaluation are Rand Jaccard, B-cubed. Here in paragraph "5.3 Evaluating clusters" I suggest to use F-measure:

Preprint A Linguistic Model of Classifying Community Pages in a Socia...

You can see in the formula how different it is from the F-measure used for analysis of classification. It is important that you will not be able to compare efficiency of your clustering method to classification ones.

But if by classification you don't mean a Machine Learning method, but just that you want to use your clusters as a basis for terminological research - for example, that these clusters relate to some expert-defined classes of disorders, then you need to compare your own clustering method to other clustering (sic!) methods, not ML classification ones.

October 6, 2021 3:13 PM IST

0

Member Sign In

Member Sign In

Create Account

How can we use unsupervised learning techniques on a data-set, and then label the clusters?

Connect With Us