Implementation of a K Means Clustering to classify documents by language

Recently i’ve been interested in machine learning, and made some sample implementations to understand better the subject. In particular recently i’ve implemented a simplified version of the K-Means Clustering classifier and then I decided to apply this algorithm to a more practical task.

I’ve implemented the K-Means Clustering to classify some text by the language it’s written in. In brief, you can provide some different text to the algorithm, decide in how many groups you want them to be classified and then run the algorithm. It will partition the text depending on the different relative frequency of the letters in it, trying to recognize some structure in the different languages. It’s not perfect, but works, and there’s a live version to try here.

The larger the text is, the more the algorithm will be accurate. In fact, with short text the result will be quite random.