Κυριακή 19 Αυγούστου 2012

How to find out whether a string is in latin!

I am currently developing an Sentiment Analysis software for my thesis. One of the several problems that I am facing was to find out whether a sentence was in english, or at least whether it is written in latin. Google offered me several solutions, like replacing using ugly regular expressions. The truth is that I do not like such solutions since I am not pretty sure about the computational complexity of the regular expressions especially in case of large text content, so I tried to find a more clean solution!
Finally I came up with the idea of checking whether the String can be encoding in a char set that uses only latin characters, like ISO-8859-1 or US-ASCII! So here is my code:

    static Charset latin = Charset.forName("US-ASCII"); // or "ISO-8859-1" for ISO Latin
    public static boolean isLatin(String content) {
        return latin.newEncoder().canEncode(content);
    }
Note: I haven't check whether the newEncoder() method creates a Thread safe instance of CharsetEncoder, so I preferred to call it each time that I need an encoder.

LinkWithin

Blog Widget by LinkWithin

Mobile edition