I am currently developing an Sentiment Analysis software for my thesis. One of the several problems that I am facing was to find out whether a sentence was in english, or at least whether it is written in latin. Google offered me several solutions, like replacing using ugly regular expressions. The truth is that I do not like such solutions since I am not pretty sure about the computational complexity of the regular expressions especially in case of large text content, so I tried to find a more clean solution!
Finally I came up with the idea of checking whether the String can be encoding in a char set that uses only latin characters, like ISO-8859-1 or US-ASCII! So here is my code:
Finally I came up with the idea of checking whether the String can be encoding in a char set that uses only latin characters, like ISO-8859-1 or US-ASCII! So here is my code:
static Charset latin = Charset.forName("US-ASCII"); // or "ISO-8859-1" for ISO Latin
public static boolean isLatin(String content) {
return latin.newEncoder().canEncode(content);
}
Note: I haven't check whether the newEncoder() method creates a Thread safe instance of CharsetEncoder, so I preferred to call it each time that I need an encoder.