Obscene language detector


If you have a website with some community then you have to watch the people keep their speech clear. Usually, you hire staff (like moderators or "website police") for controlling user's posts, comments, etc. But what you gonna do if the website is too big? There are so many posts and messages and you can control all of them?

The solution is - automate the moderating process. Let the robots decide what you got to do with this post. Therefore, obviously, our first robot has to check post or comment for obscene language. I'll use detection on the client's side (in the browser). Thus I have to use Javascript.

First of all, we have to define what kind of words are bad? That's no so hard. Here we are:

Then we have to check our message for every word from the blacklist. So we have to split the user's message into a separate word and compare it with a blacklist.

Hurray! We can detect bad words now. You've implemented this feature on your website and now you are happy. But the next day you've seen that some sly man made a mistake in the bad word and other people saw that. They decide to use the same trick to hurt one's feelings. Oh, that's no good. How to check those mistakes then?

Keep calm! I have an idea for that. You have to check the probability percentage of a bad word. Looks good. But how to achieve that? Well, this wouldn't be so simple, but I'll try to explain it.

First: you have to take bad words from your blacklist and then you have to compare it with the message for each of them. When you can't find any bad word - you can slice the bad word in one letter and run the check again. Then if you didn't find anything, slice again, and again... you got it, right?

Oh, and one more important note: Have you seen that users sometimes skip the space to combine two bad words in one? Therefore we have to check the message not only for each word but for each letter. We'll try to find a sequence that might be similar to a bad word.

Oh yes! Another improvement - some nasty words can be a part of an acceptable word. For example, our bad word "ery" is a part of the normal word "very". In this case, we have to use WhiteList to exclude a sequence of letters in the message. Here is the complete example code. It is just a sample. I wouldn't recommend to use it in production But you can modify it with your purpose: