Obscene language detector


If you have a website with some community then you have to watch the people to keep their speech clear. Usually you hire a stuff (like moderators or "website police") for controlling user's posts, comments etc. But what you gonna do if the website is too big? There are so many posts and messages and you can control all of them?

The solution is - automate moderating process. Let the robots decide what you got to do with this post. Therefore, obviously, our first robot have to check post or comment for obscene language. I'll use detection in the client's side (in the browser). Thus I have to use Javascript.

First of all we have to define what kind of words are bad? That's no so hard. Here we are:

Then we have to check our message for every word in the black list. So we have to split user's message into separate word and compare it with black list.

Hurray! We can detect bad words now. You've implemented this feature in your website and now you are happy. But the next day you've seen that some sly man made a mistake in the bad word and other people saw that. They decide to use same trick to hurt one's feelings. Oh, that's no good. How to check those mistakes then?

Keep calm! I have an idea for that. You have to check probability percentage of bad word. Looks good. But how to achieve that? Well, this wouldn't be so simple, but I'll try to explain.

First: you have to take bad words from your black list and then you have to compare it with the message for each of them. When you can't find any bad word - you can slice the bad word in one letter and proceed checking again. Then if you didn't find anything, slice again, and again... you got it, right?

Oh, and one more important note: Have you seen that users sometimes skip the space to combine two bad words in one? Therefore we have to check the message not only for each word, but for each letter. We'll try to find sequence that might be similar to bad word.

Oh yes! Another improvement - some nasty word can be a part of acceptable word. For example, our bad word "ery" is a part of normal word "very". In this case we have to use WhiteList to exclude sequence of letters in the message. Here are complete example code. It is just a sample. I wouldn't recommend to use it in production But you can modify it with your purpose: