WinDeveloper IMF Tune

WinDeveloper IMF Tune
WinDeveloper IMF Tune

Mind Your Language (charset) Spammer!

Alexander Zammit

Alexander Zammit Photo

Software Development Consultant. Involved in the development of various Enterprise software solutions. Today focused on Blockchain and DLT technologies.

  • Published: Sep 16, 2008
  • Category: Anti-Spam
  • Votes: 4.8 out of 5 - 4 Votes
Cast your Vote
Poor Excellent

The email character set can be useful in filtering foreign spam. Today we see how character sets relate to languages, how SMTP emails conveys non-English text and how this information can be used in filtering spam.

Spam filtering involves analyzing various pieces of information. The email itself is of course one full bag. The SMTP command parameters and DNS also contribute their share. Today we look at just one piece of the puzzle, the language used to author an email or more precisely the character set. I had to recently refresh my mind on this topic because of the inclusion of a Language Blacklist in the latest IMF Tune v4.1 release. So here I will share some observations I made in the process.

The Link between Character Sets, Languages and Locations

Before moving further, I will cover some basics. In simple terms a character set is a collection of characters allowing us to express ourselves in one or more languages. A single character sets is often able to cover a number of languages having small variations across them. For example Windows-1252 caters for English and various Western European languages.

The picture complicates itself because of the large number of character sets defined, the overlap between them, and the different names (aliases) used to identify the same character sets. The same language could be expressed using different character sets. Alternatively the same character set may be in use but identified using different aliases. Compounding everything we also have Unicode, a huge character set that is able to represent almost all languages.

The point here is to stress the distinction between character sets and languages. Normally we should not consider a character set as being representative of one language. At best it often relates to a set of different languages.

Another important point to appreciate is that all character sets share the 7-bit ASCII covering English. An email declaring to use Cyrillic might in fact be entirely composed of plain English. It is not unusual to use a computer for both personal and work email. An immigrant might configure his computer to easily communicate with his family in his native tongue and then switch to English when working. All his emails end up stamped with the same character set.

It is also normal to associate a language with one or more countries or regions. However it is good to remember how a language is also distinct from the email origin. Knowing the language does not tell us for sure where the email is coming from.

My goal here is that of putting into perspective what we are talking about. In practice there is a good chance you will find character set based filtering to be very effective. Nevertheless like in case of other filter types, it is worth appreciating what the filtering criteria is and the inherent strengths and weaknesses.

Character Sets in SMTP Emails

SMTP as defined in RFC2821, only allow the use of 7-bit ASCII characters. This is a very small set that is unable to go much beyond the English language. Thus it was necessary to enable SMTP emails to somehow convey texts from other languages. The MIME standard provided a solution, defining methods for encoding non-ASCII text.

The basic idea is that of encoding character sequences from other sets using exclusively the 7-bit ASCII repertoire. MIME provides two solutions, one for email bodies and the other for headers such as the Subject, From and To.

MIME breaks emails into parts, packaging together blocks of content and headers. An email body is contained within a MIME part whose headers identify the character set and encoding type. In this manner an email exposes the character set used on authoring the content and specifies the encoding used to package it in 7-bit ASCII. This is enough for the receiving end to retrieve the body as originally intended.

Here is a snippet showing a body in ISO-2022-JP character set, widely used for the Japanese language.

MIME Content-Type charset

The Content-Type header is the one identifying the body character set with the value:

charset="iso-2022-jp"

ISO-2022-JP is the character set name as registered with the Internet Assigned Numbers Authority (IANA). Thus anyone interested in processing emails can safely identify character sets through this standardized naming.

The previous image also showed the encoding of the From and Subject headers. Here the space is more restricted. Everything must be encapsulated within the header itself i.e. the character set name, the encoding type and the actual header text value. Note how we can easily identify the character set name in the first part of the encoded value. If you are interested in the technical details I suggest you to check RFC2047. What is worth appreciating is that the same elements are involved in the case of both headers and bodies.

Conclusions

When blocking emails by character set, we are making the assumption that anyone using a particular character set is a spammer. The logic behind this involves the association with languages and country of origin. This is a reasonable assumption although it does not cater for all possible scenarios that arise in practice.

At WinDeveloper I tested this with excellent results. Although here it is worth saying that in my case the character set filter is only one out of a number of layers that include overriding whitelists. Blocked emails included some declaring a Cyrillic character set but entirely composed in English. I certainly did not mind this since in any case these were spam.

I would certainly recommend considering adopting such a filter. Considerations should include, understanding the type of legitimate emails normally received. An organization whose business interests span the globe may experience different results from one whose reach does not extend beyond the state boundaries. Hopefully today I have armed you with enough understanding to assist you taking the right decision.

Copyright © 2005 - 2024 All rights reserved. ExchangeInbox.com is not affiliated with Microsoft Corporation