Encode and Decode MIME Base64 encoded binaries

I became interested in how email attachments are encoded and decoded while working on spam-filtering software. I found that many spam emails include an attached HTML document that is decoded by the user's mail client.

I wanted my spam filter to look for common words and phrases in mail documents as one way to determine if they were unsolicited but I found that, without the capability of scanning attached HTML files, many spam emails were not being filtered. So as to improve the yield of my filtering software, I decided to add BASE64 capability so that attached HTML documents could be filtered.

Base64 is described in RFC2045 which specifies the formatting of internet message bodies. Base64 encoding is used to convert binary files into a format that can be easily transmitted, with as little regard to operating system support as possible. To do this, all data is transmitted in six bit chunks using the following encoding table:

Table 1: The Base64 Alphabet Value Encoding Value Encoding Value Encoding Value Encoding 0 A 17 R 34 i 51 z 1 B 18 S 35 j 52 0 2 C 19 T 36 k 53 1 3 D 20 U 37 l 54 2 4 E 21 V 38 m 55 3 5 F 22 W 39 n 56 4 6 G 23 X 40 o 57 5 7 H 24 Y 41 p 58 6 8 I 25 Z 42 q 59 7 9 J 26 a 43 r 60 8 10 K 27 b 44 s 61 9 11 L 28 c 45 t 62 + 12 M 29 d 46 u 63 / 13 N 30 e 47 v 14 O 31 f 48 w (pad) = 15 P 32 g 49 x 16 Q 33 h 50 y

Data is transmitted in chunks of 24 bits, or blocks of four characters. If the size of the file to be transmitted is not evenly divisible by 24 bits, then the last chunk is padded with zeroes.

Blocks of characters are grouped in 76 character lines, with a carriage return and line feed separating each line. Any characters not represented in the Base64 alphabet should be ignored.

Although the specification indicates that encoded data must always be terminated with an "=" sign, I have found that this is often not the case if no additional non-encoded data followed the encoded block in an email message, in which case end-of-file indicates the end of the encoded block.

**Decoding Base64**

To decode data, each 24 bit block must be divided into three 8 character bytes, with the low-order byte as the first in the sequence.

Here's an example:

The text "scr" encoded as Base64 is "c2Ny". "c2Ny", using the Base64 alphabet, is decoded as 28,54,13,50, or in binary as 001110 011011 101100 010011. Since the first character is the low order byte, the numbers are reversed, so you have "010011 101100 011011 001110". The decimal representation of this binary number is 7,562,098.

This composite 24-bit number is then divided into 8 bit chunks, so our 24-bit binary sequence, "010011 101100 011011 001110" becomes "01001110 11000110 11001110". To determine the values of each of the 8-bit chunks, we apply first apply a mask to get only the high-, middle-, or low-order byte. We then shift the value rightward by dividing it.

For example, the value 16711680 is the decimal equivilant of the binary number "11111111 00000000 00000000". If the binary and operand is used to apply this number as a mask to our composite 24-bit number, we get 7,536,640, or "11001110 00000000 00000000". We then divide this number by 65536, or 2^16, to shift the value rightward so we then have "00000000 00000000 11001110". The decimal value of "11001110" is 115, or "s", one of our three original bytes, decoded.

**Encoding Base64**

To encode Base64, you must first combine a group of three 8-bit values into a single 24-bit number, and then divide the 24-bit number into four 6-bit numbers, which are output using the Base64 alphabet. Note that if there are fewer than 3 characters to encode, missing characters are encoded as 0.

To encode "scr", we would first create a 24-bit composite number that combined the three characters. The ascii equivilants of "s", "c", and "r" are 115, 99, and 114, so our 24-bit number is equal to 115 * 256^3 + 99 * 256^1 + 114 * 256^0, or 7,562,098. Notice that the first character in the string is accorded the highest byte in the composite number.

Once we've derived our composite number, we chop it into 4 6-bit numbers. We do this by using a mask to find just the bits corresponding to the part of the number that we want, and then dividing the result to shift it rightward, in the same manner that we used when decoding.

The four six bit numbers are then encoded using the Base64 alphabet.

There are other possible applications for Base64 encoding then sending and receiving email attachments. Base64 could also be used to encode binary files for transmission via message queue, or for storing in a database.

Here are functions for encoding and decoding Base64 using the algorithm described above:

Private Function DecodeBase64(sEncodedText As String) As String Const sBase64Alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/" Dim lTextLen As Long Dim lTextPos As Long Dim lGroupPos As Long Dim lBase64Value As Long Dim lCurSrcPos As Long Dim sCurSrcChar As String Dim lBase64AlphaPos As Long Dim iByte1 As Integer Dim iByte2 As Integer Dim iByte3 As Integer Dim sOutputString As String lTextLen = Len(sEncodedText) sOutputString = "" lCurSrcPos = 1 Do lBase64Value = 0 lGroupPos = 0 Do sCurSrcChar = Mid(sEncodedText, lCurSrcPos, 1) lBase64AlphaPos = InStr(1, sBase64Alphabet, sCurSrcChar) If lBase64AlphaPos = 0 Then '---Invalid character, or end of text indicator If sCurSrcChar = "=" Then '---End of encoded text. lCurSrcPos = lTextLen Exit Do Else '---Invalid text, so ignore End If Else lBase64Value = lBase64Value + (64 ^ (3 - lGroupPos)) * (lBase64AlphaPos - 1) lGroupPos = lGroupPos + 1 End If lCurSrcPos = lCurSrcPos + 1 Loop Until lGroupPos > 3 '---Convert 24-bit number into three bytes iByte3 = (lBase64Value And 16711680) / 65536 iByte2 = (lBase64Value And 65280) / 256 iByte1 = (lBase64Value And 255) sOutputString = sOutputString & Chr(iByte3) & Chr(iByte2) & Chr(iByte1) Loop Until lCurSrcPos >= lTextLen DecodeBase64 = sOutputString End Function

Private Function EncodeBase64(sDecodedText As String) As String Const sBase64Alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/" Dim lTextLen As Long Dim lTextPos As Long Dim lGroupPos As Long Dim lBase64Value As Long Dim lCurSrcPos As Long Dim sCurSrcChar As String Dim lBase64AlphaPos As Long Dim iByte1 As Integer Dim iByte2 As Integer Dim iByte3 As Integer Dim iByte4 As Integer Dim lOutputLen As Long Dim sOutputString As String lTextLen = Len(sDecodedText) sOutputString = "" lOutputLen = 0 lCurSrcPos = 1 Do lBase64Value = 0 lGroupPos = 0 Do If lCurSrcPos <= lTextLen Then sCurSrcChar = Mid(sDecodedText, lCurSrcPos, 1) lBase64Value = lBase64Value + (256 ^ (2 - lGroupPos)) * Asc(sCurSrcChar) End If lGroupPos = lGroupPos + 1 lCurSrcPos = lCurSrcPos + 1 Loop Until lGroupPos > 2 '---Convert 24-bit number into four six-bit bytes iByte4 = (lBase64Value And 16646144) / 262144 iByte3 = (lBase64Value And 262143) / 4096 iByte2 = (lBase64Value And 4032) / 64 iByte1 = (lBase64Value And 63) sOutputString = sOutputString & Mid$(sBase64Alphabet, iByte4 + 1, 1) & _ Mid$(sBase64Alphabet, iByte3 + 1, 1) & _ Mid$(sBase64Alphabet, iByte2 + 1, 1) & _ Mid$(sBase64Alphabet, iByte1 + 1, 1) lOutputLen = lOutputLen + 4 If lOutputLen Mod 72 = 0 Then sOutputString = sOutputString & vbCrLf End If Loop Until lCurSrcPos >= lTextLen sOutputString = sOutputString & "==" EncodeBase64 = sOutputString End Function