JLION.COM
JLION.COM - Encode and Decode MIME Base64 encoded binaries
3/16/03 Internet, VB.6

Encode and Decode MIME Base64 encoded binaries

I became interested in how email attachments are encoded and decoded while working on spam-filtering software. I found that many spam emails include an attached HTML document that is decoded by the user's mail client.

I wanted my spam filter to look for common words and phrases in mail documents as one way to determine if they were unsolicited but I found that, without the capability of scanning attached HTML files, many spam emails were not being filtered. So as to improve the yield of my filtering software, I decided to add BASE64 capability so that attached HTML documents could be filtered.

Base64 is described in RFC2045 which specifies the formatting of internet message bodies. Base64 encoding is used to convert binary files into a format that can be easily transmitted, with as little regard to operating system support as possible. To do this, all data is transmitted in six bit chunks using the following encoding table:

                    Table 1: The Base64 Alphabet

     Value Encoding  Value Encoding  Value Encoding  Value Encoding
         0 A            17 R            34 i            51 z
         1 B            18 S            35 j            52 0
         2 C            19 T            36 k            53 1
         3 D            20 U            37 l            54 2
         4 E            21 V            38 m            55 3
         5 F            22 W            39 n            56 4
         6 G            23 X            40 o            57 5
         7 H            24 Y            41 p            58 6
         8 I            25 Z            42 q            59 7
         9 J            26 a            43 r            60 8
        10 K            27 b            44 s            61 9
        11 L            28 c            45 t            62 +
        12 M            29 d            46 u            63 /
        13 N            30 e            47 v
        14 O            31 f            48 w         (pad) =
        15 P            32 g            49 x
        16 Q            33 h            50 y

Data is transmitted in chunks of 24 bits, or blocks of four characters. If the size of the file to be transmitted is not evenly divisible by 24 bits, then the last chunk is padded with zeroes.

Blocks of characters are grouped in 76 character lines, with a carriage return and line feed separating each line. Any characters not represented in the Base64 alphabet should be ignored.

Although the specification indicates that encoded data must always be terminated with an "=" sign, I have found that this is often not the case if no additional non-encoded data followed the encoded block in an email message, in which case end-of-file indicates the end of the encoded block.

Decoding Base64

To decode data, each 24 bit block must be divided into three 8 character bytes, with the low-order byte as the first in the sequence.

Here's an example:

The text "scr" encoded as Base64 is "c2Ny". "c2Ny", using the Base64 alphabet, is decoded as 28,54,13,50, or in binary as 001110 011011 101100 010011. Since the first character is the low order byte, the numbers are reversed, so you have "010011 101100 011011 001110". The decimal representation of this binary number is 7,562,098.

This composite 24-bit number is then divided into 8 bit chunks, so our 24-bit binary sequence, "010011 101100 011011 001110" becomes "01001110 11000110 11001110". To determine the values of each of the 8-bit chunks, we apply first apply a mask to get only the high-, middle-, or low-order byte. We then shift the value rightward by dividing it.

For example, the value 16711680 is the decimal equivilant of the binary number "11111111 00000000 00000000". If the binary and operand is used to apply this number as a mask to our composite 24-bit number, we get 7,536,640, or "11001110 00000000 00000000". We then divide this number by 65536, or 2^16, to shift the value rightward so we then have "00000000 00000000 11001110". The decimal value of "11001110" is 115, or "s", one of our three original bytes, decoded.

Encoding Base64

To encode Base64, you must first combine a group of three 8-bit values into a single 24-bit number, and then divide the 24-bit number into four 6-bit numbers, which are output using the Base64 alphabet. Note that if there are fewer than 3 characters to encode, missing characters are encoded as 0.

To encode "scr", we would first create a 24-bit composite number that combined the three characters. The ascii equivilants of "s", "c", and "r" are 115, 99, and 114, so our 24-bit number is equal to 115 * 256^3 + 99 * 256^1 + 114 * 256^0, or 7,562,098. Notice that the first character in the string is accorded the highest byte in the composite number.

Once we've derived our composite number, we chop it into 4 6-bit numbers. We do this by using a mask to find just the bits corresponding to the part of the number that we want, and then dividing the result to shift it rightward, in the same manner that we used when decoding.

The four six bit numbers are then encoded using the Base64 alphabet.

There are other possible applications for Base64 encoding then sending and receiving email attachments. Base64 could also be used to encode binary files for transmission via message queue, or for storing in a database.

Here are functions for encoding and decoding Base64 using the algorithm described above:

Private Function DecodeBase64(sEncodedText As String) As String
    Const sBase64Alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"

    Dim lTextLen As Long

    Dim lTextPos As Long
    Dim lGroupPos As Long

    Dim lBase64Value As Long

    Dim lCurSrcPos As Long
    Dim sCurSrcChar As String
    Dim lBase64AlphaPos As Long

    Dim iByte1 As Integer
    Dim iByte2 As Integer
    Dim iByte3 As Integer

    Dim sOutputString As String

    lTextLen = Len(sEncodedText)

    sOutputString = ""

    lCurSrcPos = 1
    Do
        lBase64Value = 0

        lGroupPos = 0
        Do
            sCurSrcChar = Mid(sEncodedText, lCurSrcPos, 1)
            lBase64AlphaPos = InStr(1, sBase64Alphabet, sCurSrcChar)

            If lBase64AlphaPos = 0 Then
                '---Invalid character, or end of text indicator
                If sCurSrcChar = "=" Then
                    '---End of encoded text.
                    lCurSrcPos = lTextLen
                    Exit Do
                Else
                    '---Invalid text, so ignore
                End If
            Else
                lBase64Value = lBase64Value + (64 ^ (3 - lGroupPos)) * (lBase64AlphaPos - 1)
                lGroupPos = lGroupPos + 1
            End If

            lCurSrcPos = lCurSrcPos + 1
        Loop Until lGroupPos > 3

        '---Convert 24-bit number into three bytes
        iByte3 = (lBase64Value And 16711680) / 65536
        iByte2 = (lBase64Value And 65280) / 256
        iByte1 = (lBase64Value And 255)

        sOutputString = sOutputString & Chr(iByte3) & Chr(iByte2) & Chr(iByte1)

    Loop Until lCurSrcPos >= lTextLen

    DecodeBase64 = sOutputString
End Function
Private Function EncodeBase64(sDecodedText As String) As String
    Const sBase64Alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"

    Dim lTextLen As Long

    Dim lTextPos As Long
    Dim lGroupPos As Long

    Dim lBase64Value As Long

    Dim lCurSrcPos As Long
    Dim sCurSrcChar As String
    Dim lBase64AlphaPos As Long

    Dim iByte1 As Integer
    Dim iByte2 As Integer
    Dim iByte3 As Integer
    Dim iByte4 As Integer

    Dim lOutputLen As Long

    Dim sOutputString As String

    lTextLen = Len(sDecodedText)

    sOutputString = ""
    lOutputLen = 0

    lCurSrcPos = 1
    Do
        lBase64Value = 0
        lGroupPos = 0
        Do
            If lCurSrcPos <= lTextLen Then
                sCurSrcChar = Mid(sDecodedText, lCurSrcPos, 1)
                lBase64Value = lBase64Value + (256 ^ (2 - lGroupPos)) * Asc(sCurSrcChar)
            End If

            lGroupPos = lGroupPos + 1
            lCurSrcPos = lCurSrcPos + 1
        Loop Until lGroupPos > 2

        '---Convert 24-bit number into four six-bit bytes
        iByte4 = (lBase64Value And 16646144) / 262144
        iByte3 = (lBase64Value And 262143) / 4096
        iByte2 = (lBase64Value And 4032) / 64
        iByte1 = (lBase64Value And 63)

        sOutputString = sOutputString & Mid$(sBase64Alphabet, iByte4 + 1, 1) & _
                                        Mid$(sBase64Alphabet, iByte3 + 1, 1) & _
                                        Mid$(sBase64Alphabet, iByte2 + 1, 1) & _
                                        Mid$(sBase64Alphabet, iByte1 + 1, 1)

        lOutputLen = lOutputLen + 4
        If lOutputLen Mod 72 = 0 Then
            sOutputString = sOutputString & vbCrLf
        End If

    Loop Until lCurSrcPos >= lTextLen

    sOutputString = sOutputString & "=="

    EncodeBase64 = sOutputString
End Function