Wednesday, 4 February 2009

Using binary data in python

For those who don't know, I am writing a python gui-frontend to an existing mysql database.

The original program that wrote out the data was written by an old-school C guy. Some of the data has been put into the tables in a "blob" format. Example
mysql> describe esttable;
+----------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------+-------------+------+-----+---------+-------+
| serialno | int(11) | NO | PRI | 0 | |
| data | blob | YES | | NULL | |
+----------+-------------+------+-----+---------+-------+

so when I read this table with the MySQLdb module (a third party module for python) I get a string like this.
import MySQLdb
db=MySQLdb.connect(host="localhost",user="myUser",passwd="myPassword",db="myDatabase")
Cursor=db.cursor()
Cursor.execute('select data from esttable where serialno=31599')
Cursor.fetchall()
(('\x01\x00o\x00\n\x0f\x00\x00\x04\x00\xc9\x00\x00\x00\x00\x00\x01\x00
\xe9\x03"\x0b\x00\x00\x02\x00\x87\x05@\x1f\x00\x00',),)

You will note that this is packed with characters which need to be handled with care (understatement!)

So how to make sense of it?
One way is to use python's build in Struct module, which allows conversion between python and C types of variable.

after a bit of decoding.... I noticed the above is in chunks of 8 bytes, with /x00 at every 2,6,7 position. I also realise (by experiment) that the data contains 2 distinct unsigned doubles so the following code works for me.
import struct
def blobToList(blob):
pythonlist=[]
i=0
for i in range(0,len(blob),8):
number=struct.unpack_from('b',blob,i)[0]
item=struct.unpack_from('H',blob,i+2)[0]
cost=struct.unpack_from('H',blob,i+4)[0]
pythonlist.append("%s %s $%d.%02d"%(number,item,cost/100,cost%100))
return pythonlist
if __name__=="__main__":
testdata ='\x01\x00o\x00\n\x0f\x00\x00\x04\x00\xc9\x00\x00\x00\x00\x00'
testdata+='\x01\x00\xe9\x03"\x0b\x00\x00\x02\x00\x87\x05@\x1f\x00\x00'
print blobToList(testdata)

which gives me the following output
['1 111 $38.50', '4 201 $0.00', '1 1001 $28.50', '2 1415 $80.00']

and that makes more sense to me (just).

No comments: