Python3, PyMARC, Unicode & File Opening

Subtle error. There is some serious underlying computing issue here … encoding.

You need to make sure you are opening the file in the correct mode in Python3. In Python2 it didn’t really matter except for the line endings.

For PyMARC you want to open the file in binary mode(open(‘filename.mrc’, ‘rb’)so you get bytes out of the file handle. Not characters (class <str>).

reader = MARCReader(open("CanadaGovt.mrc", 'rb'))
count = 0
for record in reader:
    count += 1
    print(record.leader)
print(count)

For more you can keep going… but stop here and it should just work. Keep going if you want an explanation or skip to the end for a link to a really good presentation that explains the issue.

This is a bit strange and has to deal with how Python3 handles unicode. Basically we want to get the raw binary out of the file.

The MARCReader chunks each bit of the transmission file and puts that binary chunk into the Record class.

Inside Record which you’ve actually opened up a bit that binary chunk is then ‘decoded’ into strings or in python3: character strings which support unicode. To back up here. There are two “string” like classes in python3: one is “class <str>” and is a sequence of unicode characters (https://docs.python.org/3/library/stdtypes.html#textseq). The other is “bytes” which corresponds to the (https://docs.python.org/3/library/stdtypes.html#typebytes) the sequence of single bytes (integers in the range of 0-255; which corresponds to ASCII). When you read a file in ‘rb’ mode you get the bytes out. If you just do ’r’ the language gets characters out; so if the reader is dealing with unicode text and you are getting sequences of bytes recognizing characters. For example the ‘PIG’ emoji is a actually four bytes long (b’\xf0\x9f\x90\x96’)

If you do the following in a python3 interpreter I hope it helps:

>>> import unicodedata
>>> s = unicodedata.lookup("PIG")
>>> s
'🐖'
>>> s.encode('utf8')
b'\xf0\x9f\x90\x96'

Python 2 it is much more complicated because in Python 2 there are two string-like classes; but it is reversed. In Python 2 “class <str>” is a sequence of bytes and doesn’t know anything about the characters inside of it. While there is a “class <unicode>” which is a sequence of unicode characters, like Python3’s “class <str>”. To make things even more complicated Python 2 lets you do operations which coerce the two types; changing something like ‘hello’ + u’ world’ into u’hello world’; which is something python 3 doesn’t allow.

Python 2

Python 2.7.6 (default, Sep 9 2014, 15:04:36) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> u = u'Hello' + ' World!'
>>> u
u'Hello World!'

Python 3:

Python 3.4.3 (default, Feb 25 2015, 21:28:45) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.56)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> u = 'Hello' + b' World!'
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
TypeError: Can't convert 'bytes' object to str implicitly
>>>

PS: there is a really good presentation from PyCon at: http://nedbatchelder.com/text/unipain.html and deals with a lot of what I explained above. For extra bonus points eyeing the pymarc module for how it handles the Unicode sandwich would be a really good illustration.

Advertisements

One thought on “Python3, PyMARC, Unicode & File Opening

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s