Python3, PyMARC, Unicode & File Opening

Posted on April 3, 2015 by Sean Chen

Subtle error. There is some serious underlying computing issue here … encoding.

You need to make sure you are opening the file in the correct mode in Python3. In Python2 it didn’t really matter except for the line endings.

For PyMARC you want to open the file in binary mode(open(‘filename.mrc’, ‘rb’)so you get bytes out of the file handle. Not characters (class <str>).

reader = MARCReader(open("CanadaGovt.mrc", 'rb'))
count = 0
for record in reader:
    count += 1
    print(record.leader)
print(count)

For more you can keep going… but stop here and it should just work. Keep going if you want an explanation or skip to the end for a link to a really good presentation that explains the issue.

This is a bit strange and has to deal with how Python3 handles unicode. Basically we want to get the raw binary out of the file.

The MARCReader chunks each bit of the transmission file and puts that binary chunk into the Record class.

Inside Record which you’ve actually opened up a bit that binary chunk is then ‘decoded’ into strings or in python3: character strings which support unicode. To back up here. There are two “string” like classes in python3: one is “class <str>” and is a sequence of unicode characters (https://docs.python.org/3/library/stdtypes.html#textseq). The other is “bytes” which corresponds to the (https://docs.python.org/3/library/stdtypes.html#typebytes) the sequence of single bytes (integers in the range of 0-255; which corresponds to ASCII). When you read a file in ‘rb’ mode you get the bytes out. If you just do ’r’ the language gets characters out; so if the reader is dealing with unicode text and you are getting sequences of bytes recognizing characters. For example the ‘PIG’ emoji is a actually four bytes long (b’\xf0\x9f\x90\x96’)

If you do the following in a python3 interpreter I hope it helps:

>>> import unicodedata
>>> s = unicodedata.lookup("PIG")
>>> s
'🐖'
>>> s.encode('utf8')
b'\xf0\x9f\x90\x96'

Python 2 it is much more complicated because in Python 2 there are two string-like classes; but it is reversed. In Python 2 “class <str>” is a sequence of bytes and doesn’t know anything about the characters inside of it. While there is a “class <unicode>” which is a sequence of unicode characters, like Python3’s “class <str>”. To make things even more complicated Python 2 lets you do operations which coerce the two types; changing something like ‘hello’ + u’ world’ into u’hello world’; which is something python 3 doesn’t allow.

Python 2

Python 2.7.6 (default, Sep 9 2014, 15:04:36) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> u = u'Hello' + ' World!'
>>> u
u'Hello World!'

Python 3:

Python 3.4.3 (default, Feb 25 2015, 21:28:45) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.56)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> u = 'Hello' + b' World!'
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
TypeError: Can't convert 'bytes' object to str implicitly
>>>

PS: there is a really good presentation from PyCon at: http://nedbatchelder.com/text/unipain.html and deals with a lot of what I explained above. For extra bonus points eyeing the pymarc module for how it handles the Unicode sandwich would be a really good illustration.

Waving a Dead Fish

Posted on March 27, 2015 by Sean Chen

I’ve been using Vagrant & Virtualbox for development on my OS X machines for my solo projects. But in an effort to get an intern started up on developing a front-end to a project I started a while ago I ran into a really strange problem getting Vagrant working on Windows.

So as a tale of caution for whatever robot wants to pick up this bleg.

Bootcamp partition on a Mid-2010 MacBook Pro. Running a dormant OS X and a full Windows 7. The Windows 7 is the main environment:

Install vagrant
Install virtualbox
Install Git for Windows.

Use the git bash shell since it has SSH to stand up the boxes with vagrant init, vagrant up.

And then stuck (similar to Vagrant stuck connection timeout retrying):

==> default: Clearing any previously set network interfaces...
==> default: Preparing network interfaces based on configuration...
    default: Adapter 1: nat
    default: Adapter 2: hostonly
==> default: Forwarding ports...
    default: 22 => 2222 (adapter 1)
==> default: Booting VM...
==> default: Waiting for machine to boot. This may take a few minutes...
    default: SSH address: 127.0.0.1:2222
    default: SSH username: vagrant
    default: SSH auth method: private key
    default: Error: Connection timeout. Retrying...
    default: Error: Connection timeout. Retrying...
    default: Error: Connection timeout. Retrying...
    default: Error: Connection timeout. Retrying...
    default: Error: Connection timeout. Retrying...
    default: Error: Connection timeout. Retrying...
    default: Error: Connection timeout. Retrying...
    default: Error: Connection timeout. Retrying...
    default: Error: Connection timeout. Retrying...
    default: Error: Connection timeout. Retrying...
    default: Error: Connection timeout. Retrying...
    default: Error: Connection timeout. Retrying...
    default: Error: Connection timeout. Retrying...

Well we booted into the VM with a head and it looked like the booting got interrupted by some sort of kernal panic due to:

Spurious ACK on isa0060/serio0. Some program might be trying to access hardware directly.

Ok makes sense…the machine isn’t booting up and there has to be a reason why.

Long story short. The Windows 7 partition didn’t have virtualization enabled, and there is no BIOS setting or switch somewhere to do it. So what do you do:

How to enable hardware virtualization on a MacBook?

Like waving a dead fish in front of your computer.

Boot into OSX.
System Preferences > Select the Start Up preference pane
Select the Boot Camp partition with Windows
Restart into the Boot Camp partition
Magic

Go figure

RDA Treaties (April 2014)

Posted on May 13, 2014 by Sean Chen

So the instructions for access points for treeaties changed significantly with the April update to Research Description and Access (see the LC Summary of RDA Updates for April 2014 Update of the RDA Toolkit).

To summarize the biggest changes are:

Treaties, etc. is no longer the preferred title for treaties. Instead the preferred title is the name of the treaty in an official source, legal literature or other official designation (RDA 6.19.2.7)
The signatory doesn’t form part of the authorized access point for both multilateral and bilateral treaties. (RDA 6.29.1.15)
- NOT: ~~110 1# Canada. ǂt Treaties, etc. ǂd 1992 October 7~~
- NOW: North American Free Trade Agreement ǂd (1992 October 7)
- NOT: ~~Malaysia. ǂt Treaties, etc. ǂg United States, ǂd 2006 July 28~~
- NOW: Treaty Between the United States of America and Malaysia on Mutual Legal Assistance in Criminal Matters ǂd (2006 July 28)
Some other things to note include:
- Source of the name of the treaty is in order to let you name the treaty in a logical manner.
- Capitalization should follow guidance as mentioned in the RDA Appendix A.18 Names of Documents.
- Variant access points should be added through the signatories (RDA 6.29.3.3)

So how about an example:

010    no2014066728
040 ## NcD-L ǂb eng ǂe rda ǂc NcD-L
046 ## ǂk 18801117
130 #0 Treaty Between the United States and China, Concerning Immigration
       ǂd (1880 November 17)
377 ## eng
380 ## Treaties ǂ2 lcsh
410 1# United States. ǂt Treaty Between the United States and China,
       Concerning Immigration ǂd (1880 November 17)
410 1# China. ǂt Treaty Between the United States and China, Concerning Immigration
       ǂd (1880 November 17)
430 #0 Treaty Concerning the Immigration of Chinese ǂd (1880 November 17)
430 #0 Angell Treaty ǂd (1880 November 17)
670 ## Treaty, laws, and regulations governing the admission of Chinese, 1906:
       ǂb page 3 (Treaty Between the United States and China, Concerning
       Immigration; Treaty Concerning the Immigration of Chinese; Concluded
       November 17, 1880; ratification advised by Senate May 5, 1881; ratified
       by the President May 9, 1881; ratification exchanged July 19. 1881;
       proclaimed October 56, 1881; 22 Stat. 826)
670 ## United States statutes at large, volume 22 (1864-1883): ǂb page 826
       (Treaty between the United States and China, concerning immigration)
670 ## Office of the Historian, U.S. Department of State, viewed May 13, 2014:
       ǂb content (In 1880, the Hayes Administration appointed U.S. diplomat
       James B. Angell to negotiate a new treaty with China. The resulting
       Angell Treaty permitted the United States to restrict, but not
       completely prohibit, Chinese immigration) 
       ǂu http://history.state.gov/milestones/1866-1898/chinese-immigration

Some things to note:

Date of the Treaty; just taking the date of signing/earliest date associated.
Preferred title: as it is in Statutes at Large. If it was a later treaty (post 1950) I’d use other sources like United States Treaties and Other International Agreements or Treaties and Other International Act Series.
Capitalization as specified in A.18.

org-mode ctrl-a & ctrl-e

Posted on March 26, 2014 by Sean Chen

I had been using a customized ctrl-a and ctrl-e (beginning-of-line and end-of-line) in my Emacs.

(defun smart-beginning-of-line ()
 "Move point to first non-whitespace character or beginning-of-line.

If point was already at that position, move point to beginning of line."
 (interactive)
 (let ((oldpos (point)))
   (back-to-indentation)
   (and (= oldpos (point))
     (beginning-of-line)
)))

Those of you who are OS X users: are basic Emacs keybindings which out of the box bound in a similar way.

Org-mode has been my note taking, todo list, and everything for a while. But one thing has been that the keybindings haven’t quite been right. Instead of going to the logical beginning of a heading (the text)

* Tasks

Should go to the beginning of the T in Tasks, in Org-mode the cursor would jump to the systematic beginning of the line. Uhuh, that makes sense but it isn’t what my brain _really_ wants.

Thus the amaziness of org and having a setting for everything

org-special-ctrl-a/e

Which smartly moves the cursor to where it should belong.

Thus the doctoring speaketh:

org-special-ctrl-a/e is a variable defined in `org.el'.
Its value is t
Original value was nil

Documentation:
Non-nil means `C-a' and `C-e' behave specially in headlines and items.

When t, `C-a' will bring back the cursor to the beginning of the
headline text, i.e. after the stars and after a possible TODO
keyword. In an item, this will be the position after bullet and
check-box, if any. When the cursor is already at that position,
another `C-a' will bring it to the beginning of the line.

`C-e' will jump to the end of the headline, ignoring the presence
of tags in the headline. A second `C-e' will then jump to the
true end of the line, after any tags. This also means that, when
this variable is non-nil, `C-e' also will never jump beyond the
end of the heading of a folded section, i.e. not after the
ellipses.

When set to the symbol `reversed', the first `C-a' or `C-e' works
normally, going to the true line boundary first. Only a directly
following, identical keypress will bring the cursor to the
special positions.

This may also be a cons cell where the behavior for `C-a' and
`C-e' is set separately.

You can customize this variable.

Gotcha: sRGB, Emacs 24, themes

Posted on March 9, 2014 by Sean Chen

I’ve been working with the Solarized color theme in my Emacs for a while. The homebrew recipe for Emacs has an option to pull in a patch which corrects the Cocoa port for Emacs to handle srgb colors correctly. But for the longest time I couldn’t get the colors to exactly line up to the references.

But I finally figured out that the theme was expecting a variable to be set:

(setq solarized-broken-srgb nil)

From the customize information:

Emacs bug #8402 results in incorrect color handling on Macs. If this is t (the default on Macs), Solarized works around it with alternative colors. However, these colors are not totally portable, so you may be able to edit the “Gen RGB” column in solarized-definitions.el to improve them further.

The gotcha is that if you set this through customize, generally the default custom.el loads after init.el with a lightly managed Emacs. So if you thought you were setting the variable in customize and it would work, you are wrong, since normally themes are loaded through your init.el, either through a separate library or directly in mine.

So for me to load solarized with correct srbg support:

(setq solarized-broken-srgb nil)
(load-theme 'solarized-dark t)

Installing Jekyll on OSX 10.9

Posted on January 19, 2014 by Sean Chen

Installing Jekyll

Recipe for installing RVM + Latest Stable Ruby + Jekyll on OS X 10.9. This is mostly so I can experiment with using GitHub pages to publish web sites. Loosely following instructions from GitHub how to set Jekyll up.

Install Homebrew

$ ruby -e "$(curl -fsSL https://raw.github.com/Homebrew/homebrew/go/install)"`}

Install RVM + Latest Stable Ruby

RVM isn’t provided as a formula in homebrew/homebrew since RVM installs on a per user basis and it does some other non homebrew’y stuff. Depending on RVMs autolibs feature to install all its dependencies automatically using homebrew.

$ \curl -sSL https://get.rvm.io | bash -s stable --ruby=

RVM will run and your shell should be set up to use the new Ruby.

Other options include using the system ruby as the default: and then invoking rvm to call the relevant ruby we want.

$ rvm --default use system
Now using system ruby.
Now using system ruby.
Warning! Executable 'ruby' missing, something went wrong with this ruby installation!
Warning! Executable 'gem' missing, something went wrong with this ruby installation!
Warning! Executable 'irb' missing, something went wrong with this ruby installation!
$ ruby -v
ruby 2.0.0p247 (2013-06-27 revision 41674) [universal.x86_64-darwin13]
$ which ruby
/usr/bin/ruby

In that case I go back to the new installed ruby to isolate it from the system environment:

$ rvm use 2.1.0

Install Jeyll with RubyGems

Bundler is installed automatically by RVM, so dependencies should be installed!

$ gem install jekyll

It Lives

$ jekyll new newProject
New jekyll site installed in /Users/gugek/Desktop/newProject.
$ cd newProject/
$ jekyll serve
Configuration file: /Users/gugek/Desktop/newProject/_config.yml
            Source: /Users/gugek/Desktop/newProject
       Destination: /Users/gugek/Desktop/newProject/_site
      Generating... done.
    Server address: http://0.0.0.0:4000
  Server running... press ctrl-c to stop.

org-mode agenda

Posted on March 30, 2013 by Sean Chen

I’ve been using emacs’ org-mode to handle project and task tracking. There are a number of views in the agenda mode that weren’t clear to me what they did until I had to go back and see everything I’ve been doing for the last year:

‘a’ Agenda for current week or day

Week view of your (active) TODOs. With some options you can see archived and hiddent events and TODOs. The default brings in any TODO that has an active timestamp or is scheduled. After you bring it up you can then change the view to include everything from the last month, year, or arbitrary date. This is the view you need if you want to see completed tasks in an archive; using the ‘Log-All’ function when you have the view up along with adding the archive option if you are using an archive file.

‘L’ timeline

Timeline view of all date tagged items in the current org-mode buffer. Strangely, this view doesn’t respond to any of the agenda options, except for viewing things in logged format. You’d think it could give you an overview but it doesn’t.

‘t’ List of all TODOs

This is the list of active TODOs. A tasklist, which is configurable with a number of options to sort and surface the particular ones to the top.

Authorized Access Points

Posted on March 29, 2013 by Sean Chen

I’ve been using RDA for original cataloging since October 2012 at MPOW. With authority records things have been great. There is a lot more flexibility to add and not add things. One hangup is the transition period where records need to be evaluated. Some weirdnesses I’ve encountered have been titles of nobility see: 100 1# Vitzthum, Wolfgang, $c Graf.

Other issues include media types for streaming media. Actually digital files in general are handled poorly. Everything needs a carrier. What exactly is the carrier for a file that is sitting on filesystem in the cloud?

338 ## other $2 rdacarrier

or perhaps if it is published:

338 ## online resource $2 rdacarrier

The file characteristics get handled elsewhere; not a terribly horrible thing, but not exactly intuitive.

More thoughts later.

Found in my ALA Cataloging Rules for Author and Title Entries

Image

Found in my ALA Cataloging Rules for Author and Title Entries

I can’t imagine a day when librarians had to correspond by typewritten letter when they needed a cataloging rule answer.

Textarea’s Firefox (Windows 7) & Safari (OS X Lion)

Posted on May 3, 2012 by Sean Chen

I was looking into using org-mode in emacs (I am a vi user, though I had started as an emacs users in college). But I came across a really amazing Stack Overflow question and answer on using Emacs’ org-mode, Markdown, and a Firefox plugin It’s All Text. The idea of being to edit a textarea in my editor of choice was obvious in hindsight, so I went ahead and started using the plugin, instead calling vim, and modifying relevant ftdetect and ftplugins to handle the hosts that I would be editing in.

So for example, I edit a Confluence wiki for project tracking and documentation. My ftdetect then for the confluencewiki.vim syntax is:

" confluence filetype file au BufRead,BufNewFile *.cfl,*.confluence set filetype=confluencewiki au BufRead,BufNewFile wiki.duke.edu.*.txt set filetype=confluencewiki

Using the plugin pulls the text out of the textarea, adds it to a vim buffer and then lets me edit it; while monitoring the file to update the control in the webpage. Pretty awesome

Of course after using it for half a day on Windows at MPOW, I have to do it on my Macbook Air, where I use Safari. I know there are reasons to not use Safari, but my big reason for using it is that I remain consistent with all the Fluid site specific browsers I use for a number of important sites, Gmail, Google Docs, Facebook. Of course Safari’s extension mechanism isn’t as well used as Firefox’s but it seems like that the textarea editing with an external editor itch isn’t limited: thus Quick Cursor which basically acts as a sophisticated copy and paste clipboard into any control that accepts text in OS X. I got the impression that the operating system used to have some other mechanism to achieve something similar, but for whatever reason that has been deprecated

schenizzle

Libraries, Classification, Cataloging, Carrboro, Duke

Author Archives: Sean Chen

Python3, PyMARC, Unicode & File Opening

Waving a Dead Fish

RDA Treaties (April 2014)

org-mode ctrl-a & ctrl-e

Gotcha: sRGB, Emacs 24, themes

Installing Jekyll on OSX 10.9

Installing Jekyll

Install Homebrew

Install RVM + Latest Stable Ruby

Install Jeyll with RubyGems

It Lives

org-mode agenda

Authorized Access Points

Found in my ALA Cataloging Rules for Author and Title Entries

Image

Textarea’s Firefox (Windows 7) & Safari (OS X Lion)