As mentioned previously, I am following the Fast AI course and as part of the second lesson you are encouraged to develop an Image Classifier. An Image Classifier is a type of program which can look at an image and decide which bucket the image should go in. In the first lesson the example Image Classifier let you know if an image was a picture of a dog or a cat and in this weeks lesson the Image Classifier distinguished between pictures of Brown, Grizzly and Teddy Bears.
In the previous post I discussed a couple of ideas but decided to go with the Tribunal records identification as this seemed the easiest to get started with. As a reminder the Cardiganshire War Tribunal records are a fascinating collection of records covering the communication between the community in Ceredigion and the Military Tribunals who decided if applicants were allowed to avoid military conscription during WW1. The records were part of a National Library of Wales (NLW) crowdsourcing project working with volunteers to both transcribe the records but also to retain a link between the transcription and the field on the form. The completed transcriptions were made available by the NLW and are available on Paul McCann’s Github Page as IIIF Annotations.
The archive contains many different forms and supporting correspondence. My plan is to create a classifier which will identify the form type. This is now mostly an academic exercise as all of the pages have been identified and transcribed but if in theory someone else was to digitise a set of tribunal records they could use this tool to identify which type of documents they have. I can test this theory using the sample tribunal record copies in Manchester Archive and the National Archives.
To fit the data structure discussed in the course there should be a directory per category (or bucket) containing images in that set. So there should be 9 directories containing images for the following document types:
The dataset from Paul, contains Annotation Lists for each of the different districts that make up the county of Cardiganshire (now called Ceredigion). The districts are:
Inside these files there are lists of annotations and there is one annotation per page which looks as follows:
The first job is to find the page and link it to the image it references. The is buried in the part of the annotation and to pick it out we need to split the string and find
Tag: and then this will be followed by the Tag. I used the following python method to do this conversion:
This will return a hash table with all of the fields in the string separated so will return the page tag.
The second part of the task is to identify the image this annotation points to. To start we need to find the canvas which is in the field:
If this was a set of images I didn’t know then I would find the manifest, look for this canvas id and then find the reference to the IIIF Image. Luckily I can take a short cut as I know the last number () before the .json is actually the Image Identifier. I can then use the following to get access to the Image information:
One decision that was required was the size of the image to request. The course recommends
“We don’t have a lot of data for our problem (150 pictures of each sort of bear at most), so to train our model, we’ll use RandomResizedCrop with an image size of 224 px, which is fairly standard for image classification, and default aug_transform”
(From ‘Training Your Model, and Using It to Clean your Data’)
so I can request a IIIF Image at this size by running:
The full code for the program to extract the and Images from the Annotation Lists is called typeImage.py and can be found on GitHub. Once this program has run you will have a directory for each Tag containing the following number of images:
To access it in the next steps I needed to make it public so I zipped it up and put it here: TribunalTypeImages.zip.
The next part of the task was to take this zip file of images and train a model that can identify the different forms. The course recommends doing this by creating a Jupiter Notebook. I haven’t quite got my heard around how these notebooks fit into or compares with virtual machines, Docker or Vagrance but they allow you to both run code and mix in Markdown descriptions to comment on the process. Using the Google Colab software you can run these Jupiter notepads on their virtual machine infrastructure for free. One of the main advantages of this approach is that Google Colab provides Virtual Machines with performant graphics cards which appear essential for training models. This was mentioned in the first lesson and I had a go at running the Jupiter notebook locally but even though my machine isn’t slow, it took a surprisingly long time to train the basic example models. Even though I was sorely tempted, lesson one advised against spending (wasting) time building your own computer with a powerful graphics card and just use the free or low cost options they link to in the course.
The Jupiter notebook is embedded in this blog as a read only version with the saved output when I ran it.
You should be able to open a version you can run yourself by clicking the Open in Colab button below. In the Colab version, when you see a block of code you should be able to run it by moving your mouse cursor to the top left part of the code block. This should then show a play button. You should run the code blocks in order and wait for them to finish before going onto the next one.
I was really pleased to see the results of the Image Classification. It worked on most of the images and for the ones that it didn’t there seemed to be clear reasons why not. It would be interesting to look further into the top losses to see if this could be used to identify data that needs correcting. The only real downside to this project is the lack of applicability to real world problems. Its very specific to this collection but I wonder if the techniques could be applied to identify different types of objects in a collection. One thought I’ve had is whether you could train a classifier to identify types of images e.g. Maps, Newspapers, Manuscripts etc. This maybe something I try next although finding the source data will more challenging.