2021-01-03 - UFO Files
Image by PrinceKickster, CC0, via Wikimedia Commons
Back in the day the TV programme “The X Files” was a big thing. We actually used to wait an entire week for a new episode! Anyway, I’ve started to re-watch the series (seem to have lots of home time this year!) and in doing so a curious question arose; “what are the UFO reported sightings like in the UK?", what are the numbers, where, when etc.
I’m not really looking for too much detail, just curious. A quick google search and I’m directed to the National Archive and then on to a publication listing from the Ministry of Defence which contains a list of UFO sighting reports recorded between 1997 and 2009. This is linked from the highlights guide, page 8 which states “a spreadsheet” but instead of a spreadsheet there’s a series of PDF documents. OK so not recent but still good for the same time the programme was first broadcast.
That seems like a nice little learning challenge, transforming the PDF documents to a usable format for further analysis. I’d recently watched a presentation on “Datasette” by Simon Willison which is a “tool for exploring and publishing data” and was eager to try it out on this data but also to have the potential to use other tools if I wanted.
Now, before we get started, I’m sure there are many ways of doing this better but this is what I did and it worked for me :-)
Firstly, I needed to pull each PDF file from the website. A little shell script and we can get the URLs for each PDF file and store them in a file for further use:
wget https://www.gov.uk/government/publications/ufo-reports-in-the-uk -P ../srchtml
grep '<a href="' ../srchtml/ufo-reports-in-the-uk | grep pdf | cut -d '"' -f4 > ../urls/urls.txt
Now I had the urls listing I can loop through them and download the files (I guess this could have been done in a single step but hey, this is what I did):
while IFS="" read -r url; do
wget $url -P "../srcpdfs"
done < ../urls/urls.txt
Now I had the PDF files I did visual inspection of each which highlighted a few points of note when thinking about how to process them:
- There are multiple pages in each file
- There is a table like display (grid lines) that contains data that is spread across those pages in each file
- There is a consistent number of columns to the table, across each page and each file
- There is a header row on page 1 of the table shown, but the text within each cell has minor differences across the files
- In some files the header row of the table displayed is repeated on each page but not in all files
- In some files the bottom border of the final row of the table on page 1 is missing (this will be a problem to solve)
- In some instances there are empty rows in the table, all cells are displayed with no content
- Cell content appears to contain line breaks/returns
On to transforming the PDF tables to CSV. There were quite a few python libraries that came up when I was trying this out. The two I tried were Camelot and PDFPlumber.
Camelot
The Camelot cli got me a fair part of the way to where I wanted to be, probably 90% plus. The documentation on the cli tool could have provided more of an example of use (it’s probably there somewhere) but here’s what I used for an individual PDF file whilst testing:
camelot -p 1-end -o ufo_report_2009.csv -f csv lattice ufo_report_2009.pdf
-p is the pages and here I’m specifying start from page 1 and go to the end, by default it will just act on the first page
-o is the output file name. Actually the first part of the file name generated. There’s one per page named something like namefromoption-page-number-table-number.csv e.g. ufo_report_2009-page-1-table-1.csv)
-f is the output format and here I’m specifying csv
lattice is the command mode that is being used to identify the table and cell boundaries. Lattice uses the lines between text
ufo_report_2009.pdf is the file to process
To mass process the pdf files to csv the following loop was run:
for filename in ../srcpdfs/*.pdf; do
camelot -p 1-end -o ../csvs/"${filename##*/}".csv -f csv lattice $filename
done
For each file there is now 1 csv file per page representing the table on the page. The text in each field in the data in the csv files is enclosed in quotation marks and delimited with a comma. If there is no header row to the table shown on the page in the PDF then there’s no header to the csv file. Pandas was used to generate a combined PDF file and apply these data cleansing steps:
- Remove header rows, especially important where the files repeat the headers
- Remove line breaks from each field
- Remove empty rows
- Adjust the value of Area where the Area has been included in the Town / Village as interpreted by Camelot
- Add page number, report year and file name
- Apply consistent header and export
The resulting file output gave me a single csv file with 2925 rows inclusive of a single header row. The code for this is in the repository as a Jupyter Notebook. The same can be achieved without calling the cli, an example is also in the repository in a separate Notebook.
PDFPlumber
PDFPlumber also offers a CLI option but as best as I could tell it doesn’t extract tables of data to nicely formatted files in the same way that Camelot does. The documentation [link] gives a good starting point on which to build a pattern for processing which was similar for the most part with the Camelot pattern above.
For each file:
- Open with PDFPlumber to return an instance pdfplumber .PDF class
- For each page in the file:
- Extract the table using the extract_table() method which returns a list from which a Pandas dataframe is generated (essentially 1 per page) Note that for 2 PDFs, I had to pass a custom setting tp extract_table() to add a line to the bottom of the table on page 1 in order for the last row of data to be recognised, see the Notebook in the repository that details identifying the values used to draw the lines.
- Remove header rows (including repeated)
- Remove line breaks and empty rows
- Adjust instances where the last letter of the Town / Village should be the first letter of the Area (this is different to the way that Camelot had extracted these instances)
- Add page number, report year and file name
- Apply a consistent header and export useful for testing/debug against single years
- Combine each years csv file into a single csv file with a single header row
The resulting file output gave me a single csv file with 2925 rows inclusive of a single header row. Again, the code for this is in a Notebook in the repository.
Now enough of this mucking around with files, let’s start looking at some data. Well, I lied, a bit more work but not much more… We can use the csvs-sqlite utility to convert the output csv file to a a sqlite database:
csvs-to-sqlite ufo_all_data.csv ufo_activity_all.db
Then use Datasette to serve the data for our investigations:
datasette ufo_activity_all.db
What’s in the data
So now we have a nice web interface to a sqlite database from where we can start to form some casual (i.e there’s nothing scientific about this or is there?!) observations of the data using only well populated fields
- There are 2924 reports recorded between 1997 and 2009, with 2009 recording the highest number of reported sightings, 643
- London was the most common area/location with 140 reported sightings between 97 and 09, with Essex and Kent in the next 2 slots (note: image below only showing a Location where count of records > 25 across the entire data set)
- It appears that most sightings were in the evening hours which I guess would track given what we see in the short descriptions (below) and August appears to show the highest number of sightings
- From the short description supplied we can see some of the most common words associated with the records (no real surprises there)
To sum up…
This was a fun little exercise, converting pretty straightforward PDF files to csv using different libraries and then using datasette to view and interrogate the data. I’m sure there are better ways of achieving the same outcome but for me this is the way that I did it. Now, back to the X Files!