Text Processing and Computer-assisted Reading
"I still love books. Nothing a computer can do can compare to a book. You can’t really put a book on the Internet. Three companies have offered to put books by me on the Net, and I said, ‘If you can make something that has a nice jacket, nice paper with that nice smell, then we’ll talk.’ "
― Ray Bradbury
In the quote above, Ray Bradbury, a famous American author urges that computer reading is not as comfortable as reading an actual book. To which extent is this claim right? Actually, in today post, I intentionally put this quote to only clarify that I won’t be comparing comfortability between reading contents on a book and those on a computer screen. However, I will explain how the computer can be used to analyze a large collection of written documents efficiently by using the US State of the Union Addresses and Tanzania Swahili newspapers as case studies.
Since 1790, the United States presidents are required by the constitution to deliver annual reports which are addressed to the Congress. These reports have transitioned from oral to written report as typewriters were introduced in the 1900s. Programming Historian shows an interesting example how computer-assisted reading may be used to analyze patterns with time within a particular set of text by using the State of the Union Addresses as a case study. This computer-assisted learning may do very complicated analysis within a fraction of a second.
For the computer-assisted reading, which is done by using R language in the case of the State of the Union Addresses, to be successful, the documents must be properly prepared. First, they need to be formatted to text files (.txt) which are a format which is easily readable by the computer.
It will be interesting to employ a similar approach to analyze Swahili newspapers in Tanzania, the country where I come from. Tanzania has gone through different historical phases such as a colonial and post-colonial period. However, I will be more interested to work with the post-colonial rule newspapers because they are easily available from library archives.
By using text analysis I would like to investigate the most frequent words in the newspapers and see if their easy any correlation with the political movements and changes which happened in the country at that time. I would expect a shift of most frequent words in the newspapers as the country shifted from one mode of production to another. For example the shift of Tanzania economy and politics from socialism to capitalism in 1992 and a wide usage of the internet in the 2000s.
In order to perform this analysis by R, the newspapers has to be formatted to text files. And it would be better to name them by including the publishers and their publication years. This way, it will be possible to tell if certain publishers have a specific preferred list of words as well as be able to analyze how the list of the most frequent words varies with time or any political events.
To conclude, any of these analyses is not possible if we don’t have digitalized texts. So if I had to say few words to Ray Bradbury before he passed away, I would say ‘digitalized texts are nicer than a jacket’.