Posted on: 5 December, 2018
Application deadline: January 22, 2019

Web archives and cities: mining the web to learn our cities, University of Birmingham, UK

Web archives and cities: mining the web to learn our cities

Tuesday, January 22, 2019
Competition Funded PhD Project (European/UK Students Only)

Alan Turing Institute Doctoral Scholarship, University of Birmingham

Project Description

This project will explore how the dynamics of cities are reflected in and, therefore, can be sensed by mining online content. Not only cities change, but they also change faster than the secondary data update pace. Moreover, cities change in myriads ways and, therefore, official sources of structured, secondary data may not be able to capture these changes. This project will utilise an innovative data source of billions of archived web pages under the .uk domain during the period 1996-2013. It will exploit the unstructured textual data contained in these webpages in order to understand the changes that cities in the UK have undergone. Essential element in this process would be the geolocation of these data. Specifically this project will answer the following key research questions:

• How are the dynamics of the UK urban system reflected in online internet content? • Can we detect or even predict the dynamics of the inner structures of cities in the UK by mining online content? • Can we understand urban functions and create urban typologies by using online content? How is such a ‘digital’ understanding of cities compared to our long-existing understanding based on traditional data sources?

The underpinning logic of this project is that the web contains valuable information about our cities, for example, regarding their physical and functional characteristics, the economic and social activities that take place within cities, their intensity as well as their spatial and temporal signatures. Information about these characteristics can be extracted from web pages using corpus linguistics methods. Such information can be longitudinal depending on the temporal dimension of digital archives. What is more challenging is the spatial dimension of these data which, which depends on geolocation processes. Various different approaches can be suggested, for example to interrogate these archived web pages using places names, gazetteers or actual UK address such as postcodes.

This project will use, but not limited to, data from the Internet Archive, the most complete archive of web pages (Holzmann et al., 2016; Ainsworth et al., 2011). It will employ the JISC UK Web Domain Dataset, which is a subset of the Internet Archive curated by the British Library. These data contain billions of web addresses of webpages within the .uk domain, which have been archived by the Internet Archive during the period 1996-2013 as well as the archiving timestamp. The British Library has also generated a subset of this dataset called Geoindex which contains circa 2.5 billion web addresses of archived webpages which include at least one UK postcode.

These unstructured textual data will be interrogated by employing corpus analytics in order to create meanings, themes and classifications. The student will have the opportunity to approach the above questions from specific thematic viewpoints, including, but not limited to, land values, tourism, local governance etc. Topic modelling and similar type of methods will be used first in small samples of the corpora and then will be scaled-up. These methods will be coupled with statistical modelling and spatial analysis in order to understand the spatiality of these processes.

This project will push the envelope of quantitative geography methodological tool-kit. Although corpus analytics have been employed in human geography, their use is limited to social media data (Martin & Schuurman, 2017). And although business studies have used web mining and data from the Internet Archive before, their scope was rather limited and ignored the spatial signatures of these data (Shapira et al., 2016; Li et al., 2016; Musso & Merletti, 2016; Papagiannidis et al., 2017).

The student will utilise the UoB High Performance Computing facilities. S/he will benefit by the Big Data training opportunities within the UoB including the BlueBear training. S/he will tap into existing UoB big data training networks including those offered by the DREAM CDT, the Birmingham director of which is the project’s main supervisor. The student will also join training courses, whenever necessary, at the co-supervisors institution given the short distance. Both supervisors are very well placed within existing RCUK-funded big data research networks.

– Relevant social science background in either geography/planning/urban studies or linguistics. Alternatively, a computer science background and willingness to engage with the above disciplines. – Strong computational background including experience in R or Python. – Good statistical knowledge. – Preferably, experience in Natural Language Processing and Machine Learning.

Funding Notes

To support students the Turing offers a generous tax-free stipend of £20,500 per annum, a travel allowance and conference fund, and tuition fees for a period of 3.5 years.


Arribas-Bel, D. & Tranos, E. (2018) Characterizing the Spatial Structure(s) of Cities “on the fly”: the Space-Time LISA Calendar. Geographical Analysis, 50: 162-181. Arribas‐Bel, D., & Tranos, E. (2018). Big Urban Data: Challenges and Opportunities for Geographical Analysis. Geographical Analysis, 50: 123-124. Grieve, J. & Wieling, M. Regional Dialectology: Quantitative Approaches using R. Under contract with Cambridge University Press. Grieve, J. (2016). Regional Variation in Written American English. Cambridge University Press. Grieve, J., Nini, A. & Guo, D. (2018). Mapping lexical innovation on American social media. Journal of English Linguistics 46: 293–319. Huang, Y., Guo, D., Kasakoff, A. & Grieve, J. (2016). Understanding US regional linguistic variation with Twitter data analysis. Computers, Environment and Urban Systems 59: 244-255. Kim, C., Reddy, S., Stanford, J., Wyschogrod, E. & Grieve, J. (2018). Bring on the crowd! Using online audio crowdsourcing for large-scale New England Dialectology and acoustic sociophonetics. American Speech (forthcoming). Tranos, E. & Mack, E. (2018). Big data: A new opportunity for transport geography? Journal of Transport Geography (forthcoming). Tranos, E. (2013). The Geography of the Internet: Cities, Regions and Internet Infrastructure in Europe. Cheltenham, UK: Edward Elgar (New Horizons in Regional Science Series). Tranos, E., & Mack, E. (2016). Broadband provision and knowledge intensive firms: a causal relationship? Regional Studies, 50: 1113-1126. Tranos, E., & Nijkamp, P. (2015). Mobile Phone Usage in Complex Urban Systems: a space-time, aggregated human activity study. Journal of Geographical Systems, 17: 157-185.