Improving Historical Census Transcriptions: A Machine Learning Approach
Published:
Recommended citation: Dahl, Christian M., Sam Il Myoung Hwang, Torben S. D. Johansen, Munir Squires (2024). “Improving Historical Census Transcriptions: A Machine Learning Approach”. https://www.dropbox.com/scl/fi/ay275j12rqeru6rsncw9w/DHTS_06302024.pdf?rlkey=rhb0dg7sayoobcqxxm84cdcrb&e=1&st=i25zzn4s&dl=0
Authors: Christian M. Dahl, Sam Il Myoung Hwang, Torben S. D. Johansen, and Munir Squires.
Download: You can access the working paper here.
Abtract: Historical U.S. censuses have been an important data source for economics, particularly because they allow researchers to track individuals’ life outcomes over long periods of time. However, linking ndividuals across multiple census rounds is challenging often due to errors in name transcription. In this paper, we improve the name transcription in historical U.S. censuses using a machine-learning model. Our approach resulted in a significant increase in the likelihood of linking individuals across censuses. We also find that our model performs especially well when human tran scribers struggle, i.e., when the legibility of names on the original census form is low. The increased linkage rate is observed across nearly all socio-demographic subgroups, including those that are typically difficult to link.