Breaking the HISCO Barrier: Automatic Occupational Standardization with OccCANINE
Published in arXiv, 2024
Recommended citation: Dahl, Christian M., Torben S. D. Johansen, Christian Vedel (2024). “Breaking the HISCO Barrier: Automatic Occupational Standardization with OccCANINE”. In: arXiv preprint arXiv:2402.13604 https://arxiv.org/abs/2402.13604
Authors: The paper is written by Christian M. Dahl, Torben S. D. Johansen, and Christian Vedel.
Download: You can access the working paper here.
Code: You can find the code for the project here. A YouTube video showing how to use the code is also available.
Abtract: This paper introduces a new tool, OccCANINE, to automatically transform occupational descriptions into the HISCO classification system. The manual work involved in processing and classifying occupational descriptions is error-prone, tedious, and time-consuming. We finetune a preexisting language model (CANINE) to do this automatically, thereby performing in seconds and minutes what previously took days and weeks. The model is trained on 14 million pairs of occupational descriptions and HISCO codes in 13 different languages contributed by 22 different sources. Our approach is shown to have accuracy, recall, and precision above 90 percent. Our tool breaks the metaphorical HISCO barrier and makes this data readily available for analysis of occupational structures with broad applicability in economics, economic history, and various related disciplines.
Citing
If you would like to cite our paper, please use
@article{dahl2024hisco,
title={Breaking the HISCO Barrier: Automatic Occupational Standardization with OccCANINE},
author={Dahl, Christian M. and Johansen, Torben S. D. and Vedel, Christian},
journal={arXiv preprint arXiv:2402.13604},
year={2024}
}