Data

University of Turku hosts internationally acknowledged, digitized data on linguistics, human genetics, history and archaeology, disease history as well as pedigrees with life histories and multi-generational health data. They offer unique opportunities for collaboration across disciplines. These datasets form the core of human diversity project:

Human life history
Collections of parish records provide information on year-to-year births, marriages, movement, socioeconomic status, births and deaths, local demography and environmental conditions in Finnish people from 1700 to present collected by the Human life history group. Digitized statistical yearbooks provide information on farm land, household size, average income and birth and mortality rates and are partly published in Honkola et al 2018 and Ketola et al 2021. Comprehensive spatial model for historical travel effort consists of the factors that have affected human movement in the past (Rantanen et al. 2021). Each hindering and facilitating factors have received a speed value combined to raster surfaces enabling possibilities to compare human movement opportunities within Finland.

Genetic data
Access to genetic information on Finnish people from mesolithic to present, as genomic sequence and genome-wide SNPs, is available through SUGRIGE and via POSEIDON. We have Finnish-wide modern genomic dataset from National FinHealth 2017 Study and The National FINRISK Study (via THL). In addition multiple population-based cohorts offer to address the effects of past and present environment in the next generations as part of the collections of Centre for Population Health Research.

Language data
Communication, spoken and written, has been recorded for the Finnish and the whole of Uralic languages. Historical linguistic data has been collected by BEDLAN team. It includes the Dialect atlas of Finnish (Lauri Kettunen 1940) describing Finnish dialectal landscape 100 years ago before urbanisation.The 213 features include phonological, lexical and morphological details. Updated and modified version is available under request. Uralic language family data includes Uralic basic vocabulary of 26 languages with cognate and loanword information, UraLex. Uralic Typological database, UraTyp, is collected in collaboration with Grambank and University of Tarto, and it consists of 35 Uralic languages and 360 binary feature of language structure. The Geographical database of the Uralic languages consists of past and current distributions of the Uralic languages both as the original digital spatial datasets and as finalized maps. The data presents the state-of-the-art knowledge of the Uralic languages and their dialects.  Ympäristömuuttujat (Roose et al. ms).

Of the more contemporary corpus collections Turku Paraphrase Corpus allow detection of semantically similar text patterns even when there is no lexical overlap. The corpus consists of 100,000+ pairs of manually annotated paraphrases in their document context. The CORE corpus series is a multilingual corpora representing the unrestricted Web, composed of documents with manual annotations on text genres or registers (Biber 1988) – text categories such as news, discussion forums, how-to pages and opinionated texts. This collection can be used as such for linguistic analysis of registers, but most of all, it is meant for the development of machine learning models  Finnish Internet Parsebank and other web-crawled datasets are available on request.

Cultural data
Archeological objects provide information on culture and society. Archaeological artefact Database of Finland (AADA) quantify typologically discernible prehistoric artefacts in Finland from Stone Age, Bronze Age and Iron Age. AADA is available under request. Similarly, Linquistic, genetic and archelogical information is combined usign geospatial methods.  Access to the geospatial datasets is facilitated through two user interfaces.The Uralic Historical Atlas (URHIA), developed as part of the URKO project, offers interactive access to the Uralic language speaker areas and archaeologica artefacts through a map interface, making it easy to browse and share the data. The Uralic Areal Typology can be accessed via a web service created by the Max Planck Institute for Evolutionary Anthropology and provides access to UraTyp.