Data preprocessing and tokenization techniques for technical Ukrainian texts

Sergii Volodymyrovych Mashtalir; Oleksandr Volodymyrovych  Nikolenko

doi:10.15276/aait.06.2023.22

Authors

Sergii Volodymyrovych Mashtalir Kharkiv National University of Radio Electronics, 14, Nauky Ave. Kharkiv, 61166, Ukraine https://orcid.org/0000-0002-0917-6622
Oleksandr Volodymyrovych Nikolenko Uzhhorod National University, 14, University Str. Uzhhorod, 88000, Ukraine https://orcid.org/0000-0002-6422-7824

DOI:

https://doi.org/10.15276/aait.06.2023.22

Keywords:

Multilingual natural language processing, data preprocessing, tokenization, technical Ukrainian texts, lemmatization

Abstract

The field of Natural Language Processing (NLP) has witnessed significant advancements fueled by machine learning, deep learning, and artificial intelligence, expanding its applicability and enhancing human-computer interactions. However, NLP systems grapple with issues related to incomplete and error-laden data, potentially leading to biased model outputs. Specialized technical domains pose additional challenges, demanding domain-specific fine-tuning and custom lexicons. Moreover, many languages lack comprehensive NLP support, hindering accessibility. In this context, we explore novel NLP data preprocessing and tokenization techniques tailored for technical Ukrainian texts. We address a dataset comprising automotive repair labor entity names, known for errors and domain-specific terms, often in a blend of Ukrainian and Russian. Our goal is to classify these entities accurately, requiring comprehensive data cleaning, preprocessing and tokenization. Our approach modifies classical NLP preprocessing, incorporating language detection, specific Cyrillic character recognition, compounded word disassembly, and abbreviation handling. Text line normalization standardizes characters, punctuation, and abbreviations, improving consistency. Stopwords are curated to enhance classification relevance. Translation of Russian to Ukrainian leverages detailed classifiers, resulting in a correspondence dictionary. Tokenization addresses concatenated tokens, spelling errors, common prefixes in compound words and abbreviations. Lemmatization, crucial in languages like Ukrainian and Russian, builds dictionaries mapping word forms to lemmas, with a focus on noun cases. The results yield a robust token dictionary suitable for various NLP tasks, enhancing the accuracy and reliability of applications, particularly in technical Ukrainian contexts. This research contributes to the evolving landscape of NLP data preprocessing and tokenization, offering valuable insights for handling domain-specific languages.

Downloads

Download data is not yet available.

Author Biographies

Sergii Volodymyrovych Mashtalir, Kharkiv National University of Radio Electronics, 14, Nauky Ave. Kharkiv, 61166, Ukraine

Doctor of Engineering Science, professor. Professor of Informatics Department
Scopus: 36183980100

Oleksandr Volodymyrovych Nikolenko, Uzhhorod National University, 14, University Str. Uzhhorod, 88000, Ukraine

Specialist on Applied Mathematics. PhD student

Data preprocessing and tokenization techniques for technical Ukrainian texts

Authors

DOI:

Keywords:

Abstract

Downloads

Author Biographies

Sergii Volodymyrovych Mashtalir, Kharkiv National University of Radio Electronics, 14, Nauky Ave. Kharkiv, 61166, Ukraine

Oleksandr Volodymyrovych Nikolenko, Uzhhorod National University, 14, University Str. Uzhhorod, 88000, Ukraine

Downloads

Published

How to Cite

Issue

Section

Most read articles by the same author(s)

Current Issue