Data preprocessing and tokenization techniques for technical Ukrainian texts
DOI:
https://doi.org/10.15276/aait.06.2023.22Keywords:
Multilingual natural language processing, data preprocessing, tokenization, technical Ukrainian texts, lemmatizationAbstract
The field of Natural Language Processing (NLP) has witnessed significant advancements fueled by machine learning, deep learning, and artificial intelligence, expanding its applicability and enhancing human-computer interactions. However, NLP systems grapple with issues related to incomplete and error-laden data, potentially leading to biased model outputs. Specialized technical domains pose additional challenges, demanding domain-specific fine-tuning and custom lexicons. Moreover, many languages lack comprehensive NLP support, hindering accessibility. In this context, we explore novel NLP data preprocessing and tokenization techniques tailored for technical Ukrainian texts. We address a dataset comprising automotive repair labor entity names, known for errors and domain-specific terms, often in a blend of Ukrainian and Russian. Our goal is to classify these entities accurately, requiring comprehensive data cleaning, preprocessing and tokenization. Our approach modifies classical NLP preprocessing, incorporating language detection, specific Cyrillic character recognition, compounded word disassembly, and abbreviation handling. Text line normalization standardizes characters, punctuation, and abbreviations, improving consistency. Stopwords are curated to enhance classification relevance. Translation of Russian to Ukrainian leverages detailed classifiers, resulting in a correspondence dictionary. Tokenization addresses concatenated tokens, spelling errors, common prefixes in compound words and abbreviations. Lemmatization, crucial in languages like Ukrainian and Russian, builds dictionaries mapping word forms to lemmas, with a focus on noun cases. The results yield a robust token dictionary suitable for various NLP tasks, enhancing the accuracy and reliability of applications, particularly in technical Ukrainian contexts. This research contributes to the evolving landscape of NLP data preprocessing and tokenization, offering valuable insights for handling domain-specific languages.