NLP ‘RECIPES’ FOR TEXT CORPORA: APPROACHES TO COMPUTING THE PROBABILITY OF A SEQUENCE OF TOKENS

Monika Porwoł

doi:10.28925/2311-2425.2021.151

Authors

Monika Porwoł State University of Applied Sciences in Racibórz, Institute of Modern Language Studies

DOI:

https://doi.org/10.28925/2311-2425.2021.151

Abstract

Investigation in the hybrid architectures for Natural Language Processing (NLP) requires overcoming complexity in various intellectual traditions pertaining to computer science, formal linguistics, logic, digital humanities, ethical issues and so on. NLP as a subfield of computer science and artificial intelligence is concerned with interactions between computers and human (natural) languages. It is used to apply machine learning algorithms to text (and speech) in order to create systems, such as: machine translation (converting from text in a source language to text in a target language), document summarization (converting from long texts into short texts), named entity recognition, predictive typing, et cetera. Undoubtedly, NLP phenomena have been implanted in our daily lives, for instance automatic Machine Translation (MT) is omnipresent in social media (or on the world wide web), virtual assistants (Siri, Cortana, Alexa, and so on) can recognize a natural voice or e-mail services use detection systems to filter out some spam messages. The purpose of this paper, however, is to outline the linguistic and NLP methods to textual processing. Therefore, the bag-of-n-grams concept will be discussed here as an approach to extract more details about the textual data in a string of a grouped words. The n-gram language model presented in this paper (that assigns probabilities to sequences of words in text corpora) is based on findings compiled in Sketch Engine, as well as samples of language data processed by means of NLTK library for Python. Why would one want to compute the probability of a word sequence? The answer is quite obvious – in various systems for performing tasks, the goal is to generate texts that are more fluent. Therefore, a particular component is required, which computes the probability of the output text. The idea is to collect information how frequently the n-grams occur in a large text corpus and use it to predict the next word. Counting the number of occurrences can also envisage certain drawbacks, for instance there are sometimes problems with sparsity or storage. Nonetheless, the language models and specific computing ‘recipes’ described in this paper can be used in many applications, such as machine translation, summarization, even dialogue systems, etc. Lastly, it has to be pointed out that this piece of writing is a part of an ongoing work tentatively termed as LADDER (Linguistic Analysis of Data in the Digital Era of Research) that touches upon the process of datacization^[1] that might help to create an intelligent system of interdisciplinary information.

Keywords: linguistics, Natural Language Processing; language modeling; tokenization; term frequency; bag-of-words; probabilistic classification, bag-of-n-grams; n-gram; Sketch Engine; Python; NLTK.

[1] The term pertains to “the process of transforming information resources accessed by humans into information resources accessed by machines” and it was formulated by an American lexicographer – Erin McKean – and a founder of ‘Reverb’ that compiles the online dictionary ‘Wordnik’.

Downloads

Download data is not yet available.

NLP ‘RECIPES’ FOR TEXT CORPORA: APPROACHES TO COMPUTING THE PROBABILITY OF A SEQUENCE OF TOKENS

Authors

DOI:

Abstract

Downloads

Downloads

Published

How to Cite

Issue

Section

counters2

Information

Language

Make a Submission