Text is a form of unstructured data. Unlike a spreadsheet or an XML file, computers can’t easily understand text. You therefore need to clean and pre-process your data before you can analyse it. How you prepare your data will depend on the project you are undertaking and your research questions.
Cleaning and pre-processing data involves standardising your text and removing words and characters that aren’t relevant. After performing these steps, you’ll be left with a nice “clean” text dataset that is ready to be analysed.
Some text data management (TDM) tools provide cleaning and pre-processing methods via their interface. These TDM tools include:
Alternatively, you may need to do some programming to prepare your corpus for your analysis. Tutorials, such as ones provided by the Programming Historian, can help you get started.
There are several methods available to clean and pre-process your dataset in preparation for analysis:
Tokenisation is a method where a string of text is broken down into individual units, known as tokens. These tokens can be individual words, characters or phrases. You will need tokenisation for most text and data analysis methods.
In English it’s common to split your text up into individual words or 2–3-word phrases. Splitting your text into phrases is called “n-gram tokenisation”, where “n” is the number of words in the phrase.
Example
Sample sentence: “The cat sat on a mat. Then the cat saw a rat.”
This text can be tokenised as follows:
Word (sometimes called a “unigram”): |
2-word phrase (often called “bigrams” or “2-grams”): |
3-word phrase (often called “trigrams” or “3-grams”): |
For languages that don’t separate words in their writing, such as Chinese, Thai or Vietnamese, tokenisation will require more thought to identify how the text should be split to enable the desired analysis.
Potential pitfalls
Splitting up words based on character spaces can change meaning or cause things to be grouped incorrectly in cases where multiple words are used to indicate a single thing. For example:
Use both phrase tokenisation and single word tokenisation to mitigate this issue.
Computers often treat capitalised versions of words as different to their lowercase counterparts, which can cause problems during analysis. Make all text lowercase to avoid this problem.
Example
Uncorrected text contains:
Convert all text to lowercase to get one number:
Potential pitfalls
Sometimes capital letters help to distinguish between things that are different. For example, if your documents refer to both a person named “Rose” and the flower called “rose”, then converting the name to lowercase will result in these two different things being grouped together.
Other pre-processing techniques, such as named entity recognition, can help avoid this pitfall.
Variations in spelling can cause problems in text analysis as the computer will treat different spellings of the same word as different words. Choose a single spelling and replace any other variants in your text with that version.
For a large dataset, tokenise words first and then standardise the spelling. Alternatively, you can use tools such as VARD to do the work for you.
Example
Uncorrected text contains: “paediatric”, “pediatric”, and “pædiatric”.
Replace all variants with: “paediatric”.
Potential pitfalls
If you’re specifically looking at the use of different spellings or how spelling can change over time, using this method won’t be helpful.
Punctuation or special characters can clutter your data and make analysing the text difficult. Errors in optical character recognition (OCR) can also result in unusual non-alphanumeric characters being mistakenly added to your text.
Identify characters in your text that are neither letters or numbers and remove them.
Example
Uncorrected text contains: “coastline” and “coastline;”.
Removing the punctuation will correctly identify them as the same word.
Potential pitfalls
If you’re specifically looking at how certain punctuation or special characters are used, this method will remove important information. This will also be the case when using data with mixed languages or text where punctuation is important (e.g. names or words in French). You will need to take a more targeted approach to any non-alphanumeric character removal.
Other pre-processing, for example, tokenisation by sentences, may also need punctuation.
Stopwords are commonly used words, like “the”, “is”, “that”, “a”, etc., that don’t offer much insight into the text in your documents. It is best to filter stopwords out before analysing text.
You can use existing stopword lists to remove common words from numerous languages.
If there are specific words that are common in your documents, but aren’t relevant to your analysis, you can customise existing stopword lists by adding your own words to them.
Potential pitfalls
Before using a stopword list, particularly one created by someone else, check to make sure that it doesn’t contain any words that you would like to analyse.
Part-of-speech tagging is used to provide context to text.
For computers, text is just a long string of characters that doesn’t mean anything. Part-of-speech tagging helps computers comprehend the structure and meaning of sentences by categorising text into different word classes, such as:
This categorisation enables further processing and analyses, such as lemmatisation, sentiment analysis, or any analysis where you wish to look more closely at a specific class of words.
Example
Sample sentence: “They refuse to permit us to obtain the refuse permit.”
Part-of-speech tagged: They (pronoun) refuse (verb) to (to) permit (verb) us (pronoun) to (to) obtain (verb) the (determiner) refuse (noun) permit (noun).
Potential pitfalls
If your part-of-speech tagging software was trained on text that is very different to the text in your corpus, it might struggle to correctly classify a significant number of words in your text. For example, if the tagger was trained on modern newspapers, it might have a hard time tagging social media posts or 18th century novels.
Also, some phrases are inherently ambiguous. For example, in the phrase “the duchess was entertaining last night”, the word “entertaining” could be a verb (the duchess threw a party last night) or an adjective (the duchess was a delightful and amusing companion last night).
In some cases,it’s helpful for your analysis if different words with the same root are recognised as the same.
For instance, “swim”, “swims”, “swimming”, “swam” and “swum”, would normally be treated as different words by your computer, but you might want them to all be recognised as forms of “swim”.
Stemming and lemmatisation are two different methods for reducing words to a core root so that they can be grouped in this way.
In stemming, a set of general rules is used to identify the end bits of words that can be chopped off to leave the core root of the word. The resulting “stem” may or may not be a real word.
Several different stemming algorithms exist, such as the Snowball or Lancaster stemmers. These will produce different results, so look at the rules they apply and trial them on your data to decide which will suit your needs. Implementing a stemming algorithm will require you to undertake some programming.
Example
simplify => simplif
simplified => simplif
simplification => simplif
So, all three words, which would have been counted separately, are now grouped together through a single word stem, “simplif”, although this stem itself isn’t a valid English word.
Potential pitfalls
Languages are irregular and complex, so any stemming algorithm won’t do exactly what you want it to 100% of the time.
No algorithm will be perfect, so you will have test and decide if any of them do a good enough job to achieve your purpose.
Use a lemmatiser to analyse words for the dictionary root form of each word. This analysis requires the lemmatiser to understand the context in which the word is used, so before you can lemmatise your text, you’ll need to pre-process it with parts of speech tagging.
There are several lemmatisers available to use, however all of them will require you to undertake some programming.
Example
Uncorrected text contains: “am”, “are” “is”, “was” and “were”.
These words can be lemmatised to: “be”.
Potential pitfalls
As in stemming, precision and nuances can be lost during lemmatisation. For example, “operating” can have quite different meanings from the verb form through to compound noun forms, such as “operating theatre” or “operating system”. If the lemmatiser that you use reduces all of these to “operate”, then you can end up grouping things together that should remain separate. Lemmatisation is also a slower process than stemming, as more analysis is involved.
For more help finding and accessing theses, speak to our friendly library staff.