Cleaning and preparing your data

Data preparation

Text is a form of unstructured data. Unlike a spreadsheet or an XML file, computers can’t easily understand text. You therefore need to clean and pre-process your data before you can analyse it. How you prepare your data will depend on the project you are undertaking and your research questions.

About cleaning and pre-processing data

Cleaning and pre-processing data involves standardising your text and removing words and characters that aren’t relevant. After performing these steps, you’ll be left with a nice “clean” text dataset that is ready to be analysed.

Some text data management (TDM) tools provide cleaning and pre-processing methods via their interface. These TDM tools include:

Alternatively, you may need to do some programming to prepare your corpus for your analysis. Tutorials, such as ones provided by the Programming Historian, can help you get started.

Cleaning and pre-processing techniques

There are several methods available to clean and pre-process your dataset in preparation for analysis:

Tokenisation
Converting text to lower case
Word replacement
Punctuation and non-alphanumeric character removal
Removal of stopwords
Parts of speech tagging
Named entity recognition
Stemming and lemmatisation

Tokenisation

Tokenisation is a method where a string of text is broken down into individual units, known as tokens. These tokens can be individual words, characters or phrases. You will need tokenisation for most text and data analysis methods.

In English it’s common to split your text up into individual words or 2–3-word phrases. Splitting your text into phrases is called “n-gram tokenisation”, where “n” is the number of words in the phrase.

Example

Sample sentence: “The cat sat on a mat. Then the cat saw a rat.”

This text can be tokenised as follows:

Word (sometimes called a “unigram”):
   The
   cat
   sat
   on
   a
   mat.
   Then
   the
   cat
   saw
   a
   rat.

2-word phrase (often called “bigrams” or “2-grams”):
   The cat
   cat sat
   sat on
   on a
   a mat
   mat. Then
   Then the
   the cat
   cat saw
   saw a
   a rat.

3-word phrase (often called “trigrams” or “3-grams”):
    The cat sat
    cat sat on
    sat on a
    on a mat
    a mat. Then
    mat. Then the
    Then the cat
    the cat saw
   cat saw a
    saw a rat.

For languages that don’t separate words in their writing, such as Chinese, Thai or Vietnamese, tokenisation will require more thought to identify how the text should be split to enable the desired analysis.

Potential pitfalls

Splitting up words based on character spaces can change meaning or cause things to be grouped incorrectly in cases where multiple words are used to indicate a single thing. For example:

“southeast" vs “south east" vs "south-east"
place names like "New South Wales" or "Los Angeles"
multi-word concepts like "global warming" and "social distancing".

Use both phrase tokenisation and single word tokenisation to mitigate this issue.

Converting text to lowercase

Computers often treat capitalised versions of words as different to their lowercase counterparts, which can cause problems during analysis. Make all text lowercase to avoid this problem.

Example

Uncorrected text contains:

cheetah x 7
Cheetah x 2

Convert all text to lowercase to get one number:

cheetah x 9

Potential pitfalls

Sometimes capital letters help to distinguish between things that are different. For example, if your documents refer to both a person named “Rose” and the flower called “rose”, then converting the name to lowercase will result in these two different things being grouped together.

Other pre-processing techniques, such as named entity recognition, can help avoid this pitfall.

Word replacement

Variations in spelling can cause problems in text analysis as the computer will treat different spellings of the same word as different words. Choose a single spelling and replace any other variants in your text with that version.

For a large dataset, tokenise words first and then standardise the spelling. Alternatively, you can use tools such as VARD to do the work for you.

Example

Uncorrected text contains: “paediatric”, “pediatric”, and “pædiatric”.
Replace all variants with: “paediatric”.

Potential pitfalls

If you’re specifically looking at the use of different spellings or how spelling can change over time, using this method won’t be helpful.

Punctuation and non-alphanumeric character removal

Punctuation or special characters can clutter your data and make analysing the text difficult. Errors in optical character recognition (OCR) can also result in unusual non-alphanumeric characters being mistakenly added to your text.

Identify characters in your text that are neither letters or numbers and remove them.

Example

Uncorrected text contains: “coastline” and “coastline;”.
Removing the punctuation will correctly identify them as the same word.

Potential pitfalls

If you’re specifically looking at how certain punctuation or special characters are used, this method will remove important information. This will also be the case when using data with mixed languages or text where punctuation is important (e.g. names or words in French). You will need to take a more targeted approach to any non-alphanumeric character removal.

Other pre-processing, for example, tokenisation by sentences, may also need punctuation.

Stopwords

Stopwords are commonly used words, like “the”, “is”, “that”, “a”, etc., that don’t offer much insight into the text in your documents. It is best to filter stopwords out before analysing text.

You can use existing stopword lists to remove common words from numerous languages.

Stopword lists for 19 languages – a dataset of stopword lists for various European, West Asian and South Asian languages.
Natural Language Toolkit's list of English stopwords – a list of English stopwords with a comment section that many users find helpful.

If there are specific words that are common in your documents, but aren’t relevant to your analysis, you can customise existing stopword lists by adding your own words to them.

Potential pitfalls

Before using a stopword list, particularly one created by someone else, check to make sure that it doesn’t contain any words that you would like to analyse.

Part-of-speech tagging

Part-of-speech tagging is used to provide context to text.

For computers, text is just a long string of characters that doesn’t mean anything. Part-of-speech tagging helps computers comprehend the structure and meaning of sentences by categorising text into different word classes, such as:

nouns
verbs
adjectives
prepositions
determiners.

This categorisation enables further processing and analyses, such as lemmatisation, sentiment analysis, or any analysis where you wish to look more closely at a specific class of words.

Example

Sample sentence: “They refuse to permit us to obtain the refuse permit.”
Part-of-speech tagged: They (pronoun) refuse (verb) to (to) permit (verb) us (pronoun) to (to) obtain (verb) the (determiner) refuse (noun) permit (noun).

Potential pitfalls

If your part-of-speech tagging software was trained on text that is very different to the text in your corpus, it might struggle to correctly classify a significant number of words in your text. For example, if the tagger was trained on modern newspapers, it might have a hard time tagging social media posts or 18th century novels.

Also, some phrases are inherently ambiguous. For example, in the phrase “the duchess was entertaining last night”, the word “entertaining” could be a verb (the duchess threw a party last night) or an adjective (the duchess was a delightful and amusing companion last night).

Stemming and lemmatisation

In some cases,it’s helpful for your analysis if different words with the same root are recognised as the same.

For instance, “swim”, “swims”, “swimming”, “swam” and “swum”, would normally be treated as different words by your computer, but you might want them to all be recognised as forms of “swim”.

Stemming and lemmatisation are two different methods for reducing words to a core root so that they can be grouped in this way.

Stemming

In stemming, a set of general rules is used to identify the end bits of words that can be chopped off to leave the core root of the word. The resulting “stem” may or may not be a real word.

Several different stemming algorithms exist, such as the Snowball or Lancaster stemmers. These will produce different results, so look at the rules they apply and trial them on your data to decide which will suit your needs. Implementing a stemming algorithm will require you to undertake some programming.

Example

simplify => simplif
simplified => simplif
simplification => simplif

So, all three words, which would have been counted separately, are now grouped together through a single word stem, “simplif”, although this stem itself isn’t a valid English word.

Potential pitfalls

Languages are irregular and complex, so any stemming algorithm won’t do exactly what you want it to 100% of the time.

Overstemming occurs when the algorithm cuts off too much of the word and it either loses its meaning or ends up with the same stem as other, unrelated words. For example "happ" for the stem of "happiness" or "happy" would also include words like "happen" and "happenstance"
Understemming is when not enough is cut off and two words that humans would recognise as being related are not given the same stem, for example an algorithm that does not recognise “alumnus”, “alumnae”, and “alumni” to have the same stem.

No algorithm will be perfect, so you will have test and decide if any of them do a good enough job to achieve your purpose.

Lemmatisation

Use a lemmatiser to analyse words for the dictionary root form of each word. This analysis requires the lemmatiser to understand the context in which the word is used, so before you can lemmatise your text, you’ll need to pre-process it with parts of speech tagging.

There are several lemmatisers available to use, however all of them will require you to undertake some programming.

Example

Uncorrected text contains: “am”, “are” “is”, “was” and “were”.

These words can be lemmatised to: “be”.

Potential pitfalls

As in stemming, precision and nuances can be lost during lemmatisation. For example, “operating” can have quite different meanings from the verb form through to compound noun forms, such as “operating theatre” or “operating system”. If the lemmatiser that you use reduces all of these to “operate”, then you can end up grouping things together that should remain separate. Lemmatisation is also a slower process than stemming, as more analysis is involved.

Related information

Text and data mining Creating a dataset Text and data mining methods Text and data mining databases

Contact

For more help finding and accessing theses, speak to our friendly library staff.