Text and data mining (TDM) can be used to capture key concepts, trends and patterns in your research.
Common TDM methods include:
Topic modelling is a text and data mining method that scans your texts to identify groups of words that often appear in the same documents as each other.
For more information on this method, visit Topic modeling made just simple enough.
See how researchers have used topic modelling in Examples of TDM in research.
Topic modelling can be used to:
Topic modelling can’t be used:
There are 2 types of this method you can use:
1. Lexicon-based sentiment analysis
2. Machine learning sentiment analysis
Machine learning approaches may identify specific emotions (e.g. sadness, anger, fear, joy, surprise), rather than just an overall positive or negative sentiment.
See how researchers have used both types of sentiment analysis in Examples of TDM in research.
Sentiment analysis can be used to:
Sentiment analysis doesn’t work well with:
Term frequency analyses how often a word or phrase appears in a document or in your corpus. In its simplest form, term frequency is calculated by counting the number of times the term is used. This can provide insight into the topics most frequently discussed in your text.
Term frequency-inverse document frequency (TF-IDF) is a related method that can identify more meaningful frequent words. In TF-IDF, a frequently used term in one document is compared to other documents in the corpus. This differentiates terms that are common within particular documents from terms that are common across all or most documents in the corpus.
Term frequency and TF-IDF can be used to:
Term frequency and TF-IDF does not account for:
A collocation is a group of 2 or more words that appear close together more often than would be expected by chance.
Collocations can be:
See how researchers have used collocation analysis in Examples of TDM in research.
Collocation analysis can be used to:
Named entity recognition (NER) is a process where software analyses text to locate words that a human would recognise as a distinct entity. These entities are then classified into categories, such as person, location, organisation, nationality, time, date, etc.
Some named entity recognisers, such as SpaCy, have a set of predefined categories that they have been trained to identify. Others, such as Stanford NER, allow you to define your own categories. Defining your own categories means you’ll need to train the recogniser to identify the entities you’re interested in. To do this, you’ll need to manually classify many documents, a time consuming and laborious process.
For example, if you wanted to know all the people mentioned in your text, your computer wouldn’t know how to tell you that information before you’ve performed NER, as it doesn’t know what people are. After entities in your text have been classified, it’s easy for the computer to list all the entities with a “person” tag.
Named entity recognition can be used to:
It’s good to ensure that the NER you use has been trained on text that is similar to the kind of text that you’re working with.
A NER tagger trained with American terms and locations may mislabel Fairfax as a geographic location if we run it over newspapers published by Australasian Fairfax.
A tagger is never going to get everything right, so you will likely end up with some missed or misclassified entities.
For more help finding and accessing theses, speak to our friendly library staff.