About text and data mining
Data mining is the use of computational methods on big datasets to uncover new insights that might not be found through smaller, focused analyses.
When the datasets are bodies of text, this process is often termed text mining and can complement traditional close readings of texts.
Using text and data mining in your research
Text and data mining (TDM) can open up new areas of scholarly inquiry, making use of the large volumes of digital and digitised information that is increasingly available.
TDM is an iterative process of inquiry and discovery. As your project progresses, you may need to update or change your research question or analysis methods.
Get started by visiting the Text and data mining methods page. If you are using TDM as part of a broader systematic review, see the Systematic reviews page.
Text and data mining support pages
Support from the Library
We can support your TDM activities with:
help finding out which library- licensed data sources can be mined
advice on forming a search strategy for corpora creation
using the Gale Digital Scholar Lab and ProQuest TDM Studio
Contact an Academic Liaison Librarian for help.
The Sydney Informatics Hub
The Sydney Informatics Hub provides free introductory to advanced training courses, including courses on programming, data collection and statistics.
They also run a monthly drop-in Hacky Hour where you can bring your statistics, programming and data science questions to get advice from experts.
Examples of TDM in research
Analysing 150 years of British periodicals
Content analysis of 150 years of British periodicals explores changes in culture and society through changes in language.
This project analysed 28.6 billion words from 35.9 million articles contained in 120 UK regional newspapers over the period 1800-1950.
Researchers examined changes in values, political interests, the rise of 'Britishness' as a concept, the spread of technological innovations, social changes and much more.
Methods used: term frequency, named entity recognition.
Toy safety from online reviews
In Toy safety surveillance from online reviews, researchers outline how they developed a classification system for different types of safety and performance defects related to toys.
The lists of "danger words" were then used to evaluate or score a large sample of the more than one million product reviews on Amazon.com in the “Toys and Games” category in the 1999–2014 period.
Method used: statistical analysis – correlation coefficient.
Climate change discussions in social media
Researchers undertook Topic modeling and sentiment analysis of global climate change tweets to look at public opinion on climate change across space and time.
They explored how people on Twitter felt about the changing climate, the impact of weather events on discussions, and how topics of discussion varied across countries and regions.
Methods used: topic modelling, sentiment analysis.
Legal documents in water disputes
In Understanding water disputes in Chile with text and data mining tools, legal documents are used to identify key themes arising from disputes on water rights in Chile.
Coupled with geospatial data, the authors create a picture of where certain disputes occur and what legal actions are taken.
Methods used: topic modelling, geographic analysis, network analysis.
Forecasting election results
The sentiments of social media posts are analysed in Every tweet counts? How sentiment analysis of social media can improve our knowledge of citizens’ political preferences with an application to Italy and France.
The researchers compare their analysis to more traditional mass surveys in Italy and France and find find a consistent correlation between social media results and the results normally obtained from mass surveys, providing alternative methods for forecasting electoral results.
Methods used: sentiment analysis.
Identifying research communities
In The evolution of the American Journal of Psychology 1, 1887–1903: A Network Investigation, network analysis is used to identify the development of research communities in the early years of psychology as a research discipline.
Methods used: network analysis.
Representations of refugees and asylum seekers
In Fleeing, Sneaking, Flooding: A Corpus Analysis of Discursive Constructions of Refugees and Asylum Seekers in the UK Press, 1996–2005, researchers investigate how refugees and asylum seekers were discussed in British newspapers and how representations changed through time.
Methods used: collocation analysis.
Identifying authorship
A stylometric analysis is used in Computer stylometry of C. S. Lewis’s The Dark Tower and related texts to identify C.S. Lewis as the most likely author of the unfinished novel whose authorship has been previously contested.
Methods used: stylometric analysis.