Skip to main content

Some Library spaces have reopened. For more details see Library COVID-19 updates.

Text and data mining

Analyse large scale datasets in your research

The Library is currently trialling a text and data mining support service. View the services available and email for further information.

Data mining is the process of applying open-ended computational methods to large scale datasets to discover new insights that may not be revealed through targeted smaller scale analyses. When the datasets used are bodies of text, this process is often termed text mining and can provide a complementary approach to traditional close readings of texts. Text and data mining (TDM) approaches can open up new areas of scholarly enquiry.

How to do it
  • The first step in a text and data mining project is to establish the dataset you want to mine. This involves collating the data and getting it into a form that allows it to be queried and analysed using computational methods. There are a number of ways of going about this:

    • acquire the data on a hard drive or other storage device and build a local database
    • download the data via FTP and build a local database
    • harvest the data in bulk from online sources and build a local database
    • harvest the data selectively from online sources and cache locally
    • query or analyse online data sources in place using an API or online computational laboratory environment

    There are many different tools available for web harvesting, constructing a database, interacting with APIs, and querying and analysing datasets. For example, data and text mining packages are available for use with many programming languages and data analysis software tools. MALLET and Weka are both machine-learning text mining packages for use with Java, the tm package can be used in R, while the textmining package does the same in Python. For those without programming skills, Voyant provides an interactive web-based environment for text mining. For information and links to more data and text mining tools please see the Data Analysis and Visualisation Toolkit.

    For advice or assistance with using text or data mining tools, contact the data scientists at the Sydney Informatics Hub.

Issues to consider
  • Copyright

    The large datasets used in data and text mining are often sourced from pre-existing research outputs, original creative works, or proprietary data owned by commercial enterprises. This means that performing data and text mining requires you to access, copy and process material that may be protected by copyright. If you have any questions or need guidance on complying with copyright during data mining activities, please contact the Library’s Copyright Services team.

    Licence conditions

    Data providers will each have their own specific standards and procedures that you must follow in order to legally use the data they provide. It’s essential that you ensure from the outset of your project that the activities you intend to perform during the course of your data mining and the subsequent publication of your research results comply with any licensing terms and conditions. For example, many data providers license their data to be mined for research purposes only and either prohibit or require special negotiation for data mining with potential commercial applications. If you have any questions about licensing conditions or negotiating permission for potential commercial applications of data mining with data providers please contact the Digital Collections team


    Since text and data mining sometimes involves the collation and linkage of separate datasets, you should take care to seek appropriate ethics approvals and conduct privacy impact assessments. Even if all the original datasets contain de-identified data, data linkage and data mining can sometimes have the unforeseen consequence of enabling re-identification of de-identified data. If you have questions regarding whether you require ethics approval for text and data mining activities, please contact Ethics and Research Integrity.

    Online mining etiquette

    Even if the licence permits it, some approaches to text and data mining are considered poor etiquette due to the inconvenience they can cause to data providers. For example, bulk scraping or non-rate-limited programmatic querying via APIs can place a significant burden on data providers’ servers, causing slow response times or even down time for other users. Best practice is to check the requirements of the data provider and comply with their preferences regarding data mining activities.

Publicly available data sources
Library licensed data sources