Creating a dataset

About datasets

A dataset is a collection of texts, also called a corpus. The choices you make in assembling your dataset will be crucial to the success of your project.  

Having developed a research question, you need to:

  • Consider what content you need to answer your research question.

  • Find resources that you can mine for information.

This will save time and help you choose the mining methods best suited to your project.

Assembling a dataset

Questions to consider when assembling a dataset:

  • Is the data available for me to use?

  • Where is the data coming from? (e.g. Are they primary or secondary sources?)

  • What biases might there be?

  • What is the geographical coverage of the dataset?

  • What is the time period or date range that the data covers?

  • Is the data clean and ready to use? What kinds of cleaning might the data require?

  • What kind of data can you access? For instance, do you have access to metadata, abstracts or full text?

  • Legal, ethical or financial limitations

Some library databases allow text and data mining. There are also publicly available datasets.

Academic Liaison Librarians can give advice on where to find relevant information and how to search to find useful results. 

Making your text machine readable

A computer must be able to read your documents to perform text and data mining.  

Test whether a document is machine readable by using the “find” command to search for a word that you can see in the text.  

  • If the computer can find the word, it can read your document.

If the computer can’t find the word, use Adobe Acrobat to perform optical character recognition (OCR). OCR software looks at an image, identifies the text and then adds the text to the file. This will make the text machine readable for text and data mining.  

  • Note: OCR software doesn’t work well on handwriting. You may need to type out handwritten text to make it machine readable. This is a time-consuming process and may not be possible if you have a lot of handwritten text.  

Intelligent character recognition (ICR), a form of OCR that can learn and recognise handwriting, is still being developed. Transcription by hand is the most practical option for now.

Licensing, copyright and ethics

You need to comply with any licensing and requirements from the start of your project all the way to publication of your research.

Licensing and copyright

The large datasets used in text and data mining often come from pre-existing research outputs, original creative works, or proprietary data owned by commercial enterprises. 

This means that performing data and text mining may require you to access, copy and process copyright protected material.

Data providers each have their own standards and procedures that you must follow to legally use the data they provide. For example, many data providers license their data to be mined for research purposes only and either prohibit or require special negotiation for data mining with potential commercial applications.

If you have any questions about licensing conditions or negotiating permission for commercial applications of data mining, contact library.digitalcollections@sydney.edu.au.

If you have any questions about complying with copyright during data mining activities, contact copyright@sydney.edu.au.

Ethics

Even if all the original datasets contain de-identified data, data linkage and data mining can sometimes enable re-identification of de-identified data.

When combining separate datasets for text and data mining, you should seek appropriate ethics approvals and conduct privacy impact assessments before commencing.

If you have questions regarding whether you require ethics approval for text and data mining activities, contact Ethics and Research Integrity.

Online mining etiquette

Best practice is to check the requirements of the data provider and comply with their preferences regarding data mining activities.

Causing inconvenience to data providers can be bad etiquette, even if the licensee permits it. For example, bulk scraping a data provider's website to extract information can place a significant burden on the data provider's servers.  

Make sure you use rate limiting when using an application programming interface (API) to automate accessing and downloading content. Rate limiting controls the number of automated requests you send to the data provider’s servers over a given period to avoid overloading them. Not using rate limiting can cause slow response times or even down time for other users.

  • Contact

    For more help finding and accessing theses, speak to our friendly library staff.

    Contact us