You need to comply with any licensing and requirements from the start of your project all the way to publication of your research.
Licensing and copyright
The large datasets used in text and data mining often come from pre-existing research outputs, original creative works, or proprietary data owned by commercial enterprises.
This means that performing data and text mining may require you to access, copy and process copyright protected material.
Data providers each have their own standards and procedures that you must follow to legally use the data they provide. For example, many data providers license their data to be mined for research purposes only and either prohibit or require special negotiation for data mining with potential commercial applications.
If you have any questions about licensing conditions or negotiating permission for commercial applications of data mining, contact library.digitalcollections@sydney.edu.au.
If you have any questions about complying with copyright during data mining activities, contact copyright@sydney.edu.au.
Ethics
Even if all the original datasets contain de-identified data, data linkage and data mining can sometimes enable re-identification of de-identified data.
When combining separate datasets for text and data mining, you should seek appropriate ethics approvals and conduct privacy impact assessments before commencing.
If you have questions regarding whether you require ethics approval for text and data mining activities, contact Ethics and Research Integrity.
Online mining etiquette
Best practice is to check the requirements of the data provider and comply with their preferences regarding data mining activities.
Causing inconvenience to data providers can be bad etiquette, even if the licensee permits it. For example, bulk scraping a data provider's website to extract information can place a significant burden on the data provider's servers.
Make sure you use rate limiting when using an application programming interface (API) to automate accessing and downloading content. Rate limiting controls the number of automated requests you send to the data provider’s servers over a given period to avoid overloading them. Not using rate limiting can cause slow response times or even down time for other users.