Finding datasets

Find datasets to use in your research or teaching

Researchers, governments and other institutions make data available in ways that can vary widely. Finding a dataset to use in your own work can therefore involve searching numerous data sources. It’s useful to understand the different ways in which people make their data available to ensure that you start your search in the best place possible.

For advice or assistance in locating data to reuse, please contact or your Academic Liaison Librarian.

Before you start searching

Make your search for data as efficient as possible by figuring out a few things before you get started. Think about and answer the following questions:

  • What information must be included in the dataset for it to be useful to me? This could include required fields, information about the spatial or temporal coverage of the data, or a description of any processing or modifications that have been made to the data.
  • Which file format(s) am I able to work with? If you are planning on analysing data using specific software, ensure that you know the file formats that can be used with that software.
  • Are there any reuse licenses that are incompatible with how I will use the data? For instance, if you plan on commercialising your research, then you should avoid datasets that stipulate a non-commercial condition in their reuse licence.

Knowing the answers to these questions will help you perform an effective search by identifying the keywords, filter conditions and data sources that are most appropriate for the data that you want to find.

Where to look for data

Data repositories
  • Data repositories allow creators to deposit datasets to preserve and make the data available to others. A repository will provide a record for each dataset that includes descriptive information about the data (metadata). This information enables others to understand what the dataset is about and if it’s relevant to them.

    Repositories can make data available under different access conditions. Some repositories are open access, meaning that all data stored there can be freely accessed by anyone. Other repositories require people to meet specific conditions, such as proof of approval for the proposed research from an ethics board, before access to data will be granted. Some repositories will have both openly-accessible datasets as well as datasets with access conditions. Repositories may also allow metadata only records to be published.

    Discipline-specific data repositories contain data relating to a specific subject or field of research. When looking for data from a specific research area, discipline-specific repositories are a great place to start your search. Use the Registry of Research Data Repositories to find a data repository for your field of research.

    General data repositories accept data from any discipline. If you’re looking for data resulting from multi-disciplinary research or your research area doesn’t have any discipline-specific repositories, then a general repository, such as Zenodo, figshare or Dryad, would be a good place to search.

    Institutional data repositories contain data produced by researchers working at a specific institution. If you’re looking for data from a particular researcher or research group, and you know what university or organisation they work for, then it can be worthwhile to search the organisation’s institutional repository. The University of Sydney’s institutional repository contains datasets, publications and theses produced by researchers at the University.

    Data portals and registries are useful tools to help you find datasets stored in repositories. Data portals allow you to search across multiple data repositories from a single search interface. They can be a great place to start searching if there are a large number of discipline-specific repositories in your research area. Data registries provide lists or catalogues of datasets that are stored elsewhere. Descriptive information about the datasets, usually in the form of metadata only records, is provided by the registry, along with a link to where the data is stored. Researchers may choose to publish metadata only records in several registries or repositories to ensure their data is as findable as possible.

Government or other organisation datasets
  • Many government and other organisations choose to make data that they have collected or produced available for others to reuse. Government and NGO datasets are often well described and high quality, making them easy to reuse. Data containing sensitive information, such as Australian Bureau of Statistics microdata, may require you to apply or meet certain conditions before you can be granted access. Some data providers charge a fee or require a subscription to access data. The Library subscribes to some data sources, making the data available to University of Sydney staff and students.

Author or project websites
  • Some researchers choose to make their data available on personal, research group, or project websites. This is an ad hoc form of data sharing, resulting in variable data and metadata quality, which can make some datasets difficult to find and reuse. Long term access to data isn’t guaranteed, as it relies on the continuing investment of time and resources by individual researchers. As there is no central authority for data made available on websites, internet search engines are the best way of finding this data.

Data accompanying research publications
  • Increasingly, data access statements are being required in research publications to let readers know where the data underlying the research can be found. Ideally, this will point to data published in a repository, however, some journals allow data to be made available as part of the supplementary material to the article. This means that access to the data is limited to people or institutions with access to the article, generally through a paid subscription to the journal.

Data journals
  • Data journals can be good sources of well-described datasets. These journals provide a platform for researchers to publish their data that mirrors how traditional journal articles are published, including the peer review process. A data paper will contain the information necessary to understand and reuse the data, but generally won’t include any findings based on the data. The paper will include the dataset itself, or a link to it. Data journals can be discipline-specific, such as Geoscience Data Journal, or general or multidisciplinary, such as Scientific Data. Search for data journals as you would any other academic journal, using the Library’s search tools, and internet search engines.

Assessing data for reuse

Once you have found a dataset, you should spend time assessing it for both relevance and quality. Use the metadata associated with the dataset to make sure that it meets all of the criteria that you established before you started searching. Also consider the trustworthiness of the data source. For instance, has the data been produced by a reputable organisation or researcher active in the field? Spending time on this step can be worthwhile, as it helps you avoid investing a lot of effort in analysing data that you later realise can’t be used for your purposes.