Data Management Matters And Why You Need To Take It Seriously

Last Updated on March 5, 2020

We live in a world drowning in data. Internet tracking, stock market movement, genome sequencing technologies and their ilk all produce enormous amounts of data.

Most of this data is someone else’s responsibility, generated by someone else, stored in someone else’s database, which is maintained and made available by… you guessed it… someone else.

But. Whenever we carry out a machine learning project we are working with a small subset of the all the data which is out there.

Whether you generate your own data, or use publicly available data, your results must be reproducible. And the reproducibility of an analysis depends crucially on data management.

Data Management Matters

Data Management Matters
Photo by Ken Teegardin, some rights reserved

What is Data Management?

Data management is the process of storing, handling and securing raw data and any associated metadata.

This process includes:

  1. Identifying appropriate data for your analysis
  2. Downloading the data
  3. Reformatting as necessary
  4. Cleaning the data
  5. Storing data in an appropriate repository
  6. Backing up the data
  7. Annotating with metadata
  8. To finish reading, please visit source site