How to simplify a dataset

Defry Hamdhana
2 min readFeb 26, 2021

Since data is the most important thing in machine learning, the first time that we should do is simplifying the data that we have collected. Because almost all the data we get at the beginning is still messy and a lot of noise. That means we have to filter the large data that we have obtained so that it can be processed later. Here are the following steps you can take to simplify data :

  1. Reduce noise data

Suppose we have excel data that contains various column and row data. For example, the column for the date of birth. In that case, the column can be filled with characters, integers, and so on. In this step, we have to homogenize the column date of birth so that it is ready to use.

2. Reduce dimensionality

In general, the variables we have may range from thousands or tens of thousands of variables such as customer data, transactions and so on. As much as possible we reduce the data variables that have similarities in them. For example, the dataset contains columns for date of birth and age. We can analyze that the column date of birth and age has the same information. So to simplify it we can reduce the column date of birth and only use the column age.

3. Find important variables or combination

Look for important variables or their combinations in the dataset. Suppose we have 10 variables. From these 10 variables, we can analyze which ones are important and which ones are not. We can eliminate these unimportant variables to get a new dataset that is ready to be processed. An example is eliminating the variable name in the dataset. Because the variable name does not provide value in the analysis process that we will do.

--

--

Defry Hamdhana

There is only one corner of the universe you can be certain of improving, and that’s your own self.