Sparse data is a variable in which the cells do not consist of actual data within data analysis. Sparse data has a zero value or is empty. Sparse data is zero while missing data doesn’t show what some or any of the values are or are different from missing data because sparse data shows up as empty. In AI (Artificial Intelligence) inference and machine learning, sparsity refers to values that will not significantly impact a calculation Or a matrix of numbers that includes many zeros. Sparse data is a variable in which the cells do not consist of actual data within data analysis. Matrices that contain mostly empty or zero values are called sparse, distinct from matrices where most of the values were non-zero, called dense.
Suppose that this is the smallest and simple dataset that you are training an ML (Machine Learning) and deep learning model with. You can see that it is (5) five-dimensional; there are (5) five features that can – when desired – jointly be used to create predictions.
Read Also: Universal Approximation Theorem in Neural Networks with Proof
For example, they can be measurements of electrical current, particles, or anything like that. If it’s zero that means that there is no measurement its means null.
This is what such a table can look like following :
Feature 1 | Feature 2 | Feature 3 | Feature 4 | Feature 5 |
0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 7,7 | 0 |
1,26 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
2,12 | 0 | 2,11 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
0 | 1,28 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 1,87 |
Sparse data
A variable with sparse data set is one in which a relatively higher percentage of the variable cells do not consist of actual data. Such empty or Null value, values take up storage space in the file or data sets.
Feature Scaling with Sparse Data
- import numpy as np
- from sklearn.preprocessing import StandardScaler
- samples_feature = np.array([0, 0, 1.26, 0, 2.12, 0, 0, 0, 0, 0, 0, 0]).reshape(-1, 1)
- scaler = StandardScaler()
- scaler.fit(samples_feature)
- standardized_dataset = scaler.transform(samples_feature)
- print(standardized_dataset)
The out-of-feature scaling with space
[[-0.43079317]
[-0.43079317]
[ 1.49630526]
[-0.43079317]
[ 2.81162641]
[-0.43079317]
[-0.43079317]
[-0.43079317]
[-0.43079317]
[-0.43079317]
[-0.43079317]
[-0.43079317]]
- import numpy as np
- from sklearn.preprocessing import MaxAbsScaler
- samples_feature = np.array([0, 0, 1.26, 0, 2.12, 0, 0, 0, 0, 0, 0, 0]).reshape(-1, 1)
- scaler = MaxAbsScaler()
- scaler.fit(samples_feature)
- standardized_dataset = scaler.transform(samples_feature)
- print(standardized_dataset)
[[0. ]
[0. ]
[0.59433962]
[0. ]
[1. ]
[0. ]
[0. ]
[0. ]
[0. ]
[0. ]
[0. ]
[0. ]]
Especially because the output would be the same if we applied the MinMaxScaler, which is Scikit-learn’s implementation of min-max normalization, to the dataset we used above:
[[0. ]
[0. ]
[0.59433962]
[0. ]
[1. ]
[0. ]
[0. ]
[0. ]
[0. ]
[0. ]
[0. ]
[0. ]]
Now, here’s the catch-all values in the original input array to the scaler are positive. This means that the minimum value is zero and that, because it scales by minimum and maximum value, all values will be in the range.
. Since the maximum absolute value here equals the overall maximum value.
What if we used a dataset where negative values are present?
samples_feature = np.array([-2.40, -6.13, 0.24, 0, 0, 0, 0, 0, 0, 2.13]).reshape(-1, 1)
Min-max normalization would produce this:
[[0.45157385]
[0. ]
[0.77118644]
[0.74213075]
[0.74213075]
[0.74213075]
[0.74213075]
[0.74213075]
[0.74213075]
[1. ]]
Bye bye sparsity!
The output of our MaxAbsScaler is good, as we would expect:
[[-0.39151713]
[-1. ]
[ 0.03915171]
[ 0. ]
[ 0. ]
[ 0. ]
[ 0. ]
[ 0. ]
[ 0. ]
[ 0.34747145]]
So that’s why you should prefer absolute-maximum-scaling (using MaxAbsScaler) when you are working with a sparse dataset.