Preprocessing: why you should generate polynomial features first before standardizing
For whatever reason, my main challenge in learning data science as a newbie has been organizing workflow. I saw several threads on StackOverflow about preprocessing and what order to use standardizing and polynomial features, but no in depth explanations.
Cleaning data is straight forward enough. Fitting and scoring a model is three lines of code. But what happens in between?
The steps between cleaning and fitting (a.k.a. preprocessing) offer plentiful opportunities for you to unwittingly end up with random mistakes buried under layers of code, only to come back and bite you in the ass later.
We’re just going to deal with standardization, dummy variables, and PolynomialFeatures.
Rule #1: Don’t standardize dummy variables.
When you standardize, you take your variables, subtract the mean, and express them in standard deviations. This is known as z-score.
This makes sense for continuous variables, but not for categorical variables. Separate your features into two sets, one continuous and one discrete (for categorical) before you proceed.
Rule #2: Always standardize AFTER generating PolynomialFeatures.
1.) Loss of signal.
When you create feature interactions, you’re generating values that are multiples and squares of themselves.
When you standardize, you’re converting values to z-scores, which are usually between -3 and +3.
By creating interactions between z-score sized values, you’ll get values a magnitude smaller than the original.
To better illustrate this, imagine multiplying values between 0 and 1 by each other. You can only end up with more values between 0 and 1.
The purpose of squaring values in PolynomialFeatures is to increase signal. To retain this signal, it’s better to generate the interactions first then standardize second.
2.) Making random negatives.
When you standardize, you turn a set of only positive values into positive and negative values (with the mean at zero).
When you multiply negative by positive, you get negative.
Doing this to your data will create negative values from previously all-positive values.
In other words, your data will be jacked up.
Rule #3: Don’t make interactions with dummy variables.
Dummy variables are either 0 or 1.
Multiplying anything by 1 doesn’t change it. Multiplying anything by 0 makes it zero.
You get no additional information from making interactions this way.
—
Preprocessing is an important step because it happens upstream of the rest of your data science workflow. Do it right, and the rest of your process will be way better! Hopefully understanding the reasoning behind these steps will help you keep the process clear in your mind.
Anything I can write about to help you find success in data science or trading? Tell me about it here: https://bit.ly/3mStNJG