Marcin's Developer Notes - notes and pro tips for Data Science

Log-Log plotting

Log-Log plot allows to discover polynomial dependencies between variables, i.e when plotting log-log (both X and Y axies are logarithmic) and you see straight line, there is probable polynomial correlation between X and Y in form:

y = ax^m

To find the exponent m calculate:

m = (log(y2/y1) / (log(x2/x1))

Where: y2 is value of your function for argument x2 and y1 is value of your function for argument x1.

for R:

m <- log(6747/1000) / log(166221/51590)

More info: https://en.wikipedia.org/wiki/Log%E2%80%93log_plot

RMSE vs MAE

Root Mean Square Error and MAE are in the same units as the input variables, which is good. Both are also treating the polarity of the error in the same way - there is no difference between making underestimating or overestimating. Big difference is that RMSE is squared error, so it gives high weights to large errors. If you want to persist small error, checking RMSE will be a good measure to check.

Another implication of the RMSE formula that is not often discussed has to do with sample size. Using MAE, we can put a lower and upper bound on RMSE. [MAE] ≤ [RMSE]. The RMSE result will always be larger or equal to the MAE. If all of the errors have the same magnitude, then RMSE=MAE.

[RMSE] ≤ [MAE * sqrt(n)], where n is the number of test samples. The difference between RMSE and MAE is greatest when all of the prediction error comes from a single test sample. The squared error then equals to [MAE^2 * n] for that single test sample and 0 for all other samples. Taking the square root, RMSE then equals to [MAE * sqrt(n)].

Regression loss functions

MAE (a.k.a L1) - robust to outliers, vulnerable to small error (as gradient is the same both in small and huge errors)
MSE (a.k.a L2) - vulnerable to outliers (grows quickly), smooth when error is <1.0
Log-Cosh - Σ log(cosh(y_pred - y)), smoother than L2, can be differentiated twice (might be useful for XGBoost)
Huber Loss (Smooth L1) - basically it's L1 which switches to L2 when the error <delta. Switch point is a parameter delta. In PyTorch, delta=1.0

KL-Divergence

Div_KL (P||Q) The KL divergence is the measure of inefficiency in using the probability distribution Q to approximate the true probability distribution P.

No matches...