Skip to content

Instantly share code, notes, and snippets.

@josef-pkt
Created January 30, 2019 00:50
Show Gist options
  • Save josef-pkt/29ad2116e9af0864e5100ded89efe1f5 to your computer and use it in GitHub Desktop.
Save josef-pkt/29ad2116e9af0864e5100ded89efe1f5 to your computer and use it in GitHub Desktop.
Basic GAM example with formula after merge
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@Simply-Adi
Copy link

Hello, thanks for this valuable resource on how to use GAM in Python. I would like to know how I should specify knot locations.
I tried bs.knot_kwds={'knots':(num1,num2)}
It runs without error. But, is this the correct way?

@josef-pkt
Copy link
Author

The docstring says list of dict, because splines could have several variables and we need several splines. Try brackets [ ] around the dict.

You can inspect the bsplines instance to see how the knots where set.
Note: in Bsplines everything is in list of splines, e.g. from above bs = BSplines(x_spline, df=[12, 10], degree=[3, 3])

AFAICS, [s.knots for s in bs.smoothers] should show the knots for each univariate bspline.

@tripartio
Copy link

Thanks for this example. I'm an R mgcv user looking for equivalents in Python; your package is the best I've found so far.

I have two questions about bs = BSplines(x_spline, df=[12, 10], degree=[3, 3]):

  • Is df the maximum number of knots for a spline, like the k parameter in s(weight, k=12) in an mgcv formula?
  • What is degree? The documentation says "degree(s) of the spline; the same length and type rules apply as to df" but I don't understand what that means. What would the mgcv equivalent be?

@josef-pkt
Copy link
Author

In general, the splines were based on the patsy definition and implementation, more information there https://patsy.readthedocs.io/en/latest/spline-regression.html

The main change that we made to the definition of splines is to add additional options for boundary knots to match mgcv.

I don't really remember the details
df is likely the number of implied basis function, i.e. number of columns after dropping a column for implicit constant.

I never remember "degree" versus "order" of polynomials, one is the highest power, the other is the number of terms.
It looks like degree=3 is the standard cubic bspline.

examples are in the unit tests and the unit tests were written to match mgcv (as far as possible)
checking briefly: df is k in mgcv (based on Poisson B-spline example
The knot location was difficult to match up between patsy/statsmodels and mgcv (I guess to remove ambiguity with knot options)
e.g. statsmodels\gam\tests\results\results_mpg_bs_poisson.r forces R to use the same knots as we have.

@tripartio
Copy link

Thanks for the link to Patsy. That's a really useful package!

From the link, I note:

In patsy one can specify the number of degrees of freedom directly (actual number of columns of the resulting design matrix) whereas in mgcv one has to specify the number of knots to use. For instance, in the case of cyclic regression splines (with no additional constraints) the actual degrees of freedom is the number of knots minus one.

So, it seems that df (statsmodels) is k (mgcv) minus one.

Also, it seems that you're right about degree=3 meaning cubic splines:

bs() can produce B-spline bases of arbitrary degrees – e.g., degree=0 will give produce piecewise-constant functions, degree=1 will produce piecewise-linear functions, and the default degree=3 produces cubic splines.

To be honest, I really don't understand the mathematics behind splines and all that, but at least with this information, I can line up your documentation with mgcv's. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment