This is a minor bug-fix release from 0.16.1 and includes a a large number of bug fixes along some new features (pipe() method), enhancements, and performance improvements.
pipe()
We recommend that all users upgrade to this version.
Highlights include:
A new pipe method, see here
pipe
Documentation on how to use numba with pandas, see here
What’s new in v0.16.2
New features
Pipe
Other enhancements
API changes
Performance improvements
Bug fixes
Contributors
We’ve introduced a new method DataFrame.pipe(). As suggested by the name, pipe should be used to pipe data through a chain of function calls. The goal is to avoid confusing nested function calls like
DataFrame.pipe()
# df is a DataFrame # f, g, and h are functions that take and return DataFrames f(g(h(df), arg1=1), arg2=2, arg3=3) # noqa F821
The logic flows from inside out, and function names are separated from their keyword arguments. This can be rewritten as
(df.pipe(h) # noqa F821 .pipe(g, arg1=1) # noqa F821 .pipe(f, arg2=2, arg3=3) # noqa F821 )
Now both the code and the logic flow from top to bottom. Keyword arguments are next to their functions. Overall the code is much more readable.
In the example above, the functions f, g, and h each expected the DataFrame as the first positional argument. When the function you wish to apply takes its data anywhere other than the first argument, pass a tuple of (function, keyword) indicating where the DataFrame should flow. For example:
f
g
h
(function, keyword)
In [1]: import statsmodels.formula.api as sm In [2]: bb = pd.read_csv('data/baseball.csv', index_col='id') # sm.ols takes (formula, data) In [3]: (bb.query('h > 0') ...: .assign(ln_h=lambda df: np.log(df.h)) ...: .pipe((sm.ols, 'data'), 'hr ~ ln_h + year + g + C(lg)') ...: .fit() ...: .summary() ...: ) ...: Out[3]: <class 'statsmodels.iolib.summary.Summary'> """ OLS Regression Results ============================================================================== Dep. Variable: hr R-squared: 0.685 Model: OLS Adj. R-squared: 0.665 Method: Least Squares F-statistic: 34.28 Date: Mon, 07 Dec 2020 Prob (F-statistic): 3.48e-15 Time: 12:18:05 Log-Likelihood: -205.92 No. Observations: 68 AIC: 421.8 Df Residuals: 63 BIC: 432.9 Df Model: 4 Covariance Type: nonrobust =============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------- Intercept -8484.7720 4664.146 -1.819 0.074 -1.78e+04 835.780 C(lg)[T.NL] -2.2736 1.325 -1.716 0.091 -4.922 0.375 ln_h -1.3542 0.875 -1.547 0.127 -3.103 0.395 year 4.2277 2.324 1.819 0.074 -0.417 8.872 g 0.1841 0.029 6.258 0.000 0.125 0.243 ============================================================================== Omnibus: 10.875 Durbin-Watson: 1.999 Prob(Omnibus): 0.004 Jarque-Bera (JB): 17.298 Skew: 0.537 Prob(JB): 0.000175 Kurtosis: 5.225 Cond. No. 1.49e+07 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.49e+07. This might indicate that there are strong multicollinearity or other numerical problems. """
The pipe method is inspired by unix pipes, which stream text through processes. More recently dplyr and magrittr have introduced the popular (%>%) pipe operator for R.
(%>%)
See the documentation for more. (GH10129)
Added rsplit to Index/Series StringMethods (GH10303)
Removed the hard-coded size limits on the DataFrame HTML representation in the IPython notebook, and leave this to IPython itself (only for IPython v3.0 or greater). This eliminates the duplicate scroll bars that appeared in the notebook with large frames (GH10231).
DataFrame
Note that the notebook has a toggle output scrolling feature to limit the display of very large frames (by clicking left of the output). You can also configure the way DataFrames are displayed using the pandas options, see here here.
toggle output scrolling
axis parameter of DataFrame.quantile now accepts also index and column. (GH9543)
axis
DataFrame.quantile
index
column
Holiday now raises NotImplementedError if both offset and observance are used in the constructor instead of returning an incorrect result (GH10217).
Holiday
NotImplementedError
offset
observance
Improved Series.resample performance with dtype=datetime64[ns] (GH7754)
Series.resample
dtype=datetime64[ns]
Increase performance of str.split when expand=True (GH10081)
str.split
expand=True
Bug in Series.hist raises an error when a one row Series was given (GH10214)
Series.hist
Series
Bug where HDFStore.select modifies the passed columns list (GH7212)
HDFStore.select
Bug in Categorical repr with display.width of None in Python 3 (GH10087)
Categorical
display.width
None
Bug in to_json with certain orients and a CategoricalIndex would segfault (GH10317)
to_json
CategoricalIndex
Bug where some of the nan functions do not have consistent return dtypes (GH10251)
Bug in DataFrame.quantile on checking that a valid axis was passed (GH9543)
Bug in groupby.apply aggregation for Categorical not preserving categories (GH10138)
groupby.apply
Bug in to_csv where date_format is ignored if the datetime is fractional (GH10209)
to_csv
date_format
datetime
Bug in DataFrame.to_json with mixed data types (GH10289)
DataFrame.to_json
Bug in cache updating when consolidating (GH10264)
Bug in mean() where integer dtypes can overflow (GH10172)
mean()
Bug where Panel.from_dict does not set dtype when specified (GH10058)
Panel.from_dict
Bug in Index.union raises AttributeError when passing array-likes. (GH10149)
Index.union
AttributeError
Bug in Timestamp’s’ microsecond, quarter, dayofyear, week and daysinmonth properties return np.int type, not built-in int. (GH10050)
Timestamp
microsecond
quarter
dayofyear
week
daysinmonth
np.int
int
Bug in NaT raises AttributeError when accessing to daysinmonth, dayofweek properties. (GH10096)
NaT
dayofweek
Bug in Index repr when using the max_seq_items=None setting (GH10182).
max_seq_items=None
Bug in getting timezone data with dateutil on various platforms ( GH9059, GH8639, GH9663, GH10121)
dateutil
Bug in displaying datetimes with mixed frequencies; display ‘ms’ datetimes to the proper precision. (GH10170)
Bug in setitem where type promotion is applied to the entire block (GH10280)
setitem
Bug in Series arithmetic methods may incorrectly hold names (GH10068)
Bug in GroupBy.get_group when grouping on multiple keys, one of which is categorical. (GH10132)
GroupBy.get_group
Bug in DatetimeIndex and TimedeltaIndex names are lost after timedelta arithmetics ( GH9926)
DatetimeIndex
TimedeltaIndex
Bug in DataFrame construction from nested dict with datetime64 (GH10160)
dict
datetime64
Bug in Series construction from dict with datetime64 keys (GH9456)
Bug in Series.plot(label="LABEL") not correctly setting the label (GH10119)
Series.plot(label="LABEL")
Bug in plot not defaulting to matplotlib axes.grid setting (GH9792)
plot
axes.grid
Bug causing strings containing an exponent, but no decimal to be parsed as int instead of float in engine='python' for the read_csv parser (GH9565)
float
engine='python'
read_csv
Bug in Series.align resets name when fill_value is specified (GH10067)
Series.align
name
fill_value
Bug in read_csv causing index name not to be set on an empty DataFrame (GH10184)
Bug in SparseSeries.abs resets name (GH10241)
SparseSeries.abs
Bug in TimedeltaIndex slicing may reset freq (GH10292)
Bug in GroupBy.get_group raises ValueError when group key contains NaT (GH6992)
ValueError
Bug in SparseSeries constructor ignores input data name (GH10258)
SparseSeries
Bug in Categorical.remove_categories causing a ValueError when removing the NaN category if underlying dtype is floating-point (GH10156)
Categorical.remove_categories
NaN
Bug where infer_freq infers time rule (WOM-5XXX) unsupported by to_offset (GH9425)
Bug in DataFrame.to_hdf() where table format would raise a seemingly unrelated error for invalid (non-string) column names. This is now explicitly forbidden. (GH9057)
DataFrame.to_hdf()
Bug to handle masking empty DataFrame (GH10126).
Bug where MySQL interface could not handle numeric table/column names (GH10255)
Bug in read_csv with a date_parser that returned a datetime64 array of other time resolution than [ns] (GH10245)
date_parser
[ns]
Bug in Panel.apply when the result has ndim=0 (GH10332)
Panel.apply
Bug in read_hdf where auto_close could not be passed (GH9327).
read_hdf
auto_close
Bug in read_hdf where open stores could not be used (GH10330).
Bug in adding empty DataFrames, now results in a DataFrame that .equals an empty DataFrame (GH10181).
DataFrames
.equals
Bug in to_hdf and HDFStore which did not check that complib choices were valid (GH4582, GH8874).
to_hdf
HDFStore
A total of 34 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.
Andrew Rosenfeld
Artemy Kolchinsky
Bernard Willers +
Christer van der Meeren
Christian Hudon +
Constantine Glen Evans +
Daniel Julius Lasiman +
Evan Wright
Francesco Brundu +
Gaëtan de Menten +
Jake VanderPlas
James Hiebert +
Jeff Reback
Joris Van den Bossche
Justin Lecher +
Ka Wo Chen +
Kevin Sheppard
Mortada Mehyar
Morton Fox +
Robin Wilson +
Sinhrks
Stephan Hoyer
Thomas Grainger
Tom Ajamian
Tom Augspurger
Yoshiki Vázquez Baeza
Younggun Kim
austinc +
behzad nouri
jreback
lexual
rekcahpassyla +
scls19fr
sinhrks