This is the first major release for Python Polars. Please check out the upgrade guide for help navigating the breaking changes when upgrading to this version.
π₯ Breaking changes
- Change default engine for
read_excel
to "calamine"
(#17263)
- Implement binary serialization of LazyFrame/DataFrame/Expr and set it as the default format (#17223)
- Streamline optional dependency definitions in
pyproject.toml
(#17168)
- Update
read/scan_parquet
to disable Hive partitioning by default for file inputs (#17106)
- Split
replace
functionality into two separate methods (#16921)
- Default to writing binview data to IPC, mark
compression
argument as keyword-only (#17084)
- Remove re-export of type aliases (#17032)
- Rename
ModuleUpgradeRequired
and PolarsPanicError
error, remove InvalidAssert
error (#17033)
- Change data orientation inference logic for DataFrame construction and warn when row orientation is inferred (#16976)
- Properly apply
strict
parameter in Series constructor (#16939)
- Remove supertype definition of List and non-List types (#16918)
- Consistently convert to given time zone in Series constructor (#16828)
- Update
reshape
to return Array types instead of List types (#16825)
- Default to raising on out-of-bounds indices in all
get
/gather
operations (#16841)
- Native
selector
XOR set operation, guarantee consistent selector column-order (#16833)
- Set
infer_schema_length
as keyword-only argument in str.json_decode
(#16835)
- Update
set_sorted
to only accept a single column (#16800)
- Remove deprecated parameters in
Series.cut/qcut
and update struct field names (#16741)
- Expedited removal of certain deprecated functionality (#16754)
- Update some error types to more appropriate variants (#15030)
- Scheduled removal of deprecated functionality (#16715)
- Change default
offset
in group_by_dynamic
from 'negative every
' to 'zero' (#16658)
- Constrain access to globals from
DataFrame.sql
in favor of top-level pl.sql
(#16598)
- Read 2D NumPy arrays as
Array
type instead of List
(#16710)
- Update
clip
to no longer propagate nulls in the given bounds (#14413)
- Change
str.to_datetime
to default to microsecond precision for format specifiers "%f"
and "%.f"
(#13597)
- Update resulting column names in
pivot
when pivoting by multiple values (#16439)
- Preserve nulls in
ewm_mean
, ewm_std
, and ewm_var
(#15503)
- Restrict casting for temporal data types (#14142)
- Support Decimal types by default when converting from Arrow (#15324)
- Remove serde functionality from
pl.read_json
and DataFrame.write_json
(#16550)
- Update function signature of
nth
to allow positional input of indices, remove columns
parameter (#16510)
- Rename struct fields of
rle
output to len
/value
and update data type of len
field (#15249)
- Remove class variables from some DataTypes (#16524)
- Add
check_names
parameter to Series.equals
and default to False
(#16610)
β οΈ Deprecations
- Deprecate
LazyFrame.fetch
(#17278)
- Deprecate
size
parameter in parametric testing strategies in favor of min_size
/max_size
(#17128)
- Split
replace
functionality into two separate methods (#16921)
- Rename
DataFrame.melt
to unpivot
and make parameters consistent with pivot
(#17095)
- Remove re-export of exceptions at top-level (#17059)
- Deprecate
dt.mean
/dt.median
in favor of mean
/median
(#16888)
- Deprecate
LazyFrame.with_context
in favor of horizontal concatenation (#16860)
- Rename parameter
descending
to reverse
in top_k
methods (#16817)
- Rename
str.concat
to str.join
and update default delimiter (#16790)
- Deprecate
arctan2d
in favor of arctan2(...).degrees()
(#16786)
π Performance improvements
- Rechunk before
group_by
`iteration (#17302)
- Improve
unique
performance by adding RangedUniqueKernel for primitive arrays (#17166)
- Improve
unique
performance by creating UniqueKernel and improve bool implementation (#17160)
- Default to writing binview data to IPC, mark
compression
argument as keyword-only (#17084)
- Parallelize arrow conversion if binview -> large_bin (#17083)
- Garbage collect buffers in
if-then-else
view kernel (#16993)
- Desugar
AND
filter into multiple nodes (#16992)
- Optimize generic
arg_sort
of row-encoding (#16894)
- Improve
rle_id
iteration performance and set sorted flags (#16893)
- Optimize
sort
for String and Binary types (#16871)
- Use
split_at
in split
(#16865)
- Use
split_at
instead of double slice in chunk splits. (#16856)
- Don't rechunk in
align_
if arrays are aligned (#16850)
- Don't create small chunks in parallel collect. (#16845)
- Add dedicated no-null branch in
arg_sort
(#16808)
- Speed up
dt.offset_by
2x for constant durations (#16728)
- Toggle coalesce in
join
if non-coalesced key isn't projected (#16677)
- Make
dt.truncate
1.5x faster when every
is just a single duration (and not an expression) (#16666)
- Always prune unused columns in semi/anti join (#16665)
β¨ Enhancements
- Add SQL support for
NATURAL
joins and the COLUMNS
function (#17295)
- Add
str.extract_many
expression (#17304)
- Change default engine for
read_excel
to "calamine"
(#17263)
- Deprecate
LazyFrame.fetch
(#17278)
- Support '%' in pathnames for async scan (#17271)
- Support
SQL
Struct/JSON field access operators (#17226)
- Exclude directories from glob expansion result (#17174)
- Support SQL
ORDER BY ALL
syntax (#17212)
- Support PostgreSQL
^@
("starts with"), and ~~
,~~*
,!~~
,!~~*
("like", "ilike") string-matching operators (#17251)
- Support SQL
SELECT * ILIKE
wildcard syntax (#17169)
- Support
SQL
temporal functions STRFTIME
and STRPTIME
, and typed literal syntax (#17245)
- Support date/datetime for hive parts (#17256)
- Implement binary serialization of LazyFrame/DataFrame/Expr and set it as the default format (#17223)
- Allow no-op
round/ceil/floor
on integer types (#17241)
- Support loading from datasets where the hive columns are also stored in the file (#17203)
- Implement serde for Null columns (#17218)
- Support Decimal types in
write_csv/write_json
(#14209)
- Add optional "default" to
get_column
DataFrame method (#17176)
- Improve SQL support for array indexing, increase test coverage (#16972)
- Support reading byte stream split encoded floats and doubles in parquet (#17099)
- Add
float_scientific
option to write_csv
/sink_csv
(#17111)
- Support
Struct
field selection in the SQL engine, RENAME
and REPLACE
select wildcard options (#17109)
- Update
DataFrame.pivot
to allow index=None
when values
is set (#17126)
- Update
read/scan_parquet
to disable Hive partitioning by default for file inputs (#17106)
- Improve ipython autocomplete for LazyFrame and DataFrame (#17091)
- Split
replace
functionality into two separate methods (#16921)
- Improve schema inference for hive partitions (#17079)
- Rename
DataFrame.melt
to unpivot
and make parameters consistent with pivot
(#17095)
- Print row index in
explain
and show_graph
(#17074)
- Support top-level
pl.col
autocompletion for iPython (#17080)
- Remove re-export of exceptions at top-level (#17059)
- Implement predicate and projection pushdown for
read_ndjson
(#17068)
- Allow (non-)coalescing in join_asof (#17066)
- Turn of coalescing and fix mutation of join on expressions (#17061)
- Expand NDJson glob into one SCAN (#17063)
- Do not parse hive partitions from user provided base directory path (#17055)
- Support directory paths in scans for Parquet, IPC and CSV (#17017)
- Implement general array equality checks (#17043)
- Add
strict
parameter to DataFrame/LazyFrame.drop
and fix behavior to default to True (#17044)
- Rename
ModuleUpgradeRequired
and PolarsPanicError
error, remove InvalidAssert
error (#17033)
- Add
rechunk
parameter to read_delta
(#16991)
- allow experimental metadata use on release (#17005)
- Add simple version of
json_normalize
(#17015)
- Change data orientation inference logic for DataFrame construction and warn when row orientation is inferred (#16976)
- Desugar
AND
filter into multiple nodes (#16992)
- Handle textio even if not correct (#16971)
- Properly apply
strict
parameter in Series constructor (#16939)
- Add SQL support for
INTERSECT
and EXCEPT
ops (#16960)
- Add
PerformanceWarning
to LazyFrame properties (#16964)
- Add
collect_schema
method to LazyFrame
and DataFrame
(#16929)
- Allow setting file cache TTL on a per-file basis (#16891)
- Support Decimal inputs for
lit
(#16950)
- Implement multiply and division for lhs duration (#16948)
- Raise on invalid temporal arithmetic (#16934)
- Always end with a in-memory sink on collect (#16928)
- Add
DataFrame.style
namespace (#16809)
- Add
Schema
class (#16873)
- Normalize
value_counts
(#16917)
- Implement equality for more Array types (#16902)
- Set up some of the infrastructure for new streaming engine (#16900)
- Cache downloaded cloud IPC files (#16892)
- Consistently convert to given time zone in Series constructor (#16828)
- Improve
read_csv
SQL table reading function defaults (better handle dates) (#16866)
- Support SQL
VALUES
clause and inline renaming of columns in CTE & derived table definitions (#16851)
- Support Python
Enum
values in lit
(#16858)
- Convert to given time zone in
.str.to_datetime
when values are offset-aware (#16742)
- Update
reshape
to return Array types instead of List types (#16825)
- Default to raising on out-of-bounds indices in all
get
/gather
operations (#16841)
- Support
SQL
"SELECT" with no tables, optimise registration of globals (#16836)
- Native
selector
XOR set operation, guarantee consistent selector column-order (#16833)
- Extend recognised
EXTRACT
and DATE_PART
SQL part abbreviations (#16767)
- Improve error message when raising integers to negative integers, improve docs (#16827)
- Return datetime for mean/median of Date colum (#16795)
- Update
set_sorted
to only accept a single column (#16800)
- Expose overflowing cast (#16805)
- Update
group_by
iteration and partition_by
to always return tuple keys (#16793)
- Support array arithmetic for equally sized shapes (#16791)
- Expedited removal of certain deprecated functionality (2) (#16779)
- Removal of
read_database_uri
passthrough from read_database
(#16783)
- Remove
pyxlsb
engine from read_excel
(#16784)
- Add
check_order
parameter to assert_series_equal
(#16778)
- Enforce deprecation of keyword arguments as positional (#16755)
- Support cloud storage in
scan_csv
(#16674)
- Streamline SQL
INTERVAL
handling and improve related error messages, update sqlparser-rs
lib (#16744)
- Support use of ordinal values in SQL
ORDER BY
clause (#16745)
- Support executing polars SQL against
pandas
and pyarrow
objects (#16746)
- Remove deprecated parameters in
Series.cut/qcut
and update struct field names (#16741)
- Expedited removal of certain deprecated functionality (#16754)
- Remove deprecated functionality from rolling methods (#16750)
- Update
date_range
to no longer produce datetime ranges (#16734)
- Mark
min_periods
as keyword-only for rolling
methods (#16738)
- Remove deprecated
top_k
parameters nulls_last
, maintain_order
, and multithreaded
(#16599)
- Support order-by in window functions (#16743)
- Add SQL support for
NULLS FIRST/LAST
ordering (#16711)
- Update some error types to more appropriate variants (#15030)
- Initial SQL support for
INTERVAL
strings (#16732)
- Scheduled removal of deprecated functionality (2) (#16724)
- Scheduled removal of deprecated functionality (#16715)
- Enforce deprecation of
offset
arg in truncate
and round
(#16655)
- Change default
offset
in group_by_dynamic
from 'negative every
' to 'zero' (#16658)
- Constrain access to globals from
DataFrame.sql
in favor of top-level pl.sql
(#16598)
- Read 2D NumPy arrays as
Array
type instead of List
(#16710)
- Update
clip
to no longer propagate nulls in the given bounds (#14413)
- Change
str.to_datetime
to default to microsecond precision for format specifiers "%f"
and "%.f"
(#13597)
- Update resulting column names in
pivot
when pivoting by multiple values (#16439)
- Preserve nulls in
ewm_mean
, ewm_std
, and ewm_var
(#15503)
- Restrict casting for temporal data types (#14142)
- Add many more auto-inferable datetime formats for
str.to_datetime
(#16634)
- Support Decimal types by default when converting from Arrow (#15324)
- Remove serde functionality from
pl.read_json
and DataFrame.write_json
(#16550)
- Update function signature of
nth
to allow positional input of indices, remove columns
parameter (#16510)
- Rename struct fields of
rle
output to len
/value
and update data type of len
field (#15249)
- Remove class variables from some DataTypes (#16524)
- Add
check_names
parameter to Series.equals
and default to False
(#16610)
- Dedicated
SQLInterface
and SQLSyntax
errors (#16635)
- Add
DIV
function support to the SQL interface (#16678)
- Support non-coalescing streaming left join (#16672)
- Allow wildcard and exclude before struct expansions (#16671)
π Bug fixes
- Raise on invalid shape dataframe arithmetic (#17322)
- Fix panic in window case (#17320)
- Raise errors instead of panicking when
sink_csv
fails (#17313)
- Raise if join keys are passed to cross join (#17305)
- Ensure we don't close extant
adbc
connections in write_database
(#17298)
- Don't null on oob in
list.get
for column index (#17276)
- Fix issue where sliced PyArrow record batches were not handled correctly (#17058)
- Don't oob on nulls in
list.get
(#17262)
- Fix list getter with nulls (#17261)
- Respect
nulls_last
parameter in aggregate sort_by
(#17249)
- Fix literal slice in group by (#17242)
- Fix
DataFrame.top_k
not handling nulls correctly (#17239)
- Update implementation of Enum support in
lit
to address spurious test failure (#17187)
- Use explicit turbofish to help rustc (#17159)
- Raise on invalid set dtypes (#17157)
- Fix corrupted reads for hive parts from cloud and projection pushdown failure on hive parts (#17152)
- Set intersection supertype (#17154)
ChainedWhen
should not inherit Expr
(#17142)
- Fix decompress_impl for csv with n_rows set (#17118)
- Fix incorrect window std for chunked series (#17110)
- Fix panic when using
fold
in certain situations (#17114)
- Fix melt panic (#17088)
- Fix expression autocomplete in IPython (#17072)
- Exclude index from expansion in rolling/group_by_dynamic (#17086)
- Update some
Series
dunder method type signatures (#17053)
- Fix oob of join with literals and empty table (#17047)
- Don't silently accept multi-table FROM clauses (implicit JOIN syntax) (#17028)
- Don't split up ANDed filters that are group-aware (#17031)
- Harden "async" check for users with out-of-date
sqlalchemy
libraries (#17029)
- Error when
sort_by
of unequal length (#17026)
- Properly catch not found explode cols (#17020)
- Correctly convert data frames to NumPy for C index order (#17000)
- Raise on invalid arithmetic shapes (#16986)
- Don't pushdown predicates in cross join if the refer to both tables (#16983)
- Fix projection pushdown with literal joins (#16981)
- Fix edge case in DataFrame constructor data orientation inference (#16975)
- Raise on list of objects (#16959)
- Handle strictness for Decimal Series construction (#15309)
- Don't panic in object to anyvalue (#16957)
- Properly set
FAST_EXPLODE_LIST
metadata (#16951)
- Raise informative error when writing object to file (#16954)
- Remove supertype definition of List and non-List types (#16918)
- Remove unwrap in
extend()
(#16890)
- Fix
should_rechunk
check (#16852)
- Ensure
read_excel
and read_ods
return identical frames across all engines when given empty spreadsheet tables (#16802)
- Consistent behaviour when "infer_schema_length=0" for
read_excel
(#16840)
- Standardised additional SQL interface errors (#16829)
- Ensure that splitted ChunkedArray also flattens chunks (#16837)
- Reduce needless panics in comparisons (#16831)
- Reset if next caller clones inner series (#16812)
- Raise on non-positive json schema inference (#16770)
- Rewrite implementation of
top_k/bottom_k
and fix a variety of bugs (#16804)
- Fix comparison of UInt64 with zero (#16799)
- Fix incorrect parquet statistics written for UInt64 values > Int64::MAX (#16766)
- Fix boolean distinct (#16765)
DATE_PART
SQL syntax/parsing, improve some error messages (#16761)
- Include
pl.
qualifier for inner dtypes in to_init_repr
(#16235)
- Column selection wasn't applied when reading CSV with no rows (#16739)
- Panic on empty df / null List(Categorical) (#16730)
- Only flush if operator can flush in streaming outer join (#16723)
- Raise unsupported cat array (#16717)
- Assert SQLInterfaceError is raised (#16713)
- Restrict casting for temporal data types (#14142)
- Handle nested categoricals in
assert_series_equal
when categorical_as_str=True
(#16700)
- Improve
read_database
check for SQLAlchemy async Session objects (#16680)
- Reduce scope of multi-threaded numpy conversion (#16686)
- Full null on dyn int (#16679)
- Fix filter shape on empty null (#16670)
π Documentation
- Update version switcher for 1.0.0 final release (#16848)
- Finish upgrade guide for 1.0.0 (#17257)
- Minor layout/terminology improvement for
selector
set ops (#17299)
- Mark hypothesis testing functionality as unstable (#17258)
- Add SQL docs for the
CAST
and TRY_CAST
functions (#17214)
- Mark
plot
namespace as unstable (#17205)
- Bump docs dependencies (#17199)
- More accurate and helpful docs for user defined functions (#15194)
- Add doc examples to
concat_list
(#17127)
- Add "coming from pandas" note to
DataFrame.unique
docstring (#17119)
- Fix some warnings during doc build (#17077)
- Properly expose
InProcessQuery
in docs, mark as unstable (#17097)
- Add upgrade guide for Python Polars 1.0.0 (#16914)
- Lots of additions to the SQL reference docs (#16990)
- Minor doctest fixes (#17002)
- Include a doc entry for every exception type (#17001)
- Fixup bullet points in
write_parquet
docstring (#16909)
- Update version switcher for 1.0.0 prereleases (#16847)
- Update link from Python API reference to user guide (#16849)
- Update docstring/test/etc usage of
select
and with_columns
to idiomatic form (#16801)
- Update versioning docs for 1.0.0 (#16757)
- Add docstring example for
DataFrame.limit
(#16753)
- Fix incorrect stated value of
include_nulls
in DataFrame.update
docstring (#16701)
- Update deprecation docs in the user guide (#14315)
- Add example for index count in
DataFrame.rolling
(#16600)
- Improve docstring of
Expr/Series.map_elements
(#16079)
- Add missing
polars.sql
docs entry and small docstring update (#16656)
π¦ Build system
- Update Cargo.lock (#17284)
- Streamline optional dependency definitions in
pyproject.toml
(#17168)
- Update rustc 2024-06-23 (#17135)
- Do not set environment variable on import (#17101)
- Fix config flag for Tracemalloc (#17098)
- Pin optional NumPy dependency to
< 2.0.0
for now (#17060)
π οΈ Other improvements
- Fix typo in join validation error message (#17296)
- Fix linting issue in docs (#17292)
- Use typed
iter
in list.get
(#17286)
- Rename
type_aliases
module to _typing
(#17282)
- add ability to have pipeline blockers in new streaming engine (#17247)
- Support date/datetime for hive parts (#17256)
- Refactor serde tests, add hypothesis tests (#17216)
- Refactor parsing of data type inputs to Polars data types (#17164)
- Skip all moto AWS tests for now (#17178)
- Add missing spaces in
cargo.toml
(#17145)
- Minor test refactor for
concat_list
(#17120)
- Remove re-export of data type groups (#17073)
- Add pivot test #17081 (#17090)
- Minor cleanup to better define boundaries of public API (#17051)
- Support directory paths in scans for Parquet, IPC and CSV (#17017)
- Remove re-export of type aliases (#17032)
- Remove file cache test (#17038)
- Update exception imports in test suite (#17035)
- Point polars-stream to crates/ again (#17024)
- Fix failing file cache test in CI (#17014)
- Add some parametric tests for sort functionality (#17008)
- Pin NumPy to <2.0 for now (#16999)
- Use proper join type in test (#16994)
- Fix file cache verbose logging leakage during pytest (#16984)
- Skip another intermitently failing AWS test (#16980)
- Update test suite to explicitly use
orient="row"
in DataFrame constructor when applicable (#16977)
- Remove redundant projection attribute in IR::DataFrameScan (#16952)
- Factor out some apply calls in duration namespace (#16941)
- Skip intermittently failing AWS test (#16908)
- Refactor expression parsing utils (#16906)
- Set up some of the infrastructure for new streaming engine (#16900)
- Refactor parts of IR. (#16899)
- Add fundamentals for new async-based streaming execution engine (#16884)
- Move around some existing tests (#16877)
- Remove inner
Arc
from FileCacheEntry
(#16870)
- Do not update stable API reference on prerelease (#16846)
- Update links to API references (#16843)
- Prepare update of API reference URLs (#16816)
- Rename allow_overflow to wrap_numerical (#16807)
- Set
infer_schema_length
as keyword-only argument in str.json_decode
(#16835)
- Don't enter streaming engine for groupby-> agg mean/median β¦ (#16810)
- Improve safety of amortized_iter (#16820)
- Remove needless inner type clone (#16718)
- Fix incorrect debug assertion in
ChunkedArray::from_chunks_and_dtype
(#16697)
- Update version resolver for
1.0.0
release (#16705)
- Avoid AWS pinning to outdated crc32c version (#16681)
Thank you to all our contributors for making this release possible!
@IvanIsCoding, @JamesCE2001, @JulianCologne, @KDruzhkin, @Kylea650, @MarcoGorelli, @Mottl, @Object905, @SeanTater, @adamreeve, @alexander-beedie, @bertiewooster, @borchero, @c-peters, @coastalwhite, @datapythonista, @datenzauberai, @dependabot, @dependabot[bot], @eitsupi, @flisky, @henryharbeck, @itamarst, @jqnatividad, @lukeshingles, @machow, @marenwestermann, @mcrumiller, @montanarograziano, @nameexhaustion, @orlp, @p3i0t, @ritchie46, @sherlockbeard, @stinodego, @tkellogg, @universalmind303 and @wence-