Polars: py-1.0.0 Release

Release date:
July 1, 2024
Previous version:
py-1.0.0-rc.2 (released June 24, 2024)
Magnitude:
6,149 Diff Delta
Contributors:
15 total committers
Data confidence:
Commits:

56 Commits in this Release

Ordered by the degree to which they evolved the repo in this version.

Top Contributors in py-1.0.0

stinodego
alexander-beedie
coastalwhite
orlp
nameexhaustion
flisky
ritchie46
mcrumiller
wence-
JamesCE2001

Directory Browser for py-1.0.0

All files are compared to previous version, py-1.0.0-rc.2. Click here to browse diffs between other versions.

Loading File Browser...

Release Notes Published

This is the first major release for Python Polars. Please check out the upgrade guide for help navigating the breaking changes when upgrading to this version.

πŸ’₯ Breaking changes

  • Change default engine for read_excel to "calamine" (#17263)
  • Implement binary serialization of LazyFrame/DataFrame/Expr and set it as the default format (#17223)
  • Streamline optional dependency definitions in pyproject.toml (#17168)
  • Update read/scan_parquet to disable Hive partitioning by default for file inputs (#17106)
  • Split replace functionality into two separate methods (#16921)
  • Default to writing binview data to IPC, mark compression argument as keyword-only (#17084)
  • Remove re-export of type aliases (#17032)
  • Rename ModuleUpgradeRequired and PolarsPanicError error, remove InvalidAssert error (#17033)
  • Change data orientation inference logic for DataFrame construction and warn when row orientation is inferred (#16976)
  • Properly apply strict parameter in Series constructor (#16939)
  • Remove supertype definition of List and non-List types (#16918)
  • Consistently convert to given time zone in Series constructor (#16828)
  • Update reshape to return Array types instead of List types (#16825)
  • Default to raising on out-of-bounds indices in all get/gather operations (#16841)
  • Native selector XOR set operation, guarantee consistent selector column-order (#16833)
  • Set infer_schema_length as keyword-only argument in str.json_decode (#16835)
  • Update set_sorted to only accept a single column (#16800)
  • Remove deprecated parameters in Series.cut/qcut and update struct field names (#16741)
  • Expedited removal of certain deprecated functionality (#16754)
  • Update some error types to more appropriate variants (#15030)
  • Scheduled removal of deprecated functionality (#16715)
  • Change default offset in group_by_dynamic from 'negative every' to 'zero' (#16658)
  • Constrain access to globals from DataFrame.sql in favor of top-level pl.sql (#16598)
  • Read 2D NumPy arrays as Array type instead of List (#16710)
  • Update clip to no longer propagate nulls in the given bounds (#14413)
  • Change str.to_datetime to default to microsecond precision for format specifiers "%f" and "%.f" (#13597)
  • Update resulting column names in pivot when pivoting by multiple values (#16439)
  • Preserve nulls in ewm_mean, ewm_std, and ewm_var (#15503)
  • Restrict casting for temporal data types (#14142)
  • Support Decimal types by default when converting from Arrow (#15324)
  • Remove serde functionality from pl.read_json and DataFrame.write_json (#16550)
  • Update function signature of nth to allow positional input of indices, remove columns parameter (#16510)
  • Rename struct fields of rle output to len/value and update data type of len field (#15249)
  • Remove class variables from some DataTypes (#16524)
  • Add check_names parameter to Series.equals and default to False (#16610)

⚠️ Deprecations

  • Deprecate LazyFrame.fetch (#17278)
  • Deprecate size parameter in parametric testing strategies in favor of min_size/max_size (#17128)
  • Split replace functionality into two separate methods (#16921)
  • Rename DataFrame.melt to unpivot and make parameters consistent with pivot (#17095)
  • Remove re-export of exceptions at top-level (#17059)
  • Deprecate dt.mean/dt.median in favor of mean/median (#16888)
  • Deprecate LazyFrame.with_context in favor of horizontal concatenation (#16860)
  • Rename parameter descending to reverse in top_k methods (#16817)
  • Rename str.concat to str.join and update default delimiter (#16790)
  • Deprecate arctan2d in favor of arctan2(...).degrees() (#16786)

πŸš€ Performance improvements

  • Rechunk before group_by `iteration (#17302)
  • Improve unique performance by adding RangedUniqueKernel for primitive arrays (#17166)
  • Improve unique performance by creating UniqueKernel and improve bool implementation (#17160)
  • Default to writing binview data to IPC, mark compression argument as keyword-only (#17084)
  • Parallelize arrow conversion if binview -> large_bin (#17083)
  • Garbage collect buffers in if-then-else view kernel (#16993)
  • Desugar AND filter into multiple nodes (#16992)
  • Optimize generic arg_sort of row-encoding (#16894)
  • Improve rle_id iteration performance and set sorted flags (#16893)
  • Optimize sort for String and Binary types (#16871)
  • Use split_at in split (#16865)
  • Use split_at instead of double slice in chunk splits. (#16856)
  • Don't rechunk in align_ if arrays are aligned (#16850)
  • Don't create small chunks in parallel collect. (#16845)
  • Add dedicated no-null branch in arg_sort (#16808)
  • Speed up dt.offset_by 2x for constant durations (#16728)
  • Toggle coalesce in join if non-coalesced key isn't projected (#16677)
  • Make dt.truncate 1.5x faster when every is just a single duration (and not an expression) (#16666)
  • Always prune unused columns in semi/anti join (#16665)

✨ Enhancements

  • Add SQL support for NATURAL joins and the COLUMNS function (#17295)
  • Add str.extract_many expression (#17304)
  • Change default engine for read_excel to "calamine" (#17263)
  • Deprecate LazyFrame.fetch (#17278)
  • Support '%' in pathnames for async scan (#17271)
  • Support SQL Struct/JSON field access operators (#17226)
  • Exclude directories from glob expansion result (#17174)
  • Support SQL ORDER BY ALL syntax (#17212)
  • Support PostgreSQL ^@ ("starts with"), and ~~,~~*,!~~,!~~* ("like", "ilike") string-matching operators (#17251)
  • Support SQL SELECT * ILIKE wildcard syntax (#17169)
  • Support SQL temporal functions STRFTIME and STRPTIME, and typed literal syntax (#17245)
  • Support date/datetime for hive parts (#17256)
  • Implement binary serialization of LazyFrame/DataFrame/Expr and set it as the default format (#17223)
  • Allow no-op round/ceil/floor on integer types (#17241)
  • Support loading from datasets where the hive columns are also stored in the file (#17203)
  • Implement serde for Null columns (#17218)
  • Support Decimal types in write_csv/write_json (#14209)
  • Add optional "default" to get_column DataFrame method (#17176)
  • Improve SQL support for array indexing, increase test coverage (#16972)
  • Support reading byte stream split encoded floats and doubles in parquet (#17099)
  • Add float_scientific option to write_csv/sink_csv (#17111)
  • Support Struct field selection in the SQL engine, RENAME and REPLACE select wildcard options (#17109)
  • Update DataFrame.pivot to allow index=None when values is set (#17126)
  • Update read/scan_parquet to disable Hive partitioning by default for file inputs (#17106)
  • Improve ipython autocomplete for LazyFrame and DataFrame (#17091)
  • Split replace functionality into two separate methods (#16921)
  • Improve schema inference for hive partitions (#17079)
  • Rename DataFrame.melt to unpivot and make parameters consistent with pivot (#17095)
  • Print row index in explain and show_graph (#17074)
  • Support top-level pl.col autocompletion for iPython (#17080)
  • Remove re-export of exceptions at top-level (#17059)
  • Implement predicate and projection pushdown for read_ndjson (#17068)
  • Allow (non-)coalescing in join_asof (#17066)
  • Turn of coalescing and fix mutation of join on expressions (#17061)
  • Expand NDJson glob into one SCAN (#17063)
  • Do not parse hive partitions from user provided base directory path (#17055)
  • Support directory paths in scans for Parquet, IPC and CSV (#17017)
  • Implement general array equality checks (#17043)
  • Add strict parameter to DataFrame/LazyFrame.drop and fix behavior to default to True (#17044)
  • Rename ModuleUpgradeRequired and PolarsPanicError error, remove InvalidAssert error (#17033)
  • Add rechunk parameter to read_delta (#16991)
  • allow experimental metadata use on release (#17005)
  • Add simple version of json_normalize (#17015)
  • Change data orientation inference logic for DataFrame construction and warn when row orientation is inferred (#16976)
  • Desugar AND filter into multiple nodes (#16992)
  • Handle textio even if not correct (#16971)
  • Properly apply strict parameter in Series constructor (#16939)
  • Add SQL support for INTERSECT and EXCEPT ops (#16960)
  • Add PerformanceWarning to LazyFrame properties (#16964)
  • Add collect_schema method to LazyFrame and DataFrame (#16929)
  • Allow setting file cache TTL on a per-file basis (#16891)
  • Support Decimal inputs for lit (#16950)
  • Implement multiply and division for lhs duration (#16948)
  • Raise on invalid temporal arithmetic (#16934)
  • Always end with a in-memory sink on collect (#16928)
  • Add DataFrame.style namespace (#16809)
  • Add Schema class (#16873)
  • Normalize value_counts (#16917)
  • Implement equality for more Array types (#16902)
  • Set up some of the infrastructure for new streaming engine (#16900)
  • Cache downloaded cloud IPC files (#16892)
  • Consistently convert to given time zone in Series constructor (#16828)
  • Improve read_csv SQL table reading function defaults (better handle dates) (#16866)
  • Support SQL VALUES clause and inline renaming of columns in CTE & derived table definitions (#16851)
  • Support Python Enum values in lit (#16858)
  • Convert to given time zone in .str.to_datetime when values are offset-aware (#16742)
  • Update reshape to return Array types instead of List types (#16825)
  • Default to raising on out-of-bounds indices in all get/gather operations (#16841)
  • Support SQL "SELECT" with no tables, optimise registration of globals (#16836)
  • Native selector XOR set operation, guarantee consistent selector column-order (#16833)
  • Extend recognised EXTRACT and DATE_PART SQL part abbreviations (#16767)
  • Improve error message when raising integers to negative integers, improve docs (#16827)
  • Return datetime for mean/median of Date colum (#16795)
  • Update set_sorted to only accept a single column (#16800)
  • Expose overflowing cast (#16805)
  • Update group_by iteration and partition_by to always return tuple keys (#16793)
  • Support array arithmetic for equally sized shapes (#16791)
  • Expedited removal of certain deprecated functionality (2) (#16779)
  • Removal of read_database_uri passthrough from read_database (#16783)
  • Remove pyxlsb engine from read_excel (#16784)
  • Add check_order parameter to assert_series_equal (#16778)
  • Enforce deprecation of keyword arguments as positional (#16755)
  • Support cloud storage in scan_csv (#16674)
  • Streamline SQL INTERVAL handling and improve related error messages, update sqlparser-rs lib (#16744)
  • Support use of ordinal values in SQL ORDER BY clause (#16745)
  • Support executing polars SQL against pandas and pyarrow objects (#16746)
  • Remove deprecated parameters in Series.cut/qcut and update struct field names (#16741)
  • Expedited removal of certain deprecated functionality (#16754)
  • Remove deprecated functionality from rolling methods (#16750)
  • Update date_range to no longer produce datetime ranges (#16734)
  • Mark min_periods as keyword-only for rolling methods (#16738)
  • Remove deprecated top_k parameters nulls_last, maintain_order, and multithreaded (#16599)
  • Support order-by in window functions (#16743)
  • Add SQL support for NULLS FIRST/LAST ordering (#16711)
  • Update some error types to more appropriate variants (#15030)
  • Initial SQL support for INTERVAL strings (#16732)
  • Scheduled removal of deprecated functionality (2) (#16724)
  • Scheduled removal of deprecated functionality (#16715)
  • Enforce deprecation of offset arg in truncate and round (#16655)
  • Change default offset in group_by_dynamic from 'negative every' to 'zero' (#16658)
  • Constrain access to globals from DataFrame.sql in favor of top-level pl.sql (#16598)
  • Read 2D NumPy arrays as Array type instead of List (#16710)
  • Update clip to no longer propagate nulls in the given bounds (#14413)
  • Change str.to_datetime to default to microsecond precision for format specifiers "%f" and "%.f" (#13597)
  • Update resulting column names in pivot when pivoting by multiple values (#16439)
  • Preserve nulls in ewm_mean, ewm_std, and ewm_var (#15503)
  • Restrict casting for temporal data types (#14142)
  • Add many more auto-inferable datetime formats for str.to_datetime (#16634)
  • Support Decimal types by default when converting from Arrow (#15324)
  • Remove serde functionality from pl.read_json and DataFrame.write_json (#16550)
  • Update function signature of nth to allow positional input of indices, remove columns parameter (#16510)
  • Rename struct fields of rle output to len/value and update data type of len field (#15249)
  • Remove class variables from some DataTypes (#16524)
  • Add check_names parameter to Series.equals and default to False (#16610)
  • Dedicated SQLInterface and SQLSyntax errors (#16635)
  • Add DIV function support to the SQL interface (#16678)
  • Support non-coalescing streaming left join (#16672)
  • Allow wildcard and exclude before struct expansions (#16671)

🐞 Bug fixes

  • Raise on invalid shape dataframe arithmetic (#17322)
  • Fix panic in window case (#17320)
  • Raise errors instead of panicking when sink_csv fails (#17313)
  • Raise if join keys are passed to cross join (#17305)
  • Ensure we don't close extant adbc connections in write_database (#17298)
  • Don't null on oob in list.get for column index (#17276)
  • Fix issue where sliced PyArrow record batches were not handled correctly (#17058)
  • Don't oob on nulls in list.get (#17262)
  • Fix list getter with nulls (#17261)
  • Respect nulls_last parameter in aggregate sort_by (#17249)
  • Fix literal slice in group by (#17242)
  • Fix DataFrame.top_k not handling nulls correctly (#17239)
  • Update implementation of Enum support in lit to address spurious test failure (#17187)
  • Use explicit turbofish to help rustc (#17159)
  • Raise on invalid set dtypes (#17157)
  • Fix corrupted reads for hive parts from cloud and projection pushdown failure on hive parts (#17152)
  • Set intersection supertype (#17154)
  • ChainedWhen should not inherit Expr (#17142)
  • Fix decompress_impl for csv with n_rows set (#17118)
  • Fix incorrect window std for chunked series (#17110)
  • Fix panic when using fold in certain situations (#17114)
  • Fix melt panic (#17088)
  • Fix expression autocomplete in IPython (#17072)
  • Exclude index from expansion in rolling/group_by_dynamic (#17086)
  • Update some Series dunder method type signatures (#17053)
  • Fix oob of join with literals and empty table (#17047)
  • Don't silently accept multi-table FROM clauses (implicit JOIN syntax) (#17028)
  • Don't split up ANDed filters that are group-aware (#17031)
  • Harden "async" check for users with out-of-date sqlalchemy libraries (#17029)
  • Error when sort_by of unequal length (#17026)
  • Properly catch not found explode cols (#17020)
  • Correctly convert data frames to NumPy for C index order (#17000)
  • Raise on invalid arithmetic shapes (#16986)
  • Don't pushdown predicates in cross join if the refer to both tables (#16983)
  • Fix projection pushdown with literal joins (#16981)
  • Fix edge case in DataFrame constructor data orientation inference (#16975)
  • Raise on list of objects (#16959)
  • Handle strictness for Decimal Series construction (#15309)
  • Don't panic in object to anyvalue (#16957)
  • Properly set FAST_EXPLODE_LIST metadata (#16951)
  • Raise informative error when writing object to file (#16954)
  • Remove supertype definition of List and non-List types (#16918)
  • Remove unwrap in extend() (#16890)
  • Fix should_rechunk check (#16852)
  • Ensure read_excel and read_ods return identical frames across all engines when given empty spreadsheet tables (#16802)
  • Consistent behaviour when "infer_schema_length=0" for read_excel (#16840)
  • Standardised additional SQL interface errors (#16829)
  • Ensure that splitted ChunkedArray also flattens chunks (#16837)
  • Reduce needless panics in comparisons (#16831)
  • Reset if next caller clones inner series (#16812)
  • Raise on non-positive json schema inference (#16770)
  • Rewrite implementation of top_k/bottom_k and fix a variety of bugs (#16804)
  • Fix comparison of UInt64 with zero (#16799)
  • Fix incorrect parquet statistics written for UInt64 values > Int64::MAX (#16766)
  • Fix boolean distinct (#16765)
  • DATE_PART SQL syntax/parsing, improve some error messages (#16761)
  • Include pl. qualifier for inner dtypes in to_init_repr (#16235)
  • Column selection wasn't applied when reading CSV with no rows (#16739)
  • Panic on empty df / null List(Categorical) (#16730)
  • Only flush if operator can flush in streaming outer join (#16723)
  • Raise unsupported cat array (#16717)
  • Assert SQLInterfaceError is raised (#16713)
  • Restrict casting for temporal data types (#14142)
  • Handle nested categoricals in assert_series_equal when categorical_as_str=True (#16700)
  • Improve read_database check for SQLAlchemy async Session objects (#16680)
  • Reduce scope of multi-threaded numpy conversion (#16686)
  • Full null on dyn int (#16679)
  • Fix filter shape on empty null (#16670)

πŸ“– Documentation

  • Update version switcher for 1.0.0 final release (#16848)
  • Finish upgrade guide for 1.0.0 (#17257)
  • Minor layout/terminology improvement for selector set ops (#17299)
  • Mark hypothesis testing functionality as unstable (#17258)
  • Add SQL docs for the CAST and TRY_CAST functions (#17214)
  • Mark plot namespace as unstable (#17205)
  • Bump docs dependencies (#17199)
  • More accurate and helpful docs for user defined functions (#15194)
  • Add doc examples to concat_list (#17127)
  • Add "coming from pandas" note to DataFrame.unique docstring (#17119)
  • Fix some warnings during doc build (#17077)
  • Properly expose InProcessQuery in docs, mark as unstable (#17097)
  • Add upgrade guide for Python Polars 1.0.0 (#16914)
  • Lots of additions to the SQL reference docs (#16990)
  • Minor doctest fixes (#17002)
  • Include a doc entry for every exception type (#17001)
  • Fixup bullet points in write_parquet docstring (#16909)
  • Update version switcher for 1.0.0 prereleases (#16847)
  • Update link from Python API reference to user guide (#16849)
  • Update docstring/test/etc usage of select and with_columns to idiomatic form (#16801)
  • Update versioning docs for 1.0.0 (#16757)
  • Add docstring example for DataFrame.limit (#16753)
  • Fix incorrect stated value of include_nulls in DataFrame.update docstring (#16701)
  • Update deprecation docs in the user guide (#14315)
  • Add example for index count in DataFrame.rolling (#16600)
  • Improve docstring of Expr/Series.map_elements (#16079)
  • Add missing polars.sql docs entry and small docstring update (#16656)

πŸ“¦ Build system

  • Update Cargo.lock (#17284)
  • Streamline optional dependency definitions in pyproject.toml (#17168)
  • Update rustc 2024-06-23 (#17135)
  • Do not set environment variable on import (#17101)
  • Fix config flag for Tracemalloc (#17098)
  • Pin optional NumPy dependency to < 2.0.0 for now (#17060)

πŸ› οΈ Other improvements

  • Fix typo in join validation error message (#17296)
  • Fix linting issue in docs (#17292)
  • Use typed iter in list.get (#17286)
  • Rename type_aliases module to _typing (#17282)
  • add ability to have pipeline blockers in new streaming engine (#17247)
  • Support date/datetime for hive parts (#17256)
  • Refactor serde tests, add hypothesis tests (#17216)
  • Refactor parsing of data type inputs to Polars data types (#17164)
  • Skip all moto AWS tests for now (#17178)
  • Add missing spaces in cargo.toml (#17145)
  • Minor test refactor for concat_list (#17120)
  • Remove re-export of data type groups (#17073)
  • Add pivot test #17081 (#17090)
  • Minor cleanup to better define boundaries of public API (#17051)
  • Support directory paths in scans for Parquet, IPC and CSV (#17017)
  • Remove re-export of type aliases (#17032)
  • Remove file cache test (#17038)
  • Update exception imports in test suite (#17035)
  • Point polars-stream to crates/ again (#17024)
  • Fix failing file cache test in CI (#17014)
  • Add some parametric tests for sort functionality (#17008)
  • Pin NumPy to <2.0 for now (#16999)
  • Use proper join type in test (#16994)
  • Fix file cache verbose logging leakage during pytest (#16984)
  • Skip another intermitently failing AWS test (#16980)
  • Update test suite to explicitly use orient="row" in DataFrame constructor when applicable (#16977)
  • Remove redundant projection attribute in IR::DataFrameScan (#16952)
  • Factor out some apply calls in duration namespace (#16941)
  • Skip intermittently failing AWS test (#16908)
  • Refactor expression parsing utils (#16906)
  • Set up some of the infrastructure for new streaming engine (#16900)
  • Refactor parts of IR. (#16899)
  • Add fundamentals for new async-based streaming execution engine (#16884)
  • Move around some existing tests (#16877)
  • Remove inner Arc from FileCacheEntry (#16870)
  • Do not update stable API reference on prerelease (#16846)
  • Update links to API references (#16843)
  • Prepare update of API reference URLs (#16816)
  • Rename allow_overflow to wrap_numerical (#16807)
  • Set infer_schema_length as keyword-only argument in str.json_decode (#16835)
  • Don't enter streaming engine for groupby-> agg mean/median … (#16810)
  • Improve safety of amortized_iter (#16820)
  • Remove needless inner type clone (#16718)
  • Fix incorrect debug assertion in ChunkedArray::from_chunks_and_dtype (#16697)
  • Update version resolver for 1.0.0 release (#16705)
  • Avoid AWS pinning to outdated crc32c version (#16681)

Thank you to all our contributors for making this release possible! @IvanIsCoding, @JamesCE2001, @JulianCologne, @KDruzhkin, @Kylea650, @MarcoGorelli, @Mottl, @Object905, @SeanTater, @adamreeve, @alexander-beedie, @bertiewooster, @borchero, @c-peters, @coastalwhite, @datapythonista, @datenzauberai, @dependabot, @dependabot[bot], @eitsupi, @flisky, @henryharbeck, @itamarst, @jqnatividad, @lukeshingles, @machow, @marenwestermann, @mcrumiller, @montanarograziano, @nameexhaustion, @orlp, @p3i0t, @ritchie46, @sherlockbeard, @stinodego, @tkellogg, @universalmind303 and @wence-