Python from R I: package importing (and why learning new languages sucks)
Python from R is a collaborative initiative to write a book to help experienced R users learn Python. In the words of my boi Benjamin Wolfe in the Python from R Slack channel:
- Weâre a group of people who either learned Python from R, are R users currently learning Python, or have yet to learn Python, but are comfortable with R.
- What we hope to do is write some collaborative resources for others like ourselves to find their way more easily.
- We will work in the open, for anyone âout thereâ to find and benefit from (and collaborate on!) along the way.
- In the end, we envision this looking like a book â whether that means a literal, physical, published print book, or an online one.
MOTIVATION
When I was writing my concept map for the book (weâll write about it eventually), it crossed my mind an analogy about learning a foreign language: itâs impossible to learn a new language by translating word by word. Itâs not only a matter of vocabulary. I mean, each language has itâs own grammar, phrasal verbs, diction, expressions, pace etc. That kind of issue also appears when learning a new programming language and I think importing packages is a good, yet very simple, example of that.
CALLING A FUNCTION FROM A PACKAGE
R EXPERIENCE
In R (R Core Team 2020), every package installed in the library trees are listed whenever a terminal is open. Those listed packages are available for users at all times and can be called explicitly. For example:
CASE 1: EXPLICIT CALL
1# search for machine learning measures that contais "AUC" in the {mlr3} package
2mlr3::mlr_measures$keys("auc")
## [1] "classif.auc" "classif.prauc"
But that way of calling functions usually take place only if that particular package wonât be required very often. Otherwise, itâs cultural for R users to load and attach the entire packageâs namespace to the search path1 with a library()
call.
CASE 2: ATTACHING
1# tired: explicitly calling from {ggplot2}
2p1 = mtcars |>
3ggplot2::ggplot(ggplot2::aes(x = hp, y = mpg, color = factor(cyl))) +
4 ggplot2::geom_point() +
5 ggplot2::labs(color = "cyl")
1# wired: attaching {ggplot2} namespace
2library(ggplot2)
3
4p2 = mtcars |>
5ggplot(aes(x = hp, y = mpg, color = factor(cyl))) +
6 geom_point() +
7 labs(color = "cyl")
1# are they equivalent?
2all.equal(p1,p2)
## [1] TRUE
The problem appears when there are namespace conflicts. Say we attach {plotly} for interactive graphs. Usually, users just donât care for startup warnings đš and that may eventually lead them to inconsistent results or tricky errors.
1# attaching plotly
2library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
That can be avoided by attaching only the specific functions youâre actually gonna use:
CASE 3: ATTACHING SPECIFIC FUNCTIONS
1# detaching plotly
2detach("package:plotly")
3
4# attaching only ggplotly():
5library(plotly, include.only = "ggplotly")
And no conflict warning will be triggered. Unfortunately, I donât hear much of include.only
argument from R community đ€·ââ. On the contrary, meta packages such as {tidyverse}, which will load and attach A LOT of stuff into the namespace â often unnecessary for what youâre about to do â, is quite common.
PYTHON EXPERIENCE
All of the 3 cases stated before are possible in Python, but the community standards are very different. Specially regarding to the awareness of what is loaded into the namespace â or symbol table, as it is called in Python2.
Firstly, installed packages arenât immediately available. So if I try, for example, listing {pandas} functions/methods/attributes itâll result in an error:
1# inspecting {pandas}
2dir(pandas)
## Error in py_call_impl(callable, dots$args, dots$keywords): NameError: name 'pandas' is not defined
##
## Detailed traceback:
## File "<string>", line 1, in <module>
One can check the symbol table with the following statement.
1# what is attached into the symbol table?
2print(*globals(), sep = "\n")
## __name__
## __doc__
## __package__
## __loader__
## __spec__
## __annotations__
## __builtins__
## sys
## os
## r
Depending on what system/tools youâre using, Python interpreter will load a few modules or not. If you start a REPL â a Python interactive terminal â, no modules will be leaded. If you start a Jupyter notebook, a few modules necessary for it to run will be loaded. In this case, since Iâm running Python from R via {reticulate}, some modules have been loaded:
sys
: for accesses to some variables and functions used by the interpreteros
: for OS routines for NT or Posix
So if I want to work with {pandas}, I need to attach it to the symbol table with an equivalent to Râs library()
. And just like itâs cousin function, Pythonâs import
also comes in different flavours.
Firstly, import pandas
will make the package available for explicit calls.
1# import pandas
2import pandas
3
4# what is attached into the symbol table?
5print(*globals(), sep = "\n")
## __name__
## __doc__
## __package__
## __loader__
## __spec__
## __annotations__
## __builtins__
## sys
## os
## r
## pandas
Note that only {pandas} is attached to the symbol table, not itâs functions/methods/attributes. So that statement itâs not an equivalent to library()
. For us to create a simple dataframe with {pandas}:
CASE 1: EXPLICIT CALL
1# this won't work because DataFrame isn't in symbol table
2DataFrame(
3 {
4 "capital": ["Vitoria", "SĂŁo Paulo", "Rio de Janeiro"],
5 "state": ["EspĂrito Santo", "SĂŁo Paulo", "Rio de Janeiro"]
6 }
7)
8
9# this will
10pandas.DataFrame(
11 {
12 "capital": ["Vitoria", "SĂŁo Paulo", "Rio de Janeiro"],
13 "state": ["EspĂrito Santo", "SĂŁo Paulo", "Rio de Janeiro"]
14 }
15)
## Error in py_call_impl(callable, dots$args, dots$keywords): NameError: name 'DataFrame' is not defined
##
## Detailed traceback:
## File "<string>", line 1, in <module>
## capital state
## 0 Vitoria EspĂrito Santo
## 1 SĂŁo Paulo SĂŁo Paulo
## 2 Rio de Janeiro Rio de Janeiro
If we were to replicate library()
behavior (i.e. load and attach the entire {pandas} functions/methods/attributes into the symbol table), then:
CASE 2: ATTACHING
1# importing entire {pandas} into symbol table
2from pandas import *
3
4# the updated symbol table
5print(*globals(), sep = "\n")
6
7# and now this works
8DataFrame(
9 {
10 "capital": ["Vitoria", "SĂŁo Paulo", "Rio de Janeiro"],
11 "state": ["EspĂrito Santo", "SĂŁo Paulo", "Rio de Janeiro"]
12 }
13)
## __name__
## __doc__
## __package__
## __loader__
## __spec__
## __annotations__
## __builtins__
## sys
## os
## r
## pandas
## compat
## get_option
## set_option
## reset_option
## describe_option
## option_context
## options
## core
## errors
## util
## io
## tseries
## arrays
## plotting
## Int8Dtype
## Int16Dtype
## Int32Dtype
## Int64Dtype
## UInt8Dtype
## UInt16Dtype
## UInt32Dtype
## UInt64Dtype
## CategoricalDtype
## PeriodDtype
## IntervalDtype
## DatetimeTZDtype
## StringDtype
## BooleanDtype
## NA
## isna
## isnull
## notna
## notnull
## Index
## CategoricalIndex
## Int64Index
## UInt64Index
## RangeIndex
## Float64Index
## MultiIndex
## IntervalIndex
## TimedeltaIndex
## DatetimeIndex
## PeriodIndex
## IndexSlice
## NaT
## Period
## period_range
## Timedelta
## timedelta_range
## Timestamp
## date_range
## bdate_range
## Interval
## interval_range
## DateOffset
## to_numeric
## to_datetime
## to_timedelta
## Grouper
## factorize
## unique
## value_counts
## NamedAgg
## array
## Categorical
## set_eng_float_format
## Series
## DataFrame
## SparseDtype
## infer_freq
## offsets
## eval
## concat
## lreshape
## melt
## wide_to_long
## merge
## merge_asof
## merge_ordered
## crosstab
## pivot
## pivot_table
## get_dummies
## cut
## qcut
## api
## show_versions
## ExcelFile
## ExcelWriter
## read_excel
## read_csv
## read_fwf
## read_table
## read_pickle
## to_pickle
## HDFStore
## read_hdf
## read_sql
## read_sql_query
## read_sql_table
## read_clipboard
## read_parquet
## read_orc
## read_feather
## read_gbq
## read_html
## read_json
## read_stata
## read_sas
## read_spss
## json_normalize
## test
## testing
## capital state
## 0 Vitoria EspĂrito Santo
## 1 SĂŁo Paulo SĂŁo Paulo
## 2 Rio de Janeiro Rio de Janeiro
But you wonât see any experienced Python user doing that kind of thing because theyâre worried about loading that amount of names into the symbol table and the possible conflicts it may cause. An acceptable approach would be attaching only a few frequent names as in:
CASE 3: ATTACHING SPECIFIC FUNCTIONS
1# detaching {pandas}
2for name in vars(pandas):
3 if not name.startswith("_"):
4 del globals()[name]
5
6# attaching only DataFrame()
7from pandas import DataFrame
8
9# the updated symbol table
10print(*globals(), sep = "\n")
## __name__
## __doc__
## __package__
## __loader__
## __spec__
## __annotations__
## __builtins__
## sys
## os
## r
## name
## DataFrame
According to The Hitchhikerâs Guide to Python (Reitz, Kenneth and Schlusser, Tanya. 2016), case 2 is the worst possible scenario and itâs generally considered bad practice since it âmakes code harder to read and makes dependencies less compartmentalized.â That claim is endorsed by Pythonâs official docs (Python Software Foundation. 2021):
Although certain modules are designed to export only names that follow certain patterns when you use import *, it is still considered bad practice in production code" .
In the opinion of the guide authors, case 3 would be a better option because it pinpoints specific names3, while case 1 would be the best practice, for âBeing able to tell immediately where a class or function comes from greatly improves code readability and understandability in all but the simplest single file projects.â
TL;DR
When learning a new programming language, simply finding equivalent code for the practices you already have may be misleading. Here weâre able to see that an equivalent of Râs library()
call is actually considered a bad practice in Python and if you do that in a job interview, you should not expect they call you back đ
CITATION
For attribution, please cite this work as:
Alberson Miranda. Jun 12, 2021. "Python from R I: package importing (and why learning new languages sucks)". https://datamares.netlify.app/en/post/2021-06-07-python-from-r-i-package-importing/.
BibTex citation:
@misc{datamares,
title = {Python from R I: package importing (and why learning new languages sucks)},
author = {Alberson Miranda},
year = {2021},
url = {https://datamares.netlify.app/en/post/2021-06-07-python-from-r-i-package-importing/}
}
REFERENCES
An ordered list where R will look for a function. Can be accessed with
search()
(Wickham 2015).â©ïžI guess? I donât know, still learning lol đâ©ïž
Python Foundation says âThere is nothing wrong with using from package import specific_submodule! In fact, this is the recommended notation unless the importing module needs to use submodules with the same name from different packages.ââ©ïž