Python from R I: package importing (and why learning new languages sucks)

Python from R I: package importing (and why learning new languages sucks)

Python from R is a collaborative initiative to write a book to help experienced R users learn Python. In the words of my boi Benjamin Wolfe in the Python from R Slack channel:

  • We’re a group of people who either learned Python from R, are R users currently learning Python, or have yet to learn Python, but are comfortable with R.
  • What we hope to do is write some collaborative resources for others like ourselves to find their way more easily.
  • We will work in the open, for anyone “out there” to find and benefit from (and collaborate on!) along the way.
  • In the end, we envision this looking like a book — whether that means a literal, physical, published print book, or an online one.

MOTIVATION

When I was writing my concept map for the book (we’ll write about it eventually), it crossed my mind an analogy about learning a foreign language: it’s impossible to learn a new language by translating word by word. It’s not only a matter of vocabulary. I mean, each language has it’s own grammar, phrasal verbs, diction, expressions, pace etc. That kind of issue also appears when learning a new programming language and I think importing packages is a good, yet very simple, example of that.

CALLING A FUNCTION FROM A PACKAGE

R EXPERIENCE

In R (R Core Team 2020), every package installed in the library trees are listed whenever a terminal is open. Those listed packages are available for users at all times and can be called explicitly. For example:

CASE 1: EXPLICIT CALL

1# search for machine learning measures that contais "AUC" in the {mlr3} package
2mlr3::mlr_measures$keys("auc")
## [1] "classif.auc"   "classif.prauc"

But that way of calling functions usually take place only if that particular package won’t be required very often. Otherwise, it’s cultural for R users to load and attach the entire package’s namespace to the search path1 with a library() call.

CASE 2: ATTACHING

1# tired: explicitly calling from {ggplot2}
2p1 = mtcars |>
3ggplot2::ggplot(ggplot2::aes(x = hp, y = mpg, color = factor(cyl))) +
4  ggplot2::geom_point() +
5  ggplot2::labs(color = "cyl")
1# wired: attaching {ggplot2} namespace
2library(ggplot2)
3
4p2 = mtcars |>
5ggplot(aes(x = hp, y = mpg, color = factor(cyl))) +
6  geom_point() +
7  labs(color = "cyl")
1# are they equivalent?
2all.equal(p1,p2)
## [1] TRUE

The problem appears when there are namespace conflicts. Say we attach {plotly} for interactive graphs. Usually, users just don’t care for startup warnings 😨 and that may eventually lead them to inconsistent results or tricky errors.

1# attaching plotly
2library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

That can be avoided by attaching only the specific functions you’re actually gonna use:

CASE 3: ATTACHING SPECIFIC FUNCTIONS

1# detaching plotly
2detach("package:plotly")
3
4# attaching only ggplotly():
5library(plotly, include.only = "ggplotly")

And no conflict warning will be triggered. Unfortunately, I don’t hear much of include.only argument from R community 🤷‍♂. On the contrary, meta packages such as {tidyverse}, which will load and attach A LOT of stuff into the namespace — often unnecessary for what you’re about to do —, is quite common.

PYTHON EXPERIENCE

All of the 3 cases stated before are possible in Python, but the community standards are very different. Specially regarding to the awareness of what is loaded into the namespace — or symbol table, as it is called in Python2.

Firstly, installed packages aren’t immediately available. So if I try, for example, listing {pandas} functions/methods/attributes it’ll result in an error:

1# inspecting {pandas}
2dir(pandas)
## Error in py_call_impl(callable, dots$args, dots$keywords): NameError: name 'pandas' is not defined
## 
## Detailed traceback:
##   File "<string>", line 1, in <module>

One can check the symbol table with the following statement.

1# what is attached into the symbol table?
2print(*globals(), sep = "\n")
## __name__
## __doc__
## __package__
## __loader__
## __spec__
## __annotations__
## __builtins__
## sys
## os
## r

Depending on what system/tools you’re using, Python interpreter will load a few modules or not. If you start a REPL — a Python interactive terminal —, no modules will be leaded. If you start a Jupyter notebook, a few modules necessary for it to run will be loaded. In this case, since I’m running Python from R via {reticulate}, some modules have been loaded:

  • sys: for accesses to some variables and functions used by the interpreter
  • os: for OS routines for NT or Posix

So if I want to work with {pandas}, I need to attach it to the symbol table with an equivalent to R’s library(). And just like it’s cousin function, Python’s import also comes in different flavours.

Firstly, import pandas will make the package available for explicit calls.

1# import pandas
2import pandas
3
4# what is attached into the symbol table?
5print(*globals(), sep = "\n")
## __name__
## __doc__
## __package__
## __loader__
## __spec__
## __annotations__
## __builtins__
## sys
## os
## r
## pandas

Note that only {pandas} is attached to the symbol table, not it’s functions/methods/attributes. So that statement it’s not an equivalent to library(). For us to create a simple dataframe with {pandas}:

CASE 1: EXPLICIT CALL

 1# this won't work because DataFrame isn't in symbol table
 2DataFrame(
 3  {
 4    "capital": ["Vitoria", "São Paulo", "Rio de Janeiro"],
 5    "state": ["Espírito Santo", "São Paulo", "Rio de Janeiro"]
 6  }
 7)
 8
 9# this will
10pandas.DataFrame(
11  {
12    "capital": ["Vitoria", "São Paulo", "Rio de Janeiro"],
13    "state": ["Espírito Santo", "São Paulo", "Rio de Janeiro"]
14  }
15)
## Error in py_call_impl(callable, dots$args, dots$keywords): NameError: name 'DataFrame' is not defined
## 
## Detailed traceback:
##   File "<string>", line 1, in <module>
##           capital           state
## 0         Vitoria  Espírito Santo
## 1       São Paulo       São Paulo
## 2  Rio de Janeiro  Rio de Janeiro

If we were to replicate library() behavior (i.e. load and attach the entire {pandas} functions/methods/attributes into the symbol table), then:

CASE 2: ATTACHING

 1# importing entire {pandas} into symbol table
 2from pandas import *
 3
 4# the updated symbol table
 5print(*globals(), sep = "\n")
 6
 7# and now this works
 8DataFrame(
 9  {
10    "capital": ["Vitoria", "São Paulo", "Rio de Janeiro"],
11    "state": ["Espírito Santo", "São Paulo", "Rio de Janeiro"]
12  }
13)
## __name__
## __doc__
## __package__
## __loader__
## __spec__
## __annotations__
## __builtins__
## sys
## os
## r
## pandas
## compat
## get_option
## set_option
## reset_option
## describe_option
## option_context
## options
## core
## errors
## util
## io
## tseries
## arrays
## plotting
## Int8Dtype
## Int16Dtype
## Int32Dtype
## Int64Dtype
## UInt8Dtype
## UInt16Dtype
## UInt32Dtype
## UInt64Dtype
## CategoricalDtype
## PeriodDtype
## IntervalDtype
## DatetimeTZDtype
## StringDtype
## BooleanDtype
## NA
## isna
## isnull
## notna
## notnull
## Index
## CategoricalIndex
## Int64Index
## UInt64Index
## RangeIndex
## Float64Index
## MultiIndex
## IntervalIndex
## TimedeltaIndex
## DatetimeIndex
## PeriodIndex
## IndexSlice
## NaT
## Period
## period_range
## Timedelta
## timedelta_range
## Timestamp
## date_range
## bdate_range
## Interval
## interval_range
## DateOffset
## to_numeric
## to_datetime
## to_timedelta
## Grouper
## factorize
## unique
## value_counts
## NamedAgg
## array
## Categorical
## set_eng_float_format
## Series
## DataFrame
## SparseDtype
## infer_freq
## offsets
## eval
## concat
## lreshape
## melt
## wide_to_long
## merge
## merge_asof
## merge_ordered
## crosstab
## pivot
## pivot_table
## get_dummies
## cut
## qcut
## api
## show_versions
## ExcelFile
## ExcelWriter
## read_excel
## read_csv
## read_fwf
## read_table
## read_pickle
## to_pickle
## HDFStore
## read_hdf
## read_sql
## read_sql_query
## read_sql_table
## read_clipboard
## read_parquet
## read_orc
## read_feather
## read_gbq
## read_html
## read_json
## read_stata
## read_sas
## read_spss
## json_normalize
## test
## testing
##           capital           state
## 0         Vitoria  Espírito Santo
## 1       São Paulo       São Paulo
## 2  Rio de Janeiro  Rio de Janeiro

But you won’t see any experienced Python user doing that kind of thing because they’re worried about loading that amount of names into the symbol table and the possible conflicts it may cause. An acceptable approach would be attaching only a few frequent names as in:

CASE 3: ATTACHING SPECIFIC FUNCTIONS

 1# detaching {pandas}
 2for name in vars(pandas):
 3    if not name.startswith("_"):
 4        del globals()[name]
 5
 6# attaching only DataFrame()
 7from pandas import DataFrame
 8
 9# the updated symbol table
10print(*globals(), sep = "\n")
## __name__
## __doc__
## __package__
## __loader__
## __spec__
## __annotations__
## __builtins__
## sys
## os
## r
## name
## DataFrame

According to The Hitchhiker’s Guide to Python (Reitz, Kenneth and Schlusser, Tanya. 2016), case 2 is the worst possible scenario and it’s generally considered bad practice since it “makes code harder to read and makes dependencies less compartmentalized.” That claim is endorsed by Python’s official docs (Python Software Foundation. 2021):

Although certain modules are designed to export only names that follow certain patterns when you use import *, it is still considered bad practice in production code" .

In the opinion of the guide authors, case 3 would be a better option because it pinpoints specific names3, while case 1 would be the best practice, for “Being able to tell immediately where a class or function comes from greatly improves code readability and understandability in all but the simplest single file projects.”

TL;DR

When learning a new programming language, simply finding equivalent code for the practices you already have may be misleading. Here we’re able to see that an equivalent of R’s library() call is actually considered a bad practice in Python and if you do that in a job interview, you should not expect they call you back 😂

CITATION

For attribution, please cite this work as:

Alberson Miranda. Jun 12, 2021. "Python from R I: package importing (and why learning new languages sucks)". https://datamares.netlify.app/en/post/2021-06-07-python-from-r-i-package-importing/.

BibTex citation:

  @misc{datamares,
    title = {Python from R I: package importing (and why learning new languages sucks)},
    author = {Alberson Miranda},
    year = {2021},
    url = {https://datamares.netlify.app/en/post/2021-06-07-python-from-r-i-package-importing/}
  }

REFERENCES

Python Software Foundation. 2021. The Python Tutorial. https://docs.python.org/3/tutorial/index.html.
R Core Team. 2020. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Reitz, Kenneth and Schlusser, Tanya. 2016. The Hitchhiker’s Guide to Python! O’Reilly Media. https://docs.python-guide.org/.
Wickham, Hadley. 2015. R Packages. O’Reilly Media; 1st edition. https://r-pkgs.org/index.html.

  1. An ordered list where R will look for a function. Can be accessed with search() (Wickham 2015).↩︎

  2. I guess? I don’t know, still learning lol 😂↩︎

  3. Python Foundation says “There is nothing wrong with using from package import specific_submodule! In fact, this is the recommended notation unless the importing module needs to use submodules with the same name from different packages.”↩︎

comments powered by Disqus