Python from R I: package importing (and why learning new languages sucks)

python
Author

Alberson Miranda

Published

June 12, 2021

TL;DR

When learning a new programming language, simply finding equivalent code for the practices you already have may be misleading. Here we’re able to see that an equivalent of R’s library() call is actually considered a bad practice in Python and if you do that in a job interview, you should not expect they call you back.

1 Motivation

It crossed my mind an analogy about learning a foreign language: it’s impossible to learn a new language by translating word by word. It’s not only a matter of vocabulary. I mean, each language has it’s own grammar, phrasal verbs, diction, expressions, pace etc. That kind of issue also appears when learning a new programming language and I think importing packages is a good, yet very simple, example of that.

2 Calling A Function From A Package

2.1 R Experience

In R, every package installed in the library trees are listed whenever a terminal is open. Those listed packages are available for users at all times and can be called explicitly. For example:

2.1.1 Case 1: Explicit Call

# search for machine learning measures that contais "AUC" in the {mlr3} package
mlr3::mlr_measures$keys("auc")
[1] "classif.auc"       "classif.mauc_au1p" "classif.mauc_au1u"
[4] "classif.mauc_aunp" "classif.mauc_aunu" "classif.mauc_mu"  
[7] "classif.prauc"    

But that way of calling functions usually take place only if that particular package won’t be required very often. Otherwise, it’s cultural for R users to load and attach the entire package’s namespace to the search path1 with a library() call.

2.1.2 Case 2: Attaching

# tired: explicitly calling from {ggplot2}
t1 = mtcars |>
  dplyr::mutate(hp_by_cyl = hp / cyl)

# wired: attaching {ggplot2} namespace
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
t2 = mtcars |>
  mutate(hp_by_cyl = hp / cyl)

# are they equivalent?
all.equal(t1, t2)
[1] TRUE

The problem appears when there are namespace conflicts. Did you notice the warning about objects being masked from {stats} and {base}?. Usually, users just don’t care for startup warnings 😨 and that may eventually lead them to inconsistent results or tricky errors.

That can be avoided by attaching only the specific functions you’re actually gonna use:

2.1.3 Case 3: Attaching Specific Functions

# detaching dplyr
detach("package:dplyr")

# attaching only mutate():
library(dplyr, include.only = "mutate")

And no conflict warning will be triggered. Unfortunately, I don’t hear much of include.only argument from R community 🤷‍♂. On the contrary, meta packages such as {tidyverse}, which will load and attach A LOT of stuff into the namespace — often unnecessary for what you’re about to do —, is quite common.

2.2 Python Experience

All of the 3 cases stated before are possible in Python, but the community standards are very different. Specially regarding to the awareness of what is loaded into the namespace — or symbol table, as it is called in Python2.

Firstly, installed packages aren’t immediately available. So if I try, for example, listing {pandas} functions/methods/attributes it’ll result in an error:

# inspecting modules in {pandas}
import pandas
dir(pandas)
['ArrowDtype', 'BooleanDtype', 'Categorical', 'CategoricalDtype', 'CategoricalIndex', 'DataFrame', 'DateOffset', 'DatetimeIndex', 'DatetimeTZDtype', 'ExcelFile', 'ExcelWriter', 'Flags', 'Float32Dtype', 'Float64Dtype', 'Grouper', 'HDFStore', 'Index', 'IndexSlice', 'Int16Dtype', 'Int32Dtype', 'Int64Dtype', 'Int8Dtype', 'Interval', 'IntervalDtype', 'IntervalIndex', 'MultiIndex', 'NA', 'NaT', 'NamedAgg', 'Period', 'PeriodDtype', 'PeriodIndex', 'RangeIndex', 'Series', 'SparseDtype', 'StringDtype', 'Timedelta', 'TimedeltaIndex', 'Timestamp', 'UInt16Dtype', 'UInt32Dtype', 'UInt64Dtype', 'UInt8Dtype', '__all__', '__builtins__', '__cached__', '__doc__', '__docformat__', '__file__', '__git_version__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', '_built_with_meson', '_config', '_is_numpy_dev', '_libs', '_pandas_datetime_CAPI', '_pandas_parser_CAPI', '_testing', '_typing', '_version_meson', 'annotations', 'api', 'array', 'arrays', 'bdate_range', 'compat', 'concat', 'core', 'crosstab', 'cut', 'date_range', 'describe_option', 'errors', 'eval', 'factorize', 'from_dummies', 'get_dummies', 'get_option', 'infer_freq', 'interval_range', 'io', 'isna', 'isnull', 'json_normalize', 'lreshape', 'melt', 'merge', 'merge_asof', 'merge_ordered', 'notna', 'notnull', 'offsets', 'option_context', 'options', 'pandas', 'period_range', 'pivot', 'pivot_table', 'plotting', 'qcut', 'read_clipboard', 'read_csv', 'read_excel', 'read_feather', 'read_fwf', 'read_gbq', 'read_hdf', 'read_html', 'read_json', 'read_orc', 'read_parquet', 'read_pickle', 'read_sas', 'read_spss', 'read_sql', 'read_sql_query', 'read_sql_table', 'read_stata', 'read_table', 'read_xml', 'reset_option', 'set_eng_float_format', 'set_option', 'show_versions', 'test', 'testing', 'timedelta_range', 'to_datetime', 'to_numeric', 'to_pickle', 'to_timedelta', 'tseries', 'unique', 'util', 'value_counts', 'wide_to_long']

One can check the symbol table with the following statement.

# what is attached into the symbol table?
print(*globals(), sep = "\n")
__name__
__doc__
__package__
__loader__
__spec__
__annotations__
__builtins__
r
pandas

Depending on what system/tools you’re using, Python interpreter will load a few modules or not. If you start a REPL — a Python interactive terminal —, no modules will be leaded. If you start a Jupyter notebook, a few modules necessary for it to run will be loaded. In this case, since I’m running Python from R via {reticulate}, some modules have been loaded:

  • sys: for accesses to some variables and functions used by the interpreter
  • os: for OS routines for NT or Posix

So if I want to work with {pandas}, I need to attach it to the symbol table with an equivalent to R’s library(). And just like it’s cousin function, Python’s import also comes in different flavours.

Firstly, import pandas will make the package available for explicit calls.

# import pandas
import pandas

# what is attached into the symbol table?
print(*globals(), sep = "\n")
__name__
__doc__
__package__
__loader__
__spec__
__annotations__
__builtins__
r
pandas

Note that only {pandas} is attached to the symbol table, not it’s functions/methods/attributes. So that statement it’s not an equivalent to library(). For us to create a simple dataframe with {pandas}:

2.2.1 Case 1: Explicit Call

# this will result in a NameError: name 'DataFrame' is not defined
DataFrame(
  {
    "capital": ["Vitoria", "São Paulo", "Rio de Janeiro"],
    "state": ["Espírito Santo", "São Paulo", "Rio de Janeiro"]
  }
)
NameError: name 'DataFrame' is not defined
# this will work
pandas.DataFrame(
  {
    "capital": ["Vitoria", "São Paulo", "Rio de Janeiro"],
    "state": ["Espírito Santo", "São Paulo", "Rio de Janeiro"]
  }
)
          capital           state
0         Vitoria  Espírito Santo
1       São Paulo       São Paulo
2  Rio de Janeiro  Rio de Janeiro

If we were to replicate library() behavior (i.e. load and attach the entire {pandas} functions/methods/attributes into the symbol table), then:

2.2.2 Case 2: Attaching

# importing entire {pandas} into symbol table
from pandas import *

# the updated symbol table
print(*globals(), sep = "\n")
__name__
__doc__
__package__
__loader__
__spec__
__annotations__
__builtins__
r
pandas
ArrowDtype
BooleanDtype
Categorical
CategoricalDtype
CategoricalIndex
DataFrame
DateOffset
DatetimeIndex
DatetimeTZDtype
ExcelFile
ExcelWriter
Flags
Float32Dtype
Float64Dtype
Grouper
HDFStore
Index
IndexSlice
Int16Dtype
Int32Dtype
Int64Dtype
Int8Dtype
Interval
IntervalDtype
IntervalIndex
MultiIndex
NA
NaT
NamedAgg
Period
PeriodDtype
PeriodIndex
RangeIndex
Series
SparseDtype
StringDtype
Timedelta
TimedeltaIndex
Timestamp
UInt16Dtype
UInt32Dtype
UInt64Dtype
UInt8Dtype
api
array
arrays
bdate_range
concat
crosstab
cut
date_range
describe_option
errors
eval
factorize
get_dummies
from_dummies
get_option
infer_freq
interval_range
io
isna
isnull
json_normalize
lreshape
melt
merge
merge_asof
merge_ordered
notna
notnull
offsets
option_context
options
period_range
pivot
pivot_table
plotting
qcut
read_clipboard
read_csv
read_excel
read_feather
read_fwf
read_gbq
read_hdf
read_html
read_json
read_orc
read_parquet
read_pickle
read_sas
read_spss
read_sql
read_sql_query
read_sql_table
read_stata
read_table
read_xml
reset_option
set_eng_float_format
set_option
show_versions
test
testing
timedelta_range
to_datetime
to_numeric
to_pickle
to_timedelta
tseries
unique
value_counts
wide_to_long
# and now this works
DataFrame(
  {
    "capital": ["Vitoria", "São Paulo", "Rio de Janeiro"],
    "state": ["Espírito Santo", "São Paulo", "Rio de Janeiro"]
  }
)
          capital           state
0         Vitoria  Espírito Santo
1       São Paulo       São Paulo
2  Rio de Janeiro  Rio de Janeiro

But you won’t see any experienced Python user doing that kind of thing because they’re worried about loading that amount of names into the symbol table and the possible conflicts it may cause. An acceptable approach would be attaching only a few frequent names as in:

2.2.3 Case 3: Attaching Specific Functions

# detaching {pandas}
for name in vars(pandas):
    if not name.startswith('_'):
        del globals()[name]
KeyError: 'annotations'
# attaching only DataFrame()
from pandas import DataFrame

# the updated symbol table
print(*globals(), sep = "\n")
__name__
__doc__
__package__
__loader__
__spec__
__annotations__
__builtins__
r
pandas
ArrowDtype
BooleanDtype
Categorical
CategoricalDtype
CategoricalIndex
DataFrame
DateOffset
DatetimeIndex
DatetimeTZDtype
ExcelFile
ExcelWriter
Flags
Float32Dtype
Float64Dtype
Grouper
HDFStore
Index
IndexSlice
Int16Dtype
Int32Dtype
Int64Dtype
Int8Dtype
Interval
IntervalDtype
IntervalIndex
MultiIndex
NA
NaT
NamedAgg
Period
PeriodDtype
PeriodIndex
RangeIndex
Series
SparseDtype
StringDtype
Timedelta
TimedeltaIndex
Timestamp
UInt16Dtype
UInt32Dtype
UInt64Dtype
UInt8Dtype
api
array
arrays
bdate_range
concat
crosstab
cut
date_range
describe_option
errors
eval
factorize
get_dummies
from_dummies
get_option
infer_freq
interval_range
io
isna
isnull
json_normalize
lreshape
melt
merge
merge_asof
merge_ordered
notna
notnull
offsets
option_context
options
period_range
pivot
pivot_table
plotting
qcut
read_clipboard
read_csv
read_excel
read_feather
read_fwf
read_gbq
read_hdf
read_html
read_json
read_orc
read_parquet
read_pickle
read_sas
read_spss
read_sql
read_sql_query
read_sql_table
read_stata
read_table
read_xml
reset_option
set_eng_float_format
set_option
show_versions
test
testing
timedelta_range
to_datetime
to_numeric
to_pickle
to_timedelta
tseries
unique
value_counts
wide_to_long
name

According to The Hitchhiker’s Guide to Python [@pythonguide], case 2 is the worst possible scenario and it’s generally considered bad practice since it “makes code harder to read and makes dependencies less compartmentalized”. That claim is endorsed by Python’s official docs [@pythontutorial]:

Although certain modules are designed to export only names that follow certain patterns when you use import *, it is still considered bad practice in production code” .

In the opinion of the guide authors, case 3 would be a better option because it pinpoints specific names3, while case 1 would be the best practice, for “Being able to tell immediately where a class or function comes from greatly improves code readability and understandability in all but the simplest single file projects.”

Footnotes

  1. An ordered list where R will look for a function. Can be accessed with search().↩︎

  2. I guess? I don’t know, still learning lol 😂↩︎

  3. Python Foundation says “There is nothing wrong with using from package import specific_submodule! In fact, this is the recommended notation unless the importing module needs to use submodules with the same name from different packages.”↩︎