I was learning about MongoDB tonight and the instructor went over some basic Python code. I would normally skip them but as a learn I was skimming through each Python lecture to make sure nothing strikes me (I don't want to miss any COOL tricks). Well, Python: for loops with lists caught my attention.
The loop variable item was accessible and used post-loop. HOW! I thought it was local.
My first thought was "damn could this be a static thing? It doesn't look right!". I fired up a shell, typed in some code, and tried the same code and the instructor wasn't lying!
I dissembled the code in the shell, and this is what I get:
>>> def f(): ... l = ['1', '2', '3'] ... l2 = [] ... for item in l: ... pass ... l2.append(item) ... print l2 ... >>> f() ['3'] >>> import dis >>> dis.dis(f) 2 0 LOAD_CONST 1 ('1') 3 LOAD_CONST 2 ('2') 6 LOAD_CONST 3 ('3') 9 BUILD_LIST 3 12 STORE_FAST 0 (l) 3 15 BUILD_LIST 0 18 STORE_FAST 1 (l2) 4 21 SETUP_LOOP 14 (to 38) 24 LOAD_FAST 0 (l) 27 GET_ITER >> 28 FOR_ITER 6 (to 37) 31 STORE_FAST 2 (item) 5 34 JUMP_ABSOLUTE 28 >> 37 POP_BLOCK 6 >> 38 LOAD_FAST 1 (l2) 41 LOAD_ATTR 0 (append) 44 LOAD_FAST 2 (item) 47 CALL_FUNCTION 1 50 POP_TOP 7 51 LOAD_FAST 1 (l2) 54 PRINT_ITEM 55 PRINT_NEWLINE 56 LOAD_CONST 0 (None) 59 RETURN_VALUEThis doesn't really help me. I don't remember all the goodies about dis from Python conference last summer. So instead, I googled this first.
http://stackoverflow.com/questions/3611760/scoping-in-python-for-loops
I wasn't the first one. The accepted answer referenced a mailing-list discussion on mails.python.org.
The last value of the loop variable (in this case, item) will live in the surrounding scope.
l = ['1', '2', '3'] l2 = [] for item in l: print id(item) pass l2.append(item) print id(item)
The result is as follow:
yeukhon@yeukhon-P5E-VM-DO:~/tmp$ python wtf.py
3077889320
3077837144
3077839616
3077839616
The first three are the three iterations of the list l and we can see the last output matches the last iteration. This means the value persists. The value continues to reside.
I didn't stop here. I went on. I wanted to find out the underlying assembler code. So I use Cython to translate my python code into C code.
I removed the print statements from the previous code, and then I generated this C code. The translator did an amazing job. Sadly, for some strange reason I couldn't compile it because Python.h was not found. I probably have to setup the library so the compiler can discover that file? I already have python-dev (and python2.7-dev) installed. Probably....
In any case, a few things:
I know that python keeps a namespace, and they are keys/values pair. You can see the static char starting from line 480.
/* Implementation of 'wtf' */ static char __pyx_k__1[] = "1"; static char __pyx_k__2[] = "2"; static char __pyx_k__3[] = "3"; static char __pyx_k__l[] = "l"; static char __pyx_k__l2[] = "l2"; static char __pyx_k__item[] = "item"; static char __pyx_k____main__[] = "__main__"; static char __pyx_k____test__[] = "__test__";
Now I want to dive deeper and see how they keep the last value in the surrounding scope.
Starting from line 655 we enter the for loop. The code is a bit messy to read at first. Many of the Py_* are actually CPython's C API. I have to google some of them to understand what they do and what their return values are. My quest is to find the last reference to the object holding the value item is not garbage collected.
I explained the relevant portion of the code here. It's very hard to read, so make sure you are prepared to read it...
But to summarize, at the end of the loop,
if (PyObject_SetAttr(__pyx_m, __pyx_n_s__item, __pyx_t_1) < 0) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 4; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
This line saves the value of the last iteration into the static variable item in memory! After this, the append statement looks like this.
__pyx_t_2 = __Pyx_GetName(__pyx_m, __pyx_n_s__l2); if (unlikely(!__pyx_t_2)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 6; __pyx_clineno = __LINE__; goto __pyx_L1_error;} __Pyx_GOTREF(__pyx_t_2); __pyx_t_1 = __Pyx_GetName(__pyx_m, __pyx_n_s__item); if (unlikely(!__pyx_t_1)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 6; __pyx_clineno = __LINE__; goto __pyx_L1_error;} __Pyx_GOTREF(__pyx_t_1);
As you see, we get the value directly from __pyx_n_s__item. It's a static variable.
Lesson learned:
1. First thing I recognized was all the ugly branch statements in the code. It might be just the translation (btw, Cython did a great job though... I can't even dare to think how I would go about writing such translator.... geesh). But still, under the hook, there are lots of checks. Branch statements are very expensive. So if this was not optimized, we would waste lots of cycles!
2. Static is everywhere in the translated C code! Look at the object variable item. It's static.
static PyObject *__pyx_n_s__item;
3. It's a great challenge to read all these codes. I remember all these SET, INCREF stuff from my programming language paradigm class which I took last year. It's cool to see these guys again. Honestly, if I weren't doing this, I wouldn't be digging through the C API on python.org
Some references:
http://docs.python.org/2/c-api/refcounting.html
http://stackoverflow.com/questions/4657764/py-incref-decref-when
http://docs.python.org/2/c-api/structures.html
http://docs.python.org/2/c-api/list.html
http://wiki.cython.org/enhancements/refnanny
http://wiki.cython.org/enhancements/pypy
http://mail.python.org/pipermail/python-ideas/2008-October/002109.html
http://stackoverflow.com/questions/2525518/writing-code-translator-from-python-to-chttp://docs.python.org/2/library/dis.html