Python词典类型

页面导航：

英文教程的下载地址
词典类型
通过key来获取对应的value
更新词典中的数据
删除词典中的数据
词典的key必须具有hash值
结束语

英文教程的下载地址：

    本篇文章是根据英文教程《Python Tutorial》来写的学习笔记。该英文教程的下载地址如下：

    百度盘地址：http://pan.baidu.com/s/1c0eXSQG

    DropBox地址：点此进入DropBox链接

    Google Drive：点此进入Google Drive链接

    这是学习笔记，不是翻译，因此，内容上会与英文原著有些不同。以下记录是根据英文教程的第十二章来写的。(文章中的部分链接，包括下载链接，可能需要通过代理访问!)

    本篇文章也会涉及到Python的C源代码，这些C源码都是2.7.8版本的。想使用gdb来调试python源代码的话，就需要按照前面"Python基本的操作运算符"文章中所说的，使用configure --with-pydebug命令来重新编译安装python。

    如果Python的官方网站取消了Python-2.7.8.tgz的源码包的话，可以在以下两个链接中下载到：

    DropBox：Python-2.7.8.tgz的DropBox网盘链接

    Google Drive：Python-2.7.8.tgz的Google Drive网盘链接

词典类型：

    先通过一个简单的例子来看下词典的初始化过程：

[email protected]:~$ gdb -q python
....................................................
>>> dict1 = {123:345, 234:456, 789:1011}
....................................................

Breakpoint 1, PyEval_EvalFrameEx (f=0xb7daca44, throwflag=0)
    at Python/ceval.c:2234
2234	            x = _PyDict_NewPresized((Py_ssize_t)oparg);
(gdb) l
2229	            }
2230	            break;
2231	
2232	
2233	        case BUILD_MAP:
2234	            x = _PyDict_NewPresized((Py_ssize_t)oparg);
2235	            PUSH(x);
2236	            if (x != NULL) continue;
2237	            break;
2238	
(gdb) c
Continuing.

Breakpoint 2, PyEval_EvalFrameEx (f=0xb7daca44, throwflag=0)
    at Python/ceval.c:2240
2240	            w = TOP();     /* key */
(gdb) l
2235	            PUSH(x);
2236	            if (x != NULL) continue;
2237	            break;
2238	
2239	        case STORE_MAP:
2240	            w = TOP();     /* key */
2241	            u = SECOND();  /* value */
2242	            v = THIRD();   /* dict */
2243	            STACKADJ(-2);
2244	            assert (PyDict_CheckExact(v));
(gdb) 
2245	            err = PyDict_SetItem(v, w, u);  /* v[w] = u */
2246	            Py_DECREF(u);
2247	            Py_DECREF(w);
2248	            if (err == 0) continue;
2249	            break;
2250	
....................................................
(gdb) ptype (PyDictObject *)v
type = struct _dictobject {
    struct _object *_ob_next;
    struct _object *_ob_prev;
    Py_ssize_t ob_refcnt;
    struct _typeobject *ob_type;
    Py_ssize_t ma_fill;
    Py_ssize_t ma_used;
    Py_ssize_t ma_mask;
    PyDictEntry *ma_table;
    PyDictEntry *(*ma_lookup)(PyDictObject *, PyObject *, long int);
    PyDictEntry ma_smalltable[8];
} *
....................................................
>>> dict1
{234: 456, 123: 345, 789: 1011}
>>>

    在词典初始化时，内部会先通过BUILD_MAP操作码，调用_PyDict_NewPresized函数来创建一个新的空词典。再通过STORE_MAP操作码调用PyDict_SetItem函数，依次将每个key-value添加到词典中(每执行一次STORE_MAP操作码就添加一对key-value)。

    例如，上面的dict1在初始化时，就会先通过_PyDict_NewPresized函数为其预分配一段空间，再执行三次PyDict_SetItem函数，从而将123: 345，234: 456，789: 1011这三对key-value依次添加到词典中。在显示时，234: 456之所以出现在123: 345的前面，是因为词典在插入key-value时，是根据key的hash值以及下面会介绍的一套算法，来确定插入位置的。因此，key-value的添加顺序可以和词典中实际的存储顺序完全不同。

    词典对应的C结构体为PyDictObject，具体结构如下：

struct _dictobject {
    struct _object *_ob_next;
    struct _object *_ob_prev;
    Py_ssize_t ob_refcnt;
    struct _typeobject *ob_type;
    Py_ssize_t ma_fill;
    Py_ssize_t ma_used;
    Py_ssize_t ma_mask;
    PyDictEntry *ma_table;
    PyDictEntry *(*ma_lookup)(PyDictObject *, PyObject *, long int);
    PyDictEntry ma_smalltable[8];
}

前4个字段是每个Python对象都具有的字段。词典的key-value都存储在上面的ma_table数组中，该数组里的每一项都是一个PyDictEntry结构体类型的成员，该类型定义在Include/dictobject.h的头文件中：

typedef struct {
    /* Cached hash code of me_key.  Note that hash codes are C longs.
     * We have to use Py_ssize_t instead because dict_popitem() abuses
     * me_hash to hold a search finger.
     */
    Py_ssize_t me_hash;
    PyObject *me_key;
    PyObject *me_value;
} PyDictEntry;

    该结构里的第一个me_hash字段用于存储key的hash值，第二个me_key字段用于存储key的对象指针，第三个me_value字段则用于存储value的对象指针。

    ma_table数组里可以存储三类成员：

    第一类是empty(空的成员)，也就是me_key与me_value都是NULL(空指针)的成员，表示该成员从未被使用过。

    第二类是正在使用的有效的成员，即me_value不为NULL的成员，也就是我们可以通过key访问到的成员。

    第三类是被删除掉的成员，即me_key为dummy字符串对象且me_value为NULL的成员，在词典中删除某个key-value时，只会将该key所在的ma_table数组成员里的me_key设置为dummy，同时将me_value设置为NULL而已。因此，在对词典进行删除成员操作时，是不会调整ma_table数组的实际尺寸的。哪怕你将词典中所有的key-value都删除掉，ma_table数组的大小都不会发生变化，只不过是原来的key-value成员的me_key变为了dummy而已。dummy是"<dummy key>"字符串对象，dummy仅用于表示对应的成员已被删除了。删除掉的成员可以在以后的添加操作中被重利用。

    ma_table数组中始终会预留一部分empty成员。这样当我们在词典中搜索某个key时，如果该key不存在于词典中时，就会将dummy成员或者empty成员作为结果返回，由于这两类成员中的me_value都是NULL，这样就可以通过返回成员中的me_value来判断key是否存在于词典中了(存在则me_value不为NULL，不存在则me_value为NULL)。

    在前面介绍的PyDictObject结构体中：

    ma_fill字段表示ma_table数组中一共填充了多少个非empty的成员(包括有效成员和删除掉的成员在内)。

    ma_used字段则表示ma_table数组中一共填充了多少个有效的成员。

    ma_mask字段的值是ma_table数组的大小减1，ma_table数组的大小始终是2的次方值，例如：8，16，32等。因为在通过key的hash值计算对应的ma_table数组成员的入口索引时，需要对2的次方值进行取余运算，因此将ma_mask设置为数组大小减1，就可以直接通过按位与运算来加快取余计算。

    ma_lookup字段是一个函数指针，该指针所指向的函数可以根据key及key的hash值来计算出对应的ma_table数组成员的入口位置。

    在创建词典时，如果成员数小于等于5，则这些成员可以直接存储在ma_smalltable数组中，此时ma_table会直接指向ma_smalltable。之所以要小于等于5，是因为至少需要在数组中预留三分之一的empty成员。对于ma_smalltable数组而言，由于其大小是8，因此就需要至少预留3个empty成员。当成员数超过5时，Python就会为ma_table分配额外的堆空间，此时的ma_table就不再指向ma_smalltable了。

通过key来获取对应的value：

    前面介绍了词典的初始化过程，下面再来看下词典获取key对应的value的过程：

[email protected]:~$ gdb -q python
....................................................
>>> dict1 = {123:345, 234:456, 789:1011}
>>> dict1[234]
....................................................

Breakpoint 1, PyEval_EvalFrameEx (f=0xb7d8ce44, throwflag=0)
    at Python/ceval.c:1388
1388	            w = POP();
(gdb) l
1383	            SET_TOP(x);
1384	            if (x != NULL) continue;
1385	            break;
1386	
1387	        case BINARY_SUBSCR:
1388	            w = POP();
1389	            v = TOP();
1390	            if (PyList_CheckExact(v) && PyInt_CheckExact(w)) {
1391	                /* INLINE: list[int] */
1392	                Py_ssize_t i = PyInt_AsSsize_t(w);
(gdb) 
1393	                if (i < 0)
1394	                    i += PyList_GET_SIZE(v);
1395	                if (i >= 0 && i < PyList_GET_SIZE(v)) {
1396	                    x = PyList_GET_ITEM(v, i);
1397	                    Py_INCREF(x);
1398	                }
1399	                else
1400	                    goto slow_get;
1401	            }
1402	            else
(gdb) 
1403	              slow_get:
1404	                x = PyObject_GetItem(v, w);
1405	            Py_DECREF(v);
1406	            Py_DECREF(w);
1407	            SET_TOP(x);
1408	            if (x != NULL) continue;
1409	            break;
1410	
....................................................
(gdb) n
1404	                x = PyObject_GetItem(v, w);
(gdb) s
PyObject_GetItem (o=0xb7d44df4, key=0x8208838) at Objects/abstract.c:139
139	    if (o == NULL || key == NULL)
....................................................
144	        return m->mp_subscript(o, key);
(gdb) s
dict_subscript (mp=0xb7d44df4, key=0x8208838) at Objects/dictobject.c:1169
1169	    assert(mp->ma_table != NULL);
(gdb) c
Continuing.
456
>>>

可以看到，Python内部最终会通过dict_subscript这个C函数来获取key对应的value。该函数定义在Objects/dictobject.c文件中：

static PyObject *
dict_subscript(PyDictObject *mp, register PyObject *key)
{
    PyObject *v;
    long hash;
    PyDictEntry *ep;
    assert(mp->ma_table != NULL);
    if (!PyString_CheckExact(key) ||
        (hash = ((PyStringObject *) key)->ob_shash) == -1) {
        hash = PyObject_Hash(key);
        if (hash == -1)
            return NULL;
    }
    ep = (mp->ma_lookup)(mp, key, hash);
    if (ep == NULL)
        return NULL;
    v = ep->me_value;
    ....................................................
    return v;
}

该函数会先获取到key对应的hash值，再通过ma_lookup所指向的函数，根据hash值，将对应的ma_table数组成员的位置给计算出来。最后再将成员中的me_value作为结果返回。

由hash值计算出对应的数组成员的原理，可以参考下面这段注释(该注释同样位于Objects/dictobject.c文件中)：

/* See large comment block below.  This must be >= 1. */
#define PERTURB_SHIFT 5

/*
Major subtleties ahead:  Most hash schemes depend on having a "good" hash
function, in the sense of simulating randomness.  Python doesn't:  its most
important hash functions (for strings and ints) are very regular in common
cases:

>>> map(hash, (0, 1, 2, 3))
[0, 1, 2, 3]
>>> map(hash, ("namea", "nameb", "namec", "named"))
[-1658398457, -1658398460, -1658398459, -1658398462]
>>>

This isn't necessarily bad!  To the contrary, in a table of size 2**i, taking
the low-order i bits as the initial table index is extremely fast, and there
are no collisions at all for dicts indexed by a contiguous range of ints.
The same is approximately true when keys are "consecutive" strings.  So this
gives better-than-random behavior in common cases, and that's very desirable.

OTOH, when collisions occur, the tendency to fill contiguous slices of the
hash table makes a good collision resolution strategy crucial.  Taking only
the last i bits of the hash code is also vulnerable:  for example, consider
[i << 16 for i in range(20000)] as a set of keys.  Since ints are their own
hash codes, and this fits in a dict of size 2**15, the last 15 bits of every
hash code are all 0:  they *all* map to the same table index.

But catering to unusual cases should not slow the usual ones, so we just take
the last i bits anyway.  It's up to collision resolution to do the rest.  If
we *usually* find the key we're looking for on the first try (and, it turns
out, we usually do -- the table load factor is kept under 2/3, so the odds
are solidly in our favor), then it makes best sense to keep the initial index
computation dirt cheap.

The first half of collision resolution is to visit table indices via this
recurrence:

    j = ((5*j) + 1) mod 2**i

For any initial j in range(2**i), repeating that 2**i times generates each
int in range(2**i) exactly once (see any text on random-number generation for
proof).  By itself, this doesn't help much:  like linear probing (setting
j += 1, or j -= 1, on each loop trip), it scans the table entries in a fixed
order.  This would be bad, except that's not the only thing we do, and it's
actually *good* in the common cases where hash keys are consecutive.  In an
example that's really too small to make this entirely clear, for a table of
size 2**3 the order of indices is:

    0 -> 1 -> 6 -> 7 -> 4 -> 5 -> 2 -> 3 -> 0 [and here it's repeating]

If two things come in at index 5, the first place we look after is index 2,
not 6, so if another comes in at index 6 the collision at 5 didn't hurt it.
Linear probing is deadly in this case because there the fixed probe order
is the *same* as the order consecutive keys are likely to arrive.  But it's
extremely unlikely hash codes will follow a 5*j+1 recurrence by accident,
and certain that consecutive hash codes do not.

The other half of the strategy is to get the other bits of the hash code
into play.  This is done by initializing a (unsigned) vrbl "perturb" to the
full hash code, and changing the recurrence to:

    j = (5*j) + 1 + perturb;
    perturb >>= PERTURB_SHIFT;
    use j % 2**i as the next table index;

Now the probe sequence depends (eventually) on every bit in the hash code,
and the pseudo-scrambling property of recurring on 5*j+1 is more valuable,
because it quickly magnifies small differences in the bits that didn't affect
the initial index.  Note that because perturb is unsigned, if the recurrence
is executed often enough perturb eventually becomes and remains 0.  At that
point (very rarely reached) the recurrence is on (just) 5*j+1 again, and
that's certain to find an empty slot eventually (since it generates every int
in range(2**i), and we make sure there's always at least one empty slot).

Selecting a good value for PERTURB_SHIFT is a balancing act.  You want it
small so that the high bits of the hash code continue to affect the probe
sequence across iterations; but you want it large so that in really bad cases
the high-order hash bits have an effect on early iterations.  5 was "the
best" in minimizing total collisions across experiments Tim Peters ran (on
both normal and pathological cases), but 4 and 6 weren't significantly worse.

Historical:  Reimer Behrends contributed the idea of using a polynomial-based
approach, using repeated multiplication by x in GF(2**n) where an irreducible
polynomial for each table size was chosen such that x was a primitive root.
Christian Tismer later extended that to use division by x instead, as an
efficient way to get the high bits of the hash code into play.  This scheme
also gave excellent collision statistics, but was more expensive:  two
if-tests were required inside the loop; computing "the next" index took about
the same number of operations but without as much potential parallelism
(e.g., computing 5*j can go on at the same time as computing 1+perturb in the
above, and then shifting perturb can be done while the table index is being
masked); and the PyDictObject struct required a member to hold the table's
polynomial.  In Tim's experiments the current scheme ran faster, produced
equally good collision statistics, needed less code & used less memory.

Theoretical Python 2.5 headache:  hash codes are only C "long", but
sizeof(Py_ssize_t) > sizeof(long) may be possible.  In that case, and if a
dict is genuinely huge, then only the slots directly reachable via indexing
by a C long can be the first slot in a probe sequence.  The probe sequence
will still eventually reach every slot in the table, but the collision rate
on initial probes may be much higher than this scheme was designed for.
Getting a hash code as fat as Py_ssize_t is the only real cure.  But in
practice, this probably won't make a lick of difference for many years (at
which point everyone will have terabytes of RAM on 64-bit boxes).
*/

    上面注释里最主要的一个公式就是：j = ((5*j) + 1) mod 2**i ，当j的初始值为0到2**i - 1之间的任意整数时，循环计算此公式2**i次，就可以将0到2**i - 1之间的每个值都生成一遍。

    假设i为3，那么2**i就是2的3次方，也就是8。当j的初始值为0到7的任意整数时，假设为5，那么将5放到公式中，j = ((5 * 5) + 1) mod 8 就可以计算出j的新值为2(公式中的mod表示取余运算)。

    再将j的新值2放到公式中：j = ((5 * 2) + 1) mod 8 得到j为3。

    将3放入公式中：j = ((5 * 3) + 1) mod 8 得到j为0。

    将0放入公式中：j = ((5 * 0) + 1) mod 8 得到j为1。

    将1放入公式中：j = ((5 * 1) + 1) mod 8 得到j为6。

    将6放入公式中：j = ((5 * 6) + 1) mod 8 得到j为7。

    将7放入公式中：j = ((5 * 7) + 1) mod 8 得到j为4。

    这样通过将公式循环计算8次(2的3次方)，获取到的值的情况就是 5 -> 2 -> 3 -> 0 -> 1 -> 6 -> 7 -> 4，也就是将0到7这8个数都获取了一遍。再将4放入公式中，结果又会从5开始循环下去：5 -> 2 -> 3 -> 0 -> 1 -> 6 -> 7 -> 4 -> 5 -> 2 -> 3 -> 0 .......

    假设某个key的hash值为123，同时假定词典的ma_table数组的大小为2**3即8。那么，在计算key对应的数组成员的索引值时，会先计算 123 mod 8 = 3，这里通过取余运算，将123转换为0到7之间的数，词典在查询时，就会先将索引3的数组成员的me_key与当前要查询的key的hash值及对象值进行比较，如果都相等，那么就说明123对应的数组成员的索引为3，否则就再将3放入到j = ((5 * 3) + 1) mod 8的公式中，计算得到0，再将索引0的数组成员的me_key与key进行比较。循环这么找下去。直到找到对应的符合条件的成员为止，如果找不到就返回empty或dummy成员。

    在实际计算时，Python还加入了一个perturb的计算因子，perturb是由hash值与PERTURB_SHIFT宏(该宏对应为5)通过位移得到的值。

    因此，实际的计算公式为：

j = (5*j) + 1 + perturb
perturb >> PERTURB_SHIFT
使用j mod 2**i作为数组索引值。

    要注意上面这个公式与原来公式的区别，原来公式中j的值还经过了mod 2**i的取余运算，但是这里的j的值仅由(5*j) + 1 + perturb来计算出来，j mod 2**i仅在此作为索引值而已，取余运算在这里不会修改j的值。

    perturb的初始值为hash值，每运算一次公式，perturb的值就会右移PERTURB_SHIFT位。

    因此当key的hash值为123，词典的ma_table的大小为8时，一开始还是先计算出 123 mod 8 = 3，然后在索引3的ma_table数组成员中对key进行比较，如果不符合条件，则将3放入 j = ((5 * 3) + 1 + 123) 的公式中，得到139，139 mod 8 = 3，由于索引3已经不符合条件了，再进入公式 j = (5 * 139) + 1 + (123 >> 5) 得到j为699，699 mod 8 = 3，还是索引3，还是不符合条件，再进入公式 j = (5 * 699) + 1 + (123 >> 5 >> 5) 即 j = (5 * 699) + 1 + 0 ，得到j为3496，3496 mod 8 = 0，再对索引0的数组成员进行判断，以此类推，循环下去。

    Python加入的perturb计算因子，在上面这个123的例子中并没有原始公式那么好，重复计算出两次3，不过在其他例子中，它有可能会比原始公式要查询的快些。

    词典对象的ma_lookup字段所指向的函数，默认是lookdict_string函数，该函数就用到了上面的计算公式。lookdict_string也定义在Objects/dictobject.c文件里：

static PyDictEntry *
lookdict_string(PyDictObject *mp, PyObject *key, register long hash)
{
    register size_t i;
    register size_t perturb;
    register PyDictEntry *freeslot;
    register size_t mask = (size_t)mp->ma_mask;
    PyDictEntry *ep0 = mp->ma_table;
    register PyDictEntry *ep;

    /* Make sure this function doesn't have to handle non-string keys,
       including subclasses of str; e.g., one reason to subclass
       strings is to override __eq__, and for speed we don't cater to
       that here. */
    if (!PyString_CheckExact(key)) {
#ifdef SHOW_CONVERSION_COUNTS
        ++converted;
#endif
        // 当要查询的key不是字符串对象时，
        // 会将词典对象的ma_lookup字段
        // 指向lookdict函数。
        mp->ma_lookup = lookdict;
        return lookdict(mp, key, hash);
    }
    // 先将hash值与mask进行按位与运算，
    // 由于mask的值是ma_table数组的大小减一，
    // 而ma_table的大小又是2的次方值。
    // 因此，按位与运算的结果等效于
    // hash与ma_table大小的取余运算的结果。
    // 假设hash值为123，ma_table数组大小为8，
    // 那么mask的值就为8 - 1 = 7，
    // 123 & 7 = 3，
    // 123和8进行取余运算的结果也是3，
    // 按位与运算比取余运算的效率更高些。
    // 下面的i = hash & mask表达式就是先计算
    // 出hash值与数组大小的余数。
    i = hash & mask;
    // 先根据余数作为索引对数组成员进行判断。
    ep = &ep0[i];
    // 如果是empty成员或者该成员的me_key等于查询的
    // key时，就将该成员的指针值作为结果返回。
    if (ep->me_key == NULL || ep->me_key == key)
        return ep;
    // 如果是dummy成员，就先将该成员的指针值
    // 暂时保存到freeslot变量中。
    // 如果稍候找不到符合条件的成员(即没有找到
    // key值相等的成员)时，就会将dummy成员
    // 作为结果返回。dummy成员属于被删除掉
    // 的成员，将其返回后，可以对其进行重利用。
    if (ep->me_key == dummy)
        freeslot = ep;
    else {
        // 如果数组成员的me_hash值与
        // 要查询的key的hash值相等，
        // 且数组成员的me_key对象的值与
        // 要查询的key对象的值相等时，
        // 也会将该成员作为结果返回。
        if (ep->me_hash == hash && _PyString_Eq(ep->me_key, key))
            return ep;
        freeslot = NULL;
    }

    // 通过之前介绍的公式，
    // 循环对数组成员进行判断。
    /* In the loop, me_key == dummy is by far (factor of 100s) the
       least likely outcome, so test for that last. */
    for (perturb = hash; ; perturb >>= PERTURB_SHIFT) {
        // (i << 2) + i + perturb + 1等效于
        // i * 4 + i + 1 + perturb即
        // (5 * i) + 1 + perturb，
        // 与之前介绍的公式一致。
        i = (i << 2) + i + perturb + 1;
        // i & mask相当于i与数组大小进行取余运算。
        // 取余运算的结果作为数组索引值。
        ep = &ep0[i & mask];
        // 如果计算出来的数组成员的me_key
        // 为NULL(空指针)时，就说明当前成员
        // 是empty成员，当搜索到empty
        // 成员时，就说明要查询的key不存在于
        // 词典的ma_table数组中，就将
        // dummy成员或者empty成员作为结果返回。
        if (ep->me_key == NULL)
            return freeslot == NULL ? ep : freeslot;
        // 如果计算出来的数组成员的hash值与
        // 需要查询的key的hash值相等，
        // 且数组成员的me_key的对象指针
        // 等于key的对象指针，或者me_key
        // 对象的值等于key对象的值时，就说明
        // 当前数组成员中包含了我们要查询的key，
        // 就将该成员的指针值作为结果返回。
        if (ep->me_key == key
            || (ep->me_hash == hash
            && ep->me_key != dummy
            && _PyString_Eq(ep->me_key, key)))
            return ep;
        // 如果freeslot变量没被设置过，
        // 那么就将计算过程中找到的第一个
        // dummy成员的指针值保存到freeslot变量里。
        if (ep->me_key == dummy && freeslot == NULL)
            freeslot = ep;
    }
    assert(0);          /* NOT REACHED */
    return 0;
}

从上面的代码中可以看到，当要查询的key不是字符串对象时，词典对象的ma_lookup就会被指向为lookdict函数，该函数的查询key的过程与上面这个lookdict_string函数的查询计算过程是差不多的。

更新词典中的数据：

先来看个简单的例子：

[email protected]:~$ gdb -q python
....................................................
>>> dict1 = {'hello': 123, 'world': 456}
>>> dict1['hello'] = 100
....................................................

Breakpoint 3, PyEval_EvalFrameEx (f=0xb7daca44, throwflag=0)
    at Python/ceval.c:1714
1714	            w = TOP();
(gdb) l
1709	            Py_XDECREF(w);
1710	            if (err == 0) continue;
1711	            break;
1712	
1713	        case STORE_SUBSCR:
1714	            w = TOP();
1715	            v = SECOND();
1716	            u = THIRD();
1717	            STACKADJ(-3);
1718	            /* v[w] = u */
(gdb) 
1719	            err = PyObject_SetItem(v, w, u);
1720	            Py_DECREF(u);
1721	            Py_DECREF(v);
1722	            Py_DECREF(w);
1723	            if (err == 0) continue;
1724	            break;
1725	
....................................................
(gdb) s
PyObject_SetItem (o=0xb7ca2994, key=0xb7ca1a00, value=0x8208328)
    at Objects/abstract.c:167
167	    if (o == NULL || key == NULL || value == NULL) {
....................................................
173	        return m->mp_ass_subscript(o, key, value);
(gdb) s
dict_ass_sub (mp=0xb7ca2994, v=0xb7ca1a00, w=0x8208328)
    at Objects/dictobject.c:1208
1208	    if (w == NULL)
....................................................
1211	        return PyDict_SetItem((PyObject *)mp, v, w);
(gdb) s
PyDict_SetItem (op=0xb7ca2994, key=0xb7ca1a00, value=0x8208328)
    at Objects/dictobject.c:802
802	    if (!PyDict_Check(op)) {
....................................................
818	    return dict_set_item_by_hash_or_entry(op, key, hash, NULL, value);
(gdb) s
dict_set_item_by_hash_or_entry (op=0xb7ca2994, key=0xb7ca1a00, 
    hash=-1267296259, ep=0x0, value=0x8208328) at Objects/dictobject.c:759
759	    mp = (PyDictObject *)op;
....................................................
765	        if (insertdict(mp, key, hash, value) != 0)
(gdb) s
insertdict (mp=0xb7ca2994, key=0xb7ca1a00, hash=-1267296259, value=0x8208328)
    at Objects/dictobject.c:549
549	    assert(mp->ma_lookup != NULL);
....................................................
556	    return insertdict_by_entry(mp, key, hash, ep, value);
(gdb) s
insertdict_by_entry (mp=0xb7ca2994, key=0xb7ca1a00, hash=-1267296259, 
    ep=0xb7ca29f4, value=0x8208328) at Objects/dictobject.c:515
515	    MAINTAIN_TRACKING(mp, key, value);
....................................................
(gdb) c
Continuing.
>>> dict1['hello']
100
>>>

可以看到，Python内部会通过 dict_ass_sub -> PyDict_SetItem -> dict_set_item_by_hash_or_entry -> insertdict -> insertdict_by_entry 这几个C函数对词典里的数据进行更新操作，这几个函数都定义在Objects/dictobject.c文件中：

/*
Internal routine to insert a new item into the table when you have entry object.
Used by insertdict.
*/
static int
insertdict_by_entry(register PyDictObject *mp, PyObject *key, long hash,
                    PyDictEntry *ep, PyObject *value)
{
    PyObject *old_value;

    MAINTAIN_TRACKING(mp, key, value);
    // 如果key存在于词典中，即
    // 查询到的数组成员的me_value
    // 不为NULL(空指针)时，
    // 就对该数组成员中的me_value进行更新。
    if (ep->me_value != NULL) {
        old_value = ep->me_value;
        ep->me_value = value;
        Py_DECREF(old_value); /* which **CAN** re-enter */
        Py_DECREF(key);
    }
    // 如果key不存在于词典中，
    // 则将key-value添加到词典中。
    else {
        if (ep->me_key == NULL)
            mp->ma_fill++;
        else {
            assert(ep->me_key == dummy);
            Py_DECREF(dummy);
        }
        ep->me_key = key;
        ep->me_hash = (Py_ssize_t)hash;
        ep->me_value = value;
        mp->ma_used++;
    }
    return 0;
}


/*
Internal routine to insert a new item into the table.
Used both by the internal resize routine and by the public insert routine.
Eats a reference to key and one to value.
Returns -1 if an error occurred, or 0 on success.
*/
static int
insertdict(register PyDictObject *mp, PyObject *key, long hash, PyObject *value)
{
    register PyDictEntry *ep;

    assert(mp->ma_lookup != NULL);
    // 通过ma_lookup指向的函数
    // 查询key对应的数组成员，
    // 如果key存在于词典中，下面的
    // insertdict_by_entry函数就会对
    // 找到的数组成员里的me_value进行
    // 更新操作。
    // 如果key不存在于词典中，
    // 那么ma_lookup就会返回dummy成员
    // 或者empty成员，下面的
    // insertdict_by_entry函数就会
    // 使用dummy或empty成员进行
    // 添加key-value的操作。
    ep = mp->ma_lookup(mp, key, hash);
    if (ep == NULL) {
        Py_DECREF(key);
        Py_DECREF(value);
        return -1;
    }
    return insertdict_by_entry(mp, key, hash, ep, value);
}

....................................................

static int
dict_set_item_by_hash_or_entry(register PyObject *op, PyObject *key,
                               long hash, PyDictEntry *ep, PyObject *value)
{
    register PyDictObject *mp;
    register Py_ssize_t n_used;

    mp = (PyDictObject *)op;
    // 词典的ma_table数组中至少要
    // 包含一个empty成员。
    assert(mp->ma_fill <= mp->ma_mask);  /* at least one empty slot */
    n_used = mp->ma_used;
    Py_INCREF(value);
    Py_INCREF(key);
    if (ep == NULL) {
        // 通过上面的insertdict函数
        // 去完成更新数据或者添加数据的操作。
        if (insertdict(mp, key, hash, value) != 0)
            return -1;
    }
    else {
        if (insertdict_by_entry(mp, key, hash, ep, value) != 0)
            return -1;
    }
    // 当词典的ma_table数组中填充的非empty成员
    // 的数量，超过了ma_table数组大小的
    // 三分之二时，就对ma_table数组进行扩容操作，
    // 这样就可以确保数组中有三分之一的成员
    // 属于empty成员了。
    /* If we added a key, we can safely resize.  Otherwise just return!
     * If fill >= 2/3 size, adjust size.  Normally, this doubles or
     * quaduples the size, but it's also possible for the dict to shrink
     * (if ma_fill is much larger than ma_used, meaning a lot of dict
     * keys have been * deleted).
     *
     * Quadrupling the size improves average dictionary sparseness
     * (reducing collisions) at the cost of some memory and iteration
     * speed (which loops over every possible entry).  It also halves
     * the number of expensive resize operations in a growing dictionary.
     *
     * Very large dictionaries (over 50K items) use doubling instead.
     * This may help applications with severe memory constraints.
     */
    if (!(mp->ma_used > n_used && mp->ma_fill*3 >= (mp->ma_mask+1)*2))
        return 0;
    return dictresize(mp, (mp->ma_used > 50000 ? 2 : 4) * mp->ma_used);
}

....................................................

int
PyDict_SetItem(register PyObject *op, PyObject *key, PyObject *value)
{
    register long hash;

    if (!PyDict_Check(op)) {
        PyErr_BadInternalCall();
        return -1;
    }
    assert(key);
    assert(value);
    // 先得到key的hash值，
    // 字符串对象的hash值，
    // 一般是在字符串对象初始化时就计算好了的。
    // 另外，整数对象的hash值就等于整数对象的值，
    // 例如：123整数对应的hash值就是123。
    if (PyString_CheckExact(key)) {
        hash = ((PyStringObject *)key)->ob_shash;
        if (hash == -1)
            hash = PyObject_Hash(key);
    }
    else {
        hash = PyObject_Hash(key);
        if (hash == -1)
            return -1;
    }
    // 进入上面的dict_set_item_by_hash_or_entry函数
    // 去完成更新数据或者添加数据的操作。
    return dict_set_item_by_hash_or_entry(op, key, hash, NULL, value);
}

....................................................

static int
dict_ass_sub(PyDictObject *mp, PyObject *v, PyObject *w)
{
    if (w == NULL)
        return PyDict_DelItem((PyObject *)mp, v);
    else
        // 进入上面的PyDict_SetItem函数去完成
        // 更新数据或者添加数据的操作。
        return PyDict_SetItem((PyObject *)mp, v, w);
}

删除词典中的数据：

下面再通过例子来看下，Python是如何完成词典数据的删除操作的：

[email protected]:~$ gdb -q python
....................................................
>>> dict1 = {123:345, 234:456, 789:1011}
>>> dict1
{234: 456, 123: 345, 789: 1011}
>>> del dict1[123]
....................................................

Breakpoint 4, PyEval_EvalFrameEx (f=0xb7ca4034, throwflag=0)
    at Python/ceval.c:1727
1727	            w = TOP();
(gdb) l
1722	            Py_DECREF(w);
1723	            if (err == 0) continue;
1724	            break;
1725	
1726	        case DELETE_SUBSCR:
1727	            w = TOP();
1728	            v = SECOND();
1729	            STACKADJ(-2);
1730	            /* del v[w] */
1731	            err = PyObject_DelItem(v, w);
(gdb) 
1732	            Py_DECREF(v);
1733	            Py_DECREF(w);
1734	            if (err == 0) continue;
1735	            break;
1736	
....................................................
1731	            err = PyObject_DelItem(v, w);
(gdb) s
PyObject_DelItem (o=0xb7ca2994, key=0x820815c) at Objects/abstract.c:199
199	    if (o == NULL || key == NULL) {
....................................................
205	        return m->mp_ass_subscript(o, key, (PyObject*)NULL);
(gdb) s
dict_ass_sub (mp=0xb7ca2994, v=0x820815c, w=0x0) at Objects/dictobject.c:1208
1208	    if (w == NULL)
(gdb) s
1209	        return PyDict_DelItem((PyObject *)mp, v);
(gdb) s
PyDict_DelItem (op=0xb7ca2994, key=0x820815c) at Objects/dictobject.c:829
829	    if (!PyDict_Check(op)) {
(gdb) c
Continuing.
>>> dict1
{234: 456, 789: 1011}
>>>

可以看到，Python内部最终会通过PyDict_DelItem函数去完成词典的成员数据的删除操作，该C函数同样定义在Objects/dictobject.c文件中：

int
PyDict_DelItem(PyObject *op, PyObject *key)
{
    register PyDictObject *mp;
    register long hash;
    register PyDictEntry *ep;
    PyObject *old_value, *old_key;

    if (!PyDict_Check(op)) {
        PyErr_BadInternalCall();
        return -1;
    }
    assert(key);
    // 先得到key的hash值。
    if (!PyString_CheckExact(key) ||
        (hash = ((PyStringObject *) key)->ob_shash) == -1) {
        hash = PyObject_Hash(key);
        if (hash == -1)
            return -1;
    }
    mp = (PyDictObject *)op;
    // 通过ma_lookup指向的函数，
    // 根据key的hash值来得到
    // 对应的数组成员。
    ep = (mp->ma_lookup)(mp, key, hash);
    if (ep == NULL)
        return -1;
    // 如果得到的是dummy或者empty成员
    // (这两类成员的me_value都是NULL)，
    // 则说明key不存在于词典中，
    // 就会抛出KeyError的错误。
    if (ep->me_value == NULL) {
        set_key_error(key);
        return -1;
    }
    // 如果找到了key所对应的有效的数组成员，
    // 则将该成员的me_key设置为dummy字符串对象。
    // 同时将成员的me_value设置为NULL。
    // 最后再将词典对象的ma_used的值减一，
    // 也就是将词典的有效的成员数减一，
    // 从而完成删除key对应的成员的操作。
    // 由此可以看出来，删除词典数据
    // 并不会对词典的ma_table数组的大小
    // 产生任何影响。它仅仅是将原来的
    // 有效成员变为了dummy成员而已。
    old_key = ep->me_key;
    Py_INCREF(dummy);
    ep->me_key = dummy;
    old_value = ep->me_value;
    ep->me_value = NULL;
    mp->ma_used--;
    // 将成员中原来的key和value的
    // 对象的引用计数减一。
    // 当这些对象的引用计数减到0时，
    // 就可以被Python的垃圾回收机制
    // 给处理掉。
    Py_DECREF(old_value);
    Py_DECREF(old_key);
    return 0;
}

从上面的注释中可以看到，当key不存在时，进行删除操作会抛出KeyError的错误：

[email protected]:~$ gdb -q python
....................................................
>>> dict1
{234: 456, 789: 1011}
>>> del dict1[123]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 123
>>>

词典的key必须具有hash值：

从上面的这些例子中可以看到，无论是访问词典中的成员，还是更新，添加，删除词典成员的操作，都需要先根据key的hash值来计算出对应的成员位置。因此，词典相关的key对象必须具有hash值，如果某个Python对象无法计算出hash值的话，那么使用该对象作为key时，就会抛出如下所示的TypeError的错误：

[email protected]:~$ gdb -q python
....................................................
>>> dict1[{123 : 234}] = 567
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'dict'
>>> dict1[{1,2,3,4}] = 567
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'set'
>>> dict1[[1,2,3,4]] = 567
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'
>>>

像dict(词典)，set，以及list(列表)类型的对象，他们的成员数据都是可变的，因此，无法计算出这些对象的确切的hash值来。在这些类型的C结构里的tp_hash字段也都被指向为PyObject_HashNotImplemented函数(下面以词典类型为例)：

PyTypeObject PyDict_Type = {
    PyVarObject_HEAD_INIT(&PyType_Type, 0)
    "dict",
    sizeof(PyDictObject),
    0,
    ................................................
    (hashfunc)PyObject_HashNotImplemented,      /* tp_hash */
    0,                                          /* tp_call */
    0,                                          /* tp_str */
    ................................................
};

上面这个词典类型定义在Objects/dictobject.c文件里，当Python尝试去计算词典对象的hash值时，就会通过tp_hash所指向的PyObject_HashNotImplemented函数尝试计算hash值，该C函数定义在Objects/object.c文件中：

long
PyObject_HashNotImplemented(PyObject *self)
{
    PyErr_Format(PyExc_TypeError, "unhashable type: '%.200s'",
                 self->ob_type->tp_name);
    return -1;
}

    可以看到，该函数会直接抛出一个TypeError的错误。其他的set，list类型的对象，在尝试计算hash值时，最终也都会进入到该函数中。因此，这些类型的对象在作为词典key时，就会抛出 TypeError: unhashable type: ... 的错误了。

结束语：

    以上就是和词典类型相关的内容。下一篇将介绍与词典相关的脚本函数及方法。

    OK，休息，休息一下 o(∩_∩)o~~

    如果我们想取得成功，就必须先找到属于自己的“世界”，只有在自己的“世界”里，才能将潜能发挥到最大。

—— 大时代

上下篇

下一篇： Python词典相关的脚本函数及方法

上一篇： Python元组类型及相关函数

Python元组类型及相关函数

Python基本的I/O操作 (一)

Python用户自定义函数

Python的数字类型及相关的类型转换函数

Python基本的操作运算符

Python的time模块

zenglong 2015-08-04 03:57:16