[aligner] Enable the user to reuse the cache returned by alignall_iterative() (closes #116938)

The cache returned can be reused by the alignall_iterative() function, to perform another alignment with different parameters, or just for the user to be sure everything has been correctly caught)

authorSimon Chabot <simon.chabot@logilab.fr>
changesetf942f2393fb2
branchdefault
phasepublic
hiddenno
parent revision#4b6119e623cf [aligner] Add the alignall_iterative() function (closes #116932)
child revision#7498d98cde7a [aligner] Enable the user to customize the equality_threshold (closes #116940)
files modified by this revision
aligner.py
# HG changeset patch
# User Simon Chabot <simon.chabot@logilab.fr>
# Date 1365091810 -7200
# Thu Apr 04 18:10:10 2013 +0200
# Node ID f942f2393fb2af222f7b34b8fd89f16f9a4d893e
# Parent 4b6119e623cf80f4a82c1d4b4546f5d91f1ffbae
[aligner] Enable the user to reuse the cache returned by alignall_iterative() (closes #116938)
The cache returned can be reused by the alignall_iterative() function, to perform another alignment with different parameters, or just for the user to be sure everything has been correctly caught)

diff --git a/aligner.py b/aligner.py
@@ -176,14 +176,14 @@
1          `alignset` and `targetset` are the sets to align. Each set contains
2          lists where the first column is the identifier of the item, and the others are
3          the attributs to align. (Note that the order is important !) Both must
4          have the same number of columns.
5 
6 -        `treatments` is a dictionnary of dictionnaries.
7 -        Each key is the indice of the row, and each value is a dictionnary
8 +        `treatments` is a dictionary of dictionaries.
9 +        Each key is the indice of the row, and each value is a dictionary
10          that contains the treatments to do on the different attributs.
11 -        Each dictionnary is built as the following:
12 +        Each dictionary is built as the following:
13 
14              treatment = {'normalization': [f1, f2, f3],
15                           'norm_params': {'arg1': arg01, 'arg2': arg02},
16                           'metric': d1,
17                           'metric_params': {'arg1': arg11},
@@ -299,18 +299,24 @@
18              yield alignset[alignid][0], targetset[bestid][0]
19 
20  def alignall_iterative(alignfile, targetfile, alignformat, targetformat,
21                         threshold, size=10000, treatments=None, indexes=(1,1),
22                         mode='kdtree', neighbours_threshold=0.1, n_clusters=None,
23 -                       kwordsgram=1, siglen=200):
24 +                       kwordsgram=1, siglen=200, cache=None):
25 
26      """ This function helps you to align *huge* files.
27          It takes your csv files as arguments and split them into smaller ones
28 -        (files of `size` lines), and runs the alignement on those files.
29 +        (files of `size` lines), and runs the alignment on those files.
30 
31          `alignformat` and `targetformat` are keyworded arguments given to the
32          nazca.dataio.parsefile function.
33 +
34 +        This function returns its own cache. The cache is quite simply a
35 +        dictionary having align items' id as keys and tuples (target item's id,
36 +        distance) as value. This dictionary can be regiven to this function to
37 +        perform another alignment (with different parameters, or just to be
38 +        sure everything has been caught)
39      """
40 
41      #Split the huge files into smaller ones
42      aligndir = mkdtemp()
43      targetdir = mkdtemp()
@@ -320,11 +326,11 @@
44      #Compute the number of iterations that must be done to achieve the alignement
45      nb_iterations = len(alignfiles) * len(targetfiles)
46      current_it = 0
47 
48      doneids = set([]) #Contains the id of perfectly aligned data
49 -    cache = {} #Contains the better known alignements
50 +    cache = cache or {} #Contains the better known alignments
51 
52      try:
53          for alignfile in alignfiles:
54              alignset = parsefile(osp.join(aligndir, alignfile), **alignformat)
55              for targetfile in targetfiles: