File Key Indexing


John Clark wrote on 27 Nov 1995:

I am working the FCC files for radio transmitters. Over 3,000,000 records in dBASE format. [...]

I am inverting the file (at least that is what I think I am doing). I take all the transmitters assinged to "CALIFORNIA, STATE OF" store that name once and have a vector of pointers pointing to occurrences of it. Saves a lot of space and gives me fast lookup in any conbination of searches. (I also do the same with city names, etc.)

[...] but checking a matrix to see if the data is already there is a bit slow.


I replied:

Ahh... Finding distinct names by looking them up one at a time will be extremely slow no matter what search technique you use. You need to process large blocks of names at once using either sorting or a hashing lookup algorithm (as is used by the latest version of MATIOTA, which can be found in http://www.chilton.com/~jimw). Even if you can't process the whole file at once, looping on 100,000-record blocks with MATIOTA should be very fast compared to alternatives.

And then followed up with some code:


From: Jim Weigang <jimw>
To: John Clark
Date: Tue, 28 Nov 1995 19:58:53 -0500
Subject: Re: ROWFIND and/or APL2

I think the following should be a pretty fast way of building the file pointers. Might keep the run time down to just one football game. (If you use DEFINEFNS to read this in, the resulting programs will run under either APL*PLUS II or III. The {del}: lines adjust the signature in the assembler object code.)

     {del} Z{<-}A MOREPTRS B;D;F;I;J;K;N;P;Q
[1]    @Extends a pointer list for more keys
[2]    @    B  - character matrix of new key values
[3]    @  1{pick}A  - number of records previously processed
[4]    @  2{pick}A  - character matrix of distinct key values
[5]    @  3{pick}A  - nested vector having an item for each row of 2{pick}A.{+
   +}  Each item
[6]    @         is a vector giving the indices of records having that key
[7]    @    Z  - like A, but updated for the new data
[8]    @
[9]    @ On the first call, A should be 0 (0 0{rho}'') ({iota}0).  On {+
   +}subsequent calls,
[10]   @   A should be the Z from the previous call.
[11]   @
[12]   N D P{<-}A
[13]   I{<-}(D OVER B)MATIOTA B      @ lookup keys
[14]   J{<-}J/{iota}{rho}J{<-}(I>1{take}{rho}D)^(I{iota}I)={iota}{rho}I @ {+
   +}indices of new distinct keys in B
[15]   I{<-}(({iota}1{take}{rho}D),I[J]){iota}I         @ recode to {+
   +}extended D indices
[16]   D{<-}D OVER B[J;]             @ add the new keys
[17]   P{<-}P,({rho}J){rho}{enclose}{iota}0               @ give 'em empty {+
   +}pointer lists
[18]   I{<-}I[Q{<-}{gradeup}I]                  @ order the indices
[19]   F{<-}I{/=}{neg}1{rotate}I & F[{iota}{signum}{rho}F]{<-}1       @ {+
   +}first marks
[20]   K{<-}F/I                      @ indices that occur in this batch
[21]   P[K]{<-}P[K],{each}F #PENCLOSE N+Q @ extend the pointer lists
[22]   Z{<-}(N+1{take}{rho}B) D P             @ arg for next iteration
     {del}

     {del} Z{<-}A MATIOTA B;T
[1]    @Looks up rows of {omega} in matrix {alpha}.  Returns row indices or {+
   +}0 if not found
[2]    @ The args may be character scalars, vectors, or matrices.
[3]    @ The result is a vector if B is a matrix, or a scalar if B is a
[4]    @    vector or scalar.
[5]    @
[6]    {->}(T{<-}''{rho}(#STPTR'Z A B T')#CALL MATIOTA{delta}OBJ){drop}0
[7]    #ERROR(1 2 8 3{iota}T){pick}'RANK ERROR' 'WS FULL' 'WS FULL' 'SYNTAX {+
   +}ERROR' 'DOMAIN ERROR'
[8]    @ Copyright (c) 1988, 1989, 1994, 1995 by Jim Weigang
     {del}

{del}.    MATIOTA{delta}OBJ{<-}1345730611 858915563 11201665 1686700032 {neg{+
+}}949602268 406594628 {neg}2071370240 37412 1552508416 {neg}1989785052 1717{+
+}707852 2049201289 1149814155 {neg}326416604 38749638 40322502 41895366 993{+
+}0182 {neg}972947456 {neg}973058491 {neg}973052347 {neg}973046203 38533 {ne{+
+}g}1956249600 {neg}1989788603 {neg}1989785531 {neg}1989779387 {neg}19897732{+
+}43 36997 {neg}2050529792 168 1837957120 28862024 15224832 1476395008 20253{+
+}57358 1358954495 {neg}855345832 {neg}16258104 11412628 1821048832 18379714{+
+}92 45639240 15224832 1476395008 1354268718 1358954495 {neg}855345832 {neg}{+
+}16258104 11412628 1500643328 941911179 5012864 1837978485 45639264 1522483{+
+}2 1476395008 548962350 1358954495 {neg}855345832 {neg}16258104 11412628 69{+
+}5336960 941911179 6585728 {neg}1070914699 {neg}1965520064 {neg}109032115 {{+
+}neg}2144373760 527696377 1946352000 {neg}352210913 {neg}352145396 {neg}352{+
+}079864 {neg}352014332 16721152 887685120 {neg}1962934269 116086877 {neg}19{+
+}57149301 1166629981 140347652 {neg}1958690773 1699580632 1946220928 331284{+
+}71 {neg}109049996 {neg}351833086 1885178812 1166739179 1952287600 {neg}199{+
+}4898039 1166744669 475892488 209699643 {neg}936653941 {neg}1993847415 1831{+
+}84453 {neg}886323061 {neg}1993847415 {neg}2050610083 148 {neg}2147372030 2{+
+}130797949 {neg}2013219838 38309 {neg}1750759424 {neg}1962934272 {neg}19203{+
+}95187 160 9481613 {neg}1201274880 {neg}397410301 0 {neg}1199559080 {neg}45{+
+}3 74799184 132892877 {neg}1373334273 251658240 {neg}50046 611093503 {neg}1{+
+}732410568 0 41124879 {neg}1114963968 156 {neg}1410088917 294275 39814159 1{+
+}166671872 251739207 115076 1946303488 482654299 1710686209 150996268 15693{+
+}90653 408748804 {neg}2147097609 {neg}2093050438 {neg}225771310 1622734987 {+
+}{neg}150994943 671427685 {neg}1962932862 71666648 {neg}2095553033 {neg}206{+
+}2614278 28 386039 {neg}2093636375 {neg}231014190 23889679 141688832 {neg}2{+
+}028996549 354 721706379 723497417 {neg}523026240 41337659 {neg}1040255518 {+
+}54939016 1044744640 41698758 24987078 25118150 54412683 {neg}2054618043 13{+
+}6 1719168397 1342178232 232 {neg}2144446464 {neg}176456 1968722175 {neg}33{+
+}9161852 613744391 174 {neg}28343793 1821114367 {neg}1114949596 132 8424843{+
+} {neg}1070923776 {neg}1410073528 55854475 1166626885 138750732 {neg}196291{+
+}6471 {neg}1052244923 1166607072 1044744976 50520257 33925 1111853312 {neg}{+
+}402623093 346 {neg}1962745663 33933 18647808 1962933123 1112902408 {neg}19{+
+}95418365 274041626 {neg}1962863479 1160316997 275612416 {neg}1949468412 11{+
+}66634053 1749353236 {neg}1960819319 40069 809863424 10524043 {neg}49066803{+
+}2 {neg}1983773950 1972053061 17098772 {neg}524222464 {neg}2070574334 {neg}{+
+}1962934272 {neg}75300324 {neg}1958710017 48808387 50882039 {neg}125086651 {+
+}{neg}1961593461 {neg}1069929395 544581363 990397835 41360453 1301020299 {n{+
+}eg}1341725912 1974399776 49004811 809864003 132847753 {neg}1958585085 {neg{+
+}}1950618853 1157700677 809861908 608537348 1915766073 9693589 {neg}1946419{+
+}200 2106158205 {neg}1668969708 {neg}1996488704 {neg}1651822523 160 5052102{+
+}5 610109912 {neg}1960032885 1435173981 611617544 1173171028 337935499 9965{+
+}91755 1973875648 {neg}1960792016 992486476 2115773524 67013384 1974399985 {+
+}609499914 {neg}217416420 {neg}1962511186 {neg}349428668 612141848 {neg}195{+
+}9818960 {neg}349428668 1005650705 2115773524 611582724 {neg}348437208 2089{+
+}529982 2080447524 1149441060 {neg}1962659804 993010812 1965302908 {neg}211{+
+}8112364 43716 {neg}768883968 {neg}1054061173 132317929 {neg}774884435 7377{+
+}96802 743279552 {neg}2147229312 326369529 33128620 {neg}524219020 {neg}109{+
+}007864 {neg}1056672766 866912480 1839839938 {neg}487088319 1300943403 {neg{+
+}}1029370042 12829323 941217 142513

{del}:    MATIOTA{delta}OBJ[#IO]{<-}(1345730611 2000042035)[1+1{epsilon}{+
+}#SYSID #SS'Win']

     {del} Z{<-}A OVER B;S
[1]    @Forms a matrix with {alpha} over {omega}
[2]    {->}(S{<-}''{rho}(#STPTR'Z A B')#CALL OVER{delta}OBJ){drop}0
[3]    {->}(S=1)/L1
[4]    #ERROR(2 3 4{iota}S){pick}'RANK ERROR' 'WS FULL' 'SYNTAX ERROR' '{+
   +}DOMAIN ERROR'
[5]    @ Use an APL version for mixed types, Booleans, or nested:
[6]   L1:S{<-}{rho}A{<-}({neg}2{take}1 1,{rho}A){rho}A
[7]    S{<-}0 1{times}S{max}{rho}B{<-}({neg}2{take}1 1,{rho}B){rho}B
[8]    Z{<-}(A{<-}(S{max}{rho}A){take}A){commabar}B{<-}(S{max}{rho}B){take}B
[9]    @ Copyright (c) 1988, 1994 by Jim Weigang
     {del}

{del}.    OVER{delta}OBJ{<-}1345730611 858915563 {neg}1955795837 1684900332 {+
+}843417958 39684454 441289062 34031046 35603910 37176774 411078 1983942 355{+
+}6806 1620070 4557158 407210342 809863526 1212532582 {neg}1201274880 {neg}3{+
+}97410303 0 2021666392 1968722095 {neg}339161852 609550084 7179632 178278 5{+
+}9472 777519104 1351710848 {neg}855345832 {neg}16454712 1919951956 10863357{+
+}80 {neg}1991621239 1166691909 1946172421 1946238010 1946303522 {neg}352145{+
+}370 {neg}352210928 {neg}352079860 {neg}352014328 {neg}351948796 16721152 5{+
+}35363584 {neg}1962934271 1166610501 {neg}1962087602 1166610501 340101962 {{+
+}neg}1924250231 {neg}1201268627 {neg}397410302 0 {neg}1199559080 {neg}217 7{+
+}4799184 82561229 1881429247 1821096818 {neg}1070898140 1380288832 {neg}197{+
+}4057591 3939653 20717172 37488244 {neg}1913976204 {neg}1993849461 21674963{+
+}7 {neg}1993849461 1166758469 1447397676 973358474 {neg}2062607291 {neg}143{+
+} 1010058632 1007514624 1007252482 {neg}385518589 {neg}163 6243782 {neg}206{+
+}2614468 4 543114694 55201163 1435193933 1448426318 1435173757 1532332374 3{+
+}7045702 {neg}1992274551 {neg}1047837611 {neg}2146442505 {neg}215 {neg}2062{+
+}560709 {neg}223 {neg}1925692023 {neg}1201262483 {neg}397410301 0 {neg}1199{+
+}559080 {neg}401 74799184 82561229 1881429247 {neg}17333745 1821114367 {neg{+
+}}1946393564 1972059261 1246071564 {neg}1957798517 401082445 {neg}196293427{+
+}2 1166746741 1448971090 {neg}400536181 6 {neg}997998549 1564197740 4281505{+
+}96 {neg}1962934272 877497025 {neg}930357037 {neg}217912895 {neg}1966525531{+
+} 65110216 938190067 {neg}1956100727 {neg}1020568763 {neg}751546998 1615170{+
+}016 {neg}522992757 1166727307 1616218975 {neg}880077589 {neg}217912895 {ne{+
+}g}1966525531 65110219 {neg}896817933 1308601075 {neg}1008239256 950505 941{+
+}16

{del}:    OVER{delta}OBJ[#IO]{<-}(1345730611 2000042035)[1+1{epsilon}{+
+}#SYSID #SS'Win'] 

Here's an example of how to use MOREPTRS:

{del}.    ST{<-}9 13{rho}'TEXAS        IOWA         CALIFORNIA   MASSACHUSET{+
+}TSARKANSAS     WASHINGTON   OHIO         LOUISIANA    MISSISSIPPI  '

      B1{<-}ST[?10{rho}1{take}{rho}ST;]   @ generate some data
      B2{<-}ST[?10{rho}1{take}{rho}ST;]
      B3{<-}ST[?10{rho}1{take}{rho}ST;]
      B4{<-}ST[?10{rho}1{take}{rho}ST;]

      Z0{<-}0 (0 0{rho}'') ({iota}0)
      Z1{<-}Z0 MOREPTRS B1   @ build the pointers
      Z2{<-}Z1 MOREPTRS B2
      Z3{<-}Z2 MOREPTRS B3
      Z4{<-}Z3 MOREPTRS B4

      P{<-}3{pick}Z4
      Z4[1 2] ((({rho}P),1){rho}P)
 40   OHIO           1 9 11 14 17 22 23 30
      WASHINGTON              2 7 16 24 26
      MISSISSIPPI      3 10 18 21 37 38 40
      LOUISIANA                       4 27
      ARKANSAS                     5 31 36
      TEXAS                     6 13 25 39
      MASSACHUSETTS             8 19 29 34
      CALIFORNIA         12 15 20 28 32 33
      IOWA                              35

      B{<-}B1{commabar}B2{commabar}B3{commabar}B4

      ('OHIO'MATIOTA B)/{iota}1{take}{rho}B   @ check the results
1 9 11 14 17 22 23 30

      ('WASHINGTON'MATIOTA B)/{iota}1{take}{rho}B
2 7 16 24 26

      ('MISSISSIPPI'MATIOTA B)/{iota}1{take}{rho}B
3 10 18 21 37 38 40

      ('IOWA'MATIOTA B)/{iota}1{take}{rho}B
35

You can use the program with numeric keys by replacing OVER with "," and MATIOTA with {iota}, and changing B[J;] to B[J], etc.

Jim


Home Page