4GL Problem for Las Vegas Fall 1990 DECUS Symposium Table of Contents 1. Introduction ...................................... 1 2. Distribution of data .............................. 1 3. Description of data ............................... 2 3.1 MASTER Record .................................... 2 3.2 DETAIL Record .................................... 3 3.2.1 DETAIL_BINARY Record ........................... 4 3.2.2 Distribution of DETAIL Records ................. 4 3.3 FUNDCODE Table ................................... 6 4. Description of Problem ............................ 7 5. How data was generated ............................ 10 Instructions for Las Vegas (Fall 1990) 4GL Problem by Joe H. Gallagher, Ph. D., Managing Editor Wombat Examiner & 4GL Dispatch 816-276-4068 days 913-894-9550 evenings 1. Introduction The 4GL Problem Panel at the New Orleans 1990 Spring DECUS Symposium was a very well received presentation. A 4GL Problem Panel is scheduled for the Las Vegas 1990 Fall DECUS Symposium. The problem for Las Vegas builds on and extends the problem from New Orleans. The problem for New Orleans was to browse, both forward and backwards, through a set of "people oriented" data. The problem for Las Vegas extends the browse to include both a MASTER set and a DETAIL set of data. The data is modelled on a MASTER set of donors with the DETAIL records of the amount and date of their donations. In addition to browsing the data base, certain processing (ADD, MODIFY, and DELETE) is required for the solution of the problem. One optional report is also part of of the solution. 2. Distribution of data Data for the Las Vegas (Fall 1990) 4GL Problem is distributed on one 1600 bpi magtape in VMS Backup format or on two 5.25 inch high density (1.2 Meg) IBM diskettes. Other formats may be available by calling me at one of the numbers above. The VMS Backup save set name is VEGASPROB.BCK; the tape was created with the command: BACKUP/VERIFY/LOG *.* MSA0:VEGASPROB.BCK/SAVE_SET The files in this distribution are: Filename Description AAREADME.TXT This file containing the complete description of the data and the problem. This file was created using MASS11 (so it is actually in stream format). MASTER.DTR The DATATRIEVE record definition of the master file. MASTER.DAT The master record data; 5518 records, each record is 75 bytes long, in 820 blocks. DETAIL.DTR The DATATRIEVE record definition of the DETAIL file. - 1 - DETAIL_BINARY.DTR The DATATRIEVE record definition of the DETAIL with DATE_OF_DONATION in the usual 64-bit binary format. This file is not on the IBM Diskette distribution. DETAIL.DAT The detail record data - 36027 records, each record is 21 bytes long, in 1549 blocks. (In the IBM diskette distribution, this file is on the second diskette.) DETAIL_BINARY.DAT The detail record data - 36027 records, each record is 23 bytes long, in 1689 blocks. This file is not on the IBM Diskette distribution. FUNDCODE.DTR The record definition of the FUNDCODE table. FUNDCODE.DAT A table which maps the FUND-CODE (a one digit number 0 to 9) to the abbreviation of the name of the FUND and the full name of the FUND. Ten records, each record is 56 bytes long, in 2 blocks. The order of the files on a particular media may be different from the order specified above. 3. Description of data The data consists of three files: the MASTER record, the DETAIL record, and the FUNDCODE translation table. 3.1 MASTER Record There are 5518 records in the MASTER file; each record is 75 bytes long. The MASTER record contains only ASCII text; there is no binary data in the file. The data is, therefore, printable and transferable with KERMIT. The record layout is: 01 MASTER_REC. 03 ID_NUMBER PIC 9(7). 03 LAST_NAME PIC X(15). ! may contain " ", "'", or "-" 03 FIRST_NAME PIC X(10). 03 MIDDLE_INITIAL PIC X(1). 03 TITLE_OF_ADDRESS PIC X(4). 03 SENIORITY PIC X(4). 03 ADDR PIC X(15). 03 CITY PIC X(12). 03 STATE PIC X(2). 03 ZIP_CODE PIC 9(5). ; The MASTER file is in the order of the ID_NUMBER. The record layout for the MASTER file is contained in the file MASTER.DTR. The values in ID_NUMBER are random but can span the full range of 0000000 to 9999999. That is, there are 5518 unique numbers in the 7 digit range. - 2 - The LAST_NAME field may contain a space, single quote, or dash. The data in the LAST_NAME field has been revised from the New Orleans data. The distribution of names more nearly approximates the distribution of names in the United States. In this distribution, the names WILLIAMS, SMITH, JOHNSON, JONES, BROWN, MILLER, DAVIS, and WHITE have the highest frequency of occurrence. Names of Asian and Scandinavian origin are probably under represented. Names of English, Irish, Scotch, German, Italian, and eastern European origin are well represented. Names of American Blacks and Hispanic are also proportionally represented. Most other ethnic groups are represented by a few names. Other fields (TITLE_OF_ADDRESS, SENIORITY, ADDR, CITY, STATE, and ZIP_CODE) are unchanged from the New Orleans data. In a certain number of randomly selected records, two fields (FIRST_NAME, MIDDLE_INITIAL) have been slightly changed. In some records the middle initial has been set to a blank. In other records, the middle initial has been move into the first part of the first name and the middle initial has been set to a blank. (For example: if the first name were "JOHN" and the middle initial were "L", the new first name is not "L. JOHN" and the middle initial is " "). 3.2 DETAIL Record There are 36027 records in the DETAIL file; each record is 21 bytes long. The DETAIL record contains only ASCII numeric data; there is no binary data in the file. The data is, therefore, printable and transferable with KERMIT. The record layout is: 01 DETAIL_REC. 03 ID_NUMBER PIC 9(7). 03 DATE_OF_DONATION PIC 9(6). ! in MMDDYY order 03 DATE REDEFINES DATE_OF_DONATION. 05 MONTH PIC 9(2). 05 DAY PIC 9(2). 05 YEAR PIC 9(2). 03 FUND_CODE PIC 9(1). 03 AMOUNT_OF_DONATION PIC 9(5)V99 . ; The data in the DETAIL record is SORTED by date of the donation (ascending date order) and then by ID_NUMBER if there are more than one donation on a given date. The ID_NUMBER span can span the range of 0000000 to 9999999. However, there are no orphan records. Every DETAIL record has a master record. However, there are many (1199 to be exact) MASTER records without a DETAIL record. The DATE_OF_DONATION is in MMDDYY format. The dates are in 1987, 1988, and 1989. - 3 - The FUND_CODE is a one digit number between 0 and 9. The ten different digits are approximately evenly distributed. One digit was used in order to same space. The FUND_CODE may be translated into the name of the fund using the FUND_CODE translation table. The format of the AMOUNT_OF_DONATION is 5 digits to the left of the decimal and 2 digits to the right of the decimal. However, the decimal point is implied and does not use up a byte in the file. For example, if the data in AMOUNT_OF_DONATION is 0012345, the amount of the donation is $123.45. The DETAIL file contains only digits; there are no alpha characters in the file. 3.2.1 DETAIL_BINARY Record On the VMS Backup version of the distribution, there is an alternate version of the data. In this version, the DETAIL_BINARY version, the DATE_OF_DONATION is expressed in the usual 64-bit binary format. There are 36027 records in the DETAIL_BINARY file; each record is 23 bytes long. The DETAIL_BINARY record contains binary data; therefore, this file is not printable and nor transferable with KERMIT (with out using 8-bit binary mode). The record layout is: 01 DETAIL_BINARY_REC. 03 ID_NUMBER PIC 9(7). 03 DATE_OF_DONATION USAGE IS DATE. 03 FUND_CODE PIC 9(1). 03 AMOUNT_OF_DONATION PIC 9(5)V99 . ; 3.2.2 Distribution of DETAIL Records The distribution of DETAIL records per MASTER records is given by the following table: - 4 - Number Number Total of of number detail master of records records detail per records master 0 1199 0 1 996 996 2 364 728 3 487 1461 4 254 1016 5 187 935 6 96 576 7 73 511 8 229 1832 9 174 1566 10 135 1350 11 288 3168 12 418 5016 13 9 117 14 18 252 15 23 345 16 42 672 17 71 1207 18 97 1746 19 0 0 20 0 0 21 1 21 22 1 22 23 2 46 24 6 144 25 0 0 26 0 0 27 1 27 28 0 0 29 0 0 30 0 0 31 1 31 32 4 128 33 10 330 34 35 1190 35 98 3430 36 199 7164 Totals 5518 36027 - 5 - 3.3 FUNDCODE Table In order to save space in the DETAIL records, the fund to which a donation is given is indicated by a single digit in the range 0 to 9. This FUND-CODE may be translated into the FUND-ABBREVIATION or the FUND-NAME from date from the FUNDCODE table. The data in the table is: 01 FUND_REC. 03 FUND_CODE PIC 9. 03 FUND_ABBREVIATION PIC X(5). 03 FUND_NAME PIC X(50). ; FUND FUND FUND CODE ABBREVIATION NAME 0 UNWAY United Way 1 UNCA United Negro College Appeal 2 UNICF UN International Children's Educational Fund 3 AREDC American Red Cross 4 SALVA Salvation Army 5 INFAM International Famine Relief 6 UJA United Jewish Appeal 7 AMCHC American Catholic Charities 8 AEART American Endowment for the Arts 9 GMFND The German-Marshall Fund of the United States The data is contained in the file FUNDCODE.DAT. There are ten records of 56 characters each. - 6 - 4. Description of Problem ----------------------------------------------------------------- The 4GL Problem for the Las Vegas 1990 Fall DECUS Symposium Joe H. Gallagher, 4GL Solutions, Overland Park, KS ----------------------------------------------------------------- The 4GL Problem Panel at the New Orleans 1990 Spring DECUS Symposium was a very well received presentation. A 4GL Problem Panel is scheduled for the Las Vegas 1990 Fall DECUS Symposium. The problem for Las Vegas builds on and extends the problem from New Orleans. The problem for New Orleans was to browse, both forward and backwards, through a set of "people oriented" data. The problem for Las Vegas extends the browse to include DETAIL records in addition to the MASTER record. The data is modelled on a MASTER set of donors with the DETAIL records of the amount and date of their donations. In addition to browsing the data base, certain processing (ADD, MODIFY, and DELETE) is required for the solution of the problem. One report is an optional part of the solution. The DATA: The MASTER record consists of ID_NUMBER (7 digit number), LAST_NAME, FIRST_NAME, MIDDLE_INITIAL (may be missing), TITLE_OF_ADDRESS (MR, MS, MRS, MISS, etc), SENIORITY (JR, SR, II, III, etc), ADDRESS, CITY, STATE_CODE, and ZIP_CODE (5 digits). The number of DETAIL records and the total AMOUNT_OF_DONATIONs for a particular donor are to be managed as part of the MASTER record. There are 5518 MASTER records. The DETAIL records consists of ID_NUMBER, DATE_OF_DONATION (dates are in 1987, 1988, and 1989), FUND_CODE (1 digit number between 0 and 9), and the AMOUNT_OF_DONATION. The largest single donation is $99,999.999. There are 36027 DETAIL records. A FUNDCODE table is provided which may be used to translate the FUND_CODE to the FUND_ABBREVIATION (5 characters) or the FUND_NAME (50 characters). There are 10 FUNDCODE records. A complete and detailed description of the MASTER and DETAIL records was published in September 1990 issue of the Newsletter, Volume 6, Number 1, Section DTR, pages 6-10. The Problem: Part 1. The Browse Phase The user may enter the ID_NUMBER, the LAST_NAME, part of the LAST_NAME, or the LAST_NAME to be used in a SOUNDEX search. None, one, or many records are selected. If no records are found, the user is notified and given an opportunity to make another selection. If one record is found, control is passed to the second part of the problem. If many records are found, the user may browse forwards and backwards through the records to choose one or none. These records must be "browsed" in sort - 7 - order by LAST_NAME, FIRST_NAME, MIDDLE_INITIAL, ID_NUMBER. If none is chosen, the user may try again. If one is chosen, control passes to the second part of the problem. During the browse, the display shall include all information in the MASTER record (including the number of DETAIL records and the total of all donations) and all DETAIL records or whatever subset of them will fit on the screen. The DETAIL records (or the subset that will fix on the screen) must be display by decreasing date. That is, the most recent donation (if any) must be displayed first. The abbreviation of the fund name should be display. The fund code may also be displayed. At no time should the contents of DETAIL records and MASTER records be display on the screen which do not belong together. Part 2. The Processing Phase Once a MASTER record has been chosen by the browse phase, the user may then ADD a DETAIL record, MODIFY a DETAIL record, DELETE a DETAIL record, or do no further processing of the record hierarchy. The user may also delete the MASTER record and all of its DETAIL records. No "orphan" DETAIL records are allowed. At the conclusion of an ADD, MODIFY, or DELETE, the count of DETAIL records and the total amount of all donations must be UPDATED and the record hierarchy displayed for further processing (if any). When no further processing is required on the chosen record hierarchy, control is return to the browse phase. Note that the problem DOES NOT require that MASTER records be added or modified. The entry of the FUND_CODE or the FUND_ABBREVIATION must be validated and help should be provided on possible valid entries. Part 3. Optional Reporting Phase An optional part of the problem is to make a report of the top 100 donors. This part of the application does not have to integrate into the browse and processing phases; in fact, it may be a totally separate application. The top 100 donors are to be reported (printed on a line printer, 80 column wide) by decreasing total amount of donations. That is, the highest donor must appear first in the report. The donors complete name should printed. The punctuation of the name is critical. Examples of the format of two detail lines in the report are: MR. JOHN E. SMITH, JR. $67,890.00 MISS MARY JONES $998.23 Parts of the Problem Which are not Constrained: Because there are too many details in a complete application to describe in a fixed amount of Symposia session time, there are many details which are not specified as part of the problem. The problem solver may choose to design their application as best meets their 4GL in the following areas: - 8 - 1. Loading the data is not part of the problem. Solvers may load the data any way they choose and they do not need to report how they loaded their data since it is assumed that the application would have been developed in their 4GL and the data would have been entered one record at a time. 2. The count of DETAIL records and the total of donations for each donor may be either a virtual or a real field in the MASTER record. The solver must report how they have chosen to manage these two fields. If these fields are computed and actually stored in the MASTER record, then database constraints must apply. 3. The date of donations may be stored as MMDDYY or as a 64-bit binary VMS date data type. This choice should be reported. 4. The fund code may be retained or the DETAIL records may be converted to the abbreviation of the name of the fund. However, the abbreviation of the name of the fund must be displayed during the browse phase. 5. Adding and modifying MASTER records is NOT part of the problem. Solvers may develop the application so that this is possible, but solvers should not report if their code has this capability. 6. Solvers may choose which fields are keys; there are no restrictions on the choice. However, the solver should report which fields they have keyed. 7. Solvers may choose to have the user enter either the fund code or the abbreviation of the fund name when entering or modifying DETAIL records. Either may be entered, but entries must be validated. Reporting on Development of the Application: As part of the symposia session, the problem solvers must be prepared to report the following: 1. the experience of the developer with 3GL's and their 4GL 2. what background knowledge of the user has been assumed 3. what training is, or would be required, for the user to operate the browse and processing phases of the application 4. what software is required to run the application (all licenses products) 5. what part or parts of the application is done in a 3GL (for example, the SOUNDEX algorithm could be done in MACRO) 6. what part or parts of the application could not be done in their 4GL 7. how long it took to develop each part of the application 8. how long does it take to find and display a single record which has 36 detail records for a system with a specific CPU and disk. - 9 - 5. How data was generated ----------------------------------------------------------------- Developing Applications in DATATRIEVE: Generating Test Data Joe H. Gallagher, 4GL Solutions, Overland Park, KS ----------------------------------------------------------------- Most of our efforts in developing applications are spent in thinking about and analyzing the problem or writing, testing, and debugging procedures, reports, FMS or TDMS screens, tables, domain definitions, and record definitions. When an application becomes more complex (the number of concurrently accessed domains), when the number of simultaneous users increases, and when remote data is accessed (data accessed across a network), the ability to accurately simulate the "real-world" with trivial data is no longer possible. In such situations it is necessary to create data to test our application design and to evaluate the performance of our application. The best test data for an application, of course, is the "real" data. But there are many instances when the "real" data has not yet been acquired or may not yet be available to the application developer. It is these instances were one needs to generate test data that will be address by this article. The examples used in this article are taken from the command files and procedures which were used to generated the data for the Las Vegas 1990 Fall Symposium 4GL Problem. A complete description of the Las Vegas 4GL Problem will appear in a future issue of this newsletter. Types of Data With respect to generation of data, there are two broad categories. The first is data that because of its non-mathematical nature or because there are a discrete number of occurrence among a very, very large number of possibilities can only be generated from lists of existing data. The second is data that is numeric, either continuous or discrete, and which can be generated by mathematical methods. List Data The best example of first kind of data, list data, is first and last names. Mathematically there a billions more possible names than actually occur in a given language (English, for example); so there is no effective way to generate real names (as opposed to nonsense names) with a random number generator. The only way to get realistic names is to take names from a pre-existing list. One source of possible names is the telephone book. John Babiarz used his local telephone book to generate names for the 1990 Spring New Orleans 4GL Problem. He used a little over 1100 unique names and used each one 4 times with four different first names. However, he did not make a complete or random sample of the names in the phone book. The frequency of occurrence of names was - 10 - peculiar, if not bazaar, as there were more that 400 names beginning with "Z" in a set of 5518 names. I chose another approach. Since I work in a fairly large hospital and have some good contacts in the Data Processing Department, I was able to secure the list of last names of the most recent 32000 hospital admissions. This list of names has some rather interesting properties. First there were about 20 names in the list whose length was more that 15 characters. All but three of the names were due to hyphenated names (the concatenation of two last names). All three of the "naturally" long names were of Germanic origin. They were 16, 17, and 19 characters long (and the 19 character one was a real mouth full). Because the master file in the problem set had already specified that the LAST-NAME field was PIC X(15), I discarded all last names longer than 15 characters. Names on a hospital admitting list do have one peculiar property. If a name appears on the list, then there is too high a probability that the next name on the list will be the same. The occurs when several members of the same family are involved in a motor vehicle accident and they are all admitted through the emergency room at the same time. Because of this peculiarity, I did remove the duplicates if they occurred together. The set of names did have a good representation of English, Irish, Scotch, German, American Black, and Hispanic names with a few Scandinavian, Asian, Middle Eastern, and Indian names. However, because the 4GL Problem was of project of the 4GL SIG, I check the names against the list of members of the SIG Steering Committee and if a steering committee member's name was not present I added it. Thus, the list I used came from three sources: more than 1100 unique names from John Babiarz's telephone book, about 20 names from the SIG Steering Committee list who weren't already on the list, and more than 4000 more from the patient admissions' list at Research Medical Center to make a list of 5518 names. While the list had some of the properties of a typical list of American names (Smith, Jones, and Brown occurred 46, 37, and 35 times), there were still 67.4% (3718 of 5518) which were unique. In a typical list there are often only 33% to 40% unique names. Now with a list of 5518 last names in a domain NEWNAMES, I needed to load these names into the MASTER domain. If the MASTER domain did not already have values in the other fields, I could just STORE the records. But to move the first name from NEWNAMES to the first last-name in MASTER, and the second to the second and so on, there is a bit of a trick! The MASTER record is - 11 - 01 MASTER_REC. 03 ID_NUMBER PIC 9(7). 03 LAST_NAME PIC X(15). ! may contain " ", "'", or "-" 03 FIRST_NAME PIC X(10). 03 MIDDLE_INITIAL PIC X(1). 03 TITLE_OF_ADDRESS PIC X(4). 03 SENIORITY PIC X(4). 03 ADDR PIC X(15). 03 CITY PIC X(12). 03 STATE PIC X(2). 03 ZIP_CODE PIC 9(5). ; and the NEWNAMES record is 01 NEWNAMES_REC. 03 NAMES PIC X(15). ; One would like to be able to have DATATRIEVE do something like ready MASTER modify ready NEWNAMES read for a in NEWNAMES for next 1 b in MASTER modify using b.last-name = a.names but, of course, the "FOR NEXT 1" is not legal syntax in DTR. Richard Copeland suggested something like ready MASTER write ready NEWNAMES read find MASTER declare size_of_master usage is integer. size_of_master = count of current declare x usage is integer. for a in NEWNAMES begin x = running count modify b in MASTER with FN$MOD(running count,size_of_master) = x using last-name = names end which would work, but its performance would be poor. The solutions I used was: ready MASTER modify ready NEWNAMES read find c in MASTER for NEWNAMES begin select next c modify using LAST-NAME = NAMES end - 12 - which apparently violates the documented restriction that a SELECT may not appear within a BEGIN-END block. However, in this case it works just fine and gives the desired result. Mathematically Generated Data The second type of data can be, but is not always, generated by some mathematical function or algorithm. If the data is continuous or almost continuous, a function with or without the use of a random number generator is often used. But if the data is discrete, many times it is easier to develop a table or a frequency of occurrence for the values. Discrete Values For the Las Vegas 4GL Problem, we needed DETAIL records which would be associated with the MASTER records. The frequency of occurrence of the DETAIL records with respect to the MASTER records is given by the values in the domain DISTRIBUTION. I actually used a spread sheet to fiddle with the values so that the records would come out to exactly 5518. The three fields in the DISTRIBUTION domain are N, the number of DETAIL records per MASTER record, M, the number of MASTER records having this number of DETAIL records, and Z, total number of detail records (the product of N and M). - 13 - N M Z 0 1199 0 1 996 996 2 364 728 3 487 1461 4 254 1016 5 187 935 6 96 576 7 73 511 8 229 1832 9 174 1566 10 135 1350 11 288 3168 12 418 5016 13 9 117 14 18 252 15 23 345 16 42 672 17 71 1207 18 97 1746 19 0 0 20 0 0 21 1 21 22 1 22 23 2 46 24 6 144 25 0 0 26 0 0 27 1 27 28 0 0 29 0 0 30 0 0 31 1 31 32 4 128 33 10 330 34 35 1190 35 98 3430 36 199 7164 Totals 5518 36027 To create DETAIL records with the distribution gives by the values in DISTRIBUTION, the following skeleton set of commands and statements can be used: - 14 - ready DISTRIBUTION read ready MASTER modify define file for DETAIL; ready DETAIL write find a in MASTER for DISTRIBUTION begin repeat m begin select next a repeat n begin store DETAIL using begin id-number = a.id-number . . end end end end and again the trick of using SELECT NEXT A in a collection is used to step through the MASTER domain. Continuous Uniform Random Numbers A random number generator makes numbers uniformly, but pseudo-randomly, distributed on the interval 0 <= x < 1. There are many kinds of random number generators and many kinds of tests for them as well. For information on random number generators see "Encyclopedia of Computer Science", Anthony Ralsont, Editor, Petrocelli/Charter, New York, 1976, pp. 1192-7, or "The Art of Computer Programming" Volume 2 (Seminumerical Algorithms). Addison Wesley, Reading, MA, 1969. The VAX-FORTRAN random number generator has properties better than the additive-multiplicative generator and a shift generator since it is apparently has properties of both types. A self-seeding routine to call the FORTRAN random number generator was described in Volume 4, Number 1, Section DTR, pages 11-12 of the newsletter. The FORTRAN code is: ! ! a self-initializing, FORTRAN interface for ! DATATRIEVE to the VAX random number generator. ! real*4 function fn_random() real*4 mth$random integer*4 quad(2),seed equivalence (quad,seed) if (seed .eq. 0) then call sys$gettim(quad) end if fn_random = mth$random(seed) return end The linkage to attach this function to DATATRIEVE is - 15 - ; ; FN$RANDOM- a pseudo-random number generator ; ; output is an F-floating pseudo-random number 0<= x < 1 ; $DTR$FUN_DEF FN$RANDOM, FN_RANDOM, 0 $DTR$FUN_OUT_ARG TYPE = FUN$K_VALUE, DTYPE = DSC$K_DTYPE_F $DTR$FUN_NOOPTIMIZE $DTR$FUN_END_DEF One can then use the random number generator to create number distributed between LOW_VALUE and HIGH_VALUE by the expression LOW_VALUE + fn$random * (HIGH_VALUE - LOW_VALUE) However if one needs numbers exponentially distributed between 10 and 100,000, once could use the expression FN$EXP(FN$LN(10.0)*(1.0 + 4.0*FN$RANDOM)) Discrete Random Numbers To make integers (discrete numbers) rather than real (continuous) number, judicious use of FN$FLOOR or FN$NINT will convert the continues variable to a discrete one. However, if you need unique integer, you have to keep track of which integers have already been used. For that, you need a domain and a table like DEFINE DOMAIN INDEX USING INDEX_RECORD ON INDEX.DAT; DEFINE RECORD INDEX_RECORD USING 01 INDEX_REC. 03 ID PIC 9(7). ; DEFINE TABLE INDEX_TABLE FROM DOMAIN INDEX ID : ID END_TABLE and code like ready INDEX write declare temp-id pic 9(7). . . . temp-id = 9999999 * fn$random while (temp-id in index-table) begin if temp-id eq 9999999 then begin temp-id = 0 end temp-id = temp-id + 1 end store INDEX using id = temp-id . . . to keep track of which numbers have already been used. - 16 - Normally Distributed (Gaussian) Numbers Normally distributed numbers can be created from uniformly distributed numbers by inverting the Gaussian function. Unfortunately, there is no exact closed-form expression for this inverse distribution. There is, however, a good approximation to the needed function. The FORTRAN routine FN_ZPROB gives such a result. ! A real function to invert the probability function. ! See Handbook of Mathematical Functions with Formulas, Graphs, ! and Mathematical Tables, Edited by Milton Abramowitz and ! Irene A. Stegun, National Bureau of Standards, Applied ! Mathematics Series, 55, page 933. ! ! input 0 =< rin =< 1.0 ! output -9.262 =< rout =< +9.262 ! ! If the input is uniformly distributed between 0 and 1, ! the output is normally distributed about zero with a ! standard deviation of 1 (except for the very ends of ! the distribution). ! ! Because of the singularity at rin = 0 and rin = 1 ! this function sets ! ! 0 <= rin <= 1.0E-20 rout = -9.262 ! 1 - 1.0E-20 <= rin <= 1.0 rout = +9.262 ! ! Thus, the tails of the distribution are "chopped" off ! at the very, very ends. ! real function fn_zprob(rin) real*4 rin, rout, p, t logical side ! 0 to 0.5 -> TRUE; 0.5 to 1 -> FALSE side = .TRUE. p = rin if ( p .gt. 0.5 ) then side = .FALSE. ! data is 0.5 to 1 p = 1.0 - p ! invert the data end if if ( p .lt. 1.0E-20) then ! is data too close to 0? p = 1.0E-20 ! yes, set to a small value end if t = sqrt( 2.0 * alog ( 1.0/p)) rout = t - (2.515517 + t * (0.802853 + t * 0.010328))/ 1 (1.0 + t * (1.432788 + t * (0.189269 + t*0.001308))) ! ! Size of error in this formula is ! | error(p) | < 0.0004.5 ! However, near input of 0.5, output of formula can go ! slightly negative. ! May sure value does not go negative, here. ! - 17 - if (rout .lt. 0.0) then rout = 0.0 end if if (side) then ! which side are we on? rout = -1.0 * rout ! 0 to 0.5 maps to -9.262 to 0.0 end if fn_zprob = rout ! return value return end The linkage to DATATRIEVE is ; ; FN$ZPROB- a normally distributed function ; ; input is an F-floating number 0<= z <= 1 ; output is an F-floating number normally distributed ; $DTR$FUN_DEF FN$ZPROB, FN_ZPROB, 1 $DTR$FUN_OUT_ARG TYPE = FUN$K_VALUE, DTYPE = DSC$K_DTYPE_F $DTR$FUN_IN_ARG TYPE = FUN$K_REF, DTYPE = DSC$K_DTYPE_F, ORDER = 1 $DTR$FUN_NOOPTIMIZE $DTR$FUN_END_DEF The function combination FN$ZPROB(FN$RANDOM) can be used to normally and randomly distribute values with a mean, MEAN, and a standard deviation, STD_DEV, with the expression MEAN + STD_DEV * FN$ZPROB(FN$RANDOM) Generation of Test Data for the DETAIL Record All of the techniques described above are used to generate records for the DETAIL domain. The record definition is: 01 DETAIL_REC. 03 ID_NUMBER PIC 9(7). 03 DATE_OF_DONATION USAGE IS DATE. 03 FUND_CODE PIC 9(1). 03 AMOUNT_OF_DONATION PIC 9(5)V99 . ; The variable ID_NUMBER is uniformly distributed over the range 0 to 9999999, but each value is unique. The DATE_OF_DONATION is normally distributed about a target date which is one nth (where n is the number of donations) of the time between the beginning and ending donation dates. These start and stop dates depend on the number of donations given. The start date is determine by a complex CHOICE OF expression which depends on the number of donations. The standard deviation about these target dates is arithmetically distributed between one half and one fifth of the mean time between donations. The donations are in 1987, 1988, and 1989. Donation dates are also restricted to be week days; that is, Saturdays and Sundays are not allowed. The FUND_CODE is - 18 - uniformly distributed between 0 and 9. The AMOUNT_OF_DONATION is exponentially distributed between $10.00 and $1,000.00 or $10.00 and $100,000.00 depending on the number of donations. Ninety- eight percent of the donations are truncated to 3 significant figures; 2% of the donations have all digits. The DATATRIEVE code to generate the data in DETAIL is: ! ! a command file to create new id numbers and to ! create DETAIL records for the Las Vegas 4GL Problem ! declare start-date usage is date. declare stop-date usage is date. declare start-quad usage is quad. declare stop-quad usage is quad. declare delta usage is quad. declare std usage is quad. declare index-quad usage is quad. declare temp-quad usage is quad. declare temp-date usage is date. declare ln_of_10 usage is real. declare trunk usage is integer. ln_of_10 = fn$ln(10.0) ! declare dollar-base usage is real. declare dollar-range usage is real. declare t usage is real. declare amount pic 9(6)v99 . declare tamount pic 9(6)v99 . declare fund pic 9. declare tfund pic 9. ! stop-date = "12/31/89" stop-quad = stop-date ! ready DISTRIBUTION read ready MASTER modify define file for DETAIL; ready DETAIL write define file for INDEX key=id; ready INDEX write declare temp-id pic 9(7). find a in MASTER ! for DISTRIBUTION begin repeat m begin temp-id = 9999999 * fn$random while temp-id in INDEX-table begin if temp-id eq 9999999 then begin temp-id = 0 end temp-id = temp-id + 1 end - 19 - store INDEX using id = temp-id select next a modify using id-number = temp-id if (n gt 0) begin start_date = choice of n bt 1 and 6 then "1/1/87" n bt 7 and 12 then "1/1/89" n bt 13 and 18 then "7/1/88" n bt 19 and 24 then "1/1/88" else "1/1/87" end_choice start-quad = start-date delta = (stop-quad - start-quad)/n index-quad = start-quad + delta/2 std = delta/(2.0 + 3.0*fn$random) if ( n bt 1 and 6) then begin dollar-base = 1.0 dollar-range = 4.0 end else begin dollar-base = 1.0 dollar-range = 2.0 end t = fn$random t = t * t tamount=fn$exp((dollar-base+t*dollar-range)* ln_of_10) tfund = fn$floor(10.0 * fn$random) repeat n begin amount = tamount if (fn$random le 0.05) then begin t = fn$random amount = fn$exp( (dollar-base + t * dollar-range) * ln_of_10) end if (fn$random lt 0.98) then begin trunk = fn$floor(amount/100.0) if (trunk lt 1) then trunk = 1 amount = trunk * fn$floor(amount/trunk) end fund = tfund if (fn$random le 0.15) then begin fund = fn$floor(10.0 * fn$random) end temp-quad = index-quad + std * fn$zprob(fn$random) temp-date = temp-quad if (temp-date before "1/1/87") then begin temp-date = "1/2/87" end if (temp-date after "12/29/89") then begin temp-date = "12/29/89" end - 20 - temp-date = fn$date(fn$upcase(format temp-date using dd-mmm-yyyy |" 00:00:00.00")) while (format temp-date using w(3) eq "Sat","Sun") begin temp-date = temp-date + 1 end store DETAIL using begin id-number = temp-id date-of-donation = temp-date fund-code = fund amount-of-donation = amount end index-quad = index-quad + delta end end end end Perhaps the techniques in this article will give you the knowledge and tools to generate data to test your application. - 21 -