A "descriptor" for a data type (int, float, double, etc.) determines the range and kind of values, the size of storage, and allowable operations (elements of C) All languages have a set of primitive data types; e.g. built-in and not defined in terms of other data types. Primitive data types of most languages inc.: integer, float, character,array. Some primitive data types are merely reflections of the hardware - Others require some non-hardware support. The size of a data type is platform dependent in C/C++ but not in Java see C's /usr/include/limits.h
sign exponent fraction -----|------------|---------------- = 32 bits (single-precision) 1 bit 8 bits 23 bits ------------------------------------ 31 0 sign exponent fraction -----|------------|---------------- = 64 bits (double-precision) 1 bit 11 bits 52 bitsRounding problem: floating point to binary conversion
o Character String Type Operations: Assignment and copying, comparison (=, >, etc.), catenation, substring reference, pattern matching. see Perl regexes.
Some regex metacharacters: ? 0 or 1 time * 0 or more times + 1 or more times [] marks a character class to match a single character ^ negation inside []; anchor at beginning of line otherwise $ marks end of line () substring deliminters (must be escaped in vim) $1 refers to first matched substring (this is \1 in vim) {} a{n} means match 'a' exactly "n" times some of Perl's abbreviations \d is a digit and represents [0-9] \s is a whitespace character and represents [\ \t\r\n\f] \w is a word character (alphanumeric or _) and represents [0-9a-zA-Z_] \D is a negated \d; it represents any character but a digit [^0-9] \S is a negated \s; it represents any non-whitespace character [^\s] \W is a negated \w; it represents any non-word character [^\w] The period '.' matches any character but "\n" Ex. $x = 'the cat in the hat'; $x =~ /^(.*)(at)(.*)$/; #$1 = 'the cat in the h'; $2='at'; $3=''(0 matches) What do these regexes match? /^[A-Za-z][A-Za-z\d]+/ /[yY][eE][sS]/; /[0-9a-fA-F]/; /[^0-9]{4}/; Substring replacement (substitution) syntax: s/regex/replacement/modifiers ex. /^'(.*)'$/$1/; # strip single quotes that occur at start and end of line Try out a few thing with this file in vim.
Primitive in SNOBOL4 (a string manipulation language) ; many operations and elaborate pattern matching. Primitive in Java as a String class (don't need to include anything extra)
String length options (only Ada supports all three):
1. Static: COBOL, Java's String class 2. Limited Dynamic Length: C/C++ \0 marks end of string in fixed array (example) 3. Dynamic (no maximum): SNOBOL4, Perl, JavaScript
As a primitive type strings aid writability. But C was a systems language and did not see strings as a priority. Inexpensive as a primitive type with static length. Dynamic length is nice, but is it worth the expense?
String implementation options:
1. Static length: need compile-time descriptor for length and address 2. Limited dynamic length: may need a run-time descriptor for length (not in C) 3. Dynamic length: need run-time descriptor for address, max length and current length; allocation/de-allocation biggest implementation problem
Design issues: Can enumeration constant appear in more than one type definition, if so, how is the type of an occurrence of that constant checked? Are enumeration values coerced to integer? Any other type coerced to an enumeration type?
Evaluation of Enumerated Type: Aid to readability, e.g., no need to code a color as a number. Aid to reliability, e.g., compiler can check operations (don't allow colors to be added). No enumeration variable can be assigned a value outside its defined range. Ada, C#, and Java 5.0 provide better support for enumeration than C++; e.g. enumeration type variables are not coerced into integer types
type Days is (mon, tue, wed, thu, fri, sat, sun); subtype Weekdays is Days range mon..fri; subtype Index is Integer range 1..100; Day1: Days; Day2: Weekday; Day2 := Day1;Subrange Evaluation: Aids readability. Makes clear that variables of subrange can store only certain range of values Helps reliability. Assigning a value to a subrange variable outside range is a syntax error.
Implementation of User-Defined Ordinal Types:
Enumeration types are implemented as integers. Subrange types implemented like parent types with code inserted by compiler to restrict assignments to subrange variables.
Array Design Issues:
What types are legal for subscripts?
Are subscripting expressions in element references range checked?
When is the subscript range bound (when is the size of array determined)?
When does allocation take place?
What is the maximum number of subscripts?
Can array objects be initialized?
Are any kind of slices allowed?
Array Indexing:
Indexing (or subscripting) is a mapping from indices to elements
array_name (index_value_list) -> an elementIndex Syntax:
Arrays Index (Subscript) Types:
FORTRAN, C: integer only
Pascal: any ordinal type (integer, Boolean, char, enumeration)
Ada: integer or enumeration (includes Boolean and char)
Java: integer types only
Range Checking:
C, C++, Perl, and Fortran do not specify range checking.
Ada, Java, ML, C# specify range checking. Ada makes it
MANDATORY
Five Array Allocation Categories:
C Example of all five methods
1. Static: the subscript range is statically bound and storage allocation is also at compile time (not on the runtime stack) Advantage: efficiency (no dynamic allocation)
2. Fixed stack-dynamic: the subscript range is statically bound, but allocation is done at declaration time during function call Advantage: space efficiency
3. Stack-dynamic: the subscript range is dynamically bound and the storage allocation is dynamic (on run-time stack) Advantage: flexibility (the size of an array need not be known until the array is to be used)
4. Fixed heap-dynamic: similar to fixed stack-dynamic: storage binding is heap dynamic but the size of the array is fixed
5. Heap-dynamic: binding of the subscript range and storage allocation are both dynamic and can change during execution. Advantage: flexibility (arrays can grow or shrink during program execution)
C and C++ arrays that include static modifier are static. C and C++ arrays without static modifier are fixed stack-dynamic. Ada arrays can be stack-dynamic. C and C++ provide fixed heap-dynamic arrays. C# includes a second array class ArrayList that provides fixed heap-dynamic. Perl and JavaScript support heap-dynamic arrays.
Array Initialization:
Some languages allow initialization at the time of storage allocation:
// C, C++, Java, C# int list [] = {4, 5, 7, 83} <== an aggregate constant is assigned to list char name [] = "freddie"; // C and C++ char *names [] = {Bob, Jake, Joe}; // C and C++ String[] names = {Bob, Jake, Joe}; // Java String objectsArray Operations:
1,2,3,4 + 2,3,1,5 = 3,5,4,9Rectangular and Jagged Arrays:
Slices:
A slice is a method to reference a substructure of an array.
Mechanism (only useful in languages that have array operations)
Assume, the following array declarations in Fortran 95:
Integer, Dimension (10) :: Vector // a 1-dimensional array Integer, Dimension (3, 3) :: Matrix // a 2-dimensional array Integer, Dimension (3, 3, 4) :: Cube // a 3-dimensional array Slice Examples: Vector (3:6) is a four element array Mat (:,2) is the second column of Mat (an array) Mat (2:3,:) is the 2nd and 3rd rows of Mat (a matrix) Cube (2,:,:) is a matrix Cube (:,:,2:3) is another 3-dimensional array
Implementation of Arrays:
Access function maps subscript expressions to an address in the array
Access function for single-dimensioned arrays:
address(list[k]) = address (list[lower_bound]) + ((k-lower_bound) * element_size)Accessing Elements Multi-dimensioned Arrays:
array_starting_address + ((i * n_cols + j) * element_size)
Compile-Time Descriptors for arrays:
All arrays: starting address, element type, index type
Single-dimensional: index range (e.g. lower and upper bound)
Multi-dimensional: index range 1 ... index range n, number of dimensions
An unordered collection of data elements indexed by an equal number of values called keys. User defined keys must be stored. Design issue: What is the form of references to elements? Examples: a hash in Perl and support in Java/C++/JavaScript
Illustrated:* RECORD TYPES A possibly heterogeneous aggregate of data elements individually identified by names (usually called fields) Design issues What is the syntactic form of references to the field? Are elliptical references allowed? e.g. ellipsis (...) any omitted part of speech that is understood Ex: structs in C COBOL ( level numbers show nested records ) 01 EMP-RECORD. 02 EMP-NAME. 03 FIRST PIC IS X(20). 03 MIDDLE PIC IS X(10). 03 LAST PIC IS X(20). 02 HOURLY-RATE PIC IS 99V99. Record Field Reference: MIDDLE OF EMP-NAME OF EMP-RECORD Fully qualified references include all record names Elliptical references allow leaving out some parts of record names as long as the reference is unambiguous; e.g. FIRST, FIRST OF EMP-NAME, and FIRST of EMP-RECORD are elliptical references to the employee's first name Ada Record structures are indicated in an orthogonal way: type Emp_Name_Type is record First: String (1..20); Mid: String (1..10); Last: String (1..20); end record; type Emp_Rec_Type is record Employ_Name: Emp_Name_Type; Hourly_Rate: Float; end record; Emp_Rec: Emp_Rec_Type; Record Field Reference: Emp_Rec.Name <== dot notation - most commonly used Operations on Records Assignment is very common if the types are identical Ada allows record comparison Ada records can be initialized with aggregate literals COBOL provides MOVE CORRESPONDING: Copies a field of source record to the corresponding field in target record Evaluation and Comparison to Arrays Straight forward and safe design Records are used when collection of data values is heterogeneous Access to array elements is slower than to record fields, because subscripts are dynamic (field names are static); e.g. myArray[i] Dynamic subscripts could be used with record field access, but would disallow type checking and be much slower Implementation of Record Type: * UNION TYPES This variables can store different type values at different times during execution Design issues Should type checking be required? Should unions be embedded in records? Unions: o every member begins at offset 0 from the address of the union o the size is the size of the largest member o only one member value can be stored in a union object at a time Discriminated vs. Free Unions Fortran, C, and C++ provide union constructs with no language support for type checking; the union in these languages is called free union; Ex. C Type checking of unions require that each union include a type indicator called a discriminant Discriminated Union Type Ex. Ada type Shape is (Circle, Triangle, Rectangle); type Colors is (Red, Green, Blue); type Figure (Form: Shape) is record Filled: Boolean; Color: Colors; case Form is when Circle => Diameter: Float; when Triangle => Leftside, Rightside: Integer; Angle: Float; when Rectangle => Side1, Side2: Integer; end case; end record;
Evaluation of Unions: Potentially unsafe construct if no type checking. Java and C# do not support unions. Reflective of growing concerns for safety in programming language. *Note: unions and structures are prevalent in C system programming. (see semaphore example) C was developed to be a low-level system programming language C is still the most popular choice for low-level programming For applications programming, the power of C is also its downfall obfuscated C code contest
POINTER AND REFERENCE TYPES
o A pointer type variable holds a memory address or a special value; e.g., nil o Provides indirect addressing o Provides a way to manage dynamic memory o A pointer can access memory that is dynamically created (e.g. heap) o Pointers add addressing flexibility and control dynamic storage management Design Issues of Pointers What are the scope of and lifetime of a pointer variable? What is the lifetime of a heap-dynamic variable? Are pointers restricted as to the type of value to which they can point? Are pointers used for dynamic storage management, indirect addressing, or both? Should the language support pointer types, reference types, or both? Pointer Operations Two fundamental operations: assignment and dereferencing Assignment is used to set a pointer variables value to some useful address Dereferencing yields the value stored at the location represented by the pointers value Dereferencing can be explicit or implicit C++ has explicit operation via *; e.g. j = *ptr sets j to the value located at ptr Problems with Pointers Dangling pointer: A pointer points to de-allocated a heap-dynamic variable Dangling object: An allocated heap-dynamic variable has no pointer (memory leak) Cross-linked pointers: heap-dynamic variable with two pointers (sometimes desirable) Pointers in C/C++ powerful and dangerous; Ex. C Explicit dereferencing and address-of operators (void *) can point to any type and be type checked but not de-referenced (Ex. C) In C, pointers can point to pointers can point to pointers.... (C double pointers) Limited pointer arithmetic is supported in C/C++ float stuff[100]; float *p; p = stuff; *(p+5) is equivalent to stuff[5] and p[5] *(p+i) is equivalent to stuff[i] and p[i] p = p + stuff; <= not OK Pointers in Fortran 95 point to heap and non-heap variables that have the TARGET attribute assigned in the declaration; Implicit dereferencing C-style pointers in C# are available only in unsafe code: unsafe { int i = 10; int* px1 = &i; } Reference Types C++ includes a special kind of pointer type called a reference type that is used primarily for formal parameters; C++ reference type behaves strictly like an alias (see C++ code ) In a reference, you never see the address and you have implicit assignment and dereferencing - (see C++ code) THERE IS NO EXPLICIT DEREFERENCING Advantages of both pass-by-reference and pass-by-value Java extends C++ reference variables to replace pointers entirely References refer to call instances C# includes both the references of Java and C pointers (in unsafe code) Evaluation of Pointers Dangling pointers and dangling objects are problems as is heap management Pointers or references are necessary for dynamic data structures--can't design a language without them Representations of Pointers Depends on register and address size Large computers use single values Intel microprocessors use segment and offset (2 16-bit addresses) Solutions to Dangling Pointer Problem o automatically de-allocate dynamic objects at the end of pointer's scope (Ada to some extent) Tombstone: extra heap cell that is a pointer to the heap-dynamic variable The actual pointer variable points only at tombstones (another indirection) When heap-dynamic variable de-allocated, tombstone remains but set to nil Costly in time and space -- not used in any major language Locks-and-keys: Pointer values are represented as (key, address) pairs Heap-dynamic variables are represented as variable plus cell for integer lock value When heap-dynamic variable allocated, lock value is created and placed in lock cell and key cell of pointer Allows multiple pointers to point to the same variable (key is copied also) When memory is deallocated, the key is modified to prevent other access Best solution: no explicit deallocation possible Heap Management A very complex run-time process Single-size cells vs. variable-size cells Two approaches to Garbage Collection (1) eager approach reclamation is gradual (C approach) Evaluation: less efficient but you have the maximum amount of memory at any given time (2) lazy approach reclamation occurs when the list of variable space becomes empty The run-time system allocates storage cells as requested and disconnects pointers from cells as necessary; garbage collection then begins Evaluation: more efficient but when you need it most, it works worst (takes most time when program needs most of cells in heap) * BIT TYPES (not covered in text) Systems programming often requires access to individual bits and data structures that take up less storage than primitive data types The use of pointers and bit-level operators can almost replace assembly code (Note: Only about 10 % of UNIX is assembly code - the rest is C) Example: C's bit fields and displaying bit by bittop