Homework 2 - Chapter 3
Due: Tuesday January 28, 2014 at 11:55pm
Questions are (mostly) from the book:
- 3.1: Data quality can be assessed in terms of several issues, including
accuracy, completeness, and consistency. For each of the above three
issues, discuss how data quality assessment can depend on the intended
use of the data, giving examples. Propose two other dimensions of data
quality.
- Discuss issues to consider during data cleaning. During your discussion,
highlight how each of the methods of data cleaning presented in
the book handle specific issues.
- 3.4: Discuss issues to consider during data integration.
- 3.6: Use the following methods to normalize the dataset: 200, 300, 400,
600, 1000
- min-max normalization by setting min=0 and max=1
- z-score normalization
- z-score normalization using the mean absolute deviation instead
of the standard deviation
- normalization by decimal scaling
- 3.8: Using the data for age and body fat given in Exercise 2.4:
age 23 23 27 39 41 47 49 50 52 54 56 57 58 58 60 61
%fat 9.5 26.5 7.8 31.4 25.9 27.4 27.2 31.2 34.6 28.8 33.4 30.2 34.1 32.9 41.2 35.7
Answer the following:
- Normalize the two attributes based on z-score normalization.
- Calculate the correlation coefficent (Person's product moment coefficient).
- Compute the covariance.
- 3.13: Propose an algorithm, in pseudocode or in your favorite programming
language, for the following:
- The automatic generation of a concept hierarchy for nominal data based
on the number of distinct values of attributes in the given schema.
- The automatic generation of a concept hierarchy for numeric data based
on the equal-width partitioning rule.
- The automatic generation of a concept hierarchy for numeric data based
on the equal-frequency partitioning rule.