Teaching

Data Sampling

Data sampling is a statistical analysis technique used to select, manipulate and analyze a representative subset of data points to identify patterns and trends in the larger data set being examined.

It enables data scientists, predictive modelers and other data analysts to work with a smaller, more manageable subset of data, rather than trying to analyze the entire data population.

With a representative sample, they can build and run analytical models more quickly, while still producing accurate findings.

Probability Data Sampling

Random Sampling

In statistics, a simple random sample (or SRS) is a subset of individuals (a sample) chosen from a larger set (a population) in which a subset of individuals are chosen randomly, all with the same probability.

Sampling without replacement: Once an object is selected, it is removed from the population
Sampling with replacement: A selected object is not removed from the population

Stratified Sampling

In statistics, stratified sampling is a method of sampling from a population which can be partitioned into subpopulations.

Use a SQL query, to get a sub-dataset from the dataset

Systematic Sampling

Systematic sampling creates samples by setting an interval at which to extract data from the dataset according to a predetermined rule. With systematic sampling, the element of randomness in selections only applies to the first item selected. After that, the rule dictates the selections.

An example of systematic sampling or clustering would be creating a 100 element sample by selecting every 10th row of a 1000 row dataset.

Cluster Sampling

In statistics, cluster sampling is a sampling plan used when mutually homogeneous yet internally heterogeneous groupings are evident in a statistical population.

In this sampling plan, the total population is divided into these groups (known as clusters) and a simple random sample of the groups is selected. The elements in each cluster are then sampled.

Now, I own a winery, and I collect grapes to make the wine.

Here are the data from yesterday.
First row is the diameter of each individual fruit, second row is the field ID where they are collected.

What is the "mean and standard deviation" of entire dataset?

The data size is so large, I want to sample 10% datapoints from the entire dataset.

Try to estimate the "mean and standard deviation" by
"Sampling without replacement" vs "Sampling with replacement",
100 runs for each approach.

For "Sampling with replacement" do you experience any case that sample one data twice?

The data size is still so large, I want to sample 2% datapoints from the entire dataset.

Try to estimate the "mean and standard deviation" by
"Sampling without replacement" vs "Sampling with replacement",
100 runs for each approach.

Do you experience any case that without any datapoints from filed 1?

Sample 2% datapoints the entire dataset.

Try to estimate the "mean and standard deviation" by
"Stratified Sampling" and "Systematic Sampling".

What conclusion you can get?

Non-probability Data Sampling

Nonprobability sampling is a form of sampling that does not utilise random sampling techniques where the probability of getting any particular sample may be calculated.

Nonprobability samples are not intended to be used to infer from the sample to the general population in statistical terms. In cases where external validity is not of critical importance to the study's goals or purpose, researchers might prefer to use nonprobability sampling. Researchers may seek to use iterative nonprobability sampling for theoretical purposes, where analytical generalization is considered over statistical generalization.

----Wikipedia

Convenience Sampling

Convenience sampling (also known as grab sampling, accidental sampling, or opportunity sampling) is a type of non-probability sampling that involves the sample being drawn from that part of the population that is close to hand.

Convenience sampling is not often recommended by official statistical agencies for research due to the possibility of sampling error and lack of representation of the population.

Consecutive Sampling

In the design of experiments, consecutive sampling, also known as total enumerative sampling, is a sampling technique in which every subject meeting the criteria of inclusion is selected until the required sample size is achieved.

Judgment Sample / Purposive Sampling

A judgment sample, or expert sample, is a type of non-random sample that is selected based on the opinion of an expert.

Results obtained from a judgment sample are subject to some degree of bias, due to the sample's frame (i.e. the variables that define a population to be studied) and population not being identical.

Quota Sampling

In quota sampling, a population is first segmented into mutually exclusive sub-groups, just as in stratified sampling. Then judgment is used to select the subjects or units from each segment based on a specified proportion. For example, an interviewer may be told to sample 200 females and 300 males between the age of 45 and 60. This means that individuals can put a demand on who they want to sample (targeting).

This second step makes the technique non-probability sampling. In quota sampling, there is non-random sample selection and this can be unreliable. For example, interviewers might be tempted to interview those people in the street who look most helpful, or may choose to use accidental sampling to question those closest to them, to save time.

Snowball Sampling

In sociology and statistics research, snowball sampling (or chain sampling, chain-referral sampling, referral sampling) is a nonprobability sampling technique where existing study subjects recruit future subjects from among their acquaintances. Thus the sample group is said to grow like a rolling snowball. As the sample builds up, enough data are gathered to be useful for research.

Chengwei LEI, Ph.D. Associate Professor

Department of Computer and Electrical Engineering and Computer Science
California State University, Bakersfield

Data Sampling

Chengwei LEI, Ph.D. Associate Professor

Department of Computer and Electrical Engineering and Computer Science California State University, Bakersfield

Data Sampling

Department of Computer and Electrical Engineering and Computer Science
California State University, Bakersfield