Chengwei LEI, Ph.D.    Associate Professor

Department of Computer and Electrical Engineering and Computer Science
California State University, Bakersfield

 

Data Summaries

 


 

 

There are many approaches may be viewed as data “summarization”. The most immediate effect of summarizing data is to take data that may be overwhelming to work with, and reduce it to a few key summary values that can be viewed, often in a table or plot.

 

 



Data Central Tendency





Now, I own a winery, and I collect grapes to make the wine. Majority of collections are Merlot (super good; BIG and sweet); some are blueberry (bad, ruin my wine; SMALL and sour).

 Here are the data from yesterday. First row is the diameter of each individual fruit, second row is whether it is a Merlot or not.

How can we summarize this data?






Max, Min

 

Median
What if we have even data? What if we have group data?



 

Mean (Average)
Basic Situation? Not everything equal?

 

Mode
 By definition?
An empirical relationship! Magic Spell

 


How it looks like

 

Symmetrical

Negative skew (left-skewed)   VS   Positive skew (right-skewed)


During the three years between the 2019 and 2022 surveys, the COVID-19 pandemic caused severe disruptions to the U.S. labor market and broader economic activity, leading to unprecedented levels of fiscal support. Against this backdrop, U.S. families experi enced increases in median and mean inflation-adjusted income, measured for the year before the survey.

 Median income rose a relatively modest 3 percent, from $67,900 in 2018 to $70,300 in 2021. Mean income increased 15 percent—one of the largest three-year changes in mean income over the history of the modern SCF—from $123,400 in 2018 to $141,900 in 2021.

------from FEDERAL_RESERVE_Report (local image)


More visualized?
Histogram

 

 

 

 

 





Standard Deviation



In statistics, the standard deviation is a measure of the amount of variation of the values of a variable about its mean.

A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the values are spread out over a wider range. The standard deviation is commonly used in the determination of what constitutes an outlier and what does not.

Standard deviation may be abbreviated SD or std dev, and is most commonly represented in mathematical texts and equations by the lowercase Greek letter σ (sigma)

 

A plot of normal distribution


68–95–99.7 rule

 

For the normal (distribution) curve: (μ: mean, σ: standard deviation)
From μ–σ to μ+σ: contains about 68% of the measurements
From μ–2σ to μ+2σ: contains about 95% of it
From μ–3σ to μ+3σ: contains about 99.7% of it

 


 

Variance is the expected value of the squared deviation from the mean of a random variable.

 

 




Common Probability Distributions



 

Data scientists have hundreds of probability distributions from which to choose. Where to start ?

 

Common probability distributions and some key relationships