Chengwei LEI, Ph.D.    Associate Professor

Department of Computer and Electrical Engineering and Computer Science
California State University, Bakersfield

 Data Science


 

What is Data Science

Data science is the study of data to extract meaningful insights for business. It is a multidisciplinary approach that combines principles and practices from the fields of mathematics, statistics, artificial intelligence, and computer engineering to analyze large amounts of data. This analysis helps data scientists to ask and answer questions like what happened, why it happened, what will happen, and what can be done with the results. (Definition by Amazon)

 

 Under the under big umbrella "Science of Intelligence", Data Science usually refers to Data Mining and Machine Learning

 

What is the difference between Machine Learning and Data Mining

Machine learning and data mining often employ the same methods and overlap significantly, but while machine learning focuses on prediction, based on known properties learned from the training data, data mining focuses on the discovery of (previously) unknown properties in the data (this is the analysis step of knowledge discovery in databases). Data mining uses many machine learning methods, but with different goals; on the other hand, machine learning also employs data mining methods as "unsupervised learning" or as a preprocessing step to improve learner accuracy.

 




Introduction to Data Science


Data Mining


Machine Learning


Data Visualization




Useful Math Skills

Mathematics for Machine Learning Local download

Statistical Thinking for the 21st Century github

 


Data Visualization

Python Data Visualization Cookbook

 









 

Ethical problems in data science

Before we start

Introduction

 






Data Collection

 
Data Preprocessing
Link

 
Data Cleaning
Data Integration
Data Reduction
 


 
Intro to python

Python Crawler

Intro to R

Data Exploration / Understand Data

 
Data Summaries
Link

 
Data Central Tendency
Standard Deviation / Variance
 
 Data Plots / Visualized Summarization
Link


Basic Plots
Boxplot
Histogram
Quantile plots
Heatmap / Mesh
 
Data Visualization
Link



Simpson’s Paradox
 

  
WordClouds

Computer Science

Math/Statistics

Visualization


 
 Optimization Problems
 
Classic Optimization
Link

 
Brute Force
Greedy Algorithm
Dynamic Programming
 
Stochastic Algorithms
Link

GA
ES
DE
PSO
ACO

 

Learning Types
Link

Supervised
Unsupervised
Semi-supervised

Data Spliting
Overfitting
 


Classification
Link

Decision trees
Logistic regression
K-Nearest Neighbours
Naive Bayes
SVM
Artificial neural networks
 


Clustering
Link

Hierarchical Clustering
K-means Clustering
Mean Shift Clustering
DBSCAN
Agglomerative Clustering
Affinity Propagation

Soft Clustering
 


Network Analysis
Link

Graph Mining
Random Walk
PageRank
Web mining
Infor-network analysis
 


Trend Analysis
Link

Time-series Prediction
Deviation Analysis
Sequential PatternMining Periodicity Analysis
 


Biological Data Analysis
Link

Motif Finding
Bio Sequence Analysis
Bio Network Analysis Pathway Analysis
 



 Basic Tools
 
Measurements
Link

 
Distance Measure
Similarity/Corrolation
 
Statistical Analysis
Link


Z-test
t-test
U-test
Statistical Dependence
p-value
Confidence Interval
ANOVA table

 
Data Preprocessing
Link

Normalization
Data Sampling
Data Cleaning
  
Evaluations
Link

Regression Problems
Classification Problems
Clustering Problems
    

 
Regression
Link

Simple Linear Regression
Polynomial Regression
Curve Fitting
Logistic Regression
 

 
Statistical Modeling
Link


Hidden Markov Model
Bayesian Network
 

Anomaly Analysis
Link

Anomaly detection
Anomaly/Outlier analysis
 


Find Features
Link

Feature Selection
Dimension Reduction
Feature Abstraction
Similarity-based Analysis
 


Pattern Discovery
Link

Pattern evaluation
Pattern selection
Pattern interpretation
Pattern visualization
 








 


 

 Python Data Visualization Cookbook


Result Presentation
/
Result Visualization












Popular Tools
Pyplot / Plotly







KNIME





Tableau
SeaTable
Power BI
Dialogflow