Chengwei LEI, Ph.D.    Associate Professor

Department of Computer and Electrical Engineering and Computer Science
California State University, Bakersfield

 Data Science


 

What is Data Science

Data science is the study of data to extract meaningful insights for business. It is a multidisciplinary approach that combines principles and practices from the fields of mathematics, statistics, artificial intelligence, and computer engineering to analyze large amounts of data. This analysis helps data scientists to ask and answer questions like what happened, why it happened, what will happen, and what can be done with the results. (Definition by Amazon)

 

 Under the under big umbrella "Science of Intelligence", Data Science usually refers to Data Mining and Machine Learning

 

What is the difference between Machine Learning and Data Mining

Machine learning and data mining often employ the same methods and overlap significantly, but while machine learning focuses on prediction, based on known properties learned from the training data, data mining focuses on the discovery of (previously) unknown properties in the data (this is the analysis step of knowledge discovery in databases). Data mining uses many machine learning methods, but with different goals; on the other hand, machine learning also employs data mining methods as "unsupervised learning" or as a preprocessing step to improve learner accuracy.

 




Introduction to Data Science


Data Mining


Machine Learning


Data Visualization




Useful Math Skills

Mathematics for Machine Learning Local download

Statistical Thinking for the 21st Century github

 


Data Visualization

Python Data Visualization Cookbook

 









 

Ethical problems in data science

Introduction

Before we start






Data Collection

 
Data Preprocessing
Link

 
Data Cleaning
Data Integration
Data Reduction
 


 
Intro to python

Python Crawler

Intro to R

Understand Data / Summarize Data

 
Data Exploration
Link

 
Statistical Summaries
STD
Confidence Interval
 
Visualized Summarization
Link


Basic Plots
Histogram
Boxplot
Quantile plots
 

Simpson’s Paradox
 

  
WordClouds

Computer Science

Math/Statistics

Visualization


 
 Optimization Problems
 
Classic Optimization
Link

 
Brute Force
Greedy Algorithm
Dynamic Programming
 
Stochastic Algorithms
Link

GA
ES
DE
PSO
ACO

 

Machine Learning Types
Link

Supervised
Unsupervised
Semi-supervised

Data Spliting
Overfitting
 


Classification
Link

Decision trees
Logistic regression
K-Nearest Neighbours
Naive Bayes
SVM
Artificial neural networks
 


Clustering
Link

Hierarchical Clustering
K-means Clustering
Mean Shift Clustering
DBSCAN
Agglomerative Clustering
Affinity Propagation

Soft Clustering
 



Network Analysis
Link

Graph Mining
Random Walk
PageRank
Web mining
Infor-network analysis
 

Trend Analysis
Link

Time-series Prediction
Deviation Analysis
Sequential PatternMining Periodicity Analysis
 

Biological Data Analysis
Link

Motif Finding
Bio Sequence Analysis
Bio Network Analysis Pathway Analysis
 



 Basic Tools
 
Measurements
Link

 
Distance Measure
Correlation Coefficient
SSE/MSE/Rsqr
 
Statistical Tests

u-test
t-test
ANOVA table
 

 

Use Data Carefully
Link

Normalization
Data Sampling
Outlier detection
 



Anomaly Analysis
Link

Anomaly detection
Anomaly/Outlier analysis
 



Regression
Link

Linear Regression
Nonlinear Regression
Curve Fitting
Logistic Regression
 

 
Statistical Modeling
Link


Hidden Markov Model
Bayesian Network
 


Find Features
Link

Feature Selection
Dimension Reduction
Feature Abstraction
Similarity-based Analysis
 


Pattern Discovery
Link

Pattern evaluation
Pattern selection
Pattern interpretation
Pattern visualization
 








 


 

 Python Data Visualization Cookbook


Result Presentation
/
Result Visualization












Popular Tools
Pyplot / Plotly







KNIME





Tableau
SeaTable
Power BI
Dialogflow