Data Science
What is Data Science
Data science is the study of data to extract meaningful insights for business. It is a multidisciplinary approach that combines principles and practices from the fields of mathematics, statistics, artificial intelligence, and computer engineering to analyze large amounts of data. This analysis helps data scientists to ask and answer questions like what happened, why it happened, what will happen, and what can be done with the results. (Definition by Amazon)
Under the under big umbrella "Science of Intelligence", Data Science usually refers to Data Mining and Machine Learning
What is the difference between Machine Learning and Data Mining
Machine learning and data mining often employ the same methods and overlap significantly, but while machine learning focuses on prediction, based on known properties learned from the training data, data mining focuses on the discovery of (previously) unknown properties in the data (this is the analysis step of knowledge discovery in databases). Data mining uses many machine learning methods, but with different goals; on the other hand, machine learning also employs data mining methods as "unsupervised learning" or as a preprocessing step to improve learner accuracy.
Useful Math Skills
Mathematics for Machine Learning Local download
Statistical Thinking for the 21st Century github
Data Visualization
Python Data Visualization Cookbook
Ethical problems in data science
Data Collection
|
Data Preprocessing Link Data Cleaning Data Integration Data Reduction |
Python Crawler
Intro to R
Data Exploration / Understand Data
Data Summaries Link Data Central Tendency Standard Deviation / Variance |
Data Plots / Visualized
Summarization Link Basic Plots Boxplot Histogram Quantile plots Heatmap / Mesh |
Data Visualization Link |
Simpson’s Paradox |
Computer Science
Math/Statistics
Visualization
Optimization
Problems |
Classic Optimization Link Brute Force Greedy Algorithm Dynamic Programming |
Stochastic Algorithms Link GA ES DE PSO ACO |
Learning
Types Link Supervised Unsupervised Semi-supervised Data Spliting Overfitting |
Classification Link Decision trees Logistic regression K-Nearest Neighbours Naive Bayes SVM Artificial neural networks |
Clustering Link Hierarchical Clustering K-means Clustering Mean Shift Clustering DBSCAN Agglomerative Clustering Affinity Propagation Soft Clustering |
Network Analysis Link Graph Mining Random Walk PageRank Web mining Infor-network analysis |
Trend Analysis Link Time-series Prediction Deviation Analysis Sequential PatternMining Periodicity Analysis |
Biological Data
Analysis Link Motif Finding Bio Sequence Analysis Bio Network Analysis Pathway Analysis |
Basic
Tools |
Measurements Link Distance Measure Similarity/Corrolation |
Statistical Analysis Link Z-test t-test U-test Statistical Dependence p-value Confidence Interval ANOVA table |
Data Preprocessing Link Normalization Data Sampling Data Cleaning |
Evaluations Link Regression Problems Classification Problems Clustering Problems |
Regression Link Simple Linear Regression Polynomial Regression Curve Fitting Logistic Regression |
Statistical Modeling Link Hidden Markov Model Bayesian Network |
Anomaly Analysis Link Anomaly detection Anomaly/Outlier analysis |
Find Features Link Feature Selection Dimension Reduction Feature Abstraction Similarity-based Analysis |
Pattern Discovery Link Pattern evaluation Pattern selection Pattern interpretation Pattern visualization |
Python Data Visualization Cookbook
Result Presentation
/
Result Visualization
Popular Tools
KNIME
Tableau
SeaTable
Power BI
Dialogflow