Data Science
What is Data Science
Data science is the study of data to extract meaningful insights for business. It is a multidisciplinary approach that combines principles and practices from the fields of mathematics, statistics, artificial intelligence, and computer engineering to analyze large amounts of data. This analysis helps data scientists to ask and answer questions like what happened, why it happened, what will happen, and what can be done with the results. (Definition by Amazon)
Under the under big umbrella "Science of Intelligence", Data Science usually refers to Data Mining and Machine Learning
What is the difference between Machine Learning and Data Mining
Machine learning and data mining often employ the same methods and overlap significantly, but while machine learning focuses on prediction, based on known properties learned from the training data, data mining focuses on the discovery of (previously) unknown properties in the data (this is the analysis step of knowledge discovery in databases). Data mining uses many machine learning methods, but with different goals; on the other hand, machine learning also employs data mining methods as "unsupervised learning" or as a preprocessing step to improve learner accuracy.
Useful Math Skills
Mathematics for Machine Learning Local download
Statistical Thinking for the 21st Century github
Data Visualization
Python Data Visualization Cookbook
Ethical problems in data science
Data Collection
|
Data Preprocessing Link Data Cleaning Data Integration Data Reduction |
Python Crawler
Intro to R
Understand Data / Summarize Data
Data Exploration Link Statistical Summaries STD Confidence Interval |
Visualized
Summarization Link Basic Plots Histogram Boxplot Quantile plots |
Simpson’s Paradox |
Computer Science
Math/Statistics
Visualization
Optimization
Problems |
Classic Optimization Link Brute Force Greedy Algorithm Dynamic Programming |
Stochastic Algorithms Link GA ES DE PSO ACO |
Machine Learning
Types Link Supervised Unsupervised Semi-supervised Data Spliting Overfitting |
Classification Link Decision trees Logistic regression K-Nearest Neighbours Naive Bayes SVM Artificial neural networks |
Clustering Link Hierarchical Clustering K-means Clustering Mean Shift Clustering DBSCAN Agglomerative Clustering Affinity Propagation Soft Clustering |
Network Analysis Link Graph Mining Random Walk PageRank Web mining Infor-network analysis |
Trend Analysis Link Time-series Prediction Deviation Analysis Sequential PatternMining Periodicity Analysis |
Biological Data
Analysis Link Motif Finding Bio Sequence Analysis Bio Network Analysis Pathway Analysis |
Basic
Tools |
Measurements Link Distance Measure Correlation Coefficient SSE/MSE/Rsqr |
Statistical Tests u-test t-test ANOVA table |
Use Data Carefully Link Normalization Data Sampling Outlier detection |
Anomaly Analysis Link Anomaly detection Anomaly/Outlier analysis |
Regression Link Linear Regression Nonlinear Regression Curve Fitting Logistic Regression |
Statistical Modeling Link Hidden Markov Model Bayesian Network |
Find Features Link Feature Selection Dimension Reduction Feature Abstraction Similarity-based Analysis |
Pattern Discovery Link Pattern evaluation Pattern selection Pattern interpretation Pattern visualization |
Python Data Visualization Cookbook
Result Presentation
/
Result Visualization
Popular Tools
KNIME
Tableau
SeaTable
Power BI
Dialogflow