Python For Data Science Cheat Sheet
Scikit-Learn
Learn Python for data science Interactively at www.DataCamp.com
Scikit-learn
DataCamp
Learn Python for Data Science Interactively
Loading The Data
Also see NumPy & Pandas
Scikit-learn is an open source Python library that
implements a range of machine learning,
preprocessing, cross-validation and visualization
algorithms using a unified interface.
>>> import numpy as np
>>> X = np.random.random((10,5))
>>> y = np.array(['M','M','F','F','M','F','M','M','F','F','F'])
>>> X[X < 0.7] = 0
Your data needs to be numeric and stored as NumPy arrays or SciPy sparse
matrices. Other types that are convertible to numeric arrays, such as Pandas
DataFrame, are also acceptable.
Create Your Model
Model Fi!ing
Prediction
Tune Your Model
Evaluate Your Model’s Performance
Grid Search
Randomized Parameter Optimization
Linear Regression
>>> from sklearn.linear_model import LinearRegression
>>> lr = LinearRegression(normalize=True)
Support Vector Machines (SVM)
>>> from sklearn.svm import SVC
>>> svc = SVC(kernel='linear')
Naive Bayes
>>> from sklearn.naive_bayes import GaussianNB
>>> gnb = GaussianNB()
KNN
>>> from sklearn import neighbors
>>> knn = neighbors.KNeighborsClassier(n_neighbors=5)
Supervised learning
>>> lr.t(X, y)
>>> knn.t(X_train, y_train)
>>> svc.t(X_train, y_train)
Unsupervised Learning
>>> k_means.t(X_train)
>>> pca_model = pca.t_transform(X_train)
Accuracy Score
>>> knn.score(X_test, y_test)
>>> from sklearn.metrics import accuracy_score
>>> accuracy_score(y_test, y_pred)
Classification Report
>>> from sklearn.metrics import classication_report
>>> print(classication_report(y_test, y_pred))
Confusion Matrix
>>> from sklearn.metrics import confusion_matrix
>>> print(confusion_matrix(y_test, y_pred))
Cross-Validation
>>> from sklearn.cross_validation import cross_val_score
>>> print(cross_val_score(knn, X_train, y_train, cv=4))
>>> print(cross_val_score(lr, X, y, cv=2))
Classification Metrics
>>> from sklearn.grid_search import GridSearchCV
>>> params = {"n_neighbors": np.arange(1,3),
"metric": ["euclidean", "cityblock"]}
>>> grid = GridSearchCV(estimator=knn,
param_grid=params)
>>> grid.t(X_train, y_train)
>>> print(grid.best_score_)
>>> print(grid.best_estimator_.n_neighbors)
>>> from sklearn.grid_search import RandomizedSearchCV
>>> params = {"n_neighbors": range(1,5),
"weights": ["uniform", "distance"]}
>>> rsearch = RandomizedSearchCV(estimator=knn,
param_distributions=params,
cv=4,
n_iter=8,
random_state=5)
>>> rsearch.t(X_train, y_train)
>>> print(rsearch.best_score_)
A Basic Example
>>> from sklearn import neighbors, datasets, preprocessing
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.metrics import accuracy_score
>>> iris = datasets.load_iris()
>>> X, y = iris.data[:, :2], iris.target
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33)
>>> scaler = preprocessing.StandardScaler().t(X_train)
>>> X_train = scaler.transform(X_train)
>>> X_test = scaler.transform(X_test)
>>> knn = neighbors.KNeighborsClassier(n_neighbors=5)
>>> knn.t(X_train, y_train)
>>> y_pred = knn.predict(X_test)
>>> accuracy_score(y_test, y_pred)
Supervised Learning Estimators
Unsupervised Learning Estimators
Principal Component Analysis (PCA)
>>> from sklearn.decomposition import PCA
>>> pca = PCA(n_components=0.95)
K Means
>>> from sklearn.cluster import KMeans
>>> k_means = KMeans(n_clusters=3, random_state=0)
Fit the model to the data
Fit the model to the data
Fit to data, then transform it
Preprocessing The Data
Standardization
Normalization
>>> from sklearn.preprocessing import Normalizer
>>> scaler = Normalizer().t(X_train)
>>> normalized_X = scaler.transform(X_train)
>>> normalized_X_test = scaler.transform(X_test)
Training And Test Data
>>> from sklearn.model_selection import train_test_split
>>> X_train, X_test, y_train, y_test = train_test_split(X,
y,
random_state=0)
>>> from sklearn.preprocessing import StandardScaler
>>> scaler = StandardScaler().t(X_train)
>>> standardized_X = scaler.transform(X_train)
>>> standardized_X_test = scaler.transform(X_test)
Binarization
>>> from sklearn.preprocessing import Binarizer
>>> binarizer = Binarizer(threshold=0.0).t(X)
>>> binary_X = binarizer.transform(X)
Encoding Categorical Features
Supervised Estimators
>>> y_pred = svc.predict(np.random.random((2,5)))
>>> y_pred = lr.predict(X_test)
>>> y_pred = knn.predict_proba(X_test)
Unsupervised Estimators
>>> y_pred = k_means.predict(X_test)
>>> from sklearn.preprocessing import LabelEncoder
>>> enc = LabelEncoder()
>>> y = enc.t_transform(y)
Imputing Missing Values
Predict labels
Predict labels
Estimate probability of a label
Predict labels in clustering algos
>>> from sklearn.preprocessing import Imputer
>>> imp = Imputer(missing_values=0, strategy='mean', axis=0)
>>> imp.t_transform(X_train)
Generating Polynomial Features
>>> from sklearn.preprocessing import PolynomialFeatures
>>> poly = PolynomialFeatures(5)
>>> poly.t_transform(X)
Regression Metrics
Mean Absolute Error
>>> from sklearn.metrics import mean_absolute_error
>>> y_true = [3, -0.5, 2]
>>> mean_absolute_error(y_true, y_pred)
Mean Squared Error
>>> from sklearn.metrics import mean_squared_error
>>> mean_squared_error(y_test, y_pred)
R² Score
>>> from sklearn.metrics import r2_score
>>> r2_score(y_true, y_pred)
Clustering Metrics
Adjusted Rand Index
>>> from sklearn.metrics import adjusted_rand_score
>>> adjusted_rand_score(y_true, y_pred)
Homogeneity
>>> from sklearn.metrics import homogeneity_score
>>> homogeneity_score(y_true, y_pred)
V-measure
>>> from sklearn.metrics import v_measure_score
>>> metrics.v_measure_score(y_true, y_pred)
Estimator score method
Metric scoring functions
Precision, recall, f1-score
and support