Intro to Data Science and Python

Section Outline

Module 1: Introduction to Data Science

Data Scientist has been ranked the number one job on Glassdoor and the average salary of a data scientist is over $120,000 in the United States according to Indeed! Data Science is a rewarding career that allows you to solve some of the world’s most interesting problems! In this Module we will experience an intro of Data Science and it’s different arena in simple way.

Module 2: No-Code Machine Learning

This module introduces the Designer tool, a drag and drop interface for creating machine learning models without writing any code. You will learn how to create a training pipeline that encapsulates data preparation and model training, and then convert that training pipeline to an inference pipeline that can be used to predict values from new data, before finally deploying the inference pipeline as a service for client applications to consume.

Hands-on : Design a Machine Learning Model using ML Studio

Module 3: Python – A Quick Review

In this module, you will get a quick review on Python Language. We will not going in depth but we will try to discuss some important components of Python Language. Please note, this is not meant to be a comprehensive overview of Python or programming in general

Hands-on : Environment Setup and Jupyter Notebook Intro.

Hands-on : Python Code Along

Hands-on : Python Review Exercise

Presentation Files

Presentation File:

Intro to DS and ML

Related Materials:

Data Concept
- To know more about Data Concept you can click [this] link.
ML Performance Metrics:
- For Supervised Machine Learning
AzureML End-to-End Lecture Series
- Doing Machine Learning with Azure ML (3-day course)

Cheat Sheet

Azure ML Cheat Sheet

Algorithm Summary

Source: http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/

Practice (with AzureML Studio)

Automobile Price Prediction

The Problem

This data set consists of three types of entities: (a) the specification of an auto in terms of various characteristics, (b) its assigned insurance risk rating, (c) its normalized losses in use as compared to other cars. The second rating corresponds to the degree to which the auto is more risky than its price indicates. Cars are initially assigned a risk factor symbol associated with its price. Then, if it is more risky (or less), this symbol is adjusted by moving it up (or down) the scale. Actuarians call this process “symboling”. A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe.

The third factor is the relative average loss payment per insured vehicle year. This value is normalized for all autos within a particular size classification (two-door small, station wagons, sports/speciality, etc…), and represents the average loss per car per year.

Note: Several of the attributes in the database could be used as a “class” attribute.

Please bring it on whatever inferences you can get it and Make a Price Prediction Model.

The Data

This dataset consist of data From 1985 Ward’s Automotive Yearbook. Here are the sources

Sources:

1) 1985 Model Import Car and Truck Specifications, 1985 Ward’s Automotive Yearbook.
2) Personal Auto Manuals, Insurance Services Office, 160 Water Street, New York, NY 10038
3) Insurance Collision Report, Insurance Institute for Highway Safety, Watergate 600, Washington, DC 20037

Datasets is available Azure ML Studio Saved Datasets > Samples > Automobile Price Data (Raw)

Walmart Store Sales Forecasting

The Problem

One challenge of modeling retail data is the need to make decisions based on limited history. If Christmas comes but once a year, so does the chance to see how strategic decisions impacted the bottom line.

You are provided with historical sales data for 45 Walmart stores located in different regions. Each store contains a number of departments, and you are tasked with predicting the department-wide sales for each store.

In addition, Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of which are the Super Bowl, Labor Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge presented by this competition is modeling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data.

The Data

stores.csv

This file contains anonymized information about the 45 stores, indicating the type and size of store.

train.csv

This is the historical training data, which covers to 2010-02-05 to 2012-11-01. Within this file you will find the following fields:

Store – the store number
Dept – the department number
Date – the week
Weekly_Sales – sales for the given department in the given store
IsHoliday – whether the week is a special holiday week

test.csv

This file is identical to train.csv, except we have withheld the weekly sales. You must predict the sales for each triplet of store, department, and date in this file.

features.csv

This file contains additional data related to the store, department, and regional activity for the given dates. It contains the following fields:

Store – the store number
Date – the week
Temperature – average temperature in the region
Fuel_Price – cost of fuel in the region
MarkDown1-5 – anonymized data related to promotional markdowns that Walmart is running. MarkDown data is only available after Nov 2011, and is not available for all stores all the time. Any missing value is marked with an NA.
CPI – the consumer price index
Unemployment – the unemployment rate
IsHoliday – whether the week is a special holiday week

Here are the data for download:

Self Practice (With AzureML Studio)

Bike Sharing Demand
Forecast use of a city bikeshare system

The Problem

Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.

Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.

Data Fields

datetime – hourly date + timestamp
season – 1 = spring, 2 = summer, 3 = fall, 4 = winter
holiday – whether the day is considered a holiday
workingday – whether the day is neither a weekend nor holiday
weather – 1: Clear, Few clouds, Partly cloudy, Partly cloudy
2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp – temperature in Celsius
atemp – “feels like” temperature in Celsius
humidity – relative humidity
windspeed – wind speed
casual – number of non-registered user rentals initiated
registered – number of registered user rentals initiated
count – number of total rentals

Datasets

Datasets is available Azure ML Studio Saved Datasets > Samples > Bike Rental UCI Dataset

Heart Diseases Prediction

The Problem

The term “heart disease” is often used interchangeably with the term “cardiovascular disease”. Cardiovascular disease generally refers to conditions that involve narrowed or blocked blood vessels that can lead to a heart attack, chest pain (angina) or stroke. Other heart conditions, such as those that affect your heart’s muscle, valves or rhythm, also are considered forms of heart disease.

This makes heart disease a major concern to be dealt with. But it is difficult to identify heart disease because of several contributory risk factors such as diabetes, high blood pressure, high cholesterol, abnormal pulse rate, and many other factors. Due to such constraints, scientists have turned towards modern approaches like Data Science and Machine Learning for predicting the disease.

The Data

In this practicec, we will be applying Machine Learning approaches (and eventually comparing them) for classifying whether a person is suffering from heart disease or not, using one of the most used dataset — Cleveland Heart Disease dataset from the UCI Repository.

Data Source URL : http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data

Solution

There are two ways we can do this; either we can solve this with Azure ML Designer (No Code) way or We can do this using python notebook.

Let’s do this using Azure ML Designer (Azure ML Studio -Classic)
If you’re Python savvy you can follow [this] link for get your ipynb files and to read the blog about this problem scope you can visit this [link]

Hints:

Edit Metadata info and put new column name : age,sex,chestpaintype,resting_blood_pressure,serum_cholestrol,fasting_blood_sugar,resting_ecg,max_heart_rate,exercise_induced_angina,st_depression_induced_by_exercise,slope_of_peak_exercise,number_of_major_vessel,thal,heart_disease_diag
Edit Metadata info and Change Data type to Integer for following Columns: heart_disease_diag,age,sex
Edit Metadata info and make it categorical for following Columns: sex,chestpaintype,exercise_induced_angina,number_of_major_vessel,slope_of_peak_exercise,fasting_blood_sugar,thal,resting_ecg
Clean Missing Value
Apply SQL Transformation

SELECT *, 
CASE 
WHEN heart_disease_diag < 1 THEN 0
ELSE 1 
END AS HeartDiseaseCat 
FROM t1;

Dataset Download

heart_data

Python Quick Review

Mindmap for Python

https://gitmind.com/app/doc/e105869745

Google Colab Notebook (Python Crash Course)

Click here

Assessment 01 Zip

Python Assessment 01

Assessment 01 Solution

Click here

Assessment 02 Zip

Python Assessment 02

Assessment 02 Solution

Click here

Resources (Python Practice Links)

I wanted to point out some helpful links for practice. Don’t worry about being able to do these exercises, I just want you to be aware of the links so you can visit them later.

Basic Practice:

http://codingbat.com/python

More Mathematical (and Harder) Practice:

https://projecteuler.net/archives

List of Practice Problems:

http://www.codeabbey.com/index/task_list

A SubReddit Devoted to Daily Practice Problems:

https://www.reddit.com/r/dailyprogrammer

A very tricky website with very few hints and touch problems (Not for beginners but still interesting)

http://www.pythonchallenge.com/

Resources (Shared Links)

Mindmap

https://gitmind.com/app/doc/e139096651

Regression Performance Matrix

https://towardsdatascience.com/regression-an-explanation-of-regression-metrics-and-what-can-go-wrong-a39a9793d914

Classification Performance Matrix

https://medium.com/@MohammedS/performance-metrics-for-classification-problems-in-machine-learning-part-i-b085d432082b

Open Jupyter Notebook

Please follow this link to know about how to open Jupyter Notebook from your Local Machine (in your specified directory)
For a complete User Manual check out the Bryn Mawr College Computer Science Guide.

Resources (Books)

Books

Info-graphic

Data Analysis and Machine Learning ( Preamble )

Section Outline

Module 1: Python for Data Analysis ( Pandas )

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

Hands-on : Using Python Pandas Library

Module 2: Data Visualization/EDA/Data Analysis ( Descriptive Statistics and Seaborn)

In this part of the course we will discuss methods of descriptive statistics. You will learn what cases and variables are and how you can compute measures of central tendency (mean, median and mode) and dispersion (standard deviation and variance). Next, we discuss how to assess relationships between variables, and we introduce the concepts correlation and regression.

Hands-on : Using Python Seaborn Visualization Library

Module 3: Data Analytics / Machine Learning

In this part of the course we will discuss one of the best known Machine Learning Library Scikit-Learn, a package that provides efficient versions of a large number of common algorithms. Scikit-Learn is characterized by a clean, uniform, and streamlined API, as well as by very useful and complete online documentation. A benefit of this uniformity is that once you understand the basic use and syntax of Scikit-Learn for one type of model, switching to a new model or algorithm is very straightforward.

Hands-on : Using Python scikit-learn Library

Pandas: Data Analysis with Python

Code Along for Python Pandas (Goolge Colab link)

Click here

Exercise: 1 for Pandas

Download .ipynb file from here [SF Salaries Exercise] ( It’s a zip file . You need to Unzip and use)
Dataset you can download from here [Salaries] ( It’s a zip file . You need to Unzip and use)
Solution Colab Link is here

Exercise: 2 for Pandas

Download .ipynb file from here [Ecommerce Purchases Exercise] ( It’s a zip file . You need to Unzip and use)
Dataset you can download from here [Ecommerce Purchases] ( It’s a zip file . You need to Unzip and use)
Solution Colab Link is here

Seaborn: Data Visualization with Python

Code Along for Python Seaborn (Goolge Colab link)

Click here

Exercise: 1 for Seaborn

Download .ipynb file from here [Seaborn Exercises ] ( It’s a zip file . You need to Unzip and use)
Solution Colab Link is here

Exercise: 2 for Pandas with Seaborn

Download .ipynb file from here [01-911 Calls Data Capstone Project] ( It’s a zip file . You need to Unzip and use)
Dataset you can download from here [911] ( It’s a zip file . You need to Unzip and use)
Solution Colab Link is here

Scikit-learn: Machine Learning with Python

Code Along for Python Machine Learning – Sklearn (Goolge Colab link)

Click here

Dataset: Download Sample Dataset USA_Housing

Feature Engineering and EDA ( Comprehensive )

A Glimpse of EDA & FE

Click here to see the glimpse of activities which is needed before put out data for Training with Machine Learning Algorithm.

Code Along

Download Dataset from: https://archive.ics.uci.edu/dataset/222/bank+marketing

Notebook: Click here

Let's Practice

Problem Statement:
Customer churn prediction refers to the process of identifying customers who are likely to stop using a product or service in the near future. It is a valuable predictive analytics technique used by businesses to forecast customer behavior and take proactive measures to retain customers.

Objective :
Objective of this project is to predict whether customer is about to churn or not.

Kaggle Dataset Link :
https://www.kaggle.com/datasets/blastchar/telco-customer-churn
Dataset [ archive ]
What you have to do:
- A complete Explanatory Data Analysis with Feature Engineering
Notebook Link
Ask your EDA Questions and Find Answers. Click here

Create Model with Machine Learning Algorithm(s)

Steps to do for ML Model

Step 1: Create ML Model by applying appropriate Machine Learning Algorithm depending on Dataset and Business Objective

Regression models (both linear and non-linear) are used for predicting a real value, like salary for example. If your independent variable is time, then you are forecasting future values, otherwise your model is predicting present but unknown values. Regression technique vary from Linear Regression to SVR and Random Forests Regression.

In this part, you will understand and learn how to implement the following Machine Learning Regression models:

Simple Linear Regression
Multiple Linear Regression
Polynomial Regression
Support Vector for Regression (SVR)
Decision Tree Regression
Random Forest Regression

Unlike regression where you predict a continuous number, you use classification to predict a category. There is a wide variety of classification applications from medicine to marketing. Classification models include linear models like Logistic Regression, SVM, and nonlinear ones like K-NN, Kernel SVM and Random Forests.

In this part, you will understand and learn how to implement the following Machine Learning Classification models:

Logistic Regression
K-Nearest Neighbors (K-NN)
Support Vector Machine (SVM)
Kernel SVM
Naive Bayes
Decision Tree Classification
Random Forest Classification

Step 2: Evaluate ML Model performance using Performance Metrics

Step 3: Fine-Tune the Hyperparameters of ML Model to optimize its performance

Step 4: Explain Blackbox ML Model to understand the Local and Global Explainability using SHAP (SHapley Additive exPlanations)

Code Along

Notebook:

Dataset(s):

Important Resouces

MLOps

VS Code for Data Science

Download and install VSCode
Organize your Data Science and Machine Learning Project using Cookiecutter Data Science

End to End Machine Learning Tools

Important Resouces

Steps to follow

https://miro.com/app/board/uXjVMzNLV_0=/?share_link_id=119929711449

ANN

Code Along

Code Along

https://colab.research.google.com/drive/1jBK_GPZ2baLCXzae_aLi3thdDCR6f7xF?usp=sharing

Dataset

Churn_Modelling

Data Scientist Track

Intro to Data Science and Python

Module 1: Introduction to Data Science

Module 2: No-Code Machine Learning

Hands-on : Design a Machine Learning Model using ML Studio

Module 3: Python – A Quick Review

Hands-on : Environment Setup and Jupyter Notebook Intro.

Hands-on : Python Code Along

Hands-on : Python Review Exercise

Azure ML Cheat Sheet

Algorithm Summary

Python Packages Cheat Sheet

[NumPy]

Automobile Price Prediction

The Problem

The Data

Walmart Store Sales Forecasting

The Problem

The Data

Bike Sharing Demand
Forecast use of a city bikeshare system

The Problem

Data Fields

Datasets

Heart Diseases Prediction

The Problem

The Data

Solution

Dataset Download

Data Analysis and Machine Learning ( Preamble )

Module 1: Python for Data Analysis ( Pandas )

Hands-on : Using Python Pandas Library

Module 2: Data Visualization/EDA/Data Analysis ( Descriptive Statistics and Seaborn)

Hands-on : Using Python Seaborn Visualization Library

Module 3: Data Analytics / Machine Learning

Hands-on : Using Python scikit-learn Library

Feature Engineering and EDA ( Comprehensive )

Create Model with Machine Learning Algorithm(s)

MLOps

ANN

Data Scientist Track

Intro to Data Science and Python

Module 1: Introduction to Data Science

Module 2: No-Code Machine Learning

Hands-on : Design a Machine Learning Model using ML Studio

Module 3: Python – A Quick Review

Hands-on : Environment Setup and Jupyter Notebook Intro.

Hands-on : Python Code Along

Hands-on : Python Review Exercise

Azure ML Cheat Sheet

Algorithm Summary

Python Packages Cheat Sheet

[NumPy]

Automobile Price Prediction

The Problem

The Data

Walmart Store Sales Forecasting

The Problem

The Data

Bike Sharing DemandForecast use of a city bikeshare system

The Problem

Data Fields

Datasets

Heart Diseases Prediction

The Problem

The Data

Solution

Dataset Download

Data Analysis and Machine Learning ( Preamble )

Module 1: Python for Data Analysis ( Pandas )

Hands-on : Using Python Pandas Library

Module 2: Data Visualization/EDA/Data Analysis ( Descriptive Statistics and Seaborn)

Hands-on : Using Python Seaborn Visualization Library

Module 3: Data Analytics / Machine Learning

Hands-on : Using Python scikit-learn Library

Feature Engineering and EDA ( Comprehensive )

Create Model with Machine Learning Algorithm(s)

MLOps

ANN

Bike Sharing Demand
Forecast use of a city bikeshare system