Series 3 : Analysis and ML

Practice #1

Background Information:

Python Pandas & Python Matplotlib/Seaborn to analyze and answer business questions about 12 months worth of sales data. The data contains hundreds of thousands of electronics store purchases broken down by month, product type, cost, purchase address, etc.

We start by cleaning our data. Tasks during this section include:

Drop NaN values from DataFrame
Removing rows based on a condition
Change the type of columns (to_numeric, to_datetime, astype)

Once we have cleaned up our data a bit, we move the data exploration section. In this section we explore 4 high level business questions related to our data:

What was the best month for sales? How much was earned that month?
What city sold the most product?
What time should we display advertisemens to maximize the likelihood of customer’s buying product?
What product sold the most? Why do you think it sold the most?

To answer these questions we walk through many different pandas & matplotlib methods. They include:

Concatenating multiple csvs together to create a new DataFrame (pd.concat)
Adding columns
Parsing cells as strings to make new columns (.str)
Using the .apply() method
Using groupby to perform aggregate analysis
Plotting bar charts and lines graphs to visualize our results
Labeling our graphs

Dataset

One Drive

https://rb.gy/t74leq (Short Link)
https://1drv.ms/u/s!AlmMDylfGrbdhCxrkyx7uN4ZxeeZ?e=sQl4vQ (Full Link)

Colab Link

https://colab.research.google.com/drive/1Htd0Dop4YQZh1-b_YM4wIrpGEpCo5Qex?usp=sharing

Practice #2

Used Dataset

World Happiness Report from 2017 to 2019

The World Happiness Report is a landmark survey of the state of global happiness that ranks 156 countries by how happy their citizens perceive themselves to be.

The World Happiness Report was written by a group of independent experts acting in their personal capacities. Any views expressed in this report do not necessarily reflect the views of any organization, agency or program of the United Nations.

Questions

Which are the most happy countries in the world
Which are the least happy countries in the world
What is the most significant factor associated with happiness
Top 10 countries for each criteria
Correlation coefficients between factors (e.g., how is happiness related to life expectancy)

Colab Link:

https://colab.research.google.com/drive/1EYnhz9b2MPg6OxJ0-eAh4jZMhDKAMKcg?usp=sharing

Practice #3

Background Information:

The Financial Technology company (Fin-Tech Company) launch there a mobile app. This app used for financial purposes like bank loans, savings, etc. in one place. It has two versions free and premium. The free version app contains basic features and customer wants to use the premium feature then they have to pay some amount to unlock it.

The main goal of the company is to sell the premium version app with low advertisement cost but they don’t know how to do it. That’s a reason they are provided the premium feature in the free version app for 24 hours to collect the customer’s behavior. After that, the company hired you to find insight from the collected data (customer’s behavior).

Your main job scope is to find or predict new customer who is interested to buy the product or not. If the customers will buy a product anyway so no need to give an offer to that customer and loss the business. Only give offers to those customers who are interested to use premium version app but they can’t afford its cost. So the company will give offers to those customers and earn more money.

Dataset

One Drive

https://rb.gy/t74leq (Short Link)
https://1drv.ms/u/s!AlmMDylfGrbdhCxrkyx7uN4ZxeeZ?e=sQl4vQ (Full Link)

Know about dataset

As you can see in fineTech_appData DataFrame, there are 50,000 users data with 12 different features. Let’s know each and every feature in brief.

1. user: Unique ID for each user.

2. first_open: Date (yy-mm-dd) and time (Hour:Minute:Seconds:Milliseconds) of login on app first time.

3. dayofweek: On which day user logon.

0: Sunday
1: Monday
2: Tuesday
3: Wednesday
4: Thursday
5: Friday
6: Saturday

4. Hour: Time of a day in 24-hour format customer logon. It is correlated with dayofweek column.

5. age: The age of the registered user.

6. screen_list: The name of multiple screens seen by customers, which are separated by a comma.

7. numscreens: The total number of screens seen by customers.

8. minigame: Tha app contains small games related to finance. If the customer played mini-game then 1 otherwise 0.

9. used_premium_feature: If the customer used the premium feature of the app then 1 otherwise 0.

10. enrolled: If the user bought a premium feature app then 1 otherwise 0.

11. enrolled_date: On the date (yy-mm-dd) and time (Hour:Minute:Seconds:Milliseconds) the user bought a premium features app.

12. liked: The each screen of the app has a like button if the customer likes it then 1 otherwise 0.

Practice #4

Background Information:

For Telco companies it is key to attract new customers and at the same time avoid contract terminations (=churn) to grow their revenue generating base. Looking at churn, different reasons trigger customers to terminate their contracts, for example better price offers, more interesting packages, bad service experiences or change of customers’ personal situations.

Churn analytics provides valuable capabilities to predict customer churn and also define the underlying reasons that drive it. The churn metric is mostly shown as the percentage of customers that cancel a product or service within a given period (mostly months). If a Telco company had 10 Mio. customers on the 1st of January and received 500K contract terminations until the 31st of January the monthly churn for January would be 5%.

Telcos apply machine learning models to predict churn on an individual customer basis and take counter measures such as discounts, special offers or other gratifications to keep their customers. A customer churn analysis is a typical classification problem within the domain of supervised learning.

Based on the introduction the key challenge is to predict if an individual customer will churn or not. To accomplish that, machine learning models are trained based on 80% of the sample data. The remaining 20% are used to apply the trained models and assess their predictive power with regards to “churn / not churn”. A side question will be, which features actually drive customer churn. That information can be used to identify customer “pain points” and resolve them by providing goodies to make customers stay.

Dataset

One Drive

https://rb.gy/t74leq (Short Link)
https://1drv.ms/u/s!AlmMDylfGrbdhCxrkyx7uN4ZxeeZ?e=sQl4vQ (Full Link)

Colab Link

https://colab.research.google.com/drive/1cUypUtkskwFOf7mbV8uh4LKyZoAl3tP-?usp=sharing

Resources

Machine Learning (Model) Performance Metrics

Model Performance Metrics for Classification Model

Model Performance Metrics for Regression Model

Series 2 : Intro to Statistics and EDA

e-Attendance

Please follow the steps to complete your e-attendance

Click this link

Course Outline

Module 1: Data and Visualization ( Descriptive Statistics and Seaborn)

In the first part of the course we will discuss methods of descriptive statistics. You will learn what cases and variables are and how you can compute measures of central tendency (mean, median and mode) and dispersion (standard deviation and variance). Next, we discuss how to assess relationships between variables, and we introduce the concepts correlation and regression.

Hands-on : Using Python Seaborn Visualization Library

Module 2: Python for Data Analysis ( Numpy)

NumPy, which stands for Numerical Python, is a library consisting of multidimensional array objects and a collection of routines for processing those arrays. Numpy is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays.

Hands-on : Using Python Numpy Library

Module 3: Python for Data Analysis ( Pandas )

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

Hands-on : Using Python Pandas Library

Presentation File

Resources ( Books )

Resources ( Others and Colab Link)

Colab link

Seaborn

https://colab.research.google.com/drive/1vNDDqRudgB93JloRWwPnmiMgNlge7INW?usp=sharing

https://deepnote.com/project/c331f89d-b7a8-4fd8-b634-658950c235b4

Numpy

https://colab.research.google.com/drive/1vBUs-WbzGvNVUKAyIDTuTq3V4I3Zg1qo?usp=sharing

https://deepnote.com/project/c331f89d-b7a8-4fd8-b634-658950c235b4#%2FNumpy_Notebook.ipynb

Pandas

https://colab.research.google.com/drive/1I3EHK6rj6jhgCfH4UCP9o1XpwNuZ3ToN?usp=sharing

911 Prob

https://colab.research.google.com/drive/1M55VxiOhze4mBVxDJ2cJvimJTUQO6pe_?usp=sharing

Linear Regression

https://colab.research.google.com/drive/1PjaD5zKtFkO9wJFuiDEAIR7COtct77fo?usp=sharing

Resources (Shared Links)

One Drive

https://rb.gy/t74leq (Short Link)
https://1drv.ms/u/s!AlmMDylfGrbdhCxrkyx7uN4ZxeeZ?e=sQl4vQ (Full Link)

Google Drive

https://bit.ly/robipython (Short Link)
https://drive.google.com/drive/folders/1Aa79OIK8E7As5LqxkuSXboLu9VC8t06E?usp=sharing (Full Link)

Jupyter Notebook

Please follow this link to know about how to open Jupyter Notebook from your Local Machine (in your specified directory)

Series 1 : Intro to Data Science and Python

e-Attendance

Please follow the steps to complete your e-attendance

Click this link

Prerequisite

Primary Requirements

Some programming experience (e.g. C, C++, Java, QBasic (!) etc. )
At least high school level math skills will be required.
Passion to learn

IDE Requirements

Most popular IDE for Data Science is Anaconda. You can download and install from here. Make sure your download Python 3.7 distribution.

F.A.Q

» I don’t have the admin permission to install any software (Don’t worry !)

Google Colab [if you already have Google Account ]
Azure Notebook [if you already have Microsoft Account]
Both are Free ! to use

» Is there anyway I can do Machine Learning Analytics with Less Code or No Code?

Yes ! We can.

» Really ? How to do that?

Step 1 : Please go to this site https://studio.azureml.net /

Step 2 : Use any Microsoft Account(youremail@hotmail.com / outlook.com) to Register and Login

Course Outline

Module 1: Introduction to Data Science

Data Scientist has been ranked the number one job on Glassdoor and the average salary of a data scientist is over $120,000 in the United States according to Indeed! Data Science is a rewarding career that allows you to solve some of the world’s most interesting problems! In this Module we will experience an intro of Data Science and it’s different arena in simple way.

Module 2: No-Code Machine Learning

This module introduces the Designer tool, a drag and drop interface for creating machine learning models without writing any code. You will learn how to create a training pipeline that encapsulates data preparation and model training, and then convert that training pipeline to an inference pipeline that can be used to predict values from new data, before finally deploying the inference pipeline as a service for client applications to consume.

Hands-on : Design a Machine Learning Model using ML Studio

Module 3: Python – A Quick Review

In this module, you will get a quick review on Python Language. We will not going in depth but we will try to discuss some important components of Python Language. Please note, this is not meant to be a comprehensive overview of Python or programming in general

Hands-on : Environment Setup and Jupyter Notebook Intro.

Hands-on : Python Code Along

Hands-on : Python Review Exercise

Presentation File

Presentation File:

Machine Learning – Introduction

Related Materials:

Data Concept
- To know more about Data Concept you can click [this] link.
ML Performance Metrics:
- For Supervised Machine Learning
AzureML End-to-End Lecture Series
- Doing Machine Learning with Azure ML (3-day course)

Cheat Sheet

Azure ML Cheat Sheet

Algorithm Summary

Source: http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/

Practice

Automobile Price Prediction

The Problem

This data set consists of three types of entities: (a) the specification of an auto in terms of various characteristics, (b) its assigned insurance risk rating, (c) its normalized losses in use as compared to other cars. The second rating corresponds to the degree to which the auto is more risky than its price indicates. Cars are initially assigned a risk factor symbol associated with its price. Then, if it is more risky (or less), this symbol is adjusted by moving it up (or down) the scale. Actuarians call this process “symboling”. A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe.

The third factor is the relative average loss payment per insured vehicle year. This value is normalized for all autos within a particular size classification (two-door small, station wagons, sports/speciality, etc…), and represents the average loss per car per year.

Note: Several of the attributes in the database could be used as a “class” attribute.

Please bring it on whatever inferences you can get it and Make a Price Prediction Model.

The Data

This dataset consist of data From 1985 Ward’s Automotive Yearbook. Here are the sources

Sources:

1) 1985 Model Import Car and Truck Specifications, 1985 Ward’s Automotive Yearbook.
2) Personal Auto Manuals, Insurance Services Office, 160 Water Street, New York, NY 10038
3) Insurance Collision Report, Insurance Institute for Highway Safety, Watergate 600, Washington, DC 20037

Datasets is available Azure ML Studio Saved Datasets > Samples > Automobile Price Data (Raw)

Walmart Store Sales Forecasting

The Problem

One challenge of modeling retail data is the need to make decisions based on limited history. If Christmas comes but once a year, so does the chance to see how strategic decisions impacted the bottom line.

You are provided with historical sales data for 45 Walmart stores located in different regions. Each store contains a number of departments, and you are tasked with predicting the department-wide sales for each store.

In addition, Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of which are the Super Bowl, Labor Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge presented by this competition is modeling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data.

The Data

stores.csv

This file contains anonymized information about the 45 stores, indicating the type and size of store.

train.csv

This is the historical training data, which covers to 2010-02-05 to 2012-11-01. Within this file you will find the following fields:

Store – the store number
Dept – the department number
Date – the week
Weekly_Sales – sales for the given department in the given store
IsHoliday – whether the week is a special holiday week

test.csv

This file is identical to train.csv, except we have withheld the weekly sales. You must predict the sales for each triplet of store, department, and date in this file.

features.csv

This file contains additional data related to the store, department, and regional activity for the given dates. It contains the following fields:

Store – the store number
Date – the week
Temperature – average temperature in the region
Fuel_Price – cost of fuel in the region
MarkDown1-5 – anonymized data related to promotional markdowns that Walmart is running. MarkDown data is only available after Nov 2011, and is not available for all stores all the time. Any missing value is marked with an NA.
CPI – the consumer price index
Unemployment – the unemployment rate
IsHoliday – whether the week is a special holiday week

Here are the data for download:

Self Practice

Bike Sharing Demand
Forecast use of a city bikeshare system

The Problem

Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.

Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.

Data Fields

datetime – hourly date + timestamp
season – 1 = spring, 2 = summer, 3 = fall, 4 = winter
holiday – whether the day is considered a holiday
workingday – whether the day is neither a weekend nor holiday
weather – 1: Clear, Few clouds, Partly cloudy, Partly cloudy
2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp – temperature in Celsius
atemp – “feels like” temperature in Celsius
humidity – relative humidity
windspeed – wind speed
casual – number of non-registered user rentals initiated
registered – number of registered user rentals initiated
count – number of total rentals

Datasets

Datasets is available Azure ML Studio Saved Datasets > Samples > Bike Rental UCI Dataset

Heart Diseases Prediction

The Problem

The term “heart disease” is often used interchangeably with the term “cardiovascular disease”. Cardiovascular disease generally refers to conditions that involve narrowed or blocked blood vessels that can lead to a heart attack, chest pain (angina) or stroke. Other heart conditions, such as those that affect your heart’s muscle, valves or rhythm, also are considered forms of heart disease.

This makes heart disease a major concern to be dealt with. But it is difficult to identify heart disease because of several contributory risk factors such as diabetes, high blood pressure, high cholesterol, abnormal pulse rate, and many other factors. Due to such constraints, scientists have turned towards modern approaches like Data Science and Machine Learning for predicting the disease.

The Data

In this practicec, we will be applying Machine Learning approaches (and eventually comparing them) for classifying whether a person is suffering from heart disease or not, using one of the most used dataset — Cleveland Heart Disease dataset from the UCI Repository.

Data Source URL : http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data

Solution

There are two ways we can do this; either we can solve this with Azure ML Designer (No Code) way or We can do this using python notebook.

Let’s do this using Azure ML Designer (Azure ML Studio -Classic)
If you’re Python savvy you can follow [this] link for get your ipynb files and to read the blog about this problem scope you can visit this [link]

Hints:

Edit Metadata info and put new column name : age,sex,chestpaintype,resting_blood_pressure,serum_cholestrol,fasting_blood_sugar,resting_ecg,max_heart_rate,exercise_induced_angina,st_depression_induced_by_exercise,slope_of_peak_exercise,number_of_major_vessel,thal,heart_disease_diag
Edit Metadata info and Change Data type to Integer for following Columns: heart_disease_diag,age,sex
Edit Metadata info and make it categorical for following Columns: sex,chestpaintype,exercise_induced_angina,number_of_major_vessel,slope_of_peak_exercise,fasting_blood_sugar,thal,resting_ecg
Clean Missing Value
Apply SQL Transformation

SELECT *, 
CASE 
WHEN heart_disease_diag < 1 THEN 0
ELSE 1 
END AS HeartDiseaseCat 
FROM t1;

Dataset Download

heart_data

Resources (Shared Links)

One Drive

https://rb.gy/t74leq (Short Link)
https://1drv.ms/u/s!AlmMDylfGrbdhCxrkyx7uN4ZxeeZ?e=sQl4vQ (Full Link)

Google Drive

https://bit.ly/robipython (Short Link)
https://drive.google.com/drive/folders/1Aa79OIK8E7As5LqxkuSXboLu9VC8t06E?usp=sharing (Full Link)

Mind Map

https://gitmind.com/app/doc/19b619760

Python Notebook (Google Colab)

https://colab.research.google.com/drive/1_q_sppc6hKpMWGiQjV2Ptu6JqgM5Bcw4#scrollTo=PenOwfv4pCFC

Resources (Dataset)

Dataset

CATelcoCustomerChurnTrainingSample_dataset

Resources (Books)

Books

Info-graphic

Use Case Robi (AI-deathon)

Please Download and Review following presentation file.

Overview of the tasks:

You’ve to generate own use case depending on your respective domain.
Describe clearly about your data sources
How it will impact on Business/domain and also the end user.
Any known or unknown challenges for this particular case.

Here is the ppt slides:

Template for your Business Case (AI-deathon)

Data Analytics Training (2020)

Series 3 : Analysis and ML

Background Information:

Dataset

Colab Link

Used Dataset

Questions

Background Information:

Dataset

Know about dataset

Background Information:

Dataset

Colab Link

Series 2 : Intro to Statistics and EDA

Module 1: Data and Visualization ( Descriptive Statistics and Seaborn)

Hands-on : Using Python Seaborn Visualization Library

Module 2: Python for Data Analysis ( Numpy)

Hands-on : Using Python Numpy Library

Module 3: Python for Data Analysis ( Pandas )

Hands-on : Using Python Pandas Library

Series 1 : Intro to Data Science and Python

Module 1: Introduction to Data Science

Module 2: No-Code Machine Learning

Hands-on : Design a Machine Learning Model using ML Studio

Module 3: Python – A Quick Review

Hands-on : Environment Setup and Jupyter Notebook Intro.

Hands-on : Python Code Along

Hands-on : Python Review Exercise

Azure ML Cheat Sheet

Algorithm Summary

Python Packages Cheat Sheet

[NumPy]

[Pandas]

[Matplotlib]

[Scikit Learn]

Automobile Price Prediction

The Problem

The Data

Walmart Store Sales Forecasting

The Problem

The Data

Bike Sharing DemandForecast use of a city bikeshare system

The Problem

Data Fields

Datasets

Heart Diseases Prediction

The Problem

The Data

Solution

Dataset Download

Bike Sharing Demand
Forecast use of a city bikeshare system