Series 3 : Analysis and ML
Background Information:
Python Pandas & Python Matplotlib/Seaborn to analyze and answer business questions about 12 months worth of sales data. The data contains hundreds of thousands of electronics store purchases broken down by month, product type, cost, purchase address, etc.
We start by cleaning our data. Tasks during this section include:
- Drop NaN values from DataFrame
- Removing rows based on a condition
- Change the type of columns (to_numeric, to_datetime, astype)
Once we have cleaned up our data a bit, we move the data exploration section. In this section we explore 4 high level business questions related to our data:
- What was the best month for sales? How much was earned that month?
- What city sold the most product?
- What time should we display advertisemens to maximize the likelihood of customer’s buying product?
- What product sold the most? Why do you think it sold the most?
To answer these questions we walk through many different pandas & matplotlib methods. They include:
- Concatenating multiple csvs together to create a new DataFrame (pd.concat)
- Adding columns
- Parsing cells as strings to make new columns (.str)
- Using the .apply() method
- Using groupby to perform aggregate analysis
- Plotting bar charts and lines graphs to visualize our results
- Labeling our graphs
Dataset
One Drive
- https://rb.gy/t74leq (Short Link)
- https://1drv.ms/u/s!AlmMDylfGrbdhCxrkyx7uN4ZxeeZ?e=sQl4vQ (Full Link)
Colab Link
https://colab.research.google.com/drive/1Htd0Dop4YQZh1-b_YM4wIrpGEpCo5Qex?usp=sharing
Used Dataset
World Happiness Report from 2017 to 2019
The World Happiness Report is a landmark survey of the state of global happiness that ranks 156 countries by how happy their citizens perceive themselves to be.
The World Happiness Report was written by a group of independent experts acting in their personal capacities. Any views expressed in this report do not necessarily reflect the views of any organization, agency or program of the United Nations.
Questions
- Which are the most happy countries in the world
- Which are the least happy countries in the world
- What is the most significant factor associated with happiness
- Top 10 countries for each criteria
- Correlation coefficients between factors (e.g., how is happiness related to life expectancy)
Colab Link:
https://colab.research.google.com/drive/1EYnhz9b2MPg6OxJ0-eAh4jZMhDKAMKcg?usp=sharing
Background Information:
The Financial Technology company (Fin-Tech Company) launch there a mobile app. This app used for financial purposes like bank loans, savings, etc. in one place. It has two versions free and premium. The free version app contains basic features and customer wants to use the premium feature then they have to pay some amount to unlock it.
The main goal of the company is to sell the premium version app with low advertisement cost but they don’t know how to do it. That’s a reason they are provided the premium feature in the free version app for 24 hours to collect the customer’s behavior. After that, the company hired you to find insight from the collected data (customer’s behavior).
Your main job scope is to find or predict new customer who is interested to buy the product or not. If the customers will buy a product anyway so no need to give an offer to that customer and loss the business. Only give offers to those customers who are interested to use premium version app but they can’t afford its cost. So the company will give offers to those customers and earn more money.
Dataset
One Drive
- https://rb.gy/t74leq (Short Link)
- https://1drv.ms/u/s!AlmMDylfGrbdhCxrkyx7uN4ZxeeZ?e=sQl4vQ (Full Link)
Know about dataset
As you can see in fineTech_appData DataFrame, there are 50,000 users data with 12 different features. Let’s know each and every feature in brief.
1. user: Unique ID for each user.
2. first_open: Date (yy-mm-dd) and time (Hour:Minute:Seconds:Milliseconds) of login on app first time.
3. dayofweek: On which day user logon.
0: Sunday
1: Monday
2: Tuesday
3: Wednesday
4: Thursday
5: Friday
6: Saturday
4. Hour: Time of a day in 24-hour format customer logon. It is correlated with dayofweek column.
5. age: The age of the registered user.
6. screen_list: The name of multiple screens seen by customers, which are separated by a comma.
7. numscreens: The total number of screens seen by customers.
8. minigame: Tha app contains small games related to finance. If the customer played mini-game then 1 otherwise 0.
9. used_premium_feature: If the customer used the premium feature of the app then 1 otherwise 0.
10. enrolled: If the user bought a premium feature app then 1 otherwise 0.
11. enrolled_date: On the date (yy-mm-dd) and time (Hour:Minute:Seconds:Milliseconds) the user bought a premium features app.
12. liked: The each screen of the app has a like button if the customer likes it then 1 otherwise 0.
Background Information:
For Telco companies it is key to attract new customers and at the same time avoid contract terminations (=churn) to grow their revenue generating base. Looking at churn, different reasons trigger customers to terminate their contracts, for example better price offers, more interesting packages, bad service experiences or change of customers’ personal situations.
Churn analytics provides valuable capabilities to predict customer churn and also define the underlying reasons that drive it. The churn metric is mostly shown as the percentage of customers that cancel a product or service within a given period (mostly months). If a Telco company had 10 Mio. customers on the 1st of January and received 500K contract terminations until the 31st of January the monthly churn for January would be 5%.
Telcos apply machine learning models to predict churn on an individual customer basis and take counter measures such as discounts, special offers or other gratifications to keep their customers. A customer churn analysis is a typical classification problem within the domain of supervised learning.
Based on the introduction the key challenge is to predict if an individual customer will churn or not. To accomplish that, machine learning models are trained based on 80% of the sample data. The remaining 20% are used to apply the trained models and assess their predictive power with regards to “churn / not churn”. A side question will be, which features actually drive customer churn. That information can be used to identify customer “pain points” and resolve them by providing goodies to make customers stay.
Dataset
One Drive
- https://rb.gy/t74leq (Short Link)
- https://1drv.ms/u/s!AlmMDylfGrbdhCxrkyx7uN4ZxeeZ?e=sQl4vQ (Full Link)
Colab Link
https://colab.research.google.com/drive/1cUypUtkskwFOf7mbV8uh4LKyZoAl3tP-?usp=sharing
Series 2 : Intro to Statistics and EDA
Please follow the steps to complete your e-attendance
- Click this link
Module 1: Data and Visualization ( Descriptive Statistics and Seaborn)
In the first part of the course we will discuss methods of descriptive statistics. You will learn what cases and variables are and how you can compute measures of central tendency (mean, median and mode) and dispersion (standard deviation and variance). Next, we discuss how to assess relationships between variables, and we introduce the concepts correlation and regression.
Hands-on : Using Python Seaborn Visualization Library
Module 2: Python for Data Analysis ( Numpy)
NumPy, which stands for Numerical Python, is a library consisting of multidimensional array objects and a collection of routines for processing those arrays. Numpy is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays.
Hands-on : Using Python Numpy Library
Module 3: Python for Data Analysis ( Pandas )
Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
Hands-on : Using Python Pandas Library
Colab link
Seaborn
https://colab.research.google.com/drive/1vNDDqRudgB93JloRWwPnmiMgNlge7INW?usp=sharing
https://deepnote.com/project/c331f89d-b7a8-4fd8-b634-658950c235b4
Numpy
https://colab.research.google.com/drive/1vBUs-WbzGvNVUKAyIDTuTq3V4I3Zg1qo?usp=sharing
https://deepnote.com/project/c331f89d-b7a8-4fd8-b634-658950c235b4#%2FNumpy_Notebook.ipynb
Pandas
https://colab.research.google.com/drive/1I3EHK6rj6jhgCfH4UCP9o1XpwNuZ3ToN?usp=sharing
911 Prob
https://colab.research.google.com/drive/1M55VxiOhze4mBVxDJ2cJvimJTUQO6pe_?usp=sharing
Linear Regression
https://colab.research.google.com/drive/1PjaD5zKtFkO9wJFuiDEAIR7COtct77fo?usp=sharing
One Drive
- https://rb.gy/t74leq (Short Link)
- https://1drv.ms/u/s!AlmMDylfGrbdhCxrkyx7uN4ZxeeZ?e=sQl4vQ (Full Link)
Google Drive
Please follow this link to know about how to open Jupyter Notebook from your Local Machine (in your specified directory)
Series 1 : Intro to Data Science and Python
Please follow the steps to complete your e-attendance
- Click this link
Primary Requirements
- Some programming experience (e.g. C, C++, Java, QBasic (!) etc. )
- At least high school level math skills will be required.
- Passion to learn
IDE Requirements
- Most popular IDE for Data Science is Anaconda. You can download and install from here. Make sure your download Python 3.7 distribution.
F.A.Q
» I don’t have the admin permission to install any software (Don’t worry !)
- Google Colab [if you already have Google Account ]
- Azure Notebook [if you already have Microsoft Account]
- Both are Free ! to use
» Is there anyway I can do Machine Learning Analytics with Less Code or No Code?
Yes ! We can.
» Really ? How to do that?
Step 1 : Please go to this site https://studio.azureml.net/
Step 2 : Use any Microsoft Account(youremail@hotmail.com / outlook.com) to Register and Login
Module 1: Introduction to Data Science
Data Scientist has been ranked the number one job on Glassdoor and the average salary of a data scientist is over $120,000 in the United States according to Indeed! Data Science is a rewarding career that allows you to solve some of the world’s most interesting problems! In this Module we will experience an intro of Data Science and it’s different arena in simple way.
Module 2: No-Code Machine Learning
This module introduces the Designer tool, a drag and drop interface for creating machine learning models without writing any code. You will learn how to create a training pipeline that encapsulates data preparation and model training, and then convert that training pipeline to an inference pipeline that can be used to predict values from new data, before finally deploying the inference pipeline as a service for client applications to consume.
Hands-on : Design a Machine Learning Model using ML Studio
Module 3: Python – A Quick Review
In this module, you will get a quick review on Python Language. We will not going in depth but we will try to discuss some important components of Python Language. Please note, this is not meant to be a comprehensive overview of Python or programming in general
Hands-on : Environment Setup and Jupyter Notebook Intro.
Hands-on : Python Code Along
Hands-on : Python Review Exercise
Presentation File:
Machine Learning – Introduction
Related Materials:
- Data Concept
- To know more about Data Concept you can click [this] link.
- ML Performance Metrics:
- AzureML End-to-End Lecture Series
Azure ML Cheat Sheet
Algorithm Summary
Source: http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/
Automobile Price Prediction
The Problem
This data set consists of three types of entities: (a) the specification of an auto in terms of various characteristics, (b) its assigned insurance risk rating, (c) its normalized losses in use as compared to other cars. The second rating corresponds to the degree to which the auto is more risky than its price indicates. Cars are initially assigned a risk factor symbol associated with its price. Then, if it is more risky (or less), this symbol is adjusted by moving it up (or down) the scale. Actuarians call this process “symboling”. A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe.
The third factor is the relative average loss payment per insured vehicle year. This value is normalized for all autos within a particular size classification (two-door small, station wagons, sports/speciality, etc…), and represents the average loss per car per year.
Note: Several of the attributes in the database could be used as a “class” attribute.
Please bring it on whatever inferences you can get it and Make a Price Prediction Model.
The Data
This dataset consist of data From 1985 Ward’s Automotive Yearbook. Here are the sources
Sources:
1) 1985 Model Import Car and Truck Specifications, 1985 Ward’s Automotive Yearbook.
2) Personal Auto Manuals, Insurance Services Office, 160 Water Street, New York, NY 10038
3) Insurance Collision Report, Insurance Institute for Highway Safety, Watergate 600, Washington, DC 20037
Datasets is available Azure ML Studio Saved Datasets > Samples > Automobile Price Data (Raw)
Walmart Store Sales Forecasting
The Problem
One challenge of modeling retail data is the need to make decisions based on limited history. If Christmas comes but once a year, so does the chance to see how strategic decisions impacted the bottom line.
You are provided with historical sales data for 45 Walmart stores located in different regions. Each store contains a number of departments, and you are tasked with predicting the department-wide sales for each store.
In addition, Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of which are the Super Bowl, Labor Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge presented by this competition is modeling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data.
The Data
stores.csv
This file contains anonymized information about the 45 stores, indicating the type and size of store.
train.csv
This is the historical training data, which covers to 2010-02-05 to 2012-11-01. Within this file you will find the following fields:
- Store – the store number
- Dept – the department number
- Date – the week
- Weekly_Sales – sales for the given department in the given store
- IsHoliday – whether the week is a special holiday week
test.csv
This file is identical to train.csv, except we have withheld the weekly sales. You must predict the sales for each triplet of store, department, and date in this file.
features.csv
This file contains additional data related to the store, department, and regional activity for the given dates. It contains the following fields:
- Store – the store number
- Date – the week
- Temperature – average temperature in the region
- Fuel_Price – cost of fuel in the region
- MarkDown1-5 – anonymized data related to promotional markdowns that Walmart is running. MarkDown data is only available after Nov 2011, and is not available for all stores all the time. Any missing value is marked with an NA.
- CPI – the consumer price index
- Unemployment – the unemployment rate
- IsHoliday – whether the week is a special holiday week
Here are the data for download:
Bike Sharing Demand
Forecast use of a city bikeshare system
The Problem
Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.
Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.
Data Fields
datetime – hourly date + timestamp
season – 1 = spring, 2 = summer, 3 = fall, 4 = winter
holiday – whether the day is considered a holiday
workingday – whether the day is neither a weekend nor holiday
weather – 1: Clear, Few clouds, Partly cloudy, Partly cloudy
2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp – temperature in Celsius
atemp – “feels like” temperature in Celsius
humidity – relative humidity
windspeed – wind speed
casual – number of non-registered user rentals initiated
registered – number of registered user rentals initiated
count – number of total rentals
Datasets
Datasets is available Azure ML Studio Saved Datasets > Samples > Bike Rental UCI Dataset
Heart Diseases Prediction
The Problem
The term “heart disease” is often used interchangeably with the term “cardiovascular disease”. Cardiovascular disease generally refers to conditions that involve narrowed or blocked blood vessels that can lead to a heart attack, chest pain (angina) or stroke. Other heart conditions, such as those that affect your heart’s muscle, valves or rhythm, also are considered forms of heart disease.
This makes heart disease a major concern to be dealt with. But it is difficult to identify heart disease because of several contributory risk factors such as diabetes, high blood pressure, high cholesterol, abnormal pulse rate, and many other factors. Due to such constraints, scientists have turned towards modern approaches like Data Science and Machine Learning for predicting the disease.
The Data
In this practicec, we will be applying Machine Learning approaches (and eventually comparing them) for classifying whether a person is suffering from heart disease or not, using one of the most used dataset — Cleveland Heart Disease dataset from the UCI Repository.
Data Source URL : http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data
Solution
There are two ways we can do this; either we can solve this with Azure ML Designer (No Code) way or We can do this using python notebook.
- Let’s do this using Azure ML Designer (Azure ML Studio -Classic)
- If you’re Python savvy you can follow [this] link for get your ipynb files and to read the blog about this problem scope you can visit this [link]
Hints:
- Edit Metadata info and put new column name : age,sex,chestpaintype,resting_blood_pressure,serum_cholestrol,fasting_blood_sugar,resting_ecg,max_heart_rate,exercise_induced_angina,st_depression_induced_by_exercise,slope_of_peak_exercise,number_of_major_vessel,thal,heart_disease_diag
- Edit Metadata info and Change Data type to Integer for following Columns: heart_disease_diag,age,sex
- Edit Metadata info and make it categorical for following Columns: sex,chestpaintype,exercise_induced_angina,number_of_major_vessel,slope_of_peak_exercise,fasting_blood_sugar,thal,resting_ecg
- Clean Missing Value
- Apply SQL Transformation
SELECT *,
CASE
WHEN heart_disease_diag < 1 THEN 0
ELSE 1
END AS HeartDiseaseCat
FROM t1;
Dataset Download
One Drive
- https://rb.gy/t74leq (Short Link)
- https://1drv.ms/u/s!AlmMDylfGrbdhCxrkyx7uN4ZxeeZ?e=sQl4vQ (Full Link)
Google Drive
- https://bit.ly/robipython (Short Link)
- https://drive.google.com/drive/folders/1Aa79OIK8E7As5LqxkuSXboLu9VC8t06E?usp=sharing (Full Link)
Mind Map
Python Notebook (Google Colab)
Please Download and Review following presentation file.
Overview of the tasks:
- You’ve to generate own use case depending on your respective domain.
- Describe clearly about your data sources
- How it will impact on Business/domain and also the end user.
- Any known or unknown challenges for this particular case.
Here is the ppt slides: