Machine Learning

Machine Learning is an application of Artificial Intelligence that provides systems the ability to automatically learn and improve from experience without relying on rules-based programming. It is the science of making computers learn and act like humans.

There are four types of machine learning algorithms:

  1. Supervised Learning: Supervised Learning is the output of a model for a given test data is known. In this type of learning, the model is trained with well-labeled data with the expectation that the model learns from trained data and predicts the outputs.
    Supervised Learning can be placed into two categories:
    • Classification
    • Regression
  2. Unsupervised Learning: Unsupervised Learning is the type of learning for which the output for the given test data is unknown. The model is trained with information which is neither classified nor labeled. The model is expected to act on the information without guidance.
    Example: Clustering Algorithms.
  3. Semi-Supervised Learning: Semi-Supervised Learning is in between Supervised and Unsupervised learning. *The training data consists of both the labeled and unlabeled data used in the real-world, this is the type of data which is available.
  4. Reinforcement Learning: With Reinforcement Learning, the model is exposed to an environment that receives training by trial and error method. This method allows the models to automatically determine the ideal behavior within the context to increase the performance.
Define what it is?
  • Define Different Type with examples
  • What kind of problem we can solve using this - List Industry and Problems (One or two lies)
  • Proven use cases in Health, Energy, Oil & Gas and Consumer
  • Any Architecutrual Digram
  • Limitation
  • Implementation Steps
Types Of Machine Learning Oil and Gas Health Energy Consumer


  • Sucker rod pump failure predicts by analyzing and classifying the dynamo card lift patterns.
  • * Sucker rod is a steel rod of approximately length 30 feet in length. It is used in oil industry to connect the surface and down hole components. Sucker rod pump lifts the oil from the reservoir. Dynamo meter is a measuring device in rod pumping well analytics.
  • Support Vector Machines Regression Algorithms is used in the oil and gas industry for classification of images, text and hypertext categorization. In the oilfields, it is specifically leveraged for exploration to understand the position of layers of rocks and create 2D and 3D models as a representation of the subsoil.
  • Classification Algorithms like Bayes Net, SVMs and Functional Trees are performed for detection of Coronary Artery Disease (Heart Disease).
  • Coronary Artery Disease occurs when the coronary artery is narrowed or blocked by cholesterol-containing deposits.
  • Prediction of Diabetes with the production and improper utilization rates of insulin with Naïve Bayes and Decision Trees.
  • Hear Disease Prediction System using K-Nearest Neighbour Algorithm.
  • Diagnosis of liver disease using classification algorithms.
  • Deploying, managing and optimizing of renewable Energy using Classification Algorithms.
  • Energy that is collected from renewable resources like sunlight, wind, etc. is called Renewable Energy.
  • Predicting the demand analysis of energy consumption using Nearest Neighbor Algorithm.
  • Solar Power forecasting by Time Series Supervised Learning Algorithm.
  • Logistic Regression is used to predict house values in real estate business, and customer lifetime value in the insurance sector. It is leveraged to produce a continuous outcome, such as the possibility that a customer can or will buy something.
  • Multivariate Regression Algorithm is used in the retail sector where customers make a choice on a number of variables, such as brand, price and product. This helps decision makers in finding the best combination of factors to increase footfalls in a store.


  • Anomaly detection of turbo machines using Clustering algorithms.
  • *Turbo machines are the devices which help in adding energy into or taking out from a continuously flowing fluid.
  • Clustering Algorithms to detect recurring events like breast cancer.
  • Breast cancer recur is the reappearance of cancer In other words, the recurring of cancer even after treatment.
  • A system to understand and analyze energy consumption patterns using Clustering Algorithms.
  • Customer Segmentation using a K-mean Algorithm to analyze the spending behavior of customers in order to identify which ones will be loyal customers.
  • Inventory Categorization can be done on sales or other manufacturing metrics using Clustering Algorithms.


  • Data Integration Modeling is applied to Drill Hole Planning through Semi-Supervised Learning.
  • Drill Hole Planning is the process of locating the new bore holes based on mud weight and the drill bit to drill a successful well
  • Support Vector Machines and Kernel Regularised Least Squares used the semi-labeled data to build a health care decision making model.
  • Semi-Supervised Learning is used to build a cancer detection model.
  • Energy models can be built using Semi-supervised learning and clustering approaches.
  • Energy modeling is used to determine or estimate building energy usage.
  • In Customer Relationship Management, customer behavior models can be built, using semi-supervised learning for predicting chronology of archaeological sites that employ semi-super vised learning.

  • Fake Review Detection can also be done by using semi-supervised learning.


  • Controlling oil production in smart wells by MPC (Model Predictive Control) Strategy, using Reinforcement Learning.
  • Problem of learning treatment policies in the medical sciences. In other words, finding optimal treatment policies.
  • Reinforcement Learning is used constantly in robotics.
  • Reinforcement Algorithms are used in Building autonomous energy management, which can carry out some autonomous functions like failure/ outlier detection, system modification , recovery etc.

Dynamic Pricing of Retail Markets uses Reinforcement Algorithm.


  • Scikit-Learn is a robust, free machine learning library which is a perfect library to use in any end-to-end to machine learning project.
  • Scikit-Learn is built on top of several math and data Python libraries like numpy, SciPy, matplotlib, Pandas etc.
  • Scikit-learn is not concerned with  the loading, manipution and visualising of data. It is completely focused on machine learning, such as data modeling.
  • Scikit-learn's current version is 0.21.1.
  • Scikit-Learn is  open-source.
  • Scikit-Learn is a free library for Python.
  • Some popular models provided by Scikit-learn include:
    • Clustering
    • Regression
    • Decision Trees
    • Neural Networks
    • SVMs
    • Naive Bayes
  • J P Morgan uses Scikit-Learn widely in all parts of the banks for classification, predictive analysis, etc.
  • uses Scikit-Learn in its machine learning algorithms for recommending hotels and destinations to the customers.


  • Scikit-Learn has less focus on the statistics than that of R programming.
  • Deep Learning and Reinforcement Learning are currently out of scope for the objectives of Scikit-Learn.

K-Nearest Neighbour using Scikit-learn:

K-Nearest Neighbor Classification Algorithm stores all the available cases and classifies new cases, based on the similarity measure.

  • DBSCAN Clustering Algorithm using Scikit-learn:

*Density Based Spatial Clustering of Applications with Noise, clusters the points which are close to each other based on some distance metric and points the outliers which are low-density regions.
A close up of a logo  Description automatically generated


Python Data Analysis Library

  • An open-source python library, providing easy, flexible, fast and expressive data structures which can work with structured data (tabular, multi-dimensional) as well as time series data.
  • Provides powerful data analysis tools which help in real-world data analysis.
  • There are two primary data structure of pandas:
    • 1-dimensional -Series)
    • 2-dimensional – Data Frames.
  • Pandas is built on top of NumPy and is intended to integrate well with other 3rd party libraries.
  • Current stable version of Pandas is 0.24.2.
  • Pandas support Python programming.
  • Import features of Pandas:
    1. Can easily handle missing data values.
    2. Automatic and Explicit data alignment.
    3. Robust IO tools for loading data from flat files
    4. Time-series specific functionality.
  • Pandas is an ideal tool for a data scientist who works with data in different stages like data munging, cleaning, analysing, visualizing etc.
  • Predicting the stocks, Advertising, Natural Language Processing are some of the applications of the Pandas.


  • Numpy is a general-purpose package for array processing and scientific computation with Python. It provides a high performance multi-dimensional array object and tools to work with arrays.
  • Some important features of numpy include:
  • powerful N-dimensional array object. (numpy.ndarray).
  • Broadcasting Functions for variables and arrays. (Element-wise operations).
  • Tools for integrating C/C++ and Fortran codes.
  • Applications in linear algebra, Fourier transform and has high capability to generate random numbers.
  • NumPy’s current version is 1.16.3.
  • NumPy is an open source for Python programming language.
  • American Express, Facebook and JPMorgan Chase are some companies who use Numpy.


  1. Numpy is not supported within Google App. Engine applications. Therefore, we cannot create applications with any Google App.
  2. Since Numpy is written in C, it is deemed as an “unsafe” partly.


  • Matplotlib is a 2D plotting library for Python programming language.
  • Matplotlib can produce high quality figures in a variety of formats.
  • Matplotlib is a multi-platform data visualization library built on NumPy arrays and designed to work with the broader SciPy stack. 
  • Matplotlib greatest is its visualization capacity,allowing visual access to huge amounts of data in easily digestible visuals. It tries to simplify the easy things and make the difficult things possible.
  • Matplotlib’s stable version is 3.0.3.
  • Matplotlib is an open-source.
  • Matplotlib consists of several plots like line, bar, scatter, histogram, etc.
  • Matplotlib can be used in Python Scripts, the Python and IPython shells, the Jupyter notebook, web application servers and graphical user interface tools.
  • Matplotlib’s visualizations are used in stock market analysis, weather forecasting etc.


Matplotlib is extremely limite, regarding interactive and animated visualization.

A screenshot of a cell phone  Description automatically generated

A simple plot using matplotlib.

A picture containing screenshot  Description automatically generated

A histogram plot using matplotlib.

A screenshot of a cell phone  Description automatically generated

A scatterplot using Matplotlib.


  • Seaborn is a graphic library built on top of Matplotlib and closely integrated with pandas library.
  • Seaborn provides a high-level interface for drawing attractive and informative statistical graphics.
  • Seaborn’s current version is 0.9.0.
  • Seaborn is open-source.
  • Seaborn has a support for Python programming language.
  • Some important functionalities that seaborn offers include:
    • Automatic estimation and plotting of linear regression models for different kinds of dependent variables.
    • Structuring multi-plot grids, high-level abstractions, provided to build the complex visualizations easily.
    • Specialized support for categorical variables to show observations or aggregate statistics.
    • Easily switched  between different visual representations that can be parameterized with the same dataset-oriented API.
    • Creates fully polished customised plots with additional steps.
  • Seaborn is used for plotting kernel density estimates, building heatmaps etc. 
  • A multi bivariate KDE map
    A close up of a womans face  Description automatically generated
  • Grouped Violin plots using seaborn
    A close up of a map  Description automatically generated
  • Time Series plot with error bands
  • Annotated Heatmaps using seaborn


  • TensorFlow is a free, end-to-end open source platform for machine learning.
  • TensorFlow is a computation framework for building machine learning tools.
  • TensorFlow provides a different toolkit that allows you to construct models at preferred level of abstraction.
  • TensorFlow’s  stable version is 1.13.1.  2.0.0 preview release is available.
  • TensorFlow supports Python Programming.
  • Some use cases of TensorFlow include:
    1. Voice/Sound Recognition
    2. Text Based Applications
    3. Image Recognition
    4. Time Series
  • TensorFlow consists of the following two components:
    1. a graph protocol buffer
    2. a runtime that executes the (distributed) graph.
  • Smart Retail Checkout with Object Recognition uses TensorFlow.
  • TensorFlow is used to develop digital and predictive models in Oil and Gas Equipment Monitoring and Analysis.
  • TensorFlow is used to detect Breast Cancer.
  • TensFlow can be accomplished by predicting energy consumption of a building.

TensorFlow lags in speed and usage compared to its competitors.

Library Name What is it use for
Scikit-learn Scikit-Learn machine learning library provides a wide range of supervised and unsupervised learning algorithms to build machine learning models.
Pandas Pandas is used for data munging (data cleaning and preparation to input into the models) and also allows fast analytics.
Numpy Numpy is used for scientific computing. It provides a high performance, multi-dimensional array. It also provides wide ranges of functions to manipulate and compute those arrays.
Matplotlib Matplotlib is used for 2D graphs and plots.
Seaborn Data visualization library built on top of matplotlib. It provides additional plot type and provides  a better look compared to the traditional matplotlib plots.
TensorFlow TensorFlow is a machine learning library which is used in building and training machine learning models. It helps in building Classification, Prediction etc. models.