Demystifying Machine learning and R

Artificial intelligence is based on several foundations such as mathematics, statistics and neuroscience. Its main objective is to study agents that perceive their environments and make decisions. Artificial intelligence includes several fields such as machine learning, natural language processing and robotics.

In 1956, a programme called Logic Theorist was the first artificial intelligence programme created. It was designed to imitate human problem-solving skills. The Logic Theorist has proved 38 of the first 52 theorems in chapter 2 of the Principia Mathematica.

However, rapid advances in data storage and computer processing power have significantly increased the rate of acceleration of artificial intelligence. Recent examples of advances of artificial intelligence include: IBM’s Deep Blue, a chess playing computer program and the development of AlphaGo by DeepMind, which defeated one of the best human players at Go.  

Machine learning is a subset of artificial intelligence and it is a key enabler to achieve artificial general intelligence. This article describes the basics of machine learning and discuss how R can be used to build models and algorithms.

  1. Machine learning

Machine learning is a data analytics technique that helps computers learn from experience. Machine learning algorithms find natural patterns in data, which helps generate insight and make better predictions and decisions. They are used in different applications such as:

  • Natural language processing for voice recognition
  • Medical diagnosis and computational biology for cancer detection and drug discovery
  • Finance applications such as stock trading and credit scoring
  • Image processing and computer vision
  • Automotive and manufacturing applications such as autonomous cars and predictive maintenance

The performance of the algorithms depends on the quality of data and the number of samples available. Both factors increase the learning and accuracy rates. There are different types of machine learning algorithms, the choice of the algorithm depends on the nature of the applications and the desired outcomes.

  1. Supervised learning:

Supervised machine learning builds a model that makes predictions in the presence of uncertainty. The algorithm takes a known set of input data and known responses and trains the model to generate predictions. The learning process continues until the desired level of accuracy is reached. The learning is called supervised as the correct answers are known in advance.

Supervised learning uses classification and regression techniques to predict outcomes. Classification techniques predict discrete responses whereas regression techniques predict continuous outcomes. Classification techniques can be used if data can be categorized or separated into classes. Regression techniques can be used if there is a data range, or the response is a real number such as temperature.

  • Unsupervised learning:

Unsupervised learning finds hidden patterns in unlabelled data. It is used for clustering population in different groups based on similarities and differences. The algorithm acts on information without guidance. Clustering is the most common unsupervised learning technique. The objective of the clustering technique is to organise data into classes such as it finds natural groupings among objects by looking into high intra-class similarities and low inter-class similarities.

There are two types of clustering techniques: hard and soft clustering. In hard clustering, each data point either belongs to a cluster or not whereas in soft clustering, instead of putting data into separate clusters, a probability figure is assigned.

  • Reinforced learning:

Reinforced learning trains the machine to make specific decisions based on a reward and a punishment system. The machine learns from experience and continually improves its knowledge through trial and error.

A key feature of reinforcement learning is that it explicitly considers the goal-directed agent interacting with an uncertain environment.  All reinforcement learning agents have goals, can sense their environments, and can choose actions to influence their environments. Example of reinforced learning applications include robotics and game playing.

  • Deep learning:

Deep learning is a subset of machine learning. It is a unique type of algorithms based on artificial neural networks that mimic biological neural networks of the human brain. At the basic level is the perceptron and the mathematical representation of a biological neuron. Like in the biological human brain, several layers of interconnected perceptrons could be found in the artificial neural networks.

Input values get passed through the hidden layers of the network until they converge to the output layer. The output layer represents the prediction outcome. Each node of the hidden layers has a weight, and it multiplies its input value by that weight. There are several types of artificial neural networks such as convolutional neural networks and recurrent neural networks.

Deep Learning has significantly improved the accuracy of the machine learning models and has made great advances in fields such as computer vision and speech recognition.

  • R for machine learning

R was developed in the early 90s. Originally, R has been used in academia but has increasingly found its use in business settings as well in commercial software. It is widely used for data analysis and statistical computing.  In 2017, R was listed among the top ten languages.

Data management activities such as data labelling and filtering are intuitive in R. Packages such as dplyr, tidyr, readr and SparkR have made visualisation, data manipulation and computation faster.

  1. Caret Package:

For machine learning purpose, caret in R is a powerful package. The caret package combines both model training and prediction.

Caret provides a structure to compare different models and makes parameter tuning easy. It can improve the model workflow as well. There are more than 200 models in caret. The method argument helps switch between running different models such as gradient boosting and linear models. The individual R packages perform the actual statistics and modelling, for example the randomForest () function from the randomForest package runs random forest model.

Comparing different type of models is one key feature of caret. There are different ways to compare models including the use of resamples () function when models are stacked. Other performance metrics are also available within the caret package. It allows for model comparison and helps determine optimal parameters for a given algorithm.  The train () function can be used to build any predictive model.

  • Random forest Package:

This package is widely used in machine learning. It creates large number of decision trees by taking random samples of variables as well as observations and build multiple trees. These trees are combined, and votes are taken to predict the class of the response variable.

  • E1071 Package:

This package is widely used for performing machine learning algorithms using support vector machines, fuzzy clustering and Naïve Bayes classifier.

  • Rpart Package:

This package is used to build classification and regression models. The rpart() function  establishes relationships between variables. In a business context, this is useful as it helps to understand relationships between variables, for example the relationship between sales volume and sales campaigns.

  • Kernlab Package:

Kernlab package is used to implement support vector machine.  It has various kernel functions such as polynomial kernel function and hyperbolic tangent kernel Function. Kernel based machine learning methods are used to solve challenging clustering, classification and regression problems. User can either use the predefined kernel functions found in kernlab or build their own functions.

  • Nnet Package:

Nnet package is a neural network package, which provides implementation for artificial neural networks algorithms. Artificial neural networks mimic the processing of the brain to develop algorithms that can be used to model complex patterns and prediction problems.

  • Tips for using R
  1. Understanding R syntax– R is intuitive and could be learned easily. If you already use Python and C++ then it will be easy to understand the R syntax.
  2. Getting data into R– the easiest way to get data into R is usually via a csv-file.
  3. Modifying data in R– create Data Partition function could be used to split data whereas max Dissim function is used to create sub samples.
  4. Efficient workflow in R– The key to efficient workflow is a clear objective and a good selection of packages that are relevant to the desired computation and analysis.
  5. Using machine learning in R– there are many packages that make machine learning in R easy. For example, rpart package can be used to create a decision tree, NeuralNetTools package can be used to draw a neural network and kernlab package can be used to train a support vector machine.
  6. Graphical capabilities in R, it has advanced graphical capabilities and use the support of many frameworks such as ggplot2 and htmlwidgets.

Conclusion

There is no doubt that machine learning will continue to play a significant role in the advancement of artificial intelligence. Machine learning algorithms that are efficient, accurate and transparent will be key in ensuring that artificial intelligence is accepted and seen as an ethical and a sustainable tool. The future of machine learning is expected to flourish with the development of quantum computing, collaborative learning, cognitive services and more advanced algorithms.

Coding and programming languages are essential parts of a data scientist and a machine learning scientist skill. The above sections try to describe some of the R packages and provides some tips for beginners. However, there are other skills that are important to master such as problem-solving skills and communication.

Solving real business problem and machine learning in a way that adds to the business bottom line is important. This means looking beyond data and algorithms, adopting a business mind sets and being able to ask questions such as why? what? and how?

Communications and stakeholder management are also essential skills. Most often data scientists and machine learning scientists have to work and lead teams. They need to have the ability to influence and pitch their ideas at different levels of the organisation.

Leave a Comment

Your email address will not be published. Required fields are marked *