Machine Learning

Data Cleaning in a Nutshell

“Better data beats fancier algorithms.”

Garbage in, garbage out is the motto that needs to be followed to build an accurate machine learning model.

If the data under analysis is not accurate, then it is not useful. Irrespective of how accurate your model is, without data cleaning, it will deliver biased and inaccurate results.

Thus, data cleaning, also called data cleansing or data scrubbing, is one of the most crucial parts of machine learning.

What is data cleaning?

Data cleansing can be understood as a process of making the data ready for analysis.

Eliminating null records and unnecessary columns, fixing the outliers (junk values), restructuring the data to enhance its readability, etc. are some of the components of data cleaning.

Data cleaning also focuses on increasing the accuracy of the dataset by rectifying the existing information, instead of just removing chunks of useless data.

Steps involved in data cleaning

There is no particular procedure for data cleaning, it varies from one dataset to another. However, having a roadmap is essential to keep you on the right track.

Given below are the basic steps which can be followed to create a template for your data cleaning process.

Eliminating duplicates and irrelevant observations

  • Duplicate or redundant values affect the efficiency of the model to a large extent.  The data is repeated and may add towards either the correct side or incorrect side, thereby giving biased results. 
  • The irrelevant data do not add any value to the dataset, thus should be dropped or removed to save resources like memory and processing time.

Rectifying structural errors

  • Structural errors include inconsistencies in naming conventions, typos, and wrong capitalization. These typographical errors result in mislabeled classes or categories. 
  • For instance, the model might treat “NA” and “Not Applicable” as two different categories, though they represent the same value. These structural variations make the algorithms very inefficient resulting in unfaithful results.

Filter out the irrelevant outliers

  • Outliers are the values that do not fit in the dataset under observation. These values can be understood as the noise in the dataset.
  • Outliers arise due to manual errors or data entry mistakes. The Outliers are not always incorrect, so they should not be dropped until we have a valid reason.

Handling missing data

Handling missing values is the trickiest step in the data cleaning process. The missing values can’t be ignored or eliminated since they can represent something crucial. 

Following are a couple of the most common methods to deal with the missing data:

  • Removing the observations having missing values, but might result in losing some useful information.
  • Imputing the missing values based on the previous observations. Since it is based on assumptions and not actual observations, it does not add any value to the dataset and may result in losing the data integrity.

Some data cleansing tools

Data cleaning is the most important step in machine learning to get accuracy and efficiency.

Performing data cleansing on zillions of data manually is tedious and may result in errors.

This makes the data cleaning tools prominent since they help in keeping a large amount of data clean and consistent.

Openrefine, TIBCO Clarity, Trifacta Wrangler, IBM Infosphere, Cloudingo, Quality Stage, etc. are some of the most popular data cleaning tools.


Working with clean data comes with a lot of advantages like improved efficiency, reduced error margin, accuracy, consistency, better decision making, and many more.

Thus, the data should be cleansed before fitting any model with it.

If you want to invest in Data cleaning then you can learn by implementing it using Python or R.


Learners Style of learning!

I have been into academic teaching and corporate training since 2009, and have been delivering technology sessions to Engineering students as well as Information Technology enthusiasts.

One of the greatest realizations that I had experienced during my sessions is that for an effective content delivery it first is required to understand the learners need and learning style.

Generally, the needs can be initially guessed based on the objectives of the course/training, but for learning styles, the guesswork will not contribute, so it is really necessary to do a pre-survey with the intended audience.

This pre-survey always helped me to understand how the group is comprised off based on: Visual, Auditory, Reading/Writing and Kinesthetic.

This pre-survey is a great tool for planning content delivery with required demonstrations, reading materials and practices for participants. And the effectiveness that I received with this approach is far better than the guesswork that I did in my initial years of teaching/training.

Would love to hear your say on this?

Machine Learning Webinar

Machine Learning- let us get started!

Machine learning is one of the most popular domain the new age application development programmers and companies are encashing on—just another field of computer science, which leverages on the applied practice of mathematics as well as statistics.

Why this created the buzz?

Because it reduced the intensive logic implementations for processing the massive quantity of data(generally known as big data), and the results are promising in terms of finding patterns in the data resulting in better business-oriented decisions.

Now, as a beginner, the concept of Machine learning could be overwhelming as there has been plenty of scattered information available across the web, including various theoretical courses and proprietor documentations.

So, here I will try to get you a simple flow on how as a beginner, you can get your self to familiarize yourself with the machine learning domain and where you can start looking at in the first place.

The formal definition could be:

Machine learning(ML) is a field of computer science concerned with programs that learn as well as is concerned with the question of how to construct computer programs that automatically improve with experience.

Now you might also be thinking about how artificial intelligence is different from machine learning, so here is a big picture for you.

Here you can see that machine learning is the subset or, in fact, the more specialized form of artificial intelligence. And, further supports the deep learning domain for more intense & intelligent applications.

No alt text provided for this image

Now the next point to understand is why do we want the computer programs to improve with experience. it’s because:

we have huge data and we want to make decisions or predictions from it


we want computers to learn to identify patterns without being explicitly programmed to

And as said, DATA is the new currency for this digital world and is priceless. Therefore, it’s essential to utilize it to achieve the unique potential for your business.

Great, you know why it is essential for computers to improve.

Now, as a programmer, what should you know So that this automation can be achieved.

Types of machine learning

Broadly there are three

Supervised Learning

This is simplest to implement, where primarily the problems related to regression and classification are solved. And the most important is that the Data available for analysis is available in a structured way with minimum anomalies, and even if anomalies are present, they can be rectified by using statistical measures.

General use cases that are implemented under this: Image classifications, Fraud detections, weather/market forecasting, etc. So you can simply infer that where ever the simple predictions are supposed to be done that Supervised learning.

Unsupervised Learning

This is again working on the same objective of prediction, but the complexity is increased. Because the data available for analysis is either minimally structured or totally unstructured. Therefore the added process of Clustering or Dimensionality Reduction is required to be performed before the process of predications can be put in place.

So this requires more insights into the working concepts of statistical procedures and is the next stage of learning in ML. The general use case implementations can be Customer segmentation, recommender system, Feature discoveries, etc.

Reinforcement Learning

This is basically leveraging the power from both the supervised and unsupervised procedures with an addon factor of iterative learning if some error occurred(mispredictions) in the data interpretations.

The procedures(algorithms) implemented in this system are designed in such a way so that it can tune their attributes/parameters(variables) to test it against the variety of values and find the best combinations, for example, neural networks have a variety of parameters like the number of layers, the number of neurons in each layer, connection density between neurons, weights, etc.

The general use cases for such types of implementations are Robot navigation, learning tasks, game AI, self-driving cars, etc.

The interesting point is that corresponding to each type of learning there have been plenty of algorithms published as APIs under the various opensource ML libraries such as skLearn, Keras, Tensorflow, etc. and for data management is working memory(RAM) the primary libraries used are panadas and Numpy.

Here is a webinar discussion on the machine learning types and relevant stuff

So, as a programmer, it has become very easy for you to implement your use cases, provided you know what problem you are trying to solve and what data you will be using along with which algorithm you are going to use and which library supports it.

Machine Learning implementation steps

  1. Defining your problem statement
  2. Getting data from various sources and pre-processing it for feeding to the selected algorithm(s).
  3. Model building by selecting the right ML algorithm and test it with data.
  4. Optimize and improve(this requires a repeat of step 2 and step 3 till satisfactory results were produced)
  5. Summarize the results/Tell a story by using various Data visualizations.

That would be it if you followed these steps you are through with your ML implementation work.

Now the next point is how do I know which library to look into and which language shall be learned so that the implementation can be hassle-free.

Possible Machine learning track

  • Choose a programming language: Python OR R programming. I would prefer to have a python as a beginner as it’s easy to follow, and many libraries are supported by the ML community are programmed using Python. Apart from this should CRUD skills for SQL. Also, it is not like that you required to be an expert in programming skills that you will become as you practice your work.
  • Practice your data processing/wrangling using Pandas & NumPy. Also, you should practice with the Matplot library to get yourself familiarised with the data visualizations using various charts.
  • Now, as you are through with the first two stages, it is time to open your wings and get your hands dirty with algorithms from sklearn/Keras libraries or any other of your interest as per your problem statement. Take your time to work on various small implementations, start with regression-based algorithms, then classification, clustering, and so on. Spend some good time practising these as this will lay the foundation for your enterprise career.
  • So finally, it’s time to move on to the enterprise solutions used by the industry for processing real-time data like presto, HIVE, Hadoop, AWS ML toolkits, SPARK, etc.

Moreover, apart from what all is mentioned above, each specific cloud service provider has its own service stack to support the machine learning environment within its platform. And it is always up to your inclination toward the provider, and you additionally learn their platform-dependent tools over and above what we have discussed.

In case if you have a different say or have something to discuss, feel free to start the discussion thread below. I would love to do so.

Who am I to teach you about machine learning?

Well, I have been working intensively in ML to solve my Ph.D. Research problem and have been through various ML projects to test out multiple hypotheses.

Apart from this, I have been mentoring the budding researchers working on finding solutions to complex problems in the cloud computing domain.

You may read my brief career progress on the About page or check my LinkedIn.

Look forward to having you in the webinar and have a great discussion.



Virtualization in Cloud Computing

Virtualization, one of the most popular technology has revolutionized the way infrastructure is maintained and rented as a public utility.

You will be surprised to know that it was first used in the early 1970s and was introduced by IBM to virtualize its mainframe systems through hardware virtualization, which is still a popular mode of service delivery.

Let’s start with a fundamental question

What is virtualization?

RedHat documentation really defined it well.

Virtualization is a technology that allows you to create multiple simulated environments or dedicated resources from a single, physical hardware system. Software called a hypervisor connects directly to that hardware and allows you to split 1 system into separate, distinct, and secure environments known as virtual machines (VMs). These VMs rely on the hypervisor’s ability to separate the machine’s resources from the hardware and distribute them appropriately.

So the virtualization clearly allows the web hosting and data center service providers to make the best use of their hardware investments and maintaining optimal operational cost.

All thanks to this, service providers were able to pass on the benefits to their users/customers with low-cost services and offer a low barrier to entry for new customers.

Virtualization technology leverages unprecedented benefits as follows:

  • Increased performance and optimized compute capacity sharing.
  • Optimizing under/over utilized hardware and software resources.
  • Reduced indirect carbon emissions.

Virtualization technology has a long history of development and has gone through various phases of research, development, and implementation. In fact, there has been a couple of research projects administrated to support the simulation of cloud-based virtualization systems. These systems model the real-world cloud behavior to develop state of the art optimization policies.


Hello world!

“Hello World!” the first message that a new programmer uses to test run the program.

I am using the same to test my first blog and display it to the world and presenting my self to the digital world.

This blog intend to be a digital version of real life Anupinder Singh and will share my learning based on my personal or professional experiences.

With the hope of reaching new heights, here is my first blog test!