During the last eight years, I spent considerable time analysing data, with a focus on prediction and inference. It naturally led me to the emerging Data Science field which, at the same time, had been exploding with powerful and accessible tools and techniques.
Using an increasingly broader and more advanced range of solutions ultimately fuelled a thirst to gain more extensive and in-depth knowledge in all crucial aspects of the discipline.
In November 2018 my contract ended with Vodafone. It was an excellent opportunity to take some time out and study all the subjects and specialisations to the level that I had set out to achieve.
I am delighted to announce that I have now completed this journey, one that started many years ago.
What have I been up to?
I’ve been busy with quite a bit, but would like to highlight the following aspects:
Mathematics for Machine Learning
I did not read mathematics at university even though majoring in highly numerate subjects, including Marketing Research (Statistics) and Financial Management.
I wanted a fundamental understanding of algorithms and completed Mathematics for Machine Learning by Imperial College London on Coursera.
Statistics & Probability
I studied statistics many years ago and wanted to refresh my knowledge and update my toolbag. I did a refresher in statistics and probability, focussing specifically on using programming libraries to perform statistical calculations and tests and to review aspects that I haven’t really used and thus forgotten.
An Introduction to Statistical Learning is considered by many as the bible of machine learning. It covers some of the most important modelling and prediction techniques. The knowledge is transferable even though its application is in R.
I studied the book a few years ago and fancied doing the associated online course at some point. Earlier this year I completed the online course Statistical Learning, by Stanford University with distinction.
I supercharged my insight further by working through several books on building machine learning algorithms from scratch.1
No data scientist’s toolkit is complete without the ability to process big data. It’s fast becoming widespread and commonplace. Up to now, I’ve not needed big data technologies for my work because both Python and R can be manipulated to work with reasonably large datasets.
I’ve decided to focus on Spark rather than the more mature Hadoop MapReduce. I occasionally use SparklyR mainly because it is a doddle to set up locally within the RStudio IDE, making it easy to experiment with and learn. I decided to focus more on the Python ecosystem and completed a few courses in PySpark.
For those interested in data science, I am maintaining a page with a high-level outline of the process and associated resources.