dplyr 0.8.1 grouping functions update

Changes to group_modify() and group_map()

RStudio has just released a minor update to dplyr. They had a rethink of new grouping purrr-style functions used to iterate on grouped tibbles. The changes include: group_map() is now used for iterating on grouped tibbles. It however makes no assumptions about the return type of each operation, combining results in a list - similar to purrr::map(). The previous behaviour was renamed to group_modify(), always returning and combining grouped tibbles by evaluating each operation with a reconstructed grouping structure - similar to purrr::modify(). [Read More]

Vroom

Supercharge data import in R

I’m very excited the learn about vroom, RStudio’s latest tidyverse offering. It imports data a lot faster compared with existing R solutions. Check out the following benchmark that provides a comparison across a handful of similar functions and interactions between various libraries. Benchmark The speed is already a game-changer, but the following features sweeten the deal: Similar to readr vroom shares many features with readr, including nearly all of the parsing features of readr for delimited and fixed width files. [Read More]

Sabbatical

an exhaustive review of data science fundamentals

I am now available for Management Consultant | Data Scientist roles. Please get in touch if you need my help. During the last eight years, I spent considerable time analysing data, with a focus on prediction and inference. It naturally led me to the emerging Data Science field which, at the same time, had been exploding with powerful and accessible tools and techniques. Using an increasingly broader and more advanced range of solutions ultimately fuelled a thirst to gain more extensive and in-depth knowledge in all crucial aspects of the discipline. [Read More]

Planning weekly food

using R to menu plan and create a food budget

Introduction A few years ago I woke up to an epiphany, realising that I was becoming my dad. I had started a campaign of dealing with wastefulness, switching off lights and eating leftovers to name but a few examples. I set out to transform our menu planning and the weekly food shop as part of this crusade. Menu planning is a chore which comes easily to some. For others like me, though, it is just another thing to think about on top of an already busy life. [Read More]

Summarising tables

Approach to streamline workflow when summarising tables

Introduction The result of the data science process is to communicate findings, typically to an audience that doesn’t talk technical. It is the most important deliverable of the process, even if not the first thing that springs to mind when considering data science. Fantastic insights are of no use if the intended audience doesn’t understand or trust it. It is therefore vital to take care when presenting findings. There are typical and often repeated actions when summarising data in tables. [Read More]

Low-cost housing in South Africa

Reporting state changes of large-scale programmes over time

Introduction The purpose of this case study is to explore aspects in reporting state changes of large-scale programmes over time. A state change in this context refers to the shift in statuses of multiple activities performed during the delivery of a project, the project forming part of a more extensive body programme of works (concentrated portfolio of project activities). We could attempt this using Excel, and perhaps we’ll be successful as the current dataset only contains c. [Read More]

Simulating data and file-based ETL

Introduction Data Scientists spend a lot of time importing, cleaning, tidying and transforming data before any decent analysis can start. Like many, the industry that I work in typically email files to communicate data and report. I follow a consistent approach to ETL and subsequent data concentration to better manage the accumulation of multiple, disparate files from a variety of sources and different formats. This tutorial demonstrates a simplified version of this process. [Read More]

Hello World

Last year Louis Columbus of Forbes stated that Machine Learning Engineers, Data Scientists, and Big Data Engineers rank among the top emerging jobs on LinkedIn. David Robinson’s Stackoverflow article about The Incredible Growth of Python substantiates this observation. Data science in the UK is still emerging, whereas in the US it appears to have taken off in a big way. It is encouraging to read about mainstream adoption in the UK, even if still early days. [Read More]