Apr 1, 2022
R is the 18th level of the Latin alphabet. It represents the rhotic consonant, or the r sound. It goes back to the Greek Rho, the Phoenician Resh before that and the Egyptian rêš, which is the same name the Egyptians had for head, before that. R appears in about 7 and a half percent of the words in the English dictionary.
And R is probably the best language out there for programming around various statistical and machine learning tasks. We may use tools like Tensorflow imported to languages like python to prototype but R is incredibly performant for all the maths. And so it has become an essential piece of software for data scientists.
The R programming language was created in 1993 by two statisticians Robert Gentleman, and Ross Ihaka at the University of Auckland, New Zealand. It has since been ported to practically every operating system and is available at r-project.org. Initially called "S," the name changed to "R" to avoid a trademark issue with a commercial software package that we’ll discuss in a bit. R was primarily written in C but used Fortran and since even R itself.
And there have been statistical packages since the very first computers were used for math.
IBM in fact packaged up BMDP when they first started working on the idea at UCLA Health Computing Facility. That was 1957. Then came SPSS out of the University of Chicago in 1968. And the same year, John Sall and others gave us SAS, or Statistical Analysis System) out of North Carolina State University. And those evolved from those early days through into the 80s with the advent of object oriented everything and thus got not only windowing interfaces but also extensibility, code sharing, and as we moved into the 90s, acquisition’s. BMDP was acquired by SPSS who was then acquired by IBM and the products were getting more expensive but not getting a ton of key updates for the same scientific and medical communities.
And so we saw the upstarts in the 80s, Data Desk and JMP and others. Tools built for windowing operating systems and in object oriented languages. We got the ability to interactively manipulate data, zoom in and spin three dimensional representations of data, and all kinds of pretty aspects. But they were not a programmers tool.
S was begun in the seventies at Bell Labs and was supposed to be a statistical MATLAB, a language specifically designed for number crunching. And the statistical techniques were far beyond where SPSS and SAS had stopped. And with the breakup of Ma Bell, parts of Bell became Lucent, which sold S to Insightful Corporation who released S-PLUS and would later get bought by TIBCO. Keep in mind, Bell was testing line quality and statistics and going back to World War II employed some of the top scientists in those fields, ones who would later create large chunks of the quality movement and implementations like Six Sigma. Once S went to a standalone software company basically, it became less about the statistics and more about porting to different computers to make more money.
Private equity and portfolio conglomerates are, by nature, after improving the multiples on a line of business. But sometimes more statisticians in various feels might feel left behind. And this is where R comes into the picture. R gained popularity among statisticians because it made it easier to write complicated statistical algorithms without learning an entire programming language. Its popularity has grown significantly since then. R has been described as a cross between MATLAB and SPSS, but much faster.
R was initially designed to be a language that could handle statistical analysis and other types of data mining, an offshoot of which we now call machine learning. R is also an open-source language and as with a number of other languages has plenty of packages available through a package repository - which they call CRAN (Comprehensive R Archive Network). This allows R to be used in fields outside of statistics and data science or to just get new methods to do math that doesn’t belong in the main language.
There are over 18,000 packages for R. One of the more popular is ggplot2, an open-source data visualization package. data.table is another that performs programmatic data manipulation operations. dplyr provides functions designed to enable data frame manipulation in an intuitive manner. tidyr helps create tidier data. Shiny generates interactive web apps. And there are plenty of packages to make R easier, faster, and more extensible.
By 2015, more than 10 million people used R every month and it’s now the 13th most popular language in use. And the needs have expanded. We can drop r scripts into other programs and tools for processing. And some of the workloads are huge. This led to the development of parallel computing, specifically using MPI (Message Passing Interface).
R programming is one of the most popular languages used for statistical analysis, statistical graphics generation, and data science projects. There are other languages or tools for specific uses but it’s even started being used in those.
The latest version, R 4.1.2, was released on 21/11/01. R development, as with most thriving open source solutions, is guided by a group of core developers supported by contributions from the broader community. It became popular because it provides all essential features for data mining and graphics needed for academic research and industry applications and because of the pluggable and robust and versatile nature.
And projects like tensorflow and numpy and sci-kit have evolved for other languages. And there are services from companies like Amazon that can host and process assets from both, both using unstructured databases like NoSQL or using Jupyter notebooks.
A Jupyter Notebook is a JSON document, following a versioned schema that contains an ordered list of input/output cells which can contain code, text (using Markdown), formulas, algorithms, plots and even media like audio or video. Project Jupyter was a spin-off of iPython but the goal was to create a language-agnostic tool where we could execute aspects in Ruby or Haskel or Python or even R. This gives us so many ways to get our data into the notebook, in batches or deep learning environments or whatever pipeline needs to be built based on an organization’s stack. Especially if the notebook has a frontend based on Amazon SageMaker Notebooks, Google's Colaboratory and Microsoft's Azure Notebook.
Think about this. 25% of the languages lack a rhotic consonant. Sometimes it seems like we’ve got languages that do everything or that we’ve built products that do everything. But I bet no matter the industry or focus or sub-specialty, there’s still 25% more automation or instigation into our own data to be done. Because there always will be.