## DEFINITION AND FUNCTION

The programming language R is an interactive data manipulation language that is mainly used for Data Science, (the application of scientific methods to extract insights from data). R is gaining in popularity and acclaim, and is becoming the foremost Statistical Computing language. R is a dialect of the 1980’s programming language, “S”, a language derived from the 1970’s database query language, “SAS”. R is an open source language, supported by the R Foundation for Statistical Computing.

## INSTALLATION OF R PROGRAMMING LANGUAGE:

Installation of R language can be achieved by:

- Once the R language is installed on your local PC, (from https://www.r-project.org/, you can start programming with R via using a text editor, or a proprietary Graphical User Interface, (for example, “RStudio”, at https://www.rstudio.com/).

## USAGE OF R PROGRAMMING LANGUAGE FOR DATA ANALYSIS

R programming language for can be used for exploration and statistical processing of data that is stored locally or available online. Data applications are usually scripted in R language using the R/Shiny format. Several online Data Science services, (for example Microsoft Azure), and proprietary Business Intelligence software, (ex. Tableau), have incorporated the R language.

**C**APABILITIES OF THE R PROGRAMMING LANGUAGE

The basic programming capabilities of the R language is called “base R”, and incorporate the basic commands, functions, and example datasets of the R language. R’s built-in example datasets are accessible via the “data()” command. R has built-in programming lessons, with the “swirl()” command. Reading datasets into the R programming environment is accomplished with the “read.table()”, “read.txt()”, and “read.csv()” commands. In order to perform sophisticated Data Science programming with R, externally downloadable packages of functions are needed. These R packages are downloadable with the base R commands, “install.packages()”, “required()”, and “library()”. Among the most popular R packages are “ggplot2” for drawing statistical plots, and “markdown” for creating documents from R scripts.

## R PROGRAMMING LANGUAGE IN ITS SIMPLEST FORM

The R language at the most basic level is usable as a calculator by simply typing mathematical operations, like “25 * 4” into the programming environment. The assignment operator, “<-” is used to set variables in the R language, (for example, “y <- x * 25”). A basic concatenation method in R is the command, “c()”. This command concatenates, or combines, elements into a basic vector. For example, “x <- c(“Corolla”, “Firebird”, “Europa”)”, or “x <- c(10, 20, 30, 40)”. A range of numbers can be vectorized with the “:” character. For example, “x <- 10:20”.

## CREATING AN ENABLING WORKING ENVIRONMENT FOR “R” ON YOUR PC

In order to work with data files stored on your PC, the working directory should be set with the command, “setwd()”. The present working directory is accessible with “getwd()”. Data sets loaded into the programming environment are examinable with basic commands, like “dim()”, “object.size()”, “class()”, “nrow()”, “ncol()”, “head()”, “tail()”, “names()”, “summary()”, and “str()”. The basic plotting commands are “plot()”, and “hist()”. The ggplot2 package greatly expands on the graphing/plotting capabilities of base R. Once a data set has been read into R, (for example with “data <- mtcars”), base R allows for easy sub-setting with brackets. To subset the data variable, “data”, you only need to specify the rows and columns to subset. “data[1:3, 1:5]” subsets the mtcars data frame into the first 3 rows and first 5 columns, of the data frame. A range of rows or columns is defined by the “:” character. Rows are specified before the comma character within the brackets. Columns are specified after the comma. The “c()” function is usable within the brackets, in order to select a group of rows or columns. For example, “data[c(1, 3, 4)]”. A dataset can be printed to the console with the “print()” function. If your dataset has a range of dates, in order for the R environment to treat those dates as actual time values, the command “as.Date()” is needed.

** **

## BASIC MATHEMATICAL FUNCTIONS OF R PROGRAMMING LANGUAGE

R’s basic mathematical functions are “sum()”, “mean()”, “range()”, and “quantile()”. The “matrix()” and “data.frame()” commands can create matrixes and dataframes for statistical processing. Scripting repetitive loops are been made convenient in R using the “apply()”, “lapply()”, and “mapply()” functions. Rows and Columns are bindable with “rbind()” and “cbind()”. Data types are convertible with base R conversion functions, like “as.character()”, “as.factor()”, “as.numeric()”, and “as.double()”. Using only base R, programmers can create user definable functions, with the “function()” command. For example, “firstRow <- function(x) {x[1,]}”. Once inputted, the variable name becomes the function call, “firstRow(mtcars)”.

**S**TEP BY SYEP GUIDE ON DATA SCIENCE WITH R PROGRAMMING LANGUAGE

The basic methodology of Data Science with R is to first set the working directory, download required packages, input the data, programmatically explore the data, plot the data exploration, correlate significant variables within the data, fit a linear model to the correlated data and finally create a forecast from the fitted model. The R script should be thoroughly commented to assist other programmers, (using the commenting character, “#”), and the process of Data Science exploration and discovery should be explained for reproducibility of findings.

## CONCLUSION

In the future, I hope to write a blog that goes beyond base R, and explores the powerful capabilities of the R language regarding discovering new insights interactively from the many datasets now available via the internet. This practice will require an overview of R packages, ggplot2 graphics, and statistical inference with linear modeling, logistic modeling, etc. Thank you for reading our first blog, an introduction to R programming basics for data science.

## Prince Arora

Thank you so much. Your blog is so informative.