Scroll to top

R Programming – Data Structures, Basic Statistics, and Plotting

Vector in R

“Vector” variables allow for storage of a series of elements within the R programming environment. Vectors always contain an element of the same type. A vector is a basic data structure in R, and is easily creatable and manipulated with R language commands.

In order to create a vector from a range of numbers, the semicolon “:” character is used. For example, x <- 10:20. A second method of creating vectors is via the concatenation/combine function, “c()”. For example, x <- c(“A”, “B”, “C”, “D”), or x <- c(10, 20, 30, 40). More complex vector sequences can be created using the seq() function. The number of points in an interval, or the step size is then definable.

Vector data types can be logical, integer, double, character, complex or raw. Considering that a vector must have elements of the same type, the “c()” function performs inner operations to coerce elements to a matching type. A vector’s type can be checked with the typeof() function. The number of elements in a vector can be checked with the function, length().

Logical, integer or character vectors can be accessed using vector indexing. A vector of integers can be used for indexing. Negative integers specify vector elements to not return. Positive and negative integers indexing is not possible simultaneously. Vectors can be indexed via logical arguments, according to whether the logical argument applied is TRUE, or FALSE.

Using the above indexing methods, it is possible to transform every element of a vector. Subdividing or combining specific elements of vector variables is also a possibility. A vector is usually deleted in the R programming language by assigning NULL to the vector name.

Matrix Variables in R

A matrix is a two-dimensional data structure in R programming, where data types within the matrix are similar, or homogeneous. Each dimension of a matrix variable usually contains vector variables. Matrices can be created using the matrix() function. A dimension of the matrix can be defined by passing values for nrow() and ncol(). In order to create matrix variables by rows, instead of the default of creation by columns, set “byrow=” to “TRUE”.

The functions cbind() and rbind(), “column bind” and “row bind”, creates matrix variables via combining vector variables as columns or rows. Setting the dim() parameter of a vector variable allows for the creation of a matrix variable. Elements of a matrix are accessible via the square bracket “[ ]” indexing method. Row and column specification within square brackets accesses vector variables within matrix variables. Negative numbers used for location of rows and columns exclude the rows and columns specified. Indexing is possible via integer variables or logical variables.

The parameter, “dimnames()”, allows for the naming of rows and columns of matrix variables at creation. The functions, colnames() and rownames(), allows for the modification of column and row names. After naming of rows and columns, indexing with row/columns names is supported in the R language. Matrix variables can be augmented with rbind(), cbind(), or dim().

The attributes of a matrix can be checked with the attributes() function. The dimensional lengths of matrices can be checked with the dim() function. The class() function will return whether an array of elements is a matrix variable or a dataframe variable.

Data Frames in R

A data frame is a two-dimensional data structure in R programming. Dataframes allow for non-homogeneous data types. Therefore, numeric, factor or character data can be stored, and processed, within the same dataframe.

The str() function summarizes the structure of a dataframe. The summary() function creates a statistical summary of dataframe rows and columns. New columns can be added to dataframes using the assignment operator to append the dataframe with a vector variable. New rows are added using rbind().

The individual vector variables in dataframes are accessible via specification with brackets, [ ], or with the subset operator, $. Individual elements can be accessed within dataframes using a combination of brackets and subsetting.

Performing calculations with the data contained in dataframes is made easy with the apply() function. Looping through dataframe rows and columns to process data with the same mathematical operation is encoded within the apply() function. Searching for specific data within dataframes is possible with the grep() function. The order() function sorts the rows and columns of dataframes. Dataframes are mergeable with the merge() function.

Standard Deviation in R

Standard deviation is a statistical analysis method that examines numerical variation within a set of data values. Low standard deviation indicates that data points are close to the mean, and there is likely some relationship between the data points. High standard deviation indicates that the data points are not found close to the mean, therefore indicating randomness in the data.

The square of the standard deviation is the variance. Variance has a central role in statistics and is used for descriptive statistics, statistical inference, hypothesis testing, goodness of fit, and Monte Carlo sampling. Standard deviation is the square root of the variance, and unlike variance is expressed in the same units as the data.

Standard deviation verifies the expected range of results in certain experiments, thereby maintaining the quality of data gathering. Confidence intervals between two sets of data is determined by matching the standard deviation of the data sets. Standard deviation also allows for the estimation of future data, that is derived from a data set.

Scatter plot in R

A scatter plot is a mathematical diagram that uses Cartesian coordinates to display values for typically two variables for a set of data. Additional variables can be displayed using color coding. Scatter plot data is displayed as a collection of points, with known, (independent), variables that are being affected by an unknown, (dependant), variable displayed on the horizontal axis, and the unknown variable whose effects are in need of visualization is then displayed on the vertical axis. If a dependent variable doesn’t exist, a scatter plot will show the correlation between data points within a data set.

The basic way to create a scatter plot in the R programming language is with the basic plot function, for example, plot(x, y). Using the plot() function, titles for the x and y-axes and entire plot are specifiable. The abline() function allows for regression line visualization within the data. The scatterplot() function, within the car package, augments the programmer’s ability to analyze data with scatter plots.

In the R programming language, scatter plots are useful for data visualization via the ability to apply advanced statistical processing to data points. Examples of the useful variations possible within scatter plots in R include correlation plotting, gradient shades or colors (binning), sunflower plots, and 3D scatter plotting.

Author avatar

Post a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.