## Basics of statistics for Data science: Descriptive statistics with R codes

```
## Load the data
data("iris")
# View the data
View(iris)
# There are 5 variables in the dataframe namely Sepal.length, sepal.width,
# petallength, petal.width and species of the flower
head(iris)
# The above command will give output of the first 6 rows in the dataframe.
# Let us look at the strucuture of the data
str(iris)
# Sepal.length, sepal.width, petal.length, petal.width are numeric variable
# Species is a factor variable
# Let us calculate the descriptive statistics for the
# Measures of central tendency - describes the most typical response to a question
# Mean – Centre point of the data / Vulnerable to outliers
# The command for calculating mean is mean()
# The below command will calculate the mean of first column
mean(iris[,1])
# The other alternative is
mean(iris$Sepal.Length) # The $ sign is select particular column or variable from a dataframe.
# Median – midpoint of distribution values / arrange ascending ordre
# The command for calculating median is medain()
median(iris$Sepal.Length)
# Mode - the value that appears most often.
(iris$Sepal.Length)
# Measures of Dispersion - describes the shape and spread of the data set
# Frequency distribution reveals the number (percent) of occurrences of each number or set of numbers
# This is specially relevant for categorical/factor variable
# Species is categorical variable in the dataframe
table(iris$Species)
# Range identifies the maximum and minimum values in a set of numbers
range(iris$Sepal.Length)
# Standard deviation indicates the degree of variation in a way that can be translated into a bell-shaped curve distribution
sd(iris$Sepal.Length)
# there is no function for calculating variance in the data
# We can calculate the variance by using the formula Variance = Std. dev / mean
var <- sd(iris$Sepal.Length) / mean(iris$Sepal.Length)
var
# If you want to get full set of descriptives, then use summary() command
summary(iris$Sepal.Length)
# This gives the min, 1st Quartile, median, mean, 3rd Quartile and maximum value
# Another way to get full set of descriptives is to use to command describe()
# describe() is part of package psych therefore install this package
install.packages("psych")
library(psych)
describe(iris$Sepal.Length)
# This gives the output in terms of mean, sd, median, trimmed, mad, min, max, range, skew, kurtosis, se
# Descriptives accross groups
describeBy(iris$Sepal.Length, iris$Species)
#This will give descriptives accross groups of species
```