Exploratory Data Analysis

Exploration Data Example in R using dplyr Package

Riza Purnaramadhan
12 min readApr 6, 2021
Photo by Tekton on Unsplash

Bismillahirrahmanirrahim, Assalamu’alaikum Warahmatullahi Wabarakatuh, pada kali ini saya ingin membahas tentang eksplorasi data di R menggunakan package dplyr.

Install dan Load Package

Tentunya package yang digunakan adalah dplyr yaitu package yang digunakan untuk manipulasi data, ketik pada r console Install.packages(“dplyr”) lalu tekan enter.

Load Package dplyr

Setelah tahap installasi selesai, langkah pertama adalah load package untuk bisa menggunakan package, ketikan perintah seperti berikut:

library(dplyr)

Memasukan data ke dalam R

Data yang saya gunakan adalah data sample super store training yang saya dapatkan dari perkuliahan hehe, datanya bisa kalian download disini, kemudian untuk memasukan data ke dalam R maka ketikkan perintah berikut :
training=read.csv(file = "D:\\Data_Tableau\\SampleSuperstoreTraining.csv")

lokasi filenya sesuaikan tempat kalian menyimpan file di computer kalian yaa. Kemudian karena datanya ribuan saya hanya tampilkan data yang teratas pada tampilan kali ini hehe,tapi untuk melihat seluruh data bisa dilihat di Environment pada program R atau bisa ketikan perintah view() pada R console, untuk melihat data teratas bisa dengan cara ketikan perintah seperti berikut:

head(training)

berikut hasilnya :

##   Row Order.Priority Order.Date Order Discount Unit.Price Order.Quantity
## 1 1 High 1/1/2010 28774 10% $6.00 32
## 2 2 Not Specified 1/1/2010 88028 8% $96.00 2
## 3 3 Critical 1/2/2010 9285 6% $41.00 3
## 4 4 Low 1/2/2010 37537 0% $292.00 4
## 5 5 Low 1/2/2010 37537 7% $101.00 43
## 6 6 Low 1/2/2010 37537 5% $155.00 32
## Sales Profit Shipping.Cost Product.Base.Margin Department
## 1 $172.80 $106.36 $5.00 68.0% Office Supplies
## 2 $176.64 $45.64 $35.00 50.0% Office Supplies
## 3 $115.62 $33.90 $3.00 36.0% Office Supplies
## 4 $1,168.00 $605.08 $49.00 56.0% Furniture
## 5 $4,038.99 $2,647.66 $45.00 69.0% Furniture
## 6 $4,712.00 $2,671.40 $7.00 59.0% Office Supplies
## Container Category
## 1 Small Box Storage & Organization
## 2 Large Box Storage & Organization
## 3 Small Box Binders and Binder Accessories
## 4 Jumbo Drum Chairs & Chairmats
## 5 Jumbo Drum Chairs & Chairmats
## 6 Small Box Storage & Organization
## Item
## 1 Perma STOR-ALL\231 Hanging File Box, 13 1/8"W x 12 1/4"D x 10 1/2"H
## 2 Safco Industrial Wire Shelving
## 3 Avery Trapezoid Ring Binder, 3" Capacity, Black, 1040 sheets
## 4 Hon 4070 Series Pagoda\231 Armless Upholstered Stacking Chairs
## 5 Hon Valutask\231 Swivel Chairs
## 6 Dual Level, Single-Width Filing Carts
## Customer.Segment Customer Customer.Name Region State
## 1 Small Business 1656 Joy Corbett AsiaPac Central
## 2 Home Office 2211 Anita Hahn North America Maryland
## 3 Consumer 949 Ernest Oh North America California
## 4 Corporate 68 Scott Bunn North America New York
## 5 Corporate 68 Scott Bunn North America New York
## 6 Corporate 68 Scott Bunn North America New York
## Country...Region City Postal.Code Ship.Date
## 1 Fiji Suva NA 1/2/2010
## 2 United States of America Bowie 20715 1/3/2010
## 3 United States of America Los Angeles 90049 1/4/2010
## 4 United States of America New York City 10177 1/2/2010
## 5 United States of America New York City 10177 1/4/2010
## 6 United States of America New York City 10177 1/9/2010
## Ship.Mode
## 1 Regular Air
## 2 Express Air
## 3 Regular Air
## 4 Delivery Truck
## 5 Delivery Truck
## 6 Regular Air
## SubRegion
## 1
## 2 East
## 3 West
## 4 East
## 5 East
## 6 East

Filter Data

Disini saya ingin nge-filter data dan ingin melihat data consumer saja, data consumer berada pada kolom customer segment. dan saya akan tampilkan data teratas karena datanya terlalu banyak, untuk melihat seluruh data bisa dilihat di Environment pada program R atau bisa ketikan perintah view() pada R console, ketikan syntax seperti berikut:

trainingfilter1=filter(training, Customer.Segment=="Consumer")
head(trainingfilter1)

berikut hasilnya :

##   Row Order.Priority Order.Date Order Discount Unit.Price Order.Quantity
## 1 3 Critical 1/2/2010 9285 6% $41.00 3
## 2 7 Critical 1/2/2010 44069 9% $9.00 16
## 3 8 Critical 1/2/2010 44069 8% $15.00 43
## 4 9 Critical 1/2/2010 89083 6% $41.00 1
## 5 11 Critical 1/2/2010 90853 9% $9.00 4
## 6 12 Critical 1/2/2010 90853 8% $15.00 11
## Sales Profit Shipping.Cost Product.Base.Margin Department
## 1 $115.62 $33.90 $3.00 36.0% Office Supplies
## 2 $131.04 $42.64 $2.00 40.0% Office Supplies
## 3 $593.40 $197.95 $2.00 39.0% Office Supplies
## 4 $38.54 $9.30 $3.00 36.0% Office Supplies
## 5 $32.76 $9.16 $2.00 40.0% Office Supplies
## 6 $151.80 $49.15 $2.00 39.0% Office Supplies
## Container Category
## 1 Small Box Binders and Binder Accessories
## 2 Wrap Bag Paper
## 3 Wrap Bag Paper
## 4 Small Box Binders and Binder Accessories
## 5 Wrap Bag Paper
## 6 Wrap Bag Paper
## Item
## 1 Avery Trapezoid Ring Binder, 3" Capacity, Black, 1040 sheets
## 2 Black Print Carbonless Snap-Off® Rapid Letter, 8 1/2" x 7"
## 3 White GlueTop Scratch Pads
## 4 Avery Trapezoid Ring Binder, 3" Capacity, Black, 1040 sheets
## 5 Black Print Carbonless Snap-Off® Rapid Letter, 8 1/2" x 7"
## 6 White GlueTop Scratch Pads
## Customer.Segment Customer Customer.Name Region State
## 1 Consumer 949 Ernest Oh North America California
## 2 Consumer 1154 Marjorie Bailey AsiaPac ?saka
## 3 Consumer 1154 Marjorie Bailey AsiaPac ?saka
## 4 Consumer 950 Jane Shah North America Minnesota
## 5 Consumer 1155 Alex Nicholson North America California
## 6 Consumer 1155 Alex Nicholson North America California
## Country...Region City Postal.Code Ship.Date Ship.Mode
## 1 United States of America Los Angeles 90049 1/4/2010 Regular Air
## 2 Japan Osaka NA 1/4/2010 Express Air
## 3 Japan Osaka NA 1/2/2010 Regular Air
## 4 United States of America Prior Lake 55372 1/4/2010 Regular Air
## 5 United States of America Montebello 90640 1/4/2010 Express Air
## 6 United States of America Montebello 90640 1/2/2010 Regular Air
## SubRegion
## 1 West
## 2
## 3
## 4 Central
## 5 West
## 6 West

Kita coba filter data yang lain lagii yaa, saya ingin nge-filter data dan ingin melihat data Scott Bunn saja, data Scott Bunn berada pada kolom Customer Name, ketikan syntax seperti berikut:

trainingfilter4=filter(training, Customer.Name=="Scott Bunn")
head(trainingfilter4)

berikut hasilnya :

##    Row Order.Priority Order.Date Order Discount Unit.Price Order.Quantity
## 1 4 Low 1/2/2010 37537 0% $292.00 4
## 2 5 Low 1/2/2010 37537 7% $101.00 43
## 3 6 Low 1/2/2010 37537 5% $155.00 32
## 4 369 High 2/2/2010 55713 9% $123.00 49
## 5 8597 Critical 5/2/2012 2978 1% $206.00 53
## 6 8598 Critical 5/2/2012 2978 9% $163.00 57
## Sales Profit Shipping.Cost Product.Base.Margin Department
## 1 $1,168.00 $605.08 $49.00 56.0% Furniture
## 2 $4,038.99 $2,647.66 $45.00 69.0% Furniture
## 3 $4,712.00 $2,671.40 $7.00 59.0% Office Supplies
## 4 $5,484.57 $3,847.55 $70.00 74.0% Furniture
## 5 $10,808.82 $6,421.62 $20.00 60.0% Technology
## 6 $8,454.81 $2,767.30 $20.00 39.0% Office Supplies
## Container Category
## 1 Jumbo Drum Chairs & Chairmats
## 2 Jumbo Drum Chairs & Chairmats
## 3 Small Box Storage & Organization
## 4 Jumbo Drum Chairs & Chairmats
## 5 Small Box Telephones and Communication
## 6 Small Box Envelopes
## Item
## 1 Hon 4070 Series Pagoda\231 Armless Upholstered Stacking Chairs
## 2 Hon Valutask\231 Swivel Chairs
## 3 Dual Level, Single-Width Filing Carts
## 4 Global High-Back Leather Tilter, Burgundy
## 5 iDEN i550
## 6 Multimedia Mailers
## Customer.Segment Customer Customer.Name Region State
## 1 Corporate 68 Scott Bunn North America New York
## 2 Corporate 68 Scott Bunn North America New York
## 3 Corporate 68 Scott Bunn North America New York
## 4 Corporate 68 Scott Bunn North America New York
## 5 Corporate 68 Scott Bunn North America New York
## 6 Corporate 68 Scott Bunn North America New York
## Country...Region City Postal.Code Ship.Date
## 1 United States of America New York City 10177 1/2/2010
## 2 United States of America New York City 10177 1/4/2010
## 3 United States of America New York City 10177 1/9/2010
## 4 United States of America New York City 10177 2/4/2010
## 5 United States of America New York City 10177 5/3/2012
## 6 United States of America New York City 10177 5/4/2012
## Ship.Mode
## 1 Delivery Truck
## 2 Delivery Truck
## 3 Regular Air
## 4 Delivery Truck
## 5 Regular Air
## 6 Regular Air
## SubRegion
## 1 East
## 2 East
## 3 East
## 4 East
## 5 East
## 6 East

Penggunaan Perintah Select

Disini saya ingin memilih data yang akan ditampilkan, yaitu data pada kolom Customer.Name, Item, dan Region saja, ketikan syntax seperti berikut :

trainingselect=dplyr::select(training, Customer.Name, Item, Region)
head(trainingselect)

berikut hasilnya :

##   Customer.Name
## 1 Joy Corbett
## 2 Anita Hahn
## 3 Ernest Oh
## 4 Scott Bunn
## 5 Scott Bunn
## 6 Scott Bunn
## Item
## 1 Perma STOR-ALL\231 Hanging File Box, 13 1/8"W x 12 1/4"D x 10 1/2"H
## 2 Safco Industrial Wire Shelving
## 3 Avery Trapezoid Ring Binder, 3" Capacity, Black, 1040 sheets
## 4 Hon 4070 Series Pagoda\231 Armless Upholstered Stacking Chairs
## 5 Hon Valutask\231 Swivel Chairs
## 6 Dual Level, Single-Width Filing Carts
## Region
## 1 AsiaPac
## 2 North America
## 3 North America
## 4 North America
## 5 North America
## 6 North America

Menyusun Data

Disini saya ingin mengurutkan data profit dengan perintah arrange, ketikan syntax seperti berikut :

training_ar= arrange(training, Profit)
head(training_ar)

berikut hasilnya :

##     Row Order.Priority Order.Date Order Discount Unit.Price Order.Quantity
## 1 2413 Critical 9/11/2010 90438 6% $3.00 1
## 2 512 Critical 2/17/2010 90588 5% $6.00 1
## 3 3977 Not Specified 2/24/2011 90127 8% $6.00 4
## 4 8341 Not Specified 4/9/2012 86479 9% $2.00 8
## 5 8457 High 4/18/2012 87657 2% $6.00 1
## 6 12816 Low 4/12/2013 87313 8% $6.00 4
## Sales Profit Shipping.Cost Product.Base.Margin Department
## 1 $2.82 -$0.01 $1.00 39.0% Office Supplies
## 2 $5.70 -$0.02 $2.00 38.0% Office Supplies
## 3 $22.08 -$0.04 $7.00 37.0% Office Supplies
## 4 $14.56 -$0.04 $5.00 40.0% Office Supplies
## 5 $5.88 -$0.04 $4.00 68.0% Technology
## 6 $22.08 -$0.04 $7.00 37.0% Office Supplies
## Container Category
## 1 Small Box Labels
## 2 Wrap Bag Paper
## 3 Small Box Paper
## 4 Small Box Binders and Binder Accessories
## 5 Small Pack Computer Peripherals
## 6 Small Box Paper
## Item Customer.Segment Customer
## 1 Avery 479 Corporate 753
## 2 Adams Telephone Message Books, 5 1/4” x 11” Corporate 936
## 3 Xerox 214 Small Business 1660
## 4 Avery Reinforcements for Hole-Punch Pages Small Business 3030
## 5 Imation 3.5" IBM-Formatted Diskettes, 10/Pack Corporate 25
## 6 Xerox 210 Home Office 98
## Customer.Name Region State Country...Region
## 1 Elisabeth Massey North America Arizona United States of America
## 2 Robyn Garner North America California United States of America
## 3 Keith Liu Duncan North America Texas United States of America
## 4 Janice Woods North America Oregon United States of America
## 5 Virginia Hardison North America California United States of America
## 6 Sarah Cooke Anthony North America Kansas United States of America
## City Postal.Code Ship.Date Ship.Mode
## 1 Prescott 86301 9/11/2010 Express Air
## 2 Redlands 92374 2/19/2010 Express Air
## 3 Mesquite 75150 2/25/2011 Regular Air
## 4 Aloha 97007 4/9/2012 Regular Air
## 5 Lake Elsinore 92530 4/20/2012 Regular Air
## 6 Olathe 66062 4/16/2013 Express Air
## SubRegion
## 1 West
## 2 West
## 3 Central
## 4 West
## 5 West
## 6 Central

Bisa dilihat pada kolom profit, data otomatis tersusun dari yang terkecil sampai yang terbesar.

Filter, Select, dan Menyusun data

Disini saya ingin memilih data Item dan Unit,Price lalu menyusunnya, kemudian me-filter kolom Unit.Price dengan harga barang yang >= 100 (lebih dari sama dengan 100) dengan data sebanyak 16798. Karena variabel Unit.Price bukan tipe data numeric maka saya ingin mengubanhnya menjadi tipe data numeric agar bisa dieksekusi, sebelumnya saya ingin melihat tipe data dari tiap variabelnya dengan cara ketikkan perintah berikut :

str(training)

berikut hasilnya untuk melihat struktur data :

## 'data.frame':    16798 obs. of  26 variables:
## $ Row : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Order.Priority : Factor w/ 5 levels "Critical","High",..: 2 5 1 3 3 3 1 1 1 3 ...
## $ Order.Date : Factor w/ 1427 levels "1/1/2010","1/1/2011",..: 1 1 44 44 44 44 44 44 44 44 ...
## $ Order : int 28774 88028 9285 37537 37537 37537 44069 44069 89083 87946 ...
## $ Discount : Factor w/ 17 levels "0%","1%","10%",..: 3 15 13 1 14 12 16 15 13 12 ...
## $ Unit.Price : Factor w/ 237 levels "$1,000.00","$1,271.00",..: 198 235 161 122 11 45 230 41 161 45 ...
## $ Order.Quantity : int 32 2 3 4 43 32 16 43 1 8 ...
## $ Sales : Factor w/ 8176 levels "$0.90","$0.94",..: 2281 2303 1607 264 4939 5152 1830 6340 4826 281 ...
## $ Profit : Factor w/ 11737 levels "-$0.01","-$0.02",..: 2183 8038 6683 9478 4478 4495 7786 4054 11310 9812 ...
## $ Shipping.Cost : Factor w/ 86 levels "$0.00","$1.00",..: 50 34 30 49 45 71 20 20 30 71 ...
## $ Product.Base.Margin: Factor w/ 61 levels "14.5%","3.6%",..: 43 25 7 31 44 34 13 10 7 34 ...
## $ Department : Factor w/ 3 levels "Furniture","Office Supplies",..: 2 2 2 1 1 2 2 2 2 2 ...
## $ Container : Factor w/ 7 levels "Jumbo Box","Jumbo Drum",..: 5 3 5 2 2 5 7 7 5 5 ...
## $ Category : Factor w/ 17 levels "Appliances","Binders and Binder Accessories",..: 15 15 2 4 4 15 11 11 2 15 ...
## $ Item : Factor w/ 1263 levels "\"While you Were Out\" Message Book, One Form per Page",..: 862 921 229 606 625 395 276 1115 229 395 ...
## $ Customer.Segment : Factor w/ 4 levels "Consumer","Corporate",..: 4 3 1 2 2 2 1 1 1 2 ...
## $ Customer : int 1656 2211 949 68 68 68 1154 1154 950 67 ...
## $ Customer.Name : Factor w/ 3403 levels "Aaron Davies Bruce",..: 1686 133 981 2946 2946 2946 2219 2219 1430 916 ...
## $ Region : Factor w/ 4 levels "AsiaPac","EMEA",..: 1 4 4 4 4 4 1 1 4 4 ...
## $ State : Factor w/ 149 levels "?saka","Addis Ababa",..: 22 66 19 85 85 85 1 1 71 19 ...
## $ Country...Region : Factor w/ 50 levels "Algeria","Argentina",..: 14 49 49 49 49 49 25 25 49 49 ...
## $ City : Factor w/ 1523 levels "Aberdeen","Abidjan",..: 1327 136 760 916 916 916 992 992 1100 893 ...
## $ Postal.Code : int NA 20715 90049 10177 10177 10177 NA NA 55372 94559 ...
## $ Ship.Date : Factor w/ 1459 levels "1/1/2011","1/1/2012",..: 49 94 106 49 106 127 106 49 106 127 ...
## $ Ship.Mode : Factor w/ 3 levels "Delivery Truck",..: 3 2 3 1 1 3 2 3 3 3 ...
## $ SubRegion : Factor w/ 6 levels "","Canada "| __truncated__,..: 1 4 6 4 4 4 1 1 3 6 ...

kemudian saya ingin mengubah tipe data kolom Unit.Price yang sebelumnya bertipe factor menjadi tipe numeric agar bisa dieksekusi dengan syntax seperti berikut:

training$Unit.Price=as.numeric(training$Unit.Price)

Mengecek tipe data yg diubah apakah sudah benar atau belum dengan cara ketikkan syntax seperti berikut :

is.numeric(training$Unit.Price)

berikut hasilnya :

## [1] TRUE

Jika sudah benar maka selanjutnya kita eksekusi !

training_ar2=
training %>%
select(Item, Unit.Price) %>%
arrange(Item, Unit.Price) %>%
filter(Unit.Price>=100)%>%
head(training_ar2)

berikut hasilnya :

##                                                   Item Unit.Price
## 1 "While you Were Out" Message Book, One Form per Page 157
## 2 "While you Were Out" Message Book, One Form per Page 157
## 3 "While you Were Out" Message Book, One Form per Page 157
## 4 "While you Were Out" Message Book, One Form per Page 157
## 5 "While you Were Out" Message Book, One Form per Page 157
## 6 "While you Were Out" Message Book, One Form per Page 157

Membuat Bar Chart

Disini saya akan membuat sebuah bar chart tentang persentase dari Region, berikut adalah syntaxnya:

library(tidyverse)
library(cowplot)
fillColor = "red"
plotx= function(ds, fillColorName= fillColor) {
ds%>%
group_by(Region) %>%
filter(!is.na(Region)) %>%
summarise(Count= n()) %>%
mutate(TotalCount= nrow(ds)) %>%
mutate(Percentage=(Count/TotalCount) * 100) %>%
arrange(desc(Count)) %>%
ungroup() %>%
mutate (Region = reorder(Region,Count)) %>%
head(10) %>%

ggplot(aes(x= Region, y= Percentage)) +
geom_bar( stat='identity', colour="red", fill= fillColorName) +
geom_text(aes(x= Region, y=1, label= paste0("(",round(Percentage,2),"%)", sep="")),
hjust=0, vjust=.5, size=4,colour='black', fontface= 'bold') +
labs( x='Group',
y= 'Percentage', title= 'Percentage of Region') +
coord_flip() +
theme_bw()
}
all= plotx(training)
plot_grid(all)

maka berikut hasilnya :

Sekian tentang eksplorasi data nya, kalian bisa coba aplikasikan dengan data yang berbeda :) jika bermanfaat bisa dishare yaa :) terimakasih

Referensi :

  1. http://www.rpubs.com/arumprimandari/368022
  2. https://www.kaggle.com/ambarish/detailed-analysis-kaggle-survey/notebook

--

--