Exploratory Data Analysis

Exploration Data Example in R using dplyr Package

Photo by Tekton on Unsplash

Install dan Load Package

Load Package dplyr

library(dplyr)

Memasukan data ke dalam R

head(training)
##   Row Order.Priority Order.Date Order Discount Unit.Price Order.Quantity
## 1 1 High 1/1/2010 28774 10% $6.00 32
## 2 2 Not Specified 1/1/2010 88028 8% $96.00 2
## 3 3 Critical 1/2/2010 9285 6% $41.00 3
## 4 4 Low 1/2/2010 37537 0% $292.00 4
## 5 5 Low 1/2/2010 37537 7% $101.00 43
## 6 6 Low 1/2/2010 37537 5% $155.00 32
## Sales Profit Shipping.Cost Product.Base.Margin Department
## 1 $172.80 $106.36 $5.00 68.0% Office Supplies
## 2 $176.64 $45.64 $35.00 50.0% Office Supplies
## 3 $115.62 $33.90 $3.00 36.0% Office Supplies
## 4 $1,168.00 $605.08 $49.00 56.0% Furniture
## 5 $4,038.99 $2,647.66 $45.00 69.0% Furniture
## 6 $4,712.00 $2,671.40 $7.00 59.0% Office Supplies
## Container Category
## 1 Small Box Storage & Organization
## 2 Large Box Storage & Organization
## 3 Small Box Binders and Binder Accessories
## 4 Jumbo Drum Chairs & Chairmats
## 5 Jumbo Drum Chairs & Chairmats
## 6 Small Box Storage & Organization
## Item
## 1 Perma STOR-ALL\231 Hanging File Box, 13 1/8"W x 12 1/4"D x 10 1/2"H
## 2 Safco Industrial Wire Shelving
## 3 Avery Trapezoid Ring Binder, 3" Capacity, Black, 1040 sheets
## 4 Hon 4070 Series Pagoda\231 Armless Upholstered Stacking Chairs
## 5 Hon Valutask\231 Swivel Chairs
## 6 Dual Level, Single-Width Filing Carts
## Customer.Segment Customer Customer.Name Region State
## 1 Small Business 1656 Joy Corbett AsiaPac Central
## 2 Home Office 2211 Anita Hahn North America Maryland
## 3 Consumer 949 Ernest Oh North America California
## 4 Corporate 68 Scott Bunn North America New York
## 5 Corporate 68 Scott Bunn North America New York
## 6 Corporate 68 Scott Bunn North America New York
## Country...Region City Postal.Code Ship.Date
## 1 Fiji Suva NA 1/2/2010
## 2 United States of America Bowie 20715 1/3/2010
## 3 United States of America Los Angeles 90049 1/4/2010
## 4 United States of America New York City 10177 1/2/2010
## 5 United States of America New York City 10177 1/4/2010
## 6 United States of America New York City 10177 1/9/2010
## Ship.Mode
## 1 Regular Air
## 2 Express Air
## 3 Regular Air
## 4 Delivery Truck
## 5 Delivery Truck
## 6 Regular Air
## SubRegion
## 1
## 2 East
## 3 West
## 4 East
## 5 East
## 6 East

Filter Data

trainingfilter1=filter(training, Customer.Segment=="Consumer")
head(trainingfilter1)
##   Row Order.Priority Order.Date Order Discount Unit.Price Order.Quantity
## 1 3 Critical 1/2/2010 9285 6% $41.00 3
## 2 7 Critical 1/2/2010 44069 9% $9.00 16
## 3 8 Critical 1/2/2010 44069 8% $15.00 43
## 4 9 Critical 1/2/2010 89083 6% $41.00 1
## 5 11 Critical 1/2/2010 90853 9% $9.00 4
## 6 12 Critical 1/2/2010 90853 8% $15.00 11
## Sales Profit Shipping.Cost Product.Base.Margin Department
## 1 $115.62 $33.90 $3.00 36.0% Office Supplies
## 2 $131.04 $42.64 $2.00 40.0% Office Supplies
## 3 $593.40 $197.95 $2.00 39.0% Office Supplies
## 4 $38.54 $9.30 $3.00 36.0% Office Supplies
## 5 $32.76 $9.16 $2.00 40.0% Office Supplies
## 6 $151.80 $49.15 $2.00 39.0% Office Supplies
## Container Category
## 1 Small Box Binders and Binder Accessories
## 2 Wrap Bag Paper
## 3 Wrap Bag Paper
## 4 Small Box Binders and Binder Accessories
## 5 Wrap Bag Paper
## 6 Wrap Bag Paper
## Item
## 1 Avery Trapezoid Ring Binder, 3" Capacity, Black, 1040 sheets
## 2 Black Print Carbonless Snap-Off® Rapid Letter, 8 1/2" x 7"
## 3 White GlueTop Scratch Pads
## 4 Avery Trapezoid Ring Binder, 3" Capacity, Black, 1040 sheets
## 5 Black Print Carbonless Snap-Off® Rapid Letter, 8 1/2" x 7"
## 6 White GlueTop Scratch Pads
## Customer.Segment Customer Customer.Name Region State
## 1 Consumer 949 Ernest Oh North America California
## 2 Consumer 1154 Marjorie Bailey AsiaPac ?saka
## 3 Consumer 1154 Marjorie Bailey AsiaPac ?saka
## 4 Consumer 950 Jane Shah North America Minnesota
## 5 Consumer 1155 Alex Nicholson North America California
## 6 Consumer 1155 Alex Nicholson North America California
## Country...Region City Postal.Code Ship.Date Ship.Mode
## 1 United States of America Los Angeles 90049 1/4/2010 Regular Air
## 2 Japan Osaka NA 1/4/2010 Express Air
## 3 Japan Osaka NA 1/2/2010 Regular Air
## 4 United States of America Prior Lake 55372 1/4/2010 Regular Air
## 5 United States of America Montebello 90640 1/4/2010 Express Air
## 6 United States of America Montebello 90640 1/2/2010 Regular Air
## SubRegion
## 1 West
## 2
## 3
## 4 Central
## 5 West
## 6 West
trainingfilter4=filter(training, Customer.Name=="Scott Bunn")
head(trainingfilter4)
##    Row Order.Priority Order.Date Order Discount Unit.Price Order.Quantity
## 1 4 Low 1/2/2010 37537 0% $292.00 4
## 2 5 Low 1/2/2010 37537 7% $101.00 43
## 3 6 Low 1/2/2010 37537 5% $155.00 32
## 4 369 High 2/2/2010 55713 9% $123.00 49
## 5 8597 Critical 5/2/2012 2978 1% $206.00 53
## 6 8598 Critical 5/2/2012 2978 9% $163.00 57
## Sales Profit Shipping.Cost Product.Base.Margin Department
## 1 $1,168.00 $605.08 $49.00 56.0% Furniture
## 2 $4,038.99 $2,647.66 $45.00 69.0% Furniture
## 3 $4,712.00 $2,671.40 $7.00 59.0% Office Supplies
## 4 $5,484.57 $3,847.55 $70.00 74.0% Furniture
## 5 $10,808.82 $6,421.62 $20.00 60.0% Technology
## 6 $8,454.81 $2,767.30 $20.00 39.0% Office Supplies
## Container Category
## 1 Jumbo Drum Chairs & Chairmats
## 2 Jumbo Drum Chairs & Chairmats
## 3 Small Box Storage & Organization
## 4 Jumbo Drum Chairs & Chairmats
## 5 Small Box Telephones and Communication
## 6 Small Box Envelopes
## Item
## 1 Hon 4070 Series Pagoda\231 Armless Upholstered Stacking Chairs
## 2 Hon Valutask\231 Swivel Chairs
## 3 Dual Level, Single-Width Filing Carts
## 4 Global High-Back Leather Tilter, Burgundy
## 5 iDEN i550
## 6 Multimedia Mailers
## Customer.Segment Customer Customer.Name Region State
## 1 Corporate 68 Scott Bunn North America New York
## 2 Corporate 68 Scott Bunn North America New York
## 3 Corporate 68 Scott Bunn North America New York
## 4 Corporate 68 Scott Bunn North America New York
## 5 Corporate 68 Scott Bunn North America New York
## 6 Corporate 68 Scott Bunn North America New York
## Country...Region City Postal.Code Ship.Date
## 1 United States of America New York City 10177 1/2/2010
## 2 United States of America New York City 10177 1/4/2010
## 3 United States of America New York City 10177 1/9/2010
## 4 United States of America New York City 10177 2/4/2010
## 5 United States of America New York City 10177 5/3/2012
## 6 United States of America New York City 10177 5/4/2012
## Ship.Mode
## 1 Delivery Truck
## 2 Delivery Truck
## 3 Regular Air
## 4 Delivery Truck
## 5 Regular Air
## 6 Regular Air
## SubRegion
## 1 East
## 2 East
## 3 East
## 4 East
## 5 East
## 6 East

Penggunaan Perintah Select

trainingselect=dplyr::select(training, Customer.Name, Item, Region)
head(trainingselect)
##   Customer.Name
## 1 Joy Corbett
## 2 Anita Hahn
## 3 Ernest Oh
## 4 Scott Bunn
## 5 Scott Bunn
## 6 Scott Bunn
## Item
## 1 Perma STOR-ALL\231 Hanging File Box, 13 1/8"W x 12 1/4"D x 10 1/2"H
## 2 Safco Industrial Wire Shelving
## 3 Avery Trapezoid Ring Binder, 3" Capacity, Black, 1040 sheets
## 4 Hon 4070 Series Pagoda\231 Armless Upholstered Stacking Chairs
## 5 Hon Valutask\231 Swivel Chairs
## 6 Dual Level, Single-Width Filing Carts
## Region
## 1 AsiaPac
## 2 North America
## 3 North America
## 4 North America
## 5 North America
## 6 North America

Menyusun Data

training_ar= arrange(training, Profit)
head(training_ar)
##     Row Order.Priority Order.Date Order Discount Unit.Price Order.Quantity
## 1 2413 Critical 9/11/2010 90438 6% $3.00 1
## 2 512 Critical 2/17/2010 90588 5% $6.00 1
## 3 3977 Not Specified 2/24/2011 90127 8% $6.00 4
## 4 8341 Not Specified 4/9/2012 86479 9% $2.00 8
## 5 8457 High 4/18/2012 87657 2% $6.00 1
## 6 12816 Low 4/12/2013 87313 8% $6.00 4
## Sales Profit Shipping.Cost Product.Base.Margin Department
## 1 $2.82 -$0.01 $1.00 39.0% Office Supplies
## 2 $5.70 -$0.02 $2.00 38.0% Office Supplies
## 3 $22.08 -$0.04 $7.00 37.0% Office Supplies
## 4 $14.56 -$0.04 $5.00 40.0% Office Supplies
## 5 $5.88 -$0.04 $4.00 68.0% Technology
## 6 $22.08 -$0.04 $7.00 37.0% Office Supplies
## Container Category
## 1 Small Box Labels
## 2 Wrap Bag Paper
## 3 Small Box Paper
## 4 Small Box Binders and Binder Accessories
## 5 Small Pack Computer Peripherals
## 6 Small Box Paper
## Item Customer.Segment Customer
## 1 Avery 479 Corporate 753
## 2 Adams Telephone Message Books, 5 1/4” x 11” Corporate 936
## 3 Xerox 214 Small Business 1660
## 4 Avery Reinforcements for Hole-Punch Pages Small Business 3030
## 5 Imation 3.5" IBM-Formatted Diskettes, 10/Pack Corporate 25
## 6 Xerox 210 Home Office 98
## Customer.Name Region State Country...Region
## 1 Elisabeth Massey North America Arizona United States of America
## 2 Robyn Garner North America California United States of America
## 3 Keith Liu Duncan North America Texas United States of America
## 4 Janice Woods North America Oregon United States of America
## 5 Virginia Hardison North America California United States of America
## 6 Sarah Cooke Anthony North America Kansas United States of America
## City Postal.Code Ship.Date Ship.Mode
## 1 Prescott 86301 9/11/2010 Express Air
## 2 Redlands 92374 2/19/2010 Express Air
## 3 Mesquite 75150 2/25/2011 Regular Air
## 4 Aloha 97007 4/9/2012 Regular Air
## 5 Lake Elsinore 92530 4/20/2012 Regular Air
## 6 Olathe 66062 4/16/2013 Express Air
## SubRegion
## 1 West
## 2 West
## 3 Central
## 4 West
## 5 West
## 6 Central

Filter, Select, dan Menyusun data

str(training)
## 'data.frame':    16798 obs. of  26 variables:
## $ Row : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Order.Priority : Factor w/ 5 levels "Critical","High",..: 2 5 1 3 3 3 1 1 1 3 ...
## $ Order.Date : Factor w/ 1427 levels "1/1/2010","1/1/2011",..: 1 1 44 44 44 44 44 44 44 44 ...
## $ Order : int 28774 88028 9285 37537 37537 37537 44069 44069 89083 87946 ...
## $ Discount : Factor w/ 17 levels "0%","1%","10%",..: 3 15 13 1 14 12 16 15 13 12 ...
## $ Unit.Price : Factor w/ 237 levels "$1,000.00","$1,271.00",..: 198 235 161 122 11 45 230 41 161 45 ...
## $ Order.Quantity : int 32 2 3 4 43 32 16 43 1 8 ...
## $ Sales : Factor w/ 8176 levels "$0.90","$0.94",..: 2281 2303 1607 264 4939 5152 1830 6340 4826 281 ...
## $ Profit : Factor w/ 11737 levels "-$0.01","-$0.02",..: 2183 8038 6683 9478 4478 4495 7786 4054 11310 9812 ...
## $ Shipping.Cost : Factor w/ 86 levels "$0.00","$1.00",..: 50 34 30 49 45 71 20 20 30 71 ...
## $ Product.Base.Margin: Factor w/ 61 levels "14.5%","3.6%",..: 43 25 7 31 44 34 13 10 7 34 ...
## $ Department : Factor w/ 3 levels "Furniture","Office Supplies",..: 2 2 2 1 1 2 2 2 2 2 ...
## $ Container : Factor w/ 7 levels "Jumbo Box","Jumbo Drum",..: 5 3 5 2 2 5 7 7 5 5 ...
## $ Category : Factor w/ 17 levels "Appliances","Binders and Binder Accessories",..: 15 15 2 4 4 15 11 11 2 15 ...
## $ Item : Factor w/ 1263 levels "\"While you Were Out\" Message Book, One Form per Page",..: 862 921 229 606 625 395 276 1115 229 395 ...
## $ Customer.Segment : Factor w/ 4 levels "Consumer","Corporate",..: 4 3 1 2 2 2 1 1 1 2 ...
## $ Customer : int 1656 2211 949 68 68 68 1154 1154 950 67 ...
## $ Customer.Name : Factor w/ 3403 levels "Aaron Davies Bruce",..: 1686 133 981 2946 2946 2946 2219 2219 1430 916 ...
## $ Region : Factor w/ 4 levels "AsiaPac","EMEA",..: 1 4 4 4 4 4 1 1 4 4 ...
## $ State : Factor w/ 149 levels "?saka","Addis Ababa",..: 22 66 19 85 85 85 1 1 71 19 ...
## $ Country...Region : Factor w/ 50 levels "Algeria","Argentina",..: 14 49 49 49 49 49 25 25 49 49 ...
## $ City : Factor w/ 1523 levels "Aberdeen","Abidjan",..: 1327 136 760 916 916 916 992 992 1100 893 ...
## $ Postal.Code : int NA 20715 90049 10177 10177 10177 NA NA 55372 94559 ...
## $ Ship.Date : Factor w/ 1459 levels "1/1/2011","1/1/2012",..: 49 94 106 49 106 127 106 49 106 127 ...
## $ Ship.Mode : Factor w/ 3 levels "Delivery Truck",..: 3 2 3 1 1 3 2 3 3 3 ...
## $ SubRegion : Factor w/ 6 levels "","Canada "| __truncated__,..: 1 4 6 4 4 4 1 1 3 6 ...
training$Unit.Price=as.numeric(training$Unit.Price)
is.numeric(training$Unit.Price)
## [1] TRUE
training_ar2=
training %>%
select(Item, Unit.Price) %>%
arrange(Item, Unit.Price) %>%
filter(Unit.Price>=100)%>%
head(training_ar2)
##                                                   Item Unit.Price
## 1 "While you Were Out" Message Book, One Form per Page 157
## 2 "While you Were Out" Message Book, One Form per Page 157
## 3 "While you Were Out" Message Book, One Form per Page 157
## 4 "While you Were Out" Message Book, One Form per Page 157
## 5 "While you Were Out" Message Book, One Form per Page 157
## 6 "While you Were Out" Message Book, One Form per Page 157

Membuat Bar Chart

library(tidyverse)
library(cowplot)
fillColor = "red"
plotx= function(ds, fillColorName= fillColor) {
ds%>%
group_by(Region) %>%
filter(!is.na(Region)) %>%
summarise(Count= n()) %>%
mutate(TotalCount= nrow(ds)) %>%
mutate(Percentage=(Count/TotalCount) * 100) %>%
arrange(desc(Count)) %>%
ungroup() %>%
mutate (Region = reorder(Region,Count)) %>%
head(10) %>%

ggplot(aes(x= Region, y= Percentage)) +
geom_bar( stat='identity', colour="red", fill= fillColorName) +
geom_text(aes(x= Region, y=1, label= paste0("(",round(Percentage,2),"%)", sep="")),
hjust=0, vjust=.5, size=4,colour='black', fontface= 'bold') +
labs( x='Group',
y= 'Percentage', title= 'Percentage of Region') +
coord_flip() +
theme_bw()
}
all= plotx(training)
plot_grid(all)

Referensi :

Statistician | Data Analyst at Speedwork Indonesia | SQL Addict

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store