Data Wrangling and Visualization Certificate Program

According to a New York Times article by Steve Lohr (2014), data scientists spend 50% to 80% of their time on data cleaning and transformation processes called data wrangling and 20%-50% of their time on data modeling, implying the importance of skills needed for the data wrangling task.

“Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets (Steve Lohr, August 17, 2014).”

However, most degree programs focus on data modeling, presumably because that is most technically challenging and worthy of a degree. Most courses in various types of data science programs do not offer a course in data wrangling and visualization systematically, but they expect students to use data wrangling and visualization in conjunction with modeling, making students face two challenges at the same time. The same is true in most statistics classes. Students have to deal with learning not only statistics topics but also programming software. Thus, this certification is designed to help students without much basic knowledge of R, a primary statistical analysis software used by data scientists, by giving them the necessary knowledge in programming so that they can focus more on statistics/machine learning topics in their future endeavors. Further, this course is also aimed to give data science aspirants introductory knowledge and skills to help them get started.

Course Overview

There are a total of 12 modules plus a capstone project in this course. In Modules 1 and 2, you will learn about data science in R and its ecosystem for modern data science that utilizes the concept of tidy data manifested through the Tidyverse mega package. You will learn about R’s capability and compare it with Python. In Modules 3 and 4, you will learn how to visualize the data strategically to support the kinds of storytelling for your audience, which is the final goal of this course. By knowing the stories you want to tell, you will set your visualization goal from the very beginning.

The rest of the modules are devoted to teaching you the tools you need to shape the data to support data visualization. That is, you will learn how to wrangle the data to support the visualization in Modules 5 and 6 (How to tidy data and transform data), Modules 7 and 8 (How to deal with various types of data format such as strings, factor, date, and time), and Modules 9 and 10 (How to import and export various forms of data). Modules 11 and 12 (Programming) are not part of the wrangling process, but adding programming skills will help you save a lot of time by automating repetitive coding tasks. You will have learned most of the concepts and skills needed for the Capstone project by the time you finish Module 12.

The capstone project will be introduced right after Module 4 so that you can start tackling the problems each time you learn important concepts. See which questions you can answer after you finish Module 4. Revisit the capstone project again after Modules 5 and 6, again after Modules, 7 and 8, and so on and on. In the process, you will continue to modify your coding and improve it better and better each time you revisit it. By the end of Module 12, your capstone project should be ready for submission.

Class begins on February 3, 2025. Class ends on May 11, 2025. (Modality: Online Asynchronous)

Registration Info

Course Outline

In this module, you will also learn how to make the best use of the Quarto Markdown within Rstudio to enhance reproducible publication. RStudio is an IDE (Integrated Development Environment) that makes learning R much easier. In Quarto, you can run not only R codes but also Python codes or codes for other programming languages. With this preparation done, you will learn what you can do with R. While the program will teach you Base R, it will focus on the Tidyverse approach of data science with the use of the Tidyverse package. In base R, one had to use several parentheses together, adding a code inside the parentheses and inside yet another parenthesis, making coding complex. This makes your R code hard to read and understand. In modern R, you can zoom on data, using the pipe operator (%>% or |>), which allows information to move from top to bottom through a pipe. That is, you first start with data, then pipe the data into the next steps -- to tidy, transform, or visualize data, resulting in a chain of