Data Wrangling and Visualization Certificate Program (DWV 101)

Data Wrangling and Visualization Program Overview

Why Learn Data Wrangling and Visualization using Modern R - Tidyverse?

According to a New York Times article by Steve Lohr (2014), data scientists spend 50% to 80% of their time on data wrangling (i.e., data cleaning and transformation) processes and 20%-50% of their time on data modeling, implying the importance of skills needed for the data wrangling task.

“Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets (Steve Lohr, August 17, 2014).”

However, most degree programs focus on data modeling, presumably because that is the most technically challenging and worthy of a degree. Most courses in various types of data science programs do not offer a course in data wrangling and visualization systematically, but they expect students to use these skills in conjunction with modeling, making students face two challenges simultaneously. The same is true in most statistics classes. Students must learn not only statistics but also programming software. To make things worse, statistics classes often use Base R, which is harder to learn and turns off most initial enthusiasm that students might have had earlier.

Learning how to code can be and should be much easier. In fact, modern data science with R heavily uses the Tidyverse package, which is easier to learn and fun to use. When it comes to data visualization, for instance, modern R for data science has the most popular data visualization tool called ggplot2 package. Using ggplot2, you can customize your chart creatively and persuasively in almost any way you want to support your strategic storytelling in your presentation.

Why This Certificate Program?

This certification is designed to help both novices and intermediate users of Base R by giving them the necessary knowledge in the Tidyverse way of coding for data wrangling and visualization so that they can focus more on statistics or machine learning topics when they take those classes. Also, this course aims to give data science aspirants adequate knowledge and skills for the most popular tasks for data analysts and data scientists – Data Wrangling and Visualization to help them get started with their careers.

The course will follow the recommendation of Hadley Wickham, who is the chief architect of the popular Tidyverse mega package you will learn throughout the course. We will start with the most important aspect of Tidyverse, data visualization! After you master the topic, we will teach you the bolts and nuts of how to manipulate and transform all sorts of data to support the kind of visualization you want to create for your storytelling. Each time you learn how to wrangle data, we will not stop there. You will be guided to produce a chart, the final line. This way, you will never stop applying your visualization skills across all 13 modules.

As a main tool for producing the kind of visualization and the report that contains the visualization, we will use Quarto, which is an open-source technical publishing system designed for creating a variety of documents, including articles, websites, blogs, books, and presentations. It supports multiple programming languages such as Python, R, Julia, and JavaScript, making it a versatile tool for data scientists and technical writers alike.

The certification includes 12 modules and a capstone project, which are to be completed in about 13 weeks. The course covers all aspects of wrangling and visualization of data comprehensively using the Tidyverse framework – modern data science in R – to allow you to prepare for a presentation of complex data for the non-technical audience with a focus on strategic data visualization. Please see the Course Outline below for the content of each module.

The price is set to be competitive with online courses available elsewhere, giving you unparalleled value for the quality of education you will get. We invite you to take the time to learn about how we can help you on your journey. Contact us at Insights-Lab@cpp.edu if you have any questions.

Skills Covered

Strategic data visualization, literate coding, reproducible research, wrangling for all types of data, making interactive or animated charts and dashboard, modern wrangling and visualization skills using Tidyverse in R, importing and exporting data from and to the web, cloud, and local drive, joining multiple relational datasets, dealing with labeled data (e.g., SAS, SPSS, STATA), functional coding, creating a report in HTML, pdf, ppt, publishing on the web, and automation and simplification of repetitive tasks, the styling of documents with css, etc.

Tools Covered

R, RStudio, Quarto, gt, gtsummary, gtExtra, htmltools, html/css, shiny, dplyr, tidyr, stringr, forcats, lubridate, readr, purrr, ggplot2, magrittr, Tidyverse, plotly, ggrepel, ggthemes, GGally, gganimate, Github, web scraping with rvest, heaven, broom, and patchwork, naniar, here, scales, labelled, sjlabelled, sjPlot, KableExtra, glue, etc.

Prerequisite

None. No prior coding experience is needed.

Expected Outcome

Participants will receive a Certificate of Completion upon successful completion of the program. Your letter grade will appear on a professional program transcript, but the course will not be counted towards the academic crediting-bearing degree program. In addition to the certificate, graduates earn a shareable digital badge issued through Badgr. Display it on LinkedIn, your resume, or personal website to verify your new data-wrangling and visualization skills. You will also come away with over 20 codebooks and a comprehensive capstone project, which will aid you in future endeavors. Upon completing the course, students will be able to:

Explain the role of data visualization as part of persuasive presentation and storytelling.
Choose the most effective visualization method appropriate for a given data and audience characteristics.
Wrangle data to support strategic visualization objectives.
Generate insights and recommendations.
Produce effective and persuasive presentations.
Practice reproducible data visualizations to promote transparency, credibility, and ethics.

Course Outline

Course Overview

There are a total of 12 modules plus a capstone project in this course. In Modules 1 and 2, you will learn about literate programming and data science in R and its ecosystem for modern data science that utilizes the concept of tidy data manifested through the Tidyverse mega package. You will learn about R’s capability and compare it with Python. In Modules 3 and 4, you will learn how to visualize the data strategically to support the kinds of storytelling for your audience, which is the final goal of this course. By knowing the stories you want to tell, you will set your visualization goal from the very beginning.

The rest of the modules are devoted to teaching you the tools you need to shape the data to support data visualization. That is, you will learn how to wrangle the data to support the visualization in Modules 5 and 6 (How to tidy data and transform data), Modules 7 and 8 (How to deal with various types of data format such as strings, factor, date, and time), and Modules 9 and 10 (How to import and export various forms of data and share your visualization). Modules 11 and 12 (Programming) are not part of the wrangling process, but adding programming skills will help you save a lot of time by automating repetitive coding tasks. You will have learned most of the concepts and skills needed for the Capstone project by the time you finish Module 12.

The capstone project will allow the participants to choose a topic and data of their interest and share their findings in a variety of data visualization methods, such as charts, tables, slides, and interactive dashboards, culminating in publishing them online on a website.

In this module, you will learn how to install R and RStudio. You will also learn how to make the best use of the Quarto within RStudio to enhance reproducible publication. RStudio is an IDE (Integrated Development Environment) that makes learning R much easier. In Quarto, you can run not only R codes but also Python codes or codes for other programming languages. With this preparation done, you will learn what you can do with R. While the program will teach you Base R, it will focus on the Tidyverse approach of data science with the use of the Tidyverse package. In base R, one had to use several parentheses together, adding a code inside the parentheses and inside yet another parenthesis, making coding complex. This makes your R code hard to read and understand. In modern R, you can zoom on data, using the pipe operator (%>% or |>), which allows information to move from top to bottom through a pipe. That is, you first start with data, then pipe the data into the next steps -- to tidy, transform, or visualize data, resulting in a chain of clear and readable operations. This approach simplifies coding, making it more intuitive and easier to debug. Each step in the chain performs a specific task, allowing you to focus on one operation at a time while maintaining a logical flow.