Data Wrangling and Visualization
According to a New York Times article by Steve Lohr (2014), data scientists spend 50% to 80% of their time on data cleaning and transformation processes called data wrangling and 20%-50% of their time on data modeling, implying the importance of skills needed for the data wrangling task.
“Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets (Steve Lohr, August 17, 2014).”
However, most degree programs focus on data modeling, presumably because that is most technically challenging and worthy of a degree. Most courses in various types of data science programs do not offer a course in data wrangling and visualization systematically, but they expect students to use data wrangling and visualization in conjunction with modeling, making students face two challenges at the same time. The same is true in most statistics classes. Students have to deal with learning not only statistics topics but also programming software. Thus, this certification is designed to help students without much basic knowledge of R, a primary statistical analysis software used by data scientists, by giving them the necessary knowledge in programming so that they can focus more on statistics/machine learning topics in their future endeavors. Further, this course is also aimed to give data science aspirants introductory knowledge and skills to help them get started.
Course Overview
There are a total of 12 modules plus a capstone project in this course. In Modules 1 and 2, you will learn about data science in R and its ecosystem for modern data science that utilizes the concept of tidy data manifested through the Tidyverse mega package. You will learn about R’s capability and compare it with Python. In Modules 3 and 4, you will learn how to visualize the data strategically to support the kinds of storytelling for your audience, which is the final goal of this course. By knowing the stories you want to tell, you will set your visualization goal from the very beginning.
The rest of the modules are devoted to teaching you the tools you need to shape the data to support data visualization. That is, you will learn how to wrangle the data to support the visualization in Modules 5 and 6 (How to tidy data and transform data), Modules 7 and 8 (How to deal with various types of data format such as strings, factor, date, and time), and Modules 9 and 10 (How to import and export various forms of data). Modules 11 and 12 (Programming) are not part of the wrangling process, but adding programming skills will help you save a lot of time by automating repetitive coding tasks. You will have learned most of the concepts and skills needed for the Capstone project by the time you finish Module 12.
The capstone project will be introduced right after Module 4 so that you can start tackling the problems each time you learn important concepts. See which questions you can answer after you finish Module 4. Revisit the capstone project again after Modules 5 and 6, again after Modules, 7 and 8, and so on and on. In the process, you will continue to modify your coding and improve it better and better each time you revisit it. By the end of Module 12, your capstone project should be ready for submission.
Class begins on August 26, 2024. Class ends on November 24, 2024. (Modality: Online Asynchronous)
Course Outline
In this module, you will learn how to install R and RStudio. You will also learn how to make the best use of the R Markdown within RStudio. RStudio is an IDE (Integrated Development Environment) that makes learning R much easier. In RStudio, you can run not only R codes but also Python codes or codes for other programming languages. RStudio is wonderful. With this preparation done, you will learn what you can do with R. While the program will teach you Base R, it will focus on the Tidyverse approach of data science with the use of the Tidyverse package.
Upon successful completion of this module, you will be able to:
- Install R and RStudio.
- Describe the layout and menus of RStudio.
- Start, Run codes, and Save an R script file.
- Install R packages and load them up.
- Start, Run codes, organize the codes, and save the R Markdown file.
The goal of this module is to introduce an overview of the world of data science in R and get you ready for the rest of the modules.
Specifically, you will learn the universe of R and data science in general in this module. As R spreads to the academic and research community, more and more college students are learning statistics with R. R was initially developed as a statistical tool. What else can we do with R? What do data scientists do with R?
For starters, we can create a chart any way we want with ggplot2. You can create and update Word, HTML, PPT, and PDF files right from the R markdown file. You can animate your charts and make your charts interactive. You can create a website and a dashboard with shiny and shiny Dashboard. You can build machine learning models with Caret or tidy models. You can run Python in R with the reticulate package. Some packages help you create charts with a menu-driven approach (e.g., ggThemeAssist and esquisse).
In this module, you will learn R's capabilities and resources that you can use to learn the skills. There is no way you can learn everything in this short course. Thus, we will focus on some fundamental topics, while you will be led to resources for advanced topics you can tackle in the future.
Upon successful completion of this module, you will be able to:
- Describe the concept of the Tidverse way of coding.
- List the Pros and Cons of R and Python for data science.
- Describe the capability of the Tidyverse package in R.
- Explain how to use online resources provided by the R community.
The ggplot 2 is a plotting package that provides helpful commands to create complex plots from data. It provides a more programmatic interface for specifying what variables to plot, how they are displayed, and general visual properties. Therefore, we only need minimal changes if the underlying data changes or if we decide to change from a bar plot to a scatterplot. This helps in creating publication-quality plots with minimal amounts of adjustments and tweaking. Therefore, it is not surprising that ggplot2 is included in Tidyverse.
In this module, you will be introduced to concepts of the grammar of graphics when visualizing data with the ggplot2 package. Every ggplot starts with defining data and the aesthetics of variables, which is the foundation. Then, you add a geometry layer over the foundation. The two are the basics of any ggplots. The rest of the layers are optional. You will also be introduced to how to create charts when there is only one (or more) continuous or categorical variable(s) in your data set. Furthermore, you will also be introduced to plot charts after you select and filter the variables you want to plot in your data set.
Upon successful completion of this module, you will be able to:
- Explain the concept of the grammar of graphics when visualizing data with the ggplot2 package.
- Be familiar with various types of charts.
- Visualize counts, proportions, and geospatial data.
- Select appropriate charts based on strategic considerations (e.g., the characteristics of the data and audience).
- Create a chart that involves one or two variables.
- Create a chart by adding a categorical moderator (3rd variable) to the chart involving two variables.
- Create correlation charts.
In this module, you will be introduced to advanced topics of the grammar of graphics and ggplot2 extensions such as patchwork and gganimate. You will also observe how data scientists use ggplot2 to customize charts for better communication, including live screencast demonstrations of clever and creative data visualizations. Advanced users or motivated beginners are encouraged to tackle the advanced visualization codebook with interactive, animated, or geospatial charts.
Upon successful completion of this module, you will be able to:
- Customize correlation charts.
- Create a chart that involves four or five variables by adding layers of geometry.
- Modify charts for storytelling by customizing axes, labels, coordinates, themes, etc.
- Explain when to use interactive or animated charts or dashboards.
- Visualize geospatial data on a map.
- Read charts and generate insights.
Understanding the concept of Data Wrangling is important because many professionals inside technology industries have to face different data types and also deal with the various sources of data. In this module, you will be introduced to Data Wrangling. In a broad sense, data wrangling includes (1) importing data, (2) tidying data, and (3) transforming data. After the wrangling process, you can proceed with visualizing data and modeling.
In this module, we will be focusing on the last two stages of the wrangling process -- tidying and transforming data. You will also be introduced to a modern data type called, tibbles and learn how tibbles are different from data frames. One of the first steps in data wrangling in the Tidyverse ecosystem is to reshape the data to be tidy using "tidyr" package. You will learn the concept of tidy data, and how reshaping the data helps you visualize the data. Once you make your data tidy, you will want to transform your data with the help of "dplyr" package. Learn some key functions from dplyr to transform the data. In base R, one had to use several parentheses together, adding a code inside the parentheses and inside yet another parenthesis, making coding complex. This makes your R code hard to read and understand. In modern R, you can zoom on data, using the pipe operator (%>% or |>), which allows information to move from top to bottom through a pipe. That is, you first start with data, then pipe into the data to tidy or transform it first. Once wrangling is done, you pipe the wrangled data into ggplot() to do all the necessary mapping and geometry as well as other optional layers of coordinates and themes, according to the grammar of graphics.
Upon successful completion of this module, you will be able to:
- Describe the concept of Data Wrangling.
- Describe how Tibbles are different from data frames.
- Explain how to convert wide or long data to "Tidy" data.
- Explain how to merge relational data sets using joins.
- Be familiar with key dplyr verbs and use them to transform data.
- Use the pipe operator to shape the data to prepare for analysis and visualization.
Continuing from M05, you will expand your horizons by going deeper into the topics with various approaches.
First, you will learn how data scientists wrangle and visualize data for their projects by watching demonstrations. Next, you will be exposed to more advanced topics and associated functions such as recode(), across(), case_when(), rownames_to_column(), distinct(), rowwise(), and c_across().
Upon successful completion of this module, you will be able to:
- Describe the concept of Data Wrangling.
- Describe how Tibbles are different from data frames.
- Explain how to convert wide or long data to "Tidy" data.
- Explain how to merge relational data sets using joins.
- Be familiar with key dplyr verbs and use them to transform data.
- Use the pipe operator to shape the data to prepare for analysis and visualization.
Data types (numeric, character, logical, factor, dates) are building blocks of data structures (vectors, matrices, data frames, and lists). Most of the time, you are likely to work with data frames or Tibbles in the Tidyverse framework, which is a spreadsheet with rows of observations and columns of features or variables. Matrices are similar to data frames in that they consist of rows and columns. A major difference is that while matrices are composed of variables that are of the same data type, data frames can have a mix of any type of data. You can perform various operations on data to filter and view parts of data or to create new variables. Knowing the differences between the data types and structures will provide you with the basic knowledge needed to manipulate data later.
In addition to Base R functions, Tidyverse has many packages that help deal with different types of vectors much more efficiently than base R does. With the stringr package, you can wrangle strings (or characters) and regular expressions. With forcats, you can wrangle factor, the ordered data. With the lubridate package, you can wrangle dates and times.
Upon successful completion of this module, you will be able to:
- Explain atomic R data types - numeric, character, logical, factors, and dates.
- Explain the differences between data structures -- vectors, matrices, data frames, and lists.
- Understand when to use each data type.
- Create data frames.
- Use popular functions from base R and Tidyverse to view and manipulate data frames that contain the various data types.
- Use popular functions from stringr package to handle strings and regular expressions.
- Use popular functions from the forcats package to manipulate factors in R.
- Use popular functions from the lubridate package to manipulate dates.
Building on what you learned in the previous module, you will deepen your understanding of the R data types with particular emphasis on tidyverse.
You will develop a practical sense of choosing appropriate functions from various packages associated with certain data types (stringr, forcats, and lubridate) by watching the live performances and well-prepared demonstrations on using those packages. Further, you will also be given the opportunity to drill with more advanced topics with those packages.
Upon successful completion of this module, you will be able to:
- Use advanced functions from base R and Tidyverse to view and manipulate data frames that contain the various data types.
- Use advanced functions from stringr package to handle strings and regular expressions.
- Use advanced functions from the forcats package to manipulate factors in R.
- Manipulate dates and times using advanced functions from the lubridate package.
- Utilize cheat cheats available online to find the right functions quickly.
Being able to import different formats of data is important because data scientists deal with secondary data collected by others, and secondary data exist everywhere inside and outside the organizations they work for. Therefore, learning how to import and export data is a fundamental step that we are introducing after you learn about data types and structures in Modules 7 and 8. You must use appropriate tools to import and wrangle the data depending on the data formats. Also, gathering data outside your company, such as websites, is a great way to extend your organization's capability, and you will be in the center of the action, adding value to your company; thus, web scraping can be another form of importing data that you can have under your took box. Further, you will learn how to deal with so-called labelled data produced by menu-driven proprietary statistical software such as SAS, SPSS, and STATA.
In addition to the Base R functions, Tidyverse has some packages that help you deal with different formats of data created by well-known software.
Upon successful completion of this module, you will be able to:
- Explain how to create a Github repository and collaborate with others on the same R projects.
- Effectively load and look through built-in datasets in R.
- Import various formats of data (csv, xlsx, and SPSS) to RStudio.
- Scrape data from the web using SelectorGadget and rvest.
- Import multiple external data sets and work with them.
- Work with labeled data in R.
- Export/save output data to local pc and push to Github.
Building on what you learned in the previous module about data import and export, you will learn how to share outputs more effectively in this module. That is, you will learn the concept of literate coding and reproducible research using tools such as R Markdown and Quarto. Second, you will learn how to install an interactive visualization dashboard online.
Upon successful completion of this module, you will be able to:
- Explain how reproducible research with literate coding enhances efficiency, ethics, transparency, and credibility.
- Explain how Quarter is different from Rmd.
- Describe how to make an interactive dashboard with Shiny R.
- Produce customized tables from imported SPSS data effectively in R.
- Create charts from imported SPSS data in R.
The ability to create a function of your own will make you get even closer to being a data scientist. Although R has several built-in functions of high usage, you will want to be able to create your own functions tailored for your job or tasks, because you found yourself or your team performing a particular task repeatedly. By building your own function, you can help save time.
In this module, you will be introduced to various types of vectors, logical operators, for-loops, if-else statements. You will also be introduced to good coding practice and see how this positively impacts your code's efficiency. For your code to be more readable, take fewer keystrokes, and execute a batch of jobs faster, you may want to use a family of map() functions from purrr package, which is part of Tidyverse, or use apply() family functions from base R. Furthermore, you will also be introduced to some built-in R functions, which are built by R architecture, using the same sets of principles you will learn in this module. After you finish this module, you should be considered a programmer.
Upon successful completion of this module, you will be able to:
- Describe good coding practices.
- Use the various types of logical operators.
- Use for loops in conjunction with other statements (if-else, next, break, etc).
- Create your own R functions.
- Use map functions from the purrr package to increase efficiency.
- Use a family of apply() functions to simplify repetitive tasks.
- Be familiar with some built-in R functions.
Building on the fundamentals of programming you learned in the previous module, you will deepen your understanding of R programming in this module. You will also practice more on a family of map() functions from purrr package. Further, you will be exposed to more packages and tools that allow you to go beyond the familiar Tidyverse ecosystem.
Specifically, we will go deeper into Functions and iterations with more explanations and examples to practice. Then, you will learn more about the purr package and its family of map functions with demonstrations and exercises, including walk(), walk2(), pwalk().
As the concept is hard, you will be given many examples with which you can appreciate the concepts better. For instance, you will be introduced to a way to test thousands of statistical tests using purr and broom packages.
Upon successful completion of this module, you will be able to:
- Explain a good coding style.
- Describe all components of a function and their roles.
- Set return values and describe the environment in a function.
- Explain the differences between various For loop variations.
- Map over multiple arguments using map2() and pmap().
- Use walk(), walk2(), and pwalk() in the middle of the pipelines.
Working on a hands-on project is a great way to solidify your learning. Such a comprehensive, practical project can be added to a resume, portfolio, or LinkedIn profile. Use all the tools you learned from all 12 modules. Sometimes, you may have to use additional materials given in each module.
The Capstone project will be challenging due to its scope and nature. If you worked hard on all 12 modules, you will have an idea to try, but you are not likely to remember all the functions you need. Review the codebooks and find the codes that you can apply to the new content.
Upon successful completion of this module, you will be able to:
- Be familiar with R data structure.
- Perform various operations to view and manipulate data
- Import various types of data and export output data to other types of data.
- Wrangle data to “tidy” form for visualization and modeling.
- Write codes to automate/simplify routine operation.
- Create charts, using R’s built-in functions as well as popular packages such as ggplot2.
Prerequisite
- No prior coding experience is needed.
Convenience with Responsibility
- Take the program anywhere in the world as the program is delivered online.
- Fully asynchronous offering, meaning that there is no set class time. Takes one week to finish one module and six weeks to finish all modules.
- However, you will be required to manage your time such that the assignment associated with each module is required to be finished by the deadline set on Canvas.
Learning Objectives
- Each module will follow the Quality Matters framework that has been proven effective for online learning success. That is, each module will start with learning outcomes, followed by step-by-step instructions, including a one-hour video lecture, supplemental materials to reinforce the lecture, and practice assignment(s).
- An assignment will be given out for each topic and graded with feedback in order to ensure that students can apply what they learned to a different task.
Skills Covered
- R, RStudio, R Markdown, dplyr, tidyr, stringr, forcats, lubridate, readr, purrr, ggplot2, magrittr, Tidyverse, plotly, ggrepel, ggthemes, GGally, gganimate, Github, web scraping with SelectorGadget, heaven, broom, and patchwork, naniar, here, scales, labelled, sjlabelled, sjPlot, KableExtra, glue, etc.
Rigorous Assessment and Verification
- To receive a certificate of achievement, participants must receive at least a grade of C from each module.
- Watching a video is never sufficient to demonstrate your knowledge and skills in the topic, which is why we give students hands-on practice assignments.
- The certificate is issued only when you demonstrate that you achieved the learning outcomes.
- Students who want to take various data science programs (e.g., MS in Business Analytics, etc.) and various statistics courses at undergraduate as well as graduate levels.
- Company employees who need to learn R Programming.
- Anyone who wants to have a career in data science and business analytics.
- Anyone who wants to learn R Programming.
Dr. Jae Min Jung is a Professor of Marketing and the director of the Center for Customer Insights and Digital Marketing (CCIDM) at Cal Poly Pomona. He is also director of MS in Digital Marketing program. He received a Ph.D. in Marketing from the University of Cincinnati and an MBA degree with a concentration in Business Statistics from the University of North Texas. Currently, Dr. Jung is interested in applying econometrics and data science methods to consumer behaviors, and working on several projects in the area of social media and digital marketing dealing with firm level data, national level data, and individual social media data. His research has been published in journals such as European Journal of Marketing, International Marketing Review, Journal of Business Research, Journal of Cross-Cultural Psychology, Marketing letters, and Psychology & Marketing. Dr. Jung has taught various courses including Marketing Research, Data Mining for Marketing Decisions, and Marketing Analytics, often incorporating real-world proprietary customer data of companies. Dr. Jung orchestrated designing and producing a series of R workshops that attracted hundreds of participants from both campus and business community. This experience has led to the offering of DWV 100, which is a prerequisite for MS in Digital Marketing students if they lack coding experience. His research and teaching efforts earned him recognitions, including the prestigious Jagdish N. Sheth Research Award, Wall of COOL, Provost Teacher-Scholar Program Award, and Faculty of the Year Award.
Dr. Carsten Lange is a Professor of Economics. Dr. Lange received his Ph.D in Economics at the University of Hannover, Germany. Dr. Lange specializes in money supply, inflation, central bank policy, and economic impact analysis. Dr. Lange developed expertise in machine learning and AI including neural networks and deep learning, serving as a member of Cal Poly’s High Performance Computing Cluster Working Group and engaging with the campus community on the topics. Most recently, he used his expertise in analytics and computer technology to create a database that tracks the estimated number of active COVID-19 cases in the most populous counties in all continental states of the U.S. Collaborations include applying machine learning to analyze electronic properties of crystals, predicting Major League Baseball pitcher salaries with artificial intelligence, and Using GIS to Predict Urban Development in North Carolina. Dr. Lange’s research has been published in books and peer-reviewed journals such as the Journal of Risk Finance, International Journal of Monetary Economics and Finance, Quarterly Review of Economics and Finance, and Journal of Emerging Markets. Dr. Lange has taught a number of subjects at both graduate and undergraduate levels, including Economic Statistics, Mathematical Economics, Spatial Statistics and Analyses, Neural Networks, and Machine Learning. His teaching and research efforts earned him numerous awards and honors, including, Innovative Approaches to Instruction Award, Provost Teacher-Scholar Program Award, and Golden Leave Award.
Each time we offer the program, we conduct participants’ evaluations of the course to gather participants’ feedback on the course and instructor. Following are selected testimonials from the participants who took the anonymous course evaluation.
- “I enjoyed the step-by-step workshop video followed by an application assignment that pushed me to use the tools I learned to apply to code of my own. It was well-packaged and helpful to do alongside work and school.”
- “I really appreciate how organized this course was. With being a full time participant with many obligations, this course was very straightforward and clear to take. I really liked how the tools and resources were provided for us to use, and I was able to figure out my own problems when programming. I would recommend this program to others!”
- “I like how thorough the instructions and notes are throughout the Canvas course. I also like the videos since they are easy to follow. The feedback on the homework along with having the key are important to me to help improve my coding skills. I haven't attended office hours but I like the choice between two days in the evening. Whenever I have sent an email, I always receive a reply and/or feedback.".
CEU Footer
CPGE Footer
3801 W. Temple Ave.
Bldg 220C-140
Pomona CA 91768
https://www.cpp.edu/cpge
Phone: 909-869-2288
Email : CPGEinfo@cpp.edu
Office Hours:
Monday – Friday 8:00 AM to 5:00 PM