Center for Customer Insights and Digital Marketing

Data Wrangling and Visualization Certificate Program

 

Overview

Why Learn Data Wrangling and Visualization using Modern R - Tidyverse?

According to a New York Times article by Steve Lohr (2014), data scientists spend 50% to 80% of their time on data wrangling (i.e., data cleaning and transformation) processes and 20%-50% of their time on data modeling, implying the importance of skills needed for the data wrangling task.

“Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets (Steve Lohr, August 17, 2014).”

However, most degree programs focus on data modeling, presumably because that is the most technically challenging and worthy of a degree. Most courses in various types of data science programs do not offer a course in data wrangling and visualization systematically, but they expect students to use these skills in conjunction with modeling, making students face two challenges at the same time. The same is true in most statistics classes. Students have to deal with learning not only statistics topics, but also programming software. Thus, this certification is designed to help students without much basic knowledge of R, a primary statistical analysis software used by data scientists, by giving them the necessary knowledge in programming so that they can focus more on statistics/machine learning topics. Further, this course is also aimed to give data science aspirants introductory knowledge and skills to help them get started.

Learning coding can be and should be much easier. In fact, modern data science in R heavily use Tidyverse package, which is easier to learn and fun to use. When it comes to data visualization, for instance, modern R for data science has the most popular data visualization tool called ggplot2 package, which is included in Tidyverse. Using ggplot2, you can customize your chart creatively and persuasively almost any way you want to support your strategic storytelling in a presentation.

Why This Certificate Program?

This certification is designed to help both novices and intermediate users of Base R by giving them the necessary knowledge in the Tidyverse way of coding for data wrangling and visualization so that they can focus more on statistics or machine learning topics when they take those classes. Also, this course aims to give data science aspirants adequate knowledge and skills for the most popular tasks for data analysts and data scientists – Data Wrangling and Visualization to help them get started with their career.

The course will follow the recommendation of Hadley Wickham, who is the chief architect of the popular Tidyverse mega package you will learn throughout the course. We will start with the most important aspect of Tidyverse, data visualization! After you master the topic, we will teach you the bolts and nuts of how to manipulate and transform all sorts of data to support the kind of visualization you want to create for your storytelling. Each time you learn how to wrangle data, we will not stop there. You will be guided to produce a chart, the final line. This way, you will never stop applying your visualization skills across all six modules. 

The certification includes well-structured six modules (each with three assignments requiring 20 hours of study time) and a capstone project, which are to be completed in about 12 weeks. The course covers all aspects of wrangling and visualization of data comprehensively using the Tidyverse framework – a modern data science in R – to allow you to prepare for a presentation of complex data for the non-technical audience with a focus on strategic data visualization. Please see Course Outline below for the content of each module.

The price is set to be competitive with online courses available elsewhere, giving you unparalleled value for the quality of education you will get. We invite you to take the time to learn about how we can help you for your journey and contact us at Insights-Lab@cpp.edu if you have any questions.

Skills Covered

Strategic data visualization, literate coding, reproducible research, wrangling for all types of data, making interactive or animated charts and dashboard, modern wrangling and visualization skills using Tidyverse in R, importing and exporting data from and to the web, cloud, and local drive, joining multiple relational datasets, dealing with labeled data (e.g., SAS, SPSS, STATA), functional coding, creating report in HTML, pdf, and ppt, automation and simplification of repetitive tasks, etc.

Tools Covered

R, RStudio, R Markdown, dplyr, tidyr, stringr, forcats, lubridate, readr, purrr, ggplot2, magrittr, Tidyverse, plotly, ggrepel, ggthemes, GGally, gganimate, Github, web scraping with SelectorGadget, heaven, broom, and patchwork, naniar, here, scales, labelled, sjlabelled, sjPlot, KableExtra, glue, etc.

Prerequisite

None. No prior coding experience is needed.

Expected Outcome

Participants will receive a Certificate of Completion upon successful completion of the program. Your letter grade will appear on a professional program transcript, but the course will not be counted towards academic crediting bearing degree program. You can list the certificate in your resume. You will also come away with over 20 codebooks and a comprehensive capstone project, which will aid you in future endeavors. Upon completing the course, students will be able to:

  1. Explain the role of data visualization as part of persuasive presentation and storytelling. 
  2. Choose the most effective visualization method appropriate for a given data and audience characteristics.
  3. Wrangle data to support strategic visualization objectives.
  4. Generate insights and recommendations and present them effectively and persuasively.
  5. Practice reproducible data visualizations to promote transparency, credibility and ethics.

Course Outline

Course Overview

There are a total of six modules plus an optional capstone project in this course. In Module 1, you will learn about data science in R and its ecosystem for modern data science that utilizes the concept of tidy data manifested through the Tidyverse mega package. You will learn about R’s capability and compare it with Python. In Module 2, you will learn how to visualize the data strategically to support the kinds of storytelling for your audience, which is the final goal of this course. By knowing the stories you want to tell, you will set your visualization goal from the very beginning. 

The rest of the modules are devoted to teaching you the tools you need to shape the data to support data visualization. That is, you will learn how to wrangle the data to support the visualization in Module 3 (How to tidy data and transform data), Module 4 (How to deal with various types of data format such as strings, factor, date, and time), and Module 5 (How to import and export various forms of data). Module 6 (Programming) is not part of the wrangling process, but adding programming skills will help you save a lot of time by automating repetitive coding tasks. You will have learned most of the concepts and skills needed for the Capstone project Module 2 through Module 5. 

The capstone project will be introduced right after Module 2 so that you can start tackling the problems each time you learn important concepts. See which questions you can answer after you finish Module 2. Revisit the capstone project again after Module 3, again after Module 4, and so on and on. In the process, you will continue to modify your coding and improve it better and better each time you revisit it. By the end of Module 6, your capstone project should be ready for submission.

The goal of this module is to introduce an overview of the world of data science in R and get you ready for the rest of the modules. 

In this module, you will learn how to install R and RStudio. You will also learn how to make the best use of the R Markdown within RStudio. RStudio is an IDE (Integrated Development Environment) that makes learning R much easier. In RStudio, you can run not only R codes but also Python codes or codes for other programming languages. RStudio is wonderful. With this preparation done, you will learn what you can do with R. While the program will teach you Base R, it will focus on the Tidyverse approach of data science with the use of the Tidyverse package. 

In addition, you will learn the universe of R for data science in general in this module. A lot of college students learn statistics with R, so R is known as a statistics tool. What else can you do with R? What do data scientists do with R? For starters, you can create a chart anyway you want with ggplot2. You can create and update Word, HTML, PPT, and PDF files right from the R markdown file. You can animate your charts and make your charts interactive. You can create a website and a dashboard with a shiny Dashboard. You can build machine learning models with caret or tidymodels. You can run Python in R with the reticulate package. There are packages that help you create charts with a menu-driven approach (e.g., ggThemeAssist and esquisse) as well. 

In this module, you will learn R's capabilities and resources that you can use to learn the skills. There is no way you can learn everything in this short course; we will focus on some fundamental topics, while you will be given resources for advanced topics that you can tackle in the future. 

Upon successful completion of this module, you will be able to:

  1. Install R and RStudio.
  2. Describe the layout and menus of RStudio.
  3. Start, Run codes, and Save an R script file.
  4. Install R packages and load them up.
  5. Start, Run codes, organize the codes, and save the R Markdown file.
  6. Describe the capability of the Tidyverse package in R.
  7. Explain how to use online resources provided by the R community.

The ggplot 2 is a plotting package that provides helpful commands to create complex plots from data. It provides a more programmatic interface for specifying what variables to plot, how they are displayed, and general visual properties. Therefore, we only need minimal changes if the underlying data change or if we decide to change from a bar plot to a scatterplot. This helps in creating publication-quality plots with minimal amounts of adjustments and tweaking. Therefore, it is not surprising that ggplot2 is included in Tidyverse.

In this module, you will be introduced to concepts of the grammar of graphics when visualizing data with the ggplot2 package. Every ggplot starts with defining data and the aesthetics of variables, which is the foundation. Then, you add a geometry layer over the foundation. The two are the basics of any ggplots. The rest of the layers are optional. You will also be introduced to how to create charts when there is only one (or more) continuous or categorical variable(s) in your data set. Furthermore, you will also be introduced to plot charts after you select and filter the variables that you want to plot in your data set.

 Upon successful completion of this module, you will be able to:

  1. Explain the concept of the grammar of graphics when visualizing data with the ggplot2 package.
  2. Be familiar with various types of charts.
  3. Visualize counts, proportions, and geospatial data.
  4. Select appropriate charts based on strategic considerations (e.g., the characteristics of the data and audience).
  5. Create a chart that involves one or two variables. 
  6. Create a chart by adding a categorical moderator (3rd variable) to the chart involving two variables
  7. Create a chart that involves four or five variables by adding layers of geometry.
  8. Modify charts for storytelling by customizing axis, labels, coordinates, themes, etc.
  9. Explain when to use interactive or animated charts or dashboards style

Understanding the concept of Data Wrangling is important because many professionals inside technology industries have to face different data types and also deal with the various sources of data. In this module, you will be introduced to Data Wrangling. In a broad sense, data wrangling includes (1) importing data, (2) tidying data, and (3) transforming data. After the wrangling process, you can proceed with visualizing data and modeling. 

In this module, we will be focusing on the last two-stage of the wrangling process -- tidying and transforming data. You will also be introduced to a modern data type called, tibbles and learn how tibbles are different from data frames. One of the first steps in data wrangling in the Tidyverse ecosystem is to reshape the data to be tidy using "tidyr" package. You will learn the concept of tidy data, and how reshaping the data helps you visualize the data. Once you make your data tidy, you will want to transform your data with the help of "dplyr" package. Learn some key functions from dplyr to transform the data. In base R, one had to use several parentheses together, adding a code inside the parentheses and inside yet another parenthesis, making coding complex. This makes your R code hard to read and understand. In modern R, you can zoom on data, using the pipe operator (%>% or |>), which allows information to move from top to bottom through a pipe. That is, you first start with data, then pipe into the data to tidy or transform it first. Once wrangling is done, you pipe the wrangled data into ggplot() to do all the necessary mapping and geometry as well as other optional layers of coordinates and themes, according to the grammar of graphics. 

Upon successful completion of this module, you will be able to:

  1. Describe the concept of Data Wrangling.
  2. Describe how Tibbles are different from data frames
  3. Explain how to convert wide or long data to "Tidy" data
  4. Explain how to merge relational data sets using joins.
  5. Be familiar with key dplyr verbs and use them to transform data
  6. Use the pipe operator to shape the data to prepare for analysis and visualization

Data types (numeric, character, logical, factor, dates) are building blocks of data structures (vectors, matrices, data frames, and lists). Most of the time, you are likely to work with data frames or Tibbles in the Tidyverse framework, which is a spreadsheet with rows of observations and columns of features or variables. Matrices are similar to data frames in that they consist of rows and columns. A major difference is that while matrices are composed of variables that are of the same data type, data frames can have a mix of any type of data. You can perform various operations on data to filter and view parts of data or to create new variables. Knowing the differences between the data types and structures will provide you with the basic knowledge needed to manipulate data later. 

In addition to Base R functions, Tidyverse has many packages that help deal with different types of vectors much more efficiently than base R does. With the stringr package, you can wrangle strings (or characters) and regular expressions. With forcats, you can wrangle factor, the ordered data. With the lubridate package, you can wrangle dates and times.  

Upon successful completion of this module, you will be able to:

  1. Explain atomic R data types - numeric, character, logical, factors, and dates.
  2. Explain the differences between data structures -- vectors, matrices, data frames, and lists.
  3. Understand when to use each data type.
  4. Create data frames.
  5. Use various functions from base R and Tidyverse to view and manipulate data frames that contain the various data types.
  6. Use functions from stringr package to handle strings and regular expressions.
  7. Use functions from the forcats package to manipulate factors in R.
  8. Manipulate dates and times using the lubridate package.

Being able to import different formats of data is important because data scientists deal with secondary data collected by others, and secondary data exists everywhere inside and outside the organizations they work for. Therefore, learning how to import and export data is a fundamental step after learning about data types and structures in M04. Depending on the data formats, you will need to use appropriate tools to import and wrangle the data. Also, gathering data outside your company, such as websites, is a great way to extend your organization's capability, and you will be in the center of the action, adding value to your company; thus, web scraping can be another form of importing data that you can have under your toolbox. Further, you will learn how to deal with so-called labeled data produced by menu-driven proprietary statistical software such as SAS, SPSS, and STATA.

In addition to the Base R functions, Tidyverse has some packages that help you deal with different formats of data created by well-known software. Upon successful completion of this module, you will be able to:

  1. Explain how to create a Github repository and collaborate with others on the same R projects.
  2. Effectively load and look through built-in datasets in R.
  3. Import various formats of data (csv, xlsx, and SPSS) to RStudio.
  4. Scrape data from the web using SelectorGadget and rvest.
  5. Import multiple external data sets and work with them.
  6. Work with labeled data in R.
  7. Export/save output data to local pc and push to Github.
  8. Explain how reproducible research with literate coding enhances efficiency, ethics, transparency, and credibility.

The ability to create a function of your own will make you get even closer to being a data scientist. Although R has several built-in functions of high usage, you will want to be able to create your own functions tailored for your job or tasks perhaps because you found yourself or your team performing a certain task repeatedly. By building your own function, you can help save time. 

In this module, you will be introduced to various types of vectors, logical operators, for-loops, if-else statement etc. You will also be introduced to good coding practice and see how this makes a positive impact on the efficiency of your code. In order for your code to be more readable, take fewer keystrokes, and execute a batch of jobs faster, you may want to use a family of map() functions from purrr package that is part of Tidyverse  or use apply() family functions from base R. Furthermore, you will also be introduced to some built-in R functions, which are built by R architecture, using the same sets of principles you will learn in this module.

After you finish this module, you should be considered a programmer. Upon successful completion of this module, you will be able to:

  1. Describe good coding practices.
  2. Use the various types of logical operators.
  3. Use for loops in conjunction with other statements (if-else, next, break, etc).
  4. Create your own R functions.
  5. Use map functions from the purrr package to increase efficiency.
  6. Use a family of apply() functions to simplify repetitive tasks.
  7. Be familiar with some built-in R functions.

This project is optional for the DWV 100 course. If you want a comprehensive, practical project that can be added to a resume or like to challenge yourself, try this project. Use all the tools you learned from all six modules. Sometimes, you will have to use additional materials given in each module.

The Capstone project will be challenging due to its scope and the nature. If you worked hard on all six modules, you will have an idea to try, but you are not likely to remember all the functions you need. Review the codebooks and find the codes that you can apply to the new content. 

Upon successful completion of this module, you will be able to:

  1. Be familiar with R data structure.
  2. Perform various operations to view and manipulate data
  3. Import various types of data and export output data to other types of data.
  4. Wrangle data to “tidy” form for visualization and modeling.
  5. Write codes to automate/simplify routine operation.
  6. Create charts, using R’s built-in functions as well as popular packages such as ggplot2.

Targeted Careers and Job Outlook

Data science is one of the top two jobs in America in 2021 (Glassdoor). According to the US Bureau of Labor Statistics, employment of data scientists is expected to rise 22 percent by 2030 – far faster than the eight percent average for all occupations. According to Harvard Business School, data science can be used for all areas of business – accounting, finance, manufacturing, management, marketing, and operations – for such tasks as gaining customer insights, increasing security, informing internal finances, streamlining manufacturing, and forecasting future marketing trends. Data science methods are used in hard science as well. There are several different jobs under the broad umbrella of data science – Data Engineer, Data Analyst, Researcher, Business Executive, Entrepreneur, Full-Stack Data Scientist, etc. “Yet, to harness the power of big data, it isn’t necessary to be a data scientist,” according to HBS. Whatever your goal is, this certificate program is designed to introduce you to the R programming language..

Key Features of Certificate Program

  • Take the program anywhere in the world as the program is delivered online.
  • Fully asynchronous offering, meaning that there is no set class time. Takes two weeks to finish one module and 12 weeks to finish all modules, including the capstone project.
  • However, you will be required to manage your time such that the assignment associated with each module is required to be finished by the deadline set on Canvas.

  • Each module will follow the Quality Matters framework that has been proven effective for online learning success. That is, each module will start with learning outcomes, followed by step-by-step instructions, including a one-hour video lecture, supplemental materials to reinforce the lecture, and assignment(s).
  • Each module will have one Principle Assignment (optional with bonus points) and an Application Assignment
  • The Principle Assignment is intended for participants to spend the time watching lecture videos and organizing the learning, leading to a nice codebook that can be used for future reference. Since some experienced participants may prefer to skip the process, this assignment is not required, but participants who submit the assignment will be given extra credit and will be provided an additional set of codes that can expand what was taught in the video. 
  • The Application Assignment is intended to ensure that participants can apply what they learned to real-world situations that require critical thinking, problem solving, and creativity. Each Application Assignment is composed of about 10 questions that are usually connected to each other with the same data set, serving as a mini project. 
  • An instructor (professor with a Ph.D. with ample teaching and consulting experience) and the knowledgeable mentors at the Center for Customer Insights and Digital Marketing will be available to answer your questions immediately. See “Support for Learning” for details.

  • To receive a certificate of completion, participants must receive at least a grade of D- from the course. 
  • You can choose to do only beginner level or both beginner and intermediate level as well. With two optional assignments and one required assignment for beginners and experienced coders per module, everyone should pass the course if he or she can spend 3 hours per week. You are welcome to save study materials into your local computer for future study/reference. To fully achieve the learning outcome during the course, however, one is expected to spend about 10 hours a week.
  • Watching a video is never sufficient to demonstrate your knowledge and skills in the topic, which is why we give participants hands-on practice assignments - Principle Assignment (optional) and Application Assignment.  
  • The Capstone Project gives participants the opportunity to utilize everything they learned to import, tidy, transform, and visualize messy data. The project will be available for participants to tackle for 12 weeks and can be added to their resume separately, demonstrating their skills and confidence.

  • Instructor. The instructor will provide timely feedback to your assignments so that you can use the feedback to improve your grades for the next assignments. The instructor will also provide advice on data analytics career in general and will provide in-depth coding document Rmd files that can expand your learning. The instructor will also offer office hours via Zoom for those who prefer personal interaction.
  • 24/7 Support by Mentors. Our knowledgeable mentors will guide your learning and are focused on answering your questions, motivating you to stay on track during the 12 week period. To be successful in a certificate program, it is very important to stay on track. You can contact the mentors at any time for any quick questions you may have so that you are not slowed down in your progress. These mentors at the Center for Customer Insights and Digital Marketing are employees who are knowledgeable on the topics. Thus, they will provide the highest level of assistance under close supervision of the instructors.
  • Learning Community. Students are encouraged to post questions and answer questions among themselves in our password protected Canvas, which is a leading Learning Management System, to create a safe online learning environment. Mentors and the instructors will also regularly check in to give support to the learning community.

Who Should be Enrolled in This Certificate Program?

  • Students who want to take various data science programs (e.g., MS in Digital Marketing, MS in Business Analytics, etc.) and various statistics courses at undergraduate as well as graduate levels.
  • Company employees who want to add a data science tool beyond MS Excel in their tool kit.
  • Anyone who wants to have a career in data science and business analytics.
  • Anyone who wants to learn R Programming from around the world are welcome to apply.
  • Both novices and experienced users are welcome!-. There are optional capstone assignments that experienced R users may want to tackle. Likewise, there are optional Principle Assignments that new users of R would want to work on to ensure they have a strong foundation. 

Program Offering Timeline

Following is the planned schedule for the next offering:

  • To Be Announced

Testimonials

Each time we offer the program, we conduct participants’ evaluation of the course to gather participants’ feedback on the course and instructor. Followed are selected testimonials from the participants who took the anonymous course evaluation. 

  • “I enjoyed the step-by-step workshop video followed by an application assignment that pushed me to use the tools I learned to apply to code of my own. It was well-packaged and helpful to do alongside work and school.”
  • “I really appreciate how organized this course was. With being a full time participant with many obligations, this course was very straightforward and clear to take. I really liked how the tools and resources were provided for us to use, and I was able to figure out my own problems when programming. I would recommend this program to others!”
  • “I like how thorough the instructions and notes are throughout the Canvas course. I also like the videos since they are easy to follow. The feedback on the homework along with having the key are important to me to help improve my coding skills. I haven't attended office hours but I like the choice between two days in the evening. Whenever I have sent an email, I always receive a reply and/or feedback.".

About the Instructors

Dr. Jae Jung is a Professor of Marketing and the director of the Center for Customer Insights and Digital Marketing (CCIDM) at Cal Poly Pomona. He is also director of MS in Digital Marketing program. He received a Ph.D. in Marketing from the University of Cincinnati and an MBA degree with a concentration in Business Statistics from the University of North Texas. Currently, Dr. Jung is interested in applying econometrics and data science methods to consumer behaviors, and working on several projects in the area of social media and digital marketing dealing with firm level data, national level data, and individual social media data. His research has been published in journals such as European Journal of Marketing, International Marketing Review, Journal of Business Research, Journal of Cross-Cultural Psychology, Marketing letters, and Psychology & Marketing. Dr. Jung has taught various courses including Marketing Research, Data Mining for Marketing Decisions, and Marketing Analytics, often incorporating real-world proprietary customer data of companies. Dr. Jung orchestrated designing and producing a series of R workshops that attracted hundreds of participants from both campus and business community. This experience has led to the offering of DWV 100, which is a prerequisite for MS in Digital Marketing students if they lack coding experience. His research and teaching efforts earned him recognitions, including the prestigious Jagdish N. Sheth Research Award, Wall of COOL, Provost Teacher-Scholar Program Award, and Faculty of the Year Award

Dr. Carsten Lange is a Professor of Economics. Dr. Lange received his Ph.D in Economics at the University of Hannover, Germany. Dr. Lange specializes in money supply, inflation, central bank policy, and economic impact analysis. Dr. Lange developed expertise in machine learning and AI including neural networks and deep learning, serving as a member of Cal Poly’s High Performance Computing Cluster Working Group and engaging with the campus community on the topics. Most recently, he used his expertise in analytics and computer technology to create a database that tracks the estimated number of active COVID-19 cases in the most populous counties in all continental states of the U.S. Collaborations include applying machine learning to analyze electronic properties of crystals, predicting Major League Baseball pitcher salaries with artificial intelligence, and Using GIS to Predict Urban Development in North Carolina. Dr. Lange’s research has been published in books and peer-reviewed journals such as the Journal of Risk Finance, International Journal of Monetary Economics and FinanceQuarterly Review of Economics and Finance, and Journal of Emerging Markets. Dr. Lange has taught a number of subjects at both graduate and undergraduate levels, including Economic Statistics, Mathematical Economics, Spatial Statistics and Analyses, Neural Networks, and Machine Learning. His teaching and research efforts earned him numerous awards and honors, including, Innovative Approaches to Instruction AwardProvost Teacher-Scholar Program Award, and Golden Leave Award

 

Registration

Register Here

Request Information