Exploratory Data Analysis and Statistical Inference in R
It is a truth universally acknowledged, that a Srishti Art and Design student, in possession of a good Mac, must be terrified of coding.
-Code and Prejudice, Jane Austen, 1813
1 Abstract
This Course takes Business Practitioners and Art-Design Peasants on a journey of Exploratory Data Analysis (EDA) and Statistics: using data to derive insights, make predictions, and decide on plans of action that can be communicated and actualized in a Design and Business context.
“EDA is the process of examining the data at hand, to find patterns in it, and to obtain insights. It is not a rigid process, but a fluid iterative one, where in one chases down alleyways and cul-de-sacs of inquiry, and sometimes comes back empty-handed. But in the end, one has a thorough understanding of the data structure and its variables: their values, their measures and variations, and their possible relationships. One also develops hypotheses for modelling, statistical investigation and plans for the application of machine learning procedures to the data to enable decision making.”
The Course starts with Descriptive Analytics / EDA: Datasets from various domains of business, design, and scientific activity are introduced. The datasets are motivated from the point of view of the types of information they contain: students will relate the Data Variables (Qualitative and Quantitative) to various types of Data/Information Visualizations.
Statistical Concepts such as Sampling, Hypothesis Tests, Simulation / Modelling, and Uncertainty will be introduced. Students will be motivated to set up their own Hypotheses, which they would answer after analyzing the data that they would gather using a random sampling experiment;
Predictive Analytics will take us into looking at Data and training standard ML algorithms to make predictions with new Data. Regression, Clustering, and Classification will be covered.
Prescriptive Analytics will deal with coming to terms with the uncertainty in Predictions, and using tools such as both ML, Linear/non-Linear Programming, and Decision-Making to make Business Decisions, with an assessment of the Risks involved.
The Course will include in a full Exploratory Data Analysis Workflow using RStudio and Quarto that includes Data Gathering and Cleaning, Descriptive and Predictive Analytics, Prescriptive Analytics and Decision Making, and Communication resulting in a publication-worthy documents (HTML / PDF/ Word) and/or on a personal website.
A diagram from a very popular textbook by Ismay and Kim may be relevant here:
2 What you will learn
- Data Basics: What does data look like and why should we care?
- Understand the R language, and appreciate how close it is to plain English, for the most part
- Rapidly and intuitively creating Graphs and Data Visualizations using geometric metaphors to explore data for insights,
- Use Statistical Tests, Procedures, Models, and Simulations and to answer Business and Design Questions
- Create crisp and readable Reports with Quarto that can be shared in a Design and Business Context
Using ML algorithms such Regression, Classification, and Clustering to develop Business InsightsUse Linear Programming to make Business Decisions- Use Git and GitHub to manage your code and website versions, and to share your work with others
- Create a Personal Website to showcase your work
- Using AI tools in RStudio where appropriate for code and text completion, and code development using prompts in English (
Hindi, Kannada, Tamil, Gujarati, Tulu…)
3 Our Tools
R https://cran.r-project.org/ and RStudio https://posit.co/
R is a freely available language and environment for statistical computing and graphics which provides a wide variety of statistical and graphical techniques: linear and nonlinear modelling, statistical tests, time series analysis, classification, clustering,etc. RStudio is an integrated development environment (IDE) for R and Python.
Quarto https://quarto.org
Quarto is an open-source scientific and technical publishing system built on Pandoc, and integrated into RStudio. It extends the functionality of R Markdown, Jupyter Notebooks, and other markdown formats to provide a unified authoring framework for data science and scientific communication. Quarto supports multiple programming languages, including R, Python, Julia, and Observable JavaScript.
(Very Foreseeable Future): Positron IDE https://positron.posit.co/
Positron IDE is a new IDE from Posit,
in public beta as of June 2024. Positron unifies exploration and production work in one free, AI-powered environment, empowering the full spectrum of data science in Python and R.(In a No-Code version of this course): Orange Data Mining https://orangedatamining.com/
Orange is also a FOSS visual point-and-click software for Data Mining and ML, developed at the University of Slovenia, in Ljubljana, Slovenia. It can be thought of like Photoshop for Data Science. It is component-based software, with a large toolbox of components (called widgets) for data preprocessing, feature scoring and filtering, modeling, model evaluation, and exploration techniques. It includes a wide range of standard ML algorithms, and can also be extended using Python scripts. Orange is used for both research and education at many universities worldwide.
Radiant – Business analytics using R and Shiny https://radiant-rstats.github.io/docs/index.html
Radiant is a FOSS platform-independent browser-based interface for business analytics in R, developed at the University of San Diego. The application is based on the Shiny package and can be run using R, or in your browser with no installation required. The tool automatically installs a version of R and adds a Shiny-based GUI that removes the need to write R-code. Radiant can also be installed on top of an existing installation of R and invoked from within RStudio.
4 References
Visualization and R language
- Hadley Wickham, Mine Cetinkaya-Rundel, and Garett Grolemund. R for Data Science (2e). https://r4ds.hadley.nz. The most important reference for data visualization and analysis in R. Available free online.
- Robert Kabacoff. Modern Data Visualization with R. https://rkabacoff.github.io/datavis/. Available free online.
- Jack Dougherty and Ilya Ilyankou, Hands-On Data Visualization: Interactive Storytelling from Spreadsheets to Code, https://handsondataviz.org/. Available free online.
- Claus O. Wilke, Fundamentals of Data Visualization, https://clauswilke.com/dataviz/. Available free online.
- Jonathan Schwabish, Better Data Visualizations: A Guide for Scholars, Researchers, and Wonks, Columbia University Press, 2021.
- Alberto Cairo, The Functional Art: An introduction to information graphics and visualization, New Riders. 2013. ISBN-9780133041361.
- Cole Nussbaumer Knaflic, Storytelling With Data: A Data Visualization Guide for Business Professionals, Wiley 2015. ISBN-9781119002253.
- Reza Mohammadi(2025). Data Science Foundations and Machine Learning with R: From Data to Decisions. https://book-data-science-r.netlify.app/
Analytics
- Judd, C.M., McClelland, G.H., & Ryan, C.S. (2017). Data Analysis: A Model Comparison Approach To Regression, ANOVA, and Beyond. Third Edition (3rd ed.). Routledge. https://doi.org/10.4324/9781315744131
- Thomas Maydon, The 4 Types of Data Analytics. https://www.kdnuggets.com/2017/07/4-types-data-analytics.html
- Keith McNulty. Handbook of Regression Modeling in People Analytics: With Examples in R, Python and Julia https://peopleanalytics-regression-book.org. Available free online.
- James R Evans, Business Analytics: Methods, Models, and Decisions, Pearson Education, 2021.
- Dimitris Bertsimas, Robert Freund, Data, Models, and Decisions: the Fundamentals of Management Science, Dynamic Ideas Press, 2004.
- Cliff T. Ragsdale, Spreadsheet Modeling & Decision Analysis: A Practical Introduction to Management Science, South Western, Cengage Learning, Mason, OH, 2012.
Statistics
- Mine Cetinkaya-Rundel, Johanna Hardin. Introduction to Modern Statistics. https://openintro-ims2.netlify.app. Available free online.
- Daniel T. Kaplan. Statistical Models (second edition). https://dtkaplan.github.io/SM2-bookdown/. Available free online.
- Daniel T. Kaplan, Compact Introduction to Classical Inference, 2020. https://dtkaplan.github.io/CompactInference/. Available free online.
- Daniel T. Kaplan and Frank Shaw, Statistical Modeling: Computational Technique. https://www.mosaic-web.org/go/SM2-technique/. Available free online.
- Jonas Kristoffer Lindeløv. Common statistical tests are linear models (or: how to teach stats). https://lindeloev.github.io/tests-as-linear/. Available free online.
5 Pedagogical Notes
Why this course is what it is, and why it does what it does! Read this only if you are planning to be an educator yourself, or if you want to understand why your Teacher is…like this!!
PRIMM
The method followed will be based on PRIMM:
- PREDICT: Inspect the code and guess at what the code might do, write predictions
- RUN: the code provided and check what happens
- INFER: what the parameters of the code do and write comments to explain. What bells and whistles can you see?
- MODIFY: the parameters code provided to understand the options available. Write comments to show what you have aimed for and achieved.
- MAKE: take an idea/concept of your own, and graph it.
So in this course, wherever you see “YOUR TURN”, please respond with questions of the data, explanations, more questions and if you are already confident, code chunks to create new calculations and graphs.
And why teach R in this way?
Because we all know two things:
- Pretty decent English
- \(y = mx + c\)
Let us hear from Amelia McNamara:
6 Learning R with AI
Of course.
7 Business Analytics Courses elsewhere
- University of San Diego, Rady School of Business. http://lab.rady.ucsd.edu/sawtooth/business_analytics_in_r/index.html
8 Modules
Title | Date |
---|---|
|
Dec 31, 2022 |
|
Dec 31, 2022 |
|
Nov 30, 2022 |
|
Apr 13, 2023 |
|
Feb 9, 2022 |
|
Aug 1, 2025 |
|
Sep 8, 2025 |
|
Sep 2, 2024 |
|
Jul 21, 2025 |