{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "87ede4f5",
   "metadata": {},
   "source": [
    "\n",
    "<a id='data-statistical-packages'></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c509628b",
   "metadata": {},
   "source": [
    "# General, Data, and Statistics Packages"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d0a4ecaf",
   "metadata": {},
   "source": [
    "## Contents\n",
    "\n",
    "- [General, Data, and Statistics Packages](#General,-Data,-and-Statistics-Packages)  \n",
    "  - [Overview](#Overview)  \n",
    "  - [DataFrames](#DataFrames)  \n",
    "  - [Statistics and Econometrics](#Statistics-and-Econometrics)  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9ecd560f",
   "metadata": {},
   "source": [
    "## Overview\n",
    "\n",
    "This lecture explores some of the key packages for working with data and doing statistics in Julia.\n",
    "\n",
    "In particular, we will examine the `DataFrame` object in detail (i.e., construction, manipulation, querying, visualization, and nuances like missing data).\n",
    "\n",
    "While Julia is not an ideal language for pure cookie-cutter statistical analysis, it has many useful packages to provide those tools as part of a more general solution.\n",
    "\n",
    "This list is not exhaustive, and others can be found in organizations such as [JuliaStats](https://github.com/JuliaStats), [JuliaData](https://github.com/JuliaData/), and  [QueryVerse](https://github.com/queryverse)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5938e655",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "using LinearAlgebra, Statistics, DataFrames"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "301ed2d3",
   "metadata": {},
   "source": [
    "## DataFrames\n",
    "\n",
    "A useful package for working with data is [DataFrames.jl](https://github.com/JuliaStats/DataFrames.jl).\n",
    "\n",
    "The most important data type provided is a `DataFrame`, a two dimensional array for storing heterogeneous data.\n",
    "\n",
    "Although data can be heterogeneous within a `DataFrame`, the contents of the columns must be homogeneous\n",
    "(of the same type).\n",
    "\n",
    "This is analogous to a `data.frame` in R, a `DataFrame` in Pandas (Python) or, more loosely, a spreadsheet in Excel.\n",
    "\n",
    "There are a few different ways to create a DataFrame."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22de2164",
   "metadata": {},
   "source": [
    "### Constructing and Accessing a DataFrame\n",
    "\n",
    "The first is to set up columns and construct a dataframe by assigning names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1ad55c11",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "using DataFrames\n",
    "\n",
    "# note use of missing\n",
    "commodities = [\"crude\", \"gas\", \"gold\", \"silver\"]\n",
    "last_price = [4.2, 11.3, 12.1, missing]\n",
    "df = DataFrame(commod = commodities, price = last_price)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b55d1115",
   "metadata": {},
   "source": [
    "Columns of the `DataFrame` can be accessed by name using `df.col`, as below"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "23321c4d",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "df.price"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6a46b869",
   "metadata": {},
   "source": [
    "Note that the type of this array has values `Union{Missing, Float64}` since it was created with a `missing` value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "59bd7586",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "df.commod"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a30bdf75",
   "metadata": {},
   "source": [
    "The `DataFrames.jl` package provides a number of methods for acting on `DataFrame`’s, such as `describe`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cc6075cf",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "DataFrames.describe(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a3ecc0fb",
   "metadata": {},
   "source": [
    "While often data will be generated all at once, or read from a file, you can add to a `DataFrame` by providing the key parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "912d630a",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "nt = (commod = \"nickel\", price = 5.1)\n",
    "push!(df, nt)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1c311755",
   "metadata": {},
   "source": [
    "Named tuples can also be used to construct a `DataFrame`, and have it properly deduce all types."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3d2643d4",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "nt = (t = 1, col1 = 3.0)\n",
    "df2 = DataFrame([nt])\n",
    "push!(df2, (t = 2, col1 = 4.0))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b86043e3",
   "metadata": {},
   "source": [
    "In order to modify a column, access the mutating version by the symbol `df[!, :col]`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "50a5d897",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "df[!, :price]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c7302815",
   "metadata": {},
   "source": [
    "Which allows modifications, like other mutating `!` functions in julia."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "29031053",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "df[!, :price] *= 2.0  # double prices"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "24b19ab5",
   "metadata": {},
   "source": [
    "As discussed in the next section, note that the [fundamental types](https://julia.quantecon.org/../getting_started_julia/fundamental_types.html#missing), is propagated, i.e. `missing * 2 === missing`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "323e5103",
   "metadata": {},
   "source": [
    "### Working with Missing\n",
    "\n",
    "As we discussed in [fundamental types](https://julia.quantecon.org/../getting_started_julia/fundamental_types.html#missing), the semantics of `missing` are that mathematical operations will not silently ignore it.\n",
    "\n",
    "In order to allow `missing` in a column, you can create/load the `DataFrame`\n",
    "from a source with `missing`’s, or call `allowmissing!` on a column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "66d65eb1",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "allowmissing!(df2, :col1) # necessary to add in a for col1\n",
    "push!(df2, (t = 3, col1 = missing))\n",
    "push!(df2, (t = 4, col1 = 5.1))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "27f3896d",
   "metadata": {},
   "source": [
    "We can see the propagation of `missing` to caller functions, as well as a way to efficiently calculate with non-missing data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aac53ff6",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "@show mean(df2.col1)\n",
    "@show mean(skipmissing(df2.col1))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a9a0c3ad",
   "metadata": {},
   "source": [
    "And to replace the `missing`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0f34b556",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "df2.col1 .= coalesce.(df2.col1, 0.0) # replace all missing with 0.0"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "315884db",
   "metadata": {},
   "source": [
    "### Manipulating and Transforming DataFrames\n",
    "\n",
    "One way to do an additional calculation with a `DataFrame` is to use the `@transform` macro from `DataFramesMeta.jl`.\n",
    "\n",
    "The following are code only blocks, which would require installation of the packages in a separate environment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "577efab0",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "using DataFramesMeta\n",
    "f(x) = x^2\n",
    "df2 = @transform(df2, :col2=f.(:col1))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0c4394a2",
   "metadata": {},
   "source": [
    "### Categorical Data\n",
    "\n",
    "For data that is [categorical](https://juliadata.github.io/DataFrames.jl/stable/man/categorical/)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ad5ae34e",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "using CategoricalArrays\n",
    "id = [1, 2, 3, 4]\n",
    "y = [\"old\", \"young\", \"young\", \"old\"]\n",
    "y = CategoricalArray(y)\n",
    "df = DataFrame(id = id, y = y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3d8c6ffa",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "levels(df.y)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4e6b45f1",
   "metadata": {},
   "source": [
    "### Visualization, Querying, and Plots\n",
    "\n",
    "The `DataFrame` (and similar types that fulfill a standard generic interface) can fit into a variety of packages.\n",
    "\n",
    "One set of them is the [QueryVerse](https://github.com/queryverse).\n",
    "\n",
    "**Note:** The QueryVerse, in the same spirit as R’s tidyverse, makes heavy use of the pipeline syntax `|>`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4a461719",
   "metadata": {},
   "source": [
    "## Statistics and Econometrics\n",
    "\n",
    "While Julia is not intended as a replacement for R, Stata, and similar specialty languages, it has a growing number of packages aimed at statistics and econometrics.\n",
    "\n",
    "Many of the packages live in the [JuliaStats organization](https://github.com/JuliaStats/).\n",
    "\n",
    "A few to point out\n",
    "\n",
    "- [StatsBase](https://github.com/JuliaStats/StatsBase.jl) has basic statistical functions such as geometric and harmonic means, auto-correlations, robust statistics, etc.  \n",
    "- [StatsFuns](https://github.com/JuliaStats/StatsFuns.jl) has a variety of mathematical functions and constants such as pdf and cdf of many distributions, softmax, etc.  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8338398b",
   "metadata": {},
   "source": [
    "### General Linear Models\n",
    "\n",
    "To run linear regressions and similar statistics, use the [GLM](http://juliastats.github.io/GLM.jl/latest/) package."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "66d21288",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "using GLM\n",
    "\n",
    "x = randn(100)\n",
    "y = 0.9 .* x + 0.5 * rand(100)\n",
    "df = DataFrame(x = x, y = y)\n",
    "ols = lm(@formula(y~x), df) # R-style notation"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d54bf3bd",
   "metadata": {},
   "source": [
    "To display the results in a useful tables for LaTeX and the REPL, use\n",
    "[RegressionTables](https://github.com/jmboehm/RegressionTables.jl/) for output\n",
    "similar to the Stata package esttab and the R package stargazer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7c5a892e",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "using RegressionTables\n",
    "regtable(ols)\n",
    "# regtable(ols,  renderSettings = latexOutput()) # for LaTex output"
   ]
  }
 ],
 "metadata": {
  "date": 1764120400.1029074,
  "filename": "data_statistical_packages.md",
  "kernelspec": {
   "display_name": "Julia",
   "language": "julia",
   "name": "julia-1.12"
  },
  "title": "General, Data, and Statistics Packages"
 },
 "nbformat": 4,
 "nbformat_minor": 5
}