DATA ANALYSIS WITH VECTOR FUNCTIONAL PROGRAMMING · HISTORY OF VECTOR LANGUAGES Vectors (arrays),...

35
DATA ANALYSIS WITH VECTOR FUNCTIONAL PROGRAMMING A tour of the Q programming language

Transcript of DATA ANALYSIS WITH VECTOR FUNCTIONAL PROGRAMMING · HISTORY OF VECTOR LANGUAGES Vectors (arrays),...

Page 1: DATA ANALYSIS WITH VECTOR FUNCTIONAL PROGRAMMING · HISTORY OF VECTOR LANGUAGES Vectors (arrays), not scalars, are the principle data type Not a new idea (APL, 1965) Ok… maybe new

DATA ANALYSIS WITH VECTOR FUNCTIONAL

PROGRAMMINGA tour of the Q programming language

Page 2: DATA ANALYSIS WITH VECTOR FUNCTIONAL PROGRAMMING · HISTORY OF VECTOR LANGUAGES Vectors (arrays), not scalars, are the principle data type Not a new idea (APL, 1965) Ok… maybe new

HISTORY OF VECTOR LANGUAGES

➤ Vectors (arrays), not scalars, are the principle data type

➤ Not a new idea (APL, 1965)

➤ Ok… maybe new compared to functional programming (λ-calculus, 1930s)

➤ Ken Iverson’s Iverson Notation

➤ Notation as a tool of thought

➤ Notation for people first, computers later

➤ Influenced: Mathematica, Matlab, R, Julia

➤ Descendents: I.N. → APL, J, A+, K, Q

Page 3: DATA ANALYSIS WITH VECTOR FUNCTIONAL PROGRAMMING · HISTORY OF VECTOR LANGUAGES Vectors (arrays), not scalars, are the principle data type Not a new idea (APL, 1965) Ok… maybe new

Q PRIMERThe basic concepts

Page 4: DATA ANALYSIS WITH VECTOR FUNCTIONAL PROGRAMMING · HISTORY OF VECTOR LANGUAGES Vectors (arrays), not scalars, are the principle data type Not a new idea (APL, 1965) Ok… maybe new

FUNCTION APPLICATION

➤ Monadic functions have a word name and take argument to the right

abs-11

1+23

til100123456789

abs@-11

(-).12-1

9mod30

➤ Dyadic verbs appear between the arguments

➤ Function application is a verb

Page 5: DATA ANALYSIS WITH VECTOR FUNCTIONAL PROGRAMMING · HISTORY OF VECTOR LANGUAGES Vectors (arrays), not scalars, are the principle data type Not a new idea (APL, 1965) Ok… maybe new

ATOMIC FUNCTIONS

➤ Primitive functions (and verbs) are atomic (apply to atoms)

➤ Evaluation is always right-to-left

➤ Typically read top-down (left-to-right)

5*10+til55055606570

-1*012340-1-2-3-4

5*(1;23;(4;56);78;9)(5;1015;(20;2530);3540;45)

Page 6: DATA ANALYSIS WITH VECTOR FUNCTIONAL PROGRAMMING · HISTORY OF VECTOR LANGUAGES Vectors (arrays), not scalars, are the principle data type Not a new idea (APL, 1965) Ok… maybe new

LIST VERBS

➤ List primitives (we have them too, just use less characters):

2#til1001

-2#til1089

(til4),til401230123

036_til9012345678

take (#)

join (,)

split (_)

Page 7: DATA ANALYSIS WITH VECTOR FUNCTIONAL PROGRAMMING · HISTORY OF VECTOR LANGUAGES Vectors (arrays), not scalars, are the principle data type Not a new idea (APL, 1965) Ok… maybe new

MAPPING A LIST - FP 101

counteach036_til9333

036_til9012345678

3#0000

333#'012000111222

➤ If dyadic, combine with an adverb (a pairing operator)

➤ eg, each-both (‘) take (#) + each-both (‘) = take-each-both (#’)

But Wait! There’s More!

Page 8: DATA ANALYSIS WITH VECTOR FUNCTIONAL PROGRAMMING · HISTORY OF VECTOR LANGUAGES Vectors (arrays), not scalars, are the principle data type Not a new idea (APL, 1965) Ok… maybe new

ADVERBS

nounverbadverbnoun

333#'012

Page 9: DATA ANALYSIS WITH VECTOR FUNCTIONAL PROGRAMMING · HISTORY OF VECTOR LANGUAGES Vectors (arrays), not scalars, are the principle data type Not a new idea (APL, 1965) Ok… maybe new

FOLD AND SCAN ARE ADVERBS … MORE FP 101

➤ Fold (/) is an adverb, we call it over

0+/til510

0+\til5013610

➤ Scan (\) returns the incremental values of over (left-to-right)

A plus reduction over 01234

Partial sums of 01234

Page 10: DATA ANALYSIS WITH VECTOR FUNCTIONAL PROGRAMMING · HISTORY OF VECTOR LANGUAGES Vectors (arrays), not scalars, are the principle data type Not a new idea (APL, 1965) Ok… maybe new

FLEXIBLE MAPPING WITH ADVERBS

➤ Only 6 adverbs, but they come up all the time

(floor;ceiling)@\:5.556

max@/:036_til9258

0-':til501111

(min;max)@\:/:036_til9023568

each-left (\:)

each-right (/:)

each-prior (‘:)

compose: each-left-each-right (\:/:)

Page 11: DATA ANALYSIS WITH VECTOR FUNCTIONAL PROGRAMMING · HISTORY OF VECTOR LANGUAGES Vectors (arrays), not scalars, are the principle data type Not a new idea (APL, 1965) Ok… maybe new

THINKING IN ARRAYSPrime Numbers

Page 12: DATA ANALYSIS WITH VECTOR FUNCTIONAL PROGRAMMING · HISTORY OF VECTOR LANGUAGES Vectors (arrays), not scalars, are the principle data type Not a new idea (APL, 1965) Ok… maybe new

THINKING IN ARRAYS - NO STINKING LOOPS*

functionisPrime(n){

if(n<2)returnfalse;

varq=Math.floor(Math.sqrt(n));

for(vari=2;i<=q;i++){if(n%i==0){returnfalse;}}

returntrue;}

* Steve Apter nsl.com

Page 13: DATA ANALYSIS WITH VECTOR FUNCTIONAL PROGRAMMING · HISTORY OF VECTOR LANGUAGES Vectors (arrays), not scalars, are the principle data type Not a new idea (APL, 1965) Ok… maybe new

THINKING IN ARRAYS

x mod y

1 .. 100

Page 14: DATA ANALYSIS WITH VECTOR FUNCTIONAL PROGRAMMING · HISTORY OF VECTOR LANGUAGES Vectors (arrays), not scalars, are the principle data type Not a new idea (APL, 1965) Ok… maybe new

THINKING IN ARRAYS

x mod y = 0

Page 15: DATA ANALYSIS WITH VECTOR FUNCTIONAL PROGRAMMING · HISTORY OF VECTOR LANGUAGES Vectors (arrays), not scalars, are the principle data type Not a new idea (APL, 1965) Ok… maybe new

THINKING IN ARRAYS

y = 1

y = x

Page 16: DATA ANALYSIS WITH VECTOR FUNCTIONAL PROGRAMMING · HISTORY OF VECTOR LANGUAGES Vectors (arrays), not scalars, are the principle data type Not a new idea (APL, 1965) Ok… maybe new

THINKING IN ARRAYS

primes

Page 17: DATA ANALYSIS WITH VECTOR FUNCTIONAL PROGRAMMING · HISTORY OF VECTOR LANGUAGES Vectors (arrays), not scalars, are the principle data type Not a new idea (APL, 1965) Ok… maybe new

THINKING IN ARRAYS

Page 18: DATA ANALYSIS WITH VECTOR FUNCTIONAL PROGRAMMING · HISTORY OF VECTOR LANGUAGES Vectors (arrays), not scalars, are the principle data type Not a new idea (APL, 1965) Ok… maybe new

THE RESULT

➤ Extremely concise, 111 bytes

➤ 29 characters left for emojis when tweeting it!

p:{nwhere2=sum0=nmod/:n:1+tilx}rle:{(count;first)@\:/:(wherenot=‘:[x])_x}expand:{(),/(#).’x}

rle:{(count;first)@\:/:(wherenot=‘:[x])_x}Onlyshortprogramshaveanyhopeofbeingcorrect

~ Arthur Whitney

Page 19: DATA ANALYSIS WITH VECTOR FUNCTIONAL PROGRAMMING · HISTORY OF VECTOR LANGUAGES Vectors (arrays), not scalars, are the principle data type Not a new idea (APL, 1965) Ok… maybe new

HOW CAN WE USE Q FOR DATA ANALYSIS?

➤ Q has dictionaries (associations) and tables (flipped dictionaries)

➤ Tables are first-class and columnar, operations on columns are fast and efficient

➤ It is actually the scripting language for kdb+

➤ Has an integrated sql-like query language called q-sql

selectavgpricebysymfromtradeswheredate>.z.d-5

➤ Has really nice temporal types, temporal arithmetic, and temporal joins

Page 20: DATA ANALYSIS WITH VECTOR FUNCTIONAL PROGRAMMING · HISTORY OF VECTOR LANGUAGES Vectors (arrays), not scalars, are the principle data type Not a new idea (APL, 1965) Ok… maybe new

Q FOR DATA ANALYSIS

Page 21: DATA ANALYSIS WITH VECTOR FUNCTIONAL PROGRAMMING · HISTORY OF VECTOR LANGUAGES Vectors (arrays), not scalars, are the principle data type Not a new idea (APL, 1965) Ok… maybe new

STEP 1. GET SOME DATA

//Systemcommandsstartwith\\wget.../pantheon.tsv\wget.../pageviews_2008-2013.tsv-Opageviews.tsv

//ETLinQpeople:("iSiSSSSSffsissssiffiiff";enlist"\t")0:`:pantheon.tsv;pageviews:("iSSiSisssss",72#"i";enlist"\t")0:`:pageviews.tsv;

Monthly page visit information for people on WikiPedia

We have a short fat table, want a long skinny table…

Each month is a single column

File nameTab separatedColumn types

Page 22: DATA ANALYSIS WITH VECTOR FUNCTIONAL PROGRAMMING · HISTORY OF VECTOR LANGUAGES Vectors (arrays), not scalars, are the principle data type Not a new idea (APL, 1965) Ok… maybe new

STEP 2. CLEAN THE DATA!

//Allofthemonthsmonths:"M"$ssr[;"-";"."]eachstring11_colspageviews;

//Createanewtableofthemonthsflattenedmonthly:ungroup2!([]id:pageviews`id;lang:pageviews`lang;month:(countpageviews)#enlistmonths;clicks:flippageviewsc:11_colspageviews)

//Left-Joinclickinformationwithpersoninformationclickinfo:monthlylj`id`langxkeypeople;

idnameoccupationlang------------------------------------------307AbrahamLincolnPOLITICIANaf307AbrahamLincolnPOLITICIANam307AbrahamLincolnPOLITICIANan307AbrahamLincolnPOLITICIANang307AbrahamLincolnPOLITICIANar307AbrahamLincolnPOLITICIANarz…

idlangmonthclicks----------------------------307af2008.014307af2008.025307af2008.030307af2008.045307af2008.055307af2008.061…

Month values

Long skinny table

4 columns

Left join

Page 23: DATA ANALYSIS WITH VECTOR FUNCTIONAL PROGRAMMING · HISTORY OF VECTOR LANGUAGES Vectors (arrays), not scalars, are the principle data type Not a new idea (APL, 1965) Ok… maybe new

STEP 3. ASK SOME QUESTIONS

selectfromclickinfowhereoccupationlike“COMPUTERSCIENTIST”

Page 24: DATA ANALYSIS WITH VECTOR FUNCTIONAL PROGRAMMING · HISTORY OF VECTOR LANGUAGES Vectors (arrays), not scalars, are the principle data type Not a new idea (APL, 1965) Ok… maybe new

STEP 3. ASK SOME QUESTIONS

selectfromclickinfowhereoccupationlike“COMPUTERSCIENTIST”

Page 25: DATA ANALYSIS WITH VECTOR FUNCTIONAL PROGRAMMING · HISTORY OF VECTOR LANGUAGES Vectors (arrays), not scalars, are the principle data type Not a new idea (APL, 1965) Ok… maybe new
Page 26: DATA ANALYSIS WITH VECTOR FUNCTIONAL PROGRAMMING · HISTORY OF VECTOR LANGUAGES Vectors (arrays), not scalars, are the principle data type Not a new idea (APL, 1965) Ok… maybe new
Page 27: DATA ANALYSIS WITH VECTOR FUNCTIONAL PROGRAMMING · HISTORY OF VECTOR LANGUAGES Vectors (arrays), not scalars, are the principle data type Not a new idea (APL, 1965) Ok… maybe new

STEP 4…CLEAN THE DATA… AGAIN…

file:{"List_of_Google_Doodles_in_",string`year$x};wget:{system"wgethttps://en.wikipedia.org/wiki/",filex};

process:{values:(string`January`February`March`April`May`June`July`Auguest`September`October`November`December)!til12;doc:read0hsym`$filex;pars:wheredoclike\:"<p>*";celebrated:`$first@/:/:"\""vs/:/:(@).'flip(d;where@/:not(d:"title=\""vs/:docpars)like\:\:"<p>*");headings:{[doc;x]firstposwhere(docpos:x+negtil10)like\:"<h3>*"}[doc]eachpars;months:x+valuesfirst@/:"_"vs/:first@‘"\""vs/:("id=\""vs/:docheadings)@'1;:razeeachcelebratedgroupmonths;};

years:2010.012011.012012.012013.01m;wgeteachyears;results:razeprocesseachyears;doodles:ungroup1!flip`month`name!(key;value)@\:results;

monthname----------------------------2010.01IsaacNewton2010.01DjangoReinhard2010.01AntonChekhov2010.022010WinterOlympics…

<p>On<b>Tuesday,July6,2010</b>,thebirthof<ahref="/wiki/Frida_Kahlo"title="FridaKahlo">FridaKahlo</a>wascelebratedwithagoldGooglelogowrappedwithvines,flowers,andapaintingofherselfinherpaintingstyles.<supid="cite_ref-18"class="reference"><ahref="#cite_note-18">[18]</a></sup></p>

Page 28: DATA ANALYSIS WITH VECTOR FUNCTIONAL PROGRAMMING · HISTORY OF VECTOR LANGUAGES Vectors (arrays), not scalars, are the principle data type Not a new idea (APL, 1965) Ok… maybe new

PARALLELIZATION IN Q

file:{"List_of_Google_Doodles_in_",string`year$x};wget:{system"wgethttps://en.wikipedia.org/wiki/",filex};

process:{values:(string`January`February`March`April`May`June`July`Auguest`September`October`November`December)!til12;doc:read0hsym`$filex;pars:wheredoclike\:"<p>*";celebrated:`$first@/:/:"\""vs/:/:(@).'flip(d;where@/:not(d:"title=\""vs/:docpars)like\:\:"<p>*");headings:{[doc;x]firstposwhere(docpos:x+negtil10)like\:"<h3>*"}[doc]eachpars;months:x+valuesfirst@/:"_"vs/:first@‘"\""vs/:("id=\""vs/:docheadings)@'1;:razeeachcelebratedgroupmonths;};

years:2010.012011.012012.012013.01m;wgeteachyears;results:razeprocesseachyears;doodles:ungroup1!flip`month`name!(key;value)@\:results;

Page 29: DATA ANALYSIS WITH VECTOR FUNCTIONAL PROGRAMMING · HISTORY OF VECTOR LANGUAGES Vectors (arrays), not scalars, are the principle data type Not a new idea (APL, 1965) Ok… maybe new

PARALLELIZATION IN Qfile:{"List_of_Google_Doodles_in_",string`year$x};wget:{system"wgethttps://en.wikipedia.org/wiki/",filex};

process:{values:(string`January`February`March`April`May`June`July`Auguest`September`October`November`December)!til12;doc:read0hsym`$filex;pars:wheredoclike\:"<p>*";celebrated:`$first@/:/:"\""vs/:/:(@).'flip(d;where@/:not(d:"title=\""vs/:docpars)like\:\:"<p>*");headings:{[doc;x]firstposwhere(docpos:x+negtil10)like\:"<h3>*"}[doc]eachpars;months:x+valuesfirst@/:"_"vs/:first@‘"\""vs/:("id=\""vs/:docheadings)@'1;:razeeachcelebratedgroupmonths;};

years:2010.012011.012012.012013.01m;wgeteachyears;

results:razeprocesseachyears;

doodles:ungroup1!flip`month`name!(key;value)@\:results;

Page 30: DATA ANALYSIS WITH VECTOR FUNCTIONAL PROGRAMMING · HISTORY OF VECTOR LANGUAGES Vectors (arrays), not scalars, are the principle data type Not a new idea (APL, 1965) Ok… maybe new

PARALLELIZATION IN Qfile:{"List_of_Google_Doodles_in_",string`year$x};wget:{system"wgethttps://en.wikipedia.org/wiki/",filex};

process:{values:(string`January`February`March`April`May`June`July`Auguest`September`October`November`December)!til12;doc:read0hsym`$filex;pars:wheredoclike\:"<p>*";celebrated:`$first@/:/:"\""vs/:/:(@).'flip(d;where@/:not(d:"title=\""vs/:docpars)like\:\:"<p>*");headings:{[doc;x]firstposwhere(docpos:x+negtil10)like\:"<h3>*"}[doc]eachpars;months:x+valuesfirst@/:"_"vs/:first@‘"\""vs/:("id=\""vs/:docheadings)@'1;:razeeachcelebratedgroupmonths;};

years:2010.012011.012012.012013.01m;wgeteachyears;

results:razeprocesspeachyears;

doodles:ungroup1!flip`month`name!(key;value)@\:results;

Page 31: DATA ANALYSIS WITH VECTOR FUNCTIONAL PROGRAMMING · HISTORY OF VECTOR LANGUAGES Vectors (arrays), not scalars, are the principle data type Not a new idea (APL, 1965) Ok… maybe new

PARALLELIZATION IN Q

file:{"List_of_Google_Doodles_in_",string`year$x};wget:{system"wgethttps://en.wikipedia.org/wiki/",filex};

process:{values:(string`January`February`March`April`May`June`July`Auguest`September`October`November`December)!til12;doc:read0hsym`$filex;pars:wheredoclike\:"<p>*";celebrated:`$first@/:/:"\""vs/:/:(@).'flip(d;where@/:not(d:"title=\""vs/:docpars)like\:\:"<p>*");headings:{[doc;x]firstposwhere(docpos:x+negtil10)like\:"<h3>*"}[doc]eachpars;months:x+valuesfirst@/:"_"vs/:first@‘"\""vs/:("id=\""vs/:docheadings)@'1;:razeeachcelebratedgroupmonths;};

years:2010.012011.012012.012013.01m;wgeteachyears;results:razeprocesspeachyears;doodles:ungroup1!flip`month`name!(key;value)@\:results;

Done!

Page 32: DATA ANALYSIS WITH VECTOR FUNCTIONAL PROGRAMMING · HISTORY OF VECTOR LANGUAGES Vectors (arrays), not scalars, are the principle data type Not a new idea (APL, 1965) Ok… maybe new

STEP 5. ASK SOME MORE QUESTIONS!

//Annotatethedoodledmonthsfrominthemaintableclickinfo:updatedoodle:(date,’name)indoodlesfromclickinfo;

//Gettheaverageandmedianratiobetweenthemaxmonthlyclicks(withandwithout//thedoodledmonth)andtheminmonthlyclicks—exclude0-clickmonths(avg;med)@\:{exec(%).(maxclickswheredoodle;maxclickswherenotdoodle)-minclicks

fromflipxwherenotclicks=0}eachselectclicks,doodlebynamefromclickinfowherenameindoodles`name

58.3446110.30895

Average: 58x Median: 10x

name|----------------------|--------WinsorMcCay|508.5705AlbertSzent-Györgyi|404.2465NicolasSteno|360.9331GideonSundback|340.303MaryLeakey|337.1806DennisGabor|274.4389GraceHopper|220.8074

Page 33: DATA ANALYSIS WITH VECTOR FUNCTIONAL PROGRAMMING · HISTORY OF VECTOR LANGUAGES Vectors (arrays), not scalars, are the principle data type Not a new idea (APL, 1965) Ok… maybe new

WHY SHOULD YOU CARE?

…and summary

Page 34: DATA ANALYSIS WITH VECTOR FUNCTIONAL PROGRAMMING · HISTORY OF VECTOR LANGUAGES Vectors (arrays), not scalars, are the principle data type Not a new idea (APL, 1965) Ok… maybe new

WHY SHOULD YOU CARE?

➤ High-level expressive notation

➤ Not just someones pet project

➤ Developed by Kx Systems (since 1993)

➤ Practical (dicts, tables, q-sql, temporals, etc…)

➤ Very fast

➤ memory is getting larger, vector operations getting faster (SIMD, SSE, AVX2, AVX512, …)

➤ …benchmarks available online

➤ It’s interesting, different, and will change how you think

Page 35: DATA ANALYSIS WITH VECTOR FUNCTIONAL PROGRAMMING · HISTORY OF VECTOR LANGUAGES Vectors (arrays), not scalars, are the principle data type Not a new idea (APL, 1965) Ok… maybe new

THANKS!l:{(3=not[x]*n)or(or). 3 4=\:x*n:2{flip+':[x]+1_x,0b}/x}

➤ Some references: ➤ Two books:

➤ Q Tips - Nick Psaris ➤ Q for Mortals - Jeff Borror

➤ code.kx.com ➤ kx.com

➤ /software-download.php ➤ /community.php

➤ Notation as a Tool of Thought - K. Iverson’s Turing Award Paper

@timthornton6