Optimized entropy-constrained vector quantization of lossy vector map compression
DATA ANALYSIS WITH VECTOR FUNCTIONAL PROGRAMMING · HISTORY OF VECTOR LANGUAGES Vectors (arrays),...
Transcript of DATA ANALYSIS WITH VECTOR FUNCTIONAL PROGRAMMING · HISTORY OF VECTOR LANGUAGES Vectors (arrays),...
DATA ANALYSIS WITH VECTOR FUNCTIONAL
PROGRAMMINGA tour of the Q programming language
HISTORY OF VECTOR LANGUAGES
➤ Vectors (arrays), not scalars, are the principle data type
➤ Not a new idea (APL, 1965)
➤ Ok… maybe new compared to functional programming (λ-calculus, 1930s)
➤ Ken Iverson’s Iverson Notation
➤ Notation as a tool of thought
➤ Notation for people first, computers later
➤ Influenced: Mathematica, Matlab, R, Julia
➤ Descendents: I.N. → APL, J, A+, K, Q
Q PRIMERThe basic concepts
FUNCTION APPLICATION
➤ Monadic functions have a word name and take argument to the right
abs-11
1+23
til100123456789
abs@-11
(-).12-1
9mod30
➤ Dyadic verbs appear between the arguments
➤ Function application is a verb
ATOMIC FUNCTIONS
➤ Primitive functions (and verbs) are atomic (apply to atoms)
➤ Evaluation is always right-to-left
➤ Typically read top-down (left-to-right)
5*10+til55055606570
-1*012340-1-2-3-4
5*(1;23;(4;56);78;9)(5;1015;(20;2530);3540;45)
LIST VERBS
➤ List primitives (we have them too, just use less characters):
2#til1001
-2#til1089
(til4),til401230123
036_til9012345678
take (#)
join (,)
split (_)
MAPPING A LIST - FP 101
counteach036_til9333
036_til9012345678
3#0000
333#'012000111222
➤ If dyadic, combine with an adverb (a pairing operator)
➤ eg, each-both (‘) take (#) + each-both (‘) = take-each-both (#’)
→
→
But Wait! There’s More!
ADVERBS
nounverbadverbnoun
333#'012
FOLD AND SCAN ARE ADVERBS … MORE FP 101
➤ Fold (/) is an adverb, we call it over
0+/til510
0+\til5013610
➤ Scan (\) returns the incremental values of over (left-to-right)
A plus reduction over 01234
Partial sums of 01234
FLEXIBLE MAPPING WITH ADVERBS
➤ Only 6 adverbs, but they come up all the time
(floor;ceiling)@\:5.556
max@/:036_til9258
0-':til501111
(min;max)@\:/:036_til9023568
each-left (\:)
each-right (/:)
each-prior (‘:)
compose: each-left-each-right (\:/:)
THINKING IN ARRAYSPrime Numbers
THINKING IN ARRAYS - NO STINKING LOOPS*
functionisPrime(n){
if(n<2)returnfalse;
varq=Math.floor(Math.sqrt(n));
for(vari=2;i<=q;i++){if(n%i==0){returnfalse;}}
returntrue;}
* Steve Apter nsl.com
THINKING IN ARRAYS
x mod y
1 .. 100
THINKING IN ARRAYS
x mod y = 0
THINKING IN ARRAYS
y = 1
y = x
THINKING IN ARRAYS
primes
THINKING IN ARRAYS
THE RESULT
➤ Extremely concise, 111 bytes
➤ 29 characters left for emojis when tweeting it!
p:{nwhere2=sum0=nmod/:n:1+tilx}rle:{(count;first)@\:/:(wherenot=‘:[x])_x}expand:{(),/(#).’x}
rle:{(count;first)@\:/:(wherenot=‘:[x])_x}Onlyshortprogramshaveanyhopeofbeingcorrect
~ Arthur Whitney
HOW CAN WE USE Q FOR DATA ANALYSIS?
➤ Q has dictionaries (associations) and tables (flipped dictionaries)
➤ Tables are first-class and columnar, operations on columns are fast and efficient
➤ It is actually the scripting language for kdb+
➤ Has an integrated sql-like query language called q-sql
selectavgpricebysymfromtradeswheredate>.z.d-5
➤ Has really nice temporal types, temporal arithmetic, and temporal joins
Q FOR DATA ANALYSIS
STEP 1. GET SOME DATA
//Systemcommandsstartwith\\wget.../pantheon.tsv\wget.../pageviews_2008-2013.tsv-Opageviews.tsv
//ETLinQpeople:("iSiSSSSSffsissssiffiiff";enlist"\t")0:`:pantheon.tsv;pageviews:("iSSiSisssss",72#"i";enlist"\t")0:`:pageviews.tsv;
Monthly page visit information for people on WikiPedia
We have a short fat table, want a long skinny table…
Each month is a single column
File nameTab separatedColumn types
STEP 2. CLEAN THE DATA!
//Allofthemonthsmonths:"M"$ssr[;"-";"."]eachstring11_colspageviews;
//Createanewtableofthemonthsflattenedmonthly:ungroup2!([]id:pageviews`id;lang:pageviews`lang;month:(countpageviews)#enlistmonths;clicks:flippageviewsc:11_colspageviews)
//Left-Joinclickinformationwithpersoninformationclickinfo:monthlylj`id`langxkeypeople;
idnameoccupationlang------------------------------------------307AbrahamLincolnPOLITICIANaf307AbrahamLincolnPOLITICIANam307AbrahamLincolnPOLITICIANan307AbrahamLincolnPOLITICIANang307AbrahamLincolnPOLITICIANar307AbrahamLincolnPOLITICIANarz…
idlangmonthclicks----------------------------307af2008.014307af2008.025307af2008.030307af2008.045307af2008.055307af2008.061…
Month values
Long skinny table
4 columns
Left join
STEP 3. ASK SOME QUESTIONS
selectfromclickinfowhereoccupationlike“COMPUTERSCIENTIST”
STEP 3. ASK SOME QUESTIONS
selectfromclickinfowhereoccupationlike“COMPUTERSCIENTIST”
STEP 4…CLEAN THE DATA… AGAIN…
file:{"List_of_Google_Doodles_in_",string`year$x};wget:{system"wgethttps://en.wikipedia.org/wiki/",filex};
process:{values:(string`January`February`March`April`May`June`July`Auguest`September`October`November`December)!til12;doc:read0hsym`$filex;pars:wheredoclike\:"<p>*";celebrated:`$first@/:/:"\""vs/:/:(@).'flip(d;where@/:not(d:"title=\""vs/:docpars)like\:\:"<p>*");headings:{[doc;x]firstposwhere(docpos:x+negtil10)like\:"<h3>*"}[doc]eachpars;months:x+valuesfirst@/:"_"vs/:first@‘"\""vs/:("id=\""vs/:docheadings)@'1;:razeeachcelebratedgroupmonths;};
years:2010.012011.012012.012013.01m;wgeteachyears;results:razeprocesseachyears;doodles:ungroup1!flip`month`name!(key;value)@\:results;
monthname----------------------------2010.01IsaacNewton2010.01DjangoReinhard2010.01AntonChekhov2010.022010WinterOlympics…
<p>On<b>Tuesday,July6,2010</b>,thebirthof<ahref="/wiki/Frida_Kahlo"title="FridaKahlo">FridaKahlo</a>wascelebratedwithagoldGooglelogowrappedwithvines,flowers,andapaintingofherselfinherpaintingstyles.<supid="cite_ref-18"class="reference"><ahref="#cite_note-18">[18]</a></sup></p>
→
PARALLELIZATION IN Q
file:{"List_of_Google_Doodles_in_",string`year$x};wget:{system"wgethttps://en.wikipedia.org/wiki/",filex};
process:{values:(string`January`February`March`April`May`June`July`Auguest`September`October`November`December)!til12;doc:read0hsym`$filex;pars:wheredoclike\:"<p>*";celebrated:`$first@/:/:"\""vs/:/:(@).'flip(d;where@/:not(d:"title=\""vs/:docpars)like\:\:"<p>*");headings:{[doc;x]firstposwhere(docpos:x+negtil10)like\:"<h3>*"}[doc]eachpars;months:x+valuesfirst@/:"_"vs/:first@‘"\""vs/:("id=\""vs/:docheadings)@'1;:razeeachcelebratedgroupmonths;};
years:2010.012011.012012.012013.01m;wgeteachyears;results:razeprocesseachyears;doodles:ungroup1!flip`month`name!(key;value)@\:results;
PARALLELIZATION IN Qfile:{"List_of_Google_Doodles_in_",string`year$x};wget:{system"wgethttps://en.wikipedia.org/wiki/",filex};
process:{values:(string`January`February`March`April`May`June`July`Auguest`September`October`November`December)!til12;doc:read0hsym`$filex;pars:wheredoclike\:"<p>*";celebrated:`$first@/:/:"\""vs/:/:(@).'flip(d;where@/:not(d:"title=\""vs/:docpars)like\:\:"<p>*");headings:{[doc;x]firstposwhere(docpos:x+negtil10)like\:"<h3>*"}[doc]eachpars;months:x+valuesfirst@/:"_"vs/:first@‘"\""vs/:("id=\""vs/:docheadings)@'1;:razeeachcelebratedgroupmonths;};
years:2010.012011.012012.012013.01m;wgeteachyears;
results:razeprocesseachyears;
doodles:ungroup1!flip`month`name!(key;value)@\:results;
PARALLELIZATION IN Qfile:{"List_of_Google_Doodles_in_",string`year$x};wget:{system"wgethttps://en.wikipedia.org/wiki/",filex};
process:{values:(string`January`February`March`April`May`June`July`Auguest`September`October`November`December)!til12;doc:read0hsym`$filex;pars:wheredoclike\:"<p>*";celebrated:`$first@/:/:"\""vs/:/:(@).'flip(d;where@/:not(d:"title=\""vs/:docpars)like\:\:"<p>*");headings:{[doc;x]firstposwhere(docpos:x+negtil10)like\:"<h3>*"}[doc]eachpars;months:x+valuesfirst@/:"_"vs/:first@‘"\""vs/:("id=\""vs/:docheadings)@'1;:razeeachcelebratedgroupmonths;};
years:2010.012011.012012.012013.01m;wgeteachyears;
results:razeprocesspeachyears;
doodles:ungroup1!flip`month`name!(key;value)@\:results;
PARALLELIZATION IN Q
file:{"List_of_Google_Doodles_in_",string`year$x};wget:{system"wgethttps://en.wikipedia.org/wiki/",filex};
process:{values:(string`January`February`March`April`May`June`July`Auguest`September`October`November`December)!til12;doc:read0hsym`$filex;pars:wheredoclike\:"<p>*";celebrated:`$first@/:/:"\""vs/:/:(@).'flip(d;where@/:not(d:"title=\""vs/:docpars)like\:\:"<p>*");headings:{[doc;x]firstposwhere(docpos:x+negtil10)like\:"<h3>*"}[doc]eachpars;months:x+valuesfirst@/:"_"vs/:first@‘"\""vs/:("id=\""vs/:docheadings)@'1;:razeeachcelebratedgroupmonths;};
years:2010.012011.012012.012013.01m;wgeteachyears;results:razeprocesspeachyears;doodles:ungroup1!flip`month`name!(key;value)@\:results;
Done!
STEP 5. ASK SOME MORE QUESTIONS!
//Annotatethedoodledmonthsfrominthemaintableclickinfo:updatedoodle:(date,’name)indoodlesfromclickinfo;
//Gettheaverageandmedianratiobetweenthemaxmonthlyclicks(withandwithout//thedoodledmonth)andtheminmonthlyclicks—exclude0-clickmonths(avg;med)@\:{exec(%).(maxclickswheredoodle;maxclickswherenotdoodle)-minclicks
fromflipxwherenotclicks=0}eachselectclicks,doodlebynamefromclickinfowherenameindoodles`name
58.3446110.30895
Average: 58x Median: 10x
name|----------------------|--------WinsorMcCay|508.5705AlbertSzent-Györgyi|404.2465NicolasSteno|360.9331GideonSundback|340.303MaryLeakey|337.1806DennisGabor|274.4389GraceHopper|220.8074
WHY SHOULD YOU CARE?
…and summary
WHY SHOULD YOU CARE?
➤ High-level expressive notation
➤ Not just someones pet project
➤ Developed by Kx Systems (since 1993)
➤ Practical (dicts, tables, q-sql, temporals, etc…)
➤ Very fast
➤ memory is getting larger, vector operations getting faster (SIMD, SSE, AVX2, AVX512, …)
➤ …benchmarks available online
➤ It’s interesting, different, and will change how you think
THANKS!l:{(3=not[x]*n)or(or). 3 4=\:x*n:2{flip+':[x]+1_x,0b}/x}
➤ Some references: ➤ Two books:
➤ Q Tips - Nick Psaris ➤ Q for Mortals - Jeff Borror
➤ code.kx.com ➤ kx.com
➤ /software-download.php ➤ /community.php
➤ Notation as a Tool of Thought - K. Iverson’s Turing Award Paper
@timthornton6