readme update

author: Andrew Gallant <jamslam@gmail.com> 2014-11-28 00:14:03 -0500
committer: Andrew Gallant <jamslam@gmail.com> 2014-11-28 00:14:03 -0500
commit: a7290db704f885009102d516f0418a60d11a7b13 (patch)
tree: aa8b6a88e8ad3976507d4e33bc2380dc9836be8e
parent: fed1d5fab88cccf28525a6bdb18e4e4956d553cc (diff)
1 files changed, 279 insertions, 4 deletions
diff --git a/README.md b/README.md
index 9a6a6c0..8436048 100644
--- a/README.md
+++ b/README.md
@@ -1,11 +1,11 @@
-xsv is a command line program for slicing, analyzing, splitting and joining
-CSV files. There are two primary goals: performance and compositionality. To be
-more concrete:
+xsv is a command line program for indexing, slicing, analyzing, splitting
+and joining CSV files. There are two primary goals: performance and
+compositionality. To be more concrete:
 
 1. With xsv, it should be easy to perform simple tasks.
 2. Behavior that affects performance should be made explicit (and documented)
    in the command line interface.
-3. xsv commands should be easily composable.
+3. xsv commands should be composable, but not at the expense of performance.
 
 This README contains information on how to install `xsv` and a full set of
 examples that demonstrate much of its functionality.
@@ -13,5 +13,280 @@ examples that demonstrate much of its functionality.
 [![Build status](https://api.travis-ci.org/BurntSushi/xsv.png)](https://travis-ci.org/BurntSushi/xsv)
 
 
+### A whirlwind tour
+
+Let's say you're playing with some of the data from the
+[Data Science Toolkit](https://github.com/petewarden/dstkdata), which contains
+several CSV files. Maybe you're interested in the population counts of each
+city in the world. So grab the data and start examining it:
+
+```bash
+$ curl -LO http://burntsushi.net/stuff/worldcitiespop.csv
+$ xsv headers worldcitiespop.csv
+1   Country
+2   City
+3   AccentCity
+4   Region
+5   Population
+6   Latitude
+7   Longitude
+```
+
+The next thing you might want to do is get an overview of the kind of data that
+appears in each column. The `stats` command will do this for you:
+
+```bash
+$ xsv stats worldcitiespop.csv --everything | xsv table
+field       type     min            max            min_length  max_length  mean          stddev         median     mode         cardinality
+Country     Unicode  ad             zw             2           2                                                   cn           234
+City        Unicode   bab el ahmar  Þykkvibaer     1           91                                                  san jose     2351892
+AccentCity  Unicode   Bâb el Ahmar  ïn Bou Chella  1           91                                                  San Antonio  2375760
+Region      Unicode  00             Z9             0           2                                        13         04           397
+Population  Integer  7              31480498       0           8           47719.570634  302885.559204  10779                   28754
+Latitude    Float    -54.933333     82.483333      1           12          27.188166     21.952614      32.497222  51.15        1038349
+Longitude   Float    -179.983333    180            1           14          37.08886      63.22301       35.28      23.8         1167162
+```
+
+The `xsv table` command takes any CSV data and formats it into aligned columns
+using [elastic tabs](https://github.com/BurntSushi/tabwriter). You'll notice
+that it even gets alignment right with respect to Unicode characters.
+
+So, this command takes about 12 seconds to run on my machine, but we can speed
+it up by creating an index and re-running the command:
+
+```bash
+$ xsv index worldcitiespop.csv
+$ xsv stats worldcitiespop.csv --everything | xsv table
+...
+```
+
+Which cuts it down to about 8 seconds on my machine. (And creating the index
+takes less than 2 seconds.)
+
+Notably, the same type of "statistics" command in another
+[CSV command line toolkit](https://csvkit.readthedocs.org/en/0.9.0/)
+takes about 2 minutes to produce similar statistics on the same data set.
+
+Creating an index gives us more than just faster statistics gathering. It also
+makes slice operations extremely fast because *only the sliced portion* has to
+be parsed. For example, let's say you wanted to grab the last 10 records:
+
+```bash
+$ xsv count worldcitiespop.csv
+3173958
+$ xsv slice worldcitiespop.csv -s 3173948 | xsv table
+Country  City               AccentCity         Region  Population  Latitude     Longitude
+zw       zibalonkwe         Zibalonkwe         06                  -19.8333333  27.4666667
+zw       zibunkululu        Zibunkululu        06                  -19.6666667  27.6166667
+zw       ziga               Ziga               06                  -19.2166667  27.4833333
+zw       zikamanas village  Zikamanas Village  00                  -18.2166667  27.95
+zw       zimbabwe           Zimbabwe           07                  -20.2666667  30.9166667
+zw       zimre park         Zimre Park         04                  -17.8661111  31.2136111
+zw       ziyakamanas        Ziyakamanas        00                  -18.2166667  27.95
+zw       zizalisari         Zizalisari         04                  -17.7588889  31.0105556
+zw       zuzumba            Zuzumba            06                  -20.0333333  27.9333333
+zw       zvishavane         Zvishavane         07      79876       -20.3333333  30.0333333
+```
+
+These commands are *instantaneous* because they run in time and memory
+proportional to the size of the slice (which means they will scale to
+arbitrarily large CSV data).
+
+Switching gears a little bit, you might not always want to see every column in
+the CSV data. In this case, maybe we only care about the country, city and
+population. So let's take a look at 10 random rows:
+
+```bash
+$ xsv select Country,AccentCity,Population worldcitiespop.csv \
+  | xsv sample 10 \
+  | xsv table
+Country  AccentCity       Population
+cn       Guankoushang
+za       Klipdrift
+ma       Ouled Hammou
+fr       Les Gravues
+la       Ban Phadèng
+de       Lüdenscheid      80045
+qa       Umm ash Shubrum
+bd       Panditgoan
+us       Appleton
+ua       Lukashenkivske
+```
+
+Whoops! It seems some cities don't have population counts. How pervasive is
+that?
+
+```bash
+$ xsv frequency worldcitiespop.csv --limit 5
+field,value,count
+Country,cn,238985
+Country,ru,215938
+Country,id,176546
+Country,us,141989
+Country,ir,123872
+City,san jose,328
+City,san antonio,320
+City,santa rosa,296
+City,santa cruz,282
+City,san juan,255
+AccentCity,San Antonio,317
+AccentCity,Santa Rosa,296
+AccentCity,Santa Cruz,281
+AccentCity,San Juan,254
+AccentCity,San Miguel,254
+Region,04,159916
+Region,02,142158
+Region,07,126867
+Region,03,122161
+Region,05,118441
+Population,(NULL),3125978
+Population,2310,12
+Population,3097,11
+Population,983,11
+Population,2684,11
+Latitude,51.15,777
+Latitude,51.083333,772
+Latitude,50.933333,769
+Latitude,51.116667,769
+Latitude,51.133333,767
+Longitude,23.8,484
+Longitude,23.2,477
+Longitude,23.05,476
+Longitude,25.3,474
+Longitude,23.1,459
+```
+
+(The `xsv frequency` command builds a frequency table for each column in the
+CSV data. This one only took 5 seconds.)
+
+So it seems that most cities do not have a population count associated with
+them at all. No matter---we can adjust our previous command so that it only
+shows rows with a population cound:
+
+```bash
+$ xsv search -s Population '[0-9]' worldcitiespop.csv \
+  | xsv select Country,AccentCity,Population \
+  | xsv sample 10 \
+  | xsv table
+Country  AccentCity       Population
+es       Barañáin         22264
+es       Puerto Real      36946
+at       Moosburg         4602
+hu       Hejobaba         1949
+ru       Polyarnyye Zori  15092
+gr       Kandíla          1245
+is       Ólafsvík         992
+hu       Decs             4210
+bg       Sliven           94252
+gb       Leatherhead      43544
+```
+
+Erk. Which country is `at`? No clue, but the Data Science Toolkit has a CSV
+file called `countrynames.csv`. Let's grab it and do a join so we can see which
+countries these are:
+
+```bash
+curl -LO https://gist.githubusercontent.com/anonymous/063cb470e56e64e98cf1/raw/98e2589b801f6ca3ff900b01a87fbb7452eb35c7/countrynames.csv
+$ xsv headers countrynames.csv
+1   Abbrev
+2   Country
+$ xsv join --no-case  Country sample.csv Abbrev countrynames.csv | xsv table
+Country  AccentCity       Population  Abbrev  Country
+es       Barañáin         22264       ES      Spain
+es       Puerto Real      36946       ES      Spain
+at       Moosburg         4602        AT      Austria
+hu       Hejobaba         1949        HU      Hungary
+ru       Polyarnyye Zori  15092       RU      Russian Federation | Russia
+gr       Kandíla          1245        GR      Greece
+is       Ólafsvík         992         IS      Iceland
+hu       Decs             4210        HU      Hungary
+bg       Sliven           94252       BG      Bulgaria
+gb       Leatherhead      43544       GB      Great Britain | UK | England | Scotland | Wales | Northern Ireland | United Kingdom
+```
+
+Whoops, now we have two columns called `Country` and an `Abbrev` column that we
+no longer need. This is easy to fix by re-ordering columns with the `xsv
+select` command:
+
+```bash
+$ xsv join --no-case  Country sample.csv Abbrev countrynames.csv \
+  | xsv select 'Country[1],AccentCity,Population' \
+  | xsv table
+Country                                                                              AccentCity       Population
+Spain                                                                                Barañáin         22264
+Spain                                                                                Puerto Real      36946
+Austria                                                                              Moosburg         4602
+Hungary                                                                              Hejobaba         1949
+Russian Federation | Russia                                                          Polyarnyye Zori  15092
+Greece                                                                               Kandíla          1245
+Iceland                                                                              Ólafsvík         992
+Hungary                                                                              Decs             4210
+Bulgaria                                                                             Sliven           94252
+Great Britain | UK | England | Scotland | Wales | Northern Ireland | United Kingdom  Leatherhead      43544
+```
+
+Perhaps we can do this with the original CSV data? Indeed we can---because
+joins in `xsv` are fast.
+
+```bash
+$ xsv join --no-case Abbrev countrynames.csv Country worldcitiespop.csv \
+  | xsv select '!Abbrev,Country[1]' \
+  > worldcitiespop_countrynames.csv
+$ xsv sample 10 worldcitiespop_countrynames.csv | xsv table
+Country                      City                   AccentCity             Region  Population  Latitude    Longitude
+Sri Lanka                    miriswatte             Miriswatte             36                  7.2333333   79.9
+Romania                      livezile               Livezile               26      1985        44.512222   22.863333
+Indonesia                    tawainalu              Tawainalu              22                  -4.0225     121.9273
+Russian Federation | Russia  otar                   Otar                   45                  56.975278   48.305278
+France                       le breuil-bois robert  le Breuil-Bois Robert  A8                  48.945567   1.717026
+France                       lissac                 Lissac                 B1                  45.103094   1.464927
+Albania                      lumalasi               Lumalasi               46                  40.6586111  20.7363889
+China                        motzushih              Motzushih              11                  27.65       111.966667
+Russian Federation | Russia  svakino                Svakino                69                  55.60211    34.559785
+Romania                      tirgu pancesti         Tirgu Pancesti         38                  46.216667   27.1
+```
+
+The `!Abbrev,Country[1]` syntax means, "remove the `Abbrev` column and remove
+the second occurrence of the `Country` column." Since we joined with
+`countrynames.csv` first, the first `Country` name (fully expanded) is now
+included in the CSV data.
+
+This `xsv join` command takes about 7 seconds on my machine. The performance
+comes from constructing a very simple hash index of one of the CSV data files
+given.
+
+
+### Installation
+
+Installing `xsv` is a bit hokey right now. Ideally, I could release binaries
+for Linux, Mac and Windows. Currently, I'm only able to release binaries for
+Linux because I don't know how to cross compile Rust programs.
+
+With that said, you can grab the
+[latest release](https://github.com/BurntSushi/xsv/releases/latest)
+(Linux x86_64 binary) from GitHub:
+
+```bash
+$ curl -sOL https://github.com/BurntSushi/xsv/releases/download/0.4.9/xsv-0.4.9-x86_64-unknown-linux-gnu.tar.gz
+$ tar xf xsv-0.4.9-x86_64-unknown-linux-gnu.tar.gz
+$ cd xsv-0.4.9-x86_64-unknown-linux-gnu/
+$ ./xsv --version
+0.4.9
+```
+
+Alternatively, you can compile from source by
+[installing Cargo](https://crates.io/install)
+([Rust's](http://www.rust-lang.org/) package manager)
+and building `xsv`:
+
+```bash
+git clone git://github.com/BurntSushi/xsv
+cd xsv
+cargo build --release
+```
+
+Compilation will probably take 1-2 minutes depending on your machine. The
+binary will end up in `./target/release/xsv`.
+
 **WORK IN PROGRESS**.
author	Andrew Gallant <jamslam@gmail.com>	2014-11-28 00:14:03 -0500
committer	Andrew Gallant <jamslam@gmail.com>	2014-11-28 00:14:03 -0500
commit	a7290db704f885009102d516f0418a60d11a7b13 (patch)
tree	aa8b6a88e8ad3976507d4e33bc2380dc9836be8e
parent	fed1d5fab88cccf28525a6bdb18e4e4956d553cc (diff)