Splitting the Logs With csplit

csplit is a POSIX command to split a file into sub-files using a line-delimiter:


csplit [options] <file-name> <pattern>

Patterns

The pattern can be a number (i.e. to split every so many lines) or a regular expression.

Example: Split the ape-tools log into separate tests

Each test starts with a recording of the contents of the Parameters named-tuple. It can be matched with the pattern:


INFO.*Running\ Parameters

The csplit command defaults to stopping after the first match so to trim off all the lines that come before the first match:


csplit apetools.log /INFO.*Running\ Parameters/

The forward-slashes enclose the pattern and tell csplit to save off all the text up to but not including the matched line. If you use percent-signs it will save all the text after the pattern:


csplit apetools.log %INFO.*Running\ Parameters%

In the example we're not doing anything hugely useful, as the tests are all in the same file. To tell csplit to break it up for more than the first match you use the {<count>} option. Since you want all the tests you can pass in a wild-card instead of an exact number:


csplit apetools.log /INFO.*Running\ Parameters/ {*}

File-Names

csplit names the output files it creates based on two parts -- a prefix and a suffix. The default prefix is xx and the default suffix is %02d (the strfmt format for an integer with at least 2 places). So the previous command would produce a set of files (xx00, xx01, etc.). If you want to make them a little more memorable you can change the prefix and suffix:


csplit apetools.log /INFO.*Running\ Parameters/ {*} --prefix apetest --suffix-format %03d.log

The output for this would be a series of files:


apetest000.log, apetest001.log, ...

The Sigmoid

The sigmoid is a function with an s shape that's used in logistic regression (and
elsewhere).


The Equation



\[\begin{aligned}
\sigma(z) &= \frac{1}{1+e^{-z}}\\
\sigma(0) &= \frac{1}{2}\\
 \lim_{z \to \infty} \sigma(z) &= 1\\
 \lim_{z \to -\infty} \sigma(z) &= 0\\
 \end{aligned}\]

Applying to a Linear Regression Classifier

  1. Apply the Sigmoid to z multiplied by a weight learned by the classifier.
  2. Classify using round(Sigmoid(z))

Logistic Regression

Main Idea: Find the parameters for a line that partitions a data set.

General Approach (MLIA: p. 84)

  1. Collect Data: any method
  2. Prepare Data: Convert to numeric data if needed.
  3. Analyze: Any method.
  4. Train: Find the optimal coefficients to classify the data.
  5. Use: Given new data, classify it based on the previously classified data.

Pros, Cons, and Data Types

Pros:
  • Computationally Cheap
  • Easy to implement
  • Easy to interpret
Cons:
  • Succeptible to overfitting
  • Not always accurate
Data Types:
  • Numeric Values
  • Nominal Values

The Arch ARM glibc Error

Background

When installing packages on the PogoPlugs running arch linux you sometimes get this error:


error: failed to commit transaction (conflicting files)
glibc: /lib exists in file system
Errors occurred, no packages were upgraded.

The Fix

You can fix it using the instructions in this post


pacman -R pcmciautils
mv /etc/profile.d/locale.sh /etc/profile.d/locale.sh.pacnew
pacman -Syu --ignore glibc

When prompted:

  • If it asks you if you want to update pacman say n
  • If it asks you if you want to replace any packages say y
  • If it asks you if you want to skip any packages say y

After the installation is done:


pacman -Su

When prompted:

  • Say no to anything to do with pacman
  • Say yes to anything else (within reason)

What Should Happen

The last step should update glibc so you don't get the errors anymore. To test it install whatever you were installing when the error occurred.

Finding Binomial Probabilities

The Equation

The probability of getting k successes in n trials is:
\[\begin{aligned}
p(k) &= {n \choose k} p^k (1-p)^{n-k}
\end{aligned}\]
The Variables
Variable Meaning
$n$ total number of trials
$k$ number of successful trials
$p(k)$ probability of $k$ successes
$n-k$ the number of failures
$p$ probability of success for a single trial
$1-p$ probability of failure for a single trial
${n \choose k}$ number of permutations with $k$ successes

Calculating $n$ choose $k$

\[\begin{aligned}
{n \choose k} &= \frac{n!}{k!(n - k)!}
\end{aligned}\]

Inequalities

To find the probability of less than $y$ successes, sum the probabilities from $k=0$ to $k=y-1$:
\[\begin{aligned}
p(k < y) &= \sum_{i=1}^{y-1} p(k_i)
\end{aligned}\]
To find the probability of getting greater than $y$ successes, sum the probabilities from $k=y+1$ to $k=n$:
\[\begin{aligned}
p(k > y) &= \sum_{i=y+1}^{n} p(k_i)
\end{aligned}\]
To find the probability of getting $y$ or fewer successes, sum the probabilities from $k=0$ to $k=y$:
\[\begin{aligned}
p(k \leq y) &= \sum_{i=1}^{y} p(k_i)
\end{aligned}\]
To find the probability of getting $y$ or more successes, sum the probabilities from $k=y$ to $k=n$:
\[\begin{aligned}
p(k \geq y) &= \sum_{i=y}^{n} p(k_i)
\end{aligned}\]
To find the probability of from $y$ to $z$ successes, sum the probabilities from $k=y$ to $k=z$:
\[\begin{aligned}
p(y \leq k \leq z) &= \sum_{i=y}^{z} p(k_i)
\end{aligned}\]

Mean and Standard Deviation

The mean and Standard Deviation of a Binomial Distribution are:
\[\begin{aligned}
\mu &= np\\
\sigma &= \sqrt{np(1-p)}
\end{aligned}\]
Source: Statistics For Dummies, 2nd edition

The Sign Test

What is the purpose of the sign test?

To test if the population median equals a given value (e.g. the sample median).

What are the assumptions made?

  • Independence: Each sample is independent of the others.
  • Identical Distributions
  • Continuity: There are no ties 
    • Needed for the hypothesis test
    • Means there is exactly one point x such that F(x) =1/2 and that point is the median ($theta$

How do you use it?

  1. Setup the null hypothesis:
    \[ \begin{aligned}
    H_0 &: m = m_0
    \end{aligned} \]
$m$ is the true median, $m_0$ is the proposed median.
  1. Setup the alternative hypothesis as one of:
    \[\begin{aligned}
    H_a &: m \neq m_0\\
    H_a &: m > m_0\\
    H_a &: m < m_0
    \end{aligned}\]
  2. Collect a random sample from the population.
  3. Assign a 1 or 0 to each value in the data:
    • $sign = 0 \; if value < m_0$
    • $sign = 1 \; if value > m_0$
    • remove value from sample if $value = m_0$
  4. Sum all the signs to get $k$:
    \[\begin{aligned}
    x &= \sum_{i=0}^{n-1} sign(value_i)
    \end{aligned} \]
  5. Find $k$ on the binomial distribution:
    • $n$ is the current sample size
    • $p = 0.5$ (If $H_0$ is true, half are above, half are below)
  6. Find the p-value:
    • if $H_a$ uses <, add probabilities for $x \leq k$
    • if $H_a$ uses >, add probabilities for $x \geq k$
    • if $H_a$ uses $neq$, $p = 2 \times sum(x \geq k)$
  7. Draw a conclusion:
If $p-value < alpha$ (0.05 for 95%), reject $H_0$. Otherwise don't reject.

Signs, Ranks, and Signed Ranks

The following are used in testing non-parametric statistics.

What is a sign?

  • a value $\in \{0, 1\}$ assigned to each member of a data-set
  • If the data-member is greater than the test-value, it's assigned a 1. Otherwise it gets a 0.

    What is a rank?

    • The number representing a data points' place in the ordered set (1-based index)
    • If multiple points have the same value each is reassigned the mean of their original Ranks.

    What is an absolute rank?

    • Each point is reassigned its absolute value
    • Then the set is Ranked

    What is a signed rank?

    1. Assign a 1 to each data point greater than the test-value.
    2. Assign a 0 to the remaining points.
    3. Find the absolute rank.
    4. For each point, assign it a signed rank:
      \[SignedRank = Sign \times AbsoluteRank\]

    What is a rank sum?

    • the sum of all the ranks for a data-set.
    Source: Statistics II for Dummies

    PogoPlug Arch Linux Installation

    Background

    The PogoPlug is a network-attached storage device that by default runs busybox. By installing archlinux you gain access to a larger set of installable packages.

    Steps

    1. Power up the PogoPlug and attach it to your LAN via an ethernet cable.

    2. Go to pogoplug's site and activate it.

    3. Once you activate the device click on your login name at the top left of the screen and choose:

      • Settings > Security > Enable SSH access for this PogoPlug device
    4. SSH into your device (the password is what you set it to in the previous step):


      ssh root@<IP address>
    5. Kill the PogoPlug software:


      killall hbwd
    6. Download and install a bootloader:


      cd /tmp
      wget http://jeff.doozan.com/debian/uboot/install_uboot_mtd0.sh
      chmod +x install_uboot_mtd0.sh
      ./install_uboot_mtd0.sh
    7. Insert a USB drive into the PogoPlug

    8. Start fdisk on the PogoPlug:


      /sbin/fdisk /dev/sda
    9. At the prompt enter the commands to create a partition:


      o
      p
      n
      p
      1
    10. Accept all the defaults after entering 1 and the exit when you reach the prompt Command (m for help)::


      w
    11. Create the filesystem on the USB:


      wget http://archlinuxarm.org/os/pogoplug/mke2fs
      chmod 755 mke2fs
      ./mke2fs /dev/sda1
      mkdir usb
      mount /dev/sda1 usb
    12. Download and install Arch Linux:


      cd usb
      wget http://archlinuxarm.org/os/ArchLinuxARM-armv5te-latest.tar.gz
      tar -xzvf ArchLinuxARM-armv5te-*.tar.gz
      rm ArchLinuxARM-armv5te-*.tar.gz
      sync
    13. Clean up and reboot:


      cd ..
      umount usb
      /sbin/reboot
    14. The new SSH login and password are root, root

    Standard Deviation, Standard Error, and Confidence Intervals

    What is Standard Deviation?

    • The spread around the center of a normal distribution
    • The amount of variation in a population
    • The point where the normal curve changes from concave down to concave up

    What is the Sample Standard Deviation?

    • The variance around the mean of a sample from the population
    Calculation:
    \[\begin{aligned}
    s &= \sqrt{\frac{(x-\bar{x})^2}{n-1}}
    \end{aligned} \]

    What is the Empirical Rule?

    • 68% of the data falls within 1 Standard Deviation of the center
    • 95% of the data falls within 2 Standard Deviations of the center
    • 99.7% of the data falls within 3 Standard Deviations of the center
    • If the percentages don't match the data, the distribution isn't normal

    What is the Standard Error?

    • The amount of variance the measure of central tendency (e.g. the mean) has:
      \[ \begin{aligned}
      SE &= \frac{\sigma}{\sqrt{n}}
      \end{aligned}\]
    • Sample Standard Deviation is the variance within a sample, Standard Error is how close the Sample mean is to the Population Mean

    What is a Margin Of Error?

    • A multiple of the Standard Error
    • The multiple is based on the Confidence Interval you want
    • For example: 68% Confidence uses a multiple of 1 (see the Empirical Rule)
    \[\begin{aligned}
    MOE &= multiple \times StandardError
    \end{aligned}\]

    What is a Confidence Interval?

    • Level of Confidence based on the percent of the distribution your Margin of Error covers
    • If you have a small data set or you don't know that the distribution is normal use the t-distribution:
      \[\begin{aligned}
      ConfidenceInterval &= \bar{x} \pm t_{n-1} \frac{s}{\sqrt{n-1}}
      \end{aligned}\]
    • n is the size of the sample
    • s is the Sample Standard Deviation
    • If you know that the distribution is normal and/or your sample is large, use the z-score instead.
    Source: Statistic For Dummies