csplit is a POSIX command to split a file into sub-files using a line-delimiter:


csplit [options] <file-name> <pattern>

Patterns

The pattern can be a number (i.e. to split every so many lines) or a regular expression.

Example: Split the ape-tools log into separate tests

Each test starts with a recording of the contents of the Parameters named-tuple. It can be matched with the pattern:


INFO.*Running\ Parameters

The csplit command defaults to stopping after the first match so to trim off all the lines that come before the first match:


csplit apetools.log /INFO.*Running\ Parameters/

The forward-slashes enclose the pattern and tell csplit to save off all the text up to but not including the matched line. If you use percent-signs it will save all the text after the pattern:


csplit apetools.log %INFO.*Running\ Parameters%

In the example we're not doing anything hugely useful, as the tests are all in the same file. To tell csplit to break it up for more than the first match you use the {<count>} option. Since you want all the tests you can pass in a wild-card instead of an exact number:


csplit apetools.log /INFO.*Running\ Parameters/ {*}

File-Names

csplit names the output files it creates based on two parts -- a prefix and a suffix. The default prefix is xx and the default suffix is %02d (the strfmt format for an integer with at least 2 places). So the previous command would produce a set of files (xx00, xx01, etc.). If you want to make them a little more memorable you can change the prefix and suffix:


csplit apetools.log /INFO.*Running\ Parameters/ {*} --prefix apetest --suffix-format %03d.log

The output for this would be a series of files:


apetest000.log, apetest001.log, ...

The Sigmoid

Cloistered Monkey

2012-10-27 23:54

The sigmoid is a function with an s shape that's used in logistic regression (and
elsewhere).

The Equation

\[\begin{aligned}
\sigma(z) &= \frac{1}{1+e^{-z}}\\
\sigma(0) &= \frac{1}{2}\\
\lim_{z \to \infty} \sigma(z) &= 1\\
\lim_{z \to -\infty} \sigma(z) &= 0\\
\end{aligned}\]

Applying to a Linear Regression Classifier

Apply the Sigmoid to z multiplied by a weight learned by the classifier.
Classify using round(Sigmoid(z))

Logistic Regression

Cloistered Monkey

2012-10-27 22:31

Main Idea: Find the parameters for a line that partitions a data set.

General Approach (MLIA: p. 84)

Collect Data: any method
Prepare Data: Convert to numeric data if needed.
Analyze: Any method.
Train: Find the optimal coefficients to classify the data.
Use: Given new data, classify it based on the previously classified data.

Pros, Cons, and Data Types

Pros:

Computationally Cheap

Easy to implement

Easy to interpret

Cons:

Succeptible to overfitting

Not always accurate

Data Types:

Numeric Values

Nominal Values

Sidebar on Nominal Values

Nominal Values are data that you can determine to be equivalent to other data or belonging to a set of data, but no ordering or other numeric calculations are possible.

Dichotomous: Belongs to one of two groups

Non-Dichotomous: Belongs to one of multiple groups

Nominal Values are usually summarized using frequencies or percentages (and sometimes summarized by mode).
Column (bar) charts are the best form of graphical representation (along with pie charts)
These are also called categorical or qualitative values

The Arch ARM glibc Error

Cloistered Monkey

2012-10-17 19:56

Background

When installing packages on the PogoPlugs running arch linux you sometimes get this error:


error: failed to commit transaction (conflicting files)
glibc: /lib exists in file system
Errors occurred, no packages were upgraded.

The Fix

You can fix it using the instructions in this post


pacman -R pcmciautils
mv /etc/profile.d/locale.sh /etc/profile.d/locale.sh.pacnew
pacman -Syu --ignore glibc

When prompted:

If it asks you if you want to update pacman say n

If it asks you if you want to replace any packages say y

If it asks you if you want to skip any packages say y

After the installation is done:


pacman -Su

When prompted:

Say no to anything to do with pacman

Say yes to anything else (within reason)

What Should Happen

The last step should update glibc so you don't get the errors anymore. To test it install whatever you were installing when the error occurred.

The Three Parts of a Measurement of Central Tendency

Cloistered Monkey

2012-10-11 23:58

Hypothesis Test
Confidence Interval
Point Estimate (Confidence Interval of 0)

Examples
	NonParametric	Parametric
Hypothesis Test	Sign Test	t Test
Confidence Interval	Associated Test	t Confidence Interval
Point Estimate	sample median	sample mean

Source: Stat 5102 Notes, Charles J. Geyer, April 13, 2003

Finding Binomial Probabilities

Cloistered Monkey

2012-10-03 23:50

The Equation

The probability of getting k successes in n trials is:
\[\begin{aligned}
p(k) &= {n \choose k} p^k (1-p)^{n-k}
\end{aligned}\]

The Variables
Variable	Meaning
$n$	total number of trials
$k$	number of successful trials
$p(k)$	probability of $k$ successes
$n-k$	the number of failures
$p$	probability of success for a single trial
$1-p$	probability of failure for a single trial
${n \choose k}$	number of permutations with $k$ successes

Calculating $n$ choose $k$

\[\begin{aligned}
{n \choose k} &= \frac{n!}{k!(n - k)!}
\end{aligned}\]

Inequalities

To find the probability of less than $y$ successes, sum the probabilities from $k=0$ to $k=y-1$:
\[\begin{aligned}
p(k < y) &= \sum_{i=1}^{y-1} p(k_i)
\end{aligned}\]
To find the probability of getting greater than $y$ successes, sum the probabilities from $k=y+1$ to $k=n$:
\[\begin{aligned}
p(k > y) &= \sum_{i=y+1}^{n} p(k_i)
\end{aligned}\]
To find the probability of getting $y$ or fewer successes, sum the probabilities from $k=0$ to $k=y$:
\[\begin{aligned}
p(k \leq y) &= \sum_{i=1}^{y} p(k_i)
\end{aligned}\]
To find the probability of getting $y$ or more successes, sum the probabilities from $k=y$ to $k=n$:
\[\begin{aligned}
p(k \geq y) &= \sum_{i=y}^{n} p(k_i)
\end{aligned}\]
To find the probability of from $y$ to $z$ successes, sum the probabilities from $k=y$ to $k=z$:
\[\begin{aligned}
p(y \leq k \leq z) &= \sum_{i=y}^{z} p(k_i)
\end{aligned}\]

Mean and Standard Deviation

The mean and Standard Deviation of a Binomial Distribution are:
\[\begin{aligned}
\mu &= np\\
\sigma &= \sqrt{np(1-p)}
\end{aligned}\]
Source: Statistics For Dummies, 2nd edition

The Sign Test

Cloistered Monkey

2012-10-02 22:54

What is the purpose of the sign test?

To test if the population median equals a given value (e.g. the sample median).

What are the assumptions made?

Independence: Each sample is independent of the others.
Identical Distributions
Continuity: There are no ties

Needed for the hypothesis test
Means there is exactly one point x such that F(x) =1/2 and that point is the median ($theta$)

How do you use it?

Setup the null hypothesis:
\[ \begin{aligned}
H_0 &: m = m_0
\end{aligned} \]

$m$ is the true median, $m_0$ is the proposed median.

Setup the alternative hypothesis as one of:
\[\begin{aligned}
H_a &: m \neq m_0\\
H_a &: m > m_0\\
H_a &: m < m_0
\end{aligned}\]
Collect a random sample from the population.
Assign a 1 or 0 to each value in the data:
- $sign = 0 \; if value < m_0$
- $sign = 1 \; if value > m_0$
- remove value from sample if $value = m_0$
Sum all the signs to get $k$:
\[\begin{aligned}
x &= \sum_{i=0}^{n-1} sign(value_i)
\end{aligned} \]
Find $k$ on the binomial distribution:
- $n$ is the current sample size
- $p = 0.5$ (If $H_0$ is true, half are above, half are below)
Find the p-value:
- if $H_a$ uses <, add probabilities for $x \leq k$
- if $H_a$ uses >, add probabilities for $x \geq k$
- if $H_a$ uses $neq$, $p = 2 \times sum(x \geq k)$
Draw a conclusion:

If $p-value < alpha$ (0.05 for 95%), reject $H_0$. Otherwise don't reject.

Signs, Ranks, and Signed Ranks

Cloistered Monkey

2012-10-02 21:07

The following are used in testing non-parametric statistics.

What is a sign?

a value $\in \{0, 1\}$ assigned to each member of a data-set
If the data-member is greater than the test-value, it's assigned a 1. Otherwise it gets a 0.

What is a rank?

The number representing a data points' place in the ordered set (1-based index)
If multiple points have the same value each is reassigned the mean of their original Ranks.

What is an absolute rank?

Each point is reassigned its absolute value
Then the set is Ranked

What is a signed rank?

Assign a 1 to each data point greater than the test-value.
Assign a 0 to the remaining points.
Find the absolute rank.
For each point, assign it a signed rank:

\[SignedRank = Sign \times AbsoluteRank\]

What is a rank sum?

the sum of all the ranks for a data-set.

Source: Statistics II for Dummies

PogoPlug Arch Linux Installation

Cloistered Monkey

2012-10-02 05:20

Background

The PogoPlug is a network-attached storage device that by default runs busybox. By installing archlinux you gain access to a larger set of installable packages.

Steps

Power up the PogoPlug and attach it to your LAN via an ethernet cable.
Go to pogoplug's site and activate it.
Once you activate the device click on your login name at the top left of the screen and choose:
- Settings > Security > Enable SSH access for this PogoPlug device
SSH into your device (the password is what you set it to in the previous step):
```
ssh root@<IP address>
```
Kill the PogoPlug software:
```
killall hbwd
```

Download and install a bootloader:


cd /tmp
wget http://jeff.doozan.com/debian/uboot/install_uboot_mtd0.sh
chmod +x install_uboot_mtd0.sh
./install_uboot_mtd0.sh

Insert a USB drive into the PogoPlug
Start fdisk on the PogoPlug:
```
/sbin/fdisk /dev/sda
```
At the prompt enter the commands to create a partition:
```
o
p
n
p
1
```
Accept all the defaults after entering 1 and the exit when you reach the prompt Command (m for help)::
```
w
```

Create the filesystem on the USB:


wget http://archlinuxarm.org/os/pogoplug/mke2fs
chmod 755 mke2fs
./mke2fs /dev/sda1
mkdir usb
mount /dev/sda1 usb

Download and install Arch Linux:


cd usb
wget http://archlinuxarm.org/os/ArchLinuxARM-armv5te-latest.tar.gz
tar -xzvf ArchLinuxARM-armv5te-*.tar.gz
rm ArchLinuxARM-armv5te-*.tar.gz
sync

Clean up and reboot:
```
cd ..
umount usb
/sbin/reboot
```
The new SSH login and password are root, root

Standard Deviation, Standard Error, and Confidence Intervals

Cloistered Monkey

2012-10-01 21:28

What is Standard Deviation?

The spread around the center of a normal distribution
The amount of variation in a population
The point where the normal curve changes from concave down to concave up

What is the Sample Standard Deviation?

The variance around the mean of a sample from the population

Calculation:
\[\begin{aligned}
s &= \sqrt{\frac{(x-\bar{x})^2}{n-1}}
\end{aligned} \]

What is the Empirical Rule?

68% of the data falls within 1 Standard Deviation of the center
95% of the data falls within 2 Standard Deviations of the center
99.7% of the data falls within 3 Standard Deviations of the center
If the percentages don't match the data, the distribution isn't normal

What is the Standard Error?

The amount of variance the measure of central tendency (e.g. the mean) has:
\[ \begin{aligned}
SE &= \frac{\sigma}{\sqrt{n}}
\end{aligned}\]
Sample Standard Deviation is the variance within a sample, Standard Error is how close the Sample mean is to the Population Mean

What is a Margin Of Error?

A multiple of the Standard Error
The multiple is based on the Confidence Interval you want
For example: 68% Confidence uses a multiple of 1 (see the Empirical Rule)

\[\begin{aligned}
MOE &= multiple \times StandardError
\end{aligned}\]

What is a Confidence Interval?

Level of Confidence based on the percent of the distribution your Margin of Error covers
If you have a small data set or you don't know that the distribution is normal use the t-distribution:
\[\begin{aligned}
ConfidenceInterval &= \bar{x} \pm t_{n-1} \frac{s}{\sqrt{n-1}}
\end{aligned}\]
n is the size of the sample
s is the Sample Standard Deviation
If you know that the distribution is normal and/or your sample is large, use the z-score instead.

Source: Statistic For Dummies