Looking for the right archive format with deduplication

I’m often copying my files to an external disk as a backup option. So far, nothing fancy.

However, I end up with many copies of more or less the same folders over time. And that takes a lot of space for nothing.

Incremental backups, that only add new files to an archive, might look as a solution for that, but over time the paths changes, the structure changes. So the deduplication would break.

One solution could be to format my disk with a system such as ZFS or BTRFS which have a deduplication built-in. However I can’t use my disk for quick operations/archiving on Windows, nor on Mac (easily at least).

So I want an archive format with:

  • Deduplication
  • Some compression (lot of source code files)

One might think, and I often came across that statement while searching on the web for a solution, that one shouldn’t worry about that as compression algorithm inheritely support deduplication. That is actually wrong, they’re not made to merge very large patterns. The following experiment shows it with file of random data (incompressible) :

-rwxrwxrwx 1 tbarbette tbarbette 10M Apr 17 21:09 binA
-rwxrwxrwx 1 tbarbette tbarbette 10M Apr 17 21:10 binACOPY
-rwxrwxrwx 1 tbarbette tbarbette 10M Apr 17 21:09 binB
-rwxrwxrwx 1 tbarbette tbarbette 256M Apr 17 21:19 binBIG
-rwxrwxrwx 1 tbarbette tbarbette 256M Apr 17 21:19 binBIGCOPY
-rwxrwxrwx 1 tbarbette tbarbette 10M Apr 17 21:09 binC

Here’s the size for a few formats :

SizeTimeComma,d
tar.gz553M19.48star -cpazf [archive] [file]
tar.xz553M303star -cpaJf [archive] [file]
zpaq287M18.7szpaq add [archive] [file]
wim287M4.4s7zz a [archive].wim [file]
.7z543M37s7zz a [archive].7z [file]
.7z (512M dictionnary)287M140s7zz a [archive].7z -m0=LZMA2:d512m [file]
.7z (1513M dictionnary)287M114s7zz a [archive].7z -m0=LZMA2:d1513m [file]

zpaq is the clear winner considering it can also compress at the same time, however it’s not very widespread, and has no good GUI available. I’m wondering about recovery in case of problems.

wim has no compression, so it will need to be encapsulated in something else. The problem is then to add some files. I have to uncompress the inner wim format first. The idea being to save the same computer again and again, one of those archives is 200G, so it means adding a single file would take a huge time. While zpaq can add one quite fast.

7zip with a dictionary large enough to hold big duplicates would seem to be a good compromise.

Any input?

MPTCP on Windows with WSL2

Limitations

It is possible to use MPTCP, but WSL2 uses a virtual interface that prevents advertising multiple paths. There might be a solution using multiple forwarded ports but I haven’t been able to use it yet.

Prerequisite

Install Ubuntu in WSL2 (simply look for Ubuntu in the Microsoft Store)

Optional: Allow Windows to keep both Wifi and Ethernet open

Windows will automatically turn off wifi when Ethernet is plugged in. If you want to try MPTCP over Wifi + Ethernet (or 4G through USB, all the same) you must disable this behavior :

1. Open Registry Editor.

2. Go to HKEY_LOCAL_MACHINE\Software\Policies\Microsoft\Windows\WcmSvc\Local.

3. Create/change the fMinimizeConnections registry DWORD to 0.

4. Close Registry Editor and reboot.

Step 1 : Install an MPTCP-compatible Kernel (easier than it sounds!)

sudo apt install build-essential flex bison libssl-dev libelf-dev pahole
git clone https://github.com/microsoft/WSL2-Linux-Kernel.git
cd WSL2-Linux-Kernel
cp Microsoft/config-wsl .config

Edit .config and change “#CONFIG_MPTCP is not set” by CONFIG_MPTCP=y

make -j4
cp arch/x86/boot/vmlinux.bin /mnt/c/vmlinux

Then shut down WSL in a CMD window:

wsl --shutdown

And to boot in your new kernel add a file in C:\Users\$USER\.wslconfig

[wsl2]
kernel=C:\vmlinux

Step 2 : Install mptcpd

This is to get the “mptcpize” command to run a legacy TCP application with mptcp

sudo apt install mptcpd

Step 3 : Try it out !

sudo apt install iperf
sudo tcpdump -i  lo -w capture.pcap
mptcpize run iperf -s
mptcpize run iperf -c 127.0.0.1 -b 1k -l 1

Then open capture.pcap with wireshark and you should see MPTCP instead of TCP 🙂

Step 3 : SSH and failover

[todo!]

GitLab with Docker : Fixing “Error: PG::ConnectionBad” or “DETAIL: The data directory was initialized by PostgreSQL version 9.6, which is not compatible with this version 11.7.”

An issue I had where I could not find any fix online. The closest help I could find was this blog post : https://gotanbl.com/foss/how-update-gitlab-in-docker/, but there are multiple mistakes in the article, and the author is not reachable to fix them, so I’ll put the main trivia here…

The root of the issue (in my case) is that updating GitLab (with docker at least) is quite cumbersome. If you run it almost every week, then you can always upgrade to “latest”. If you do it from time to time, you have to upgrade from minor to latest minor then to first major, then to latest minor, then first major etc… And there is no way to do that automatically. So sometimes, running the “usual command” won’t work.

The problem which leads to this error is that the Postgresql database version will only be updated on some versions. If you skip the right one, then you’ll never update and all subsequent updates will break…

Worst, you’ll end up being told to run some commands to do this and that… However the problem is that the docker container will die as it fails to start. So you won’t be able to enter those commands.

The solution

First, note your current version:

sudo docket exec -it gitlab bash
cat /opt/gitlab/version-manifest.txt |grep gitlab-ce|awk '{print $2}'

Then, stop and remove the container. It’s safe, as the real files, db, etc are kept in the $GITLAB_HOME:

sudo docker exec -t gitlab gitlab-backup create
sudo docker stop gitlab
sudo docker rm gitlab

Basically, you’ll have to follow a specific upgrade path that can be found at : https://docs.gitlab.com/ce/update/#upgrade-paths

At the time of writing, this is the path:

8.11.x -> 8.12.0 -> 8.17.7 -> 9.5.10 -> 10.8.7 -> 11.11.8 -> 12.0.12 -> 12.1.17 -> 12.10.14 -> 13.0.14 -> 13.1.11 - > 13.x (latest)

So if your version is 11.10, you’ll have to upgrade at 11.11.8, and continue up to the latest.

To update to a version, do the following.

Verify the command to run the container matches what you used to install GitLab in the first place, you should re-use exactly the same command, only the last $VERSION should change:

export GITLAB_HOME=/srv/gitlab
sudo docker run --detach --hostname gitlab.tombarbette.be --env GITLAB_OMNIBUS_CONFIG="external_url 'https://gitlab.tombarbette.be/'; gitlab_rails['gitlab_shell_ssh_port'] = 2022; " --publish 2443:443 --publish 2080:80 --publish 2022:22 --name gitlab --restart always --volume $GITLAB_HOME/config:/etc/gitlab --volume $GITLAB_HOME/logs:/var/log/gitlab --volume $GITLAB_HOME/data:/var/opt/gitlab gitlab/gitlab-ce:$VERSION

The $VERSION should be the version with “-ce.0”, so for instance 11.11.8-ce.0, this is the docker container version that can be found at https://hub.docker.com/r/gitlab/gitlab-ce/tags?page=1&ordering=last_updated

Normally, when you launch that command the update for postgresql should be done automatically. If somehow you start hearing complaints, you can force the upgrade to the version XXX with:

gitlab-ctl pg-upgrade -v XXX

Where XXX is the database version.

After launching a specific version, you have to wait for GitLab to completely start, to be sure all migration was terminated.

If in troubles, you might want to check the logs with :

sudo docker logs -f gitlab

Typically the logs in $GITLAB_HOME starts to be meaningful when this problem is fixed and gitlab completely started, so it was not helpful for me.

So at this point, go back to the version list above and advance one by one…

It may seem crazy, but now to avoid that you have no choices than updating every weeks… Or you’ll have to play with versions again…

Now, I run a cron script that will save and update GitLab every week. Anyway, GitLab is an horror memory-wise and slows down with time. So removing and re-adding the container every week is actually helpful…

#!/bin/bash
sudo docker exec -t gitlab gitlab-backup create
sudo docker stop gitlab
sudo docker rm gitlab
export GITLAB_HOME=/srv/gitlab
sudo docker run --detach --hostname gitlab.tombarbette.be --env GITLAB_OMNIBUS_CONFIG="external_url 'https://gitlab.tombarbette.be/'; gitlab_rails['gitlab_shell_ssh_port'] = 2022; " --publish 2443:443 --publish 2080:80 --publish 2022:22 --name gitlab --restart always --volume $GITLAB_HOME/config:/etc/gitlab --volume $GITLAB_HOME/logs:/var/log/gitlab --volume $GITLAB_HOME/data:/var/opt/gitlab gitlab/gitlab-ce:latest

A poster of our latest work, CrossRSS, a Stateless CPU-Aware Datacenter Load-Balancer

Today we will present a poster of our latest work, at CoNEXT’20 : CrossRSS! CrossRSS is a load-balancer that spreads the load uniformly even inside the servers. It uses knowledge of the dispatching done inside the servers, RSS, to purposely select less-loaded cores without any server modification, or inter-core communications on the server. Learn more by watching the short video!

The poster session will be held on the 4th of December, 2:30 CET on the Mozilla VR Hub

Extended Abstract ; Hub ; Video ; Poster-As-Slides

Dynamic DNS with OVH

It may not be a clear thing, but OVH allows to have your own Dynamic DNS if you rent a domain name, surely a better thing than the weird paid website from dyndns.org. I will explain how to handle the update with Linux using ddclient.

On the manager

Connect to https://www.ovh.com/manager/web/#/configuration/domain/ , select your domain name, and create a new dynhost with the button on the right.

Enter a sub-domain name such as “mydyn” (.tombarbette.be), and add the actual IP for now, or just 8.8.8.8 for the time being.

Then it is not finished, you have to create a login that will be able to update that dns entry. Select the second button to handle accesses and create a new login.

Select a login, probably the name of the subdomain, the subdomain itself, and a password.

On the server

sudo apt install ddclient

Then edit /etc/ddclient.conf

protocol=dyndns2
use=web,web=checkip.dyndns.com
server=www.ovh.com
login=tombarbette.be-mydns
password='password'
mydns.tombarbette.be

Just do “sudo ddclient” to update once then “sudo service ddclient restart” to get it updated automatically.

May this be helpful to someone, personally I just forget it all the time so I wanted to leave a post-it somewhere.

Creating a dynamic and redundant array with LVM and MDADM

RAID5 allows to create an array of N+1 drives where N is the number of drives which will contain real data. The last drive will be used to store parity about the other drives (in practice, the parity information is stored by chunks across all drives and not only on one drive). RAID 5 allows to loose any of the drive without loosing the data thanks to the parity drive, and has a cheaper cost than RAID 1 where the usable data will be N/ instead of N-1.

MDADM is the tool of predilection to build a RAID5 drive. Given 3 disks, the command to build a raid 5 array is :

[code lang=”bash”]mdadm –create /dev/md0 –level=5 –raid-devices=3 /dev/sda1 /dev/sdb1 /dev/sdc1[/code]

Problem is, RAID5 drives are not easily splittable/shrinkable/resizable, the operation is complex and must be done offline. The solution is to use LVM on top of MDADM to build a big volume group which will be “protected” by RAID5 allowing to make dynamic paritions on it :

[code lang=”bash”]pvcreate /dev/md0
vgcreate group0 /dev/md0[/code]

And then create multiple, online-resizeable partitions with :

[code lang=”bash”]lvcreate /dev/group0 -n system -L 10G
mkfs.ext4 /dev/mapper/group0-system[/code]

[code lang=”bash”]lvcreate /dev/group0 -n home -L 50G
mkfs.ext4 /dev/mapper/group0-home[/code]

To resize a partition, one can do :

[code lang=”bash”]lvresize /dev/mapper/group0-home -L +10G
resize2fs /dev/mapper/group0-home[/code]

Which will add 10G to the partition, and resize it. It will work even with the system partition, without needing any reboot.

 

Enable Wifi N access point with hostapd

I use an odroid (a rasberry-pi like mini-pc but more powerfull) as a Wifi access point for my smartphone and my camera since quite a long time. I forgot that my USB Wifi dongle was compatible with Wifi N (only on 2.4Ghz), so my hostapd config file was :

[code]interface=wlan3
ssid=Barbette-Chambre
hw_mode=g
channel=11
bridge=br0
wpa=2
wpa_passphrase=YOURPASSPHRASE
wpa_key_mgmt=WPA-PSK
wpa_pairwise=CCMP
rsn_pairwise=CCMP
wpa_ptk_rekey=600[/code]

Here is the speed result with iperf :

[ 4] local 10.0.0.44 port 5001 connected with 10.0.0.175 port 48727
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-10.2 sec 18.4 MBytes 15.1 Mbits/sec

Normaly, this should be 56Mbits/s, but we know wifi is crap…

And to enable Wifi N :

[code]

interface=wlan3
ssid=Barbette-Chambre
hw_mode=g   #Yes, this is not an error. Wifi N builds on top of G 😉
channel=11
bridge=br0
ieee80211n=1
wmm_enabled=1
country_code=BE
ht_capab=[HT20][HT40][SHORT-GI-20][SHORT-GI-40]
ieee80211d=1
wpa=2
wpa_passphrase=YOURPASSPHRASE
wpa_key_mgmt=WPA-PSK
wpa_pairwise=CCMP
rsn_pairwise=CCMP
wpa_ptk_rekey=600[/code]

 

And the speed result is now :

[ 4] local 10.0.0.44 port 5001 connected with 10.0.0.175 port 48754
[ 4] 0.0-10.1 sec 30.6 MBytes 25.4 Mbits/sec

Better, but still not the 150Mbits/s of wifi N… But it’s better !

ZSH : Open terminal where you left, for each session

There is some snippets for ZSH configuration which allow you to re-open the session in the folder where it was last closed available on the web. The problem is that you often launch 3 sessions at the same time, work on them and then quit/reboot/loose SSH connections/… So you will re-log 3 sessions which will start in the same last opened folder.

I propose a version allowing to keep the last folder per-session. Each ZSH session receive a number and write the current folder in a per-session file. When you open a new session it opens the file number associated to the session number.

 

Add somewhere in .zshrc :

[code]mkdir -p ~/.cwd/
session_num=`pgrep zsh | wc -l`
function cd() {
builtin cd “$@”;
echo “$PWD” > ~/.cwd/$session_num
echo “$PWD” > ~/.cwd/last
}
export cd
function cwd() {
if [ -e ~/.cwd/$session_num ] ; then
cd “$(cat ~/.cwd/$session_num)”
else
cd “$(cat ~/.cwd/last)”
fi
echo “This is session #$session_num”
}[/code]

And at the bottom of the file :

[code]cwd[/code]

 

HTop

Maybe you already know the program “top”, “htop” is its enhanced version. And is very usefull to see how your systems handle its load and where is the load.

 

htop

 

You’ve got your CPUs load per core on top. Here I’ve got two processors with 8 cores each, and having hyperthreading activated, so 32 logical cores. The part in green is the percentage of time spent in your programs, and the read is the percentage of time spent in kernel. You also have the memory usage and programs.

 

top

Top – Remember…

Post it – What to save when reinstalling a server

More a post-it for myself, what to save when formatting or doing a major upgrade of my linux servers.

 

– database SQL – Saving /var/lib/mysql is possible but you’ll have to change some maintenance passwords. The easiest is to export your databases with the export function of phpmyadmin or with mysqldump.
– /var/svn – If you have an SVN server
– /var/www – Websites
– /etc – Configuration. Do not re-apply everything ! Choose only some config files for example :

  • /etc/dhcp/dhcpd.conf – Dhcp server config file
  • /etc/php/php.ini (can change according to linux distributions) – Php configuration
  • /etc/apache/sites-enbabled – Apache websites
  • /etc/apache/apache2.conf – Apache config
  • /etc/passwd /etc/shadow /etc/group – Users, groups, and passwords. But I may be a good idea to force everyone to change passwords at the same time, and to clean…
  • And many many mores…

– /home – If you want to keep user data. You should not have to save that when re-formatting your system because /home should always be another partition than the system “/” partition.
– ~/public_html – If not saving the homes, saves your local websites…

 

Any other ideas?