Archive formats...

D

Don Y

Guest
I\'m looking for \"established\" archive formats and/or compression
formats (the thinking being that an archive can always be subsequently
compressed).

What\'s come to mind includes (I\'m not being pedantic, here -- sometimes
using file extensions to represent file formats):

7z
ace
apk
arc
arj
brotli
bzip2
cab
cfs
compress
cpio
cpt
dar
dmg
egg
gzip
jar
lbr
lha
lz4
lzip
lzma
lzop
lzx
mpq
pea
rar
rpm
shar
sit
sitx
sq
sqx
tar
xar
xz
zip
zoo
zopfli
zpaq
zstd

Daunting list, eh? Any others that I have overlooked?
 
On 11/30/2021 9:56 PM, Don Y wrote:
I\'m looking for \"established\" archive formats and/or compression
formats (the thinking being that an archive can always be subsequently
compressed).

Daunting list, eh? Any others that I have overlooked?

Ugh! Skip that. I\'ve apparently missed *dozens* (scores?)... :<
 
On 2021-12-01, Don Y <blockedofcourse@foo.invalid> wrote:
I\'m looking for \"established\" archive formats and/or compression
formats (the thinking being that an archive can always be subsequently
compressed).

If by compressed you mean made smaller, that\'s obviously false.

--
Jasen.
 
On 11/30/2021 11:53 PM, Jasen Betts wrote:
On 2021-12-01, Don Y <blockedofcourse@foo.invalid> wrote:
I\'m looking for \"established\" archive formats and/or compression
formats (the thinking being that an archive can always be subsequently
compressed).

If by compressed you mean made smaller, that\'s obviously false.

No, I mean \"original (uncompressed) content being made obscure,
without the *intent* of *hiding* the content\". I.e., you can\'t
(generally) peek into a compressed archive and understand what it
contains, without some assistance from tools.

Given \"foo.zip\", tell me *anything* about foo? Repeat for
\"folder.tgz\"? Or, \"volume.imz\"? (yet, there\'s nothing that
prevents you from using a tool to examine the contents;
unlike \"message.pem\")

[And, the *contents* of the archive -- along with the compression
algorithm used -- determine if the result is (physically) smaller
or larger than the original. Compressing (TRULY) random data will
inevitably result in a larger result. Compressing data with
\"predictable\" patterns will most often result in a space savings
if the chosen compressor knows how to exploit those patterns.]
 
On 01/12/2021 08:12, Don Y wrote:
On 11/30/2021 11:53 PM, Jasen Betts wrote:
On 2021-12-01, Don Y <blockedofcourse@foo.invalid> wrote:
I\'m looking for \"established\" archive formats and/or compression
formats (the thinking being that an archive can always be subsequently
compressed).

If by compressed you mean made smaller, that\'s obviously false.

No, I mean \"original (uncompressed) content being made obscure,
without the *intent* of *hiding* the content\".  I.e., you can\'t
(generally) peek into a compressed archive and understand what it
contains, without some assistance from tools.

Some archive formats have the directory in a form where you can read it
fairly easily even if it isn\'t quite in plaintext.
Given \"foo.zip\", tell me *anything* about foo?  Repeat for
\"folder.tgz\"?  Or, \"volume.imz\"?  (yet, there\'s nothing that
prevents you from using a tool to examine the contents;
unlike \"message.pem\")

If you want to compare the effectiveness of the different algorithms
then compressing a chunk of web content or a random executable will span
a reasonable range of important use cases.

There was a nice DOS tool called something like ifl that could read the
contents of most common archive formats. Look for it on Simtel.
[And, the *contents* of the archive -- along with the compression
algorithm used -- determine if the result is (physically) smaller
or larger than the original.  Compressing (TRULY) random data will
inevitably result in a larger result.  Compressing data with
\"predictable\" patterns will most often result in a space savings
if the chosen compressor knows how to exploit those patterns.]

Bytewise entropy of the source material will give you a reasonable
independent estimate of how compressible or otherwise it is.

Highly compressed material tends toward ln(256) ~ 5.545
png ~ 5.25
jpg ~ 5.20
exe\'s ~ 4.4
test ~ 2.0

It can be used to classify unknown data to likely type of file.

(ignoring the sign) sum(p.ln(p))

p = n/N

Where n = number of times token i appears
N = sum_over_i (n) = N = filesize

It will give you a fair guess at whether a given file can still be
compressed by a general compression algorithm. You have to work
incredibly hard to get the last 2% reduction in size.

--
Regards,
Martin Brown
 
On 12/1/2021 2:36 AM, Martin Brown wrote:
On 01/12/2021 08:12, Don Y wrote:
On 11/30/2021 11:53 PM, Jasen Betts wrote:
On 2021-12-01, Don Y <blockedofcourse@foo.invalid> wrote:
I\'m looking for \"established\" archive formats and/or compression
formats (the thinking being that an archive can always be subsequently
compressed).

If by compressed you mean made smaller, that\'s obviously false.

No, I mean \"original (uncompressed) content being made obscure,
without the *intent* of *hiding* the content\". I.e., you can\'t
(generally) peek into a compressed archive and understand what it
contains, without some assistance from tools.

Some archive formats have the directory in a form where you can read it fairly
easily even if it isn\'t quite in plaintext.

Yes. And, \"image\" files often allow one to read the \"plaintext\"
of the contained file -- though often not in a contiguous manner.

Given \"foo.zip\", tell me *anything* about foo? Repeat for
\"folder.tgz\"? Or, \"volume.imz\"? (yet, there\'s nothing that
prevents you from using a tool to examine the contents;
unlike \"message.pem\")

If you want to compare the effectiveness of the different algorithms then
compressing a chunk of web content or a random executable will span a
reasonable range of important use cases.

There was a nice DOS tool called something like ifl that could read the
contents of most common archive formats. Look for it on Simtel.

Right now, I\'m just looking to see how many different ways file
contents are typically \"obscured\" (without \"information hiding\"
being an explicit goal)

[And, the *contents* of the archive -- along with the compression
algorithm used -- determine if the result is (physically) smaller
or larger than the original. Compressing (TRULY) random data will
inevitably result in a larger result. Compressing data with
\"predictable\" patterns will most often result in a space savings
if the chosen compressor knows how to exploit those patterns.]

Bytewise entropy of the source material will give you a reasonable independent
estimate of how compressible or otherwise it is.

Highly compressed material tends toward ln(256) ~ 5.545
png ~ 5.25
jpg ~ 5.20
exe\'s ~ 4.4
test ~ 2.0

It can be used to classify unknown data to likely type of file.

(ignoring the sign) sum(p.ln(p))

p = n/N

Where n = number of times token i appears
N = sum_over_i (n) = N = filesize

It will give you a fair guess at whether a given file can still be compressed
by a general compression algorithm. You have to work incredibly hard to get the
last 2% reduction in size.


I\'m not concerned about the compressibility of the file or how
effective a particular tool is at achieving that compression.

Rather, the fact that compressors are commonly applied to
files (and \"archives\" are files) and, as a result, alter
their representation as a side-effect of their goal.

The only other \"regularly applied\" tools that alter file
contents typically involve encryption (of varying degrees).

[I can think of no other reason to alter a file\'s content]
 
On 01/12/2021 10:56, Don Y wrote:
On 12/1/2021 2:36 AM, Martin Brown wrote:

It will give you a fair guess at whether a given file can still be
compressed by a general compression algorithm. You have to work
incredibly hard to get the last 2% reduction in size.

I\'m not concerned about the compressibility of the file or how
effective a particular tool is at achieving that compression.

Rather, the fact that compressors are commonly applied to
files (and \"archives\" are files) and, as a result, alter
their representation as a side-effect of their goal.

There are quite a few backup programs that use their own proprietary
encoding and compression sometimes allowing a tradeoff of speed vs
redundancy vs compression. The one I use historically names its files
with extensions .000 .001 and maked them just under 2^32 bytes each.

Backups are not much use if the easily become write only read never.

Cue April 1st adverts for infinite capacity write only memory...

The only other \"regularly applied\" tools that alter file
contents typically involve encryption (of varying degrees).

[I can think of no other reason to alter a file\'s content]

To make it more compressible is one such, lossy compression always wins
out over lossless unless it is a very peculiar edge case.

--
Regards,
Martin Brown
 
On a sunny day (Tue, 30 Nov 2021 21:56:46 -0700) it happened Don Y
<blockedofcourse@foo.invalid> wrote in <so6vaj$q31$1@dont-email.me>:

I\'m looking for \"established\" archive formats and/or compression
formats (the thinking being that an archive can always be subsequently
compressed).

What\'s come to mind includes (I\'m not being pedantic, here -- sometimes
using file extensions to represent file formats):

7z
ace
apk
arc
arj
brotli
bzip2
cab
cfs
compress
cpio
cpt
dar
dmg
egg
gzip
jar
lbr
lha
lz4
lzip
lzma
lzop
lzx
mpq
pea
rar
rpm
shar
sit
sitx
sq
sqx
tar
xar
xz
zip
zoo
zopfli
zpaq
zstd

Daunting list, eh? Any others that I have overlooked?

Probably.
I usually use
tar -zcvf my_archive.tgz /xx/yyy/*

The \'v\' list what it does, including filenames
You could probably use
tar -zcvf my_archive.tgz /xx/yyy/* 2>my_archive_contents.txt
to get a plain text content file, and save it with the archive.

Ultimately YOU decide how you compress / store. encrypt, whatever.
US launch codes are all zeros I\'ve read as in a stress situation those poor guys cannot
remember anything more complicated, so something like Kamalatypezerozerozero.txt would be a good compression example.
Standards.....

This message was muddified by BiBo

ISO format for blueray
contains 3 movies in .ts (transport stream format):

disc number:
991
Thu Aug 23 14:24:40 CEST 2018
BD-R25GB
ext2
Mediarange 4x inkjet printable
LG BH10LS38
Method:
PLEASE STOP ANY RTL_SDR write data errors observed when that is running!
Make sure you habve enough disk space.
dd if=/dev/zero bs=100000000 count=242 > bluray.iso
mke2fs bluray.iso
mount -o loop=/dev/loop0 bluray.iso /mnt/loop
cp ... /mnt/loop/
du /mnt/loop
#umount /dev/loop0
umount /mnt/loop
cd /mnt/sda1/video/satellite
growisofs -speed=4 -dvd-compat -Z /dev/dvd=bluray.iso
dvdimagecmp -a bluray.iso -b /dev/dvd
l /mnt/loop
total 19283944
-rw-r--r-- 1 root root 3906700000 Aug 19 08:38 bond_golden_eye_1995.ts amovie
-rw-r--r-- 1 root root 12913150172 Aug 19 20:15 pirates_of_the_caribbean_dead_mans_chest_2006_HD.ts amovie
-rw-r--r-- 1 root root 2907600000 Aug 20 12:45 bond_spectre_2015.ts amovie


So the .ts contains video in mpeg2 format (so compressed) and several audio channels in mp2 (so also compressed) format.

There seems to be new audio and video compression methods every few years...
If you want a list of many of those type
ffmpeg -formats
File formats:
D. = Demuxing supported
.E = Muxing supported
--
E 3g2 3GP2 (3GPP2 file format)
E 3gp 3GP (3GPP file format)
D 4xm 4X Technologies
E a64 a64 - video for Commodore 64
D aac raw ADTS AAC (Advanced Audio Coding)
DE ac3 raw AC-3
D act ACT Voice file format
D adf Artworx Data Format
E adts ADTS AAC (Advanced Audio Coding)
DE adx CRI ADX
D aea MD STUDIO audio
D afc AFC
DE aiff Audio IFF
DE alaw PCM A-law
DE alsa ALSA audio output
DE amr 3GPP AMR
D anm Deluxe Paint Animation
D apc CRYO APC
D ape Monkey\'s Audio
D aqtitle AQTitle subtitles
DE asf ASF (Advanced / Active Streaming Format)
E asf_stream ASF (Advanced / Active Streaming Format)
DE ass SSA (SubStation Alpha) subtitle
DE ast AST (Audio Stream)
DE au Sun AU
DE avi AVI (Audio Video Interleaved)
E avm2 SWF (ShockWave Flash) (AVM2)
D avr AVR (Audio Visual Research)
D avs AVS
D bethsoftvid Bethesda Softworks VID
D bfi Brute Force & Ignorance
D bin Binary text
D bink Bink
DE bit G.729 BIT file format
D bmv Discworld II BMV
D brstm BRSTM (Binary Revolution Stream)
D c93 Interplay C93
DE caf Apple CAF (Core Audio Format)
DE cavsvideo raw Chinese AVS (Audio Video Standard) video
D cdg CD Graphics
D cdxl Commodore CDXL video
D concat Virtual concatenation script
E crc CRC testing
DE daud D-Cinema audio
D dfa Chronomaster DFA
DE dirac raw Dirac
DE dnxhd raw DNxHD (SMPTE VC-3)
D dsicin Delphine Software International CIN
DE dts raw DTS
D dtshd raw DTS-HD
DE dv DV (Digital Video)
D dv1394 DV1394 A/V grab
E dvd MPEG-2 PS (DVD VOB)
D dxa DXA
D ea Electronic Arts Multimedia
D ea_cdata Electronic Arts cdata
DE eac3 raw E-AC-3
D epaf Ensoniq Paris Audio File
DE f32be PCM 32-bit floating-point big-endian
DE f32le PCM 32-bit floating-point little-endian
E f4v F4V Adobe Flash Video
DE f64be PCM 64-bit floating-point big-endian
DE f64le PCM 64-bit floating-point little-endian
D fbdev Linux framebuffer
DE ffm FFM (FFserver live feed)
DE ffmetadata FFmpeg metadata in text
D film_cpk Sega FILM / CPK
DE filmstrip Adobe Filmstrip
DE flac raw FLAC
D flic FLI/FLC/FLX animation
DE flv FLV (Flash Video)
E framecrc framecrc testing
E framemd5 Per-frame MD5 testing
D frm Megalux Frame
DE g722 raw G.722
DE g723_1 raw G.723.1
D g729 G.729 raw format demuxer
DE gif GIF Animation
D gsm raw GSM
DE gxf GXF (General eXchange Format)
DE h261 raw H.261
DE h263 raw H.263
DE h264 raw H.264 video
E hls Apple HTTP Live Streaming
D hls,applehttp Apple HTTP Live Streaming
DE ico Microsoft Windows ICO
D idcin id Cinematic
D idf iCE Draw File
D iff IFF (Interchange File Format)
DE ilbc iLBC storage
DE image2 image2 sequence
DE image2pipe piped image2 sequence
D ingenient raw Ingenient MJPEG
D ipmovie Interplay MVE
E ipod iPod H.264 MP4 (MPEG-4 Part 14)
DE ircam Berkeley/IRCAM/CARL Sound Format
E ismv ISMV/ISMA (Smooth Streaming)
D iss Funcom ISS
D iv8 IndigoVision 8000 video
DE ivf On2 IVF
DE jacosub JACOsub subtitle format
D jv Bitmap Brothers JV
DE latm LOAS/LATM
D lavfi Libavfilter virtual input device
D lmlm4 raw lmlm4
D loas LOAS AudioSyncStream
D lvf LVF
D lxf VR native stream (LXF)
DE m4v raw MPEG-4 video
E matroska Matroska
D matroska,webm Matroska / WebM
E md5 MD5 testing
D mgsts Metal Gear Solid: The Twin Snakes
DE microdvd MicroDVD subtitle format
DE mjpeg raw MJPEG video
E mkvtimestamp_v2 extract pts as timecode v2 format, as defined by mkvtoolnix
DE mlp raw MLP
D mm American Laser Games MM
DE mmf Yamaha SMAF
E mov QuickTime / MOV
D mov,mp4,m4a,3gp,3g2,mj2 QuickTime / MOV
E mp2 MP2 (MPEG audio layer 2)
DE mp3 MP3 (MPEG audio layer 3)
E mp4 MP4 (MPEG-4 Part 14)
D mpc Musepack
D mpc8 Musepack SV8
DE mpeg MPEG-1 Systems / MPEG program stream
E mpeg1video raw MPEG-1 video
E mpeg2video raw MPEG-2 video
DE mpegts MPEG-TS (MPEG-2 Transport Stream)
D mpegtsraw raw MPEG-TS (MPEG-2 Transport Stream)
D mpegvideo raw MPEG video
E mpjpeg MIME multipart JPEG
D mpl2 MPL2 subtitles
D mpsub MPlayer subtitles
D msnwctcp MSN TCP Webcam stream
D mtv MTV
DE mulaw PCM mu-law
D mv Silicon Graphics Movie
D mvi Motion Pixels MVI
DE mxf MXF (Material eXchange Format)
E mxf_d10 MXF (Material eXchange Format) D-10 Mapping
D mxg MxPEG clip
D nc NC camera feed
D nistsphere NIST SPeech HEader REsources
D nsv Nullsoft Streaming Video
E null raw null video
DE nut NUT
D nuv NuppelVideo
DE ogg Ogg
DE oma Sony OpenMG audio
DE oss OSS (Open Sound System) playback
D paf Amazing Studio Packed Animation File
D pjs PJS (Phoenix Japanimation Society) subtitles
D pmp Playstation Portable PMP
E psp PSP MP4 (MPEG-4 Part 14)
D psxstr Sony Playstation STR
D pva TechnoTrend PVA
D pvf PVF (Portable Voice Format)
D qcp QCP
D r3d REDCODE R3D
DE rawvideo raw video
E rcv VC-1 test bitstream
D realtext RealText subtitle format
D rl2 RL2
DE rm RealMedia
DE roq raw id RoQ
D rpl RPL / ARMovie
DE rso Lego Mindstorms RSO
DE rtp RTP output
DE rtsp RTSP output
DE s16be PCM signed 16-bit big-endian
DE s16le PCM signed 16-bit little-endian
DE s24be PCM signed 24-bit big-endian
DE s24le PCM signed 24-bit little-endian
DE s32be PCM signed 32-bit big-endian
DE s32le PCM signed 32-bit little-endian
DE s8 PCM signed 8-bit
D sami SAMI subtitle format
DE sap SAP output
D sbg SBaGen binaural beats script
E sdl SDL output device
D sdp SDP
E segment segment
D shn raw Shorten
D siff Beam Software SIFF
DE smjpeg Loki SDL MJPEG
D smk Smacker
E smoothstreaming Smooth Streaming Muxer
D smush LucasArts Smush
D sol Sierra SOL
DE sox SoX native
DE spdif IEC 61937 (used on S/PDIF - IEC958)
DE srt SubRip subtitle
E stream_segment,ssegment streaming segment muxer
D subviewer SubViewer subtitle format
D subviewer1 SubViewer v1 subtitle format
E svcd MPEG-2 PS (SVCD)
DE swf SWF (ShockWave Flash)
D tak raw TAK
D tedcaptions TED Talks captions
D thp THP
D tiertexseq Tiertex Limited SEQ
D tmv 8088flex TMV
DE truehd raw TrueHD
D tta TTA (True Audio)
D tty Tele-typewriter
D txd Renderware TeXture Dictionary
DE u16be PCM unsigned 16-bit big-endian
DE u16le PCM unsigned 16-bit little-endian
DE u24be PCM unsigned 24-bit big-endian
DE u24le PCM unsigned 24-bit little-endian
DE u32be PCM unsigned 32-bit big-endian
DE u32le PCM unsigned 32-bit little-endian
DE u8 PCM unsigned 8-bit
D vc1 raw VC-1
D vc1test VC-1 test bitstream
E vcd MPEG-1 Systems / MPEG program stream (VCD)
D video4linux2,v4l2 Video4Linux2 device grab
D vivo Vivo
D vmd Sierra VMD
E vob MPEG-2 PS (VOB)
D vobsub VobSub subtitle format
DE voc Creative Voice
D vplayer VPlayer subtitles
D vqf Nippon Telegraph and Telephone Corporation (NTT) TwinVQ
DE w64 Sony Wave64
DE wav WAV / WAVE (Waveform Audio)
D wc3movie Wing Commander III movie
E webm WebM
D webvtt WebVTT subtitle
D wsaud Westwood Studios audio
D wsvqa Westwood Studios VQA
DE wtv Windows Television (WTV)
DE wv WavPack
D xa Maxis XA
D xbin eXtended BINary text (XBIN)
D xmv Microsoft XMV
D xwma Microsoft xWMA
D yop Psygnosis YOP
DE yuv4mpegpipe YUV4MPEG pipe

If you want to know about some format media file use
mediainfo filename

For example:
mediainfo /root/martian_lauching_1.avi
Complete name : /root/martian_lauching_1.avi
Format : AVI
Format/Info : Audio Video Interleave
File size : 391 KiB
Duration : 1s 8ms
Overall bit rate : 3 180 Kbps
Writing application : Lavf54.6.100

Video
Format : MPEG Video
Codec ID : mpg2
Codec ID/Info : (MPEG-1/2) FFmpeg
Codec ID/Hint : Ffmpeg
Duration : 980ms
Bit rate : 2 973 Kbps
Width : 720 pixels
Height : 576 pixels
Display aspect ratio : 5/4
Frame rate : 50.000 fps
Standard : PAL
Resolution : 24 bits
Bits/(Pixel*Frame) : 0.143
Stream size : 356 KiB (91%)

Audio
Format : MPEG Audio
Format version : Version 1
Format profile : Layer 2
Codec ID : 50
Codec ID/Hint : MP1
Duration : 1s 8ms
Bit rate mode : Constant
Bit rate : 192 Kbps
Channel(s) : 2 channels
Sampling rate : 48.0 KHz
Resolution : 16 bits
Stream size : 23.6 KiB (6%)
Alignment : Aligned on interleaves
Interleave, duration : 23 ms (1.17 video frame)
Interleave, preload duration : 240 ms
 
Jan Panteltje wrote:
> Don Y wrote:

<snip>

Daunting list, eh? Any others that I have overlooked?

Probably.
I usually use
tar -zcvf my_archive.tgz /xx/yyy/*

Some people use tar.gz as the suffix.

Self-extracting Microsoft .EXE files make it easy on the user.

Self-extracting unix shell archives are absolutely elegant:

https://alt.sources.narkive.com/k7MHsAnN/example-code-for-reading-and-writing-data-via-bscan-spartan6-with-urjtag-and-python

Although shell archives use .shar as a suffix by convention, they
actually accommodate any old suffix. Here\'s how you unpack a shar:

sh filename.shar

There\'s more archive (and even more compression) suffixes, ranked by
popularity, at the link below:

https://fileinfo.com/filetypes/compressed

Danke,

--
Don, KB7RPU, https://www.qsl.net/kb7rpu
There was a young lady named Bright Whose speed was far faster than light;
She set out one day In a relative way And returned on the previous night.
 
On 12/1/2021 4:25 AM, Martin Brown wrote:
On 01/12/2021 10:56, Don Y wrote:
On 12/1/2021 2:36 AM, Martin Brown wrote:

It will give you a fair guess at whether a given file can still be
compressed by a general compression algorithm. You have to work incredibly
hard to get the last 2% reduction in size.

I\'m not concerned about the compressibility of the file or how
effective a particular tool is at achieving that compression.

Rather, the fact that compressors are commonly applied to
files (and \"archives\" are files) and, as a result, alter
their representation as a side-effect of their goal.

There are quite a few backup programs that use their own proprietary encoding
and compression sometimes allowing a tradeoff of speed vs redundancy vs
compression. The one I use historically names its files with extensions .000
.001 and maked them just under 2^32 bytes each.

Backups are not much use if the easily become write only read never.

Agreed. There are also many file extensions that are proprietary
reassignments of standard file formats (e.g., .whatever being
ZIP under a different -- less obvious -- name)

> Cue April 1st adverts for infinite capacity write only memory...

One of my oldest \"saved adverts\" was for a Signetics WoM.

The only other \"regularly applied\" tools that alter file
contents typically involve encryption (of varying degrees).

[I can think of no other reason to alter a file\'s content]

To make it more compressible is one such, lossy compression always wins out
over lossless unless it is a very peculiar edge case.

Yes, but then you\'re not encoding the original file -- just an approximation
of it!

If I strip the EXIF tags from a photo, have I changed the file?
(I\'ve certainly made it smaller!).

One can \"translate\" Inuit to English and get an *approximation* of
what was said. Converting back to Inuit will likely not give you
the same \"statement\", though.

And, folks don\'t casually convert text files into \".inu\" format
for any particular reason! :>
 
On 12/1/2021 4:37 AM, Jan Panteltje wrote:
On a sunny day (Tue, 30 Nov 2021 21:56:46 -0700) it happened Don Y
blockedofcourse@foo.invalid> wrote in <so6vaj$q31$1@dont-email.me>:

I\'m looking for \"established\" archive formats and/or compression
formats (the thinking being that an archive can always be subsequently
compressed).

Daunting list, eh? Any others that I have overlooked?

Probably.
I usually use
tar -zcvf my_archive.tgz /xx/yyy/*

\'tar cvpf\' in my case.

> Ultimately YOU decide how you compress / store. encrypt, whatever.

Or, *someone else* has already made that decision. If the format
is \"well known\" *and* not protected with a key, you can recover the
original file(s) at a later date.

And, potentially recompress using a different algorithm. You
now have three versions of the same file: the original
compressed form, the recovered file and the newly compressed
form. All are, effectively, the same file.

ISO format for blueray
contains 3 movies in .ts (transport stream format):

Hmmm... I\'d not considered media formats.

Different CODECs produce different outputs. I\'m not sure
you can take \"source\" and process it through two different
CODECs and still recover the (exact!) *same* source from
each of them -- let alone try to convert from one CODEC
to another.

OTOH, *containers* can arguably act as different \"envelopes\"
on the same encoded streams. So, converting from one
container to another is a completely reversible process
(barring the presence of additional nonportable metadata)

> mke2fs bluray.iso > mount -o loop=/dev/loop0 bluray.iso /mnt/loop

I completely missed the \"image formats\": iso, dd, vmdk, etc.

Your mount(8) example is a perfect example of the point I am making:
once mounted, you effectively have \"recovered\" the original files
contained in that image. You now have an accessible *copy* of the
files that are contained in that image!
 
On 12/1/2021 8:02 AM, Don wrote:
Jan Panteltje wrote:
Don Y wrote:

snip

Daunting list, eh? Any others that I have overlooked?

Probably.
I usually use
tar -zcvf my_archive.tgz /xx/yyy/*

Some people use tar.gz as the suffix.

And, some people pipe tar to gzip instead of using the -z switch.

> Self-extracting Microsoft .EXE files make it easy on the user.

Hmmmm... another form I\'d not considered (though there are several
tools that will build SE executables while encoding the original
content in other forms *within* the executable).

Self-extracting unix shell archives are absolutely elegant:

https://alt.sources.narkive.com/k7MHsAnN/example-code-for-reading-and-writing-data-via-bscan-spartan6-with-urjtag-and-python

Although shell archives use .shar as a suffix by convention, they
actually accommodate any old suffix. Here\'s how you unpack a shar:

sh filename.shar

Yes, the whole notion of file extensions is just a needless complication.

tar -czpf my_archive.tgz /xx/yyy/*
mv my_archive.tgz my_archive.mytriviallydisguisedfiletype

Moral of story: you can\'t rely on file name/extension to tell you *anything*.
(file(1) is your friend)

There\'s more archive (and even more compression) suffixes, ranked by
popularity, at the link below:

https://fileinfo.com/filetypes/compressed

Thanks, I\'ve been finding multiple such \"lists\". Way more types
than I\'d initially imagined! <frown>
 
In article <so764l$r9v$1@gonzo.revmaps.no-ip.org>,
Jasen Betts <usenet@revmaps.no-ip.org> wrote:
On 2021-12-01, Don Y <blockedofcourse@foo.invalid> wrote:
I\'m looking for \"established\" archive formats and/or compression
formats (the thinking being that an archive can always be subsequently
compressed).

If by compressed you mean made smaller, that\'s obviously false.

If we interpret \"compressed\" to mean \"compressed without information
loss\", Jasen is correct. This can\'t be done.

The proof is based on the pigeonhole principle. If you have an
existing archive whose length is N bits, there are 2^N possible
combinations of bits in that archive. If you claim that you can
always compress such an archive further (and make it smaller), then
you\'re claiming that the (re-compressed) archive will always be no
larger than N-1 bits in length.

The maximum number of bit combinations available in the compressed
representation is a sum: 2^1 + 2^2 + 2^3 + ... + 2^(N-2) +
2^(N-1). This sum is equal to 2^N - 1.

This means that the total number of compressed representations is,
at best, one less than the number of uncompressed representations.
As the Pigeonhole Principle phrases it, you have one less pigeonhole
in your office desk, than you have slips of paper that you need to
put into the pigeonholes.

So, you\'re left with two possibilities:

(1) The compression algorithm can always map each of the 2^N inputs
onto a specific pigeonhole. But, since you\'ve got one fewer
pigeonholes than inputs, two of the inputs must map to the same
pigeonhole. Two different input archives, compress down to the
same output archive.

When it comes time to decompress, the decompression algorithm
can only produce one output (which presumably will be one of
those two inputs). It can\'t successfully reconstruct the second
input. If you compress the \"unlucky\" input, and then decompress,
you get the wrong result. This contradicts (and disproves)
your starting assumption that you can always compress without
loss.

(2) One or more of the 2^N inputs doesn\'t map into any of the
2^N - 1 pigeonholes. It maps into something longer (2^N or
more bits long), or it causes the compression algorithm to
crash, hang, explode, or cross the streams and instantly end
all life as we know it.

This contradicts your starting assumption that you can always
compress any input further.

JPEG, MPEG, and similar systems are usually called \"compression\"
algorithms, but it\'s clearer to think of them as \"lossy encoding\".
They get around the pigeonhole principle by being willing to
lose information - the decoded signal is not guaranteed to be
identical to the input signal.
 
On a sunny day (Wed, 1 Dec 2021 11:03:57 -0700) it happened Don Y
<blockedofcourse@foo.invalid> wrote in <so8deo$gtt$1@dont-email.me>:

On 12/1/2021 4:37 AM, Jan Panteltje wrote:
On a sunny day (Tue, 30 Nov 2021 21:56:46 -0700) it happened Don Y
blockedofcourse@foo.invalid> wrote in <so6vaj$q31$1@dont-email.me>:

I\'m looking for \"established\" archive formats and/or compression
formats (the thinking being that an archive can always be subsequently
compressed).

Daunting list, eh? Any others that I have overlooked?

Probably.
I usually use
tar -zcvf my_archive.tgz /xx/yyy/*

\'tar cvpf\' in my case.

Ultimately YOU decide how you compress / store. encrypt, whatever.

Or, *someone else* has already made that decision. If the format
is \"well known\" *and* not protected with a key, you can recover the
original file(s) at a later date.

And, potentially recompress using a different algorithm. You
now have three versions of the same file: the original
compressed form, the recovered file and the newly compressed
form. All are, effectively, the same file.

ISO format for blueray
contains 3 movies in .ts (transport stream format):

Hmmm... I\'d not considered media formats.

Different CODECs produce different outputs. I\'m not sure
you can take \"source\" and process it through two different
CODECs and still recover the (exact!) *same* source from
each of them -- let alone try to convert from one CODEC
to another.

Oh that is old crypto fun, take pictures of the source binary in hexadecimal
combined frame by frame into a movie.
Re-encode with a different video codec.
Keep enough bandwidth to keep it readable for a computer.
Like for audio encode / decode with text to speech - speech to text.
maybe translate language >one two three -> un deux trois.
You can scramble the pictures too so it shows whatever,
There was a discussion some time back in sci.crypt about using fractals,
google shows several papers on fractal encryption.
Converting from one codec to an other I do all the time with ffmpeg.
ffmpeg -i q1.avi -i q1.mp2 -f avi -vcodec copy -acodec ac3 -y $1-hd.avi
... ffmpeg -f yuv4mpegpipe -i - -f avi -vcodec libx264 -b 10M -y q1.avi
You may lose detail depending on allocated bandwidth.

OTOH, *containers* can arguably act as different \"envelopes\"
on the same encoded streams. So, converting from one
container to another is a completely reversible process
(barring the presence of additional nonportable metadata)

mke2fs bluray.iso > mount -o loop=/dev/loop0 bluray.iso /mnt/loop

I completely missed the \"image formats\": iso, dd, vmdk, etc.

Your mount(8) example is a perfect example of the point I am making:
once mounted, you effectively have \"recovered\" the original files
contained in that image. You now have an accessible *copy* of the
files that are contained in that image!

Yes
 
On 12/1/2021 11:52 AM, Dave Platt wrote:
In article <so764l$r9v$1@gonzo.revmaps.no-ip.org>,
Jasen Betts <usenet@revmaps.no-ip.org> wrote:
On 2021-12-01, Don Y <blockedofcourse@foo.invalid> wrote:
I\'m looking for \"established\" archive formats and/or compression
formats (the thinking being that an archive can always be subsequently
compressed).

If by compressed you mean made smaller, that\'s obviously false.

If we interpret \"compressed\" to mean \"compressed without information
loss\", Jasen is correct. This can\'t be done.

No, you are assuming there is no other (implicit) source of
information that the compressor can rely upon.

I, for example, have JUST designed a compressor that compresses
all occurrences of the string \"No, you are assuming there is no
other (implicit) source of information that the compressor can
rely upon.\" into the hex constant 0xFE.

As such, the first paragraph in my reply, here, can be compressed
to a single byte! The remaining characters in this message are
not affected by my compressor. So, the message ends up SMALLER
as a result of the elided characters in that first paragraph.

My compressor obviously relies on the fact that 0xFE does not
occur in ascii text. (If it did, I\'d have to encode *it* in some
other manner)

[Unapplicable \"proof\" elided]
 
In article <so8t57$6cf$1@dont-email.me>,
Don Y <blockedofcourse@foo.invalid> wrote:
On 12/1/2021 11:52 AM, Dave Platt wrote:
In article <so764l$r9v$1@gonzo.revmaps.no-ip.org>,
Jasen Betts <usenet@revmaps.no-ip.org> wrote:
On 2021-12-01, Don Y <blockedofcourse@foo.invalid> wrote:
I\'m looking for \"established\" archive formats and/or compression
formats (the thinking being that an archive can always be subsequently
compressed).

If by compressed you mean made smaller, that\'s obviously false.

If we interpret \"compressed\" to mean \"compressed without information
loss\", Jasen is correct. This can\'t be done.

No, you are assuming there is no other (implicit) source of
information that the compressor can rely upon.

I, for example, have JUST designed a compressor that compresses
all occurrences of the string \"No, you are assuming there is no
other (implicit) source of information that the compressor can
rely upon.\" into the hex constant 0xFE.

As such, the first paragraph in my reply, here, can be compressed
to a single byte! The remaining characters in this message are
not affected by my compressor. So, the message ends up SMALLER
as a result of the elided characters in that first paragraph.

Sure - you can always design a compressor which works very well indeed
for certain classes of input. If you \"cherry-pick\" the allowable
inputs, you can get extremely high coding gain.

What you can\'t do, is design any _single_ compressor which is
guaranteed to always compress any _arbitrary_ input (arbitrary sets of
bits), to less bits in total (and here you have to include any magic
\"implicit\" bits your algorithm may be depending on, such as \"which
special compressor was used?\" in the output-file header).

My compressor obviously relies on the fact that 0xFE does not
occur in ascii text. (If it did, I\'d have to encode *it* in some
other manner)

Yup. The common way in telecom protocols is to \"escape\" such
special codes, so you\'d send escape-0xFE to represent a single
0xFE in the file. Of course, that means that you\'ve just increased
the size of the file rather than decreased it.

You\'re doing a double cherry-pick here, by pre-defining the
magic input string you\'ll compress so well, and by declaring the
existence of a compressed-representation token for it which is not
allowed to appear in the input. That combination gives you extremely
high coding gain... for this one magic input string.

It gives you bupkis for any input which doesn\'t contain that magic
string, though. You get _zero_ compression there.

Fixed-dictionary-based compression schemes (which is essentially what
you are proposing here) can give extremely high coding gain (compression)
as long as most of the input is \"words\" in the fixed \"vocabulary\", and
as long as those \"words\" are significantly longer than the tokens you use
to replace or number them. That amounts to saying \"the input must have
a relatively low entropy\"... the input isn\'t just random (or random-like)
collections of bits.

Shannon\'s source coding theorem is applicable here... it sets a pretty
hard limit on how far you can compress any given input (given the
statistics of the input data) before information loss becomes virtually
certain.
 
Don Y <blockedofcourse@foo.invalid> wrote:
On 11/30/2021 9:56 PM, Don Y wrote:
I\'m looking for \"established\" archive formats and/or compression
formats (the thinking being that an archive can always be subsequently
compressed).

Daunting list, eh? Any others that I have overlooked?

Ugh! Skip that. I\'ve apparently missed *dozens* (scores?)... :

you also duplicated quite a few formats in that original list.
 
On 12/1/2021 6:58 PM, Dave Platt wrote:
In article <so8t57$6cf$1@dont-email.me>,
Don Y <blockedofcourse@foo.invalid> wrote:
On 12/1/2021 11:52 AM, Dave Platt wrote:
In article <so764l$r9v$1@gonzo.revmaps.no-ip.org>,
Jasen Betts <usenet@revmaps.no-ip.org> wrote:
On 2021-12-01, Don Y <blockedofcourse@foo.invalid> wrote:
I\'m looking for \"established\" archive formats and/or compression
formats (the thinking being that an archive can always be subsequently
compressed).

If by compressed you mean made smaller, that\'s obviously false.

If we interpret \"compressed\" to mean \"compressed without information
loss\", Jasen is correct. This can\'t be done.

No, you are assuming there is no other (implicit) source of
information that the compressor can rely upon.

I, for example, have JUST designed a compressor that compresses
all occurrences of the string \"No, you are assuming there is no
other (implicit) source of information that the compressor can
rely upon.\" into the hex constant 0xFE.

As such, the first paragraph in my reply, here, can be compressed
to a single byte! The remaining characters in this message are
not affected by my compressor. So, the message ends up SMALLER
as a result of the elided characters in that first paragraph.

Sure - you can always design a compressor which works very well indeed
for certain classes of input. If you \"cherry-pick\" the allowable
inputs, you can get extremely high coding gain.

But you don\'t have to \"cherry pick\"! \"Data\" typically already has
\"known characteristics\" that compressors can exploit.

JPEG exploits the idea that the human eye won\'t \"notice\" certain
loss of detail in photos. MP3 makes similar assumptions wrt
audio. ASCII text is already 14% larger than required (as every
byte has a high-order bit that is KNOWN to be \'0\'). English
prose can make other assumptions regarding \"expectations\" of
what follows in a given sequence of words. Speech can be
encoded in a few hundred *bits* per second. etc.

What you can\'t do, is design any _single_ compressor which is
guaranteed to always compress any _arbitrary_ input (arbitrary sets of
bits), to less bits in total (and here you have to include any magic
\"implicit\" bits your algorithm may be depending on, such as \"which
special compressor was used?\" in the output-file header).

And you\'ll notice there isn\'t a *single* compressor available!
And, as I said, you can\'t compress truly random data (and expect
it to get smaller).

But, anyone can choose to apply any compressor to any file.
There\'s nothing that prevents this from being done.
So, you can find a RAR archive of a set of ZIP files, each
compressing an ISO archive, etc. The ratio of \"compressed\"
file size to original file size can exceed 1.0. But,
despite that, one can still apply the proper sequence of
DEcompressors to retrieve the original \"input\".

My concern over archives and the sorts of \"manipulations\"
that can be applied to them (the most common of which is
compression) is solely in how it affects recovery of the
\"original\" content.

My compressor obviously relies on the fact that 0xFE does not
occur in ascii text. (If it did, I\'d have to encode *it* in some
other manner)

Yup. The common way in telecom protocols is to \"escape\" such
special codes, so you\'d send escape-0xFE to represent a single
0xFE in the file. Of course, that means that you\'ve just increased
the size of the file rather than decreased it.

You\'re doing a double cherry-pick here, by pre-defining the
magic input string you\'ll compress so well, and by declaring the
existence of a compressed-representation token for it which is not
allowed to appear in the input. That combination gives you extremely
high coding gain... for this one magic input string.

But there are many \"magic strings\" that appear in day to day encounters.
And other \"special conditions\" that a compressor (and the user
who applies that compressor) can exploit.

Facsimiles tend to contain lots of white space. That can be compressed
as can the runs of \"black\" (for B&W FAXs). Instead of many megabytes
to represent a single sheet image, you can reduce it to kilobytes.

In your vernacular, it wouldn\'t give you \"bupkis\" if you tried to
apply it to a color image. So, you (wisely) wouldn\'t use that
algorithm in that case.

An \"unused\" sector on a disk can be represented with a single bit.
(or, RLE the number of such consecutive empty sectors to exploit
the fact that deleted files occupy contiguous space on a volume).
So, you\'ve represented 4,096 bits with *one*.

Granted, after \"compression\", we can\'t recover the contents of those
\"unused\" sectors. But, we typically don\'t want to. We will trade
that ability for this higher compression rate.

It gives you bupkis for any input which doesn\'t contain that magic
string, though. You get _zero_ compression there.

So, you design a compressor that exploits the *patterns* that are
present in that other \"input\".

Fixed-dictionary-based compression schemes (which is essentially what
you are proposing here) can give extremely high coding gain (compression)
as long as most of the input is \"words\" in the fixed \"vocabulary\", and
as long as those \"words\" are significantly longer than the tokens you use
to replace or number them. That amounts to saying \"the input must have
a relatively low entropy\"... the input isn\'t just random (or random-like)
collections of bits.

Shannon\'s source coding theorem is applicable here... it sets a pretty
hard limit on how far you can compress any given input (given the
statistics of the input data) before information loss becomes virtually
certain.

Yes, but that applies to unconstrained data. Where the compressor has no
*additional* knowledge of the content that it can exploit. Few people
encounter such \"uncompressable\" (raw) data. Hence the appeal and value
of compressors (if they had little/no use, there wouldn\'t be so many
of them!)
 
On 01/12/2021 22:31, Don Y wrote:
On 12/1/2021 11:52 AM, Dave Platt wrote:
In article <so764l$r9v$1@gonzo.revmaps.no-ip.org>,
Jasen Betts  <usenet@revmaps.no-ip.org> wrote:
On 2021-12-01, Don Y <blockedofcourse@foo.invalid> wrote:
I\'m looking for \"established\" archive formats and/or compression
formats (the thinking being that an archive can always be subsequently
compressed).

If by compressed you mean made smaller, that\'s obviously false.

If we interpret \"compressed\" to mean \"compressed without information
loss\", Jasen is correct.  This can\'t be done.

No, you are assuming there is no other (implicit) source of
information that the compressor can rely upon.

He is stating a well known and general result.

One that sometimes catches people out. We had offline compression for
bulk data over phone line that could break some telecom modems realtime
compression back in the day. Internal buffer overflow because the data
expanded quite a bit when their simplistic \"compression\" algorithm tried
to process it in realtime. If it is still around I created a document
called fullfile which epitomised the maximally incompressible file.
There were already a test file sample of ASCII text and an empty file
(which essentially tests the baud rate of the modems at each end).

I, for example, have JUST designed a compressor that compresses
all occurrences of the string \"No, you are assuming there is no
other (implicit) source of information that the compressor can
rely upon.\" into the hex constant 0xFE.

As such, the first paragraph in my reply, here, can be compressed
to a single byte!  The remaining characters in this message are
not affected by my compressor.  So, the message ends up SMALLER
as a result of the elided characters in that first paragraph.

My compressor obviously relies on the fact that 0xFE does not
occur in ascii text.  (If it did, I\'d have to encode *it* in some
other manner)

[Unapplicable \"proof\" elided]

His general point is true though.

Unless there is some other redundant structure in the file you cannot
compress a file where the bytewise entropy is ln(256) or nearly so.

You also have to work much harder to get that very last 1% of additional
compression too - most algorithms don\'t even try.

PNG is one of the better lossless image ones and gets ~ln(190)
ZIP on a larger files gets very close indeed ~ln(255.7)

--
Regards,
Martin Brown
 
On 12/2/2021 2:24 AM, Martin Brown wrote:
On 01/12/2021 22:31, Don Y wrote:
On 12/1/2021 11:52 AM, Dave Platt wrote:
In article <so764l$r9v$1@gonzo.revmaps.no-ip.org>,
Jasen Betts <usenet@revmaps.no-ip.org> wrote:
On 2021-12-01, Don Y <blockedofcourse@foo.invalid> wrote:
I\'m looking for \"established\" archive formats and/or compression
formats (the thinking being that an archive can always be subsequently
compressed).

If by compressed you mean made smaller, that\'s obviously false.

If we interpret \"compressed\" to mean \"compressed without information
loss\", Jasen is correct. This can\'t be done.

No, you are assuming there is no other (implicit) source of
information that the compressor can rely upon.

He is stating a well known and general result.

That only applies in the general case. The fact that most compressors
achieve *some* compression means the general case is RARE in the wild;
typically encountered when someone tries to compress already compressed
content.

One that sometimes catches people out. We had offline compression for bulk data
over phone line that could break some telecom modems realtime compression back
in the day. Internal buffer overflow because the data expanded quite a bit when
their simplistic \"compression\" algorithm tried to process it in realtime. If it
is still around I created a document called fullfile which epitomised the
maximally incompressible file. There were already a test file sample of ASCII
text and an empty file (which essentially tests the baud rate of the modems at
each end).

I, for example, have JUST designed a compressor that compresses
all occurrences of the string \"No, you are assuming there is no
other (implicit) source of information that the compressor can
rely upon.\" into the hex constant 0xFE.

As such, the first paragraph in my reply, here, can be compressed
to a single byte! The remaining characters in this message are
not affected by my compressor. So, the message ends up SMALLER
as a result of the elided characters in that first paragraph.

My compressor obviously relies on the fact that 0xFE does not
occur in ascii text. (If it did, I\'d have to encode *it* in some
other manner)

[Unapplicable \"proof\" elided]

His general point is true though.

It isn\'t important to the issues I\'m addressing.

*If* compression is used, IT WILL ALREADY HAVE BEEN APPLIED BEFORE I
ENCOUNTER THE (compressed) FILE(s). Any increase or decrease in file
size will already have been \"baked in\". There is no value to my being
able to \"lecture\" the content creator that his compression actually
INCREASED the size of his content. (caps are for emphasis, not shouting).

[Compression also affords other features that are absent in its absence.
In particular, most compressors include checksums -- either implied
or explicit -- that further act to vouch for the integrity of the
content. Can you tell me if \"foo.txt\" is corrupted? What about
\"foo.zip\"?]

*My* concern is being able to recover the original file(s). REGARDLESS
OF THE COMPRESSORS AND ARCHIVERS USED TO GET THEM INTO THEIR CURRENT FORM.

A user can use an off-the-shelf archiver to \"bundle\" multiple files
into a single \"archive file\". So, I need to be able to \"unbundle\"
them, regardless of the archiver he happened to choose -- hence my
interest in \"archive formats\".

A user can often opt to \"compress\" that resulting archive (or, the archive
program may offer that as an option applied while the archive is built).
(Or, an individual file without \"bundling\")

So, in order to unbundle the archive (or recover the singleton), I need
to be able to UNcompress it. Hence my interest in compressors.

A user *could* opt to encrypt the contents. If so, I won\'t even attempt
to access the original files. I have no desire to expend resource
\"guessing\" secrets!

He can also opt to apply some other (wacky, home-baked) encoding or compression
scheme (e.g., when sending executables through mail, I routinely change the
file extenstion to \"xex\" and prepend some gibberish at the front of the file
to obscure its signature -- because some mail scanners will attempt to
decompress compressed files to \"protect\" the recipients, otherwise wrapping
it in a ZIP would suffice). If so, I won\'t even attempt to access the
original file(s).

One can argue that a user might do some other \"silly\" transform (ROT13?)
so I could cover those bases with (equally silly) inversions. I want to
identify the sorts of *likely* \"processes\" to which some (other!) user
could have subjected a file\'s (or group of files\') content and be able
to reverse them.

[I recently encountered some dictionaries that were poorly disguised ZIP
archives]

If the user *chose* to encode his content in BNPF, then I want to be able
to *decode* that content. (as long as I don\'t have to \"guess secrets\"
or try to reverse engineer some wacky coding/packing scheme)

Its a relatively simple problem to solve --once you\'ve identified the
range of *common* archivers/encoders/compressors that might be used!
(e.g., SIT is/was common on Macs)

Unless there is some other redundant structure in the file you cannot compress
a file where the bytewise entropy is ln(256) or nearly so.

You also have to work much harder to get that very last 1% of additional
compression too - most algorithms don\'t even try.

PNG is one of the better lossless image ones and gets ~ln(190)
ZIP on a larger files gets very close indeed ~ln(255.7)
 

Welcome to EDABoard.com

Sponsor

Back
Top